2021-09-24 18:50:03

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink

This patchset is aimed to support shared pages tracking for fsdax.

Changes from [V6 RESEND]:
- Move ->memory_failure() into the patch who implements it
- Change the parameter of ->memory_failure():
unsigned long nr_pfns -> size_t size
- Remove changes(2 patches) for mapped device
- Add some necessary comments for functions or interfaces
- P1: Make a pre-patch for changes for dax_{read,write}_lock()
- P2: Change the parameter of ->notify_failure():
void *data -> int flags
- P3: keep the original dax_lock_page() logic
- P5: Rewrite the lock function for file's mapping and index:
dax_lock_mapping_entry()
- P6: use the new dax_lock_mapping_entry() to lock dax entry, and add
size parameter to handle a range of failure
- P7: add a cross range calculation between memory_failure range and
founded extent range
- Rebased to v5.15-rc1

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device driver, by introducing an interface ->memory_failure() for struct
pagemap. This interface is called by memory_failure() in mm, and
implemented by pmem device.

Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.

Finally we are able to try to fix the corrupted data in filesystem and
do other necessary processing, such as killing processes who are using
the files affected.

The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_notify_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
mf_generic_kill_procs()

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.15-rc1)
==

Shiyang Ruan (8):
dax: Use rwsem for dax_{read,write}_lock()
dax: Introduce holder for dax_device
mm: factor helpers for memory_failure_dev_pagemap
pagemap,pmem: Introduce ->memory_failure()
fsdax: Introduce dax_lock_mapping_entry()
mm: Introduce mf_dax_kill_procs() for fsdax case
xfs: Implement ->notify_failure() for XFS
fsdax: add exception for reflinked files

drivers/dax/device.c | 11 +-
drivers/dax/super.c | 121 +++++++++++++++++---
drivers/md/dm-writecache.c | 7 +-
drivers/nvdimm/pmem.c | 11 ++
fs/dax.c | 115 ++++++++++++++-----
fs/xfs/xfs_fsops.c | 3 +
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 188 +++++++++++++++++++++++++++++++
include/linux/dax.h | 80 ++++++++++++-
include/linux/memremap.h | 9 ++
include/linux/mm.h | 2 +
mm/memory-failure.c | 225 ++++++++++++++++++++++++++-----------
12 files changed, 643 insertions(+), 130 deletions(-)

--
2.33.0




2021-09-24 18:51:20

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()

When memory-failure occurs, we call this function which is implemented
by each kind of devices. For the fsdax case, pmem device driver
implements it. Pmem device driver will find out the filesystem in which
the corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers. If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Signed-off-by: Shiyang Ruan <[email protected]>
---
drivers/nvdimm/pmem.c | 11 +++++++++++
include/linux/memremap.h | 9 +++++++++
mm/memory-failure.c | 14 ++++++++++++++
3 files changed, 34 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 72de88ff0d30..0dfafad8fcc5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -362,9 +362,20 @@ static void pmem_release_disk(void *__pmem)
del_gendisk(pmem->disk);
}

+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+ unsigned long pfn, size_t size, int flags)
+{
+ struct pmem_device *pmem =
+ container_of(pgmap, struct pmem_device, pgmap);
+ loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
+
+ return dax_holder_notify_failure(pmem->dax_dev, offset, size, flags);
+}
+
static const struct dev_pagemap_ops fsdax_pagemap_ops = {
.kill = pmem_pagemap_kill,
.cleanup = pmem_pagemap_cleanup,
+ .memory_failure = pmem_pagemap_memory_failure,
};

static int pmem_attach_disk(struct device *dev,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..36d47bacd46d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -87,6 +87,15 @@ struct dev_pagemap_ops {
* the page back to a CPU accessible page.
*/
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+ /*
+ * Handle the memory failure happens on a range of pfns. Notify the
+ * processes who are using these pfns, and try to recover the data on
+ * them if necessary. The flag is finally passed to the recover
+ * function through the whole notify routine.
+ */
+ int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+ size_t size, int flags);
};

#define PGMAP_ALTMAP_VALID (1 << 0)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8ff9b52823c0..85eab206b68f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1605,6 +1605,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
if (!pgmap_pfn_valid(pgmap, pfn))
goto out;

+ /*
+ * Call driver's implementation to handle the memory failure, otherwise
+ * fall back to generic handler.
+ */
+ if (pgmap->ops->memory_failure) {
+ rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
+ /*
+ * Fall back to generic handler too if operation is not
+ * supported inside the driver/device/filesystem.
+ */
+ if (rc != EOPNOTSUPP)
+ goto out;
+ }
+
rc = mf_generic_kill_procs(pfn, flags, pgmap);
out:
/* drop pgmap ref acquired in caller */
--
2.33.0



2021-09-24 18:52:36

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS

This function is used to handle errors which may cause data lost in
filesystem. Such as memory failure in fsdax mode.

If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data. For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata.

After that, the memory failure needs to notify the processes who are
using those files.

Signed-off-by: Shiyang Ruan <[email protected]>
---
drivers/dax/super.c | 19 +++++
fs/xfs/xfs_fsops.c | 3 +
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 188 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 18 +++++
5 files changed, 229 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7d4a11dcba90..22091e7fb0ef 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -135,6 +135,25 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
}
EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);

+void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
+ const struct dax_holder_operations *ops)
+{
+ dax_set_holder(dax_dev, holder, ops);
+}
+EXPORT_SYMBOL_GPL(fs_dax_register_holder);
+
+void fs_dax_unregister_holder(struct dax_device *dax_dev)
+{
+ dax_set_holder(dax_dev, NULL, NULL);
+}
+EXPORT_SYMBOL_GPL(fs_dax_unregister_holder);
+
+void *fs_dax_get_holder(struct dax_device *dax_dev)
+{
+ return dax_get_holder(dax_dev);
+}
+EXPORT_SYMBOL_GPL(fs_dax_get_holder);
+
bool generic_fsdax_supported(struct dax_device *dax_dev,
struct block_device *bdev, int blocksize, sector_t start,
sector_t sectors)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 33e26690a8c4..4c2d3d4ca5a5 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -542,6 +542,9 @@ xfs_do_force_shutdown(
} else if (flags & SHUTDOWN_CORRUPT_INCORE) {
tag = XFS_PTAG_SHUTDOWN_CORRUPT;
why = "Corruption of in-memory data";
+ } else if (flags & SHUTDOWN_CORRUPT_META) {
+ tag = XFS_PTAG_SHUTDOWN_CORRUPT;
+ why = "Corruption of on-disk metadata";
} else {
tag = XFS_PTAG_SHUTDOWN_IOERROR;
why = "Metadata I/O Error";
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e091f3b3fa15..d0f6da23e3df 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -434,6 +434,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
#define SHUTDOWN_LOG_IO_ERROR 0x0002 /* write attempt to the log failed */
#define SHUTDOWN_FORCE_UMOUNT 0x0004 /* shutdown from a forced unmount */
#define SHUTDOWN_CORRUPT_INCORE 0x0008 /* corrupt in-memory data structures */
+#define SHUTDOWN_CORRUPT_META 0x0010 /* corrupt metadata on device */

#define XFS_SHUTDOWN_STRINGS \
{ SHUTDOWN_META_IO_ERROR, "metadata_io" }, \
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c4e0cd1c1c8c..46fdf44b5ec2 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -37,11 +37,19 @@
#include "xfs_reflink.h"
#include "xfs_pwork.h"
#include "xfs_ag.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bit.h"

#include <linux/magic.h>
#include <linux/fs_context.h>
#include <linux/fs_parser.h>
+#include <linux/mm.h>
+#include <linux/dax.h>

+static const struct dax_holder_operations xfs_dax_holder_operations;
static const struct super_operations xfs_super_operations;

static struct kset *xfs_kset; /* top-level xfs sysfs dir */
@@ -377,6 +385,8 @@ xfs_close_devices(

xfs_free_buftarg(mp->m_logdev_targp);
xfs_blkdev_put(logdev);
+ if (dax_logdev)
+ fs_dax_unregister_holder(dax_logdev);
fs_put_dax(dax_logdev);
}
if (mp->m_rtdev_targp) {
@@ -385,9 +395,13 @@ xfs_close_devices(

xfs_free_buftarg(mp->m_rtdev_targp);
xfs_blkdev_put(rtdev);
+ if (dax_rtdev)
+ fs_dax_unregister_holder(dax_rtdev);
fs_put_dax(dax_rtdev);
}
xfs_free_buftarg(mp->m_ddev_targp);
+ if (dax_ddev)
+ fs_dax_unregister_holder(dax_ddev);
fs_put_dax(dax_ddev);
}

@@ -411,6 +425,9 @@ xfs_open_devices(
struct block_device *logdev = NULL, *rtdev = NULL;
int error;

+ if (dax_ddev)
+ fs_dax_register_holder(dax_ddev, mp,
+ &xfs_dax_holder_operations);
/*
* Open real time and log devices - order is important.
*/
@@ -419,6 +436,9 @@ xfs_open_devices(
if (error)
goto out;
dax_logdev = fs_dax_get_by_bdev(logdev);
+ if (dax_logdev != dax_ddev && dax_logdev)
+ fs_dax_register_holder(dax_logdev, mp,
+ &xfs_dax_holder_operations);
}

if (mp->m_rtname) {
@@ -433,6 +453,9 @@ xfs_open_devices(
goto out_close_rtdev;
}
dax_rtdev = fs_dax_get_by_bdev(rtdev);
+ if (dax_rtdev)
+ fs_dax_register_holder(dax_rtdev, mp,
+ &xfs_dax_holder_operations);
}

/*
@@ -1110,6 +1133,171 @@ xfs_fs_free_cached_objects(
return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
}

+struct notify_failure_info {
+ xfs_agblock_t startblock;
+ xfs_filblks_t blockcount;
+ int flags;
+};
+
+static loff_t
+xfs_notify_failure_start(
+ struct xfs_mount *mp,
+ const struct xfs_rmap_irec *rec,
+ const struct notify_failure_info *notify)
+{
+ loff_t start = rec->rm_offset;
+
+ if (notify->startblock > rec->rm_startblock)
+ start += XFS_FSB_TO_B(mp,
+ notify->startblock - rec->rm_startblock);
+ return start;
+}
+
+static size_t
+xfs_notify_failure_size(
+ struct xfs_mount *mp,
+ const struct xfs_rmap_irec *rec,
+ const struct notify_failure_info *notify)
+{
+ xfs_agblock_t rec_start = rec->rm_startblock;
+ xfs_agblock_t rec_end = rec->rm_startblock + rec->rm_blockcount;
+ xfs_agblock_t notify_start = notify->startblock;
+ xfs_agblock_t notify_end = notify->startblock + notify->blockcount;
+ xfs_agblock_t cross_start = max(rec_start, notify_start);
+ xfs_agblock_t cross_end = min(rec_end, notify_end);
+
+ return XFS_FSB_TO_B(mp, cross_end - cross_start);
+}
+
+static int
+xfs_dax_notify_failure_fn(
+ struct xfs_btree_cur *cur,
+ const struct xfs_rmap_irec *rec,
+ void *data)
+{
+ struct xfs_mount *mp = cur->bc_mp;
+ struct xfs_inode *ip;
+ struct address_space *mapping;
+ struct notify_failure_info *notify = data;
+ int error = 0;
+
+ if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+ (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ // TODO check and try to fix metadata
+ xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
+ return -EFSCORRUPTED;
+ }
+
+ /* Get files that incore, filter out others that are not in use. */
+ error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
+ 0, &ip);
+ if (error)
+ return error;
+
+ mapping = VFS_I(ip)->i_mapping;
+ if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) {
+ loff_t offset = xfs_notify_failure_start(mp, rec, notify);
+ size_t size = xfs_notify_failure_size(mp, rec, notify);
+
+ error = mf_dax_kill_procs(mapping, offset >> PAGE_SHIFT, size,
+ notify->flags);
+ }
+ // TODO try to fix data
+ xfs_irele(ip);
+
+ return error;
+}
+
+static loff_t
+xfs_dax_bdev_offset(
+ struct xfs_mount *mp,
+ struct dax_device *dax_dev,
+ loff_t disk_offset)
+{
+ struct block_device *bdev;
+
+ if (mp->m_ddev_targp->bt_daxdev == dax_dev)
+ bdev = mp->m_ddev_targp->bt_bdev;
+ else if (mp->m_logdev_targp->bt_daxdev == dax_dev)
+ bdev = mp->m_logdev_targp->bt_bdev;
+ else
+ bdev = mp->m_rtdev_targp->bt_bdev;
+
+ return disk_offset - (get_start_sect(bdev) << SECTOR_SHIFT);
+}
+
+static int
+xfs_dax_notify_failure(
+ struct dax_device *dax_dev,
+ loff_t offset,
+ size_t len,
+ int flags)
+{
+ struct xfs_mount *mp = fs_dax_get_holder(dax_dev);
+ struct xfs_trans *tp = NULL;
+ struct xfs_btree_cur *cur = NULL;
+ struct xfs_buf *agf_bp = NULL;
+ struct xfs_rmap_irec rmap_low, rmap_high;
+ loff_t bdev_offset = xfs_dax_bdev_offset(mp, dax_dev,
+ offset);
+ xfs_fsblock_t fsbno = XFS_B_TO_FSB(mp, bdev_offset);
+ xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
+ int error = 0;
+ struct notify_failure_info notify = {
+ .startblock = XFS_FSB_TO_AGBNO(mp, fsbno),
+ .blockcount = XFS_B_TO_FSB(mp, len),
+ .flags = flags,
+ };
+
+ if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
+ xfs_warn(mp,
+ "notify_failure() not supported on realtime device!");
+ return -EOPNOTSUPP;
+ }
+
+ if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
+ mp->m_logdev_targp != mp->m_ddev_targp) {
+ xfs_err(mp, "ondisk log corrupt, shutting down fs!");
+ xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
+ return -EFSCORRUPTED;
+ }
+
+ if (!xfs_has_rmapbt(mp)) {
+ xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
+ return -EOPNOTSUPP;
+ }
+
+ error = xfs_trans_alloc_empty(mp, &tp);
+ if (error)
+ return error;
+
+ error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
+ if (error)
+ goto out_cancel_tp;
+
+ cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agf_bp->b_pag);
+
+ /* Construct a range for rmap query */
+ memset(&rmap_low, 0, sizeof(rmap_low));
+ memset(&rmap_high, 0xFF, sizeof(rmap_high));
+ rmap_low.rm_startblock = rmap_high.rm_startblock = notify.startblock;
+ rmap_low.rm_blockcount = rmap_high.rm_blockcount = notify.blockcount;
+
+ error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
+ xfs_dax_notify_failure_fn, &notify);
+
+ xfs_btree_del_cursor(cur, error);
+ xfs_trans_brelse(tp, agf_bp);
+
+out_cancel_tp:
+ xfs_trans_cancel(tp);
+ return error;
+}
+
+static const struct dax_holder_operations xfs_dax_holder_operations = {
+ .notify_failure = xfs_dax_notify_failure,
+};
+
static const struct super_operations xfs_super_operations = {
.alloc_inode = xfs_fs_alloc_inode,
.destroy_inode = xfs_fs_destroy_inode,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3d90becbd160..0f1fa7a4a616 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -149,6 +149,10 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
}

struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
+ const struct dax_holder_operations *ops);
+void fs_dax_unregister_holder(struct dax_device *dax_dev);
+void *fs_dax_get_holder(struct dax_device *dax_dev);
int dax_writeback_mapping_range(struct address_space *mapping,
struct dax_device *dax_dev, struct writeback_control *wbc);

@@ -179,6 +183,20 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
return NULL;
}

+static inline void fs_dax_register_holder(struct dax_device *dax_dev,
+ void *holder, const struct dax_holder_operations *ops)
+{
+}
+
+static inline void fs_dax_unregister_holder(struct dax_device *dax_dev)
+{
+}
+
+static inline void *fs_dax_get_holder(struct dax_device *dax_dev)
+{
+ return NULL;
+}
+
static inline struct page *dax_layout_busy_page(struct address_space *mapping)
{
return NULL;
--
2.33.0



2021-09-24 20:23:35

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 2/8] dax: Introduce holder for dax_device

To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
superblock of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Signed-off-by: Shiyang Ruan <[email protected]>
---
drivers/dax/super.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 29 ++++++++++++++++++++++
2 files changed, 88 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 48ce86501d93..7d4a11dcba90 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -23,7 +23,10 @@
* @cdev: optional character interface for "device dax"
* @host: optional name for lookups where the device path is not available
* @private: dax driver private data
+ * @holder_data: holder of a dax_device: could be filesystem or mapped device
* @flags: state and boolean properties
+ * @ops: operations for dax_device
+ * @holder_ops: operations for the inner holder
*/
struct dax_device {
struct hlist_node list;
@@ -31,8 +34,10 @@ struct dax_device {
struct cdev cdev;
const char *host;
void *private;
+ void *holder_data;
unsigned long flags;
const struct dax_operations *ops;
+ const struct dax_holder_operations *holder_ops;
};

static dev_t dax_devt;
@@ -374,6 +379,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
}
EXPORT_SYMBOL_GPL(dax_zero_page_range);

+int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
+ size_t size, int flags)
+{
+ int rc;
+
+ dax_read_lock();
+ if (!dax_alive(dax_dev)) {
+ rc = -ENXIO;
+ goto out;
+ }
+
+ if (!dax_dev->holder_data) {
+ rc = -EOPNOTSUPP;
+ goto out;
+ }
+
+ rc = dax_dev->holder_ops->notify_failure(dax_dev, offset, size, flags);
+out:
+ dax_read_unlock();
+ return rc;
+}
+EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
+
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_wb_cache_pmem(void *addr, size_t size);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
@@ -618,6 +646,37 @@ void put_dax(struct dax_device *dax_dev)
}
EXPORT_SYMBOL_GPL(put_dax);

+void dax_set_holder(struct dax_device *dax_dev, void *holder,
+ const struct dax_holder_operations *ops)
+{
+ dax_write_lock();
+ if (!dax_alive(dax_dev)) {
+ dax_write_unlock();
+ return;
+ }
+
+ dax_dev->holder_data = holder;
+ dax_dev->holder_ops = ops;
+ dax_write_unlock();
+}
+EXPORT_SYMBOL_GPL(dax_set_holder);
+
+void *dax_get_holder(struct dax_device *dax_dev)
+{
+ void *holder;
+
+ dax_read_lock();
+ if (!dax_alive(dax_dev)) {
+ dax_read_unlock();
+ return NULL;
+ }
+
+ holder = dax_dev->holder_data;
+ dax_read_unlock();
+ return holder;
+}
+EXPORT_SYMBOL_GPL(dax_get_holder);
+
/**
* inode_dax: convert a public inode into its dax_dev
* @inode: An inode with i_cdev pointing to a dax_dev
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 097b3304f9b9..d273d59723cd 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,9 +38,24 @@ struct dax_operations {
int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
};

+struct dax_holder_operations {
+ /*
+ * notify_failure - notify memory failure into inner holder device
+ * @dax_dev: the dax device which contains the holder
+ * @offset: offset on this dax device where memory failure occurs
+ * @size: length of this memory failure event
+ * @flags: action flags for memory failure handler
+ */
+ int (*notify_failure)(struct dax_device *dax_dev, loff_t offset,
+ size_t size, int flags);
+};
+
extern struct attribute_group dax_attribute_group;

#if IS_ENABLED(CONFIG_DAX)
+void dax_set_holder(struct dax_device *dax_dev, void *holder,
+ const struct dax_holder_operations *ops);
+void *dax_get_holder(struct dax_device *dax_dev);
struct dax_device *alloc_dax(void *private, const char *host,
const struct dax_operations *ops, unsigned long flags);
void put_dax(struct dax_device *dax_dev);
@@ -70,6 +85,18 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
return dax_synchronous(dax_dev);
}
#else
+static inline struct dax_device *dax_get_by_host(const char *host)
+{
+ return NULL;
+}
+static inline void dax_set_holder(struct dax_device *dax_dev, void *holder,
+ const struct dax_holder_operations *ops)
+{
+}
+static inline void *dax_get_holder(struct dax_device *dax_dev)
+{
+ return NULL;
+}
static inline struct dax_device *alloc_dax(void *private, const char *host,
const struct dax_operations *ops, unsigned long flags)
{
@@ -198,6 +225,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
size_t nr_pages);
+int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
+ size_t size, int flags);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);

ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
--
2.33.0



2021-09-24 20:38:21

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax. So it is needed to factor some helper functions to
simplify these code.

Signed-off-by: Shiyang Ruan <[email protected]>
---
mm/memory-failure.c | 140 ++++++++++++++++++++++++--------------------
1 file changed, 76 insertions(+), 64 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 54879c339024..8ff9b52823c0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1430,6 +1430,79 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
return 0;
}

+static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
+ struct address_space *mapping, pgoff_t index, int flags)
+{
+ struct to_kill *tk;
+ unsigned long size = 0;
+
+ list_for_each_entry(tk, to_kill, nd)
+ if (tk->size_shift)
+ size = max(size, 1UL << tk->size_shift);
+ if (size) {
+ /*
+ * Unmap the largest mapping to avoid breaking up device-dax
+ * mappings which are constant size. The actual size of the
+ * mapping being torn down is communicated in siginfo, see
+ * kill_proc()
+ */
+ loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+
+ unmap_mapping_range(mapping, start, size, 0);
+ }
+
+ kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+}
+
+static int mf_generic_kill_procs(unsigned long long pfn, int flags,
+ struct dev_pagemap *pgmap)
+{
+ struct page *page = pfn_to_page(pfn);
+ LIST_HEAD(to_kill);
+ dax_entry_t cookie;
+
+ /*
+ * Prevent the inode from being freed while we are interrogating
+ * the address_space, typically this would be handled by
+ * lock_page(), but dax pages do not use the page lock. This
+ * also prevents changes to the mapping of this pfn until
+ * poison signaling is complete.
+ */
+ cookie = dax_lock_page(page);
+ if (!cookie)
+ return -EBUSY;
+
+ if (hwpoison_filter(page))
+ return 0;
+
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ /*
+ * TODO: Handle HMM pages which may need coordination
+ * with device-side memory.
+ */
+ return -EBUSY;
+ }
+
+ /*
+ * Use this flag as an indication that the dax page has been
+ * remapped UC to prevent speculative consumption of poison.
+ */
+ SetPageHWPoison(page);
+
+ /*
+ * Unlike System-RAM there is no possibility to swap in a
+ * different physical page at a given virtual address, so all
+ * userspace consumption of ZONE_DEVICE memory necessitates
+ * SIGBUS (i.e. MF_MUST_KILL)
+ */
+ flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+ collect_procs(page, &to_kill, true);
+
+ unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
+ dax_unlock_page(page, cookie);
+ return 0;
+}
+
static int memory_failure_hugetlb(unsigned long pfn, int flags)
{
struct page *p = pfn_to_page(pfn);
@@ -1519,12 +1592,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
struct dev_pagemap *pgmap)
{
struct page *page = pfn_to_page(pfn);
- unsigned long size = 0;
- struct to_kill *tk;
LIST_HEAD(tokill);
- int rc = -EBUSY;
- loff_t start;
- dax_entry_t cookie;
+ int rc = -ENXIO;

if (flags & MF_COUNT_INCREASED)
/*
@@ -1533,67 +1602,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
put_page(page);

/* device metadata space is not recoverable */
- if (!pgmap_pfn_valid(pgmap, pfn)) {
- rc = -ENXIO;
- goto out;
- }
-
- /*
- * Prevent the inode from being freed while we are interrogating
- * the address_space, typically this would be handled by
- * lock_page(), but dax pages do not use the page lock. This
- * also prevents changes to the mapping of this pfn until
- * poison signaling is complete.
- */
- cookie = dax_lock_page(page);
- if (!cookie)
+ if (!pgmap_pfn_valid(pgmap, pfn))
goto out;

- if (hwpoison_filter(page)) {
- rc = 0;
- goto unlock;
- }
-
- if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
- /*
- * TODO: Handle HMM pages which may need coordination
- * with device-side memory.
- */
- goto unlock;
- }
-
- /*
- * Use this flag as an indication that the dax page has been
- * remapped UC to prevent speculative consumption of poison.
- */
- SetPageHWPoison(page);
-
- /*
- * Unlike System-RAM there is no possibility to swap in a
- * different physical page at a given virtual address, so all
- * userspace consumption of ZONE_DEVICE memory necessitates
- * SIGBUS (i.e. MF_MUST_KILL)
- */
- flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
- collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
-
- list_for_each_entry(tk, &tokill, nd)
- if (tk->size_shift)
- size = max(size, 1UL << tk->size_shift);
- if (size) {
- /*
- * Unmap the largest mapping to avoid breaking up
- * device-dax mappings which are constant size. The
- * actual size of the mapping being torn down is
- * communicated in siginfo, see kill_proc()
- */
- start = (page->index << PAGE_SHIFT) & ~(size - 1);
- unmap_mapping_range(page->mapping, start, size, 0);
- }
- kill_procs(&tokill, flags & MF_MUST_KILL, false, pfn, flags);
- rc = 0;
-unlock:
- dax_unlock_page(page, cookie);
+ rc = mf_generic_kill_procs(pfn, flags, pgmap);
out:
/* drop pgmap ref acquired in caller */
put_dev_pagemap(pgmap);
--
2.33.0



2021-09-24 20:40:37

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 8/8] fsdax: add exception for reflinked files

For reflinked files, one dax page may be associated more than once with
different fime mapping and index. It will report warning. Now, since
we have introduced dax-RMAP for this case and also have to keep its
functionality for other filesystems who are not support rmap, I add this
exception here.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2536c105ec7f..1a57211b1bc9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -352,9 +352,10 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- WARN_ON_ONCE(page->mapping);
- page->mapping = mapping;
- page->index = index + i++;
+ if (!page->mapping) {
+ page->mapping = mapping;
+ page->index = index + i++;
+ }
}
}

@@ -370,9 +371,10 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
struct page *page = pfn_to_page(pfn);

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
- WARN_ON_ONCE(page->mapping && page->mapping != mapping);
- page->mapping = NULL;
- page->index = 0;
+ if (page->mapping == mapping) {
+ page->mapping = NULL;
+ page->index = 0;
+ }
}
}

--
2.33.0



2021-09-24 20:40:37

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry()

The current dax_lock_page() locks dax entry by obtaining mapping and
index in page. To support 1-to-N RMAP in NVDIMM, we need a new function
to lock a specific dax entry corresponding to this file's mapping,index.
And BTW, output the page corresponding to the specific dax entry for
caller use.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 65 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/dax.h | 15 +++++++++++
2 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 798c43f09eee..509b65e60478 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -390,7 +390,7 @@ static struct page *dax_busy_page(void *entry)
}

/*
- * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
+ * dax_lock_page - Lock the DAX entry corresponding to a page
* @page: The page whose entry we want to lock
*
* Context: Process context.
@@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
dax_unlock_entry(&xas, (void *)cookie);
}

+/*
+ * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
+ * @mapping: the file's mapping whose entry we want to lock
+ * @index: the offset within this file
+ * @page: output the dax page corresponding to this dax entry
+ *
+ * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
+ * could not be locked.
+ */
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
+ struct page **page)
+{
+ XA_STATE(xas, NULL, 0);
+ void *entry;
+
+ rcu_read_lock();
+ for (;;) {
+ entry = NULL;
+ if (!dax_mapping(mapping))
+ break;
+
+ xas.xa = &mapping->i_pages;
+ xas_lock_irq(&xas);
+ xas_set(&xas, index);
+ entry = xas_load(&xas);
+ if (dax_is_locked(entry)) {
+ rcu_read_unlock();
+ wait_entry_unlocked(&xas, entry);
+ rcu_read_lock();
+ continue;
+ }
+ if (!entry ||
+ dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+ /*
+ * Because we are looking for entry from file's mapping
+ * and index, so the entry may not be inserted for now,
+ * or even a zero/empty entry. We don't think this is
+ * an error case. So, return a special value and do
+ * not output @page.
+ */
+ entry = (void *)~0UL;
+ } else {
+ *page = pfn_to_page(dax_to_pfn(entry));
+ dax_lock_entry(&xas, entry);
+ }
+ xas_unlock_irq(&xas);
+ break;
+ }
+ rcu_read_unlock();
+ return (dax_entry_t)entry;
+}
+
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
+ dax_entry_t cookie)
+{
+ XA_STATE(xas, &mapping->i_pages, index);
+
+ if (cookie == ~0UL)
+ return;
+
+ dax_unlock_entry(&xas, (void *)cookie);
+}
+
/*
* Find page cache entry at given index. If it is a DAX entry, return it
* with the entry locked. If the page cache doesn't contain an entry at
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d273d59723cd..65411bee4312 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -156,6 +156,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
dax_entry_t dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page, dax_entry_t cookie);
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+ unsigned long index, struct page **page);
+void dax_unlock_mapping_entry(struct address_space *mapping,
+ unsigned long index, dax_entry_t cookie);
#else
#define generic_fsdax_supported NULL

@@ -201,6 +205,17 @@ static inline dax_entry_t dax_lock_page(struct page *page)
static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
{
}
+
+static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+ unsigned long index, struct page **page)
+{
+ return 0;
+}
+
+static inline void dax_unlock_mapping_entry(struct address_space *mapping,
+ unsigned long index, dax_entry_t cookie)
+{
+}
#endif

#if IS_ENABLED(CONFIG_DAX)
--
2.33.0



2021-09-24 20:40:37

by Shiyang Ruan

[permalink] [raw]
Subject: [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case

This function is called at the end of RMAP routine, i.e. filesystem
recovery function, to collect and kill processes using a shared page of
DAX file. The difference between mf_generic_kill_procs() is,
it accepts file's mapping,offset instead of struct page. Because
different file's mappings and offsets may share the same page in fsdax
mode. So, it is called when filesystem RMAP results are found.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/dax.c | 10 ------
include/linux/dax.h | 9 +++++
include/linux/mm.h | 2 ++
mm/memory-failure.c | 83 ++++++++++++++++++++++++++++++++++++++++-----
4 files changed, 86 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 509b65e60478..2536c105ec7f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -852,16 +852,6 @@ static void *dax_insert_entry(struct xa_state *xas,
return entry;
}

-static inline
-unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
-{
- unsigned long address;
-
- address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
- return address;
-}
-
/* Walk all mappings of a given index of a file and writeprotect them */
static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
unsigned long pfn)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 65411bee4312..3d90becbd160 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -258,6 +258,15 @@ static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
}
+static inline unsigned long pgoff_address(pgoff_t pgoff,
+ struct vm_area_struct *vma)
+{
+ unsigned long address;
+
+ address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+ return address;
+}

#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
void hmem_register_device(int target_nid, struct resource *r);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..d06af0051e53 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3114,6 +3114,8 @@ enum mf_flags {
MF_MUST_KILL = 1 << 2,
MF_SOFT_OFFLINE = 1 << 3,
};
+extern int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+ size_t size, int flags);
extern int memory_failure(unsigned long pfn, int flags);
extern void memory_failure_queue(unsigned long pfn, int flags);
extern void memory_failure_queue_kick(int cpu);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 85eab206b68f..a9d0d487d205 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -302,10 +302,9 @@ void shake_page(struct page *p)
}
EXPORT_SYMBOL_GPL(shake_page);

-static unsigned long dev_pagemap_mapping_shift(struct page *page,
+static unsigned long dev_pagemap_mapping_shift(unsigned long address,
struct vm_area_struct *vma)
{
- unsigned long address = vma_address(page, vma);
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -345,7 +344,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
* Schedule a process for later kill.
* Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
*/
-static void add_to_kill(struct task_struct *tsk, struct page *p,
+static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
struct vm_area_struct *vma,
struct list_head *to_kill)
{
@@ -358,9 +357,15 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
}

tk->addr = page_address_in_vma(p, vma);
- if (is_zone_device_page(p))
- tk->size_shift = dev_pagemap_mapping_shift(p, vma);
- else
+ if (is_zone_device_page(p)) {
+ /*
+ * Since page->mapping is no more used for fsdax, we should
+ * calculate the address in a fsdax way.
+ */
+ if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
+ tk->addr = pgoff_address(pgoff, vma);
+ tk->size_shift = dev_pagemap_mapping_shift(tk->addr, vma);
+ } else
tk->size_shift = page_shift(compound_head(p));

/*
@@ -508,7 +513,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
if (!page_mapped_in_vma(page, vma))
continue;
if (vma->vm_mm == t->mm)
- add_to_kill(t, page, vma, to_kill);
+ add_to_kill(t, page, 0, vma, to_kill);
}
}
read_unlock(&tasklist_lock);
@@ -544,7 +549,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
* to be informed of all such data corruptions.
*/
if (vma->vm_mm == t->mm)
- add_to_kill(t, page, vma, to_kill);
+ add_to_kill(t, page, 0, vma, to_kill);
+ }
+ }
+ read_unlock(&tasklist_lock);
+ i_mmap_unlock_read(mapping);
+}
+
+/*
+ * Collect processes when the error hit a fsdax page.
+ */
+static void collect_procs_fsdax(struct page *page, struct address_space *mapping,
+ pgoff_t pgoff, struct list_head *to_kill)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+
+ i_mmap_lock_read(mapping);
+ read_lock(&tasklist_lock);
+ for_each_process(tsk) {
+ struct task_struct *t = task_early_kill(tsk, true);
+
+ if (!t)
+ continue;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ if (vma->vm_mm == t->mm)
+ add_to_kill(t, page, pgoff, vma, to_kill);
}
}
read_unlock(&tasklist_lock);
@@ -1503,6 +1533,43 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
return 0;
}

+/**
+ * mf_dax_kill_procs - Collect and kill processes who are using this file range
+ * @mapping: the file in use
+ * @index: start offset of the range
+ * @size: length of the range
+ * @flags: memory failure flags
+ */
+int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+ size_t size, int flags)
+{
+ LIST_HEAD(to_kill);
+ dax_entry_t cookie;
+ struct page *page;
+ size_t end = (index << PAGE_SHIFT) + size;
+
+ flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+
+ for (; (index << PAGE_SHIFT) < end; index++) {
+ page = NULL;
+ cookie = dax_lock_mapping_entry(mapping, index, &page);
+ if (!cookie)
+ return -EBUSY;
+ if (!page)
+ goto unlock;
+
+ SetPageHWPoison(page);
+
+ collect_procs_fsdax(page, mapping, index, &to_kill);
+ unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
+ index, flags);
+unlock:
+ dax_unlock_mapping_entry(mapping, index, cookie);
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
+
static int memory_failure_hugetlb(unsigned long pfn, int flags)
{
struct page *p = pfn_to_page(pfn);
--
2.33.0



2021-10-14 19:20:15

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap

On Fri, Sep 24, 2021 at 09:09:54PM +0800, Shiyang Ruan wrote:
> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
> feature for fsdax. So it is needed to factor some helper functions to
> simplify these code.
>
> Signed-off-by: Shiyang Ruan <[email protected]>

This looks like a reasonable hoist...
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
> mm/memory-failure.c | 140 ++++++++++++++++++++++++--------------------
> 1 file changed, 76 insertions(+), 64 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 54879c339024..8ff9b52823c0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1430,6 +1430,79 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
> return 0;
> }
>
> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> + struct address_space *mapping, pgoff_t index, int flags)
> +{
> + struct to_kill *tk;
> + unsigned long size = 0;
> +
> + list_for_each_entry(tk, to_kill, nd)
> + if (tk->size_shift)
> + size = max(size, 1UL << tk->size_shift);
> + if (size) {
> + /*
> + * Unmap the largest mapping to avoid breaking up device-dax
> + * mappings which are constant size. The actual size of the
> + * mapping being torn down is communicated in siginfo, see
> + * kill_proc()
> + */
> + loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
> +
> + unmap_mapping_range(mapping, start, size, 0);
> + }
> +
> + kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
> +}
> +
> +static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> + struct dev_pagemap *pgmap)
> +{
> + struct page *page = pfn_to_page(pfn);
> + LIST_HEAD(to_kill);
> + dax_entry_t cookie;
> +
> + /*
> + * Prevent the inode from being freed while we are interrogating
> + * the address_space, typically this would be handled by
> + * lock_page(), but dax pages do not use the page lock. This
> + * also prevents changes to the mapping of this pfn until
> + * poison signaling is complete.
> + */
> + cookie = dax_lock_page(page);
> + if (!cookie)
> + return -EBUSY;
> +
> + if (hwpoison_filter(page))
> + return 0;
> +
> + if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> + /*
> + * TODO: Handle HMM pages which may need coordination
> + * with device-side memory.
> + */
> + return -EBUSY;
> + }
> +
> + /*
> + * Use this flag as an indication that the dax page has been
> + * remapped UC to prevent speculative consumption of poison.
> + */
> + SetPageHWPoison(page);
> +
> + /*
> + * Unlike System-RAM there is no possibility to swap in a
> + * different physical page at a given virtual address, so all
> + * userspace consumption of ZONE_DEVICE memory necessitates
> + * SIGBUS (i.e. MF_MUST_KILL)
> + */
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> + collect_procs(page, &to_kill, true);
> +
> + unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
> + dax_unlock_page(page, cookie);
> + return 0;
> +}
> +
> static int memory_failure_hugetlb(unsigned long pfn, int flags)
> {
> struct page *p = pfn_to_page(pfn);
> @@ -1519,12 +1592,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> struct dev_pagemap *pgmap)
> {
> struct page *page = pfn_to_page(pfn);
> - unsigned long size = 0;
> - struct to_kill *tk;
> LIST_HEAD(tokill);
> - int rc = -EBUSY;
> - loff_t start;
> - dax_entry_t cookie;
> + int rc = -ENXIO;
>
> if (flags & MF_COUNT_INCREASED)
> /*
> @@ -1533,67 +1602,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> put_page(page);
>
> /* device metadata space is not recoverable */
> - if (!pgmap_pfn_valid(pgmap, pfn)) {
> - rc = -ENXIO;
> - goto out;
> - }
> -
> - /*
> - * Prevent the inode from being freed while we are interrogating
> - * the address_space, typically this would be handled by
> - * lock_page(), but dax pages do not use the page lock. This
> - * also prevents changes to the mapping of this pfn until
> - * poison signaling is complete.
> - */
> - cookie = dax_lock_page(page);
> - if (!cookie)
> + if (!pgmap_pfn_valid(pgmap, pfn))
> goto out;
>
> - if (hwpoison_filter(page)) {
> - rc = 0;
> - goto unlock;
> - }
> -
> - if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> - /*
> - * TODO: Handle HMM pages which may need coordination
> - * with device-side memory.
> - */
> - goto unlock;
> - }
> -
> - /*
> - * Use this flag as an indication that the dax page has been
> - * remapped UC to prevent speculative consumption of poison.
> - */
> - SetPageHWPoison(page);
> -
> - /*
> - * Unlike System-RAM there is no possibility to swap in a
> - * different physical page at a given virtual address, so all
> - * userspace consumption of ZONE_DEVICE memory necessitates
> - * SIGBUS (i.e. MF_MUST_KILL)
> - */
> - flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> - collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
> -
> - list_for_each_entry(tk, &tokill, nd)
> - if (tk->size_shift)
> - size = max(size, 1UL << tk->size_shift);
> - if (size) {
> - /*
> - * Unmap the largest mapping to avoid breaking up
> - * device-dax mappings which are constant size. The
> - * actual size of the mapping being torn down is
> - * communicated in siginfo, see kill_proc()
> - */
> - start = (page->index << PAGE_SHIFT) & ~(size - 1);
> - unmap_mapping_range(page->mapping, start, size, 0);
> - }
> - kill_procs(&tokill, flags & MF_MUST_KILL, false, pfn, flags);
> - rc = 0;
> -unlock:
> - dax_unlock_page(page, cookie);
> + rc = mf_generic_kill_procs(pfn, flags, pgmap);
> out:
> /* drop pgmap ref acquired in caller */
> put_dev_pagemap(pgmap);
> --
> 2.33.0
>
>
>

2021-10-14 19:20:33

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] dax: Introduce holder for dax_device

On Fri, Sep 24, 2021 at 09:09:53PM +0800, Shiyang Ruan wrote:
> To easily track filesystem from a pmem device, we introduce a holder for
> dax_device structure, and also its operation. This holder is used to
> remember who is using this dax_device:
> - When it is the backend of a filesystem, the holder will be the
> superblock of this filesystem.
> - When this pmem device is one of the targets in a mapped device, the
> holder will be this mapped device. In this case, the mapped device
> has its own dax_device and it will follow the first rule. So that we
> can finally track to the filesystem we needed.
>
> The holder and holder_ops will be set when filesystem is being mounted,
> or an target device is being activated.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> drivers/dax/super.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
> include/linux/dax.h | 29 ++++++++++++++++++++++
> 2 files changed, 88 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 48ce86501d93..7d4a11dcba90 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -23,7 +23,10 @@
> * @cdev: optional character interface for "device dax"
> * @host: optional name for lookups where the device path is not available
> * @private: dax driver private data
> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
> * @flags: state and boolean properties
> + * @ops: operations for dax_device
> + * @holder_ops: operations for the inner holder
> */
> struct dax_device {
> struct hlist_node list;
> @@ -31,8 +34,10 @@ struct dax_device {
> struct cdev cdev;
> const char *host;
> void *private;
> + void *holder_data;
> unsigned long flags;
> const struct dax_operations *ops;
> + const struct dax_holder_operations *holder_ops;
> };
>
> static dev_t dax_devt;
> @@ -374,6 +379,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> }
> EXPORT_SYMBOL_GPL(dax_zero_page_range);
>
> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
> + size_t size, int flags)
> +{
> + int rc;
> +
> + dax_read_lock();
> + if (!dax_alive(dax_dev)) {
> + rc = -ENXIO;
> + goto out;
> + }
> +
> + if (!dax_dev->holder_data) {
> + rc = -EOPNOTSUPP;
> + goto out;
> + }
> +
> + rc = dax_dev->holder_ops->notify_failure(dax_dev, offset, size, flags);

Shouldn't this check if dax_dev->holder_ops != NULL before dereferencing
it for the function call? Imagine an implementation that wants to
attach a ->notify_failure function to a dax_device, maintains its own
lookup table, and decides that it doesn't need to set holder_data.

(Or, imagine someone who writes a garbage into holder_data and *boom*)

How does the locking work here? If there's a media failure, we'll take
dax_rwsem and call ->notify_failure. If the ->notify_failure function
wants to access the pmem to handle the error by calling back into the
dax code, will that cause nested locking on dax_rwsem?

Jumping ahead a bit, I think the rmap btree accesses that the xfs
implementation performs can cause xfs_buf(fer) cache IO, which would
trigger that if the buffers aren't already in memory, if I'm reading
this correctly?

> +out:
> + dax_read_unlock();
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> +
> #ifdef CONFIG_ARCH_HAS_PMEM_API
> void arch_wb_cache_pmem(void *addr, size_t size);
> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> @@ -618,6 +646,37 @@ void put_dax(struct dax_device *dax_dev)
> }
> EXPORT_SYMBOL_GPL(put_dax);
>
> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops)
> +{
> + dax_write_lock();
> + if (!dax_alive(dax_dev)) {
> + dax_write_unlock();
> + return;
> + }
> +
> + dax_dev->holder_data = holder;
> + dax_dev->holder_ops = ops;
> + dax_write_unlock();

I guess this means that the holder has to detach itself before anyone
calls kill_dax, or else a dead dax device ends up with a dangling
reference to the holder?

> +}
> +EXPORT_SYMBOL_GPL(dax_set_holder);
> +
> +void *dax_get_holder(struct dax_device *dax_dev)
> +{
> + void *holder;
> +
> + dax_read_lock();
> + if (!dax_alive(dax_dev)) {
> + dax_read_unlock();
> + return NULL;
> + }
> +
> + holder = dax_dev->holder_data;
> + dax_read_unlock();
> + return holder;
> +}
> +EXPORT_SYMBOL_GPL(dax_get_holder);
> +
> /**
> * inode_dax: convert a public inode into its dax_dev
> * @inode: An inode with i_cdev pointing to a dax_dev
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 097b3304f9b9..d273d59723cd 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -38,9 +38,24 @@ struct dax_operations {
> int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
> };
>
> +struct dax_holder_operations {
> + /*
> + * notify_failure - notify memory failure into inner holder device
> + * @dax_dev: the dax device which contains the holder
> + * @offset: offset on this dax device where memory failure occurs
> + * @size: length of this memory failure event
> + * @flags: action flags for memory failure handler
> + */
> + int (*notify_failure)(struct dax_device *dax_dev, loff_t offset,
> + size_t size, int flags);

Shouldn't size be u64 or something? Let's say that 8GB of your pmem go
bad, wouldn't you want a single call? Though I guess the current
implementation only goes a single page at a time, doesn't it?

> +};
> +
> extern struct attribute_group dax_attribute_group;
>
> #if IS_ENABLED(CONFIG_DAX)
> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops);
> +void *dax_get_holder(struct dax_device *dax_dev);
> struct dax_device *alloc_dax(void *private, const char *host,
> const struct dax_operations *ops, unsigned long flags);
> void put_dax(struct dax_device *dax_dev);
> @@ -70,6 +85,18 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
> return dax_synchronous(dax_dev);
> }
> #else
> +static inline struct dax_device *dax_get_by_host(const char *host)

Not sure why this is being added here? AFAICT none of the patches call
this function...?

--D

> +{
> + return NULL;
> +}
> +static inline void dax_set_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops)
> +{
> +}
> +static inline void *dax_get_holder(struct dax_device *dax_dev)
> +{
> + return NULL;
> +}
> static inline struct dax_device *alloc_dax(void *private, const char *host,
> const struct dax_operations *ops, unsigned long flags)
> {
> @@ -198,6 +225,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> size_t bytes, struct iov_iter *i);
> int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
> size_t nr_pages);
> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
> + size_t size, int flags);
> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
>
> ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> --
> 2.33.0
>
>
>

2021-10-14 22:11:08

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()

On Fri, Sep 24, 2021 at 09:09:55PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices. For the fsdax case, pmem device driver
> implements it. Pmem device driver will find out the filesystem in which
> the corrupted page located in.
>
> With dax_holder notify support, we are able to notify the memory failure
> from pmem driver to upper layers. If there is something not support in
> the notify routine, memory_failure will fall back to the generic hanlder.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> drivers/nvdimm/pmem.c | 11 +++++++++++
> include/linux/memremap.h | 9 +++++++++
> mm/memory-failure.c | 14 ++++++++++++++
> 3 files changed, 34 insertions(+)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 72de88ff0d30..0dfafad8fcc5 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -362,9 +362,20 @@ static void pmem_release_disk(void *__pmem)
> del_gendisk(pmem->disk);
> }
>
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + unsigned long pfn, size_t size, int flags)
> +{
> + struct pmem_device *pmem =
> + container_of(pgmap, struct pmem_device, pgmap);
> + loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
> +
> + return dax_holder_notify_failure(pmem->dax_dev, offset, size, flags);
> +}
> +
> static const struct dev_pagemap_ops fsdax_pagemap_ops = {
> .kill = pmem_pagemap_kill,
> .cleanup = pmem_pagemap_cleanup,
> + .memory_failure = pmem_pagemap_memory_failure,
> };
>
> static int pmem_attach_disk(struct device *dev,
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c0e9d35889e8..36d47bacd46d 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -87,6 +87,15 @@ struct dev_pagemap_ops {
> * the page back to a CPU accessible page.
> */
> vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> +
> + /*
> + * Handle the memory failure happens on a range of pfns. Notify the
> + * processes who are using these pfns, and try to recover the data on
> + * them if necessary. The flag is finally passed to the recover
> + * function through the whole notify routine.
> + */
> + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> + size_t size, int flags);
> };
>
> #define PGMAP_ALTMAP_VALID (1 << 0)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 8ff9b52823c0..85eab206b68f 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1605,6 +1605,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> if (!pgmap_pfn_valid(pgmap, pfn))
> goto out;
>
> + /*
> + * Call driver's implementation to handle the memory failure, otherwise
> + * fall back to generic handler.
> + */
> + if (pgmap->ops->memory_failure) {
> + rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
> + /*
> + * Fall back to generic handler too if operation is not
> + * supported inside the driver/device/filesystem.
> + */
> + if (rc != EOPNOTSUPP)

-EOPNOTSUPP? (negative errno)

--D

> + goto out;
> + }
> +
> rc = mf_generic_kill_procs(pfn, flags, pgmap);
> out:
> /* drop pgmap ref acquired in caller */
> --
> 2.33.0
>
>
>

2021-10-14 22:14:07

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry()

On Fri, Sep 24, 2021 at 09:09:56PM +0800, Shiyang Ruan wrote:
> The current dax_lock_page() locks dax entry by obtaining mapping and
> index in page. To support 1-to-N RMAP in NVDIMM, we need a new function
> to lock a specific dax entry corresponding to this file's mapping,index.
> And BTW, output the page corresponding to the specific dax entry for
> caller use.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/dax.c | 65 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/dax.h | 15 +++++++++++
> 2 files changed, 79 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 798c43f09eee..509b65e60478 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -390,7 +390,7 @@ static struct page *dax_busy_page(void *entry)
> }
>
> /*
> - * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
> + * dax_lock_page - Lock the DAX entry corresponding to a page
> * @page: The page whose entry we want to lock
> *
> * Context: Process context.
> @@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
> dax_unlock_entry(&xas, (void *)cookie);
> }
>
> +/*
> + * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
> + * @mapping: the file's mapping whose entry we want to lock
> + * @index: the offset within this file
> + * @page: output the dax page corresponding to this dax entry
> + *
> + * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
> + * could not be locked.
> + */
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
> + struct page **page)
> +{
> + XA_STATE(xas, NULL, 0);
> + void *entry;
> +
> + rcu_read_lock();
> + for (;;) {
> + entry = NULL;
> + if (!dax_mapping(mapping))
> + break;
> +
> + xas.xa = &mapping->i_pages;
> + xas_lock_irq(&xas);
> + xas_set(&xas, index);
> + entry = xas_load(&xas);
> + if (dax_is_locked(entry)) {
> + rcu_read_unlock();
> + wait_entry_unlocked(&xas, entry);
> + rcu_read_lock();
> + continue;
> + }
> + if (!entry ||
> + dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + /*
> + * Because we are looking for entry from file's mapping
> + * and index, so the entry may not be inserted for now,
> + * or even a zero/empty entry. We don't think this is
> + * an error case. So, return a special value and do
> + * not output @page.
> + */
> + entry = (void *)~0UL;

I kinda wonder if these open-coded magic values ~0UL (no entry) and 0
(cannot lock) should be #defines that force-cast the magic value to
dax_entry_t...

...but then I'm not really an expert in the design behind fs/dax.c --
this part looks reasonable enough to me, but I think Dan or Matthew
ought to look this over.

--D

> + } else {
> + *page = pfn_to_page(dax_to_pfn(entry));
> + dax_lock_entry(&xas, entry);
> + }
> + xas_unlock_irq(&xas);
> + break;
> + }
> + rcu_read_unlock();
> + return (dax_entry_t)entry;
> +}
> +
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
> + dax_entry_t cookie)
> +{
> + XA_STATE(xas, &mapping->i_pages, index);
> +
> + if (cookie == ~0UL)
> + return;
> +
> + dax_unlock_entry(&xas, (void *)cookie);
> +}
> +
> /*
> * Find page cache entry at given index. If it is a DAX entry, return it
> * with the entry locked. If the page cache doesn't contain an entry at
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index d273d59723cd..65411bee4312 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -156,6 +156,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
> struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
> dax_entry_t dax_lock_page(struct page *page);
> void dax_unlock_page(struct page *page, dax_entry_t cookie);
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
> + unsigned long index, struct page **page);
> +void dax_unlock_mapping_entry(struct address_space *mapping,
> + unsigned long index, dax_entry_t cookie);
> #else
> #define generic_fsdax_supported NULL
>
> @@ -201,6 +205,17 @@ static inline dax_entry_t dax_lock_page(struct page *page)
> static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
> {
> }
> +
> +static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
> + unsigned long index, struct page **page)
> +{
> + return 0;
> +}
> +
> +static inline void dax_unlock_mapping_entry(struct address_space *mapping,
> + unsigned long index, dax_entry_t cookie)
> +{
> +}
> #endif
>
> #if IS_ENABLED(CONFIG_DAX)
> --
> 2.33.0
>
>
>

2021-10-15 00:30:37

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS

On Fri, Sep 24, 2021 at 09:09:58PM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem. Such as memory failure in fsdax mode.
>
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data. For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
>
> After that, the memory failure needs to notify the processes who are
> using those files.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> drivers/dax/super.c | 19 +++++
> fs/xfs/xfs_fsops.c | 3 +
> fs/xfs/xfs_mount.h | 1 +
> fs/xfs/xfs_super.c | 188 ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/dax.h | 18 +++++
> 5 files changed, 229 insertions(+)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 7d4a11dcba90..22091e7fb0ef 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -135,6 +135,25 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> }
> EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>
> +void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops)
> +{
> + dax_set_holder(dax_dev, holder, ops);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_register_holder);
> +
> +void fs_dax_unregister_holder(struct dax_device *dax_dev)
> +{
> + dax_set_holder(dax_dev, NULL, NULL);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_unregister_holder);
> +
> +void *fs_dax_get_holder(struct dax_device *dax_dev)
> +{
> + return dax_get_holder(dax_dev);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get_holder);
> +
> bool generic_fsdax_supported(struct dax_device *dax_dev,
> struct block_device *bdev, int blocksize, sector_t start,
> sector_t sectors)
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..4c2d3d4ca5a5 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -542,6 +542,9 @@ xfs_do_force_shutdown(
> } else if (flags & SHUTDOWN_CORRUPT_INCORE) {
> tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> why = "Corruption of in-memory data";
> + } else if (flags & SHUTDOWN_CORRUPT_META) {
> + tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> + why = "Corruption of on-disk metadata";
> } else {
> tag = XFS_PTAG_SHUTDOWN_IOERROR;
> why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index e091f3b3fa15..d0f6da23e3df 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -434,6 +434,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int lags, char *fname,
> #define SHUTDOWN_LOG_IO_ERROR 0x0002 /* write attempt to the log failed */
> #define SHUTDOWN_FORCE_UMOUNT 0x0004 /* shutdown from a forced unmount */
> #define SHUTDOWN_CORRUPT_INCORE 0x0008 /* corrupt in-memory data structures */
> +#define SHUTDOWN_CORRUPT_META 0x0010 /* corrupt metadata on device */
>
> #define XFS_SHUTDOWN_STRINGS \
> { SHUTDOWN_META_IO_ERROR, "metadata_io" }, \
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index c4e0cd1c1c8c..46fdf44b5ec2 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -37,11 +37,19 @@
> #include "xfs_reflink.h"
> #include "xfs_pwork.h"
> #include "xfs_ag.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_bit.h"
>
> #include <linux/magic.h>
> #include <linux/fs_context.h>
> #include <linux/fs_parser.h>
> +#include <linux/mm.h>
> +#include <linux/dax.h>
>
> +static const struct dax_holder_operations xfs_dax_holder_operations;
> static const struct super_operations xfs_super_operations;
>
> static struct kset *xfs_kset; /* top-level xfs sysfs dir */
> @@ -377,6 +385,8 @@ xfs_close_devices(
>
> xfs_free_buftarg(mp->m_logdev_targp);
> xfs_blkdev_put(logdev);
> + if (dax_logdev)
> + fs_dax_unregister_holder(dax_logdev);
> fs_put_dax(dax_logdev);
> }
> if (mp->m_rtdev_targp) {
> @@ -385,9 +395,13 @@ xfs_close_devices(
>
> xfs_free_buftarg(mp->m_rtdev_targp);
> xfs_blkdev_put(rtdev);
> + if (dax_rtdev)
> + fs_dax_unregister_holder(dax_rtdev);
> fs_put_dax(dax_rtdev);
> }
> xfs_free_buftarg(mp->m_ddev_targp);
> + if (dax_ddev)
> + fs_dax_unregister_holder(dax_ddev);
> fs_put_dax(dax_ddev);
> }
>
> @@ -411,6 +425,9 @@ xfs_open_devices(
> struct block_device *logdev = NULL, *rtdev = NULL;
> int error;
>
> + if (dax_ddev)
> + fs_dax_register_holder(dax_ddev, mp,
> + &xfs_dax_holder_operations);
> /*
> * Open real time and log devices - order is important.
> */
> @@ -419,6 +436,9 @@ xfs_open_devices(
> if (error)
> goto out;
> dax_logdev = fs_dax_get_by_bdev(logdev);
> + if (dax_logdev != dax_ddev && dax_logdev)
> + fs_dax_register_holder(dax_logdev, mp,
> + &xfs_dax_holder_operations);
> }
>
> if (mp->m_rtname) {
> @@ -433,6 +453,9 @@ xfs_open_devices(
> goto out_close_rtdev;
> }
> dax_rtdev = fs_dax_get_by_bdev(rtdev);
> + if (dax_rtdev)
> + fs_dax_register_holder(dax_rtdev, mp,
> + &xfs_dax_holder_operations);
> }
>
> /*
> @@ -1110,6 +1133,171 @@ xfs_fs_free_cached_objects(
> return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
> }
>
> +struct notify_failure_info {
> + xfs_agblock_t startblock;

Style nit: tabs between type name and variable name.

> + xfs_filblks_t blockcount;
> + int flags;

Are these MF_ flags to be passed from memory_failure to
mf_dax_kill_procs?

> +};
> +
> +static loff_t
> +xfs_notify_failure_start(
> + struct xfs_mount *mp,
> + const struct xfs_rmap_irec *rec,
> + const struct notify_failure_info *notify)
> +{
> + loff_t start = rec->rm_offset;
> +
> + if (notify->startblock > rec->rm_startblock)
> + start += XFS_FSB_TO_B(mp,
> + notify->startblock - rec->rm_startblock);

I'm confused, is this supposed to return the file pos(ition) of the
failed range in units of bytes or in fs blocks?

If it's units of bytes (like the loff_t return value implies) then this
should be called xfs_notify_failure_pos.

> + return start;
> +}
> +
> +static size_t
> +xfs_notify_failure_size(
> + struct xfs_mount *mp,
> + const struct xfs_rmap_irec *rec,
> + const struct notify_failure_info *notify)
> +{
> + xfs_agblock_t rec_start = rec->rm_startblock;
> + xfs_agblock_t rec_end = rec->rm_startblock + rec->rm_blockcount;

These are "next" variables, not "end". The end of the record is
startblock + blockcount - 1.

> + xfs_agblock_t notify_start = notify->startblock;
> + xfs_agblock_t notify_end = notify->startblock + notify->blockcount;
> + xfs_agblock_t cross_start = max(rec_start, notify_start);
> + xfs_agblock_t cross_end = min(rec_end, notify_end);
> +
> + return XFS_FSB_TO_B(mp, cross_end - cross_start);
> +}
> +
> +static int
> +xfs_dax_notify_failure_fn(
> + struct xfs_btree_cur *cur,
> + const struct xfs_rmap_irec *rec,
> + void *data)
> +{
> + struct xfs_mount *mp = cur->bc_mp;
> + struct xfs_inode *ip;
> + struct address_space *mapping;
> + struct notify_failure_info *notify = data;
> + int error = 0;
> +
> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> + (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + // TODO check and try to fix metadata
> + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
> + return -EFSCORRUPTED;
> + }
> +
> + /* Get files that incore, filter out others that are not in use. */
> + error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
> + 0, &ip);

If you're going to use _INCORE then you probably want to filter out the
-ENODATA or whatever error code means "inode wasn't loaded", because
returning any nonzero value to rmap_query_range causes it to stop
iterating.

> + if (error)
> + return error;
> +
> + mapping = VFS_I(ip)->i_mapping;
> + if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) {
> + loff_t offset = xfs_notify_failure_start(mp, rec, notify);
> + size_t size = xfs_notify_failure_size(mp, rec, notify);
> +
> + error = mf_dax_kill_procs(mapping, offset >> PAGE_SHIFT, size,
> + notify->flags);
> + }
> + // TODO try to fix data
> + xfs_irele(ip);
> +
> + return error;
> +}
> +
> +static loff_t
> +xfs_dax_bdev_offset(
> + struct xfs_mount *mp,
> + struct dax_device *dax_dev,
> + loff_t disk_offset)
> +{
> + struct block_device *bdev;
> +
> + if (mp->m_ddev_targp->bt_daxdev == dax_dev)
> + bdev = mp->m_ddev_targp->bt_bdev;
> + else if (mp->m_logdev_targp->bt_daxdev == dax_dev)
> + bdev = mp->m_logdev_targp->bt_bdev;
> + else
> + bdev = mp->m_rtdev_targp->bt_bdev;
> +
> + return disk_offset - (get_start_sect(bdev) << SECTOR_SHIFT);
> +}
> +
> +static int
> +xfs_dax_notify_failure(
> + struct dax_device *dax_dev,
> + loff_t offset,
> + size_t len,
> + int flags)

Are @flags supposed to contain the MF_ flags that were passed to
memory_failure()? The variable name should probably be @mf_flags
throughout the patchset if that's the case.

> +{
> + struct xfs_mount *mp = fs_dax_get_holder(dax_dev);
> + struct xfs_trans *tp = NULL;
> + struct xfs_btree_cur *cur = NULL;
> + struct xfs_buf *agf_bp = NULL;
> + struct xfs_rmap_irec rmap_low, rmap_high;
> + loff_t bdev_offset = xfs_dax_bdev_offset(mp, dax_dev,
> + offset);

I don't like using loff_t to represent byte offsets into the physical
device. loff_t should be used only for file byte offsets, and that's
not what we're storing here.

> + xfs_fsblock_t fsbno = XFS_B_TO_FSB(mp, bdev_offset);

I think this is still wrong, since fsblocks are segmented (nonlinear)
addresses. Pass daddr into xfs_dax_notify_ddev_failure like I lay out
below, and then you can do:

start_fsbno = XFS_DADDR_TO_FSB(mp, daddr);
agno = XFS_FSB_TO_AGNO(mp, fsbno);
notify.startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
notify.blockcount = XFS_BB_TO_FSB(mp, bblen);

(More on this below)

> + xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
> + int error = 0;
> + struct notify_failure_info notify = {
> + .startblock = XFS_FSB_TO_AGBNO(mp, fsbno),
> + .blockcount = XFS_B_TO_FSB(mp, len),
> + .flags = flags,
> + };
> +
> + if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
> + xfs_warn(mp,
> + "notify_failure() not supported on realtime device!");
> + return -EOPNOTSUPP;
> + }
> +
> + if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> + mp->m_logdev_targp != mp->m_ddev_targp) {
> + xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> + xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
> + return -EFSCORRUPTED;
> + }

Urk. xfs_dax_notify_failure should be a short function to dispatch the
notification to the proper handler. Everything from here down should be
in a separate function xfs_dax_notify_ddev_failure, so that the next
line of xfs_dax_notify_failure is just:

return xfs_dax_notify_ddev_failure(mp, BTOBB(offset),
BTOBB(len), flags);

And then we have

static int
xfs_dax_notify_ddev_failure(
struct xfs_mount *mp,
xfs_daddr_t daddr,
xfs_daddr_t bblen,
int mf_flags)
{
if (!xfs_has_rmapbt(mp)) {
xfs_warn(...);

Because otherwise xfs_dax_notify_failure gets cluttered.

> +
> + if (!xfs_has_rmapbt(mp)) {
> + xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
> + return -EOPNOTSUPP;
> + }
> +
> + error = xfs_trans_alloc_empty(mp, &tp);
> + if (error)
> + return error;
> +
> + error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
> + if (error)
> + goto out_cancel_tp;
> +
> + cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agf_bp->b_pag);

What happens if the failure range spans multiple AGs? I suppose it's
not technically possible today since we only report a single page at a
time. But for the general case, I think we actually want (building off
the sample code above) this function to do something like this:

start_fsbno = XFS_DADDR_TO_FSB(mp, daddr);
agno = XFS_FSB_TO_AGNO(mp, fsbno);
end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);

for (; agno <= end_agbno; agno++) {
struct xfs_rmap_irec rmap_low = { };
struct xfs_rmap_irec rmap_high;

notify.startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
notify.blockcount = XFS_BB_TO_FSB(mp, bblen);

/*
* init transaction, read agf, init cursor...
*/

memset(&rmap_high, 0xFF, sizeof(rmap_high));
rmap_low.rm_startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
if (agno == end_agbno)
rmap_high.rm_startblock = XFS_FSB_TO_AGBNO(mp,
end_fsbno);

error = xfs_rmap_query_range(...);
if (error)
fail;

fsbno = XFS_AGB_TO_FSB(mp, agno + 1, 0);
}

--D

> +
> + /* Construct a range for rmap query */
> + memset(&rmap_low, 0, sizeof(rmap_low));
> + memset(&rmap_high, 0xFF, sizeof(rmap_high));
> + rmap_low.rm_startblock = rmap_high.rm_startblock = notify.startblock;
> + rmap_low.rm_blockcount = rmap_high.rm_blockcount = notify.blockcount;
> +
> + error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
> + xfs_dax_notify_failure_fn, &notify);
> +
> + xfs_btree_del_cursor(cur, error);
> + xfs_trans_brelse(tp, agf_bp);
> +
> +out_cancel_tp:
> + xfs_trans_cancel(tp);
> + return error;
> +}
> +
> +static const struct dax_holder_operations xfs_dax_holder_operations = {
> + .notify_failure = xfs_dax_notify_failure,
> +};
> +
> static const struct super_operations xfs_super_operations = {
> .alloc_inode = xfs_fs_alloc_inode,
> .destroy_inode = xfs_fs_destroy_inode,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 3d90becbd160..0f1fa7a4a616 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -149,6 +149,10 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
> }
>
> struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> +void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops);
> +void fs_dax_unregister_holder(struct dax_device *dax_dev);
> +void *fs_dax_get_holder(struct dax_device *dax_dev);
> int dax_writeback_mapping_range(struct address_space *mapping,
> struct dax_device *dax_dev, struct writeback_control *wbc);
>
> @@ -179,6 +183,20 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> return NULL;
> }
>
> +static inline void fs_dax_register_holder(struct dax_device *dax_dev,
> + void *holder, const struct dax_holder_operations *ops)
> +{
> +}
> +
> +static inline void fs_dax_unregister_holder(struct dax_device *dax_dev)
> +{
> +}
> +
> +static inline void *fs_dax_get_holder(struct dax_device *dax_dev)
> +{
> + return NULL;
> +}
> +
> static inline struct page *dax_layout_busy_page(struct address_space *mapping)
> {
> return NULL;
> --
> 2.33.0
>
>
>

2021-10-15 01:22:48

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] fsdax: add exception for reflinked files

On Fri, Sep 24, 2021 at 09:09:59PM +0800, Shiyang Ruan wrote:
> For reflinked files, one dax page may be associated more than once with
> different fime mapping and index. It will report warning. Now, since
> we have introduced dax-RMAP for this case and also have to keep its
> functionality for other filesystems who are not support rmap, I add this
> exception here.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/dax.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 2536c105ec7f..1a57211b1bc9 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -352,9 +352,10 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
> for_each_mapped_pfn(entry, pfn) {
> struct page *page = pfn_to_page(pfn);
>
> - WARN_ON_ONCE(page->mapping);
> - page->mapping = mapping;
> - page->index = index + i++;
> + if (!page->mapping) {
> + page->mapping = mapping;
> + page->index = index + i++;

It feels a little dangerous to have page->mapping for shared storage
point to an actual address_space when there are really multiple
potential address_spaces out there. If the mm or dax folks are ok with
doing this this way then I'll live with it, but it seems like you'd want
to leave /some/ kind of marker once you know that the page has multiple
owners and therefore regular mm rmap via page->mapping won't work.

--D

> + }
> }
> }
>
> @@ -370,9 +371,10 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> struct page *page = pfn_to_page(pfn);
>
> WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> - WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> - page->mapping = NULL;
> - page->index = 0;
> + if (page->mapping == mapping) {
> + page->mapping = NULL;
> + page->index = 0;
> + }
> }
> }
>
> --
> 2.33.0
>
>
>

2021-10-15 01:31:29

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case

On Fri, Sep 24, 2021 at 09:09:57PM +0800, Shiyang Ruan wrote:
> This function is called at the end of RMAP routine, i.e. filesystem
> recovery function, to collect and kill processes using a shared page of
> DAX file. The difference between mf_generic_kill_procs() is,
> it accepts file's mapping,offset instead of struct page. Because
> different file's mappings and offsets may share the same page in fsdax
> mode. So, it is called when filesystem RMAP results are found.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> fs/dax.c | 10 ------
> include/linux/dax.h | 9 +++++
> include/linux/mm.h | 2 ++
> mm/memory-failure.c | 83 ++++++++++++++++++++++++++++++++++++++++-----
> 4 files changed, 86 insertions(+), 18 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 509b65e60478..2536c105ec7f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -852,16 +852,6 @@ static void *dax_insert_entry(struct xa_state *xas,
> return entry;
> }
>
> -static inline
> -unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> -{
> - unsigned long address;
> -
> - address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> - return address;
> -}
> -
> /* Walk all mappings of a given index of a file and writeprotect them */
> static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
> unsigned long pfn)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 65411bee4312..3d90becbd160 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -258,6 +258,15 @@ static inline bool dax_mapping(struct address_space *mapping)
> {
> return mapping->host && IS_DAX(mapping->host);
> }
> +static inline unsigned long pgoff_address(pgoff_t pgoff,
> + struct vm_area_struct *vma)
> +{
> + unsigned long address;
> +
> + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> + return address;
> +}
>
> #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
> void hmem_register_device(int target_nid, struct resource *r);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..d06af0051e53 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3114,6 +3114,8 @@ enum mf_flags {
> MF_MUST_KILL = 1 << 2,
> MF_SOFT_OFFLINE = 1 << 3,
> };
> +extern int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> + size_t size, int flags);
> extern int memory_failure(unsigned long pfn, int flags);
> extern void memory_failure_queue(unsigned long pfn, int flags);
> extern void memory_failure_queue_kick(int cpu);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 85eab206b68f..a9d0d487d205 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -302,10 +302,9 @@ void shake_page(struct page *p)
> }
> EXPORT_SYMBOL_GPL(shake_page);
>
> -static unsigned long dev_pagemap_mapping_shift(struct page *page,
> +static unsigned long dev_pagemap_mapping_shift(unsigned long address,
> struct vm_area_struct *vma)
> {
> - unsigned long address = vma_address(page, vma);
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> @@ -345,7 +344,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
> * Schedule a process for later kill.
> * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> */
> -static void add_to_kill(struct task_struct *tsk, struct page *p,
> +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,

Hm, so I guess you're passing the page and the pgoff now because
page->index is meaningless for shared dax pages? Ok.

> struct vm_area_struct *vma,
> struct list_head *to_kill)
> {
> @@ -358,9 +357,15 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
> }
>
> tk->addr = page_address_in_vma(p, vma);
> - if (is_zone_device_page(p))
> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> - else
> + if (is_zone_device_page(p)) {
> + /*
> + * Since page->mapping is no more used for fsdax, we should
> + * calculate the address in a fsdax way.
> + */
> + if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
> + tk->addr = pgoff_address(pgoff, vma);
> + tk->size_shift = dev_pagemap_mapping_shift(tk->addr, vma);
> + } else
> tk->size_shift = page_shift(compound_head(p));
>
> /*
> @@ -508,7 +513,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> if (!page_mapped_in_vma(page, vma))
> continue;
> if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, 0, vma, to_kill);
> }
> }
> read_unlock(&tasklist_lock);
> @@ -544,7 +549,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
> * to be informed of all such data corruptions.
> */
> if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, 0, vma, to_kill);
> + }
> + }
> + read_unlock(&tasklist_lock);
> + i_mmap_unlock_read(mapping);
> +}
> +
> +/*
> + * Collect processes when the error hit a fsdax page.
> + */
> +static void collect_procs_fsdax(struct page *page, struct address_space *mapping,
> + pgoff_t pgoff, struct list_head *to_kill)
> +{
> + struct vm_area_struct *vma;
> + struct task_struct *tsk;
> +
> + i_mmap_lock_read(mapping);
> + read_lock(&tasklist_lock);
> + for_each_process(tsk) {
> + struct task_struct *t = task_early_kill(tsk, true);
> +
> + if (!t)
> + continue;
> + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> + if (vma->vm_mm == t->mm)
> + add_to_kill(t, page, pgoff, vma, to_kill);
> }
> }
> read_unlock(&tasklist_lock);
> @@ -1503,6 +1533,43 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> return 0;
> }
>
> +/**
> + * mf_dax_kill_procs - Collect and kill processes who are using this file range
> + * @mapping: the file in use
> + * @index: start offset of the range
> + * @size: length of the range

It feels odd that one argument is in units of pgoff_t but the other is
in bytes.

> + * @flags: memory failure flags
> + */
> +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> + size_t size, int flags)
> +{
> + LIST_HEAD(to_kill);
> + dax_entry_t cookie;
> + struct page *page;
> + size_t end = (index << PAGE_SHIFT) + size;
> +
> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;

Hm. What flags will we be passing to the xfs_dax_notify_failure_fn?
Does XFS itself have to care about what's in the flags values, or is it
really just a magic cookie to be passed from the mm layer into the fs
and back to mf_dax_kill_procs?

--D

> +
> + for (; (index << PAGE_SHIFT) < end; index++) {
> + page = NULL;
> + cookie = dax_lock_mapping_entry(mapping, index, &page);
> + if (!cookie)
> + return -EBUSY;
> + if (!page)
> + goto unlock;
> +
> + SetPageHWPoison(page);
> +
> + collect_procs_fsdax(page, mapping, index, &to_kill);
> + unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> + index, flags);
> +unlock:
> + dax_unlock_mapping_entry(mapping, index, cookie);
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> +
> static int memory_failure_hugetlb(unsigned long pfn, int flags)
> {
> struct page *p = pfn_to_page(pfn);
> --
> 2.33.0
>
>
>

2021-10-15 13:00:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap

On Fri, Sep 24, 2021 at 09:09:54PM +0800, Shiyang Ruan wrote:
> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
> feature for fsdax. So it is needed to factor some helper functions to
> simplify these code.
>
> Signed-off-by: Shiyang Ruan <[email protected]>
> ---
> mm/memory-failure.c | 140 ++++++++++++++++++++++++--------------------
> 1 file changed, 76 insertions(+), 64 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 54879c339024..8ff9b52823c0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1430,6 +1430,79 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
> return 0;
> }
>
> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> + struct address_space *mapping, pgoff_t index, int flags)
> +{
> + struct to_kill *tk;
> + unsigned long size = 0;
> +
> + list_for_each_entry(tk, to_kill, nd)
> + if (tk->size_shift)
> + size = max(size, 1UL << tk->size_shift);
> + if (size) {

Nit: an empty line here would be nice for readability.

> + if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> + /*
> + * TODO: Handle HMM pages which may need coordination
> + * with device-side memory.
> + */
> + return -EBUSY;

We've got rid of the HMM terminology for device private memory, so
I'd reword this update the comment to follow that while you're at it.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2021-10-15 13:03:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()

Except for the error code inversion noticed by Darrick this looks fine
to me:

Reviewed-by: Christoph Hellwig <[email protected]>

2021-10-15 13:03:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS

On Fri, Sep 24, 2021 at 09:09:58PM +0800, Shiyang Ruan wrote:
> +void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
> + const struct dax_holder_operations *ops)
> +{
> + dax_set_holder(dax_dev, holder, ops);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_register_holder);
> +
> +void fs_dax_unregister_holder(struct dax_device *dax_dev)
> +{
> + dax_set_holder(dax_dev, NULL, NULL);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_unregister_holder);
> +
> +void *fs_dax_get_holder(struct dax_device *dax_dev)
> +{
> + return dax_get_holder(dax_dev);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get_holder);

These should not be in a XFS patch. But why do we even need this
wrappers?

> @@ -377,6 +385,8 @@ xfs_close_devices(
>
> xfs_free_buftarg(mp->m_logdev_targp);
> xfs_blkdev_put(logdev);
> + if (dax_logdev)
> + fs_dax_unregister_holder(dax_logdev);
> fs_put_dax(dax_logdev);

I'd prefer to include the fs_dax_unregister_holder in the fs_put_dax
call to avoid callers failing to unregister it.

> @@ -411,6 +425,9 @@ xfs_open_devices(
> struct block_device *logdev = NULL, *rtdev = NULL;
> int error;
>
> + if (dax_ddev)
> + fs_dax_register_holder(dax_ddev, mp,
> + &xfs_dax_holder_operations);

I'd include the holder registration with fs_dax_get_by_bdev as well.

2021-10-15 13:04:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 8/8] fsdax: add exception for reflinked files

On Thu, Oct 14, 2021 at 12:24:50PM -0700, Darrick J. Wong wrote:
> It feels a little dangerous to have page->mapping for shared storage
> point to an actual address_space when there are really multiple
> potential address_spaces out there. If the mm or dax folks are ok with
> doing this this way then I'll live with it, but it seems like you'd want
> to leave /some/ kind of marker once you know that the page has multiple
> owners and therefore regular mm rmap via page->mapping won't work.

Yes, I thing poisoning page->mapping for the rmap enabled case seems
like a better idea.

2021-10-20 05:28:33

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()



在 2021/10/15 2:05, Darrick J. Wong 写道:
> On Fri, Sep 24, 2021 at 09:09:55PM +0800, Shiyang Ruan wrote:
>> When memory-failure occurs, we call this function which is implemented
>> by each kind of devices. For the fsdax case, pmem device driver
>> implements it. Pmem device driver will find out the filesystem in which
>> the corrupted page located in.
>>
>> With dax_holder notify support, we are able to notify the memory failure
>> from pmem driver to upper layers. If there is something not support in
>> the notify routine, memory_failure will fall back to the generic hanlder.
>>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> ---
>> drivers/nvdimm/pmem.c | 11 +++++++++++
>> include/linux/memremap.h | 9 +++++++++
>> mm/memory-failure.c | 14 ++++++++++++++
>> 3 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 72de88ff0d30..0dfafad8fcc5 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -362,9 +362,20 @@ static void pmem_release_disk(void *__pmem)
>> del_gendisk(pmem->disk);
>> }
>>
>> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
>> + unsigned long pfn, size_t size, int flags)
>> +{
>> + struct pmem_device *pmem =
>> + container_of(pgmap, struct pmem_device, pgmap);
>> + loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
>> +
>> + return dax_holder_notify_failure(pmem->dax_dev, offset, size, flags);
>> +}
>> +
>> static const struct dev_pagemap_ops fsdax_pagemap_ops = {
>> .kill = pmem_pagemap_kill,
>> .cleanup = pmem_pagemap_cleanup,
>> + .memory_failure = pmem_pagemap_memory_failure,
>> };
>>
>> static int pmem_attach_disk(struct device *dev,
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index c0e9d35889e8..36d47bacd46d 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -87,6 +87,15 @@ struct dev_pagemap_ops {
>> * the page back to a CPU accessible page.
>> */
>> vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
>> +
>> + /*
>> + * Handle the memory failure happens on a range of pfns. Notify the
>> + * processes who are using these pfns, and try to recover the data on
>> + * them if necessary. The flag is finally passed to the recover
>> + * function through the whole notify routine.
>> + */
>> + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>> + size_t size, int flags);
>> };
>>
>> #define PGMAP_ALTMAP_VALID (1 << 0)
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 8ff9b52823c0..85eab206b68f 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1605,6 +1605,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>> if (!pgmap_pfn_valid(pgmap, pfn))
>> goto out;
>>
>> + /*
>> + * Call driver's implementation to handle the memory failure, otherwise
>> + * fall back to generic handler.
>> + */
>> + if (pgmap->ops->memory_failure) {
>> + rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
>> + /*
>> + * Fall back to generic handler too if operation is not
>> + * supported inside the driver/device/filesystem.
>> + */
>> + if (rc != EOPNOTSUPP)
>
> -EOPNOTSUPP? (negative errno)

Yes, my mistake. Thanks for pointing out.


--
Thanks,
Ruan.

>
> --D
>
>> + goto out;
>> + }
>> +
>> rc = mf_generic_kill_procs(pfn, flags, pgmap);
>> out:
>> /* drop pgmap ref acquired in caller */
>> --
>> 2.33.0
>>
>>
>>


2021-10-20 05:50:08

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case



在 2021/10/15 3:32, Darrick J. Wong 写道:
> On Fri, Sep 24, 2021 at 09:09:57PM +0800, Shiyang Ruan wrote:
>> This function is called at the end of RMAP routine, i.e. filesystem
>> recovery function, to collect and kill processes using a shared page of
>> DAX file. The difference between mf_generic_kill_procs() is,
>> it accepts file's mapping,offset instead of struct page. Because
>> different file's mappings and offsets may share the same page in fsdax
>> mode. So, it is called when filesystem RMAP results are found.
>>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> ---
>> fs/dax.c | 10 ------
>> include/linux/dax.h | 9 +++++
>> include/linux/mm.h | 2 ++
>> mm/memory-failure.c | 83 ++++++++++++++++++++++++++++++++++++++++-----
>> 4 files changed, 86 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 509b65e60478..2536c105ec7f 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -852,16 +852,6 @@ static void *dax_insert_entry(struct xa_state *xas,
>> return entry;
>> }
>>
>> -static inline
>> -unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
>> -{
>> - unsigned long address;
>> -
>> - address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> - return address;
>> -}
>> -
>> /* Walk all mappings of a given index of a file and writeprotect them */
>> static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
>> unsigned long pfn)
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 65411bee4312..3d90becbd160 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -258,6 +258,15 @@ static inline bool dax_mapping(struct address_space *mapping)
>> {
>> return mapping->host && IS_DAX(mapping->host);
>> }
>> +static inline unsigned long pgoff_address(pgoff_t pgoff,
>> + struct vm_area_struct *vma)
>> +{
>> + unsigned long address;
>> +
>> + address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> + return address;
>> +}
>>
>> #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>> void hmem_register_device(int target_nid, struct resource *r);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 73a52aba448f..d06af0051e53 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3114,6 +3114,8 @@ enum mf_flags {
>> MF_MUST_KILL = 1 << 2,
>> MF_SOFT_OFFLINE = 1 << 3,
>> };
>> +extern int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> + size_t size, int flags);
>> extern int memory_failure(unsigned long pfn, int flags);
>> extern void memory_failure_queue(unsigned long pfn, int flags);
>> extern void memory_failure_queue_kick(int cpu);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 85eab206b68f..a9d0d487d205 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -302,10 +302,9 @@ void shake_page(struct page *p)
>> }
>> EXPORT_SYMBOL_GPL(shake_page);
>>
>> -static unsigned long dev_pagemap_mapping_shift(struct page *page,
>> +static unsigned long dev_pagemap_mapping_shift(unsigned long address,
>> struct vm_area_struct *vma)
>> {
>> - unsigned long address = vma_address(page, vma);
>> pgd_t *pgd;
>> p4d_t *p4d;
>> pud_t *pud;
>> @@ -345,7 +344,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
>> * Schedule a process for later kill.
>> * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>> */
>> -static void add_to_kill(struct task_struct *tsk, struct page *p,
>> +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
>
> Hm, so I guess you're passing the page and the pgoff now because
> page->index is meaningless for shared dax pages? Ok.

Yes, it is for that case.

>
>> struct vm_area_struct *vma,
>> struct list_head *to_kill)
>> {
>> @@ -358,9 +357,15 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>> }
>>
>> tk->addr = page_address_in_vma(p, vma);
>> - if (is_zone_device_page(p))
>> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>> - else
>> + if (is_zone_device_page(p)) {
>> + /*
>> + * Since page->mapping is no more used for fsdax, we should
>> + * calculate the address in a fsdax way.
>> + */
>> + if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
>> + tk->addr = pgoff_address(pgoff, vma);
>> + tk->size_shift = dev_pagemap_mapping_shift(tk->addr, vma);
>> + } else
>> tk->size_shift = page_shift(compound_head(p));
>>
>> /*
>> @@ -508,7 +513,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
>> if (!page_mapped_in_vma(page, vma))
>> continue;
>> if (vma->vm_mm == t->mm)
>> - add_to_kill(t, page, vma, to_kill);
>> + add_to_kill(t, page, 0, vma, to_kill);
>> }
>> }
>> read_unlock(&tasklist_lock);
>> @@ -544,7 +549,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>> * to be informed of all such data corruptions.
>> */
>> if (vma->vm_mm == t->mm)
>> - add_to_kill(t, page, vma, to_kill);
>> + add_to_kill(t, page, 0, vma, to_kill);
>> + }
>> + }
>> + read_unlock(&tasklist_lock);
>> + i_mmap_unlock_read(mapping);
>> +}
>> +
>> +/*
>> + * Collect processes when the error hit a fsdax page.
>> + */
>> +static void collect_procs_fsdax(struct page *page, struct address_space *mapping,
>> + pgoff_t pgoff, struct list_head *to_kill)
>> +{
>> + struct vm_area_struct *vma;
>> + struct task_struct *tsk;
>> +
>> + i_mmap_lock_read(mapping);
>> + read_lock(&tasklist_lock);
>> + for_each_process(tsk) {
>> + struct task_struct *t = task_early_kill(tsk, true);
>> +
>> + if (!t)
>> + continue;
>> + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> + if (vma->vm_mm == t->mm)
>> + add_to_kill(t, page, pgoff, vma, to_kill);
>> }
>> }
>> read_unlock(&tasklist_lock);
>> @@ -1503,6 +1533,43 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
>> return 0;
>> }
>>
>> +/**
>> + * mf_dax_kill_procs - Collect and kill processes who are using this file range
>> + * @mapping: the file in use
>> + * @index: start offset of the range
>> + * @size: length of the range
>
> It feels odd that one argument is in units of pgoff_t but the other is
> in bytes.

The index is page aligned but @size may not be. I will explain it in
detail in the comments.

>
>> + * @flags: memory failure flags
>> + */
>> +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> + size_t size, int flags)
>> +{
>> + LIST_HEAD(to_kill);
>> + dax_entry_t cookie;
>> + struct page *page;
>> + size_t end = (index << PAGE_SHIFT) + size;
>> +
>> + flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> Hm. What flags will we be passing to the xfs_dax_notify_failure_fn?
> Does XFS itself have to care about what's in the flags values, or is it
> really just a magic cookie to be passed from the mm layer into the fs
> and back to mf_dax_kill_procs?
>

Just to pass the flag from mm layer to mf_dax_kill_procs(). No one
inside this RMAP progress will care about or change it. As you
mentioned in the next patch, I think this should be named with a "mf_"
prefix to make it easier to understand.


--
Thanks,
Ruan.

> --D
>
>> +
>> + for (; (index << PAGE_SHIFT) < end; index++) {
>> + page = NULL;
>> + cookie = dax_lock_mapping_entry(mapping, index, &page);
>> + if (!cookie)
>> + return -EBUSY;
>> + if (!page)
>> + goto unlock;
>> +
>> + SetPageHWPoison(page);
>> +
>> + collect_procs_fsdax(page, mapping, index, &to_kill);
>> + unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>> + index, flags);
>> +unlock:
>> + dax_unlock_mapping_entry(mapping, index, cookie);
>> + }
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
>> +
>> static int memory_failure_hugetlb(unsigned long pfn, int flags)
>> {
>> struct page *p = pfn_to_page(pfn);
>> --
>> 2.33.0
>>
>>
>>


2021-10-20 07:02:24

by Shiyang Ruan

[permalink] [raw]
Subject: Re: [PATCH v7 2/8] dax: Introduce holder for dax_device



在 2021/10/15 2:00, Darrick J. Wong 写道:
> On Fri, Sep 24, 2021 at 09:09:53PM +0800, Shiyang Ruan wrote:
>> To easily track filesystem from a pmem device, we introduce a holder for
>> dax_device structure, and also its operation. This holder is used to
>> remember who is using this dax_device:
>> - When it is the backend of a filesystem, the holder will be the
>> superblock of this filesystem.
>> - When this pmem device is one of the targets in a mapped device, the
>> holder will be this mapped device. In this case, the mapped device
>> has its own dax_device and it will follow the first rule. So that we
>> can finally track to the filesystem we needed.
>>
>> The holder and holder_ops will be set when filesystem is being mounted,
>> or an target device is being activated.
>>
>> Signed-off-by: Shiyang Ruan <[email protected]>
>> ---
>> drivers/dax/super.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/dax.h | 29 ++++++++++++++++++++++
>> 2 files changed, 88 insertions(+)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 48ce86501d93..7d4a11dcba90 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -23,7 +23,10 @@
>> * @cdev: optional character interface for "device dax"
>> * @host: optional name for lookups where the device path is not available
>> * @private: dax driver private data
>> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
>> * @flags: state and boolean properties
>> + * @ops: operations for dax_device
>> + * @holder_ops: operations for the inner holder
>> */
>> struct dax_device {
>> struct hlist_node list;
>> @@ -31,8 +34,10 @@ struct dax_device {
>> struct cdev cdev;
>> const char *host;
>> void *private;
>> + void *holder_data;
>> unsigned long flags;
>> const struct dax_operations *ops;
>> + const struct dax_holder_operations *holder_ops;
>> };
>>
>> static dev_t dax_devt;
>> @@ -374,6 +379,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>> }
>> EXPORT_SYMBOL_GPL(dax_zero_page_range);
>>
>> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
>> + size_t size, int flags)
>> +{
>> + int rc;
>> +
>> + dax_read_lock();
>> + if (!dax_alive(dax_dev)) {
>> + rc = -ENXIO;
>> + goto out;
>> + }
>> +
>> + if (!dax_dev->holder_data) {
>> + rc = -EOPNOTSUPP;
>> + goto out;
>> + }
>> +
>> + rc = dax_dev->holder_ops->notify_failure(dax_dev, offset, size, flags);
>
> Shouldn't this check if dax_dev->holder_ops != NULL before dereferencing
> it for the function call? Imagine an implementation that wants to
> attach a ->notify_failure function to a dax_device, maintains its own
> lookup table, and decides that it doesn't need to set holder_data.
>
> (Or, imagine someone who writes a garbage into holder_data and *boom*)

My mistake. I should check @holder_ops instead of @holder_data.

>
> How does the locking work here? If there's a media failure, we'll take
> dax_rwsem and call ->notify_failure. If the ->notify_failure function
> wants to access the pmem to handle the error by calling back into the
> dax code, will that cause nested locking on dax_rwsem?

Won't for now. I have tested it with my simple testcases.
>
> Jumping ahead a bit, I think the rmap btree accesses that the xfs
> implementation performs can cause xfs_buf(fer) cache IO, which would
> trigger that if the buffers aren't already in memory, if I'm reading
> this correctly?

I didn't think of this case. But I think this uses read lock too. It
won't be blocked. Only dax_set_holder() takes write lock.

>
>> +out:
>> + dax_read_unlock();
>> + return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
>> +
>> #ifdef CONFIG_ARCH_HAS_PMEM_API
>> void arch_wb_cache_pmem(void *addr, size_t size);
>> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
>> @@ -618,6 +646,37 @@ void put_dax(struct dax_device *dax_dev)
>> }
>> EXPORT_SYMBOL_GPL(put_dax);
>>
>> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
>> + const struct dax_holder_operations *ops)
>> +{
>> + dax_write_lock();
>> + if (!dax_alive(dax_dev)) {
>> + dax_write_unlock();
>> + return;
>> + }
>> +
>> + dax_dev->holder_data = holder;
>> + dax_dev->holder_ops = ops;
>> + dax_write_unlock();
>
> I guess this means that the holder has to detach itself before anyone
> calls kill_dax, or else a dead dax device ends up with a dangling
> reference to the holder?

Yes.

>
>> +}
>> +EXPORT_SYMBOL_GPL(dax_set_holder);
>> +
>> +void *dax_get_holder(struct dax_device *dax_dev)
>> +{
>> + void *holder;
>> +
>> + dax_read_lock();
>> + if (!dax_alive(dax_dev)) {
>> + dax_read_unlock();
>> + return NULL;
>> + }
>> +
>> + holder = dax_dev->holder_data;
>> + dax_read_unlock();
>> + return holder;
>> +}
>> +EXPORT_SYMBOL_GPL(dax_get_holder);
>> +
>> /**
>> * inode_dax: convert a public inode into its dax_dev
>> * @inode: An inode with i_cdev pointing to a dax_dev
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 097b3304f9b9..d273d59723cd 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -38,9 +38,24 @@ struct dax_operations {
>> int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
>> };
>>
>> +struct dax_holder_operations {
>> + /*
>> + * notify_failure - notify memory failure into inner holder device
>> + * @dax_dev: the dax device which contains the holder
>> + * @offset: offset on this dax device where memory failure occurs
>> + * @size: length of this memory failure event
>> + * @flags: action flags for memory failure handler
>> + */
>> + int (*notify_failure)(struct dax_device *dax_dev, loff_t offset,
>> + size_t size, int flags);
>
> Shouldn't size be u64 or something? Let's say that 8GB of your pmem go
> bad, wouldn't you want a single call? Though I guess the current
> implementation only goes a single page at a time, doesn't it?

Right.

>
>> +};
>> +
>> extern struct attribute_group dax_attribute_group;
>>
>> #if IS_ENABLED(CONFIG_DAX)
>> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
>> + const struct dax_holder_operations *ops);
>> +void *dax_get_holder(struct dax_device *dax_dev);
>> struct dax_device *alloc_dax(void *private, const char *host,
>> const struct dax_operations *ops, unsigned long flags);
>> void put_dax(struct dax_device *dax_dev);
>> @@ -70,6 +85,18 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
>> return dax_synchronous(dax_dev);
>> }
>> #else
>> +static inline struct dax_device *dax_get_by_host(const char *host)
>
> Not sure why this is being added here? AFAICT none of the patches call
> this function...?

It's mistake when I rebase my code to the latest. These lines were
deleted but I didn't notice. Will fix it.


--
Thanks,
Ruan.

>
> --D
>
>> +{
>> + return NULL;
>> +}
>> +static inline void dax_set_holder(struct dax_device *dax_dev, void *holder,
>> + const struct dax_holder_operations *ops)
>> +{
>> +}
>> +static inline void *dax_get_holder(struct dax_device *dax_dev)
>> +{
>> + return NULL;
>> +}
>> static inline struct dax_device *alloc_dax(void *private, const char *host,
>> const struct dax_operations *ops, unsigned long flags)
>> {
>> @@ -198,6 +225,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>> size_t bytes, struct iov_iter *i);
>> int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>> size_t nr_pages);
>> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
>> + size_t size, int flags);
>> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
>>
>> ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>> --
>> 2.33.0
>>
>>
>>