2021-02-08 11:20:24

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH v3 00/11] fsdax: introduce fs query to support reflink

This patchset is aimed to support shared pages tracking for fsdax.

Change from V2:
- Split 8th patch into other related to make it easy to review
- Other small fixes

Change from V1:
- Add the old memory-failure handler back for rolling back
- Add callback in MD's ->rmap() to support multiple mapping of dm device
- Add judgement for CONFIG_SYSFS
- Add pfn_valid() judgement in hwpoison_filter()
- Rebased to v5.11-rc5

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device driver, by introducing an interface ->memory_failure() of struct
pagemap. This interface is called by memory_failure() in mm, and
implemented by pmem device. Then pmem device calls its ->corrupted_range()
to find the filesystem which the corrupted data located in, and call
filesystem handler to track files or metadata assocaited with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.

The call trace is like this:
memory_failure()
pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
gendisk->fops->corrupted_range() => - pmem_corrupted_range()
- md_blk_corrupted_range()
sb->s_ops->currupted_range() => xfs_fs_corrupted_range()
xfs_rmap_query_range()
xfs_currupt_helper()
* corrupted on metadata
try to recover data, call xfs_force_shutdown()
* corrupted on file data
try to recover data, call mf_dax_mapping_kill_procs()

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.11-rc5)
==

Shiyang Ruan (11):
pagemap: Introduce ->memory_failure()
blk: Introduce ->corrupted_range() for block device
fs: Introduce ->corrupted_range() for superblock
block_dev: Introduce bd_corrupted_range()
mm, fsdax: Refactor memory-failure handler for dax mapping
mm, pmem: Implement ->memory_failure() in pmem driver
pmem: Implement ->corrupted_range() for pmem driver
dm: Introduce ->rmap() to find bdev offset
md: Implement ->corrupted_range()
xfs: Implement ->corrupted_range() for XFS
fs/dax: Remove useless functions

block/genhd.c | 6 ++
drivers/md/dm-linear.c | 20 ++++
drivers/md/dm.c | 61 +++++++++++
drivers/nvdimm/pmem.c | 45 ++++++++
fs/block_dev.c | 47 ++++++++-
fs/dax.c | 63 ++++-------
fs/xfs/xfs_fsops.c | 5 +
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 112 ++++++++++++++++++++
include/linux/blkdev.h | 2 +
include/linux/dax.h | 1 +
include/linux/device-mapper.h | 5 +
include/linux/fs.h | 2 +
include/linux/genhd.h | 3 +
include/linux/memremap.h | 8 ++
include/linux/mm.h | 9 ++
mm/memory-failure.c | 190 +++++++++++++++++++++++-----------
17 files changed, 475 insertions(+), 105 deletions(-)

--
2.30.0




2021-02-08 11:20:45

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH v3 07/11] pmem: Implement ->corrupted_range() for pmem driver

Obtain the superblock of a pmem disk, and call filesystem's
->corrupted_range() to handle the corrupted data.

Signed-off-by: Shiyang Ruan <[email protected]>
---
block/genhd.c | 6 ++++++
drivers/nvdimm/pmem.c | 19 +++++++++++++++++++
include/linux/genhd.h | 1 +
3 files changed, 26 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 419548e92d82..fd7cf03b65a8 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -936,6 +936,12 @@ struct block_device *bdget_disk(struct gendisk *disk, int partno)
return bdev;
}

+struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)
+{
+ return disk_map_sector_rcu(disk, sector);
+}
+EXPORT_SYMBOL(bdget_disk_sector);
+
/*
* print a full list of all partitions - intended for places where the root
* filesystem can't be mounted and thus to give the victim some idea of what
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c77c80e3d155..e38b9f9c7d97 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -253,6 +253,24 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
return blk_status_to_errno(rc);
}

+static int pmem_corrupted_range(struct gendisk *disk, struct block_device *bdev,
+ loff_t disk_offset, size_t len, void *data)
+{
+ loff_t bdev_offset;
+ sector_t disk_sector = disk_offset >> SECTOR_SHIFT;
+ int rc = -ENODEV;
+
+ bdev = bdget_disk_sector(disk, disk_sector);
+ if (!bdev)
+ return rc;
+
+ bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
+ rc = bd_corrupted_range(bdev, bdev_offset, bdev_offset, len, data);
+
+ bdput(bdev);
+ return rc;
+}
+
/* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
__weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn)
@@ -281,6 +299,7 @@ static const struct block_device_operations pmem_fops = {
.owner = THIS_MODULE,
.submit_bio = pmem_submit_bio,
.rw_page = pmem_rw_page,
+ .corrupted_range = pmem_corrupted_range,
};

static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 751cbd559bba..996f91b08d48 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -248,6 +248,7 @@ static inline void add_disk_no_queue_reg(struct gendisk *disk)

extern void del_gendisk(struct gendisk *gp);
extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
+extern struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector);

extern void set_disk_ro(struct gendisk *disk, int flag);

--
2.30.0



2021-02-08 11:23:08

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH v3 10/11] xfs: Implement ->corrupted_range() for XFS

This function is used to handle errors which may cause data lost in
filesystem. Such as memory failure in fsdax mode.

If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data. For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata.

After that, the memory failure needs to notify the processes who are
using those files.

Only support data device. Realtime device is not supported for now.

Signed-off-by: Shiyang Ruan <[email protected]>
---
fs/xfs/xfs_fsops.c | 5 ++
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 112 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 118 insertions(+)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 959ce91a3755..f03901a5c673 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -498,6 +498,11 @@ xfs_do_force_shutdown(
"Corruption of in-memory data detected. Shutting down filesystem");
if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
xfs_stack_trace();
+ } else if (flags & SHUTDOWN_CORRUPT_META) {
+ xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT,
+"Corruption of on-disk metadata detected. Shutting down filesystem");
+ if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
+ xfs_stack_trace();
} else if (logerror) {
xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_LOGERROR,
"Log I/O Error Detected. Shutting down filesystem");
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index dfa429b77ee2..8f0df67ffcc1 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -274,6 +274,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
#define SHUTDOWN_LOG_IO_ERROR 0x0002 /* write attempt to the log failed */
#define SHUTDOWN_FORCE_UMOUNT 0x0004 /* shutdown from a forced unmount */
#define SHUTDOWN_CORRUPT_INCORE 0x0008 /* corrupt in-memory data structures */
+#define SHUTDOWN_CORRUPT_META 0x0010 /* corrupt metadata on device */

/*
* Flags for xfs_mountfs
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 813be879a5e5..8906426a0f60 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,11 @@
#include "xfs_refcount_item.h"
#include "xfs_bmap_item.h"
#include "xfs_reflink.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bit.h"

#include <linux/magic.h>
#include <linux/fs_context.h>
@@ -1105,6 +1110,112 @@ xfs_fs_free_cached_objects(
return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
}

+static int
+xfs_corrupt_helper(
+ struct xfs_btree_cur *cur,
+ struct xfs_rmap_irec *rec,
+ void *data)
+{
+ struct xfs_inode *ip;
+ struct address_space *mapping;
+ int rc = 0;
+ int *flags = data;
+
+ if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+ (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ // TODO check and try to fix metadata
+ rc = -EFSCORRUPTED;
+ xfs_force_shutdown(cur->bc_mp, SHUTDOWN_CORRUPT_META);
+ } else {
+ /*
+ * Get files that incore, filter out others that are not in use.
+ */
+ rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
+ XFS_IGET_INCORE, 0, &ip);
+ if (rc || !ip)
+ return rc;
+ if (!VFS_I(ip)->i_mapping)
+ goto out;
+
+ mapping = VFS_I(ip)->i_mapping;
+ if (IS_DAX(VFS_I(ip)))
+ rc = mf_dax_mapping_kill_procs(mapping, rec->rm_offset,
+ *flags);
+ else {
+ rc = -EIO;
+ mapping_set_error(mapping, rc);
+ }
+
+ // TODO try to fix data
+out:
+ xfs_irele(ip);
+ }
+
+ return rc;
+}
+
+static int
+xfs_fs_corrupted_range(
+ struct super_block *sb,
+ struct block_device *bdev,
+ loff_t offset,
+ size_t len,
+ void *data)
+{
+ struct xfs_mount *mp = XFS_M(sb);
+ struct xfs_trans *tp = NULL;
+ struct xfs_btree_cur *cur = NULL;
+ struct xfs_rmap_irec rmap_low, rmap_high;
+ struct xfs_buf *agf_bp = NULL;
+ xfs_fsblock_t fsbno = XFS_B_TO_FSB(mp, offset);
+ xfs_filblks_t bcnt = XFS_B_TO_FSB(mp, len);
+ xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
+ xfs_agblock_t agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+ int error = 0;
+
+ if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_bdev == bdev) {
+ xfs_warn(mp, "corrupted_range support not available for realtime device!");
+ return -EOPNOTSUPP;
+ }
+ if (mp->m_logdev_targp && mp->m_logdev_targp->bt_bdev == bdev &&
+ mp->m_logdev_targp != mp->m_ddev_targp) {
+ xfs_err(mp, "ondisk log corrupt, shutting down fs!");
+ xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
+ return -EFSCORRUPTED;
+ }
+
+ if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+ xfs_warn(mp, "corrupted_range needs rmapbt enabled!");
+ return -EOPNOTSUPP;
+ }
+
+ error = xfs_trans_alloc_empty(mp, &tp);
+ if (error)
+ return error;
+
+ error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
+ if (error)
+ goto out_cancel_tp;
+
+ cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
+
+ /* Construct a range for rmap query */
+ memset(&rmap_low, 0, sizeof(rmap_low));
+ memset(&rmap_high, 0xFF, sizeof(rmap_high));
+ rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
+ rmap_low.rm_blockcount = rmap_high.rm_blockcount = bcnt;
+
+ error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
+ xfs_corrupt_helper, data);
+
+ xfs_btree_del_cursor(cur, error);
+ xfs_trans_brelse(tp, agf_bp);
+
+out_cancel_tp:
+ xfs_trans_cancel(tp);
+ return error;
+}
+
static const struct super_operations xfs_super_operations = {
.alloc_inode = xfs_fs_alloc_inode,
.destroy_inode = xfs_fs_destroy_inode,
@@ -1118,6 +1229,7 @@ static const struct super_operations xfs_super_operations = {
.show_options = xfs_fs_show_options,
.nr_cached_objects = xfs_fs_nr_cached_objects,
.free_cached_objects = xfs_fs_free_cached_objects,
+ .corrupted_range = xfs_fs_corrupted_range,
};

static int
--
2.30.0



2021-02-08 11:23:52

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH v3 09/11] md: Implement ->corrupted_range()

With the support of ->rmap(), it is possible to obtain the superblock on
a mapped device.

Signed-off-by: Shiyang Ruan <[email protected]>
---
drivers/md/dm.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7bac564f3faa..31b0c340b695 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -507,6 +507,66 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
#define dm_blk_report_zones NULL
#endif /* CONFIG_BLK_DEV_ZONED */

+struct corrupted_hit_info {
+ struct block_device *bdev;
+ sector_t offset;
+};
+
+static int dm_blk_corrupted_hit(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t count, void *data)
+{
+ struct corrupted_hit_info *bc = data;
+
+ return bc->bdev == (void *)dev->bdev &&
+ (start <= bc->offset && bc->offset < start + count);
+}
+
+struct corrupted_do_info {
+ size_t length;
+ void *data;
+};
+
+static int dm_blk_corrupted_do(struct dm_target *ti, struct block_device *bdev,
+ sector_t disk_sect, void *data)
+{
+ struct corrupted_do_info *bc = data;
+ loff_t disk_off = to_bytes(disk_sect);
+ loff_t bdev_off = to_bytes(disk_sect - get_start_sect(bdev));
+
+ return bd_corrupted_range(bdev, disk_off, bdev_off, bc->length, bc->data);
+}
+
+static int dm_blk_corrupted_range(struct gendisk *disk,
+ struct block_device *target_bdev,
+ loff_t target_offset, size_t len, void *data)
+{
+ struct mapped_device *md = disk->private_data;
+ struct dm_table *map;
+ struct dm_target *ti;
+ sector_t target_sect = to_sector(target_offset);
+ struct corrupted_hit_info hi = {target_bdev, target_sect};
+ struct corrupted_do_info di = {len, data};
+ int srcu_idx, i, rc = -ENODEV;
+
+ map = dm_get_live_table(md, &srcu_idx);
+ if (!map)
+ return rc;
+
+ for (i = 0; i < dm_table_get_num_targets(map); i++) {
+ ti = dm_table_get_target(map, i);
+ if (!(ti->type->iterate_devices && ti->type->rmap))
+ continue;
+ if (!ti->type->iterate_devices(ti, dm_blk_corrupted_hit, &hi))
+ continue;
+
+ rc = ti->type->rmap(ti, target_sect, dm_blk_corrupted_do, &di);
+ break;
+ }
+
+ dm_put_live_table(md, srcu_idx);
+ return rc;
+}
+
static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
struct block_device **bdev)
{
@@ -3062,6 +3122,7 @@ static const struct block_device_operations dm_blk_dops = {
.getgeo = dm_blk_getgeo,
.report_zones = dm_blk_report_zones,
.pr_ops = &dm_pr_ops,
+ .corrupted_range = dm_blk_corrupted_range,
.owner = THIS_MODULE
};

--
2.30.0



2021-02-08 11:26:18

by Ruan Shiyang

[permalink] [raw]
Subject: [PATCH v3 08/11] dm: Introduce ->rmap() to find bdev offset

Pmem device could be a target of mapped device. In order to obtain
superblock on the mapped device, we introduce this to translate offset
from target device to md device.

Currently, we implement it on linear target, which is easy to do the
translation. Other targets will be supported in the future. However,
some targets may not support it because of the non-linear mapping.

Signed-off-by: Shiyang Ruan <[email protected]>
---
drivers/md/dm-linear.c | 20 ++++++++++++++++++++
include/linux/device-mapper.h | 5 +++++
2 files changed, 25 insertions(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 00774b5d7668..90fdb4700afd 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -5,6 +5,7 @@
*/

#include "dm.h"
+#include "dm-core.h"
#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
@@ -119,6 +120,24 @@ static void linear_status(struct dm_target *ti, status_type_t type,
}
}

+static int linear_rmap(struct dm_target *ti, sector_t offset,
+ rmap_callout_fn fn, void *data)
+{
+ struct linear_c *lc = (struct linear_c *) ti->private;
+ struct mapped_device *md = ti->table->md;
+ struct block_device *bdev;
+ sector_t disk_sect = offset - dm_target_offset(ti, lc->start);
+ int rc = -ENODEV;
+
+ bdev = bdget_disk_sector(md->disk, offset);
+ if (!bdev)
+ return rc;
+
+ rc = fn(ti, bdev, disk_sect, data);
+ bdput(bdev);
+ return rc;
+}
+
static int linear_prepare_ioctl(struct dm_target *ti, struct block_device **bdev)
{
struct linear_c *lc = (struct linear_c *) ti->private;
@@ -238,6 +257,7 @@ static struct target_type linear_target = {
.ctr = linear_ctr,
.dtr = linear_dtr,
.map = linear_map,
+ .rmap = linear_rmap,
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 61a66fb8ebb3..c5cd1009a08d 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -58,6 +58,10 @@ typedef void (*dm_dtr_fn) (struct dm_target *ti);
* = 2: The target wants to push back the io
*/
typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio);
+typedef int (*rmap_callout_fn) (struct dm_target *ti, struct block_device *bdev,
+ sector_t sect, void *data);
+typedef int (*dm_rmap_fn) (struct dm_target *ti, sector_t offset,
+ rmap_callout_fn fn, void *data);
typedef int (*dm_clone_and_map_request_fn) (struct dm_target *ti,
struct request *rq,
union map_info *map_context,
@@ -175,6 +179,7 @@ struct target_type {
dm_ctr_fn ctr;
dm_dtr_fn dtr;
dm_map_fn map;
+ dm_rmap_fn rmap;
dm_clone_and_map_request_fn clone_and_map_rq;
dm_release_clone_request_fn release_clone_rq;
dm_endio_fn end_io;
--
2.30.0



2021-02-10 13:48:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 10/11] xfs: Implement ->corrupted_range() for XFS


> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> + (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + // TODO check and try to fix metadata
> + rc = -EFSCORRUPTED;
> + xfs_force_shutdown(cur->bc_mp, SHUTDOWN_CORRUPT_META);

Just return early here so that we can avoid the else later.

> + /*
> + * Get files that incore, filter out others that are not in use.
> + */
> + rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
> + XFS_IGET_INCORE, 0, &ip);

Can we rename rc to error?

> + if (rc || !ip)
> + return rc;

No need to check for ip here.

> + if (!VFS_I(ip)->i_mapping)
> + goto out;

This can't happen either.

> +
> + mapping = VFS_I(ip)->i_mapping;
> + if (IS_DAX(VFS_I(ip)))
> + rc = mf_dax_mapping_kill_procs(mapping, rec->rm_offset,
> + *flags);
> + else {
> + rc = -EIO;
> + mapping_set_error(mapping, rc);
> + }

By passing the method directly to the DAX device we should never get
this called for the non-DAX case.