2022-04-27 09:48:59

by Nitesh Shetty

[permalink] [raw]
Subject: [PATCH v4 00/10] Add Copy offload support

The patch series covers the points discussed in November 2021 virtual call
[LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
We have covered the Initial agreed requirements in this patchset.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.

Overall series supports –

1. Driver
- NVMe Copy command (single NS), including support in nvme-target (for
block and file backend)

2. Block layer
- Block-generic copy (REQ_COPY flag), with interface accommodating
two block-devs, and multi-source/destination interface
- Emulation, when offload is natively absent
- dm-linear support (for cases not requiring split)

3. User-interface
- new ioctl
- copy_file_range for zonefs

4. In-kernel user
- dm-kcopyd
- copy_file_range in zonefs

For zonefs copy_file_range - Seems we cannot levearge fstest here. Limited
testing is done at this point using a custom application for unit testing.

Appreciate the inputs on plumbing and how to test this further?
Perhaps some of it can be discussed during LSF/MM too.

[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/

Changes in v4:
- added copy_file_range support for zonefs
- added documentaion about new sysfs entries
- incorporated review comments on v3
- minor fixes


Arnav Dawn (2):
nvmet: add copy command support for bdev and file ns
fs: add support for copy file range in zonefs

Nitesh Shetty (7):
block: Introduce queue limits for copy-offload support
block: Add copy offload support infrastructure
block: Introduce a new ioctl for copy
block: add emulation for copy
nvme: add copy offload support
dm: Add support for copy offload.
dm: Enable copy offload for dm-linear target

SelvaKumar S (1):
dm kcopyd: use copy offload support

Documentation/ABI/stable/sysfs-block | 83 +++++++
block/blk-lib.c | 358 +++++++++++++++++++++++++++
block/blk-map.c | 2 +-
block/blk-settings.c | 59 +++++
block/blk-sysfs.c | 138 +++++++++++
block/blk.h | 2 +
block/ioctl.c | 32 +++
drivers/md/dm-kcopyd.c | 55 +++-
drivers/md/dm-linear.c | 1 +
drivers/md/dm-table.c | 45 ++++
drivers/md/dm.c | 6 +
drivers/nvme/host/core.c | 116 ++++++++-
drivers/nvme/host/fc.c | 4 +
drivers/nvme/host/nvme.h | 7 +
drivers/nvme/host/pci.c | 25 ++
drivers/nvme/host/rdma.c | 6 +
drivers/nvme/host/tcp.c | 14 ++
drivers/nvme/host/trace.c | 19 ++
drivers/nvme/target/admin-cmd.c | 8 +-
drivers/nvme/target/io-cmd-bdev.c | 65 +++++
drivers/nvme/target/io-cmd-file.c | 49 ++++
fs/zonefs/super.c | 178 ++++++++++++-
fs/zonefs/zonefs.h | 1 +
include/linux/blk_types.h | 21 ++
include/linux/blkdev.h | 17 ++
include/linux/device-mapper.h | 5 +
include/linux/nvme.h | 43 +++-
include/uapi/linux/fs.h | 23 ++
28 files changed, 1367 insertions(+), 15 deletions(-)


base-commit: e7d6987e09a328d4a949701db40ef63fbb970670
--
2.35.1.500.gb896f729e2


2022-04-27 10:13:57

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On 4/26/22 19:12, Nitesh Shetty wrote:
> The patch series covers the points discussed in November 2021 virtual call
> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
> We have covered the Initial agreed requirements in this patchset.
> Patchset borrows Mikulas's token based approach for 2 bdev
> implementation.
>
> Overall series supports –
>
> 1. Driver
> - NVMe Copy command (single NS), including support in nvme-target (for
> block and file backend)

It would also be nice to have copy offload emulation in null_blk for testing.

>
> 2. Block layer
> - Block-generic copy (REQ_COPY flag), with interface accommodating
> two block-devs, and multi-source/destination interface
> - Emulation, when offload is natively absent
> - dm-linear support (for cases not requiring split)
>
> 3. User-interface
> - new ioctl
> - copy_file_range for zonefs
>
> 4. In-kernel user
> - dm-kcopyd
> - copy_file_range in zonefs
>
> For zonefs copy_file_range - Seems we cannot levearge fstest here. Limited
> testing is done at this point using a custom application for unit testing.
>
> Appreciate the inputs on plumbing and how to test this further?
> Perhaps some of it can be discussed during LSF/MM too.
>
> [0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
>
> Changes in v4:
> - added copy_file_range support for zonefs
> - added documentaion about new sysfs entries
> - incorporated review comments on v3
> - minor fixes
>
>
> Arnav Dawn (2):
> nvmet: add copy command support for bdev and file ns
> fs: add support for copy file range in zonefs
>
> Nitesh Shetty (7):
> block: Introduce queue limits for copy-offload support
> block: Add copy offload support infrastructure
> block: Introduce a new ioctl for copy
> block: add emulation for copy
> nvme: add copy offload support
> dm: Add support for copy offload.
> dm: Enable copy offload for dm-linear target
>
> SelvaKumar S (1):
> dm kcopyd: use copy offload support
>
> Documentation/ABI/stable/sysfs-block | 83 +++++++
> block/blk-lib.c | 358 +++++++++++++++++++++++++++
> block/blk-map.c | 2 +-
> block/blk-settings.c | 59 +++++
> block/blk-sysfs.c | 138 +++++++++++
> block/blk.h | 2 +
> block/ioctl.c | 32 +++
> drivers/md/dm-kcopyd.c | 55 +++-
> drivers/md/dm-linear.c | 1 +
> drivers/md/dm-table.c | 45 ++++
> drivers/md/dm.c | 6 +
> drivers/nvme/host/core.c | 116 ++++++++-
> drivers/nvme/host/fc.c | 4 +
> drivers/nvme/host/nvme.h | 7 +
> drivers/nvme/host/pci.c | 25 ++
> drivers/nvme/host/rdma.c | 6 +
> drivers/nvme/host/tcp.c | 14 ++
> drivers/nvme/host/trace.c | 19 ++
> drivers/nvme/target/admin-cmd.c | 8 +-
> drivers/nvme/target/io-cmd-bdev.c | 65 +++++
> drivers/nvme/target/io-cmd-file.c | 49 ++++
> fs/zonefs/super.c | 178 ++++++++++++-
> fs/zonefs/zonefs.h | 1 +
> include/linux/blk_types.h | 21 ++
> include/linux/blkdev.h | 17 ++
> include/linux/device-mapper.h | 5 +
> include/linux/nvme.h | 43 +++-
> include/uapi/linux/fs.h | 23 ++
> 28 files changed, 1367 insertions(+), 15 deletions(-)
>
>
> base-commit: e7d6987e09a328d4a949701db40ef63fbb970670


--
Damien Le Moal
Western Digital Research

2022-04-27 10:16:10

by Nitesh Shetty

[permalink] [raw]
Subject: [PATCH v4 06/10] nvmet: add copy command support for bdev and file ns

From: Arnav Dawn <[email protected]>

Add support for handling target command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.

For file-ns we call vfs_copy_file_range to service our request.

Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.

Signed-off-by: Arnav Dawn <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
drivers/nvme/host/tcp.c | 2 +-
drivers/nvme/target/admin-cmd.c | 8 +++-
drivers/nvme/target/io-cmd-bdev.c | 65 +++++++++++++++++++++++++++++++
drivers/nvme/target/io-cmd-file.c | 49 +++++++++++++++++++++++
4 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4e4cdcf8210a..2c77e5b596bb 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2395,7 +2395,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
return ret;

if (unlikely((rq->cmd_flags & REQ_COPY) && (req_op(rq) == REQ_OP_READ))) {
- blk_mq_start_request(req);
+ blk_mq_start_request(rq);
return BLK_STS_OK;
}

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 397daaf51f1b..db32debdb528 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -431,8 +431,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
- NVME_CTRL_ONCS_WRITE_ZEROES);
-
+ NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;

@@ -534,6 +533,11 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)

if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+ else {
+ id->msrc = to0based(BIO_MAX_VECS);
+ id->mssrl = cpu_to_le16(BIO_MAX_VECS << (PAGE_SHIFT - SECTOR_SHIFT));
+ id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl) * BIO_MAX_VECS);
+ }

/*
* We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 27a72504d31c..18666d36423f 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -47,6 +47,30 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(ql->io_opt / ql->logical_block_size);
+
+ /*Copy limits*/
+ if (ql->max_copy_sectors) {
+ id->mcl = cpu_to_le32((ql->max_copy_sectors << 9) / ql->logical_block_size);
+ id->mssrl = cpu_to_le16((ql->max_copy_range_sectors << 9) /
+ ql->logical_block_size);
+ id->msrc = to0based(ql->max_copy_nr_ranges);
+ } else {
+ if (ql->zoned == BLK_ZONED_NONE) {
+ id->msrc = to0based(BIO_MAX_VECS);
+ id->mssrl = cpu_to_le16(
+ (BIO_MAX_VECS << PAGE_SHIFT) / ql->logical_block_size);
+ id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl) * BIO_MAX_VECS);
+#ifdef CONFIG_BLK_DEV_ZONED
+ } else {
+ /* TODO: get right values for zoned device */
+ id->msrc = to0based(BIO_MAX_VECS);
+ id->mssrl = cpu_to_le16(min((BIO_MAX_VECS << PAGE_SHIFT),
+ ql->chunk_sectors) / ql->logical_block_size);
+ id->mcl = cpu_to_le32(min(le16_to_cpu(id->mssrl) * BIO_MAX_VECS,
+ ql->chunk_sectors));
+#endif
+ }
+ }
}

void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -442,6 +466,43 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
}
}

+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+ struct nvme_copy_range range;
+ struct range_entry *rlist;
+ struct nvme_command *cmnd = req->cmd;
+ sector_t dest, dest_off = 0;
+ int ret, id, nr_range;
+
+ nr_range = cmnd->copy.nr_range + 1;
+ dest = le64_to_cpu(cmnd->copy.sdlba) << req->ns->blksize_shift;
+ rlist = kmalloc_array(nr_range, sizeof(*rlist), GFP_KERNEL);
+
+ for (id = 0 ; id < nr_range; id++) {
+ ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range, sizeof(range));
+ if (ret)
+ goto out;
+
+ rlist[id].dst = dest + dest_off;
+ rlist[id].src = le64_to_cpu(range.slba) << req->ns->blksize_shift;
+ rlist[id].len = (le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift;
+ rlist[id].comp_len = 0;
+ dest_off += rlist[id].len;
+ }
+ ret = blkdev_issue_copy(req->ns->bdev, nr_range, rlist, req->ns->bdev, GFP_KERNEL);
+ if (ret) {
+ for (id = 0 ; id < nr_range; id++) {
+ if (rlist[id].len != rlist[id].comp_len) {
+ req->cqe->result.u32 = cpu_to_le32(id);
+ break;
+ }
+ }
+ }
+out:
+ kfree(rlist);
+ nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+}
+
u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
{
switch (req->cmd->common.opcode) {
@@ -460,6 +521,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
case nvme_cmd_write_zeroes:
req->execute = nvmet_bdev_execute_write_zeroes;
return 0;
+ case nvme_cmd_copy:
+ req->execute = nvmet_bdev_execute_copy;
+ return 0;
+
default:
return nvmet_report_invalid_opcode(req);
}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index f3d58abf11e0..fe26a9120436 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -338,6 +338,46 @@ static void nvmet_file_dsm_work(struct work_struct *w)
}
}

+static void nvmet_file_copy_work(struct work_struct *w)
+{
+ struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+ int nr_range;
+ loff_t pos;
+ struct nvme_command *cmnd = req->cmd;
+ int ret = 0, len = 0, src, id;
+
+ nr_range = cmnd->copy.nr_range + 1;
+ pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+ if (unlikely(pos + req->transfer_len > req->ns->size)) {
+ nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+ return;
+ }
+
+ for (id = 0 ; id < nr_range; id++) {
+ struct nvme_copy_range range;
+
+ ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+ sizeof(range));
+ if (ret)
+ goto out;
+
+ len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+ src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+ ret = vfs_copy_file_range(req->ns->file, src, req->ns->file, pos, len, 0);
+out:
+ if (ret != len) {
+ pos += ret;
+ req->cqe->result.u32 = cpu_to_le32(id);
+ nvmet_req_complete(req, ret < 0 ? errno_to_nvme_status(req, ret) :
+ errno_to_nvme_status(req, -EIO));
+ return;
+
+ } else
+ pos += len;
+}
+ nvmet_req_complete(req, ret);
+
+}
static void nvmet_file_execute_dsm(struct nvmet_req *req)
{
if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -346,6 +386,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
queue_work(nvmet_wq, &req->f.work);
}

+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+ INIT_WORK(&req->f.work, nvmet_file_copy_work);
+ schedule_work(&req->f.work);
+}
+
static void nvmet_file_write_zeroes_work(struct work_struct *w)
{
struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -392,6 +438,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
case nvme_cmd_write_zeroes:
req->execute = nvmet_file_execute_write_zeroes;
return 0;
+ case nvme_cmd_copy:
+ req->execute = nvmet_file_execute_copy;
+ return 0;
default:
return nvmet_report_invalid_opcode(req);
}
--
2.35.1.500.gb896f729e2

2022-04-27 10:40:25

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On 4/26/22 19:12, Nitesh Shetty wrote:
> The patch series covers the points discussed in November 2021 virtual call
> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
> We have covered the Initial agreed requirements in this patchset.
> Patchset borrows Mikulas's token based approach for 2 bdev
> implementation.
>
> Overall series supports –
>
> 1. Driver
> - NVMe Copy command (single NS), including support in nvme-target (for
> block and file backend)
>
> 2. Block layer
> - Block-generic copy (REQ_COPY flag), with interface accommodating
> two block-devs, and multi-source/destination interface
> - Emulation, when offload is natively absent
> - dm-linear support (for cases not requiring split)
>
> 3. User-interface
> - new ioctl
> - copy_file_range for zonefs
>
> 4. In-kernel user
> - dm-kcopyd
> - copy_file_range in zonefs
>
> For zonefs copy_file_range - Seems we cannot levearge fstest here. Limited
> testing is done at this point using a custom application for unit testing.

https://github.com/westerndigitalcorporation/zonefs-tools

./configure --with-tests
make
sudo make install

Then run tests/zonefs-tests.sh

Adding test case is simple. Just add script files under tests/scripts

I just realized that the README file of this project is not documenting
this. I will update it.

>
> Appreciate the inputs on plumbing and how to test this further?
> Perhaps some of it can be discussed during LSF/MM too.
>
> [0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
>
> Changes in v4:
> - added copy_file_range support for zonefs
> - added documentaion about new sysfs entries
> - incorporated review comments on v3
> - minor fixes
>
>
> Arnav Dawn (2):
> nvmet: add copy command support for bdev and file ns
> fs: add support for copy file range in zonefs
>
> Nitesh Shetty (7):
> block: Introduce queue limits for copy-offload support
> block: Add copy offload support infrastructure
> block: Introduce a new ioctl for copy
> block: add emulation for copy
> nvme: add copy offload support
> dm: Add support for copy offload.
> dm: Enable copy offload for dm-linear target
>
> SelvaKumar S (1):
> dm kcopyd: use copy offload support
>
> Documentation/ABI/stable/sysfs-block | 83 +++++++
> block/blk-lib.c | 358 +++++++++++++++++++++++++++
> block/blk-map.c | 2 +-
> block/blk-settings.c | 59 +++++
> block/blk-sysfs.c | 138 +++++++++++
> block/blk.h | 2 +
> block/ioctl.c | 32 +++
> drivers/md/dm-kcopyd.c | 55 +++-
> drivers/md/dm-linear.c | 1 +
> drivers/md/dm-table.c | 45 ++++
> drivers/md/dm.c | 6 +
> drivers/nvme/host/core.c | 116 ++++++++-
> drivers/nvme/host/fc.c | 4 +
> drivers/nvme/host/nvme.h | 7 +
> drivers/nvme/host/pci.c | 25 ++
> drivers/nvme/host/rdma.c | 6 +
> drivers/nvme/host/tcp.c | 14 ++
> drivers/nvme/host/trace.c | 19 ++
> drivers/nvme/target/admin-cmd.c | 8 +-
> drivers/nvme/target/io-cmd-bdev.c | 65 +++++
> drivers/nvme/target/io-cmd-file.c | 49 ++++
> fs/zonefs/super.c | 178 ++++++++++++-
> fs/zonefs/zonefs.h | 1 +
> include/linux/blk_types.h | 21 ++
> include/linux/blkdev.h | 17 ++
> include/linux/device-mapper.h | 5 +
> include/linux/nvme.h | 43 +++-
> include/uapi/linux/fs.h | 23 ++
> 28 files changed, 1367 insertions(+), 15 deletions(-)
>
>
> base-commit: e7d6987e09a328d4a949701db40ef63fbb970670


--
Damien Le Moal
Western Digital Research

2022-04-27 10:46:13

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On 4/26/22 19:12, Nitesh Shetty wrote:
> The patch series covers the points discussed in November 2021 virtual call
> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
> We have covered the Initial agreed requirements in this patchset.
> Patchset borrows Mikulas's token based approach for 2 bdev
> implementation.

Please reduce the distribution list. List servers (and email clients) are
complaining about it being too large.

>
> Overall series supports –
>
> 1. Driver
> - NVMe Copy command (single NS), including support in nvme-target (for
> block and file backend)
>
> 2. Block layer
> - Block-generic copy (REQ_COPY flag), with interface accommodating
> two block-devs, and multi-source/destination interface
> - Emulation, when offload is natively absent
> - dm-linear support (for cases not requiring split)
>
> 3. User-interface
> - new ioctl
> - copy_file_range for zonefs
>
> 4. In-kernel user
> - dm-kcopyd
> - copy_file_range in zonefs
>
> For zonefs copy_file_range - Seems we cannot levearge fstest here. Limited
> testing is done at this point using a custom application for unit testing.
>
> Appreciate the inputs on plumbing and how to test this further?
> Perhaps some of it can be discussed during LSF/MM too.
>
> [0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
>
> Changes in v4:
> - added copy_file_range support for zonefs
> - added documentaion about new sysfs entries
> - incorporated review comments on v3
> - minor fixes
>
>
> Arnav Dawn (2):
> nvmet: add copy command support for bdev and file ns
> fs: add support for copy file range in zonefs
>
> Nitesh Shetty (7):
> block: Introduce queue limits for copy-offload support
> block: Add copy offload support infrastructure
> block: Introduce a new ioctl for copy
> block: add emulation for copy
> nvme: add copy offload support
> dm: Add support for copy offload.
> dm: Enable copy offload for dm-linear target
>
> SelvaKumar S (1):
> dm kcopyd: use copy offload support
>
> Documentation/ABI/stable/sysfs-block | 83 +++++++
> block/blk-lib.c | 358 +++++++++++++++++++++++++++
> block/blk-map.c | 2 +-
> block/blk-settings.c | 59 +++++
> block/blk-sysfs.c | 138 +++++++++++
> block/blk.h | 2 +
> block/ioctl.c | 32 +++
> drivers/md/dm-kcopyd.c | 55 +++-
> drivers/md/dm-linear.c | 1 +
> drivers/md/dm-table.c | 45 ++++
> drivers/md/dm.c | 6 +
> drivers/nvme/host/core.c | 116 ++++++++-
> drivers/nvme/host/fc.c | 4 +
> drivers/nvme/host/nvme.h | 7 +
> drivers/nvme/host/pci.c | 25 ++
> drivers/nvme/host/rdma.c | 6 +
> drivers/nvme/host/tcp.c | 14 ++
> drivers/nvme/host/trace.c | 19 ++
> drivers/nvme/target/admin-cmd.c | 8 +-
> drivers/nvme/target/io-cmd-bdev.c | 65 +++++
> drivers/nvme/target/io-cmd-file.c | 49 ++++
> fs/zonefs/super.c | 178 ++++++++++++-
> fs/zonefs/zonefs.h | 1 +
> include/linux/blk_types.h | 21 ++
> include/linux/blkdev.h | 17 ++
> include/linux/device-mapper.h | 5 +
> include/linux/nvme.h | 43 +++-
> include/uapi/linux/fs.h | 23 ++
> 28 files changed, 1367 insertions(+), 15 deletions(-)
>
>
> base-commit: e7d6987e09a328d4a949701db40ef63fbb970670


--
Damien Le Moal
Western Digital Research

2022-04-27 11:18:49

by Nitesh Shetty

[permalink] [raw]
Subject: [PATCH v4 10/10] fs: add support for copy file range in zonefs

From: Arnav Dawn <[email protected]>

copy_file_range is implemented using copy offload,
copy offloading to device is always enabled.
To disable copy offloading mount with "no_copy_offload" mount option.
At present copy offload is only used, if the source and destination files
are on same block device, otherwise copy file range is completed by
generic copy file range.

copy file range implemented as following:
- write pending writes on the src and dest files
- drop page cache for dest file if its conv zone
- copy the range using offload
- update dest file info

For all failure cases we fallback to generic file copy range
At present this implementation does not support conv aggregation

Signed-off-by: Arnav Dawn <[email protected]>
---
fs/zonefs/super.c | 178 ++++++++++++++++++++++++++++++++++++++++++++-
fs/zonefs/zonefs.h | 1 +
2 files changed, 178 insertions(+), 1 deletion(-)

diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index b3b0b71fdf6c..60563b592bf2 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -901,6 +901,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
else
ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
&zonefs_write_dio_ops, 0, 0);
+
if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
(ret > 0 || ret == -EIOCBQUEUED)) {
if (ret > 0)
@@ -1189,6 +1190,171 @@ static int zonefs_file_release(struct inode *inode, struct file *file)
return 0;
}

+static int zonefs_is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
+ loff_t src_off, loff_t dst_off, size_t len)
+{
+ loff_t size, endoff;
+
+ size = i_size_read(src_inode);
+ /* Don't copy beyond source file EOF. */
+ if (src_off + len > size) {
+ zonefs_err(src_inode->i_sb, "Copy beyond EOF (%llu + %zu > %llu)\n",
+ src_off, len, size);
+ return -EOPNOTSUPP;
+ }
+
+ endoff = dst_off + len;
+ if (inode_newsize_ok(dst_inode, endoff))
+ return -EOPNOTSUPP;
+
+
+ return 0;
+}
+static ssize_t __zonefs_send_copy(struct zonefs_inode_info *src_zi, loff_t src_off,
+ struct zonefs_inode_info *dst_zi, loff_t dst_off, size_t len)
+{
+ struct block_device *src_bdev = src_zi->i_vnode.i_sb->s_bdev;
+ struct block_device *dst_bdev = dst_zi->i_vnode.i_sb->s_bdev;
+ struct range_entry *rlist;
+ int ret = -EIO;
+
+ rlist = kmalloc(sizeof(*rlist), GFP_KERNEL);
+ rlist[0].dst = (dst_zi->i_zsector << SECTOR_SHIFT) + dst_off;
+ rlist[0].src = (src_zi->i_zsector << SECTOR_SHIFT) + src_off;
+ rlist[0].len = len;
+ rlist[0].comp_len = 0;
+ ret = blkdev_issue_copy(src_bdev, 1, rlist, dst_bdev, GFP_KERNEL);
+ if (ret) {
+ if (rlist[0].comp_len != len) {
+ ret = rlist[0].comp_len;
+ kfree(rlist);
+ return ret;
+ }
+ }
+ kfree(rlist);
+ return len;
+}
+static ssize_t __zonefs_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
+{
+ struct inode *src_inode = file_inode(src_file);
+ struct inode *dst_inode = file_inode(dst_file);
+ struct zonefs_inode_info *src_zi = ZONEFS_I(src_inode);
+ struct zonefs_inode_info *dst_zi = ZONEFS_I(dst_inode);
+ struct block_device *src_bdev = src_inode->i_sb->s_bdev;
+ struct block_device *dst_bdev = dst_inode->i_sb->s_bdev;
+ struct super_block *src_sb = src_inode->i_sb;
+ struct zonefs_sb_info *src_sbi = ZONEFS_SB(src_sb);
+ struct super_block *dst_sb = dst_inode->i_sb;
+ struct zonefs_sb_info *dst_sbi = ZONEFS_SB(dst_sb);
+ ssize_t ret = -EIO, bytes;
+
+ if (src_bdev != dst_bdev) {
+ zonefs_err(src_sb, "Copying files across two devices\n");
+ return -EXDEV;
+ }
+
+ /*
+ * Some of the checks below will return -EOPNOTSUPP,
+ * which will force a generic copy
+ */
+
+ if (!(src_sbi->s_mount_opts & ZONEFS_MNTOPT_COPY_FILE)
+ || !(dst_sbi->s_mount_opts & ZONEFS_MNTOPT_COPY_FILE))
+ return -EOPNOTSUPP;
+
+ /* Start by sync'ing the source and destination files ifor conv zones */
+ if (src_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
+ ret = file_write_and_wait_range(src_file, src_off, (src_off + len));
+ if (ret < 0) {
+ zonefs_err(src_sb, "failed to write source file (%zd)\n", ret);
+ goto out;
+ }
+ }
+ if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
+ ret = file_write_and_wait_range(dst_file, dst_off, (dst_off + len));
+ if (ret < 0) {
+ zonefs_err(dst_sb, "failed to write destination file (%zd)\n", ret);
+ goto out;
+ }
+ }
+ mutex_lock(&dst_zi->i_truncate_mutex);
+ if (len > dst_zi->i_max_size - dst_zi->i_wpoffset) {
+ /* Adjust length */
+ len -= dst_zi->i_max_size - dst_zi->i_wpoffset;
+ if (len <= 0) {
+ mutex_unlock(&dst_zi->i_truncate_mutex);
+ return -EOPNOTSUPP;
+ }
+ }
+ if (dst_off != dst_zi->i_wpoffset) {
+ mutex_unlock(&dst_zi->i_truncate_mutex);
+ return -EOPNOTSUPP; /* copy not at zone write ptr */
+ }
+ mutex_lock(&src_zi->i_truncate_mutex);
+ ret = zonefs_is_file_size_ok(src_inode, dst_inode, src_off, dst_off, len);
+ if (ret < 0) {
+ mutex_unlock(&src_zi->i_truncate_mutex);
+ mutex_unlock(&dst_zi->i_truncate_mutex);
+ goto out;
+ }
+ mutex_unlock(&src_zi->i_truncate_mutex);
+
+ /* Drop dst file cached pages for a conv zone*/
+ if (dst_zi->i_ztype == ZONEFS_ZTYPE_CNV) {
+ ret = invalidate_inode_pages2_range(dst_inode->i_mapping,
+ dst_off >> PAGE_SHIFT,
+ (dst_off + len) >> PAGE_SHIFT);
+ if (ret < 0) {
+ zonefs_err(dst_sb, "Failed to invalidate inode pages (%zd)\n", ret);
+ ret = 0;
+ }
+ }
+ bytes = __zonefs_send_copy(src_zi, src_off, dst_zi, dst_off, len);
+ ret += bytes;
+
+ file_update_time(dst_file);
+ zonefs_update_stats(dst_inode, dst_off + bytes);
+ zonefs_i_size_write(dst_inode, dst_off + bytes);
+ dst_zi->i_wpoffset += bytes;
+ mutex_unlock(&dst_zi->i_truncate_mutex);
+
+
+
+ /*
+ * if we still have some bytes left, do splice copy
+ */
+ if (bytes && (bytes < len)) {
+ zonefs_info(src_sb, "Final partial copy of %zu bytes\n", len);
+ bytes = do_splice_direct(src_file, &src_off, dst_file,
+ &dst_off, len, flags);
+ if (bytes > 0)
+ ret += bytes;
+ else
+ zonefs_info(src_sb, "Failed partial copy (%zd)\n", bytes);
+ }
+
+out:
+
+ return ret;
+}
+
+static ssize_t zonefs_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = __zonefs_copy_file_range(src_file, src_off, dst_file, dst_off,
+ len, flags);
+
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
+ ret = generic_copy_file_range(src_file, src_off, dst_file,
+ dst_off, len, flags);
+ return ret;
+}
+
static const struct file_operations zonefs_file_operations = {
.open = zonefs_file_open,
.release = zonefs_file_release,
@@ -1200,6 +1366,7 @@ static const struct file_operations zonefs_file_operations = {
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
.iopoll = iocb_bio_iopoll,
+ .copy_file_range = zonefs_copy_file_range,
};

static struct kmem_cache *zonefs_inode_cachep;
@@ -1262,7 +1429,7 @@ static int zonefs_statfs(struct dentry *dentry, struct kstatfs *buf)

enum {
Opt_errors_ro, Opt_errors_zro, Opt_errors_zol, Opt_errors_repair,
- Opt_explicit_open, Opt_err,
+ Opt_explicit_open, Opt_no_copy_offload, Opt_err,
};

static const match_table_t tokens = {
@@ -1271,6 +1438,7 @@ static const match_table_t tokens = {
{ Opt_errors_zol, "errors=zone-offline"},
{ Opt_errors_repair, "errors=repair"},
{ Opt_explicit_open, "explicit-open" },
+ { Opt_no_copy_offload, "no_copy_offload" },
{ Opt_err, NULL}
};

@@ -1280,6 +1448,7 @@ static int zonefs_parse_options(struct super_block *sb, char *options)
substring_t args[MAX_OPT_ARGS];
char *p;

+ sbi->s_mount_opts |= ZONEFS_MNTOPT_COPY_FILE;
if (!options)
return 0;

@@ -1310,6 +1479,9 @@ static int zonefs_parse_options(struct super_block *sb, char *options)
case Opt_explicit_open:
sbi->s_mount_opts |= ZONEFS_MNTOPT_EXPLICIT_OPEN;
break;
+ case Opt_no_copy_offload:
+ sbi->s_mount_opts &= ~ZONEFS_MNTOPT_COPY_FILE;
+ break;
default:
return -EINVAL;
}
@@ -1330,6 +1502,8 @@ static int zonefs_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",errors=zone-offline");
if (sbi->s_mount_opts & ZONEFS_MNTOPT_ERRORS_REPAIR)
seq_puts(seq, ",errors=repair");
+ if (sbi->s_mount_opts & ZONEFS_MNTOPT_COPY_FILE)
+ seq_puts(seq, ",copy_offload");

return 0;
}
@@ -1769,6 +1943,8 @@ static int zonefs_fill_super(struct super_block *sb, void *data, int silent)
atomic_set(&sbi->s_active_seq_files, 0);
sbi->s_max_active_seq_files = bdev_max_active_zones(sb->s_bdev);

+ /* set copy support by default */
+ sbi->s_mount_opts |= ZONEFS_MNTOPT_COPY_FILE;
ret = zonefs_read_super(sb);
if (ret)
return ret;
diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h
index 4b3de66c3233..efa6632c4b6a 100644
--- a/fs/zonefs/zonefs.h
+++ b/fs/zonefs/zonefs.h
@@ -162,6 +162,7 @@ enum zonefs_features {
(ZONEFS_MNTOPT_ERRORS_RO | ZONEFS_MNTOPT_ERRORS_ZRO | \
ZONEFS_MNTOPT_ERRORS_ZOL | ZONEFS_MNTOPT_ERRORS_REPAIR)
#define ZONEFS_MNTOPT_EXPLICIT_OPEN (1 << 4) /* Explicit open/close of zones on open/close */
+#define ZONEFS_MNTOPT_COPY_FILE (1 << 5) /* enable copy file range offload to kernel */

/*
* In-memory Super block information.
--
2.35.1.500.gb896f729e2

2022-04-27 23:00:59

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On 4/28/22 00:38, Nitesh Shetty wrote:
> On Wed, Apr 27, 2022 at 10:46:32AM +0900, Damien Le Moal wrote:
>> On 4/26/22 19:12, Nitesh Shetty wrote:
>>> The patch series covers the points discussed in November 2021 virtual call
>>> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
>>> We have covered the Initial agreed requirements in this patchset.
>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>> implementation.
>>>
>>> Overall series supports –
>>>
>>> 1. Driver
>>> - NVMe Copy command (single NS), including support in nvme-target (for
>>> block and file backend)
>>>
>>> 2. Block layer
>>> - Block-generic copy (REQ_COPY flag), with interface accommodating
>>> two block-devs, and multi-source/destination interface
>>> - Emulation, when offload is natively absent
>>> - dm-linear support (for cases not requiring split)
>>>
>>> 3. User-interface
>>> - new ioctl
>>> - copy_file_range for zonefs
>>>
>>> 4. In-kernel user
>>> - dm-kcopyd
>>> - copy_file_range in zonefs
>>>
>>> For zonefs copy_file_range - Seems we cannot levearge fstest here. Limited
>>> testing is done at this point using a custom application for unit testing.
>>
>> https://protect2.fireeye.com/v1/url?k=b14bf8e1-d0361099-b14a73ae-74fe485fffb1-9bd9bbb269af18f9&q=1&e=b9714c29-ea22-4fa5-8a2a-eeb42ca4bdc1&u=https%3A%2F%2Fgithub.com%2Fwesterndigitalcorporation%2Fzonefs-tools
>>
>> ./configure --with-tests
>> make
>> sudo make install
>>
>> Then run tests/zonefs-tests.sh
>>
>> Adding test case is simple. Just add script files under tests/scripts
>>
>> I just realized that the README file of this project is not documenting
>> this. I will update it.
>>
>
> Thank you. We will try to use this.
> Any plans to integrate this testsuite with fstests(xfstest) ?

No. It is not a good fit since zonefs cannot pass most of the generic test
cases.

>
> --
> Nitesh Shetty
>
>


--
Damien Le Moal
Western Digital Research

2022-04-28 01:01:46

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On 4/27/22 21:49, Nitesh Shetty wrote:
> O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:
>> On 4/26/22 19:12, Nitesh Shetty wrote:
>>> The patch series covers the points discussed in November 2021 virtual call
>>> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
>>> We have covered the Initial agreed requirements in this patchset.
>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>> implementation.
>>>
>>> Overall series supports –
>>>
>>> 1. Driver
>>> - NVMe Copy command (single NS), including support in nvme-target (for
>>> block and file backend)
>>
>> It would also be nice to have copy offload emulation in null_blk for testing.
>>
>
> We can plan this in next phase of copy support, once this series settles down.

So how can people test your series ? Not a lot of drives out there with
copy support.

>
> --
> Nitesh Shetty
>
>


--
Damien Le Moal
Western Digital Research

2022-05-02 23:20:08

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On 4/28/22 16:49, Nitesh Shetty wrote:
> On Thu, Apr 28, 2022 at 07:05:32AM +0900, Damien Le Moal wrote:
>> On 4/27/22 21:49, Nitesh Shetty wrote:
>>> O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:
>>>> On 4/26/22 19:12, Nitesh Shetty wrote:
>>>>> The patch series covers the points discussed in November 2021 virtual call
>>>>> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
>>>>> We have covered the Initial agreed requirements in this patchset.
>>>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>>>> implementation.
>>>>>
>>>>> Overall series supports –
>>>>>
>>>>> 1. Driver
>>>>> - NVMe Copy command (single NS), including support in nvme-target (for
>>>>> block and file backend)
>>>>
>>>> It would also be nice to have copy offload emulation in null_blk for testing.
>>>>
>>>
>>> We can plan this in next phase of copy support, once this series settles down.
>>
>> So how can people test your series ? Not a lot of drives out there with
>> copy support.
>>
>
> Yeah not many drives at present, Qemu can be used to test NVMe copy.

Upstream QEMU ? What is the command line options ? An example would be
nice. But I still think null_blk support would be easiest.


--
Damien Le Moal
Western Digital Research

2022-05-02 23:22:56

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On 2022/05/02 13:09, Dave Chinner wrote:
> On Wed, Apr 27, 2022 at 06:19:51PM +0530, Nitesh Shetty wrote:
>> O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:
>>> On 4/26/22 19:12, Nitesh Shetty wrote:
>>>> The patch series covers the points discussed in November 2021 virtual call
>>>> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
>>>> We have covered the Initial agreed requirements in this patchset.
>>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>>> implementation.
>>>>
>>>> Overall series supports –
>>>>
>>>> 1. Driver
>>>> - NVMe Copy command (single NS), including support in nvme-target (for
>>>> block and file backend)
>>>
>>> It would also be nice to have copy offload emulation in null_blk for testing.
>>>
>>
>> We can plan this in next phase of copy support, once this series settles down.
>
> Why not just hook the loopback driver up to copy_file_range() so
> that the backend filesystem can just reflink copy the ranges being
> passed? That would enable testing on btrfs, XFS and NFSv4.2 hosted
> image files without needing any special block device setup at all...

That is a very good idea ! But that will cover only the non-zoned case. For copy
offload on zoned devices, adding support in null_blk is probably the simplest
thing to do.

>
> i.e. I think you're doing this compeltely backwards by trying to
> target non-existent hardware first....
>
> Cheers,
>
> Dave.


--
Damien Le Moal
Western Digital Research

2022-05-02 23:43:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On Wed, Apr 27, 2022 at 06:19:51PM +0530, Nitesh Shetty wrote:
> O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:
> > On 4/26/22 19:12, Nitesh Shetty wrote:
> > > The patch series covers the points discussed in November 2021 virtual call
> > > [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
> > > We have covered the Initial agreed requirements in this patchset.
> > > Patchset borrows Mikulas's token based approach for 2 bdev
> > > implementation.
> > >
> > > Overall series supports –
> > >
> > > 1. Driver
> > > - NVMe Copy command (single NS), including support in nvme-target (for
> > > block and file backend)
> >
> > It would also be nice to have copy offload emulation in null_blk for testing.
> >
>
> We can plan this in next phase of copy support, once this series settles down.

Why not just hook the loopback driver up to copy_file_range() so
that the backend filesystem can just reflink copy the ranges being
passed? That would enable testing on btrfs, XFS and NFSv4.2 hosted
image files without needing any special block device setup at all...

i.e. I think you're doing this compeltely backwards by trying to
target non-existent hardware first....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-05-03 00:50:37

by Damien Le Moal

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v4 00/10] Add Copy offload support

On 2022/04/27 21:49, Nitesh Shetty wrote:
> O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:
>> On 4/26/22 19:12, Nitesh Shetty wrote:
>>> The patch series covers the points discussed in November 2021 virtual call
>>> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
>>> We have covered the Initial agreed requirements in this patchset.
>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>> implementation.
>>>
>>> Overall series supports –
>>>
>>> 1. Driver
>>> - NVMe Copy command (single NS), including support in nvme-target (for
>>> block and file backend)
>>
>> It would also be nice to have copy offload emulation in null_blk for testing.
>>
>
> We can plan this in next phase of copy support, once this series settles down.

Why ? How do you expect people to test simply without null_blk ? Sutre, you said
QEMU can be used. But if copy offload is not upstream for QEMU either, there is
no easy way to test.

Adding that support to null_blk would not be hard at all.



--
Damien Le Moal
Western Digital Research

2022-05-03 00:52:41

by Bart Van Assche

[permalink] [raw]
Subject: Re: [dm-devel] [PATCH v4 00/10] Add Copy offload support

On 4/28/22 14:37, Damien Le Moal wrote:
> On 4/28/22 16:49, Nitesh Shetty wrote:
>> On Thu, Apr 28, 2022 at 07:05:32AM +0900, Damien Le Moal wrote:
>>> On 4/27/22 21:49, Nitesh Shetty wrote:
>>>> O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:
>>>>> On 4/26/22 19:12, Nitesh Shetty wrote:
>>>>>> The patch series covers the points discussed in November 2021 virtual call
>>>>>> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
>>>>>> We have covered the Initial agreed requirements in this patchset.
>>>>>> Patchset borrows Mikulas's token based approach for 2 bdev
>>>>>> implementation.
>>>>>>
>>>>>> Overall series supports –
>>>>>>
>>>>>> 1. Driver
>>>>>> - NVMe Copy command (single NS), including support in nvme-target (for
>>>>>> block and file backend)
>>>>>
>>>>> It would also be nice to have copy offload emulation in null_blk for testing.
>>>>>
>>>>
>>>> We can plan this in next phase of copy support, once this series settles down.
>>>
>>> So how can people test your series ? Not a lot of drives out there with
>>> copy support.
>>>
>>
>> Yeah not many drives at present, Qemu can be used to test NVMe copy.
>
> Upstream QEMU ? What is the command line options ? An example would be
> nice. But I still think null_blk support would be easiest.

+1 for adding copy offloading support in null_blk. That enables running
copy offloading tests without depending on Qemu.

Thanks,

Bart.

2022-05-03 01:27:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 00/10] Add Copy offload support

On Mon, May 02, 2022 at 09:54:55PM +0900, Damien Le Moal wrote:
> On 2022/05/02 13:09, Dave Chinner wrote:
> > On Wed, Apr 27, 2022 at 06:19:51PM +0530, Nitesh Shetty wrote:
> >> O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:
> >>> On 4/26/22 19:12, Nitesh Shetty wrote:
> >>>> The patch series covers the points discussed in November 2021 virtual call
> >>>> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
> >>>> We have covered the Initial agreed requirements in this patchset.
> >>>> Patchset borrows Mikulas's token based approach for 2 bdev
> >>>> implementation.
> >>>>
> >>>> Overall series supports –
> >>>>
> >>>> 1. Driver
> >>>> - NVMe Copy command (single NS), including support in nvme-target (for
> >>>> block and file backend)
> >>>
> >>> It would also be nice to have copy offload emulation in null_blk for testing.
> >>>
> >>
> >> We can plan this in next phase of copy support, once this series settles down.
> >
> > Why not just hook the loopback driver up to copy_file_range() so
> > that the backend filesystem can just reflink copy the ranges being
> > passed? That would enable testing on btrfs, XFS and NFSv4.2 hosted
> > image files without needing any special block device setup at all...
>
> That is a very good idea ! But that will cover only the non-zoned case. For copy
> offload on zoned devices, adding support in null_blk is probably the simplest
> thing to do.

Sure, but that's a zone device implementation issue, not a "how do
applications use this offload" issue.

i.e. zonefs support is not necessary to test the bio/block layer
interfaces at all. All we need is a block device that can decode the
bio-encoded offload packet and execute it to do full block layer
testing. We can build dm devices on top of loop devices, etc, so we
can test that the oflload support is plumbed, sliced, diced, and
regurgitated correctly that way. We don't need actual low level
device drivers to test this.

And, unlike the nullblk device, using the loopback device w/
copy_file_range() will also allow data integrity testing if a
generic copy_file_range() offload implementation is added. That is,
we test a non-reflink capable filesystem on the loop device with the
image file hosted on a reflink-capable filesystem. The upper
filesystem copy then gets offloaded to reflinks in the lower
filesystem. We already have copy_file_range() support in fsx, so all
the data integrity fsx tests in fstests will exercise this offload
path and find all the data corruptions the initial block layer bugs
expose...

Further, fsstress also has copy_file_range() support, and so all the
fstests that generate stress tests or use fstress as load for
failure testing will also exercise it.

Indeed, this then gives us fine-grained error injection capability
within fstests via devices like dm-flakey. What happens when
dm-flakey kills the device IO mid-offload? Does everything recover
correctly? Do we end up with data corruption? Are partial offload
completions when errors occur signalled correctly? Is there -any-
test coverage (or even capability for testing) of user driven copy
offload failure situations like this in any of the other test
suites?

I mean, once the loop device has cfr offload, we can use dm-flakey
to kill IO in the image file or even do a force shutdown of the
image host filesystem. Hence we can actually trash the copy offload
operation in mid-flight, not just error it out on full completion.
This is trivial to do with the fstests infrastructure - it just
relies on having generic copy_file_range() block offload support and
a loopback device offload of hardware copy bios back to
copy_file_range()....

This is what I mean about copy offload being designed the wrong way.
We have the high level hooks needed to implement it right though the
filesystems and block layer without any specific hardware support,
and we can test the whole stack without needing specific hardware
support. We already have filesystem level copy offload acceleration,
so the last thing we want to see is a block layer offload
implementation that is incompatible with the semantics we've already
exposed to userspace for copy offloads.

As I said:

> > i.e. I think you're doing this compeltely backwards by trying to
> > target non-existent hardware first....

Rather than tie the block layer offload function/implementation to
the specific quirks of a specific target hardware, we should be
adding generic support in the block layer for the copy offload
semantics we've already exposed to userspace. We already have test
coverage and infrastructure for this interface and is already in use
by applications.

Transparent hardware acceleration of data copies when the hardware
supports it is exactly where copy offloads are useful - implementing
support based around hardware made of unobtainium and then adding
high level user facing API support as an afterthought is putting the
cart before the horse. We need to make sure the high level
functionality is robust and handles errors correctly before we even
worry about what quirks the hardware might bring to the table.

Build a reference model first with the loop device and
copy-file-range, test it, validate it, make sure it all works. Then
hook up the hardware, and fix all the hardware bugs that are exposed
before the hardware is released to the general public....

Why haven't we learnt this lesson yet from all the problems we've
had with, say, broken discard/trim, zeroing, erase, etc in hardware
implementations, incompatible hardware protocol implementations of
equivalent functionality, etc? i.e. We haven't defined the OS
required behaviour that hardware must support and instead just tried
to make whatever has come from the hardware vendor's
"standarisation" process work ok?

In this case, we already have a functioning model, syscalls and user
applications making use of copy offloads at the OS level. Now we
need to implement those exact semantics at the block layer to build
a validated reference model for the block layer offload behaviour
that hardware must comply with. Then hardware offloads in actual
hardware can be compared and validated against the reference model
behaviour, and any hardware that doesn't match can be
quirked/blacklisted until the manufacturer fixes their firmware...

Cheers,

Dave.
--
Dave Chinner
[email protected]