The patch series covers the points discussed in past and most recently
in LSFMM'23[0].
We have covered the initial agreed requirements in this patchset and
further additional features suggested by community.
Patchset borrows Mikulas's token based approach for 2 bdev implementation.
This is next iteration of our previous patchset v10[1].
Overall series supports:
========================
1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file backend).
2. Block layer
- Block-generic copy (REQ_COPY flag), with interface
accommodating two block-devs
- Emulation, for in-kernel user when offload is natively
absent
- dm-linear support (for cases not requiring split)
3. User-interface
- copy_file_range
Testing
=======
Copy offload can be tested on:
a. QEMU: NVME simple copy (TP 4065). By setting nvme-ns
parameters mssrl,mcl, msrc. For more info [2].
b. Null block device
c. NVMe Fabrics loopback.
d. blktests[3] (tests block/034-037, nvme/050-053)
Emulation can be tested on any device.
fio[4].
Infra and plumbing:
===================
We populate copy_file_range callback in def_blk_fops.
For devices that support copy-offload, use blkdev_copy_offload to
achieve in-device copy.
However for cases, where device doesn't support offload,
fallback to generic_copy_file_range.
For in-kernel users (like NVMe fabrics), we use blkdev_issue_copy
which implements its own emulation, as fd is not available.
Modify checks in generic_copy_file_range to support block-device.
Performance:
============
The major benefit of this copy-offload/emulation framework is
observed in fabrics setup, for copy workloads across the network.
The host will send offload command over the network and actual copy
can be achieved using emulation on the target.
This results in higher performance and lower network consumption,
as compared to read and write travelling across the network.
With async-design of copy-offload/emulation we are able to see the
following improvements as compared to userspace read + write on a
NVMeOF TCP setup:
Setup1: Network Speed: 1000Mb/s
Host PC: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Target PC: AMD Ryzen 9 5900X 12-Core Processor
block size 8k:
710% improvement in IO BW (108 MiB/s to 876 MiB/s).
Network utilisation drops from 97% to 15%.
block-size 1M:
2532% improvement in IO BW (101 MiB/s to 2659 MiB/s).
Network utilisation drops from 89% to 0.62%.
Setup2: Network Speed: 100Gb/s
Server: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, 72 cores
(host and target have the same configuration)
block-size 8k:
17.5% improvement in IO BW (794 MiB/s to 933 MiB/s).
Network utilisation drops from 6.75% to 0.16%.
Blktests[3]
======================
tests/block/034,035: Runs copy offload and emulation on block
device.
tests/block/035,036: Runs copy offload and emulation on null
block device.
tests/nvme/050-053: Create a loop backed fabrics device and
run copy offload and emulation.
Future Work
===========
- loopback device copy offload support
- upstream fio to use copy offload
- upstream blktest to use copy offload
These are to be taken up after this minimal series is agreed upon.
Additional links:
=================
[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
https://lore.kernel.org/linux-nvme/[email protected]/
https://lore.kernel.org/linux-nvme/[email protected]/
[1] https://lore.kernel.org/all/[email protected]/
[2] https://qemu-project.gitlab.io/qemu/system/devices/nvme.html#simple-copy
[3] https://github.com/nitesh-shetty/blktests/tree/feat/copy_offload/v11
[4] https://github.com/vincentkfu/fio/commits/copyoffload-3.34-v11
Changes since v10:
=================
- NVMeOF: optimization in NVMe fabrics (Chaitanya Kulkarni)
- NVMeOF: sparse warnings (kernel test robot)
Changes since v9:
=================
- null_blk, improved documentation, minor fixes(Chaitanya Kulkarni)
- fio, expanded testing and minor fixes (Vincent Fu)
Changes since v8:
=================
- null_blk, copy_max_bytes_hw is made config fs parameter
(Damien Le Moal)
- Negative error handling in copy_file_range (Christian Brauner)
- minor fixes, better documentation (Damien Le Moal)
- fio upgraded to 3.34 (Vincent Fu)
Changes since v7:
=================
- null block copy offload support for testing (Damien Le Moal)
- adding direct flag check for copy offload to block device,
as we are using generic_copy_file_range for cached cases.
- Minor fixes
Changes since v6:
=================
- copy_file_range instead of ioctl for direct block device
- Remove support for multi range (vectored) copy
- Remove ioctl interface for copy.
- Remove offload support in dm kcopyd.
Changes since v5:
=================
- Addition of blktests (Chaitanya Kulkarni)
- Minor fix for fabrics file backed path
- Remove buggy zonefs copy file range implementation.
Changes since v4:
=================
- make the offload and emulation design asynchronous (Hannes
Reinecke)
- fabrics loopback support
- sysfs naming improvements (Damien Le Moal)
- use kfree() instead of kvfree() in cio_await_completion
(Damien Le Moal)
- use ranges instead of rlist to represent range_entry (Damien
Le Moal)
- change argument ordering in blk_copy_offload suggested (Damien
Le Moal)
- removed multiple copy limit and merged into only one limit
(Damien Le Moal)
- wrap overly long lines (Damien Le Moal)
- other naming improvements and cleanups (Damien Le Moal)
- correctly format the code example in description (Damien Le
Moal)
- mark blk_copy_offload as static (kernel test robot)
Changes since v3:
=================
- added copy_file_range support for zonefs
- added documentation about new sysfs entries
- incorporated review comments on v3
- minor fixes
Changes since v2:
=================
- fixed possible race condition reported by Damien Le Moal
- new sysfs controls as suggested by Damien Le Moal
- fixed possible memory leak reported by Dan Carpenter, lkp
- minor fixes
Changes since v1:
=================
- sysfs documentation (Greg KH)
- 2 bios for copy operation (Bart Van Assche, Mikulas Patocka,
Martin K. Petersen, Douglas Gilbert)
- better payload design (Darrick J. Wong)
Nitesh Shetty (9):
block: Introduce queue limits for copy-offload support
block: Add copy offload support infrastructure
block: add emulation for copy
fs, block: copy_file_range for def_blk_ops for direct block device
nvme: add copy offload support
nvmet: add copy command support for bdev and file ns
dm: Add support for copy offload
dm: Enable copy offload for dm-linear target
null_blk: add support for copy offload
Documentation/ABI/stable/sysfs-block | 33 ++
block/blk-lib.c | 431 +++++++++++++++++++++++++++
block/blk-map.c | 4 +-
block/blk-settings.c | 24 ++
block/blk-sysfs.c | 64 ++++
block/blk.h | 2 +
block/fops.c | 20 ++
drivers/block/null_blk/main.c | 108 ++++++-
drivers/block/null_blk/null_blk.h | 8 +
drivers/md/dm-linear.c | 1 +
drivers/md/dm-table.c | 41 +++
drivers/md/dm.c | 7 +
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 103 ++++++-
drivers/nvme/host/fc.c | 5 +
drivers/nvme/host/nvme.h | 7 +
drivers/nvme/host/pci.c | 27 +-
drivers/nvme/host/rdma.c | 7 +
drivers/nvme/host/tcp.c | 16 +
drivers/nvme/host/trace.c | 19 ++
drivers/nvme/target/admin-cmd.c | 9 +-
drivers/nvme/target/io-cmd-bdev.c | 62 ++++
drivers/nvme/target/io-cmd-file.c | 52 ++++
drivers/nvme/target/loop.c | 6 +
drivers/nvme/target/nvmet.h | 1 +
fs/read_write.c | 11 +-
include/linux/blk_types.h | 25 ++
include/linux/blkdev.h | 21 ++
include/linux/device-mapper.h | 5 +
include/linux/nvme.h | 43 ++-
include/uapi/linux/fs.h | 3 +
mm/filemap.c | 11 +-
32 files changed, 1155 insertions(+), 22 deletions(-)
base-commit: dbd91ef4e91c1ce3a24429f5fb3876b7a0306733
--
2.35.1.500.gb896f729e2
Add support for handling nvme_cmd_copy command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.
For file-ns we call vfs_copy_file_range to service our request.
Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.
loop target has copy support, which can be used to test copy offload.
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
drivers/nvme/target/admin-cmd.c | 9 ++++-
drivers/nvme/target/io-cmd-bdev.c | 62 +++++++++++++++++++++++++++++++
drivers/nvme/target/io-cmd-file.c | 52 ++++++++++++++++++++++++++
drivers/nvme/target/loop.c | 6 +++
drivers/nvme/target/nvmet.h | 1 +
5 files changed, 128 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 39cb570f833d..8e644b8ec0fd 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -433,8 +433,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
- NVME_CTRL_ONCS_WRITE_ZEROES);
-
+ NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
@@ -536,6 +535,12 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+ else {
+ id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+ id->mssrl = cpu_to_le16(BIO_MAX_VECS <<
+ (PAGE_SHIFT - SECTOR_SHIFT));
+ id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl));
+ }
/*
* We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index c2d6cea0236b..d92dfe86c647 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,18 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(bdev_io_opt(bdev) / bdev_logical_block_size(bdev));
+
+ if (bdev_max_copy_sectors(bdev)) {
+ id->msrc = id->msrc;
+ id->mssrl = cpu_to_le16((bdev_max_copy_sectors(bdev) <<
+ SECTOR_SHIFT) / bdev_logical_block_size(bdev));
+ id->mcl = cpu_to_le32((__force u32)id->mssrl);
+ } else {
+ id->msrc = (__force u8)to0based(BIO_MAX_VECS - 1);
+ id->mssrl = cpu_to_le16((BIO_MAX_VECS << PAGE_SHIFT) /
+ bdev_logical_block_size(bdev));
+ id->mcl = cpu_to_le32((__force u32)id->mssrl);
+ }
}
void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -184,6 +196,21 @@ static void nvmet_bio_done(struct bio *bio)
nvmet_req_bio_put(req, bio);
}
+static void nvmet_bdev_copy_end_io(void *private, int comp_len)
+{
+ struct nvmet_req *req = (struct nvmet_req *)private;
+ u16 status;
+
+ if (comp_len == req->copy_len) {
+ req->cqe->result.u32 = cpu_to_le32(1);
+ status = errno_to_nvme_status(req, 0);
+ } else {
+ req->cqe->result.u32 = cpu_to_le32(0);
+ status = errno_to_nvme_status(req, (__force u16)BLK_STS_IOERR);
+ }
+ nvmet_req_complete(req, status);
+}
+
#ifdef CONFIG_BLK_DEV_INTEGRITY
static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio,
struct sg_mapping_iter *miter)
@@ -450,6 +477,37 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
}
}
+/* At present we handle only one range entry, since copy offload is aligned with
+ * copy_file_range, only one entry is passed from block layer.
+ */
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+ struct nvme_copy_range range;
+ struct nvme_command *cmd = req->cmd;
+ int ret;
+ u16 status;
+
+ status = nvmet_copy_from_sgl(req, 0, &range, sizeof(range));
+ if (status)
+ goto out;
+
+ ret = blkdev_issue_copy(req->ns->bdev,
+ le64_to_cpu(cmd->copy.sdlba) << req->ns->blksize_shift,
+ req->ns->bdev,
+ le64_to_cpu(range.slba) << req->ns->blksize_shift,
+ (le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift,
+ nvmet_bdev_copy_end_io, (void *)req, GFP_KERNEL);
+ if (ret) {
+ req->cqe->result.u32 = cpu_to_le32(0);
+ status = blk_to_nvme_status(req, BLK_STS_IOERR);
+ goto out;
+ }
+
+ return;
+out:
+ nvmet_req_complete(req, errno_to_nvme_status(req, status));
+}
+
u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
{
switch (req->cmd->common.opcode) {
@@ -468,6 +526,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
case nvme_cmd_write_zeroes:
req->execute = nvmet_bdev_execute_write_zeroes;
return 0;
+ case nvme_cmd_copy:
+ req->execute = nvmet_bdev_execute_copy;
+ return 0;
+
default:
return nvmet_report_invalid_opcode(req);
}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 2d068439b129..f61aa834f7a5 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -322,6 +322,49 @@ static void nvmet_file_dsm_work(struct work_struct *w)
}
}
+static void nvmet_file_copy_work(struct work_struct *w)
+{
+ struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+ int nr_range = req->cmd->copy.nr_range + 1;
+ u16 status = 0;
+ int src, id;
+ ssize_t len, ret;
+ loff_t pos;
+
+ pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+ if (unlikely(pos + req->transfer_len > req->ns->size)) {
+ nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+ return;
+ }
+
+ for (id = 0 ; id < nr_range; id++) {
+ struct nvme_copy_range range;
+
+ status = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+ sizeof(range));
+ if (status)
+ goto out;
+
+ src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+ len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+ ret = vfs_copy_file_range(req->ns->file, src, req->ns->file,
+ pos, len, 0);
+ if (ret != len) {
+ pos += ret;
+ req->cqe->result.u32 = cpu_to_le32(id);
+ if (ret < 0)
+ status = errno_to_nvme_status(req, ret);
+ else
+ status = errno_to_nvme_status(req, -EIO);
+ goto out;
+ } else
+ pos += len;
+ }
+
+out:
+ nvmet_req_complete(req, status);
+}
+
static void nvmet_file_execute_dsm(struct nvmet_req *req)
{
if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -330,6 +373,12 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
queue_work(nvmet_wq, &req->f.work);
}
+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+ INIT_WORK(&req->f.work, nvmet_file_copy_work);
+ queue_work(nvmet_wq, &req->f.work);
+}
+
static void nvmet_file_write_zeroes_work(struct work_struct *w)
{
struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -376,6 +425,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
case nvme_cmd_write_zeroes:
req->execute = nvmet_file_execute_write_zeroes;
return 0;
+ case nvme_cmd_copy:
+ req->execute = nvmet_file_execute_copy;
+ return 0;
default:
return nvmet_report_invalid_opcode(req);
}
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index f2d24b2d992f..d18ed8067a15 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -146,6 +146,12 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
return ret;
nvme_start_request(req);
+ if (unlikely((req->cmd_flags & REQ_COPY) &&
+ (req_op(req) == REQ_OP_READ))) {
+ blk_mq_set_request_complete(req);
+ blk_mq_end_request(req, BLK_STS_OK);
+ return BLK_STS_OK;
+ }
iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
iod->req.port = queue->ctrl->port;
if (!nvmet_req_init(&iod->req, &queue->nvme_cq,
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index dc60a22646f7..1615dc9194ba 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -393,6 +393,7 @@ struct nvmet_req {
struct device *p2p_client;
u16 error_loc;
u64 error_slba;
+ size_t copy_len;
};
#define NVMET_MAX_MPOOL_BVEC 16
--
2.35.1.500.gb896f729e2
Before enabling copy for dm target, check if underlying devices and
dm target support copy. Avoid split happening inside dm target.
Fail early if the request needs split, currently splitting copy
request is not supported.
Signed-off-by: Nitesh Shetty <[email protected]>
---
drivers/md/dm-table.c | 41 +++++++++++++++++++++++++++++++++++
drivers/md/dm.c | 7 ++++++
include/linux/device-mapper.h | 5 +++++
3 files changed, 53 insertions(+)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 1398f1d6e83e..b3269271e761 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1867,6 +1867,39 @@ static bool dm_table_supports_nowait(struct dm_table *t)
return true;
}
+static int device_not_copy_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+ struct request_queue *q = bdev_get_queue(dev->bdev);
+
+ return !blk_queue_copy(q);
+}
+
+static bool dm_table_supports_copy(struct dm_table *t)
+{
+ struct dm_target *ti;
+ unsigned int i;
+
+ for (i = 0; i < t->num_targets; i++) {
+ ti = dm_table_get_target(t, i);
+
+ if (!ti->copy_offload_supported)
+ return false;
+
+ /*
+ * target provides copy support (as implied by setting
+ * 'copy_offload_supported')
+ * and it relies on _all_ data devices having copy support.
+ */
+ if (!ti->type->iterate_devices ||
+ ti->type->iterate_devices(ti,
+ device_not_copy_capable, NULL))
+ return false;
+ }
+
+ return true;
+}
+
static int device_not_discard_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1949,6 +1982,14 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
q->limits.discard_misaligned = 0;
}
+ if (!dm_table_supports_copy(t)) {
+ blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+ q->limits.max_copy_sectors = 0;
+ q->limits.max_copy_sectors_hw = 0;
+ } else {
+ blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+ }
+
if (!dm_table_supports_secure_erase(t))
q->limits.max_secure_erase_sectors = 0;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3b694ba3a106..ab9069090a7d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,13 @@ static blk_status_t __split_and_process_bio(struct clone_info *ci)
if (unlikely(ci->is_abnormal_io))
return __process_abnormal_io(ci, ti);
+ if ((unlikely(op_is_copy(ci->bio->bi_opf)) &&
+ max_io_len(ti, ci->sector) < ci->sector_count)) {
+ DMERR("Error, IO size(%u) > max target size(%llu)\n",
+ ci->sector_count, max_io_len(ti, ci->sector));
+ return BLK_STS_IOERR;
+ }
+
/*
* Only support bio polling for normal IO, and the target io is
* exactly inside the dm_io instance (verified in dm_poll_dm_io)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a52d2b9a6846..04016bd76e73 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -398,6 +398,11 @@ struct dm_target {
* bio_set_dev(). NOTE: ideally a target should _not_ need this.
*/
bool needs_bio_set_dev:1;
+
+ /*
+ * copy offload is supported
+ */
+ bool copy_offload_supported:1;
};
void *dm_per_bio_data(struct bio *bio, size_t data_size);
--
2.35.1.500.gb896f729e2
Introduce blkdev_issue_copy which takes similar arguments as
copy_file_range and performs copy offload between two bdevs.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy
Larger copy will be divided, based on max_copy_sectors limit.
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
block/blk-lib.c | 235 ++++++++++++++++++++++++++++++++++++++
block/blk.h | 2 +
include/linux/blk_types.h | 25 ++++
include/linux/blkdev.h | 3 +
4 files changed, 265 insertions(+)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index e59c3069e835..ed089e703cb1 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -115,6 +115,241 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL(blkdev_issue_discard);
+/*
+ * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
+ * This must only be called once all bios have been issued so that the refcount
+ * can only decrease. This just waits for all bios to make it through
+ * blkdev_copy_write_endio.
+ */
+static int blkdev_copy_wait_completion(struct cio *cio)
+{
+ int ret;
+
+ if (cio->endio)
+ return 0;
+
+ if (atomic_read(&cio->refcount)) {
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ blk_io_schedule();
+ }
+
+ ret = cio->comp_len;
+ kfree(cio);
+
+ return ret;
+}
+
+static void blkdev_copy_offload_write_endio(struct bio *bio)
+{
+ struct copy_ctx *ctx = bio->bi_private;
+ struct cio *cio = ctx->cio;
+ sector_t clen;
+
+ if (bio->bi_status) {
+ clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
+ cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+ }
+ __free_page(bio->bi_io_vec[0].bv_page);
+ bio_put(bio);
+
+ kfree(ctx);
+ if (!atomic_dec_and_test(&cio->refcount))
+ return;
+ if (cio->endio) {
+ cio->endio(cio->private, cio->comp_len);
+ kfree(cio);
+ } else
+ blk_wake_io_task(cio->waiter);
+}
+
+static void blkdev_copy_offload_read_endio(struct bio *read_bio)
+{
+ struct copy_ctx *ctx = read_bio->bi_private;
+ struct cio *cio = ctx->cio;
+ sector_t clen;
+
+ if (read_bio->bi_status) {
+ clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT)
+ - cio->pos_in;
+ cio->comp_len = min_t(sector_t, clen, cio->comp_len);
+ __free_page(read_bio->bi_io_vec[0].bv_page);
+ bio_put(ctx->write_bio);
+ bio_put(read_bio);
+ kfree(ctx);
+ if (atomic_dec_and_test(&cio->refcount)) {
+ if (cio->endio) {
+ cio->endio(cio->private, cio->comp_len);
+ kfree(cio);
+ } else
+ blk_wake_io_task(cio->waiter);
+ }
+ return;
+ }
+
+ schedule_work(&ctx->dispatch_work);
+ bio_put(read_bio);
+}
+
+static void blkdev_copy_dispatch_work(struct work_struct *work)
+{
+ struct copy_ctx *ctx = container_of(work, struct copy_ctx,
+ dispatch_work);
+
+ submit_bio(ctx->write_bio);
+}
+
+/*
+ * __blkdev_copy_offload - Use device's native copy offload feature.
+ * we perform copy operation by sending 2 bio.
+ * 1. First we send a read bio with REQ_COPY flag along with a token and source
+ * and length. Once read bio reaches driver layer, device driver adds all the
+ * source info to token and does a fake completion.
+ * 2. Once read operation completes, we issue write with REQ_COPY flag with same
+ * token. In driver layer, token info is used to form a copy offload command.
+ *
+ * Returns the length of bytes copied or error if encountered
+ */
+static int __blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+ struct cio *cio;
+ struct copy_ctx *ctx;
+ struct bio *read_bio, *write_bio;
+ struct page *token;
+ sector_t copy_len;
+ sector_t rem, max_copy_len;
+
+ cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+ if (!cio)
+ return -ENOMEM;
+ atomic_set(&cio->refcount, 0);
+ cio->waiter = current;
+ cio->endio = endio;
+ cio->private = private;
+
+ max_copy_len = min(bdev_max_copy_sectors(bdev_in),
+ bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
+
+ cio->pos_in = pos_in;
+ cio->pos_out = pos_out;
+ /* If there is a error, comp_len will be set to least successfully
+ * completed copied length
+ */
+ cio->comp_len = len;
+ for (rem = len; rem > 0; rem -= copy_len) {
+ copy_len = min(rem, max_copy_len);
+
+ token = alloc_page(gfp_mask);
+ if (unlikely(!token))
+ goto err_token;
+
+ ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+ if (!ctx)
+ goto err_ctx;
+ read_bio = bio_alloc(bdev_in, 1, REQ_OP_READ | REQ_COPY
+ | REQ_SYNC | REQ_NOMERGE, gfp_mask);
+ if (!read_bio)
+ goto err_read_bio;
+ write_bio = bio_alloc(bdev_out, 1, REQ_OP_WRITE
+ | REQ_COPY | REQ_SYNC | REQ_NOMERGE, gfp_mask);
+ if (!write_bio)
+ goto err_write_bio;
+
+ ctx->cio = cio;
+ ctx->write_bio = write_bio;
+ INIT_WORK(&ctx->dispatch_work, blkdev_copy_dispatch_work);
+
+ __bio_add_page(read_bio, token, PAGE_SIZE, 0);
+ read_bio->bi_iter.bi_size = copy_len;
+ read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
+ read_bio->bi_end_io = blkdev_copy_offload_read_endio;
+ read_bio->bi_private = ctx;
+
+ __bio_add_page(write_bio, token, PAGE_SIZE, 0);
+ write_bio->bi_iter.bi_size = copy_len;
+ write_bio->bi_end_io = blkdev_copy_offload_write_endio;
+ write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+ write_bio->bi_private = ctx;
+
+ atomic_inc(&cio->refcount);
+ submit_bio(read_bio);
+ pos_in += copy_len;
+ pos_out += copy_len;
+ }
+
+ /* Wait for completion of all IO's*/
+ return blkdev_copy_wait_completion(cio);
+
+err_write_bio:
+ bio_put(read_bio);
+err_read_bio:
+ kfree(ctx);
+err_ctx:
+ __free_page(token);
+err_token:
+ cio->comp_len = min_t(sector_t, cio->comp_len, (len - rem));
+ if (!atomic_read(&cio->refcount))
+ return -ENOMEM;
+ /* Wait for submitted IOs to complete */
+ return blkdev_copy_wait_completion(cio);
+}
+
+static inline int blkdev_copy_sanity_check(struct block_device *bdev_in,
+ loff_t pos_in, struct block_device *bdev_out, loff_t pos_out,
+ size_t len)
+{
+ unsigned int align = max(bdev_logical_block_size(bdev_out),
+ bdev_logical_block_size(bdev_in)) - 1;
+
+ if (bdev_read_only(bdev_out))
+ return -EPERM;
+
+ if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
+ len >= COPY_MAX_BYTES)
+ return -EINVAL;
+
+ return 0;
+}
+
+/*
+ * @bdev_in: source block device
+ * @pos_in: source offset
+ * @bdev_out: destination block device
+ * @pos_out: destination offset
+ * @len: length in bytes to be copied
+ * @endio: endio function to be called on completion of copy operation,
+ * for synchronous operation this should be NULL
+ * @private: endio function will be called with this private data, should be
+ * NULL, if operation is synchronous in nature
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ *
+ * Returns the length of bytes copied or error if encountered
+ *
+ * Description:
+ * Copy source offset from source block device to destination block
+ * device. Max total length of copy is limited to MAX_COPY_TOTAL_LENGTH
+ */
+int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ cio_iodone_t endio, void *private, gfp_t gfp_mask)
+{
+ struct request_queue *q_in = bdev_get_queue(bdev_in);
+ struct request_queue *q_out = bdev_get_queue(bdev_out);
+ int ret;
+
+ ret = blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len);
+ if (ret)
+ return ret;
+
+ if (blk_queue_copy(q_in) && blk_queue_copy(q_out))
+ ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+ len, endio, private, gfp_mask);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_issue_copy);
+
static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned flags)
diff --git a/block/blk.h b/block/blk.h
index 45547bcf1119..ec48a237fe12 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -303,6 +303,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
break;
}
+ if (unlikely(op_is_copy(bio->bi_opf)))
+ return false;
/*
* All drivers must accept single-segments bios that are <= PAGE_SIZE.
* This is a quick and dirty check that relies on the fact that
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 740afe80f297..0117e33087e1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -420,6 +420,7 @@ enum req_flag_bits {
*/
/* for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
+ __REQ_COPY, /* copy request */
__REQ_NR_BITS, /* stops here */
};
@@ -444,6 +445,7 @@ enum req_flag_bits {
#define REQ_POLLED (__force blk_opf_t)(1ULL << __REQ_POLLED)
#define REQ_ALLOC_CACHE (__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE)
#define REQ_SWAP (__force blk_opf_t)(1ULL << __REQ_SWAP)
+#define REQ_COPY ((__force blk_opf_t)(1ULL << __REQ_COPY))
#define REQ_DRV (__force blk_opf_t)(1ULL << __REQ_DRV)
#define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
@@ -474,6 +476,11 @@ static inline bool op_is_write(blk_opf_t op)
return !!(op & (__force blk_opf_t)1);
}
+static inline bool op_is_copy(blk_opf_t op)
+{
+ return op & REQ_COPY;
+}
+
/*
* Check if the bio or request is one that needs special treatment in the
* flush state machine.
@@ -533,4 +540,22 @@ struct blk_rq_stat {
u64 batch;
};
+typedef void (cio_iodone_t)(void *private, int comp_len);
+
+struct cio {
+ struct task_struct *waiter; /* waiting task (NULL if none) */
+ atomic_t refcount;
+ loff_t pos_in;
+ loff_t pos_out;
+ size_t comp_len;
+ cio_iodone_t *endio; /* applicable for async operation */
+ void *private; /* applicable for async operation */
+};
+
+struct copy_ctx {
+ struct cio *cio;
+ struct work_struct dispatch_work;
+ struct bio *write_bio;
+};
+
#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c9bf11adccb3..6f2814ab4741 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1051,6 +1051,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp);
+int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ cio_iodone_t end_io, void *private, gfp_t gfp_mask);
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
--
2.35.1.500.gb896f729e2
For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy
trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier González <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 103 +++++++++++++++++++++++++++++++++-
drivers/nvme/host/fc.c | 5 ++
drivers/nvme/host/nvme.h | 7 +++
drivers/nvme/host/pci.c | 27 ++++++++-
drivers/nvme/host/rdma.c | 7 +++
drivers/nvme/host/tcp.c | 16 ++++++
drivers/nvme/host/trace.c | 19 +++++++
include/linux/nvme.h | 43 +++++++++++++-
9 files changed, 220 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index bc523ca02254..01be882b726f 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -19,6 +19,7 @@ static const char * const nvme_ops[] = {
[nvme_cmd_resv_report] = "Reservation Report",
[nvme_cmd_resv_acquire] = "Reservation Acquire",
[nvme_cmd_resv_release] = "Reservation Release",
+ [nvme_cmd_copy] = "Copy Offload",
[nvme_cmd_zone_mgmt_send] = "Zone Management Send",
[nvme_cmd_zone_mgmt_recv] = "Zone Management Receive",
[nvme_cmd_zone_append] = "Zone Management Append",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ccb6eb1282f8..aef7b59dbd61 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -754,6 +754,77 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
}
+static inline void nvme_setup_copy_read(struct nvme_ns *ns, struct request *req)
+{
+ struct bio *bio = req->bio;
+ struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+
+ token->subsys = "nvme";
+ token->ns = ns;
+ token->src_sector = bio->bi_iter.bi_sector;
+ token->sectors = bio->bi_iter.bi_size >> 9;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+ struct request *req, struct nvme_command *cmnd)
+{
+ struct nvme_copy_range *range = NULL;
+ struct bio *bio = req->bio;
+ struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+ sector_t src_sector, dst_sector, n_sectors;
+ u64 src_lba, dst_lba, n_lba;
+ unsigned short nr_range = 1;
+ u16 control = 0;
+
+ if (unlikely(memcmp(token->subsys, "nvme", 4)))
+ return BLK_STS_NOTSUPP;
+ if (unlikely(token->ns != ns))
+ return BLK_STS_NOTSUPP;
+
+ src_sector = token->src_sector;
+ dst_sector = bio->bi_iter.bi_sector;
+ n_sectors = token->sectors;
+ if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+ return BLK_STS_NOTSUPP;
+
+ src_lba = nvme_sect_to_lba(ns, src_sector);
+ dst_lba = nvme_sect_to_lba(ns, dst_sector);
+ n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+ if (WARN_ON(!n_lba))
+ return BLK_STS_NOTSUPP;
+
+ if (req->cmd_flags & REQ_FUA)
+ control |= NVME_RW_FUA;
+
+ if (req->cmd_flags & REQ_FAILFAST_DEV)
+ control |= NVME_RW_LR;
+
+ memset(cmnd, 0, sizeof(*cmnd));
+ cmnd->copy.opcode = nvme_cmd_copy;
+ cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+ cmnd->copy.sdlba = cpu_to_le64(dst_lba);
+
+ range = kmalloc_array(nr_range, sizeof(*range),
+ GFP_ATOMIC | __GFP_NOWARN);
+ if (!range)
+ return BLK_STS_RESOURCE;
+
+ range[0].slba = cpu_to_le64(src_lba);
+ range[0].nlb = cpu_to_le16(n_lba - 1);
+
+ cmnd->copy.nr_range = 0;
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * nr_range;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ cmnd->copy.control = cpu_to_le16(control);
+
+ return BLK_STS_OK;
+}
+
static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
@@ -988,10 +1059,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
ret = nvme_setup_discard(ns, req, cmd);
break;
case REQ_OP_READ:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+ if (unlikely(req->cmd_flags & REQ_COPY))
+ nvme_setup_copy_read(ns, req);
+ else
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
break;
case REQ_OP_WRITE:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+ if (unlikely(req->cmd_flags & REQ_COPY))
+ ret = nvme_setup_copy_write(ns, req, cmd);
+ else
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
break;
case REQ_OP_ZONE_APPEND:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -1698,6 +1775,26 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
}
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+ struct nvme_id_ns *id)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct request_queue *q = disk->queue;
+
+ if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+ blk_queue_max_copy_sectors_hw(q, 0);
+ blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+ return;
+ }
+
+ /* setting copy limits */
+ if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
+ return;
+
+ blk_queue_max_copy_sectors_hw(q,
+ nvme_lba_to_sect(ns, le16_to_cpu(id->mssrl)));
+}
+
static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
{
return uuid_equal(&a->uuid, &b->uuid) &&
@@ -1897,6 +1994,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
set_capacity_and_notify(disk, capacity);
nvme_config_discard(disk, ns);
+ nvme_config_copy(disk, ns, id);
blk_queue_max_write_zeroes_sectors(disk->queue,
ns->ctrl->max_zeroes_sectors);
}
@@ -5343,6 +5441,7 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 2ed75923507d..db2e22b4ca7f 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2807,6 +2807,11 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx,
if (ret)
return ret;
+ if (unlikely((rq->cmd_flags & REQ_COPY) &&
+ (req_op(rq) == REQ_OP_READ))) {
+ blk_mq_end_request(rq, BLK_STS_OK);
+ return BLK_STS_OK;
+ }
/*
* nvme core doesn't quite treat the rq opaquely. Commands such
* as WRITE ZEROES will return a non-zero rq payload_bytes yet
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bf46f122e9e1..66af37170bff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -500,6 +500,13 @@ struct nvme_ns {
};
+struct nvme_copy_token {
+ char *subsys;
+ struct nvme_ns *ns;
+ sector_t src_sector;
+ sector_t sectors;
+};
+
/* NVMe ns supports metadata actions by the controller (generate/strip) */
static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
{
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7f25c0fe3a0b..d5d094fa2fd1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -495,16 +495,19 @@ static inline void nvme_sq_copy_cmd(struct nvme_queue *nvmeq,
nvmeq->sq_tail = 0;
}
-static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+static inline void nvme_commit_sq_db(struct nvme_queue *nvmeq)
{
- struct nvme_queue *nvmeq = hctx->driver_data;
-
spin_lock(&nvmeq->sq_lock);
if (nvmeq->sq_tail != nvmeq->last_sq_tail)
nvme_write_sq_db(nvmeq, true);
spin_unlock(&nvmeq->sq_lock);
}
+static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+ nvme_commit_sq_db(hctx->driver_data);
+}
+
static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
int nseg)
{
@@ -848,6 +851,12 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
if (ret)
return ret;
+ if (unlikely((req->cmd_flags & REQ_COPY) &&
+ (req_op(req) == REQ_OP_READ))) {
+ blk_mq_start_request(req);
+ return BLK_STS_OK;
+ }
+
if (blk_rq_nr_phys_segments(req)) {
ret = nvme_map_data(dev, req, &iod->cmd);
if (ret)
@@ -894,6 +903,18 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
ret = nvme_prep_rq(dev, req);
if (unlikely(ret))
return ret;
+ if (unlikely((req->cmd_flags & REQ_COPY) &&
+ (req_op(req) == REQ_OP_READ))) {
+ blk_mq_set_request_complete(req);
+ blk_mq_end_request(req, BLK_STS_OK);
+ /* Commit the sq if copy read was the last req in the list,
+ * as copy read deoesn't update sq db
+ */
+ if (bd->last)
+ nvme_commit_sq_db(nvmeq);
+ return ret;
+ }
+
spin_lock(&nvmeq->sq_lock);
nvme_sq_copy_cmd(nvmeq, &iod->cmd);
nvme_write_sq_db(nvmeq, bd->last);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0eb79696fb73..be1d20ac8bb0 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2038,6 +2038,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
nvme_start_request(rq);
+ if (unlikely((rq->cmd_flags & REQ_COPY) &&
+ (req_op(rq) == REQ_OP_READ))) {
+ blk_mq_end_request(rq, BLK_STS_OK);
+ ret = BLK_STS_OK;
+ goto unmap_qe;
+ }
+
if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
queue->pi_support &&
(c->common.opcode == nvme_cmd_write ||
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bf0230442d57..5ba1bb35c557 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2373,6 +2373,11 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
if (ret)
return ret;
+ if (unlikely((rq->cmd_flags & REQ_COPY) &&
+ (req_op(rq) == REQ_OP_READ))) {
+ return BLK_STS_OK;
+ }
+
req->state = NVME_TCP_SEND_CMD_PDU;
req->status = cpu_to_le16(NVME_SC_SUCCESS);
req->offset = 0;
@@ -2441,6 +2446,17 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
nvme_start_request(rq);
+ if (unlikely((rq->cmd_flags & REQ_COPY) &&
+ (req_op(rq) == REQ_OP_READ))) {
+ blk_mq_set_request_complete(rq);
+ blk_mq_end_request(rq, BLK_STS_OK);
+ /* if copy read is the last req queue tcp reqs */
+ if (bd->last && nvme_tcp_queue_more(queue))
+ queue_work_on(queue->io_cpu, nvme_tcp_wq,
+ &queue->io_work);
+ return ret;
+ }
+
nvme_tcp_queue_request(req, true, bd->last);
return BLK_STS_OK;
diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c
index 1c36fcedea20..da4a7494e5a7 100644
--- a/drivers/nvme/host/trace.c
+++ b/drivers/nvme/host/trace.c
@@ -150,6 +150,23 @@ static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10)
return ret;
}
+static const char *nvme_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u64 slba = get_unaligned_le64(cdw10);
+ u8 nr_range = get_unaligned_le16(cdw10 + 8);
+ u16 control = get_unaligned_le16(cdw10 + 10);
+ u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+ u32 reftag = get_unaligned_le32(cdw10 + 16);
+
+ trace_seq_printf(p,
+ "slba=%llu, nr_range=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+ slba, nr_range, control, dsmgmt, reftag);
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
static const char *nvme_trace_dsm(struct trace_seq *p, u8 *cdw10)
{
const char *ret = trace_seq_buffer_ptr(p);
@@ -243,6 +260,8 @@ const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p,
return nvme_trace_zone_mgmt_send(p, cdw10);
case nvme_cmd_zone_mgmt_recv:
return nvme_trace_zone_mgmt_recv(p, cdw10);
+ case nvme_cmd_copy:
+ return nvme_trace_copy(p, cdw10);
default:
return nvme_trace_common(p, cdw10);
}
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 779507ac750b..6582b26e532c 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -337,7 +337,7 @@ struct nvme_id_ctrl {
__u8 nvscc;
__u8 nwpc;
__le16 acwu;
- __u8 rsvd534[2];
+ __le16 ocfs;
__le32 sgls;
__le32 mnan;
__u8 rsvd544[224];
@@ -365,6 +365,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_ONCS_RESERVATIONS = 1 << 5,
NVME_CTRL_ONCS_TIMESTAMP = 1 << 6,
+ NVME_CTRL_ONCS_COPY = 1 << 8,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
NVME_CTRL_OACS_NS_MNGT_SUPP = 1 << 3,
@@ -414,7 +415,10 @@ struct nvme_id_ns {
__le16 npdg;
__le16 npda;
__le16 nows;
- __u8 rsvd74[18];
+ __le16 mssrl;
+ __le32 mcl;
+ __u8 msrc;
+ __u8 rsvd91[11];
__le32 anagrpid;
__u8 rsvd96[3];
__u8 nsattr;
@@ -796,6 +800,7 @@ enum nvme_opcode {
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
nvme_cmd_resv_release = 0x15,
+ nvme_cmd_copy = 0x19,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
nvme_cmd_zone_append = 0x7d,
@@ -819,7 +824,8 @@ enum nvme_opcode {
nvme_opcode_name(nvme_cmd_resv_release), \
nvme_opcode_name(nvme_cmd_zone_mgmt_send), \
nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \
- nvme_opcode_name(nvme_cmd_zone_append))
+ nvme_opcode_name(nvme_cmd_zone_append), \
+ nvme_opcode_name(nvme_cmd_copy))
@@ -996,6 +1002,36 @@ struct nvme_dsm_range {
__le64 slba;
};
+struct nvme_copy_command {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2;
+ __le64 metadata;
+ union nvme_data_ptr dptr;
+ __le64 sdlba;
+ __u8 nr_range;
+ __u8 rsvd12;
+ __le16 control;
+ __le16 rsvd13;
+ __le16 dspec;
+ __le32 ilbrt;
+ __le16 lbat;
+ __le16 lbatm;
+};
+
+struct nvme_copy_range {
+ __le64 rsvd0;
+ __le64 slba;
+ __le16 nlb;
+ __le16 rsvd18;
+ __le32 rsvd20;
+ __le32 eilbrt;
+ __le16 elbat;
+ __le16 elbatm;
+};
+
struct nvme_write_zeroes_cmd {
__u8 opcode;
__u8 flags;
@@ -1757,6 +1793,7 @@ struct nvme_command {
struct nvme_download_firmware dlfw;
struct nvme_format_cmd format;
struct nvme_dsm_cmd dsm;
+ struct nvme_copy_command copy;
struct nvme_write_zeroes_cmd write_zeroes;
struct nvme_zone_mgmt_send_cmd zms;
struct nvme_zone_mgmt_recv_cmd zmr;
--
2.35.1.500.gb896f729e2
Implementaion is based on existing read and write infrastructure.
copy_max_bytes: A new configfs and module parameter is introduced, which
can be used to set hardware/driver supported maximum copy limit.
Suggested-by: Damien Le Moal <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Vincent Fu <[email protected]>
---
drivers/block/null_blk/main.c | 108 ++++++++++++++++++++++++++++--
drivers/block/null_blk/null_blk.h | 8 +++
2 files changed, 111 insertions(+), 5 deletions(-)
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index b3fedafe301e..34e009b3ebd5 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -157,6 +157,10 @@ static int g_max_sectors;
module_param_named(max_sectors, g_max_sectors, int, 0444);
MODULE_PARM_DESC(max_sectors, "Maximum size of a command (in 512B sectors)");
+static unsigned long g_copy_max_bytes = COPY_MAX_BYTES;
+module_param_named(copy_max_bytes, g_copy_max_bytes, ulong, 0444);
+MODULE_PARM_DESC(copy_max_bytes, "Maximum size of a copy command (in bytes)");
+
static unsigned int nr_devices = 1;
module_param(nr_devices, uint, 0444);
MODULE_PARM_DESC(nr_devices, "Number of devices to register");
@@ -409,6 +413,7 @@ NULLB_DEVICE_ATTR(home_node, uint, NULL);
NULLB_DEVICE_ATTR(queue_mode, uint, NULL);
NULLB_DEVICE_ATTR(blocksize, uint, NULL);
NULLB_DEVICE_ATTR(max_sectors, uint, NULL);
+NULLB_DEVICE_ATTR(copy_max_bytes, uint, NULL);
NULLB_DEVICE_ATTR(irqmode, uint, NULL);
NULLB_DEVICE_ATTR(hw_queue_depth, uint, NULL);
NULLB_DEVICE_ATTR(index, uint, NULL);
@@ -550,6 +555,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
&nullb_device_attr_queue_mode,
&nullb_device_attr_blocksize,
&nullb_device_attr_max_sectors,
+ &nullb_device_attr_copy_max_bytes,
&nullb_device_attr_irqmode,
&nullb_device_attr_hw_queue_depth,
&nullb_device_attr_index,
@@ -656,7 +662,8 @@ static ssize_t memb_group_features_show(struct config_item *item, char *page)
"poll_queues,power,queue_mode,shared_tag_bitmap,size,"
"submit_queues,use_per_node_hctx,virt_boundary,zoned,"
"zone_capacity,zone_max_active,zone_max_open,"
- "zone_nr_conv,zone_offline,zone_readonly,zone_size\n");
+ "zone_nr_conv,zone_offline,zone_readonly,zone_size,"
+ "copy_max_bytes\n");
}
CONFIGFS_ATTR_RO(memb_group_, features);
@@ -722,6 +729,7 @@ static struct nullb_device *null_alloc_dev(void)
dev->queue_mode = g_queue_mode;
dev->blocksize = g_bs;
dev->max_sectors = g_max_sectors;
+ dev->copy_max_bytes = g_copy_max_bytes;
dev->irqmode = g_irqmode;
dev->hw_queue_depth = g_hw_queue_depth;
dev->blocking = g_blocking;
@@ -1271,6 +1279,78 @@ static int null_transfer(struct nullb *nullb, struct page *page,
return err;
}
+static inline void nullb_setup_copy_read(struct nullb *nullb, struct bio *bio)
+{
+ struct nullb_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+
+ token->subsys = "nullb";
+ token->sector_in = bio->bi_iter.bi_sector;
+ token->nullb = nullb;
+ token->sectors = bio->bi_iter.bi_size >> SECTOR_SHIFT;
+}
+
+static inline int nullb_setup_copy_write(struct nullb *nullb,
+ struct bio *bio, bool is_fua)
+{
+ struct nullb_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+ sector_t sector_in, sector_out;
+ void *in, *out;
+ size_t rem, temp;
+ unsigned long offset_in, offset_out;
+ struct nullb_page *t_page_in, *t_page_out;
+ int ret = -EIO;
+
+ if (unlikely(memcmp(token->subsys, "nullb", 5)))
+ return -EINVAL;
+ if (unlikely(token->nullb != nullb))
+ return -EINVAL;
+ if (WARN_ON(token->sectors != bio->bi_iter.bi_size >> SECTOR_SHIFT))
+ return -EINVAL;
+
+ sector_in = token->sector_in;
+ sector_out = bio->bi_iter.bi_sector;
+ rem = token->sectors << SECTOR_SHIFT;
+
+ spin_lock_irq(&nullb->lock);
+ while (rem > 0) {
+ temp = min_t(size_t, nullb->dev->blocksize, rem);
+ offset_in = (sector_in & SECTOR_MASK) << SECTOR_SHIFT;
+ offset_out = (sector_out & SECTOR_MASK) << SECTOR_SHIFT;
+
+ if (null_cache_active(nullb) && !is_fua)
+ null_make_cache_space(nullb, PAGE_SIZE);
+
+ t_page_in = null_lookup_page(nullb, sector_in, false,
+ !null_cache_active(nullb));
+ if (!t_page_in)
+ goto err;
+ t_page_out = null_insert_page(nullb, sector_out,
+ !null_cache_active(nullb) || is_fua);
+ if (!t_page_out)
+ goto err;
+
+ in = kmap_local_page(t_page_in->page);
+ out = kmap_local_page(t_page_out->page);
+
+ memcpy(out + offset_out, in + offset_in, temp);
+ kunmap_local(out);
+ kunmap_local(in);
+ __set_bit(sector_out & SECTOR_MASK, t_page_out->bitmap);
+
+ if (is_fua)
+ null_free_sector(nullb, sector_out, true);
+
+ rem -= temp;
+ sector_in += temp >> SECTOR_SHIFT;
+ sector_out += temp >> SECTOR_SHIFT;
+ }
+
+ ret = 0;
+err:
+ spin_unlock_irq(&nullb->lock);
+ return ret;
+}
+
static int null_handle_rq(struct nullb_cmd *cmd)
{
struct request *rq = cmd->rq;
@@ -1280,13 +1360,20 @@ static int null_handle_rq(struct nullb_cmd *cmd)
sector_t sector = blk_rq_pos(rq);
struct req_iterator iter;
struct bio_vec bvec;
+ bool fua = rq->cmd_flags & REQ_FUA;
+
+ if (rq->cmd_flags & REQ_COPY) {
+ if (op_is_write(req_op(rq)))
+ return nullb_setup_copy_write(nullb, rq->bio, fua);
+ nullb_setup_copy_read(nullb, rq->bio);
+ return 0;
+ }
spin_lock_irq(&nullb->lock);
rq_for_each_segment(bvec, rq, iter) {
len = bvec.bv_len;
err = null_transfer(nullb, bvec.bv_page, len, bvec.bv_offset,
- op_is_write(req_op(rq)), sector,
- rq->cmd_flags & REQ_FUA);
+ op_is_write(req_op(rq)), sector, fua);
if (err) {
spin_unlock_irq(&nullb->lock);
return err;
@@ -1307,13 +1394,20 @@ static int null_handle_bio(struct nullb_cmd *cmd)
sector_t sector = bio->bi_iter.bi_sector;
struct bio_vec bvec;
struct bvec_iter iter;
+ bool fua = bio->bi_opf & REQ_FUA;
+
+ if (bio->bi_opf & REQ_COPY) {
+ if (op_is_write(bio_op(bio)))
+ return nullb_setup_copy_write(nullb, bio, fua);
+ nullb_setup_copy_read(nullb, bio);
+ return 0;
+ }
spin_lock_irq(&nullb->lock);
bio_for_each_segment(bvec, bio, iter) {
len = bvec.bv_len;
err = null_transfer(nullb, bvec.bv_page, len, bvec.bv_offset,
- op_is_write(bio_op(bio)), sector,
- bio->bi_opf & REQ_FUA);
+ op_is_write(bio_op(bio)), sector, fua);
if (err) {
spin_unlock_irq(&nullb->lock);
return err;
@@ -2161,6 +2255,10 @@ static int null_add_dev(struct nullb_device *dev)
dev->max_sectors = queue_max_hw_sectors(nullb->q);
dev->max_sectors = min(dev->max_sectors, BLK_DEF_MAX_SECTORS);
blk_queue_max_hw_sectors(nullb->q, dev->max_sectors);
+ blk_queue_max_copy_sectors_hw(nullb->q,
+ dev->copy_max_bytes >> SECTOR_SHIFT);
+ if (dev->copy_max_bytes)
+ blk_queue_flag_set(QUEUE_FLAG_COPY, nullb->disk->queue);
if (dev->virt_boundary)
blk_queue_virt_boundary(nullb->q, PAGE_SIZE - 1);
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 929f659dd255..3dda593b0747 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -67,6 +67,13 @@ enum {
NULL_Q_MQ = 2,
};
+struct nullb_copy_token {
+ char *subsys;
+ struct nullb *nullb;
+ sector_t sector_in;
+ sector_t sectors;
+};
+
struct nullb_device {
struct nullb *nullb;
struct config_group group;
@@ -107,6 +114,7 @@ struct nullb_device {
unsigned int queue_mode; /* block interface */
unsigned int blocksize; /* block size */
unsigned int max_sectors; /* Max sectors per command */
+ unsigned long copy_max_bytes; /* Max copy offload length in bytes */
unsigned int irqmode; /* IRQ completion handler */
unsigned int hw_queue_depth; /* queue depth */
unsigned int index; /* index of the disk, only valid with a disk */
--
2.35.1.500.gb896f729e2
For direct block device opened with O_DIRECT, use copy_file_range to
issue device copy offload, and fallback to generic_copy_file_range incase
device copy offload capability is absent.
Modify checks to allow bdevs to use copy_file_range.
Suggested-by: Ming Lei <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
block/blk-lib.c | 23 +++++++++++++++++++++++
block/fops.c | 20 ++++++++++++++++++++
fs/read_write.c | 11 +++++++++--
include/linux/blkdev.h | 3 +++
mm/filemap.c | 11 ++++++++---
5 files changed, 63 insertions(+), 5 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index ba32545eb8d5..7d6ef85692a6 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -523,6 +523,29 @@ int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
}
EXPORT_SYMBOL_GPL(blkdev_issue_copy);
+/* Returns the length of bytes copied */
+int blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ gfp_t gfp_mask)
+{
+ struct request_queue *in_q = bdev_get_queue(bdev_in);
+ struct request_queue *out_q = bdev_get_queue(bdev_out);
+ int ret = 0;
+
+ if (blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len))
+ return 0;
+
+ if (blk_queue_copy(in_q) && blk_queue_copy(out_q)) {
+ ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
+ len, NULL, NULL, gfp_mask);
+ if (ret < 0)
+ return 0;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_copy_offload);
+
static int __blkdev_issue_write_zeroes(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned flags)
diff --git a/block/fops.c b/block/fops.c
index ab750e8a040f..df8985675ed1 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -614,6 +614,25 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
return ret;
}
+static ssize_t blkdev_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ struct block_device *in_bdev = I_BDEV(bdev_file_inode(file_in));
+ struct block_device *out_bdev = I_BDEV(bdev_file_inode(file_out));
+ int comp_len = 0;
+
+ if ((file_in->f_iocb_flags & IOCB_DIRECT) &&
+ (file_out->f_iocb_flags & IOCB_DIRECT))
+ comp_len = blkdev_copy_offload(in_bdev, pos_in, out_bdev,
+ pos_out, len, GFP_KERNEL);
+ if (comp_len != len)
+ comp_len = generic_copy_file_range(file_in, pos_in + comp_len,
+ file_out, pos_out + comp_len, len - comp_len, flags);
+
+ return comp_len;
+}
+
#define BLKDEV_FALLOC_FL_SUPPORTED \
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \
FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
@@ -697,6 +716,7 @@ const struct file_operations def_blk_fops = {
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
.fallocate = blkdev_fallocate,
+ .copy_file_range = blkdev_copy_file_range,
};
static __init int blkdev_init(void)
diff --git a/fs/read_write.c b/fs/read_write.c
index a21ba3be7dbe..47e848fcfd42 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -20,6 +20,7 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/fs.h>
+#include <linux/blkdev.h>
#include "internal.h"
#include <linux/uaccess.h>
@@ -1447,7 +1448,11 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
return -EOVERFLOW;
/* Shorten the copy to EOF */
- size_in = i_size_read(inode_in);
+ if (S_ISBLK(inode_in->i_mode))
+ size_in = bdev_nr_bytes(I_BDEV(file_in->f_mapping->host));
+ else
+ size_in = i_size_read(inode_in);
+
if (pos_in >= size_in)
count = 0;
else
@@ -1708,7 +1713,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
/* Don't copy dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
- if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+
+ if ((!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode)) &&
+ (!S_ISBLK(inode_in->i_mode) || !S_ISBLK(inode_out->i_mode)))
return -EINVAL;
if (!(file_in->f_mode & FMODE_READ) ||
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a95c26faa8b6..a9bb7e3a8c79 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1054,6 +1054,9 @@ int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
struct block_device *bdev_out, loff_t pos_out, size_t len,
cio_iodone_t end_io, void *private, gfp_t gfp_mask);
+int blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
+ struct block_device *bdev_out, loff_t pos_out, size_t len,
+ gfp_t gfp_mask);
struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
gfp_t gfp_mask);
void bio_map_kern_endio(struct bio *bio);
diff --git a/mm/filemap.c b/mm/filemap.c
index 570bc8c3db87..289f0c8229ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -48,6 +48,7 @@
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "internal.h"
+#include <linux/blkdev.h>
#define CREATE_TRACE_POINTS
#include <trace/events/filemap.h>
@@ -2855,7 +2856,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
{
struct folio_batch fbatch;
struct kiocb iocb;
- size_t total_spliced = 0, used, npages;
+ size_t total_spliced = 0, used, npages, size_in;
loff_t isize, end_offset;
bool writably_mapped;
int i, error = 0;
@@ -2863,6 +2864,10 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
init_sync_kiocb(&iocb, in);
iocb.ki_pos = *ppos;
+ if (S_ISBLK(file_inode(in)->i_mode))
+ size_in = bdev_nr_bytes(I_BDEV(in->f_mapping->host));
+ else
+ size_in = i_size_read(file_inode(in));
/* Work out how much data we can actually add into the pipe */
used = pipe_occupancy(pipe->head, pipe->tail);
npages = max_t(ssize_t, pipe->max_usage - used, 0);
@@ -2873,7 +2878,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
do {
cond_resched();
- if (*ppos >= i_size_read(file_inode(in)))
+ if (*ppos >= size_in)
break;
iocb.ki_pos = *ppos;
@@ -2889,7 +2894,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
* part of the page is not copied back to userspace (unless
* another truncate extends the file - this is desired though).
*/
- isize = i_size_read(file_inode(in));
+ isize = size_in;
if (unlikely(*ppos >= isize))
break;
end_offset = min_t(loff_t, isize, *ppos + len);
--
2.35.1.500.gb896f729e2
Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_bytes_hw (RO)
Above limits help to split the copy payload in block layer.
copy_offload: used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_bytes_hw: Reflects the device supported maximum limit.
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Anuj Gupta <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 33 ++++++++++++++
block/blk-settings.c | 24 +++++++++++
block/blk-sysfs.c | 64 ++++++++++++++++++++++++++++
include/linux/blkdev.h | 12 ++++++
include/uapi/linux/fs.h | 3 ++
5 files changed, 136 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index c57e5b7cb532..e4d31132f77c 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -155,6 +155,39 @@ Description:
last zone of the device which may be smaller.
+What: /sys/block/<disk>/queue/copy_offload
+Date: April 2023
+Contact: [email protected]
+Description:
+ [RW] When read, this file shows whether offloading copy to a
+ device is enabled (1) or disabled (0). Writing '0' to this
+ file will disable offloading copies for this device.
+ Writing any '1' value will enable this feature. If the device
+ does not support offloading, then writing 1, will result in
+ error.
+
+
+What: /sys/block/<disk>/queue/copy_max_bytes
+Date: April 2023
+Contact: [email protected]
+Description:
+ [RW] This is the maximum number of bytes, that the block layer
+ will allow for copy request. This will be smaller or equal to
+ the maximum size allowed by the hardware, indicated by
+ 'copy_max_bytes_hw'. Attempt to set value higher than
+ 'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
+
+
+What: /sys/block/<disk>/queue/copy_max_bytes_hw
+Date: April 2023
+Contact: [email protected]
+Description:
+ [RO] This is the maximum number of bytes, that the hardware
+ will allow in a single data copy request.
+ A value of 0 means that the device does not support
+ copy offload.
+
+
What: /sys/block/<disk>/queue/crypto/
Date: February 2022
Contact: [email protected]
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..23aff2d4dcba 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
+ lim->max_copy_sectors_hw = 0;
+ lim->max_copy_sectors = 0;
}
/**
@@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+ lim->max_copy_sectors_hw = ULONG_MAX;
+ lim->max_copy_sectors = ULONG_MAX;
}
EXPORT_SYMBOL(blk_set_stacking_limits);
@@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_max_discard_sectors);
+/**
+ * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
+ * @q: the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+ unsigned int max_copy_sectors)
+{
+ if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
+ max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
+
+ q->limits.max_copy_sectors_hw = max_copy_sectors;
+ q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
+
/**
* blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
* @q: the request queue for the device
@@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
b->max_segment_size);
+ t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+ t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
+ b->max_copy_sectors_hw);
+
t->misaligned |= b->misaligned;
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a64208583853..826ab29beba3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,63 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
return queue_var_show(0, page);
}
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(blk_queue_copy(q), page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ s64 copy_offload;
+ ssize_t ret = queue_var_store64(©_offload, page);
+
+ if (ret < 0)
+ return ret;
+
+ if (copy_offload && !q->limits.max_copy_sectors_hw)
+ return -EINVAL;
+
+ if (copy_offload)
+ blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+
+ return count;
+}
+
+static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
+{
+ return sprintf(page, "%llu\n", (unsigned long long)
+ q->limits.max_copy_sectors_hw << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+ return sprintf(page, "%llu\n", (unsigned long long)
+ q->limits.max_copy_sectors << SECTOR_SHIFT);
+}
+
+static ssize_t queue_copy_max_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ s64 max_copy;
+ ssize_t ret = queue_var_store64(&max_copy, page);
+
+ if (ret < 0)
+ return ret;
+
+ if (max_copy & (queue_logical_block_size(q) - 1))
+ return -EINVAL;
+
+ max_copy >>= SECTOR_SHIFT;
+ if (max_copy > q->limits.max_copy_sectors_hw)
+ max_copy = q->limits.max_copy_sectors_hw;
+
+ q->limits.max_copy_sectors = max_copy;
+ return count;
+}
+
static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
{
return queue_var_show(0, page);
@@ -590,6 +647,10 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_bytes_hw");
+QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
+
QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -637,6 +698,9 @@ static struct attribute *queue_attrs[] = {
&queue_discard_max_entry.attr,
&queue_discard_max_hw_entry.attr,
&queue_discard_zeroes_data_entry.attr,
+ &queue_copy_offload_entry.attr,
+ &queue_copy_max_hw_entry.attr,
+ &queue_copy_max_entry.attr,
&queue_write_same_max_entry.attr,
&queue_write_zeroes_max_entry.attr,
&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b441e633f4dd..c9bf11adccb3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -295,6 +295,9 @@ struct queue_limits {
unsigned int discard_alignment;
unsigned int zone_write_granularity;
+ unsigned long max_copy_sectors_hw;
+ unsigned long max_copy_sectors;
+
unsigned short max_segments;
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
@@ -561,6 +564,7 @@ struct request_queue {
#define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */
#define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */
#define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/
+#define QUEUE_FLAG_COPY 32 /* supports copy offload */
#define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \
(1UL << QUEUE_FLAG_SAME_COMP) | \
@@ -581,6 +585,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
+#define blk_queue_copy(q) test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
#define blk_queue_zone_resetall(q) \
test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
#define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
@@ -899,6 +904,8 @@ extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
extern void blk_queue_max_segments(struct request_queue *, unsigned short);
extern void blk_queue_max_discard_segments(struct request_queue *,
unsigned short);
+extern void blk_queue_max_copy_sectors_hw(struct request_queue *q,
+ unsigned int max_copy_sectors);
void blk_queue_max_secure_erase_sectors(struct request_queue *q,
unsigned int max_sectors);
extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
@@ -1218,6 +1225,11 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
return bdev_get_queue(bdev)->limits.discard_granularity;
}
+static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
+{
+ return bdev_get_queue(bdev)->limits.max_copy_sectors;
+}
+
static inline unsigned int
bdev_max_secure_erase_sectors(struct block_device *bdev)
{
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..8879567791fa 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,9 @@ struct fstrim_range {
__u64 minlen;
};
+/* maximum total copy length */
+#define COPY_MAX_BYTES (1 << 27)
+
/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
#define FILE_DEDUPE_RANGE_SAME 0
#define FILE_DEDUPE_RANGE_DIFFERS 1
--
2.35.1.500.gb896f729e2
Setting copy_offload_supported flag to enable offload.
Signed-off-by: Nitesh Shetty <[email protected]>
---
drivers/md/dm-linear.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index f4448d520ee9..1d1ee30bbefb 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -62,6 +62,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
ti->num_write_zeroes_bios = 1;
+ ti->copy_offload_supported = 1;
ti->private = lc;
return 0;
--
2.35.1.500.gb896f729e2
On 5/22/23 19:41, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
> - copy_offload (RW)
> - copy_max_bytes (RW)
> - copy_max_bytes_hw (RO)
>
> Above limits help to split the copy payload in block layer.
> copy_offload: used for setting copy offload(1) or emulation(0).
> copy_max_bytes: maximum total length of copy in single payload.
> copy_max_bytes_hw: Reflects the device supported maximum limit.
>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Anuj Gupta <[email protected]>
> ---
> Documentation/ABI/stable/sysfs-block | 33 ++++++++++++++
> block/blk-settings.c | 24 +++++++++++
> block/blk-sysfs.c | 64 ++++++++++++++++++++++++++++
> include/linux/blkdev.h | 12 ++++++
> include/uapi/linux/fs.h | 3 ++
> 5 files changed, 136 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index c57e5b7cb532..e4d31132f77c 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -155,6 +155,39 @@ Description:
> last zone of the device which may be smaller.
>
>
> +What: /sys/block/<disk>/queue/copy_offload
> +Date: April 2023
> +Contact: [email protected]
> +Description:
> + [RW] When read, this file shows whether offloading copy to a
> + device is enabled (1) or disabled (0). Writing '0' to this
> + file will disable offloading copies for this device.
> + Writing any '1' value will enable this feature. If the device
> + does not support offloading, then writing 1, will result in
> + error.
will result is an error.
> +
> +
> +What: /sys/block/<disk>/queue/copy_max_bytes
> +Date: April 2023
> +Contact: [email protected]
> +Description:
> + [RW] This is the maximum number of bytes, that the block layer
Please drop the comma after block.
> + will allow for copy request. This will be smaller or equal to
will allow for a copy request. This value is always smaller...
> + the maximum size allowed by the hardware, indicated by
> + 'copy_max_bytes_hw'. Attempt to set value higher than
An attempt to set a value higher than...
> + 'copy_max_bytes_hw' will truncate this to 'copy_max_bytes_hw'.
> +
> +
> +What: /sys/block/<disk>/queue/copy_max_bytes_hw
> +Date: April 2023
> +Contact: [email protected]
> +Description:
> + [RO] This is the maximum number of bytes, that the hardware
drop the comma after bytes
> + will allow in a single data copy request.
will allow for
> + A value of 0 means that the device does not support
> + copy offload.
Given that you do have copy emulation for devices that do not support hw
offload, how is the user supposed to know the maximum size of a copy request
when it is emulated ? This is not obvious from looking at these parameters.
> +
> +
> What: /sys/block/<disk>/queue/crypto/
> Date: February 2022
> Contact: [email protected]
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 896b4654ab00..23aff2d4dcba 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,8 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->zoned = BLK_ZONED_NONE;
> lim->zone_write_granularity = 0;
> lim->dma_alignment = 511;
> + lim->max_copy_sectors_hw = 0;
> + lim->max_copy_sectors = 0;
> }
>
> /**
> @@ -82,6 +84,8 @@ void blk_set_stacking_limits(struct queue_limits *lim)
> lim->max_dev_sectors = UINT_MAX;
> lim->max_write_zeroes_sectors = UINT_MAX;
> lim->max_zone_append_sectors = UINT_MAX;
> + lim->max_copy_sectors_hw = ULONG_MAX;
> + lim->max_copy_sectors = ULONG_MAX;
UINT_MAX is not enough ?
> }
> EXPORT_SYMBOL(blk_set_stacking_limits);
>
> @@ -183,6 +187,22 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
> }
> EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>
> +/**
> + * blk_queue_max_copy_sectors_hw - set max sectors for a single copy payload
> + * @q: the request queue for the device
> + * @max_copy_sectors: maximum number of sectors to copy
> + **/
> +void blk_queue_max_copy_sectors_hw(struct request_queue *q,
> + unsigned int max_copy_sectors)
> +{
> + if (max_copy_sectors > (COPY_MAX_BYTES >> SECTOR_SHIFT))
> + max_copy_sectors = COPY_MAX_BYTES >> SECTOR_SHIFT;
> +
> + q->limits.max_copy_sectors_hw = max_copy_sectors;
> + q->limits.max_copy_sectors = max_copy_sectors;
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors_hw);
> +
> /**
> * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
> * @q: the request queue for the device
> @@ -578,6 +598,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> t->max_segment_size = min_not_zero(t->max_segment_size,
> b->max_segment_size);
>
> + t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
> + t->max_copy_sectors_hw = min(t->max_copy_sectors_hw,
> + b->max_copy_sectors_hw);
> +
> t->misaligned |= b->misaligned;
>
> alignment = queue_limit_alignment_offset(b, start);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index a64208583853..826ab29beba3 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -212,6 +212,63 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
> return queue_var_show(0, page);
> }
>
> +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
> +{
> + return queue_var_show(blk_queue_copy(q), page);
> +}
> +
> +static ssize_t queue_copy_offload_store(struct request_queue *q,
> + const char *page, size_t count)
> +{
> + s64 copy_offload;
> + ssize_t ret = queue_var_store64(©_offload, page);
> +
> + if (ret < 0)
> + return ret;
> +
> + if (copy_offload && !q->limits.max_copy_sectors_hw)
> + return -EINVAL;
> +
> + if (copy_offload)
> + blk_queue_flag_set(QUEUE_FLAG_COPY, q);
> + else
> + blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
> +
> + return count;
> +}
> +
> +static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
> +{
> + return sprintf(page, "%llu\n", (unsigned long long)
> + q->limits.max_copy_sectors_hw << SECTOR_SHIFT);
> +}
> +
> +static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
> +{
> + return sprintf(page, "%llu\n", (unsigned long long)
> + q->limits.max_copy_sectors << SECTOR_SHIFT);
> +}
> +
> +static ssize_t queue_copy_max_store(struct request_queue *q,
> + const char *page, size_t count)
> +{
> + s64 max_copy;
> + ssize_t ret = queue_var_store64(&max_copy, page);
> +
> + if (ret < 0)
> + return ret;
> +
> + if (max_copy & (queue_logical_block_size(q) - 1))
> + return -EINVAL;
> +
> + max_copy >>= SECTOR_SHIFT;
> + if (max_copy > q->limits.max_copy_sectors_hw)
> + max_copy = q->limits.max_copy_sectors_hw;
> +
> + q->limits.max_copy_sectors = max_copy;
q->limits.max_copy_sectors =
min(max_copy, q->limits.max_copy_sectors_hw);
To remove the if above.
> + return count;
> +}
> +
> static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
> {
> return queue_var_show(0, page);
> @@ -590,6 +647,10 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
> QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
> QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>
> +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
> +QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_bytes_hw");
> +QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
> +
> QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
> QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
> QUEUE_RW_ENTRY(queue_poll, "io_poll");
> @@ -637,6 +698,9 @@ static struct attribute *queue_attrs[] = {
> &queue_discard_max_entry.attr,
> &queue_discard_max_hw_entry.attr,
> &queue_discard_zeroes_data_entry.attr,
> + &queue_copy_offload_entry.attr,
> + &queue_copy_max_hw_entry.attr,
> + &queue_copy_max_entry.attr,
> &queue_write_same_max_entry.attr,
> &queue_write_zeroes_max_entry.attr,
> &queue_zone_append_max_entry.attr,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index b441e633f4dd..c9bf11adccb3 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -295,6 +295,9 @@ struct queue_limits {
> unsigned int discard_alignment;
> unsigned int zone_write_granularity;
>
> + unsigned long max_copy_sectors_hw;
> + unsigned long max_copy_sectors;
Why unsigned long ? unsigned int gives you 4G * 512 = 2TB max copy. Isn't that a
large enough max limit ?
> +
> unsigned short max_segments;
> unsigned short max_integrity_segments;
> unsigned short max_discard_segments;
> @@ -561,6 +564,7 @@ struct request_queue {
> #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */
> #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */
> #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/
> +#define QUEUE_FLAG_COPY 32 /* supports copy offload */
Nope. That is misleading. This flag is cleared in queue_copy_offload_store() if
the user writes 0. So this flag indicates that copy offload is enabled or
disabled, not that the device supports it. For the device support, one has to
look at max copy sectors hw being != 0.
>
> #define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \
> (1UL << QUEUE_FLAG_SAME_COMP) | \
> @@ -581,6 +585,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
> test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags)
> #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
> #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
> +#define blk_queue_copy(q) test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
> #define blk_queue_zone_resetall(q) \
> test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
> #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
> @@ -899,6 +904,8 @@ extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
> extern void blk_queue_max_segments(struct request_queue *, unsigned short);
> extern void blk_queue_max_discard_segments(struct request_queue *,
> unsigned short);
> +extern void blk_queue_max_copy_sectors_hw(struct request_queue *q,
> + unsigned int max_copy_sectors);
> void blk_queue_max_secure_erase_sectors(struct request_queue *q,
> unsigned int max_sectors);
> extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
> @@ -1218,6 +1225,11 @@ static inline unsigned int bdev_discard_granularity(struct block_device *bdev)
> return bdev_get_queue(bdev)->limits.discard_granularity;
> }
>
> +static inline unsigned int bdev_max_copy_sectors(struct block_device *bdev)
> +{
> + return bdev_get_queue(bdev)->limits.max_copy_sectors;
> +}
> +
> static inline unsigned int
> bdev_max_secure_erase_sectors(struct block_device *bdev)
> {
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..8879567791fa 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,9 @@ struct fstrim_range {
> __u64 minlen;
> };
>
> +/* maximum total copy length */
> +#define COPY_MAX_BYTES (1 << 27)
My brain bit shifter is not as good as a CPU one :) So it would be nice to
mention what value that is in MB and also explain where this magic value comes from.
> +
> /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
> #define FILE_DEDUPE_RANGE_SAME 0
> #define FILE_DEDUPE_RANGE_DIFFERS 1
--
Damien Le Moal
Western Digital Research
On Mon, May 22, 2023 at 04:11:33PM +0530, Nitesh Shetty wrote:
> Introduce blkdev_issue_copy which takes similar arguments as
> copy_file_range and performs copy offload between two bdevs.
> Introduce REQ_COPY copy offload operation flag. Create a read-write
> bio pair with a token as payload and submitted to the device in order.
> Read request populates token with source specific information which
> is then passed with write request.
> This design is courtesy Mikulas Patocka's token based copy
>
> Larger copy will be divided, based on max_copy_sectors limit.
>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Anuj Gupta <[email protected]>
> ---
> block/blk-lib.c | 235 ++++++++++++++++++++++++++++++++++++++
> block/blk.h | 2 +
> include/linux/blk_types.h | 25 ++++
> include/linux/blkdev.h | 3 +
> 4 files changed, 265 insertions(+)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e59c3069e835..ed089e703cb1 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -115,6 +115,241 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> }
> EXPORT_SYMBOL(blkdev_issue_discard);
>
> +/*
> + * For synchronous copy offload/emulation, wait and process all in-flight BIOs.
> + * This must only be called once all bios have been issued so that the refcount
> + * can only decrease. This just waits for all bios to make it through
> + * blkdev_copy_write_endio.
> + */
> +static int blkdev_copy_wait_completion(struct cio *cio)
> +{
> + int ret;
> +
> + if (cio->endio)
> + return 0;
> +
> + if (atomic_read(&cio->refcount)) {
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> + blk_io_schedule();
> + }
> +
> + ret = cio->comp_len;
> + kfree(cio);
> +
> + return ret;
> +}
> +
> +static void blkdev_copy_offload_write_endio(struct bio *bio)
> +{
> + struct copy_ctx *ctx = bio->bi_private;
> + struct cio *cio = ctx->cio;
> + sector_t clen;
> +
> + if (bio->bi_status) {
> + clen = (bio->bi_iter.bi_sector << SECTOR_SHIFT) - cio->pos_out;
> + cio->comp_len = min_t(sector_t, clen, cio->comp_len);
> + }
> + __free_page(bio->bi_io_vec[0].bv_page);
> + bio_put(bio);
> +
> + kfree(ctx);
> + if (!atomic_dec_and_test(&cio->refcount))
> + return;
> + if (cio->endio) {
> + cio->endio(cio->private, cio->comp_len);
> + kfree(cio);
> + } else
> + blk_wake_io_task(cio->waiter);
> +}
> +
> +static void blkdev_copy_offload_read_endio(struct bio *read_bio)
> +{
> + struct copy_ctx *ctx = read_bio->bi_private;
> + struct cio *cio = ctx->cio;
> + sector_t clen;
> +
> + if (read_bio->bi_status) {
> + clen = (read_bio->bi_iter.bi_sector << SECTOR_SHIFT)
> + - cio->pos_in;
> + cio->comp_len = min_t(sector_t, clen, cio->comp_len);
> + __free_page(read_bio->bi_io_vec[0].bv_page);
> + bio_put(ctx->write_bio);
> + bio_put(read_bio);
> + kfree(ctx);
> + if (atomic_dec_and_test(&cio->refcount)) {
> + if (cio->endio) {
> + cio->endio(cio->private, cio->comp_len);
> + kfree(cio);
> + } else
> + blk_wake_io_task(cio->waiter);
> + }
> + return;
> + }
> +
> + schedule_work(&ctx->dispatch_work);
> + bio_put(read_bio);
> +}
> +
> +static void blkdev_copy_dispatch_work(struct work_struct *work)
> +{
> + struct copy_ctx *ctx = container_of(work, struct copy_ctx,
> + dispatch_work);
> +
> + submit_bio(ctx->write_bio);
> +}
> +
> +/*
> + * __blkdev_copy_offload - Use device's native copy offload feature.
> + * we perform copy operation by sending 2 bio.
> + * 1. First we send a read bio with REQ_COPY flag along with a token and source
> + * and length. Once read bio reaches driver layer, device driver adds all the
> + * source info to token and does a fake completion.
> + * 2. Once read operation completes, we issue write with REQ_COPY flag with same
> + * token. In driver layer, token info is used to form a copy offload command.
> + *
> + * Returns the length of bytes copied or error if encountered
> + */
> +static int __blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
> + struct block_device *bdev_out, loff_t pos_out, size_t len,
> + cio_iodone_t endio, void *private, gfp_t gfp_mask)
> +{
> + struct cio *cio;
> + struct copy_ctx *ctx;
> + struct bio *read_bio, *write_bio;
> + struct page *token;
> + sector_t copy_len;
> + sector_t rem, max_copy_len;
> +
> + cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> + if (!cio)
> + return -ENOMEM;
> + atomic_set(&cio->refcount, 0);
> + cio->waiter = current;
> + cio->endio = endio;
> + cio->private = private;
> +
> + max_copy_len = min(bdev_max_copy_sectors(bdev_in),
> + bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
> +
> + cio->pos_in = pos_in;
> + cio->pos_out = pos_out;
> + /* If there is a error, comp_len will be set to least successfully
> + * completed copied length
> + */
> + cio->comp_len = len;
> + for (rem = len; rem > 0; rem -= copy_len) {
> + copy_len = min(rem, max_copy_len);
> +
> + token = alloc_page(gfp_mask);
> + if (unlikely(!token))
> + goto err_token;
> +
> + ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
> + if (!ctx)
> + goto err_ctx;
> + read_bio = bio_alloc(bdev_in, 1, REQ_OP_READ | REQ_COPY
> + | REQ_SYNC | REQ_NOMERGE, gfp_mask);
> + if (!read_bio)
> + goto err_read_bio;
> + write_bio = bio_alloc(bdev_out, 1, REQ_OP_WRITE
> + | REQ_COPY | REQ_SYNC | REQ_NOMERGE, gfp_mask);
> + if (!write_bio)
> + goto err_write_bio;
> +
> + ctx->cio = cio;
> + ctx->write_bio = write_bio;
> + INIT_WORK(&ctx->dispatch_work, blkdev_copy_dispatch_work);
> +
> + __bio_add_page(read_bio, token, PAGE_SIZE, 0);
> + read_bio->bi_iter.bi_size = copy_len;
> + read_bio->bi_iter.bi_sector = pos_in >> SECTOR_SHIFT;
> + read_bio->bi_end_io = blkdev_copy_offload_read_endio;
> + read_bio->bi_private = ctx;
> +
> + __bio_add_page(write_bio, token, PAGE_SIZE, 0);
> + write_bio->bi_iter.bi_size = copy_len;
> + write_bio->bi_end_io = blkdev_copy_offload_write_endio;
> + write_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
> + write_bio->bi_private = ctx;
> +
> + atomic_inc(&cio->refcount);
> + submit_bio(read_bio);
> + pos_in += copy_len;
> + pos_out += copy_len;
> + }
> +
> + /* Wait for completion of all IO's*/
> + return blkdev_copy_wait_completion(cio);
> +
> +err_write_bio:
> + bio_put(read_bio);
> +err_read_bio:
> + kfree(ctx);
> +err_ctx:
> + __free_page(token);
> +err_token:
> + cio->comp_len = min_t(sector_t, cio->comp_len, (len - rem));
> + if (!atomic_read(&cio->refcount))
> + return -ENOMEM;
> + /* Wait for submitted IOs to complete */
> + return blkdev_copy_wait_completion(cio);
> +}
> +
> +static inline int blkdev_copy_sanity_check(struct block_device *bdev_in,
> + loff_t pos_in, struct block_device *bdev_out, loff_t pos_out,
> + size_t len)
> +{
> + unsigned int align = max(bdev_logical_block_size(bdev_out),
> + bdev_logical_block_size(bdev_in)) - 1;
> +
> + if (bdev_read_only(bdev_out))
> + return -EPERM;
> +
> + if ((pos_in & align) || (pos_out & align) || (len & align) || !len ||
> + len >= COPY_MAX_BYTES)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +/*
> + * @bdev_in: source block device
> + * @pos_in: source offset
> + * @bdev_out: destination block device
> + * @pos_out: destination offset
> + * @len: length in bytes to be copied
> + * @endio: endio function to be called on completion of copy operation,
> + * for synchronous operation this should be NULL
> + * @private: endio function will be called with this private data, should be
> + * NULL, if operation is synchronous in nature
> + * @gfp_mask: memory allocation flags (for bio_alloc)
> + *
> + * Returns the length of bytes copied or error if encountered
> + *
> + * Description:
> + * Copy source offset from source block device to destination block
> + * device. Max total length of copy is limited to MAX_COPY_TOTAL_LENGTH
> + */
> +int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
I'd have thought you'd return ssize_t here. If the two block devices
are loopmounted xfs files, we can certainly reflink "copy" more than 2GB
at a time.
--D
> + struct block_device *bdev_out, loff_t pos_out, size_t len,
> + cio_iodone_t endio, void *private, gfp_t gfp_mask)
> +{
> + struct request_queue *q_in = bdev_get_queue(bdev_in);
> + struct request_queue *q_out = bdev_get_queue(bdev_out);
> + int ret;
> +
> + ret = blkdev_copy_sanity_check(bdev_in, pos_in, bdev_out, pos_out, len);
> + if (ret)
> + return ret;
> +
> + if (blk_queue_copy(q_in) && blk_queue_copy(q_out))
> + ret = __blkdev_copy_offload(bdev_in, pos_in, bdev_out, pos_out,
> + len, endio, private, gfp_mask);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(blkdev_issue_copy);
> +
> static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
> struct bio **biop, unsigned flags)
> diff --git a/block/blk.h b/block/blk.h
> index 45547bcf1119..ec48a237fe12 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -303,6 +303,8 @@ static inline bool bio_may_exceed_limits(struct bio *bio,
> break;
> }
>
> + if (unlikely(op_is_copy(bio->bi_opf)))
> + return false;
> /*
> * All drivers must accept single-segments bios that are <= PAGE_SIZE.
> * This is a quick and dirty check that relies on the fact that
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 740afe80f297..0117e33087e1 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -420,6 +420,7 @@ enum req_flag_bits {
> */
> /* for REQ_OP_WRITE_ZEROES: */
> __REQ_NOUNMAP, /* do not free blocks when zeroing */
> + __REQ_COPY, /* copy request */
>
> __REQ_NR_BITS, /* stops here */
> };
> @@ -444,6 +445,7 @@ enum req_flag_bits {
> #define REQ_POLLED (__force blk_opf_t)(1ULL << __REQ_POLLED)
> #define REQ_ALLOC_CACHE (__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE)
> #define REQ_SWAP (__force blk_opf_t)(1ULL << __REQ_SWAP)
> +#define REQ_COPY ((__force blk_opf_t)(1ULL << __REQ_COPY))
> #define REQ_DRV (__force blk_opf_t)(1ULL << __REQ_DRV)
> #define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
>
> @@ -474,6 +476,11 @@ static inline bool op_is_write(blk_opf_t op)
> return !!(op & (__force blk_opf_t)1);
> }
>
> +static inline bool op_is_copy(blk_opf_t op)
> +{
> + return op & REQ_COPY;
> +}
> +
> /*
> * Check if the bio or request is one that needs special treatment in the
> * flush state machine.
> @@ -533,4 +540,22 @@ struct blk_rq_stat {
> u64 batch;
> };
>
> +typedef void (cio_iodone_t)(void *private, int comp_len);
> +
> +struct cio {
> + struct task_struct *waiter; /* waiting task (NULL if none) */
> + atomic_t refcount;
> + loff_t pos_in;
> + loff_t pos_out;
> + size_t comp_len;
> + cio_iodone_t *endio; /* applicable for async operation */
> + void *private; /* applicable for async operation */
> +};
> +
> +struct copy_ctx {
> + struct cio *cio;
> + struct work_struct dispatch_work;
> + struct bio *write_bio;
> +};
> +
> #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c9bf11adccb3..6f2814ab4741 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1051,6 +1051,9 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> sector_t nr_sects, gfp_t gfp_mask, struct bio **biop);
> int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
> sector_t nr_sects, gfp_t gfp);
> +int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
> + struct block_device *bdev_out, loff_t pos_out, size_t len,
> + cio_iodone_t end_io, void *private, gfp_t gfp_mask);
>
> #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
> #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
> --
> 2.35.1.500.gb896f729e2
>
> --
> dm-devel mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
> > +/*
> > + * @bdev_in: source block device
> > + * @pos_in: source offset
> > + * @bdev_out: destination block device
> > + * @pos_out: destination offset
> > + * @len: length in bytes to be copied
> > + * @endio: endio function to be called on completion of copy operation,
> > + * for synchronous operation this should be NULL
> > + * @private: endio function will be called with this private data, should be
> > + * NULL, if operation is synchronous in nature
> > + * @gfp_mask: memory allocation flags (for bio_alloc)
> > + *
> > + * Returns the length of bytes copied or error if encountered
> > + *
> > + * Description:
> > + * Copy source offset from source block device to destination block
> > + * device. Max total length of copy is limited to MAX_COPY_TOTAL_LENGTH
> > + */
> > +int blkdev_issue_copy(struct block_device *bdev_in, loff_t pos_in,
>
> I'd have thought you'd return ssize_t here. If the two block devices
> are loopmounted xfs files, we can certainly reflink "copy" more than 2GB
> at a time.
>
> --D
>
Sure we will add this to make API future proof, but at present we do have
a limit for copy. COPY_MAX_BYTES(=128MB) at present. This limit is based
on our internal testing, we have plans to increase/remove with this
limit in future phases.
Thank you,
Nitesh Shetty
On Mon, May 22, 2023 at 04:11:33PM +0530, Nitesh Shetty wrote:
> + token = alloc_page(gfp_mask);
Why is PAGE_SIZE the right size for 'token'? That seems quite unlikely.
I could understand it being SECTOR_SIZE or something that's dependent on
the device, but I cannot fathom it being dependent on the host' page size.
po 22. 5. 2023 v 13:17 odesĂlatel Nitesh Shetty <[email protected]> napsal:
>
> +static int __blkdev_copy_offload(struct block_device *bdev_in, loff_t pos_in,
> + struct block_device *bdev_out, loff_t pos_out, size_t len,
> + cio_iodone_t endio, void *private, gfp_t gfp_mask)
> +{
> + struct cio *cio;
> + struct copy_ctx *ctx;
> + struct bio *read_bio, *write_bio;
> + struct page *token;
> + sector_t copy_len;
> + sector_t rem, max_copy_len;
> +
> + cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
> + if (!cio)
> + return -ENOMEM;
> + atomic_set(&cio->refcount, 0);
> + cio->waiter = current;
> + cio->endio = endio;
> + cio->private = private;
> +
> + max_copy_len = min(bdev_max_copy_sectors(bdev_in),
> + bdev_max_copy_sectors(bdev_out)) << SECTOR_SHIFT;
> +
> + cio->pos_in = pos_in;
> + cio->pos_out = pos_out;
> + /* If there is a error, comp_len will be set to least successfully
> + * completed copied length
> + */
> + cio->comp_len = len;
> + for (rem = len; rem > 0; rem -= copy_len) {
> + copy_len = min(rem, max_copy_len);
> +
> + token = alloc_page(gfp_mask);
> + if (unlikely(!token))
> + goto err_token;
[...]
> +err_token:
> + cio->comp_len = min_t(sector_t, cio->comp_len, (len - rem));
> + if (!atomic_read(&cio->refcount))
> + return -ENOMEM;
> + /* Wait for submitted IOs to complete */
> + return blkdev_copy_wait_completion(cio);
> +}
Suppose the first call to "token = alloc_page()" fails (and
cio->refcount == 0), isn't "cio" going to be leaked here?
Maurizio