The patch series covers the points discussed in November 2021 virtual call
[LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
We have covered the Initial agreed requirements in this patchset.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.
Overall series supports –
1. Driver
- NVMe Copy command (single NS), including support in nvme-target (for
block and file backend)
2. Block layer
- Block-generic copy (REQ_COPY flag), with interface accommodating
two block-devs, and multi-source/destination interface
- Emulation, when offload is natively absent
- dm-linear support (for cases not requiring split)
3. User-interface
- new ioctl
4. In-kernel user
- dm-kcopyd
[0] https://lore.kernel.org/linux-nvme/CA+1E3rJ7BZ7LjQXXTdX+-0Edz=zT14mmPGMiVCzUgB33C60tbQ@mail.gmail.com/
Changes in v3:
- fixed possible race condition reported by Damien Le Moal
- new sysfs controls as suggested by Damien Le Moal
- fixed possible memory leak reported by Dan Carpenter, lkp
- minor fixes
Arnav Dawn (1):
nvmet: add copy command support for bdev and file ns
Nitesh Shetty (6):
block: Introduce queue limits for copy-offload support
block: Add copy offload support infrastructure
block: Introduce a new ioctl for copy
block: add emulation for copy
dm: Add support for copy offload.
dm: Enable copy offload for dm-linear target
SelvaKumar S (3):
block: make bio_map_kern() non static
nvme: add copy offload support
dm kcopyd: use copy offload support
block/blk-lib.c | 346 ++++++++++++++++++++++++++++++
block/blk-map.c | 2 +-
block/blk-settings.c | 59 +++++
block/blk-sysfs.c | 138 ++++++++++++
block/blk.h | 2 +
block/ioctl.c | 32 +++
drivers/md/dm-kcopyd.c | 55 ++++-
drivers/md/dm-linear.c | 1 +
drivers/md/dm-table.c | 45 ++++
drivers/md/dm.c | 6 +
drivers/nvme/host/core.c | 119 +++++++++-
drivers/nvme/host/fc.c | 4 +
drivers/nvme/host/nvme.h | 7 +
drivers/nvme/host/pci.c | 9 +
drivers/nvme/host/rdma.c | 6 +
drivers/nvme/host/tcp.c | 8 +
drivers/nvme/host/trace.c | 19 ++
drivers/nvme/target/admin-cmd.c | 8 +-
drivers/nvme/target/io-cmd-bdev.c | 65 ++++++
drivers/nvme/target/io-cmd-file.c | 48 +++++
include/linux/blk_types.h | 21 ++
include/linux/blkdev.h | 17 ++
include/linux/device-mapper.h | 5 +
include/linux/nvme.h | 43 +++-
include/uapi/linux/fs.h | 23 ++
25 files changed, 1074 insertions(+), 14 deletions(-)
base-commit: 23a3fe5e6bb58304e662c604b86bc1264453e888
--
2.30.0-rc0
Add device limits as sysfs entries,
- copy_offload (RW)
- copy_max_bytes (RW)
- copy_max_hw_bytes (RO)
- copy_max_range_bytes (RW)
- copy_max_range_hw_bytes (RO)
- copy_max_nr_ranges (RW)
- copy_max_nr_ranges_hw (RO)
Above limits help to split the copy payload in block layer.
copy_offload, used for setting copy offload(1) or emulation(0).
copy_max_bytes: maximum total length of copy in single payload.
copy_max_range_bytes: maximum length in a single entry.
copy_max_nr_ranges: maximum number of entries in a payload.
copy_max_*_hw_*: Reflects the device supported maximum limits.
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
---
block/blk-settings.c | 59 ++++++++++++++++++
block/blk-sysfs.c | 138 +++++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 13 ++++
3 files changed, 210 insertions(+)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index b880c70e22e4..4baccc93a294 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -57,6 +57,12 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->misaligned = 0;
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
+ lim->max_hw_copy_sectors = 0;
+ lim->max_copy_sectors = 0;
+ lim->max_hw_copy_nr_ranges = 0;
+ lim->max_copy_nr_ranges = 0;
+ lim->max_hw_copy_range_sectors = 0;
+ lim->max_copy_range_sectors = 0;
}
EXPORT_SYMBOL(blk_set_default_limits);
@@ -82,6 +88,12 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_write_same_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+ lim->max_hw_copy_sectors = ULONG_MAX;
+ lim->max_copy_sectors = ULONG_MAX;
+ lim->max_hw_copy_range_sectors = UINT_MAX;
+ lim->max_copy_range_sectors = UINT_MAX;
+ lim->max_hw_copy_nr_ranges = USHRT_MAX;
+ lim->max_copy_nr_ranges = USHRT_MAX;
}
EXPORT_SYMBOL(blk_set_stacking_limits);
@@ -178,6 +190,45 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
}
EXPORT_SYMBOL(blk_queue_max_discard_sectors);
+/**
+ * blk_queue_max_copy_sectors - set max sectors for a single copy payload
+ * @q: the request queue for the device
+ * @max_copy_sectors: maximum number of sectors to copy
+ **/
+void blk_queue_max_copy_sectors(struct request_queue *q,
+ unsigned int max_copy_sectors)
+{
+ q->limits.max_hw_copy_sectors = max_copy_sectors;
+ q->limits.max_copy_sectors = max_copy_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_copy_sectors);
+
+/**
+ * blk_queue_max_copy_range_sectors - set max sectors for a single range, in a copy payload
+ * @q: the request queue for the device
+ * @max_copy_range_sectors: maximum number of sectors to copy in a single range
+ **/
+void blk_queue_max_copy_range_sectors(struct request_queue *q,
+ unsigned int max_copy_range_sectors)
+{
+ q->limits.max_hw_copy_range_sectors = max_copy_range_sectors;
+ q->limits.max_copy_range_sectors = max_copy_range_sectors;
+}
+EXPORT_SYMBOL(blk_queue_max_copy_range_sectors);
+
+/**
+ * blk_queue_max_copy_nr_ranges - set max number of ranges, in a copy payload
+ * @q: the request queue for the device
+ * @max_copy_nr_ranges: maximum number of ranges
+ **/
+void blk_queue_max_copy_nr_ranges(struct request_queue *q,
+ unsigned int max_copy_nr_ranges)
+{
+ q->limits.max_hw_copy_nr_ranges = max_copy_nr_ranges;
+ q->limits.max_copy_nr_ranges = max_copy_nr_ranges;
+}
+EXPORT_SYMBOL(blk_queue_max_copy_nr_ranges);
+
/**
* blk_queue_max_write_same_sectors - set max sectors for a single write same
* @q: the request queue for the device
@@ -541,6 +592,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
b->max_segment_size);
+ t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
+ t->max_hw_copy_sectors = min(t->max_hw_copy_sectors, b->max_hw_copy_sectors);
+ t->max_copy_range_sectors = min(t->max_copy_range_sectors, b->max_copy_range_sectors);
+ t->max_hw_copy_range_sectors = min(t->max_hw_copy_range_sectors,
+ b->max_hw_copy_range_sectors);
+ t->max_copy_nr_ranges = min(t->max_copy_nr_ranges, b->max_copy_nr_ranges);
+ t->max_hw_copy_nr_ranges = min(t->max_hw_copy_nr_ranges, b->max_hw_copy_nr_ranges);
+
t->misaligned |= b->misaligned;
alignment = queue_limit_alignment_offset(b, start);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9f32882ceb2f..9ddd07f142d9 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -212,6 +212,129 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
return queue_var_show(0, page);
}
+static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(blk_queue_copy(q), page);
+}
+
+static ssize_t queue_copy_offload_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long copy_offload;
+ ssize_t ret = queue_var_store(©_offload, page, count);
+
+ if (ret < 0)
+ return ret;
+
+ if (copy_offload && !q->limits.max_hw_copy_sectors)
+ return -EINVAL;
+
+ if (copy_offload)
+ blk_queue_flag_set(QUEUE_FLAG_COPY, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+
+ return ret;
+}
+
+static ssize_t queue_copy_max_hw_show(struct request_queue *q, char *page)
+{
+ return sprintf(page, "%llu\n",
+ (unsigned long long)q->limits.max_hw_copy_sectors << 9);
+}
+
+static ssize_t queue_copy_max_show(struct request_queue *q, char *page)
+{
+ return sprintf(page, "%llu\n",
+ (unsigned long long)q->limits.max_copy_sectors << 9);
+}
+
+static ssize_t queue_copy_max_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long max_copy;
+ ssize_t ret = queue_var_store(&max_copy, page, count);
+
+ if (ret < 0)
+ return ret;
+
+ if (max_copy & (queue_logical_block_size(q) - 1))
+ return -EINVAL;
+
+ max_copy >>= 9;
+ if (max_copy > q->limits.max_hw_copy_sectors)
+ max_copy = q->limits.max_hw_copy_sectors;
+
+ q->limits.max_copy_sectors = max_copy;
+ return ret;
+}
+
+static ssize_t queue_copy_range_max_hw_show(struct request_queue *q, char *page)
+{
+ return sprintf(page, "%llu\n",
+ (unsigned long long)q->limits.max_hw_copy_range_sectors << 9);
+}
+
+static ssize_t queue_copy_range_max_show(struct request_queue *q,
+ char *page)
+{
+ return sprintf(page, "%llu\n",
+ (unsigned long long)q->limits.max_copy_range_sectors << 9);
+}
+
+static ssize_t queue_copy_range_max_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long max_copy;
+ ssize_t ret = queue_var_store(&max_copy, page, count);
+
+ if (ret < 0)
+ return ret;
+
+ if (max_copy & (queue_logical_block_size(q) - 1))
+ return -EINVAL;
+
+ max_copy >>= 9;
+ if (max_copy > UINT_MAX)
+ return -EINVAL;
+
+ if (max_copy > q->limits.max_hw_copy_range_sectors)
+ max_copy = q->limits.max_hw_copy_range_sectors;
+
+ q->limits.max_copy_range_sectors = max_copy;
+ return ret;
+}
+
+static ssize_t queue_copy_nr_ranges_max_hw_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->limits.max_hw_copy_nr_ranges, page);
+}
+
+static ssize_t queue_copy_nr_ranges_max_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(q->limits.max_copy_nr_ranges, page);
+}
+
+static ssize_t queue_copy_nr_ranges_max_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long max_nr;
+ ssize_t ret = queue_var_store(&max_nr, page, count);
+
+ if (ret < 0)
+ return ret;
+
+ if (max_nr > USHRT_MAX)
+ return -EINVAL;
+
+ if (max_nr > q->limits.max_hw_copy_nr_ranges)
+ max_nr = q->limits.max_hw_copy_nr_ranges;
+
+ q->limits.max_copy_nr_ranges = max_nr;
+ return ret;
+}
+
static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
{
return sprintf(page, "%llu\n",
@@ -597,6 +720,14 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
+QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
+QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_hw_bytes");
+QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
+QUEUE_RO_ENTRY(queue_copy_range_max_hw, "copy_max_range_hw_bytes");
+QUEUE_RW_ENTRY(queue_copy_range_max, "copy_max_range_bytes");
+QUEUE_RO_ENTRY(queue_copy_nr_ranges_max_hw, "copy_max_nr_ranges_hw");
+QUEUE_RW_ENTRY(queue_copy_nr_ranges_max, "copy_max_nr_ranges");
+
QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
QUEUE_RW_ENTRY(queue_poll, "io_poll");
@@ -643,6 +774,13 @@ static struct attribute *queue_attrs[] = {
&queue_discard_max_entry.attr,
&queue_discard_max_hw_entry.attr,
&queue_discard_zeroes_data_entry.attr,
+ &queue_copy_offload_entry.attr,
+ &queue_copy_max_hw_entry.attr,
+ &queue_copy_max_entry.attr,
+ &queue_copy_range_max_hw_entry.attr,
+ &queue_copy_range_max_entry.attr,
+ &queue_copy_nr_ranges_max_hw_entry.attr,
+ &queue_copy_nr_ranges_max_entry.attr,
&queue_write_same_max_entry.attr,
&queue_write_zeroes_max_entry.attr,
&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index efed3820cbf7..792e6d556589 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -254,6 +254,13 @@ struct queue_limits {
unsigned int discard_alignment;
unsigned int zone_write_granularity;
+ unsigned long max_hw_copy_sectors;
+ unsigned long max_copy_sectors;
+ unsigned int max_hw_copy_range_sectors;
+ unsigned int max_copy_range_sectors;
+ unsigned short max_hw_copy_nr_ranges;
+ unsigned short max_copy_nr_ranges;
+
unsigned short max_segments;
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
@@ -562,6 +569,7 @@ struct request_queue {
#define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */
#define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */
#define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */
+#define QUEUE_FLAG_COPY 30 /* supports copy offload */
#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
@@ -585,6 +593,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
#define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
#define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
#define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
+#define blk_queue_copy(q) test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
#define blk_queue_zone_resetall(q) \
test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
#define blk_queue_secure_erase(q) \
@@ -958,6 +967,10 @@ extern void blk_queue_max_discard_segments(struct request_queue *,
extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
extern void blk_queue_max_discard_sectors(struct request_queue *q,
unsigned int max_discard_sectors);
+extern void blk_queue_max_copy_sectors(struct request_queue *q, unsigned int max_copy_sectors);
+extern void blk_queue_max_copy_range_sectors(struct request_queue *q,
+ unsigned int max_copy_range_sectors);
+extern void blk_queue_max_copy_nr_ranges(struct request_queue *q, unsigned int max_copy_nr_ranges);
extern void blk_queue_max_write_same_sectors(struct request_queue *q,
unsigned int max_write_same_sectors);
extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
--
2.30.0-rc0
From: Arnav Dawn <[email protected]>
Add support for handling target command on target.
For bdev-ns we call into blkdev_issue_copy, which the block layer
completes by a offloaded copy request to backend bdev or by emulating the
request.
For file-ns we call vfs_copy_file_range to service our request.
Currently target always shows copy capability by setting
NVME_CTRL_ONCS_COPY in controller ONCS.
Signed-off-by: Arnav Dawn <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
drivers/nvme/target/admin-cmd.c | 8 +++-
drivers/nvme/target/io-cmd-bdev.c | 65 +++++++++++++++++++++++++++++++
drivers/nvme/target/io-cmd-file.c | 48 +++++++++++++++++++++++
3 files changed, 119 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 6fb24746de06..3577e8af8003 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -431,8 +431,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES);
id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM |
- NVME_CTRL_ONCS_WRITE_ZEROES);
-
+ NVME_CTRL_ONCS_WRITE_ZEROES | NVME_CTRL_ONCS_COPY);
/* XXX: don't report vwc if the underlying device is write through */
id->vwc = NVME_CTRL_VWC_PRESENT;
@@ -530,6 +529,11 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req)
if (req->ns->bdev)
nvmet_bdev_set_limits(req->ns->bdev, id);
+ else {
+ id->msrc = to0based(BIO_MAX_VECS);
+ id->mssrl = cpu_to_le16(BIO_MAX_VECS << (PAGE_SHIFT - SECTOR_SHIFT));
+ id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl) * BIO_MAX_VECS);
+ }
/*
* We just provide a single LBA format that matches what the
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 95c2bbb0b2f5..47504aec20ce 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -46,6 +46,30 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
id->npda = id->npdg;
/* NOWS = Namespace Optimal Write Size */
id->nows = to0based(ql->io_opt / ql->logical_block_size);
+
+ /*Copy limits*/
+ if (ql->max_copy_sectors) {
+ id->mcl = cpu_to_le32((ql->max_copy_sectors << 9) / ql->logical_block_size);
+ id->mssrl = cpu_to_le16((ql->max_copy_range_sectors << 9) /
+ ql->logical_block_size);
+ id->msrc = to0based(ql->max_copy_nr_ranges);
+ } else {
+ if (ql->zoned == BLK_ZONED_NONE) {
+ id->msrc = to0based(BIO_MAX_VECS);
+ id->mssrl = cpu_to_le16(
+ (BIO_MAX_VECS << PAGE_SHIFT) / ql->logical_block_size);
+ id->mcl = cpu_to_le32(le16_to_cpu(id->mssrl) * BIO_MAX_VECS);
+#ifdef CONFIG_BLK_DEV_ZONED
+ } else {
+ /* TODO: get right values for zoned device */
+ id->msrc = to0based(BIO_MAX_VECS);
+ id->mssrl = cpu_to_le16(min((BIO_MAX_VECS << PAGE_SHIFT),
+ ql->chunk_sectors) / ql->logical_block_size);
+ id->mcl = cpu_to_le32(min(le16_to_cpu(id->mssrl) * BIO_MAX_VECS,
+ ql->chunk_sectors));
+#endif
+ }
+ }
}
void nvmet_bdev_ns_disable(struct nvmet_ns *ns)
@@ -433,6 +457,43 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
}
}
+static void nvmet_bdev_execute_copy(struct nvmet_req *req)
+{
+ struct nvme_copy_range range;
+ struct range_entry *rlist;
+ struct nvme_command *cmnd = req->cmd;
+ sector_t dest, dest_off = 0;
+ int ret, id, nr_range;
+
+ nr_range = cmnd->copy.nr_range + 1;
+ dest = le64_to_cpu(cmnd->copy.sdlba) << req->ns->blksize_shift;
+ rlist = kmalloc_array(nr_range, sizeof(*rlist), GFP_KERNEL);
+
+ for (id = 0 ; id < nr_range; id++) {
+ ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range, sizeof(range));
+ if (ret)
+ goto out;
+
+ rlist[id].dst = dest + dest_off;
+ rlist[id].src = le64_to_cpu(range.slba) << req->ns->blksize_shift;
+ rlist[id].len = (le16_to_cpu(range.nlb) + 1) << req->ns->blksize_shift;
+ rlist[id].comp_len = 0;
+ dest_off += rlist[id].len;
+ }
+ ret = blkdev_issue_copy(req->ns->bdev, nr_range, rlist, req->ns->bdev, GFP_KERNEL);
+ if (ret) {
+ for (id = 0 ; id < nr_range; id++) {
+ if (rlist[id].len != rlist[id].comp_len) {
+ req->cqe->result.u32 = cpu_to_le32(id);
+ break;
+ }
+ }
+ }
+out:
+ kfree(rlist);
+ nvmet_req_complete(req, errno_to_nvme_status(req, ret));
+}
+
u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
{
switch (req->cmd->common.opcode) {
@@ -451,6 +512,10 @@ u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req)
case nvme_cmd_write_zeroes:
req->execute = nvmet_bdev_execute_write_zeroes;
return 0;
+ case nvme_cmd_copy:
+ req->execute = nvmet_bdev_execute_copy;
+ return 0;
+
default:
return nvmet_report_invalid_opcode(req);
}
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 6be6e59d273b..cf51169cd71d 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -347,6 +347,46 @@ static void nvmet_file_dsm_work(struct work_struct *w)
}
}
+static void nvmet_file_copy_work(struct work_struct *w)
+{
+ struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
+ int nr_range;
+ loff_t pos;
+ struct nvme_command *cmnd = req->cmd;
+ int ret = 0, len = 0, src, id;
+
+ nr_range = cmnd->copy.nr_range + 1;
+ pos = le64_to_cpu(req->cmd->copy.sdlba) << req->ns->blksize_shift;
+ if (unlikely(pos + req->transfer_len > req->ns->size)) {
+ nvmet_req_complete(req, errno_to_nvme_status(req, -ENOSPC));
+ return;
+ }
+
+ for (id = 0 ; id < nr_range; id++) {
+ struct nvme_copy_range range;
+
+ ret = nvmet_copy_from_sgl(req, id * sizeof(range), &range,
+ sizeof(range));
+ if (ret)
+ goto out;
+
+ len = (le16_to_cpu(range.nlb) + 1) << (req->ns->blksize_shift);
+ src = (le64_to_cpu(range.slba) << (req->ns->blksize_shift));
+ ret = vfs_copy_file_range(req->ns->file, src, req->ns->file, pos, len, 0);
+out:
+ if (ret != len) {
+ pos += ret;
+ req->cqe->result.u32 = cpu_to_le32(id);
+ nvmet_req_complete(req, ret < 0 ? errno_to_nvme_status(req, ret) :
+ errno_to_nvme_status(req, -EIO));
+ return;
+
+ } else
+ pos += len;
+}
+ nvmet_req_complete(req, ret);
+
+}
static void nvmet_file_execute_dsm(struct nvmet_req *req)
{
if (!nvmet_check_data_len_lte(req, nvmet_dsm_len(req)))
@@ -355,6 +395,11 @@ static void nvmet_file_execute_dsm(struct nvmet_req *req)
schedule_work(&req->f.work);
}
+static void nvmet_file_execute_copy(struct nvmet_req *req)
+{
+ INIT_WORK(&req->f.work, nvmet_file_copy_work);
+ schedule_work(&req->f.work);
+}
static void nvmet_file_write_zeroes_work(struct work_struct *w)
{
struct nvmet_req *req = container_of(w, struct nvmet_req, f.work);
@@ -401,6 +446,9 @@ u16 nvmet_file_parse_io_cmd(struct nvmet_req *req)
case nvme_cmd_write_zeroes:
req->execute = nvmet_file_execute_write_zeroes;
return 0;
+ case nvme_cmd_copy:
+ req->execute = nvmet_file_execute_copy;
+ return 0;
default:
return nvmet_report_invalid_opcode(req);
}
--
2.30.0-rc0
From: SelvaKumar S <[email protected]>
For device supporting native copy, nvme driver receives read and
write request with BLK_COPY op flags.
For read request the nvme driver populates the payload with source
information.
For write request the driver converts it to nvme copy command using the
source information in the payload and submits to the device.
current design only supports single source range.
This design is courtesy Mikulas Patocka's token based copy
trace event support for nvme_copy_cmd.
Set the device copy limits to queue limits.
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier González <[email protected]>
Signed-off-by: Arnav Dawn <[email protected]>
---
drivers/nvme/host/core.c | 119 +++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/fc.c | 4 ++
drivers/nvme/host/nvme.h | 7 +++
drivers/nvme/host/pci.c | 9 +++
drivers/nvme/host/rdma.c | 6 ++
drivers/nvme/host/tcp.c | 8 +++
drivers/nvme/host/trace.c | 19 ++++++
include/linux/nvme.h | 43 +++++++++++++-
8 files changed, 210 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 961a5f8a44d2..731a091f4bc3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -821,6 +821,90 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
}
+static inline blk_status_t nvme_setup_copy_read(struct nvme_ns *ns, struct request *req)
+{
+ struct bio *bio = req->bio;
+ struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+
+ memcpy(token->subsys, "nvme", 4);
+ token->ns = ns;
+ token->src_sector = bio->bi_iter.bi_sector;
+ token->sectors = bio->bi_iter.bi_size >> 9;
+
+ return 0;
+}
+
+static inline blk_status_t nvme_setup_copy_write(struct nvme_ns *ns,
+ struct request *req, struct nvme_command *cmnd)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct nvme_copy_range *range = NULL;
+ struct bio *bio = req->bio;
+ struct nvme_copy_token *token = bvec_kmap_local(&bio->bi_io_vec[0]);
+ sector_t src_sector, dst_sector, n_sectors;
+ u64 src_lba, dst_lba, n_lba;
+ unsigned short nr_range = 1;
+ u16 control = 0;
+ u32 dsmgmt = 0;
+
+ if (unlikely(memcmp(token->subsys, "nvme", 4)))
+ return BLK_STS_NOTSUPP;
+ if (unlikely(token->ns != ns))
+ return BLK_STS_NOTSUPP;
+
+ src_sector = token->src_sector;
+ dst_sector = bio->bi_iter.bi_sector;
+ n_sectors = token->sectors;
+ if (WARN_ON(n_sectors != bio->bi_iter.bi_size >> 9))
+ return BLK_STS_NOTSUPP;
+
+ src_lba = nvme_sect_to_lba(ns, src_sector);
+ dst_lba = nvme_sect_to_lba(ns, dst_sector);
+ n_lba = nvme_sect_to_lba(ns, n_sectors);
+
+ if (unlikely(nvme_lba_to_sect(ns, src_lba) != src_sector) ||
+ unlikely(nvme_lba_to_sect(ns, dst_lba) != dst_sector) ||
+ unlikely(nvme_lba_to_sect(ns, n_lba) != n_sectors))
+ return BLK_STS_NOTSUPP;
+
+ if (WARN_ON(!n_lba))
+ return BLK_STS_NOTSUPP;
+
+ if (req->cmd_flags & REQ_FUA)
+ control |= NVME_RW_FUA;
+
+ if (req->cmd_flags & REQ_FAILFAST_DEV)
+ control |= NVME_RW_LR;
+
+ memset(cmnd, 0, sizeof(*cmnd));
+ cmnd->copy.opcode = nvme_cmd_copy;
+ cmnd->copy.nsid = cpu_to_le32(ns->head->ns_id);
+ cmnd->copy.sdlba = cpu_to_le64(blk_rq_pos(req) >> (ns->lba_shift - 9));
+
+ range = kmalloc_array(nr_range, sizeof(*range),
+ GFP_ATOMIC | __GFP_NOWARN);
+ if (!range)
+ return BLK_STS_RESOURCE;
+
+ range[0].slba = cpu_to_le64(src_lba);
+ range[0].nlb = cpu_to_le16(n_lba - 1);
+
+ cmnd->copy.nr_range = 0;
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * nr_range;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ if (ctrl->nr_streams)
+ nvme_assign_write_stream(ctrl, req, &control, &dsmgmt);
+
+ cmnd->copy.control = cpu_to_le16(control);
+ cmnd->copy.dspec = cpu_to_le32(dsmgmt);
+
+ return BLK_STS_OK;
+}
+
static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd)
{
@@ -1024,10 +1108,16 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
ret = nvme_setup_discard(ns, req, cmd);
break;
case REQ_OP_READ:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
+ if (unlikely(req->cmd_flags & REQ_COPY))
+ ret = nvme_setup_copy_read(ns, req);
+ else
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_read);
break;
case REQ_OP_WRITE:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+ if (unlikely(req->cmd_flags & REQ_COPY))
+ ret = nvme_setup_copy_write(ns, req, cmd);
+ else
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
break;
case REQ_OP_ZONE_APPEND:
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -1682,6 +1772,29 @@ static void nvme_config_discard(struct gendisk *disk, struct nvme_ns *ns)
blk_queue_max_write_zeroes_sectors(queue, UINT_MAX);
}
+static void nvme_config_copy(struct gendisk *disk, struct nvme_ns *ns,
+ struct nvme_id_ns *id)
+{
+ struct nvme_ctrl *ctrl = ns->ctrl;
+ struct request_queue *q = disk->queue;
+
+ if (!(ctrl->oncs & NVME_CTRL_ONCS_COPY)) {
+ blk_queue_max_copy_sectors(q, 0);
+ blk_queue_max_copy_range_sectors(q, 0);
+ blk_queue_max_copy_nr_ranges(q, 0);
+ blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
+ return;
+ }
+
+ /* setting copy limits */
+ if (blk_queue_flag_test_and_set(QUEUE_FLAG_COPY, q))
+ return;
+
+ blk_queue_max_copy_sectors(q, nvme_lba_to_sect(ns, le32_to_cpu(id->mcl)));
+ blk_queue_max_copy_range_sectors(q, nvme_lba_to_sect(ns, le16_to_cpu(id->mssrl)));
+ blk_queue_max_copy_nr_ranges(q, id->msrc + 1);
+}
+
static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
{
return !uuid_is_null(&ids->uuid) ||
@@ -1864,6 +1977,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
nvme_config_discard(disk, ns);
blk_queue_max_write_zeroes_sectors(disk->queue,
ns->ctrl->max_zeroes_sectors);
+ nvme_config_copy(disk, ns, id);
set_disk_ro(disk, (id->nsattr & NVME_NS_ATTR_RO) ||
test_bit(NVME_NS_FORCE_RO, &ns->flags));
@@ -4728,6 +4842,7 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_download_firmware) != 64);
BUILD_BUG_ON(sizeof(struct nvme_format_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_dsm_cmd) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_copy_command) != 64);
BUILD_BUG_ON(sizeof(struct nvme_write_zeroes_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_abort_cmd) != 64);
BUILD_BUG_ON(sizeof(struct nvme_get_log_page_command) != 64);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 71b3108c22f0..5057ab1a1875 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2788,6 +2788,10 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx,
if (ret)
return ret;
+ if (unlikely((rq->cmd_flags & REQ_COPY) && (req_op(rq) == REQ_OP_READ))) {
+ blk_mq_end_request(rq, BLK_STS_OK);
+ return BLK_STS_OK;
+ }
/*
* nvme core doesn't quite treat the rq opaquely. Commands such
* as WRITE ZEROES will return a non-zero rq payload_bytes yet
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a162f6c6da6e..117658a8cf5f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -474,6 +474,13 @@ struct nvme_ns {
};
+struct nvme_copy_token {
+ char subsys[4];
+ struct nvme_ns *ns;
+ u64 src_sector;
+ u64 sectors;
+};
+
/* NVMe ns supports metadata actions by the controller (generate/strip) */
static inline bool nvme_ns_has_pi(struct nvme_ns *ns)
{
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed680915..a7b0f129a19d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -916,6 +916,11 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
if (ret)
return ret;
+ if (unlikely((req->cmd_flags & REQ_COPY) && (req_op(req) == REQ_OP_READ))) {
+ blk_mq_end_request(req, BLK_STS_OK);
+ return BLK_STS_OK;
+ }
+
if (blk_rq_nr_phys_segments(req)) {
ret = nvme_map_data(dev, req, &iod->cmd);
if (ret)
@@ -929,6 +934,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
}
blk_mq_start_request(req);
+
return BLK_STS_OK;
out_unmap_data:
nvme_unmap_data(dev, req);
@@ -962,6 +968,9 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
ret = nvme_prep_rq(dev, req);
if (unlikely(ret))
return ret;
+ if (unlikely((req->cmd_flags & REQ_COPY) && (req_op(req) == REQ_OP_READ)))
+ return ret;
+
spin_lock(&nvmeq->sq_lock);
nvme_sq_copy_cmd(nvmeq, &iod->cmd);
nvme_write_sq_db(nvmeq, bd->last);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9c55e4be8a39..060abf310fb8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2070,6 +2070,12 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
if (ret)
goto unmap_qe;
+ if (unlikely((rq->cmd_flags & REQ_COPY) && (req_op(rq) == REQ_OP_READ))) {
+ blk_mq_end_request(rq, BLK_STS_OK);
+ ret = BLK_STS_OK;
+ goto unmap_qe;
+ }
+
blk_mq_start_request(rq);
if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 01e24b5703db..c36b727384a8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2315,6 +2315,11 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
if (ret)
return ret;
+ if (unlikely((rq->cmd_flags & REQ_COPY) && (req_op(rq) == REQ_OP_READ))) {
+ blk_mq_end_request(rq, BLK_STS_OK);
+ return BLK_STS_OK;
+ }
+
req->state = NVME_TCP_SEND_CMD_PDU;
req->status = cpu_to_le16(NVME_SC_SUCCESS);
req->offset = 0;
@@ -2380,6 +2385,9 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
if (unlikely(ret))
return ret;
+ if (unlikely((rq->cmd_flags & REQ_COPY) && (req_op(rq) == REQ_OP_READ)))
+ return ret;
+
blk_mq_start_request(rq);
nvme_tcp_queue_request(req, true, bd->last);
diff --git a/drivers/nvme/host/trace.c b/drivers/nvme/host/trace.c
index 2a89c5aa0790..ab72bf546a13 100644
--- a/drivers/nvme/host/trace.c
+++ b/drivers/nvme/host/trace.c
@@ -150,6 +150,23 @@ static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10)
return ret;
}
+static const char *nvme_trace_copy(struct trace_seq *p, u8 *cdw10)
+{
+ const char *ret = trace_seq_buffer_ptr(p);
+ u64 slba = get_unaligned_le64(cdw10);
+ u8 nr_range = get_unaligned_le16(cdw10 + 8);
+ u16 control = get_unaligned_le16(cdw10 + 10);
+ u32 dsmgmt = get_unaligned_le32(cdw10 + 12);
+ u32 reftag = get_unaligned_le32(cdw10 + 16);
+
+ trace_seq_printf(p,
+ "slba=%llu, nr_range=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u",
+ slba, nr_range, control, dsmgmt, reftag);
+ trace_seq_putc(p, 0);
+
+ return ret;
+}
+
static const char *nvme_trace_dsm(struct trace_seq *p, u8 *cdw10)
{
const char *ret = trace_seq_buffer_ptr(p);
@@ -243,6 +260,8 @@ const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p,
return nvme_trace_zone_mgmt_send(p, cdw10);
case nvme_cmd_zone_mgmt_recv:
return nvme_trace_zone_mgmt_recv(p, cdw10);
+ case nvme_cmd_copy:
+ return nvme_trace_copy(p, cdw10);
default:
return nvme_trace_common(p, cdw10);
}
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 855dd9b3e84b..7ed966058f4c 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -309,7 +309,7 @@ struct nvme_id_ctrl {
__u8 nvscc;
__u8 nwpc;
__le16 acwu;
- __u8 rsvd534[2];
+ __le16 ocfs;
__le32 sgls;
__le32 mnan;
__u8 rsvd544[224];
@@ -335,6 +335,7 @@ enum {
NVME_CTRL_ONCS_WRITE_ZEROES = 1 << 3,
NVME_CTRL_ONCS_RESERVATIONS = 1 << 5,
NVME_CTRL_ONCS_TIMESTAMP = 1 << 6,
+ NVME_CTRL_ONCS_COPY = 1 << 8,
NVME_CTRL_VWC_PRESENT = 1 << 0,
NVME_CTRL_OACS_SEC_SUPP = 1 << 0,
NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
@@ -383,7 +384,10 @@ struct nvme_id_ns {
__le16 npdg;
__le16 npda;
__le16 nows;
- __u8 rsvd74[18];
+ __le16 mssrl;
+ __le32 mcl;
+ __u8 msrc;
+ __u8 rsvd91[11];
__le32 anagrpid;
__u8 rsvd96[3];
__u8 nsattr;
@@ -704,6 +708,7 @@ enum nvme_opcode {
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
nvme_cmd_resv_release = 0x15,
+ nvme_cmd_copy = 0x19,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
nvme_cmd_zone_append = 0x7d,
@@ -725,7 +730,8 @@ enum nvme_opcode {
nvme_opcode_name(nvme_cmd_resv_release), \
nvme_opcode_name(nvme_cmd_zone_mgmt_send), \
nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \
- nvme_opcode_name(nvme_cmd_zone_append))
+ nvme_opcode_name(nvme_cmd_zone_append), \
+ nvme_opcode_name(nvme_cmd_copy))
@@ -898,6 +904,36 @@ struct nvme_dsm_range {
__le64 slba;
};
+struct nvme_copy_command {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2;
+ __le64 metadata;
+ union nvme_data_ptr dptr;
+ __le64 sdlba;
+ __u8 nr_range;
+ __u8 rsvd12;
+ __le16 control;
+ __le16 rsvd13;
+ __le16 dspec;
+ __le32 ilbrt;
+ __le16 lbat;
+ __le16 lbatm;
+};
+
+struct nvme_copy_range {
+ __le64 rsvd0;
+ __le64 slba;
+ __le16 nlb;
+ __le16 rsvd18;
+ __le32 rsvd20;
+ __le32 eilbrt;
+ __le16 elbat;
+ __le16 elbatm;
+};
+
struct nvme_write_zeroes_cmd {
__u8 opcode;
__u8 flags;
@@ -1449,6 +1485,7 @@ struct nvme_command {
struct nvme_download_firmware dlfw;
struct nvme_format_cmd format;
struct nvme_dsm_cmd dsm;
+ struct nvme_copy_command copy;
struct nvme_write_zeroes_cmd write_zeroes;
struct nvme_zone_mgmt_send_cmd zms;
struct nvme_zone_mgmt_recv_cmd zmr;
--
2.30.0-rc0
From: SelvaKumar S <[email protected]>
Make bio_map_kern() non static
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
block/blk-map.c | 2 +-
include/linux/blkdev.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 4526adde0156..c110205df0d5 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -336,7 +336,7 @@ static void bio_map_kern_endio(struct bio *bio)
* Map the kernel address into a bio suitable for io to a block
* device. Returns an error pointer in case of error.
*/
-static struct bio *bio_map_kern(struct request_queue *q, void *data,
+struct bio *bio_map_kern(struct request_queue *q, void *data,
unsigned int len, gfp_t gfp_mask)
{
unsigned long kaddr = (unsigned long)data;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3bfc75a2a450..efed3820cbf7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1106,6 +1106,8 @@ extern int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask, int flags,
struct bio **biop);
+struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
+ gfp_t gfp_mask);
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
--
2.30.0-rc0
For the devices which does not support copy, copy emulation is
added. Copy-emulation is implemented by reading from source ranges
into memory and writing to the corresponding destination synchronously.
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Vincent Fu <[email protected]>
---
block/blk-lib.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index efa7e2a5fab7..59c770814e72 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -286,6 +286,65 @@ int blk_copy_offload(struct block_device *src_bdev, int nr_srcs,
return cio_await_completion(cio);
}
+int blk_submit_rw_buf(struct block_device *bdev, void *buf, sector_t buf_len,
+ sector_t sector, unsigned int op, gfp_t gfp_mask)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+ struct bio *bio, *parent = NULL;
+ sector_t max_hw_len = min_t(unsigned int, queue_max_hw_sectors(q),
+ queue_max_segments(q) << (PAGE_SHIFT - SECTOR_SHIFT)) << SECTOR_SHIFT;
+ sector_t len, remaining;
+ int ret;
+
+ for (remaining = buf_len; remaining > 0; remaining -= len) {
+ len = min_t(int, max_hw_len, remaining);
+retry:
+ bio = bio_map_kern(q, buf, len, gfp_mask);
+ if (IS_ERR(bio)) {
+ len >>= 1;
+ if (len)
+ goto retry;
+ return PTR_ERR(bio);
+ }
+
+ bio->bi_iter.bi_sector = sector >> SECTOR_SHIFT;
+ bio->bi_opf = op;
+ bio_set_dev(bio, bdev);
+ bio->bi_end_io = NULL;
+ bio->bi_private = NULL;
+
+ if (parent) {
+ bio_chain(parent, bio);
+ submit_bio(parent);
+ }
+ parent = bio;
+ sector += len;
+ buf = (char *) buf + len;
+ }
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
+
+ return ret;
+}
+
+static void *blk_alloc_buf(sector_t req_size, sector_t *alloc_size, gfp_t gfp_mask)
+{
+ int min_size = PAGE_SIZE;
+ void *buf;
+
+ while (req_size >= min_size) {
+ buf = kvmalloc(req_size, gfp_mask);
+ if (buf) {
+ *alloc_size = req_size;
+ return buf;
+ }
+ /* retry half the requested size */
+ req_size >>= 1;
+ }
+
+ return NULL;
+}
+
static inline int blk_copy_sanity_check(struct block_device *src_bdev,
struct block_device *dst_bdev, struct range_entry *rlist, int nr)
{
@@ -311,6 +370,64 @@ static inline int blk_copy_sanity_check(struct block_device *src_bdev,
return 0;
}
+static inline sector_t blk_copy_max_range(struct range_entry *rlist, int nr, sector_t *max_len)
+{
+ int i;
+ sector_t len = 0;
+
+ *max_len = 0;
+ for (i = 0; i < nr; i++) {
+ *max_len = max(*max_len, rlist[i].len);
+ len += rlist[i].len;
+ }
+
+ return len;
+}
+
+/*
+ * If native copy offload feature is absent, this function tries to emulate,
+ * by copying data from source to a temporary buffer and from buffer to
+ * destination device.
+ */
+static int blk_copy_emulate(struct block_device *src_bdev, int nr,
+ struct range_entry *rlist, struct block_device *dest_bdev, gfp_t gfp_mask)
+{
+ void *buf = NULL;
+ int ret, nr_i = 0;
+ sector_t src, dst, copy_len, buf_len, read_len, copied_len, max_len = 0, remaining = 0;
+
+ copy_len = blk_copy_max_range(rlist, nr, &max_len);
+ buf = blk_alloc_buf(max_len, &buf_len, gfp_mask);
+ if (!buf)
+ return -ENOMEM;
+
+ for (copied_len = 0; copied_len < copy_len; copied_len += read_len) {
+ if (!remaining) {
+ rlist[nr_i].comp_len = 0;
+ src = rlist[nr_i].src;
+ dst = rlist[nr_i].dst;
+ remaining = rlist[nr_i++].len;
+ }
+
+ read_len = min_t(sector_t, remaining, buf_len);
+ ret = blk_submit_rw_buf(src_bdev, buf, read_len, src, REQ_OP_READ, gfp_mask);
+ if (ret)
+ goto out;
+ src += read_len;
+ remaining -= read_len;
+ ret = blk_submit_rw_buf(dest_bdev, buf, read_len, dst, REQ_OP_WRITE,
+ gfp_mask);
+ if (ret)
+ goto out;
+ else
+ rlist[nr_i - 1].comp_len += read_len;
+ dst += read_len;
+ }
+out:
+ kvfree(buf);
+ return ret;
+}
+
static inline bool blk_check_copy_offload(struct request_queue *src_q,
struct request_queue *dest_q)
{
@@ -357,6 +474,8 @@ int blkdev_issue_copy(struct block_device *src_bdev, int nr,
if (blk_check_copy_offload(src_q, dest_q))
ret = blk_copy_offload(src_bdev, nr, rlist, dest_bdev, gfp_mask);
+ else
+ ret = blk_copy_emulate(src_bdev, nr, rlist, dest_bdev, gfp_mask);
return ret;
}
--
2.30.0-rc0
From: SelvaKumar S <[email protected]>
Introduce copy_jobs to use copy-offload, if supported by underlying devices
otherwise fall back to existing method.
run_copy_jobs() calls block layer copy offload API, if both source and
destination request queue are same and support copy offload.
On successful completion, destination regions copied count is made zero,
failed regions are processed via existing method.
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Arnav Dawn <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
drivers/md/dm-kcopyd.c | 55 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 6 deletions(-)
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 37b03ab7e5c9..214fadd6d71f 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -74,18 +74,20 @@ struct dm_kcopyd_client {
atomic_t nr_jobs;
/*
- * We maintain four lists of jobs:
+ * We maintain five lists of jobs:
*
- * i) jobs waiting for pages
- * ii) jobs that have pages, and are waiting for the io to be issued.
- * iii) jobs that don't need to do any IO and just run a callback
- * iv) jobs that have completed.
+ * i) jobs waiting to try copy offload
+ * ii) jobs waiting for pages
+ * iii) jobs that have pages, and are waiting for the io to be issued.
+ * iv) jobs that don't need to do any IO and just run a callback
+ * v) jobs that have completed.
*
- * All four of these are protected by job_lock.
+ * All five of these are protected by job_lock.
*/
spinlock_t job_lock;
struct list_head callback_jobs;
struct list_head complete_jobs;
+ struct list_head copy_jobs;
struct list_head io_jobs;
struct list_head pages_jobs;
};
@@ -579,6 +581,42 @@ static int run_io_job(struct kcopyd_job *job)
return r;
}
+static int run_copy_job(struct kcopyd_job *job)
+{
+ int r, i, count = 0;
+ struct range_entry range;
+
+ struct request_queue *src_q, *dest_q;
+
+ for (i = 0; i < job->num_dests; i++) {
+ range.dst = job->dests[i].sector << SECTOR_SHIFT;
+ range.src = job->source.sector << SECTOR_SHIFT;
+ range.len = job->source.count << SECTOR_SHIFT;
+
+ src_q = bdev_get_queue(job->source.bdev);
+ dest_q = bdev_get_queue(job->dests[i].bdev);
+
+ if (src_q != dest_q || !blk_queue_copy(src_q))
+ break;
+
+ r = blkdev_issue_copy(job->source.bdev, 1, &range, job->dests[i].bdev, GFP_KERNEL);
+ if (r)
+ break;
+
+ job->dests[i].count = 0;
+ count++;
+ }
+
+ if (count == job->num_dests) {
+ push(&job->kc->complete_jobs, job);
+ } else {
+ push(&job->kc->pages_jobs, job);
+ r = 0;
+ }
+
+ return r;
+}
+
static int run_pages_job(struct kcopyd_job *job)
{
int r;
@@ -659,6 +697,7 @@ static void do_work(struct work_struct *work)
spin_unlock_irq(&kc->job_lock);
blk_start_plug(&plug);
+ process_jobs(&kc->copy_jobs, kc, run_copy_job);
process_jobs(&kc->complete_jobs, kc, run_complete_job);
process_jobs(&kc->pages_jobs, kc, run_pages_job);
process_jobs(&kc->io_jobs, kc, run_io_job);
@@ -676,6 +715,8 @@ static void dispatch_job(struct kcopyd_job *job)
atomic_inc(&kc->nr_jobs);
if (unlikely(!job->source.count))
push(&kc->callback_jobs, job);
+ else if (job->source.bdev->bd_disk == job->dests[0].bdev->bd_disk)
+ push(&kc->copy_jobs, job);
else if (job->pages == &zero_page_list)
push(&kc->io_jobs, job);
else
@@ -916,6 +957,7 @@ struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *thro
spin_lock_init(&kc->job_lock);
INIT_LIST_HEAD(&kc->callback_jobs);
INIT_LIST_HEAD(&kc->complete_jobs);
+ INIT_LIST_HEAD(&kc->copy_jobs);
INIT_LIST_HEAD(&kc->io_jobs);
INIT_LIST_HEAD(&kc->pages_jobs);
kc->throttle = throttle;
@@ -971,6 +1013,7 @@ void dm_kcopyd_client_destroy(struct dm_kcopyd_client *kc)
BUG_ON(!list_empty(&kc->callback_jobs));
BUG_ON(!list_empty(&kc->complete_jobs));
+ WARN_ON(!list_empty(&kc->copy_jobs));
BUG_ON(!list_empty(&kc->io_jobs));
BUG_ON(!list_empty(&kc->pages_jobs));
destroy_workqueue(kc->kcopyd_wq);
--
2.30.0-rc0
Introduce blkdev_issue_copy which supports source and destination bdevs,
and an array of (source, destination and copy length) tuples.
Introduce REQ_COPY copy offload operation flag. Create a read-write
bio pair with a token as payload and submitted to the device in order.
Read request populates token with source specific information which
is then passed with write request.
This design is courtesy Mikulas Patocka's token based copy
Larger copy will be divided, based on max_copy_sectors,
max_copy_range_sector limits.
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Arnav Dawn <[email protected]>
---
block/blk-lib.c | 227 ++++++++++++++++++++++++++++++++++++++
block/blk.h | 2 +
include/linux/blk_types.h | 21 ++++
include/linux/blkdev.h | 2 +
include/uapi/linux/fs.h | 14 +++
5 files changed, 266 insertions(+)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 1b8ced45e4e5..efa7e2a5fab7 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -135,6 +135,233 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL(blkdev_issue_discard);
+/*
+ * Wait on and process all in-flight BIOs. This must only be called once
+ * all bios have been issued so that the refcount can only decrease.
+ * This just waits for all bios to make it through bio_copy_end_io. IO
+ * errors are propagated through cio->io_error.
+ */
+static int cio_await_completion(struct cio *cio)
+{
+ int ret = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&cio->lock, flags);
+ if (cio->refcount) {
+ cio->waiter = current;
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ spin_unlock_irqrestore(&cio->lock, flags);
+ blk_io_schedule();
+ /* wake up sets us TASK_RUNNING */
+ spin_lock_irqsave(&cio->lock, flags);
+ cio->waiter = NULL;
+ ret = cio->io_err;
+ }
+ spin_unlock_irqrestore(&cio->lock, flags);
+ kvfree(cio);
+
+ return ret;
+}
+
+static void bio_copy_end_io(struct bio *bio)
+{
+ struct copy_ctx *ctx = bio->bi_private;
+ struct cio *cio = ctx->cio;
+ sector_t clen;
+ int ri = ctx->range_idx;
+ unsigned long flags;
+
+ if (bio->bi_status) {
+ cio->io_err = bio->bi_status;
+ clen = (bio->bi_iter.bi_sector - ctx->start_sec) << SECTOR_SHIFT;
+ cio->rlist[ri].comp_len = min_t(sector_t, clen, cio->rlist[ri].comp_len);
+ }
+ __free_page(bio->bi_io_vec[0].bv_page);
+ kfree(ctx);
+ bio_put(bio);
+
+ spin_lock_irqsave(&cio->lock, flags);
+ if (!--cio->refcount && cio->waiter)
+ wake_up_process(cio->waiter);
+ spin_unlock_irqrestore(&cio->lock, flags);
+}
+
+/*
+ * blk_copy_offload - Use device's native copy offload feature
+ * Go through user provide payload, prepare new payload based on device's copy offload limits.
+ */
+int blk_copy_offload(struct block_device *src_bdev, int nr_srcs,
+ struct range_entry *rlist, struct block_device *dst_bdev, gfp_t gfp_mask)
+{
+ struct request_queue *sq = bdev_get_queue(src_bdev);
+ struct request_queue *dq = bdev_get_queue(dst_bdev);
+ struct bio *read_bio, *write_bio;
+ struct copy_ctx *ctx;
+ struct cio *cio;
+ struct page *token;
+ sector_t src_blk, copy_len, dst_blk;
+ sector_t remaining, max_copy_len = LONG_MAX;
+ unsigned long flags;
+ int ri = 0, ret = 0;
+
+ cio = kzalloc(sizeof(struct cio), GFP_KERNEL);
+ if (!cio)
+ return -ENOMEM;
+ cio->rlist = rlist;
+ spin_lock_init(&cio->lock);
+
+ max_copy_len = min_t(sector_t, sq->limits.max_copy_sectors, dq->limits.max_copy_sectors);
+ max_copy_len = min3(max_copy_len, (sector_t)sq->limits.max_copy_range_sectors,
+ (sector_t)dq->limits.max_copy_range_sectors) << SECTOR_SHIFT;
+
+ for (ri = 0; ri < nr_srcs; ri++) {
+ cio->rlist[ri].comp_len = rlist[ri].len;
+ src_blk = rlist[ri].src;
+ dst_blk = rlist[ri].dst;
+ for (remaining = rlist[ri].len; remaining > 0; remaining -= copy_len) {
+ copy_len = min(remaining, max_copy_len);
+
+ token = alloc_page(gfp_mask);
+ if (unlikely(!token)) {
+ ret = -ENOMEM;
+ goto err_token;
+ }
+
+ ctx = kzalloc(sizeof(struct copy_ctx), gfp_mask);
+ if (!ctx) {
+ ret = -ENOMEM;
+ goto err_ctx;
+ }
+ ctx->cio = cio;
+ ctx->range_idx = ri;
+ ctx->start_sec = rlist[ri].src;
+
+ read_bio = bio_alloc(src_bdev, 1, REQ_OP_READ | REQ_COPY | REQ_NOMERGE,
+ gfp_mask);
+ if (!read_bio) {
+ ret = -ENOMEM;
+ goto err_read_bio;
+ }
+ read_bio->bi_iter.bi_sector = src_blk >> SECTOR_SHIFT;
+ read_bio->bi_iter.bi_size = copy_len;
+ __bio_add_page(read_bio, token, PAGE_SIZE, 0);
+ ret = submit_bio_wait(read_bio);
+ bio_put(read_bio);
+ if (ret)
+ goto err_read_bio;
+
+ write_bio = bio_alloc(dst_bdev, 1, REQ_OP_WRITE | REQ_COPY | REQ_NOMERGE,
+ gfp_mask);
+ if (!write_bio) {
+ ret = -ENOMEM;
+ goto err_read_bio;
+ }
+ write_bio->bi_iter.bi_sector = dst_blk >> SECTOR_SHIFT;
+ write_bio->bi_iter.bi_size = copy_len;
+ __bio_add_page(write_bio, token, PAGE_SIZE, 0);
+ write_bio->bi_end_io = bio_copy_end_io;
+ write_bio->bi_private = ctx;
+
+ spin_lock_irqsave(&cio->lock, flags);
+ ++cio->refcount;
+ spin_unlock_irqrestore(&cio->lock, flags);
+
+ submit_bio(write_bio);
+ src_blk += copy_len;
+ dst_blk += copy_len;
+ }
+ }
+
+ /* Wait for completion of all IO's*/
+ return cio_await_completion(cio);
+
+err_read_bio:
+ kfree(ctx);
+err_ctx:
+ __free_page(token);
+err_token:
+ rlist[ri].comp_len = min_t(sector_t, rlist[ri].comp_len, (rlist[ri].len - remaining));
+
+ cio->io_err = ret;
+ return cio_await_completion(cio);
+}
+
+static inline int blk_copy_sanity_check(struct block_device *src_bdev,
+ struct block_device *dst_bdev, struct range_entry *rlist, int nr)
+{
+ unsigned int align_mask = max(
+ bdev_logical_block_size(dst_bdev), bdev_logical_block_size(src_bdev)) - 1;
+ sector_t len = 0;
+ int i;
+
+ for (i = 0; i < nr; i++) {
+ if (rlist[i].len)
+ len += rlist[i].len;
+ else
+ return -EINVAL;
+ if ((rlist[i].dst & align_mask) || (rlist[i].src & align_mask) ||
+ (rlist[i].len & align_mask))
+ return -EINVAL;
+ rlist[i].comp_len = 0;
+ }
+
+ if (len && len >= MAX_COPY_TOTAL_LENGTH)
+ return -EINVAL;
+
+ return 0;
+}
+
+static inline bool blk_check_copy_offload(struct request_queue *src_q,
+ struct request_queue *dest_q)
+{
+ if (blk_queue_copy(dest_q) && blk_queue_copy(src_q))
+ return true;
+
+ return false;
+}
+
+/*
+ * blkdev_issue_copy - queue a copy
+ * @src_bdev: source block device
+ * @nr_srcs: number of source ranges to copy
+ * @rlist: array of source/dest/len
+ * @dest_bdev: destination block device
+ * @gfp_mask: memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ * Copy source ranges from source block device to destination block device.
+ * length of a source range cannot be zero.
+ */
+int blkdev_issue_copy(struct block_device *src_bdev, int nr,
+ struct range_entry *rlist, struct block_device *dest_bdev, gfp_t gfp_mask)
+{
+ struct request_queue *src_q = bdev_get_queue(src_bdev);
+ struct request_queue *dest_q = bdev_get_queue(dest_bdev);
+ int ret = -EINVAL;
+
+ if (!src_q || !dest_q)
+ return -ENXIO;
+
+ if (!nr)
+ return -EINVAL;
+
+ if (nr >= MAX_COPY_NR_RANGE)
+ return -EINVAL;
+
+ if (bdev_read_only(dest_bdev))
+ return -EPERM;
+
+ ret = blk_copy_sanity_check(src_bdev, dest_bdev, rlist, nr);
+ if (ret)
+ return ret;
+
+ if (blk_check_copy_offload(src_q, dest_q))
+ ret = blk_copy_offload(src_bdev, nr, rlist, dest_bdev, gfp_mask);
+
+ return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_copy);
+
/**
* __blkdev_issue_write_same - generate number of bios with same page
* @bdev: target blockdev
diff --git a/block/blk.h b/block/blk.h
index abb663a2a147..94d2b055750b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,6 +292,8 @@ static inline bool blk_may_split(struct request_queue *q, struct bio *bio)
break;
}
+ if (unlikely(op_is_copy(bio->bi_opf)))
+ return false;
/*
* All drivers must accept single-segments bios that are <= PAGE_SIZE.
* This is a quick and dirty check that relies on the fact that
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 5561e58d158a..e5c967c9f174 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -418,6 +418,7 @@ enum req_flag_bits {
/* for driver use */
__REQ_DRV,
__REQ_SWAP, /* swapping request. */
+ __REQ_COPY, /* copy request*/
__REQ_NR_BITS, /* stops here */
};
@@ -442,6 +443,7 @@ enum req_flag_bits {
#define REQ_DRV (1ULL << __REQ_DRV)
#define REQ_SWAP (1ULL << __REQ_SWAP)
+#define REQ_COPY (1ULL << __REQ_COPY)
#define REQ_FAILFAST_MASK \
(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
@@ -498,6 +500,11 @@ static inline bool op_is_discard(unsigned int op)
return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
}
+static inline bool op_is_copy(unsigned int op)
+{
+ return (op & REQ_COPY);
+}
+
/*
* Check if a bio or request operation is a zone management operation, with
* the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
@@ -532,4 +539,18 @@ struct blk_rq_stat {
u64 batch;
};
+struct cio {
+ struct range_entry *rlist;
+ struct task_struct *waiter; /* waiting task (NULL if none) */
+ spinlock_t lock; /* protects refcount and waiter */
+ int refcount;
+ blk_status_t io_err;
+};
+
+struct copy_ctx {
+ int range_idx;
+ sector_t start_sec;
+ struct cio *cio;
+};
+
#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 792e6d556589..1c2f65f1f143 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1121,6 +1121,8 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
struct bio **biop);
struct bio *bio_map_kern(struct request_queue *q, void *data, unsigned int len,
gfp_t gfp_mask);
+int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
+ struct range_entry *src_rlist, struct block_device *dest_bdev, gfp_t gfp_mask);
#define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */
#define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..55bca8f6e8ed 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -64,6 +64,20 @@ struct fstrim_range {
__u64 minlen;
};
+/* Maximum no of entries supported */
+#define MAX_COPY_NR_RANGE (1 << 12)
+
+/* maximum total copy length */
+#define MAX_COPY_TOTAL_LENGTH (1 << 21)
+
+/* Source range entry for copy */
+struct range_entry {
+ __u64 src;
+ __u64 dst;
+ __u64 len;
+ __u64 comp_len;
+};
+
/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
#define FILE_DEDUPE_RANGE_SAME 0
#define FILE_DEDUPE_RANGE_DIFFERS 1
--
2.30.0-rc0
On Mon, Feb 14, 2022 at 01:29:50PM +0530, Nitesh Shetty wrote:
> The patch series covers the points discussed in November 2021 virtual call
> [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
> We have covered the Initial agreed requirements in this patchset.
> Patchset borrows Mikulas's token based approach for 2 bdev
> implementation.
>
> Overall series supports –
>
> 1. Driver
> - NVMe Copy command (single NS), including support in nvme-target (for
> block and file backend)
>
> 2. Block layer
> - Block-generic copy (REQ_COPY flag), with interface accommodating
> two block-devs, and multi-source/destination interface
> - Emulation, when offload is natively absent
> - dm-linear support (for cases not requiring split)
>
> 3. User-interface
> - new ioctl
>
> 4. In-kernel user
> - dm-kcopyd
The biggest missing piece - and arguably the single most useful
piece of this functionality for users - is hooking this up to the
copy_file_range() syscall so that user file copies can be offloaded
to the hardware efficiently.
This seems like it would relatively easy to do with an fs/iomap iter
loop that maps src + dst file ranges and issues block copy offload
commands on the extents. We already do similar "read from source,
write to destination" operations in iomap, so it's not a huge
stretch to extent the iomap interfaces to provide an copy offload
mechanism using this infrastructure.
Also, hooking this up to copy-file-range() will also get you
immediate data integrity testing right down to the hardware via fsx
in fstests - it uses copy_file_range() as one of it's operations and
it will find all the off-by-one failures in both the linux IO stack
implementation and the hardware itself.
And, in reality, I wouldn't trust a block copy offload mechanism
until it is integrated with filesystems, the page cache and has
solid end-to-end data integrity testing available to shake out all
the bugs that will inevitably exist in this stack....
Cheers,
Dave.
--
Dave Chinner
[email protected]
The subject says limits for copy-offload...
On Mon, Feb 14, 2022 at 01:29:52PM +0530, Nitesh Shetty wrote:
> Add device limits as sysfs entries,
> - copy_offload (RW)
> - copy_max_bytes (RW)
> - copy_max_hw_bytes (RO)
> - copy_max_range_bytes (RW)
> - copy_max_range_hw_bytes (RO)
> - copy_max_nr_ranges (RW)
> - copy_max_nr_ranges_hw (RO)
Some of these seem like generic... and also I see a few more max_hw ones
not listed above...
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> +/**
> + * blk_queue_max_copy_sectors - set max sectors for a single copy payload
> + * @q: the request queue for the device
> + * @max_copy_sectors: maximum number of sectors to copy
> + **/
> +void blk_queue_max_copy_sectors(struct request_queue *q,
> + unsigned int max_copy_sectors)
> +{
> + q->limits.max_hw_copy_sectors = max_copy_sectors;
> + q->limits.max_copy_sectors = max_copy_sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_max_copy_sectors);
Please use EXPORT_SYMBOL_GPL() for all new things.
Why is this setting both? The documentation does't seem to say.
What's the point?
> +
> +/**
> + * blk_queue_max_copy_range_sectors - set max sectors for a single range, in a copy payload
> + * @q: the request queue for the device
> + * @max_copy_range_sectors: maximum number of sectors to copy in a single range
> + **/
> +void blk_queue_max_copy_range_sectors(struct request_queue *q,
> + unsigned int max_copy_range_sectors)
> +{
> + q->limits.max_hw_copy_range_sectors = max_copy_range_sectors;
> + q->limits.max_copy_range_sectors = max_copy_range_sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_max_copy_range_sectors);
Same here.
> +/**
> + * blk_queue_max_copy_nr_ranges - set max number of ranges, in a copy payload
> + * @q: the request queue for the device
> + * @max_copy_nr_ranges: maximum number of ranges
> + **/
> +void blk_queue_max_copy_nr_ranges(struct request_queue *q,
> + unsigned int max_copy_nr_ranges)
> +{
> + q->limits.max_hw_copy_nr_ranges = max_copy_nr_ranges;
> + q->limits.max_copy_nr_ranges = max_copy_nr_ranges;
> +}
> +EXPORT_SYMBOL(blk_queue_max_copy_nr_ranges);
Same.
> +
> /**
> * blk_queue_max_write_same_sectors - set max sectors for a single write same
> * @q: the request queue for the device
> @@ -541,6 +592,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> t->max_segment_size = min_not_zero(t->max_segment_size,
> b->max_segment_size);
>
> + t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
> + t->max_hw_copy_sectors = min(t->max_hw_copy_sectors, b->max_hw_copy_sectors);
> + t->max_copy_range_sectors = min(t->max_copy_range_sectors, b->max_copy_range_sectors);
> + t->max_hw_copy_range_sectors = min(t->max_hw_copy_range_sectors,
> + b->max_hw_copy_range_sectors);
> + t->max_copy_nr_ranges = min(t->max_copy_nr_ranges, b->max_copy_nr_ranges);
> + t->max_hw_copy_nr_ranges = min(t->max_hw_copy_nr_ranges, b->max_hw_copy_nr_ranges);
> +
> t->misaligned |= b->misaligned;
>
> alignment = queue_limit_alignment_offset(b, start);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 9f32882ceb2f..9ddd07f142d9 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -212,6 +212,129 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
> return queue_var_show(0, page);
> }
>
> +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
> +{
> + return queue_var_show(blk_queue_copy(q), page);
> +}
> +
> +static ssize_t queue_copy_offload_store(struct request_queue *q,
> + const char *page, size_t count)
> +{
> + unsigned long copy_offload;
> + ssize_t ret = queue_var_store(©_offload, page, count);
> +
> + if (ret < 0)
> + return ret;
> +
> + if (copy_offload && !q->limits.max_hw_copy_sectors)
> + return -EINVAL;
If the kernel schedules, copy_offload may still be true and
max_hw_copy_sectors may be set to 0. Is that an issue?
> +
> + if (copy_offload)
> + blk_queue_flag_set(QUEUE_FLAG_COPY, q);
> + else
> + blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
The flag may be set but the queue flag could be set. Is that an issue?
> @@ -597,6 +720,14 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
> QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
> QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>
> +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
> +QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_hw_bytes");
> +QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
> +QUEUE_RO_ENTRY(queue_copy_range_max_hw, "copy_max_range_hw_bytes");
> +QUEUE_RW_ENTRY(queue_copy_range_max, "copy_max_range_bytes");
> +QUEUE_RO_ENTRY(queue_copy_nr_ranges_max_hw, "copy_max_nr_ranges_hw");
> +QUEUE_RW_ENTRY(queue_copy_nr_ranges_max, "copy_max_nr_ranges");
Seems like you need to update Documentation/ABI/stable/sysfs-block.
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index efed3820cbf7..792e6d556589 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -254,6 +254,13 @@ struct queue_limits {
> unsigned int discard_alignment;
> unsigned int zone_write_granularity;
>
> + unsigned long max_hw_copy_sectors;
> + unsigned long max_copy_sectors;
> + unsigned int max_hw_copy_range_sectors;
> + unsigned int max_copy_range_sectors;
> + unsigned short max_hw_copy_nr_ranges;
> + unsigned short max_copy_nr_ranges;
Before limits start growing more.. I wonder if we should just
stuff hw offload stuff to its own struct within queue_limits.
Christoph?
Luis
On Mon, Feb 14, 2022 at 01:29:51PM +0530, Nitesh Shetty wrote:
> From: SelvaKumar S <[email protected]>
>
> Make bio_map_kern() non static
>
> Signed-off-by: SelvaKumar S <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
This patch makes no sense on its own. I'd just merge it with
its first user.
Luis
On 2/17/22 1:07 AM, Luis Chamberlain wrote:
> The subject says limits for copy-offload...
>
> On Mon, Feb 14, 2022 at 01:29:52PM +0530, Nitesh Shetty wrote:
>> Add device limits as sysfs entries,
>> - copy_offload (RW)
>> - copy_max_bytes (RW)
>> - copy_max_hw_bytes (RO)
>> - copy_max_range_bytes (RW)
>> - copy_max_range_hw_bytes (RO)
>> - copy_max_nr_ranges (RW)
>> - copy_max_nr_ranges_hw (RO)
>
> Some of these seem like generic... and also I see a few more max_hw ones
> not listed above...
>
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> +/**
>> + * blk_queue_max_copy_sectors - set max sectors for a single copy payload
>> + * @q: the request queue for the device
>> + * @max_copy_sectors: maximum number of sectors to copy
>> + **/
>> +void blk_queue_max_copy_sectors(struct request_queue *q,
>> + unsigned int max_copy_sectors)
>> +{
>> + q->limits.max_hw_copy_sectors = max_copy_sectors;
>> + q->limits.max_copy_sectors = max_copy_sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_max_copy_sectors);
>
> Please use EXPORT_SYMBOL_GPL() for all new things.
>
> Why is this setting both? The documentation does't seem to say.
> What's the point?
>
>> +
>> +/**
>> + * blk_queue_max_copy_range_sectors - set max sectors for a single range, in a copy payload
>> + * @q: the request queue for the device
>> + * @max_copy_range_sectors: maximum number of sectors to copy in a single range
>> + **/
>> +void blk_queue_max_copy_range_sectors(struct request_queue *q,
>> + unsigned int max_copy_range_sectors)
>> +{
>> + q->limits.max_hw_copy_range_sectors = max_copy_range_sectors;
>> + q->limits.max_copy_range_sectors = max_copy_range_sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_max_copy_range_sectors);
>
> Same here.
>
>> +/**
>> + * blk_queue_max_copy_nr_ranges - set max number of ranges, in a copy payload
>> + * @q: the request queue for the device
>> + * @max_copy_nr_ranges: maximum number of ranges
>> + **/
>> +void blk_queue_max_copy_nr_ranges(struct request_queue *q,
>> + unsigned int max_copy_nr_ranges)
>> +{
>> + q->limits.max_hw_copy_nr_ranges = max_copy_nr_ranges;
>> + q->limits.max_copy_nr_ranges = max_copy_nr_ranges;
>> +}
>> +EXPORT_SYMBOL(blk_queue_max_copy_nr_ranges);
>
> Same.
>
>> +
>> /**
>> * blk_queue_max_write_same_sectors - set max sectors for a single write same
>> * @q: the request queue for the device
>> @@ -541,6 +592,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>> t->max_segment_size = min_not_zero(t->max_segment_size,
>> b->max_segment_size);
>>
>> + t->max_copy_sectors = min(t->max_copy_sectors, b->max_copy_sectors);
>> + t->max_hw_copy_sectors = min(t->max_hw_copy_sectors, b->max_hw_copy_sectors);
>> + t->max_copy_range_sectors = min(t->max_copy_range_sectors, b->max_copy_range_sectors);
>> + t->max_hw_copy_range_sectors = min(t->max_hw_copy_range_sectors,
>> + b->max_hw_copy_range_sectors);
>> + t->max_copy_nr_ranges = min(t->max_copy_nr_ranges, b->max_copy_nr_ranges);
>> + t->max_hw_copy_nr_ranges = min(t->max_hw_copy_nr_ranges, b->max_hw_copy_nr_ranges);
>> +
>> t->misaligned |= b->misaligned;
>>
>> alignment = queue_limit_alignment_offset(b, start);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 9f32882ceb2f..9ddd07f142d9 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -212,6 +212,129 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
>> return queue_var_show(0, page);
>> }
>>
>> +static ssize_t queue_copy_offload_show(struct request_queue *q, char *page)
>> +{
>> + return queue_var_show(blk_queue_copy(q), page);
>> +}
>> +
>> +static ssize_t queue_copy_offload_store(struct request_queue *q,
>> + const char *page, size_t count)
>> +{
>> + unsigned long copy_offload;
>> + ssize_t ret = queue_var_store(©_offload, page, count);
>> +
>> + if (ret < 0)
>> + return ret;
>> +
>> + if (copy_offload && !q->limits.max_hw_copy_sectors)
>> + return -EINVAL;
>
>
> If the kernel schedules, copy_offload may still be true and
> max_hw_copy_sectors may be set to 0. Is that an issue?
>
>> +
>> + if (copy_offload)
>> + blk_queue_flag_set(QUEUE_FLAG_COPY, q);
>> + else
>> + blk_queue_flag_clear(QUEUE_FLAG_COPY, q);
>
> The flag may be set but the queue flag could be set. Is that an issue?
>
>> @@ -597,6 +720,14 @@ QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
>> QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
>> QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
>>
>> +QUEUE_RW_ENTRY(queue_copy_offload, "copy_offload");
>> +QUEUE_RO_ENTRY(queue_copy_max_hw, "copy_max_hw_bytes");
>> +QUEUE_RW_ENTRY(queue_copy_max, "copy_max_bytes");
>> +QUEUE_RO_ENTRY(queue_copy_range_max_hw, "copy_max_range_hw_bytes");
>> +QUEUE_RW_ENTRY(queue_copy_range_max, "copy_max_range_bytes");
>> +QUEUE_RO_ENTRY(queue_copy_nr_ranges_max_hw, "copy_max_nr_ranges_hw");
>> +QUEUE_RW_ENTRY(queue_copy_nr_ranges_max, "copy_max_nr_ranges");
>
> Seems like you need to update Documentation/ABI/stable/sysfs-block.
>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index efed3820cbf7..792e6d556589 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -254,6 +254,13 @@ struct queue_limits {
>> unsigned int discard_alignment;
>> unsigned int zone_write_granularity;
>>
>> + unsigned long max_hw_copy_sectors;
>> + unsigned long max_copy_sectors;
>> + unsigned int max_hw_copy_range_sectors;
>> + unsigned int max_copy_range_sectors;
>> + unsigned short max_hw_copy_nr_ranges;
>> + unsigned short max_copy_nr_ranges;
>
> Before limits start growing more.. I wonder if we should just
> stuff hw offload stuff to its own struct within queue_limits.
>
> Christoph?
>
Potentially use a pointer to structure and maybe make it configurable,
although I'm not sure about the later part, I'll let Christoph decide
that.
> Luis
>
-ck
On Thu, Feb 17, 2022 at 10:16:21AM +0000, Chaitanya Kulkarni wrote:
> On 2/17/22 1:07 AM, Luis Chamberlain wrote:
> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> index efed3820cbf7..792e6d556589 100644
> >> --- a/include/linux/blkdev.h
> >> +++ b/include/linux/blkdev.h
> >> @@ -254,6 +254,13 @@ struct queue_limits {
> >> unsigned int discard_alignment;
> >> unsigned int zone_write_granularity;
> >>
> >> + unsigned long max_hw_copy_sectors;
> >> + unsigned long max_copy_sectors;
> >> + unsigned int max_hw_copy_range_sectors;
> >> + unsigned int max_copy_range_sectors;
> >> + unsigned short max_hw_copy_nr_ranges;
> >> + unsigned short max_copy_nr_ranges;
> >
> > Before limits start growing more.. I wonder if we should just
> > stuff hw offload stuff to its own struct within queue_limits.
> >
> > Christoph?
> >
>
> Potentially use a pointer to structure and maybe make it configurable,
Did you mean to make queue limits local or for hw offload and make that
a pointer? If so that seems odd because even for hw copy offload we
still need the other limits no?
So what I meant was that struct queue_limits seems to be getting large,
and that hw copy offload seems like an example use case where we should
probably use a separate struct for it. And while at it, well, start
adding kdocs for these things, because, there's tons of things which
could use kdoc love.
> although I'm not sure about the later part, I'll let Christoph decide
> that.
Luis
On Thu, Feb 17, 2022 at 06:32:15PM +0530, Nitesh Shetty wrote:
> Tue, Feb 15, 2022 at 09:08:12AM +1100, Dave Chinner wrote:
> > On Mon, Feb 14, 2022 at 01:29:50PM +0530, Nitesh Shetty wrote:
> > > [LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
> > The biggest missing piece - and arguably the single most useful
> > piece of this functionality for users - is hooking this up to the
> > copy_file_range() syscall so that user file copies can be offloaded
> > to the hardware efficiently.
> >
> > This seems like it would relatively easy to do with an fs/iomap iter
> > loop that maps src + dst file ranges and issues block copy offload
> > commands on the extents. We already do similar "read from source,
> > write to destination" operations in iomap, so it's not a huge
> > stretch to extent the iomap interfaces to provide an copy offload
> > mechanism using this infrastructure.
> >
> > Also, hooking this up to copy-file-range() will also get you
> > immediate data integrity testing right down to the hardware via fsx
> > in fstests - it uses copy_file_range() as one of it's operations and
> > it will find all the off-by-one failures in both the linux IO stack
> > implementation and the hardware itself.
> >
> > And, in reality, I wouldn't trust a block copy offload mechanism
> > until it is integrated with filesystems, the page cache and has
> > solid end-to-end data integrity testing available to shake out all
> > the bugs that will inevitably exist in this stack....
>
> We had planned copy_file_range (CFR) in next phase of copy offload patch series.
> Thinking that we will get to CFR when everything else is robust.
> But if that is needed to make things robust, will start looking into that.
How do you make it robust when there is no locking/serialisation to
prevent overlapping concurrent IO while the copy-offload is in
progress? Or that you don't have overlapping concurrent
copy-offloads running at the same time?
You've basically created a block dev ioctl interface that looks
impossible to use safely. It doesn't appear to be coherent with the
blockdev page cache nor does it appear to have any documented data
integrity semantics, either. e.g. how does this interact with the
guarantees that fsync_bdev() and/or sync_blockdev() are supposed to
provide?
IOWs, if you don't have either CFR or some other strictly bound
kernel user with well defined access, synchronisation and integrity
semantics, how can anyone actually robustly test these ioctls to be
working correctly in all situations they might be called?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Feb 17, 2022 at 06:29:01PM +0530, Nitesh Shetty wrote:
> Thu, Feb 17, 2022 at 01:07:00AM -0800, Luis Chamberlain wrote:
> > The subject says limits for copy-offload...
> >
> > On Mon, Feb 14, 2022 at 01:29:52PM +0530, Nitesh Shetty wrote:
> > > Add device limits as sysfs entries,
> > > - copy_offload (RW)
> > > - copy_max_bytes (RW)
> > > - copy_max_hw_bytes (RO)
> > > - copy_max_range_bytes (RW)
> > > - copy_max_range_hw_bytes (RO)
> > > - copy_max_nr_ranges (RW)
> > > - copy_max_nr_ranges_hw (RO)
> >
> > Some of these seem like generic... and also I see a few more max_hw ones
> > not listed above...
> >
> queue_limits and sysfs entries are differently named.
> All sysfs entries start with copy_* prefix. Also it makes easy to lookup
> all copy sysfs.
> For queue limits naming, I tried to following existing queue limit
> convention (like discard).
My point was that your subject seems to indicate the changes are just
for copy-offload, but you seem to be adding generic queue limits as
well. Is that correct? If so then perhaps the subject should be changed
or the patch split up.
> > > +static ssize_t queue_copy_offload_store(struct request_queue *q,
> > > + const char *page, size_t count)
> > > +{
> > > + unsigned long copy_offload;
> > > + ssize_t ret = queue_var_store(©_offload, page, count);
> > > +
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + if (copy_offload && !q->limits.max_hw_copy_sectors)
> > > + return -EINVAL;
> >
> >
> > If the kernel schedules, copy_offload may still be true and
> > max_hw_copy_sectors may be set to 0. Is that an issue?
> >
>
> This check ensures that, we dont enable offload if device doesnt support
> offload. I feel it shouldn't be an issue.
My point was this:
CPU1 CPU2
Time
1) if (copy_offload
2) ---> preemption so it schedules
3) ---> some other high priority task Sets q->limits.max_hw_copy_sectors to 0
4) && !q->limits.max_hw_copy_sectors)
Can something bad happen if we allow for this?
On 2/23/22 09:55, Luis Chamberlain wrote:
> On Thu, Feb 17, 2022 at 06:29:01PM +0530, Nitesh Shetty wrote:
>> Thu, Feb 17, 2022 at 01:07:00AM -0800, Luis Chamberlain wrote:
>>> The subject says limits for copy-offload...
>>>
>>> On Mon, Feb 14, 2022 at 01:29:52PM +0530, Nitesh Shetty wrote:
>>>> Add device limits as sysfs entries,
>>>> - copy_offload (RW)
>>>> - copy_max_bytes (RW)
>>>> - copy_max_hw_bytes (RO)
>>>> - copy_max_range_bytes (RW)
>>>> - copy_max_range_hw_bytes (RO)
>>>> - copy_max_nr_ranges (RW)
>>>> - copy_max_nr_ranges_hw (RO)
>>>
>>> Some of these seem like generic... and also I see a few more max_hw ones
>>> not listed above...
>>>
>> queue_limits and sysfs entries are differently named.
>> All sysfs entries start with copy_* prefix. Also it makes easy to lookup
>> all copy sysfs.
>> For queue limits naming, I tried to following existing queue limit
>> convention (like discard).
>
> My point was that your subject seems to indicate the changes are just
> for copy-offload, but you seem to be adding generic queue limits as
> well. Is that correct? If so then perhaps the subject should be changed
> or the patch split up.
>
>>>> +static ssize_t queue_copy_offload_store(struct request_queue *q,
>>>> + const char *page, size_t count)
>>>> +{
>>>> + unsigned long copy_offload;
>>>> + ssize_t ret = queue_var_store(©_offload, page, count);
>>>> +
>>>> + if (ret < 0)
>>>> + return ret;
>>>> +
>>>> + if (copy_offload && !q->limits.max_hw_copy_sectors)
>>>> + return -EINVAL;
>>>
>>>
>>> If the kernel schedules, copy_offload may still be true and
>>> max_hw_copy_sectors may be set to 0. Is that an issue?
>>>
>>
>> This check ensures that, we dont enable offload if device doesnt support
>> offload. I feel it shouldn't be an issue.
>
> My point was this:
>
> CPU1 CPU2
> Time
> 1) if (copy_offload
> 2) ---> preemption so it schedules
> 3) ---> some other high priority task Sets q->limits.max_hw_copy_sectors to 0
> 4) && !q->limits.max_hw_copy_sectors)
>
> Can something bad happen if we allow for this?
max_hw_copy_sectors describes the device capability to offload copy. So
this is read-only and "max_hw_copy_sectors != 0" means that the device
supports copy offload (this attribute should really be named
max_hw_copy_offload_sectors).
The actual loop to issue copy offload BIOs, however, must use the soft
version of the attribute: max_copy_sectors, which defaults to
max_hw_copy_sectors if copy offload is truned on and I guess to
max_sectors for the emulation case.
Now, with this in mind, I do not see how allowing max_copy_sectors to be
0 makes sense. I fail to see why that should be allowed since:
1) If copy_offload is true, we will rely on the device and chunk copy
offload BIOs up to max_copy_sectors
2) If copy_offload is false (or device does not support it), emulation
will be used by issuing read/write BIOs of up to max_copy_sectors.
Thus max_copy_sectors must always be at least equal to the device
minimum IO size, that is, the logical block size.
--
Damien Le Moal
Western Digital Research