2020-06-28 23:02:31

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 0/2] Zone Descriptor Extension for Zoned Block Devices

Hi,

This patchset adds support for the Zone Descriptor Extension feature
that is defined in the NVMe Zoned Namespace Command Set.

The feature adds support for associating data to a zone that is in
the Empty state. Upon successful completion, the specified zone
transitions to the Closed state and further writes can be issued
to the zone. The data is lost when the zone at some point transitions
to the Empty state, the Read Only state, or the Offline state. For
example, the lifetime of the data is valid until a zone reset is
issued on the specific zone.

The first patch adds support for the zone_desc_ext_bytes queue sysfs
entry, and the second patch adds a ioctl to allow user-space to
associate data to a specific zone.

Support for the feature can be detected through the zone_desc_ext_bytes
queue sysfs. A value larger than zero indicates support, and zero value
indicates no support.

Best, Matias

Matias Bjørling (2):
block: add zone_desc_ext_bytes to sysfs
block: add BLKSETDESCZONE ioctl for Zoned Block Devices

Documentation/block/queue-sysfs.rst | 6 ++
block/blk-sysfs.c | 15 +++-
block/blk-zoned.c | 108 ++++++++++++++++++++++++++++
block/ioctl.c | 2 +
drivers/nvme/host/core.c | 3 +
drivers/nvme/host/nvme.h | 9 +++
drivers/nvme/host/zns.c | 12 ++++
drivers/scsi/sd_zbc.c | 1 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 31 +++++++-
include/uapi/linux/blkzoned.h | 20 +++++-
11 files changed, 206 insertions(+), 3 deletions(-)

--
2.17.1


2020-06-28 23:04:38

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 1/2] block: add zone_desc_ext_bytes to sysfs

The NVMe Zoned Namespace Command Set adds support for associating
data to a zone through the Zone Descriptor Extension feature.

The Zone Descriptor Extension size is fixed to a multiple of 64
bytes. A value of zero communicates the feature is not available.
A value larger than zero communites the feature is available, and
the specified Zone Descriptor Extension size in bytes.

The Zone Descriptor Extension feature is only available in the
NVMe Zoned Namespaces Command Set. Devices that supports ZAC/ZBC
therefore reports this value as zero, where as the NVMe device
driver reports the Zone Descriptor Extension size from the
specific device.

Signed-off-by: Matias Bjørling <[email protected]>
---
Documentation/block/queue-sysfs.rst | 6 ++++++
block/blk-sysfs.c | 15 ++++++++++++++-
drivers/nvme/host/zns.c | 1 +
drivers/scsi/sd_zbc.c | 1 +
include/linux/blkdev.h | 22 ++++++++++++++++++++++
5 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst
index f261a5c84170..c4fa195c87b4 100644
--- a/Documentation/block/queue-sysfs.rst
+++ b/Documentation/block/queue-sysfs.rst
@@ -265,4 +265,10 @@ devices are described in the ZBC (Zoned Block Commands) and ZAC
do not support zone commands, they will be treated as regular block devices
and zoned will report "none".

+zone_desc_ext_bytes (RO)
+-------------------------
+This indicates the zone description extension (ZDE) size, in bytes, of a zoned
+block device. A value of '0' means that zone description extension is not
+supported.
+
Jens Axboe <[email protected]>, February 2009
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 624bb4d85fc7..0c99454823b7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -315,6 +315,12 @@ static ssize_t queue_max_active_zones_show(struct request_queue *q, char *page)
return queue_var_show(queue_max_active_zones(q), page);
}

+static ssize_t queue_zone_desc_ext_bytes_show(struct request_queue *q,
+ char *page)
+{
+ return queue_var_show(queue_zone_desc_ext_bytes(q), page);
+}
+
static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
{
return queue_var_show((blk_queue_nomerges(q) << 1) |
@@ -687,6 +693,11 @@ static struct queue_sysfs_entry queue_max_active_zones_entry = {
.show = queue_max_active_zones_show,
};

+static struct queue_sysfs_entry queue_zone_desc_ext_bytes_entry = {
+ .attr = {.name = "zone_desc_ext_bytes", .mode = 0444 },
+ .show = queue_zone_desc_ext_bytes_show,
+};
+
static struct queue_sysfs_entry queue_nomerges_entry = {
.attr = {.name = "nomerges", .mode = 0644 },
.show = queue_nomerges_show,
@@ -787,6 +798,7 @@ static struct attribute *queue_attrs[] = {
&queue_nr_zones_entry.attr,
&queue_max_open_zones_entry.attr,
&queue_max_active_zones_entry.attr,
+ &queue_zone_desc_ext_bytes_entry.attr,
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
@@ -815,7 +827,8 @@ static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr,
return 0;

if ((attr == &queue_max_open_zones_entry.attr ||
- attr == &queue_max_active_zones_entry.attr) &&
+ attr == &queue_max_active_zones_entry.attr ||
+ attr == &queue_zone_desc_ext_bytes_entry.attr) &&
!blk_queue_is_zoned(q))
return 0;

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 502070763266..5792d953a8f3 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -84,6 +84,7 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
blk_queue_max_open_zones(q, le32_to_cpu(id->mor) + 1);
blk_queue_max_active_zones(q, le32_to_cpu(id->mar) + 1);
+ blk_queue_zone_desc_ext_bytes(q, id->lbafe[lbaf].zdes << 6);
free_data:
kfree(id);
return status;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index d8b2c49d645b..a4b6d6cf5457 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -722,6 +722,7 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
else
blk_queue_max_open_zones(q, sdkp->zones_max_open);
blk_queue_max_active_zones(q, 0);
+ blk_queue_zone_desc_ext_bytes(q, 0);
nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);

/* READ16/WRITE16 is mandatory for ZBC disks */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3776140f8f20..2ed55055f68d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -522,6 +522,7 @@ struct request_queue {
unsigned long *seq_zones_wlock;
unsigned int max_open_zones;
unsigned int max_active_zones;
+ unsigned int zone_desc_ext_bytes;
#endif /* CONFIG_BLK_DEV_ZONED */

/*
@@ -753,6 +754,18 @@ static inline unsigned int queue_max_active_zones(const struct request_queue *q)
{
return q->max_active_zones;
}
+
+static inline void blk_queue_zone_desc_ext_bytes(struct request_queue *q,
+ unsigned int zone_desc_ext_bytes)
+{
+ q->zone_desc_ext_bytes = zone_desc_ext_bytes;
+}
+
+static inline unsigned int queue_zone_desc_ext_bytes(
+ const struct request_queue *q)
+{
+ return q->zone_desc_ext_bytes;
+}
#else /* CONFIG_BLK_DEV_ZONED */
static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
{
@@ -784,6 +797,15 @@ static inline unsigned int queue_max_active_zones(const struct request_queue *q)
{
return 0;
}
+static inline void blk_queue_zone_desc_ext_bytes(struct request_queue *q,
+ unsigned int zone_desc_ext_bytes)
+{
+}
+static inline unsigned int queue_zone_desc_ext_bytes(
+ const struct request_queue *q)
+{
+ return 0;
+}
#endif /* CONFIG_BLK_DEV_ZONED */

static inline bool rq_is_sync(struct request *rq)
--
2.17.1

2020-06-28 23:06:32

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

The NVMe Zoned Namespace Command Set adds support for associating
data to a zone through the Zone Descriptor Extension feature.

To allow user-space to associate data to a zone, add support through
the BLKSETDESCZONE ioctl. The ioctl requires that it is issued to
a zoned block device, and that it supports the Zone Descriptor
Extension feature. Support is detected through the
the zone_desc_ext_bytes sysfs queue entry for the specific block
device. A value larger than zero communicates that the device supports
the feature.

The ioctl associates data to a zone by issuing a Zone Management Send
command with the Zone Send Action set as the Set Zone Descriptor
Extension.

For the command to complete successfully, the specified zone must be
in the Empty state, and active resources must be available. On
success, the specified zone is transioned to Closed state by the
device. If less data is supplied by user-space then reported by the
the Zone Descriptor Extension size, the rest is zero-filled. If more
data or no data is supplied by user-space, the ioctl fails.

To issue the ioctl, a new blk_zone_set_desc data structure is defined.
It has following parameters:

* the sector of the specific zone.
* the length of the data to be associated to the zone.
* any flags be used by the ioctl. None is defined.
* data associated to the zone.

The data is laid out after the flags parameter, and it is the caller's
responsibility to allocate memory for the data that is specified in the
length parameter.

Signed-off-by: Matias Bjørling <[email protected]>
---
block/blk-zoned.c | 108 ++++++++++++++++++++++++++++++++++
block/ioctl.c | 2 +
drivers/nvme/host/core.c | 3 +
drivers/nvme/host/nvme.h | 9 +++
drivers/nvme/host/zns.c | 11 ++++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 9 ++-
include/uapi/linux/blkzoned.h | 20 ++++++-
8 files changed, 162 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 81152a260354..4dc40ec006a2 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
}
EXPORT_SYMBOL_GPL(blkdev_zone_mgmt);

+/**
+ * blkdev_zone_set_desc - Execute a zone management set zone descriptor
+ * extension operation on a zone
+ * @bdev: Target block device
+ * @sector: Start sector of the zone to operate on
+ * @data: Pointer to the data that is to be associated to the zone
+ * @gfp_mask: Memory allocation flags (for bio_alloc)
+ *
+ * Description:
+ * Associate zone descriptor extension data to a specified zone.
+ * The block device must support zone descriptor extensions.
+ * i.e., by exposing a positive zone descriptor extension size.
+ */
+int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector,
+ struct page *data, gfp_t gfp_mask)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+ sector_t zone_sectors = blk_queue_zone_sectors(q);
+ struct bio_vec bio_vec;
+ struct bio bio;
+
+ if (!blk_queue_is_zoned(q))
+ return -EOPNOTSUPP;
+
+ if (bdev_read_only(bdev))
+ return -EPERM;
+
+ /* Check alignment (handle eventual smaller last zone) */
+ if (sector & (zone_sectors - 1))
+ return -EINVAL;
+
+ bio_init(&bio, &bio_vec, 1);
+ bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC;
+ bio.bi_iter.bi_sector = sector;
+ bio_set_dev(&bio, bdev);
+ bio_add_page(&bio, data, queue_zone_desc_ext_bytes(q), 0);
+
+ /* This may take a while, so be nice to others */
+ cond_resched();
+
+ return submit_bio_wait(&bio);
+}
+EXPORT_SYMBOL_GPL(blkdev_zone_set_desc);
+
struct zone_report_args {
struct blk_zone __user *zones;
};
@@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
GFP_KERNEL);
}

+/*
+ * BLKSETDESCZONE ioctl processing.
+ * Called from blkdev_ioctl.
+ */
+int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+ void __user *argp = (void __user *)arg;
+ struct request_queue *q;
+ struct blk_zone_set_desc zsd;
+ void *zsd_data;
+ int ret;
+
+ if (!argp)
+ return -EINVAL;
+
+ q = bdev_get_queue(bdev);
+ if (!q)
+ return -ENXIO;
+
+ if (!blk_queue_is_zoned(q))
+ return -ENOTTY;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EACCES;
+
+ if (!(mode & FMODE_WRITE))
+ return -EBADF;
+
+ if (!queue_zone_desc_ext_bytes(q))
+ return -EOPNOTSUPP;
+
+ if (copy_from_user(&zsd, argp, sizeof(struct blk_zone_set_desc)))
+ return -EFAULT;
+
+ /* no flags is currently supported */
+ if (zsd.flags)
+ return -ENOTTY;
+
+ if (!zsd.len || zsd.len > queue_zone_desc_ext_bytes(q))
+ return -ENOTTY;
+
+ /* allocate the size of the zone descriptor extension and fill
+ * with the data in the user data buffer. If the data size is less
+ * than the zone descriptor extension size, then the rest of the
+ * zone description extension data buffer is zero-filled.
+ */
+ zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
+ if (!zsd_data)
+ return -ENOMEM;
+
+ if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
+ zsd.len)) {
+ ret = -EFAULT;
+ goto free;
+ }
+
+ ret = blkdev_zone_set_desc(bdev, zsd.sector, virt_to_page(zsd_data),
+ GFP_KERNEL);
+free:
+ free_page((unsigned long) zsd_data);
+ return ret;
+}
+
static inline unsigned long *blk_alloc_zone_bitmap(int node,
unsigned int nr_zones)
{
diff --git a/block/ioctl.c b/block/ioctl.c
index bdb3bbb253d9..b9744705835b 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -515,6 +515,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
case BLKCLOSEZONE:
case BLKFINISHZONE:
return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
+ case BLKSETDESCZONE:
+ return blkdev_zone_set_desc_ioctl(bdev, mode, cmd, arg);
case BLKGETZONESZ:
return put_uint(argp, bdev_zone_sectors(bdev));
case BLKGETNRZONES:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e961910da4ac..b8f25b0d00ad 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
case REQ_OP_ZONE_FINISH:
ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
break;
+ case REQ_OP_ZONE_SET_DESC:
+ ret = nvme_setup_zone_set_desc(ns, req, cmd);
+ break;
case REQ_OP_WRITE_ZEROES:
ret = nvme_setup_write_zeroes(ns, req, cmd);
break;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 662f95fbd909..5bd5a437b038 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -708,6 +708,9 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd,
enum nvme_zone_mgmt_action action);
+
+blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct request *req,
+ struct nvme_command *cmnd);
#else
#define nvme_report_zones NULL

@@ -718,6 +721,12 @@ static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
return BLK_STS_NOTSUPP;
}

+static inline blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns,
+ struct request *req, struct nvme_command *cmnd)
+{
+ return BLK_STS_NOTSUPP;
+}
+
static inline int nvme_update_zone_info(struct gendisk *disk,
struct nvme_ns *ns,
unsigned lbaf)
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 5792d953a8f3..bfa64cc685d3 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -239,3 +239,14 @@ blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,

return BLK_STS_OK;
}
+
+blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct request *req,
+ struct nvme_command *c)
+{
+ c->zms.opcode = nvme_cmd_zone_mgmt_send;
+ c->zms.nsid = cpu_to_le32(ns->head->ns_id);
+ c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
+ c->zms.action = NVME_ZONE_SET_DESC_EXT;
+
+ return BLK_STS_OK;
+}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ccb895f911b1..53b7b05b0004 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -316,6 +316,8 @@ enum req_opf {
REQ_OP_ZONE_FINISH = 12,
/* write data at the current zone write pointer */
REQ_OP_ZONE_APPEND = 13,
+ /* associate zone desc extension data to a zone */
+ REQ_OP_ZONE_SET_DESC = 14,

/* SCSI passthrough using struct scsi_request */
REQ_OP_SCSI_IN = 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2ed55055f68d..c5f092dd5aa3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -370,7 +370,8 @@ extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg);
extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg);
-
+extern int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode,
+ unsigned int cmd, unsigned long arg);
#else /* CONFIG_BLK_DEV_ZONED */

static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
@@ -392,6 +393,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
return -ENOTTY;
}

+static inline int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
+ fmode_t mode, unsigned int cmd,
+ unsigned long arg)
+{
+ return -ENOTTY;
+}
#endif /* CONFIG_BLK_DEV_ZONED */

struct request_queue {
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index 42c3366cc25f..68abda9abf33 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -142,6 +142,20 @@ struct blk_zone_range {
__u64 nr_sectors;
};

+/**
+ * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
+ * @sector: Starting sector of the zone to operate on.
+ * @flags: Feature flags.
+ * @len: size, in bytes, of the data to be associated to the zone.
+ * @data: data to be associated.
+ */
+struct blk_zone_set_desc {
+ __u64 sector;
+ __u32 flags;
+ __u32 len;
+ __u8 data[0];
+};
+
/**
* Zoned block device ioctl's:
*
@@ -158,6 +172,10 @@ struct blk_zone_range {
* The 512 B sector range must be zone aligned.
* @BLKFINISHZONE: Mark the zones as full in the specified sector range.
* The 512 B sector range must be zone aligned.
+ * @BLKSETDESCZONE: Set zone description extension data for the zone
+ * in the specified sector. On success, the zone
+ * will transition to the closed zone state.
+ * The 512B sector must be zone aligned.
*/
#define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report)
#define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)
@@ -166,5 +184,5 @@ struct blk_zone_range {
#define BLKOPENZONE _IOW(0x12, 134, struct blk_zone_range)
#define BLKCLOSEZONE _IOW(0x12, 135, struct blk_zone_range)
#define BLKFINISHZONE _IOW(0x12, 136, struct blk_zone_range)
-
+#define BLKSETDESCZONE _IOW(0x12, 137, struct blk_zone_set_desc)
#endif /* _UAPI_BLKZONED_H */
--
2.17.1

2020-06-29 00:54:12

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 1/2] block: add zone_desc_ext_bytes to sysfs

On 2020/06/29 8:01, Matias Bjorling wrote:
> The NVMe Zoned Namespace Command Set adds support for associating
> data to a zone through the Zone Descriptor Extension feature.
>
> The Zone Descriptor Extension size is fixed to a multiple of 64
> bytes. A value of zero communicates the feature is not available.
> A value larger than zero communites the feature is available, and
> the specified Zone Descriptor Extension size in bytes.
>
> The Zone Descriptor Extension feature is only available in the
> NVMe Zoned Namespaces Command Set. Devices that supports ZAC/ZBC
> therefore reports this value as zero, where as the NVMe device
> driver reports the Zone Descriptor Extension size from the
> specific device.
>
> Signed-off-by: Matias Bj?rling <[email protected]>
> ---
> Documentation/block/queue-sysfs.rst | 6 ++++++
> block/blk-sysfs.c | 15 ++++++++++++++-
> drivers/nvme/host/zns.c | 1 +
> drivers/scsi/sd_zbc.c | 1 +
> include/linux/blkdev.h | 22 ++++++++++++++++++++++
> 5 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst
> index f261a5c84170..c4fa195c87b4 100644
> --- a/Documentation/block/queue-sysfs.rst
> +++ b/Documentation/block/queue-sysfs.rst
> @@ -265,4 +265,10 @@ devices are described in the ZBC (Zoned Block Commands) and ZAC
> do not support zone commands, they will be treated as regular block devices
> and zoned will report "none".
>
> +zone_desc_ext_bytes (RO)
> +-------------------------
> +This indicates the zone description extension (ZDE) size, in bytes, of a zoned
> +block device. A value of '0' means that zone description extension is not
> +supported.
> +
> Jens Axboe <[email protected]>, February 2009
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 624bb4d85fc7..0c99454823b7 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -315,6 +315,12 @@ static ssize_t queue_max_active_zones_show(struct request_queue *q, char *page)
> return queue_var_show(queue_max_active_zones(q), page);
> }
>
> +static ssize_t queue_zone_desc_ext_bytes_show(struct request_queue *q,
> + char *page)
> +{
> + return queue_var_show(queue_zone_desc_ext_bytes(q), page);
> +}
> +
> static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
> {
> return queue_var_show((blk_queue_nomerges(q) << 1) |
> @@ -687,6 +693,11 @@ static struct queue_sysfs_entry queue_max_active_zones_entry = {
> .show = queue_max_active_zones_show,
> };
>
> +static struct queue_sysfs_entry queue_zone_desc_ext_bytes_entry = {
> + .attr = {.name = "zone_desc_ext_bytes", .mode = 0444 },
> + .show = queue_zone_desc_ext_bytes_show,
> +};
> +
> static struct queue_sysfs_entry queue_nomerges_entry = {
> .attr = {.name = "nomerges", .mode = 0644 },
> .show = queue_nomerges_show,
> @@ -787,6 +798,7 @@ static struct attribute *queue_attrs[] = {
> &queue_nr_zones_entry.attr,
> &queue_max_open_zones_entry.attr,
> &queue_max_active_zones_entry.attr,

Which tree is this patch based on ? Not I have seen any patch introducing max
active zones.

> + &queue_zone_desc_ext_bytes_entry.attr,
> &queue_nomerges_entry.attr,
> &queue_rq_affinity_entry.attr,
> &queue_iostats_entry.attr,
> @@ -815,7 +827,8 @@ static umode_t queue_attr_visible(struct kobject *kobj, struct attribute *attr,
> return 0;
>
> if ((attr == &queue_max_open_zones_entry.attr ||
> - attr == &queue_max_active_zones_entry.attr) &&
> + attr == &queue_max_active_zones_entry.attr ||
> + attr == &queue_zone_desc_ext_bytes_entry.attr) &&
> !blk_queue_is_zoned(q))
> return 0;
>
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 502070763266..5792d953a8f3 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -84,6 +84,7 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
> blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> blk_queue_max_open_zones(q, le32_to_cpu(id->mor) + 1);
> blk_queue_max_active_zones(q, le32_to_cpu(id->mar) + 1);
> + blk_queue_zone_desc_ext_bytes(q, id->lbafe[lbaf].zdes << 6);
> free_data:
> kfree(id);
> return status;
> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index d8b2c49d645b..a4b6d6cf5457 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -722,6 +722,7 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
> else
> blk_queue_max_open_zones(q, sdkp->zones_max_open);
> blk_queue_max_active_zones(q, 0);
> + blk_queue_zone_desc_ext_bytes(q, 0);
> nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
>
> /* READ16/WRITE16 is mandatory for ZBC disks */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 3776140f8f20..2ed55055f68d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -522,6 +522,7 @@ struct request_queue {
> unsigned long *seq_zones_wlock;
> unsigned int max_open_zones;
> unsigned int max_active_zones;
> + unsigned int zone_desc_ext_bytes;

Why is this not a queue limit ? This may need to be to be gracefully handled by
device mapper for a target device using multiple zoned drives.

> #endif /* CONFIG_BLK_DEV_ZONED */
>
> /*
> @@ -753,6 +754,18 @@ static inline unsigned int queue_max_active_zones(const struct request_queue *q)
> {
> return q->max_active_zones;
> }
> +
> +static inline void blk_queue_zone_desc_ext_bytes(struct request_queue *q,
> + unsigned int zone_desc_ext_bytes)
> +{
> + q->zone_desc_ext_bytes = zone_desc_ext_bytes;
> +}
> +
> +static inline unsigned int queue_zone_desc_ext_bytes(
> + const struct request_queue *q)
> +{
> + return q->zone_desc_ext_bytes;
> +}
> #else /* CONFIG_BLK_DEV_ZONED */
> static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
> {
> @@ -784,6 +797,15 @@ static inline unsigned int queue_max_active_zones(const struct request_queue *q)
> {
> return 0;
> }
> +static inline void blk_queue_zone_desc_ext_bytes(struct request_queue *q,
> + unsigned int zone_desc_ext_bytes)
> +{
> +}
> +static inline unsigned int queue_zone_desc_ext_bytes(
> + const struct request_queue *q)
> +{
> + return 0;
> +}
> #endif /* CONFIG_BLK_DEV_ZONED */
>
> static inline bool rq_is_sync(struct request *rq)
>


--
Damien Le Moal
Western Digital Research

2020-06-29 01:03:20

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

On 2020/06/29 8:01, Matias Bjorling wrote:
> The NVMe Zoned Namespace Command Set adds support for associating
> data to a zone through the Zone Descriptor Extension feature.
>
> To allow user-space to associate data to a zone, add support through
> the BLKSETDESCZONE ioctl. The ioctl requires that it is issued to
> a zoned block device, and that it supports the Zone Descriptor
> Extension feature. Support is detected through the
> the zone_desc_ext_bytes sysfs queue entry for the specific block
> device. A value larger than zero communicates that the device supports
> the feature.
>
> The ioctl associates data to a zone by issuing a Zone Management Send
> command with the Zone Send Action set as the Set Zone Descriptor
> Extension.
>
> For the command to complete successfully, the specified zone must be
> in the Empty state, and active resources must be available. On
> success, the specified zone is transioned to Closed state by the
> device. If less data is supplied by user-space then reported by the
> the Zone Descriptor Extension size, the rest is zero-filled. If more
> data or no data is supplied by user-space, the ioctl fails.
>
> To issue the ioctl, a new blk_zone_set_desc data structure is defined.
> It has following parameters:
>
> * the sector of the specific zone.
> * the length of the data to be associated to the zone.
> * any flags be used by the ioctl. None is defined.
> * data associated to the zone.
>
> The data is laid out after the flags parameter, and it is the caller's
> responsibility to allocate memory for the data that is specified in the
> length parameter.
>
> Signed-off-by: Matias Bj?rling <[email protected]>
> ---
> block/blk-zoned.c | 108 ++++++++++++++++++++++++++++++++++
> block/ioctl.c | 2 +
> drivers/nvme/host/core.c | 3 +
> drivers/nvme/host/nvme.h | 9 +++
> drivers/nvme/host/zns.c | 11 ++++
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 9 ++-
> include/uapi/linux/blkzoned.h | 20 ++++++-
> 8 files changed, 162 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 81152a260354..4dc40ec006a2 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
> }
> EXPORT_SYMBOL_GPL(blkdev_zone_mgmt);
>
> +/**
> + * blkdev_zone_set_desc - Execute a zone management set zone descriptor
> + * extension operation on a zone
> + * @bdev: Target block device
> + * @sector: Start sector of the zone to operate on
> + * @data: Pointer to the data that is to be associated to the zone
> + * @gfp_mask: Memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + * Associate zone descriptor extension data to a specified zone.
> + * The block device must support zone descriptor extensions.
> + * i.e., by exposing a positive zone descriptor extension size.
> + */
> +int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector,
> + struct page *data, gfp_t gfp_mask)

struct page for the data ? Why not just a "void *" to allow for kmalloc/vmalloc
data ? And no length for the data ? This is a bit odd.

> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> + sector_t zone_sectors = blk_queue_zone_sectors(q);
> + struct bio_vec bio_vec;
> + struct bio bio;
> +
> + if (!blk_queue_is_zoned(q))
> + return -EOPNOTSUPP;
> +
> + if (bdev_read_only(bdev))
> + return -EPERM;

You are not checking the zone_desc_ext_bytes limit here. You should.
> +
> + /* Check alignment (handle eventual smaller last zone) */
> + if (sector & (zone_sectors - 1))
> + return -EINVAL;

The comment is incorrect. There is nothing special for handling the last zone in
this test.

> +
> + bio_init(&bio, &bio_vec, 1);
> + bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC;
> + bio.bi_iter.bi_sector = sector;
> + bio_set_dev(&bio, bdev);
> + bio_add_page(&bio, data, queue_zone_desc_ext_bytes(q), 0);
> +
> + /* This may take a while, so be nice to others */
> + cond_resched();

This is not a loop, so you do not need this.

> +
> + return submit_bio_wait(&bio);
> +}
> +EXPORT_SYMBOL_GPL(blkdev_zone_set_desc);
> +
> struct zone_report_args {
> struct blk_zone __user *zones;
> };
> @@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
> GFP_KERNEL);
> }
>
> +/*
> + * BLKSETDESCZONE ioctl processing.
> + * Called from blkdev_ioctl.
> + */
> +int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode,
> + unsigned int cmd, unsigned long arg)
> +{
> + void __user *argp = (void __user *)arg;
> + struct request_queue *q;
> + struct blk_zone_set_desc zsd;
> + void *zsd_data;
> + int ret;
> +
> + if (!argp)
> + return -EINVAL;
> +
> + q = bdev_get_queue(bdev);
> + if (!q)
> + return -ENXIO;
> +
> + if (!blk_queue_is_zoned(q))
> + return -ENOTTY;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EACCES;
> +
> + if (!(mode & FMODE_WRITE))
> + return -EBADF;
> +
> + if (!queue_zone_desc_ext_bytes(q))
> + return -EOPNOTSUPP;
> +
> + if (copy_from_user(&zsd, argp, sizeof(struct blk_zone_set_desc)))
> + return -EFAULT;
> +
> + /* no flags is currently supported */
> + if (zsd.flags)
> + return -ENOTTY;
> +
> + if (!zsd.len || zsd.len > queue_zone_desc_ext_bytes(q))
> + return -ENOTTY;

This should go into blkdev_zone_set_desc() as well so that in-kernel users are
checked. So there may be no need to check this here.

> +
> + /* allocate the size of the zone descriptor extension and fill
> + * with the data in the user data buffer. If the data size is less
> + * than the zone descriptor extension size, then the rest of the
> + * zone description extension data buffer is zero-filled.
> + */
> + zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
> + if (!zsd_data)
> + return -ENOMEM;
> +
> + if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
> + zsd.len)) {
> + ret = -EFAULT;
> + goto free;
> + }
> +
> + ret = blkdev_zone_set_desc(bdev, zsd.sector, virt_to_page(zsd_data),
> + GFP_KERNEL);
> +free:
> + free_page((unsigned long) zsd_data);
> + return ret;
> +}
> +
> static inline unsigned long *blk_alloc_zone_bitmap(int node,
> unsigned int nr_zones)
> {
> diff --git a/block/ioctl.c b/block/ioctl.c
> index bdb3bbb253d9..b9744705835b 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -515,6 +515,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
> case BLKCLOSEZONE:
> case BLKFINISHZONE:
> return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
> + case BLKSETDESCZONE:
> + return blkdev_zone_set_desc_ioctl(bdev, mode, cmd, arg);
> case BLKGETZONESZ:
> return put_uint(argp, bdev_zone_sectors(bdev));
> case BLKGETNRZONES:
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e961910da4ac..b8f25b0d00ad 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
> case REQ_OP_ZONE_FINISH:
> ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
> break;
> + case REQ_OP_ZONE_SET_DESC:
> + ret = nvme_setup_zone_set_desc(ns, req, cmd);
> + break;
> case REQ_OP_WRITE_ZEROES:
> ret = nvme_setup_write_zeroes(ns, req, cmd);
> break;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 662f95fbd909..5bd5a437b038 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -708,6 +708,9 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
> blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
> struct nvme_command *cmnd,
> enum nvme_zone_mgmt_action action);
> +
> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct request *req,
> + struct nvme_command *cmnd);
> #else
> #define nvme_report_zones NULL
>
> @@ -718,6 +721,12 @@ static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
> return BLK_STS_NOTSUPP;
> }
>
> +static inline blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns,
> + struct request *req, struct nvme_command *cmnd)
> +{
> + return BLK_STS_NOTSUPP;
> +}
> +
> static inline int nvme_update_zone_info(struct gendisk *disk,
> struct nvme_ns *ns,
> unsigned lbaf)
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 5792d953a8f3..bfa64cc685d3 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -239,3 +239,14 @@ blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>
> return BLK_STS_OK;
> }
> +
> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct request *req,
> + struct nvme_command *c)
> +{
> + c->zms.opcode = nvme_cmd_zone_mgmt_send;
> + c->zms.nsid = cpu_to_le32(ns->head->ns_id);
> + c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
> + c->zms.action = NVME_ZONE_SET_DESC_EXT;
> +
> + return BLK_STS_OK;
> +}
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index ccb895f911b1..53b7b05b0004 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -316,6 +316,8 @@ enum req_opf {
> REQ_OP_ZONE_FINISH = 12,
> /* write data at the current zone write pointer */
> REQ_OP_ZONE_APPEND = 13,
> + /* associate zone desc extension data to a zone */
> + REQ_OP_ZONE_SET_DESC = 14,
>
> /* SCSI passthrough using struct scsi_request */
> REQ_OP_SCSI_IN = 32,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2ed55055f68d..c5f092dd5aa3 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -370,7 +370,8 @@ extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
> unsigned int cmd, unsigned long arg);
> extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
> unsigned int cmd, unsigned long arg);
> -
> +extern int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode,
> + unsigned int cmd, unsigned long arg);
> #else /* CONFIG_BLK_DEV_ZONED */
>
> static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
> @@ -392,6 +393,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
> return -ENOTTY;
> }
>
> +static inline int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
> + fmode_t mode, unsigned int cmd,
> + unsigned long arg)
> +{
> + return -ENOTTY;
> +}
> #endif /* CONFIG_BLK_DEV_ZONED */
>
> struct request_queue {
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index 42c3366cc25f..68abda9abf33 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -142,6 +142,20 @@ struct blk_zone_range {
> __u64 nr_sectors;
> };
>
> +/**
> + * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
> + * @sector: Starting sector of the zone to operate on.
> + * @flags: Feature flags.
> + * @len: size, in bytes, of the data to be associated to the zone.
> + * @data: data to be associated.
> + */
> +struct blk_zone_set_desc {
> + __u64 sector;
> + __u32 flags;
> + __u32 len;
> + __u8 data[0];
> +};
> +
> /**
> * Zoned block device ioctl's:
> *
> @@ -158,6 +172,10 @@ struct blk_zone_range {
> * The 512 B sector range must be zone aligned.
> * @BLKFINISHZONE: Mark the zones as full in the specified sector range.
> * The 512 B sector range must be zone aligned.
> + * @BLKSETDESCZONE: Set zone description extension data for the zone
> + * in the specified sector. On success, the zone
> + * will transition to the closed zone state.
> + * The 512B sector must be zone aligned.
> */
> #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report)
> #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)
> @@ -166,5 +184,5 @@ struct blk_zone_range {
> #define BLKOPENZONE _IOW(0x12, 134, struct blk_zone_range)
> #define BLKCLOSEZONE _IOW(0x12, 135, struct blk_zone_range)
> #define BLKFINISHZONE _IOW(0x12, 136, struct blk_zone_range)
> -
> +#define BLKSETDESCZONE _IOW(0x12, 137, struct blk_zone_set_desc)
> #endif /* _UAPI_BLKZONED_H */
>

How to you retreive an extended descriptor that was set ? I do not see any code
doing that. Report zones is not changed, but I would leave that one as is and
implement a BLKGETDESCZONE ioctl & in-kernel API.

--
Damien Le Moal
Western Digital Research

2020-06-29 01:37:04

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

On 2020-06-28 16:01, Matias Bjørling wrote:
> + /* This may take a while, so be nice to others */
> + cond_resched();
> +
> + return submit_bio_wait(&bio);

A cond_resched() call before a submit_bio_wait() call? I think it's the
first time that I see this. Is that call really necessary? Isn't the
wait_for_completion() call inside submit_bio_wait() sufficient?

> + /* no flags is currently supported */
^^
are?

> + /* allocate the size of the zone descriptor extension and fill
> + * with the data in the user data buffer. If the data size is less
> + * than the zone descriptor extension size, then the rest of the
> + * zone description extension data buffer is zero-filled.
> + */
> + zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
> + if (!zsd_data)
> + return -ENOMEM;
> +
> + if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
> + zsd.len)) {
> + ret = -EFAULT;
> + goto free;
> + }

Has it been considered to use kmalloc() instead of get_zeroed_page()?

> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index ccb895f911b1..53b7b05b0004 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -316,6 +316,8 @@ enum req_opf {
> REQ_OP_ZONE_FINISH = 12,
> /* write data at the current zone write pointer */
> REQ_OP_ZONE_APPEND = 13,
> + /* associate zone desc extension data to a zone */
> + REQ_OP_ZONE_SET_DESC = 14,
>
> /* SCSI passthrough using struct scsi_request */
> REQ_OP_SCSI_IN = 32,

Does REQ_OP_ZONE_SET_DESC count as a read or as a write operation? See also:

static inline bool op_is_write(unsigned int op)
{
return (op & 1);
}

> +/**
> + * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
> + * @sector: Starting sector of the zone to operate on.
> + * @flags: Feature flags.
> + * @len: size, in bytes, of the data to be associated to the zone.
> + * @data: data to be associated.
> + */
> +struct blk_zone_set_desc {
> + __u64 sector;
> + __u32 flags;
> + __u32 len;
> + __u8 data[0];
> +};

Isn't the recommended style to use a flexible array ([] instead of [0])?
See also https://lore.kernel.org/lkml/20200608213711.GA22271@embeddedor/.

Thanks,

Bart.

2020-06-29 18:55:35

by Niklas Cassel

[permalink] [raw]
Subject: Re: [PATCH 1/2] block: add zone_desc_ext_bytes to sysfs

On Mon, Jun 29, 2020 at 12:52:46AM +0000, Damien Le Moal wrote:
> On 2020/06/29 8:01, Matias Bjorling wrote:
> > The NVMe Zoned Namespace Command Set adds support for associating
> > data to a zone through the Zone Descriptor Extension feature.
> >
> > The Zone Descriptor Extension size is fixed to a multiple of 64
> > bytes. A value of zero communicates the feature is not available.
> > A value larger than zero communites the feature is available, and
> > the specified Zone Descriptor Extension size in bytes.
> >
> > The Zone Descriptor Extension feature is only available in the
> > NVMe Zoned Namespaces Command Set. Devices that supports ZAC/ZBC
> > therefore reports this value as zero, where as the NVMe device
> > driver reports the Zone Descriptor Extension size from the
> > specific device.
> >
> > Signed-off-by: Matias Bj?rling <[email protected]>
> > ---
> > Documentation/block/queue-sysfs.rst | 6 ++++++
> > block/blk-sysfs.c | 15 ++++++++++++++-
> > drivers/nvme/host/zns.c | 1 +
> > drivers/scsi/sd_zbc.c | 1 +
> > include/linux/blkdev.h | 22 ++++++++++++++++++++++
> > 5 files changed, 44 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst
> > index f261a5c84170..c4fa195c87b4 100644

(snip)

> > static struct queue_sysfs_entry queue_nomerges_entry = {
> > .attr = {.name = "nomerges", .mode = 0644 },
> > .show = queue_nomerges_show,
> > @@ -787,6 +798,7 @@ static struct attribute *queue_attrs[] = {
> > &queue_nr_zones_entry.attr,
> > &queue_max_open_zones_entry.attr,
> > &queue_max_active_zones_entry.attr,
>
> Which tree is this patch based on ? Not I have seen any patch introducing max
> active zones.

The cover letter forgot to mention this patch series' dependencies.
I assume that it is based on the following patch series:
https://lore.kernel.org/linux-block/[email protected]/
https://lore.kernel.org/linux-block/[email protected]/


Kind regards,
Niklas

2020-06-29 19:39:55

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

On 29.06.2020 01:00, Damien Le Moal wrote:
>On 2020/06/29 8:01, Matias Bjorling wrote:
>> The NVMe Zoned Namespace Command Set adds support for associating
>> data to a zone through the Zone Descriptor Extension feature.
>>
>> To allow user-space to associate data to a zone, add support through
>> the BLKSETDESCZONE ioctl. The ioctl requires that it is issued to
>> a zoned block device, and that it supports the Zone Descriptor
>> Extension feature. Support is detected through the
>> the zone_desc_ext_bytes sysfs queue entry for the specific block
>> device. A value larger than zero communicates that the device supports
>> the feature.

Have you considered the possibility of adding this an action to a IOCTL
that looks like the zone management one we discussed last week? We would
start saving IOCTLs already if we count the offline transition and this
one.

>>
>> The ioctl associates data to a zone by issuing a Zone Management Send
>> command with the Zone Send Action set as the Set Zone Descriptor
>> Extension.
>>
>> For the command to complete successfully, the specified zone must be
>> in the Empty state, and active resources must be available. On
>> success, the specified zone is transioned to Closed state by the
>> device. If less data is supplied by user-space then reported by the
>> the Zone Descriptor Extension size, the rest is zero-filled. If more
>> data or no data is supplied by user-space, the ioctl fails.
>>
>> To issue the ioctl, a new blk_zone_set_desc data structure is defined.
>> It has following parameters:
>>
>> * the sector of the specific zone.
>> * the length of the data to be associated to the zone.
>> * any flags be used by the ioctl. None is defined.
>> * data associated to the zone.
>>
>> The data is laid out after the flags parameter, and it is the caller's
>> responsibility to allocate memory for the data that is specified in the
>> length parameter.
>>
>> Signed-off-by: Matias Bjørling <[email protected]>
>> ---
>> block/blk-zoned.c | 108 ++++++++++++++++++++++++++++++++++
>> block/ioctl.c | 2 +
>> drivers/nvme/host/core.c | 3 +
>> drivers/nvme/host/nvme.h | 9 +++
>> drivers/nvme/host/zns.c | 11 ++++
>> include/linux/blk_types.h | 2 +
>> include/linux/blkdev.h | 9 ++-
>> include/uapi/linux/blkzoned.h | 20 ++++++-
>> 8 files changed, 162 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 81152a260354..4dc40ec006a2 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
>> }
>> EXPORT_SYMBOL_GPL(blkdev_zone_mgmt);
>>
>> +/**
>> + * blkdev_zone_set_desc - Execute a zone management set zone descriptor
>> + * extension operation on a zone
>> + * @bdev: Target block device
>> + * @sector: Start sector of the zone to operate on
>> + * @data: Pointer to the data that is to be associated to the zone
>> + * @gfp_mask: Memory allocation flags (for bio_alloc)
>> + *
>> + * Description:
>> + * Associate zone descriptor extension data to a specified zone.
>> + * The block device must support zone descriptor extensions.
>> + * i.e., by exposing a positive zone descriptor extension size.
>> + */
>> +int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector,
>> + struct page *data, gfp_t gfp_mask)
>
>struct page for the data ? Why not just a "void *" to allow for kmalloc/vmalloc
>data ? And no length for the data ? This is a bit odd.
>
>> +{
>> + struct request_queue *q = bdev_get_queue(bdev);
>> + sector_t zone_sectors = blk_queue_zone_sectors(q);
>> + struct bio_vec bio_vec;
>> + struct bio bio;
>> +
>> + if (!blk_queue_is_zoned(q))
>> + return -EOPNOTSUPP;
>> +
>> + if (bdev_read_only(bdev))
>> + return -EPERM;
>
>You are not checking the zone_desc_ext_bytes limit here. You should.
>> +
>> + /* Check alignment (handle eventual smaller last zone) */
>> + if (sector & (zone_sectors - 1))
>> + return -EINVAL;
>
>The comment is incorrect. There is nothing special for handling the last zone in
>this test.
>
>> +
>> + bio_init(&bio, &bio_vec, 1);
>> + bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC;
>> + bio.bi_iter.bi_sector = sector;
>> + bio_set_dev(&bio, bdev);
>> + bio_add_page(&bio, data, queue_zone_desc_ext_bytes(q), 0);
>> +
>> + /* This may take a while, so be nice to others */
>> + cond_resched();
>
>This is not a loop, so you do not need this.
>
>> +
>> + return submit_bio_wait(&bio);
>> +}
>> +EXPORT_SYMBOL_GPL(blkdev_zone_set_desc);
>> +
>> struct zone_report_args {
>> struct blk_zone __user *zones;
>> };
>> @@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>> GFP_KERNEL);
>> }
>>
>> +/*
>> + * BLKSETDESCZONE ioctl processing.
>> + * Called from blkdev_ioctl.
>> + */
>> +int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode,
>> + unsigned int cmd, unsigned long arg)
>> +{
>> + void __user *argp = (void __user *)arg;
>> + struct request_queue *q;
>> + struct blk_zone_set_desc zsd;
>> + void *zsd_data;
>> + int ret;
>> +
>> + if (!argp)
>> + return -EINVAL;
>> +
>> + q = bdev_get_queue(bdev);
>> + if (!q)
>> + return -ENXIO;
>> +
>> + if (!blk_queue_is_zoned(q))
>> + return -ENOTTY;
>> +
>> + if (!capable(CAP_SYS_ADMIN))
>> + return -EACCES;
>> +
>> + if (!(mode & FMODE_WRITE))
>> + return -EBADF;
>> +
>> + if (!queue_zone_desc_ext_bytes(q))
>> + return -EOPNOTSUPP;
>> +
>> + if (copy_from_user(&zsd, argp, sizeof(struct blk_zone_set_desc)))
>> + return -EFAULT;
>> +
>> + /* no flags is currently supported */
>> + if (zsd.flags)
>> + return -ENOTTY;
>> +
>> + if (!zsd.len || zsd.len > queue_zone_desc_ext_bytes(q))
>> + return -ENOTTY;
>
>This should go into blkdev_zone_set_desc() as well so that in-kernel users are
>checked. So there may be no need to check this here.
>
>> +
>> + /* allocate the size of the zone descriptor extension and fill
>> + * with the data in the user data buffer. If the data size is less
>> + * than the zone descriptor extension size, then the rest of the
>> + * zone description extension data buffer is zero-filled.
>> + */
>> + zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
>> + if (!zsd_data)
>> + return -ENOMEM;
>> +
>> + if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
>> + zsd.len)) {
>> + ret = -EFAULT;
>> + goto free;
>> + }
>> +
>> + ret = blkdev_zone_set_desc(bdev, zsd.sector, virt_to_page(zsd_data),
>> + GFP_KERNEL);
>> +free:
>> + free_page((unsigned long) zsd_data);
>> + return ret;
>> +}
>> +
>> static inline unsigned long *blk_alloc_zone_bitmap(int node,
>> unsigned int nr_zones)
>> {
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index bdb3bbb253d9..b9744705835b 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -515,6 +515,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
>> case BLKCLOSEZONE:
>> case BLKFINISHZONE:
>> return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>> + case BLKSETDESCZONE:
>> + return blkdev_zone_set_desc_ioctl(bdev, mode, cmd, arg);
>> case BLKGETZONESZ:
>> return put_uint(argp, bdev_zone_sectors(bdev));
>> case BLKGETNRZONES:
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index e961910da4ac..b8f25b0d00ad 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>> case REQ_OP_ZONE_FINISH:
>> ret = nvme_setup_zone_mgmt_send(ns, req, cmd, NVME_ZONE_FINISH);
>> break;
>> + case REQ_OP_ZONE_SET_DESC:
>> + ret = nvme_setup_zone_set_desc(ns, req, cmd);
>> + break;
>> case REQ_OP_WRITE_ZEROES:
>> ret = nvme_setup_write_zeroes(ns, req, cmd);
>> break;
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index 662f95fbd909..5bd5a437b038 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -708,6 +708,9 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
>> blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>> struct nvme_command *cmnd,
>> enum nvme_zone_mgmt_action action);
>> +
>> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct request *req,
>> + struct nvme_command *cmnd);
>> #else
>> #define nvme_report_zones NULL
>>
>> @@ -718,6 +721,12 @@ static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>> return BLK_STS_NOTSUPP;
>> }
>>
>> +static inline blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns,
>> + struct request *req, struct nvme_command *cmnd)
>> +{
>> + return BLK_STS_NOTSUPP;
>> +}
>> +
>> static inline int nvme_update_zone_info(struct gendisk *disk,
>> struct nvme_ns *ns,
>> unsigned lbaf)
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index 5792d953a8f3..bfa64cc685d3 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -239,3 +239,14 @@ blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
>>
>> return BLK_STS_OK;
>> }
>> +
>> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct request *req,
>> + struct nvme_command *c)
>> +{
>> + c->zms.opcode = nvme_cmd_zone_mgmt_send;
>> + c->zms.nsid = cpu_to_le32(ns->head->ns_id);
>> + c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
>> + c->zms.action = NVME_ZONE_SET_DESC_EXT;
>> +
>> + return BLK_STS_OK;
>> +}
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index ccb895f911b1..53b7b05b0004 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -316,6 +316,8 @@ enum req_opf {
>> REQ_OP_ZONE_FINISH = 12,
>> /* write data at the current zone write pointer */
>> REQ_OP_ZONE_APPEND = 13,
>> + /* associate zone desc extension data to a zone */
>> + REQ_OP_ZONE_SET_DESC = 14,
>>
>> /* SCSI passthrough using struct scsi_request */
>> REQ_OP_SCSI_IN = 32,
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 2ed55055f68d..c5f092dd5aa3 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -370,7 +370,8 @@ extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
>> unsigned int cmd, unsigned long arg);
>> extern int blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>> unsigned int cmd, unsigned long arg);
>> -
>> +extern int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t mode,
>> + unsigned int cmd, unsigned long arg);
>> #else /* CONFIG_BLK_DEV_ZONED */
>>
>> static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
>> @@ -392,6 +393,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev,
>> return -ENOTTY;
>> }
>>
>> +static inline int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
>> + fmode_t mode, unsigned int cmd,
>> + unsigned long arg)
>> +{
>> + return -ENOTTY;
>> +}
>> #endif /* CONFIG_BLK_DEV_ZONED */
>>
>> struct request_queue {
>> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
>> index 42c3366cc25f..68abda9abf33 100644
>> --- a/include/uapi/linux/blkzoned.h
>> +++ b/include/uapi/linux/blkzoned.h
>> @@ -142,6 +142,20 @@ struct blk_zone_range {
>> __u64 nr_sectors;
>> };
>>
>> +/**
>> + * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
>> + * @sector: Starting sector of the zone to operate on.
>> + * @flags: Feature flags.
>> + * @len: size, in bytes, of the data to be associated to the zone.
>> + * @data: data to be associated.
>> + */
>> +struct blk_zone_set_desc {
>> + __u64 sector;
>> + __u32 flags;
>> + __u32 len;
>> + __u8 data[0];
>> +};

Would it make sense to add nr_sectors if the host wants to associate the
same metadata to several zones. The use case would be the grouping of
larger zones in software.

>> +
>> /**
>> * Zoned block device ioctl's:
>> *
>> @@ -158,6 +172,10 @@ struct blk_zone_range {
>> * The 512 B sector range must be zone aligned.
>> * @BLKFINISHZONE: Mark the zones as full in the specified sector range.
>> * The 512 B sector range must be zone aligned.
>> + * @BLKSETDESCZONE: Set zone description extension data for the zone
>> + * in the specified sector. On success, the zone
>> + * will transition to the closed zone state.
>> + * The 512B sector must be zone aligned.
>> */
>> #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report)
>> #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)
>> @@ -166,5 +184,5 @@ struct blk_zone_range {
>> #define BLKOPENZONE _IOW(0x12, 134, struct blk_zone_range)
>> #define BLKCLOSEZONE _IOW(0x12, 135, struct blk_zone_range)
>> #define BLKFINISHZONE _IOW(0x12, 136, struct blk_zone_range)
>> -
>> +#define BLKSETDESCZONE _IOW(0x12, 137, struct blk_zone_set_desc)
>> #endif /* _UAPI_BLKZONED_H */
>>
>
>How to you retreive an extended descriptor that was set ? I do not see any code
>doing that. Report zones is not changed, but I would leave that one as is and
>implement a BLKGETDESCZONE ioctl & in-kernel API.
>

In any case, this is needed. I also could not find a way to read the
extended descriptor back.

Javier

2020-06-29 21:18:44

by Matias Bjørling

[permalink] [raw]
Subject: RE: [PATCH 1/2] block: add zone_desc_ext_bytes to sysfs

> -----Original Message-----
> From: Niklas Cassel <[email protected]>
> Sent: Monday, 29 June 2020 11.04
> To: Damien Le Moal <[email protected]>
> Cc: Matias Bjorling <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Hans Holmberg <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: Re: [PATCH 1/2] block: add zone_desc_ext_bytes to sysfs
>
> On Mon, Jun 29, 2020 at 12:52:46AM +0000, Damien Le Moal wrote:
> > On 2020/06/29 8:01, Matias Bjorling wrote:
> > > The NVMe Zoned Namespace Command Set adds support for associating
> > > data to a zone through the Zone Descriptor Extension feature.
> > >
> > > The Zone Descriptor Extension size is fixed to a multiple of 64
> > > bytes. A value of zero communicates the feature is not available.
> > > A value larger than zero communites the feature is available, and
> > > the specified Zone Descriptor Extension size in bytes.
> > >
> > > The Zone Descriptor Extension feature is only available in the NVMe
> > > Zoned Namespaces Command Set. Devices that supports ZAC/ZBC
> > > therefore reports this value as zero, where as the NVMe device
> > > driver reports the Zone Descriptor Extension size from the specific
> > > device.
> > >
> > > Signed-off-by: Matias Bj?rling <[email protected]>
> > > ---
> > > Documentation/block/queue-sysfs.rst | 6 ++++++
> > > block/blk-sysfs.c | 15 ++++++++++++++-
> > > drivers/nvme/host/zns.c | 1 +
> > > drivers/scsi/sd_zbc.c | 1 +
> > > include/linux/blkdev.h | 22 ++++++++++++++++++++++
> > > 5 files changed, 44 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/block/queue-sysfs.rst
> > > b/Documentation/block/queue-sysfs.rst
> > > index f261a5c84170..c4fa195c87b4 100644
>
> (snip)
>
> > > static struct queue_sysfs_entry queue_nomerges_entry = {
> > > .attr = {.name = "nomerges", .mode = 0644 },
> > > .show = queue_nomerges_show,
> > > @@ -787,6 +798,7 @@ static struct attribute *queue_attrs[] = {
> > > &queue_nr_zones_entry.attr,
> > > &queue_max_open_zones_entry.attr,
> > > &queue_max_active_zones_entry.attr,
> >
> > Which tree is this patch based on ? Not I have seen any patch
> > introducing max active zones.
>
> The cover letter forgot to mention this patch series' dependencies.
> I assume that it is based on the following patch series:
> https://lore.kernel.org/linux-block/20200622162530.1287650-1-
> [email protected]/
> https://lore.kernel.org/linux-block/20200616102546.491961-1-
> [email protected]/
>

Very true. Thanks, Niklas.

I'll rebase this work when the above patchsets have been picked up.

>
> Kind regards,
> Niklas

2020-06-30 13:51:49

by Matias Bjørling

[permalink] [raw]
Subject: RE: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

> -----Original Message-----
> From: Bart Van Assche <[email protected]>
> Sent: Monday, 29 June 2020 03.36
> To: Matias Bjorling <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Damien Le Moal <[email protected]>;
> Niklas Cassel <[email protected]>; Hans Holmberg
> <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block
> Devices
>
> On 2020-06-28 16:01, Matias Bjørling wrote:
> > + /* This may take a while, so be nice to others */
> > + cond_resched();
> > +
> > + return submit_bio_wait(&bio);
>
> A cond_resched() call before a submit_bio_wait() call? I think it's the first time
> that I see this. Is that call really necessary? Isn't the
> wait_for_completion() call inside submit_bio_wait() sufficient?
>

One can't be too careful these days, heh. I'll fix it up. Thanks!

> > + /* no flags is currently supported */
> ^^
> are?
>
> > + /* allocate the size of the zone descriptor extension and fill
> > + * with the data in the user data buffer. If the data size is less
> > + * than the zone descriptor extension size, then the rest of the
> > + * zone description extension data buffer is zero-filled.
> > + */
> > + zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
> > + if (!zsd_data)
> > + return -ENOMEM;
> > +
> > + if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
> > + zsd.len)) {
> > + ret = -EFAULT;
> > + goto free;
> > + }
>
> Has it been considered to use kmalloc() instead of get_zeroed_page()?
>
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index ccb895f911b1..53b7b05b0004 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -316,6 +316,8 @@ enum req_opf {
> > REQ_OP_ZONE_FINISH = 12,
> > /* write data at the current zone write pointer */
> > REQ_OP_ZONE_APPEND = 13,
> > + /* associate zone desc extension data to a zone */
> > + REQ_OP_ZONE_SET_DESC = 14,
> >
> > /* SCSI passthrough using struct scsi_request */
> > REQ_OP_SCSI_IN = 32,
>
> Does REQ_OP_ZONE_SET_DESC count as a read or as a write operation? See
> also:
>
> static inline bool op_is_write(unsigned int op) {
> return (op & 1);
> }
>
> > +/**
> > + * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
> > + * @sector: Starting sector of the zone to operate on.
> > + * @flags: Feature flags.
> > + * @len: size, in bytes, of the data to be associated to the zone.
> > + * @data: data to be associated.
> > + */
> > +struct blk_zone_set_desc {
> > + __u64 sector;
> > + __u32 flags;
> > + __u32 len;
> > + __u8 data[0];
> > +};
>
> Isn't the recommended style to use a flexible array ([] instead of [0])?
> See also
> https://lore.kernel.org/lkml/20200608213711.GA22271@embeddedor/.
>
> Thanks,
>
> Bart.

2020-07-03 09:47:05

by Matias Bjørling

[permalink] [raw]
Subject: RE: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

> -----Original Message-----
> From: Javier González <[email protected]>
> Sent: Monday, 29 June 2020 21.39
> To: Damien Le Moal <[email protected]>
> Cc: Matias Bjorling <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Niklas Cassel <[email protected]>; Hans
> Holmberg <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block
> Devices
>
> On 29.06.2020 01:00, Damien Le Moal wrote:
> >On 2020/06/29 8:01, Matias Bjorling wrote:
> >> The NVMe Zoned Namespace Command Set adds support for associating
> >> data to a zone through the Zone Descriptor Extension feature.
> >>
> >> To allow user-space to associate data to a zone, add support through
> >> the BLKSETDESCZONE ioctl. The ioctl requires that it is issued to a
> >> zoned block device, and that it supports the Zone Descriptor
> >> Extension feature. Support is detected through the the
> >> zone_desc_ext_bytes sysfs queue entry for the specific block device.
> >> A value larger than zero communicates that the device supports the
> >> feature.
>
> Have you considered the possibility of adding this an action to a IOCTL that
> looks like the zone management one we discussed last week? We would start
> saving IOCTLs already if we count the offline transition and this one.

Yes, I considered it. Damien and Christoph have asked for it to separate ioctls.

>
> >>
> >> The ioctl associates data to a zone by issuing a Zone Management Send
> >> command with the Zone Send Action set as the Set Zone Descriptor
> >> Extension.
> >>
> >> For the command to complete successfully, the specified zone must be
> >> in the Empty state, and active resources must be available. On
> >> success, the specified zone is transioned to Closed state by the
> >> device. If less data is supplied by user-space then reported by the
> >> the Zone Descriptor Extension size, the rest is zero-filled. If more
> >> data or no data is supplied by user-space, the ioctl fails.
> >>
> >> To issue the ioctl, a new blk_zone_set_desc data structure is defined.
> >> It has following parameters:
> >>
> >> * the sector of the specific zone.
> >> * the length of the data to be associated to the zone.
> >> * any flags be used by the ioctl. None is defined.
> >> * data associated to the zone.
> >>
> >> The data is laid out after the flags parameter, and it is the
> >> caller's responsibility to allocate memory for the data that is
> >> specified in the length parameter.
> >>
> >> Signed-off-by: Matias Bjørling <[email protected]>
> >> ---
> >> block/blk-zoned.c | 108 ++++++++++++++++++++++++++++++++++
> >> block/ioctl.c | 2 +
> >> drivers/nvme/host/core.c | 3 +
> >> drivers/nvme/host/nvme.h | 9 +++
> >> drivers/nvme/host/zns.c | 11 ++++
> >> include/linux/blk_types.h | 2 +
> >> include/linux/blkdev.h | 9 ++-
> >> include/uapi/linux/blkzoned.h | 20 ++++++-
> >> 8 files changed, 162 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/blk-zoned.c b/block/blk-zoned.c index
> >> 81152a260354..4dc40ec006a2 100644
> >> --- a/block/blk-zoned.c
> >> +++ b/block/blk-zoned.c
> >> @@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device *bdev,
> >> enum req_opf op, } EXPORT_SYMBOL_GPL(blkdev_zone_mgmt);
> >>
> >> +/**
> >> + * blkdev_zone_set_desc - Execute a zone management set zone
> descriptor
> >> + * extension operation on a zone
> >> + * @bdev: Target block device
> >> + * @sector: Start sector of the zone to operate on
> >> + * @data: Pointer to the data that is to be associated to the zone
> >> + * @gfp_mask: Memory allocation flags (for bio_alloc)
> >> + *
> >> + * Description:
> >> + * Associate zone descriptor extension data to a specified zone.
> >> + * The block device must support zone descriptor extensions.
> >> + * i.e., by exposing a positive zone descriptor extension size.
> >> + */
> >> +int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector,
> >> + struct page *data, gfp_t gfp_mask)
> >
> >struct page for the data ? Why not just a "void *" to allow for
> >kmalloc/vmalloc data ? And no length for the data ? This is a bit odd.
> >
> >> +{
> >> + struct request_queue *q = bdev_get_queue(bdev);
> >> + sector_t zone_sectors = blk_queue_zone_sectors(q);
> >> + struct bio_vec bio_vec;
> >> + struct bio bio;
> >> +
> >> + if (!blk_queue_is_zoned(q))
> >> + return -EOPNOTSUPP;
> >> +
> >> + if (bdev_read_only(bdev))
> >> + return -EPERM;
> >
> >You are not checking the zone_desc_ext_bytes limit here. You should.
> >> +
> >> + /* Check alignment (handle eventual smaller last zone) */
> >> + if (sector & (zone_sectors - 1))
> >> + return -EINVAL;
> >
> >The comment is incorrect. There is nothing special for handling the
> >last zone in this test.
> >
> >> +
> >> + bio_init(&bio, &bio_vec, 1);
> >> + bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC;
> >> + bio.bi_iter.bi_sector = sector;
> >> + bio_set_dev(&bio, bdev);
> >> + bio_add_page(&bio, data, queue_zone_desc_ext_bytes(q), 0);
> >> +
> >> + /* This may take a while, so be nice to others */
> >> + cond_resched();
> >
> >This is not a loop, so you do not need this.
> >
> >> +
> >> + return submit_bio_wait(&bio);
> >> +}
> >> +EXPORT_SYMBOL_GPL(blkdev_zone_set_desc);
> >> +
> >> struct zone_report_args {
> >> struct blk_zone __user *zones;
> >> };
> >> @@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct block_device
> *bdev, fmode_t mode,
> >> GFP_KERNEL);
> >> }
> >>
> >> +/*
> >> + * BLKSETDESCZONE ioctl processing.
> >> + * Called from blkdev_ioctl.
> >> + */
> >> +int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t
> mode,
> >> + unsigned int cmd, unsigned long arg) {
> >> + void __user *argp = (void __user *)arg;
> >> + struct request_queue *q;
> >> + struct blk_zone_set_desc zsd;
> >> + void *zsd_data;
> >> + int ret;
> >> +
> >> + if (!argp)
> >> + return -EINVAL;
> >> +
> >> + q = bdev_get_queue(bdev);
> >> + if (!q)
> >> + return -ENXIO;
> >> +
> >> + if (!blk_queue_is_zoned(q))
> >> + return -ENOTTY;
> >> +
> >> + if (!capable(CAP_SYS_ADMIN))
> >> + return -EACCES;
> >> +
> >> + if (!(mode & FMODE_WRITE))
> >> + return -EBADF;
> >> +
> >> + if (!queue_zone_desc_ext_bytes(q))
> >> + return -EOPNOTSUPP;
> >> +
> >> + if (copy_from_user(&zsd, argp, sizeof(struct blk_zone_set_desc)))
> >> + return -EFAULT;
> >> +
> >> + /* no flags is currently supported */
> >> + if (zsd.flags)
> >> + return -ENOTTY;
> >> +
> >> + if (!zsd.len || zsd.len > queue_zone_desc_ext_bytes(q))
> >> + return -ENOTTY;
> >
> >This should go into blkdev_zone_set_desc() as well so that in-kernel
> >users are checked. So there may be no need to check this here.
> >
> >> +
> >> + /* allocate the size of the zone descriptor extension and fill
> >> + * with the data in the user data buffer. If the data size is less
> >> + * than the zone descriptor extension size, then the rest of the
> >> + * zone description extension data buffer is zero-filled.
> >> + */
> >> + zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
> >> + if (!zsd_data)
> >> + return -ENOMEM;
> >> +
> >> + if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
> >> + zsd.len)) {
> >> + ret = -EFAULT;
> >> + goto free;
> >> + }
> >> +
> >> + ret = blkdev_zone_set_desc(bdev, zsd.sector, virt_to_page(zsd_data),
> >> + GFP_KERNEL);
> >> +free:
> >> + free_page((unsigned long) zsd_data);
> >> + return ret;
> >> +}
> >> +
> >> static inline unsigned long *blk_alloc_zone_bitmap(int node,
> >> unsigned int nr_zones)
> >> {
> >> diff --git a/block/ioctl.c b/block/ioctl.c index
> >> bdb3bbb253d9..b9744705835b 100644
> >> --- a/block/ioctl.c
> >> +++ b/block/ioctl.c
> >> @@ -515,6 +515,8 @@ static int blkdev_common_ioctl(struct block_device
> *bdev, fmode_t mode,
> >> case BLKCLOSEZONE:
> >> case BLKFINISHZONE:
> >> return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
> >> + case BLKSETDESCZONE:
> >> + return blkdev_zone_set_desc_ioctl(bdev, mode, cmd, arg);
> >> case BLKGETZONESZ:
> >> return put_uint(argp, bdev_zone_sectors(bdev));
> >> case BLKGETNRZONES:
> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >> index e961910da4ac..b8f25b0d00ad 100644
> >> --- a/drivers/nvme/host/core.c
> >> +++ b/drivers/nvme/host/core.c
> >> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns,
> struct request *req,
> >> case REQ_OP_ZONE_FINISH:
> >> ret = nvme_setup_zone_mgmt_send(ns, req, cmd,
> NVME_ZONE_FINISH);
> >> break;
> >> + case REQ_OP_ZONE_SET_DESC:
> >> + ret = nvme_setup_zone_set_desc(ns, req, cmd);
> >> + break;
> >> case REQ_OP_WRITE_ZEROES:
> >> ret = nvme_setup_write_zeroes(ns, req, cmd);
> >> break;
> >> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> >> index 662f95fbd909..5bd5a437b038 100644
> >> --- a/drivers/nvme/host/nvme.h
> >> +++ b/drivers/nvme/host/nvme.h
> >> @@ -708,6 +708,9 @@ int nvme_report_zones(struct gendisk *disk,
> >> sector_t sector, blk_status_t nvme_setup_zone_mgmt_send(struct
> nvme_ns *ns, struct request *req,
> >> struct nvme_command *cmnd,
> >> enum nvme_zone_mgmt_action action);
> >> +
> >> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct
> request *req,
> >> + struct nvme_command *cmnd);
> >> #else
> >> #define nvme_report_zones NULL
> >>
> >> @@ -718,6 +721,12 @@ static inline blk_status_t
> nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
> >> return BLK_STS_NOTSUPP;
> >> }
> >>
> >> +static inline blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns,
> >> + struct request *req, struct nvme_command *cmnd) {
> >> + return BLK_STS_NOTSUPP;
> >> +}
> >> +
> >> static inline int nvme_update_zone_info(struct gendisk *disk,
> >> struct nvme_ns *ns,
> >> unsigned lbaf)
> >> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index
> >> 5792d953a8f3..bfa64cc685d3 100644
> >> --- a/drivers/nvme/host/zns.c
> >> +++ b/drivers/nvme/host/zns.c
> >> @@ -239,3 +239,14 @@ blk_status_t
> nvme_setup_zone_mgmt_send(struct
> >> nvme_ns *ns, struct request *req,
> >>
> >> return BLK_STS_OK;
> >> }
> >> +
> >> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct
> request *req,
> >> + struct nvme_command *c)
> >> +{
> >> + c->zms.opcode = nvme_cmd_zone_mgmt_send;
> >> + c->zms.nsid = cpu_to_le32(ns->head->ns_id);
> >> + c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
> >> + c->zms.action = NVME_ZONE_SET_DESC_EXT;
> >> +
> >> + return BLK_STS_OK;
> >> +}
> >> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> >> index ccb895f911b1..53b7b05b0004 100644
> >> --- a/include/linux/blk_types.h
> >> +++ b/include/linux/blk_types.h
> >> @@ -316,6 +316,8 @@ enum req_opf {
> >> REQ_OP_ZONE_FINISH = 12,
> >> /* write data at the current zone write pointer */
> >> REQ_OP_ZONE_APPEND = 13,
> >> + /* associate zone desc extension data to a zone */
> >> + REQ_OP_ZONE_SET_DESC = 14,
> >>
> >> /* SCSI passthrough using struct scsi_request */
> >> REQ_OP_SCSI_IN = 32,
> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index
> >> 2ed55055f68d..c5f092dd5aa3 100644
> >> --- a/include/linux/blkdev.h
> >> +++ b/include/linux/blkdev.h
> >> @@ -370,7 +370,8 @@ extern int blkdev_report_zones_ioctl(struct
> block_device *bdev, fmode_t mode,
> >> unsigned int cmd, unsigned long arg);
> extern int
> >> blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
> >> unsigned int cmd, unsigned long arg);
> >> -
> >> +extern int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
> fmode_t mode,
> >> + unsigned int cmd, unsigned long arg);
> >> #else /* CONFIG_BLK_DEV_ZONED */
> >>
> >> static inline unsigned int blkdev_nr_zones(struct gendisk *disk) @@
> >> -392,6 +393,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct
> block_device *bdev,
> >> return -ENOTTY;
> >> }
> >>
> >> +static inline int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
> >> + fmode_t mode, unsigned int cmd,
> >> + unsigned long arg)
> >> +{
> >> + return -ENOTTY;
> >> +}
> >> #endif /* CONFIG_BLK_DEV_ZONED */
> >>
> >> struct request_queue {
> >> diff --git a/include/uapi/linux/blkzoned.h
> >> b/include/uapi/linux/blkzoned.h index 42c3366cc25f..68abda9abf33
> >> 100644
> >> --- a/include/uapi/linux/blkzoned.h
> >> +++ b/include/uapi/linux/blkzoned.h
> >> @@ -142,6 +142,20 @@ struct blk_zone_range {
> >> __u64 nr_sectors;
> >> };
> >>
> >> +/**
> >> + * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
> >> + * @sector: Starting sector of the zone to operate on.
> >> + * @flags: Feature flags.
> >> + * @len: size, in bytes, of the data to be associated to the zone.
> >> + * @data: data to be associated.
> >> + */
> >> +struct blk_zone_set_desc {
> >> + __u64 sector;
> >> + __u32 flags;
> >> + __u32 len;
> >> + __u8 data[0];
> >> +};
>
> Would it make sense to add nr_sectors if the host wants to associate the same
> metadata to several zones. The use case would be the grouping of larger zones
> in software.

I believe we get into atomicity concerns if we do ranges, and action only supports a single zone. I like to align with the ZNS API as much as possible, and let the user-space application track errors per set desc ext action.

>
> >> +
> >> /**
> >> * Zoned block device ioctl's:
> >> *
> >> @@ -158,6 +172,10 @@ struct blk_zone_range {
> >> * The 512 B sector range must be zone aligned.
> >> * @BLKFINISHZONE: Mark the zones as full in the specified sector range.
> >> * The 512 B sector range must be zone aligned.
> >> + * @BLKSETDESCZONE: Set zone description extension data for the zone
> >> + * in the specified sector. On success, the zone
> >> + * will transition to the closed zone state.
> >> + * The 512B sector must be zone aligned.
> >> */
> >> #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report)
> >> #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)
> >> @@ -166,5 +184,5 @@ struct blk_zone_range {
> >> #define BLKOPENZONE _IOW(0x12, 134, struct blk_zone_range)
> >> #define BLKCLOSEZONE _IOW(0x12, 135, struct blk_zone_range)
> >> #define BLKFINISHZONE _IOW(0x12, 136, struct blk_zone_range)
> >> -
> >> +#define BLKSETDESCZONE _IOW(0x12, 137, struct blk_zone_set_desc)
> >> #endif /* _UAPI_BLKZONED_H */
> >>
> >
> >How to you retreive an extended descriptor that was set ? I do not see
> >any code doing that. Report zones is not changed, but I would leave
> >that one as is and implement a BLKGETDESCZONE ioctl & in-kernel API.
> >
>
> In any case, this is needed. I also could not find a way to read the extended
> descriptor back.

Thanks for the feedback.

Besides the users I have in mind, do you have users for which you need the ability for user-space to access zone descriptor extensions data? Is this a need to have or nice to have feature from your point of view?

>
> Javier

2020-07-07 08:44:02

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices

On 03.07.2020 09:44, Matias Bjorling wrote:
>> -----Original Message-----
>> From: Javier González <[email protected]>
>> Sent: Monday, 29 June 2020 21.39
>> To: Damien Le Moal <[email protected]>
>> Cc: Matias Bjorling <[email protected]>; [email protected];
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Niklas Cassel <[email protected]>; Hans
>> Holmberg <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; linux-
>> [email protected]
>> Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block
>> Devices
>>
>> On 29.06.2020 01:00, Damien Le Moal wrote:
>> >On 2020/06/29 8:01, Matias Bjorling wrote:
>> >> The NVMe Zoned Namespace Command Set adds support for associating
>> >> data to a zone through the Zone Descriptor Extension feature.
>> >>
>> >> To allow user-space to associate data to a zone, add support through
>> >> the BLKSETDESCZONE ioctl. The ioctl requires that it is issued to a
>> >> zoned block device, and that it supports the Zone Descriptor
>> >> Extension feature. Support is detected through the the
>> >> zone_desc_ext_bytes sysfs queue entry for the specific block device.
>> >> A value larger than zero communicates that the device supports the
>> >> feature.
>>
>> Have you considered the possibility of adding this an action to a IOCTL that
>> looks like the zone management one we discussed last week? We would start
>> saving IOCTLs already if we count the offline transition and this one.
>
>Yes, I considered it. Damien and Christoph have asked for it to separate ioctls.

Ok.

>
>>
>> >>
>> >> The ioctl associates data to a zone by issuing a Zone Management Send
>> >> command with the Zone Send Action set as the Set Zone Descriptor
>> >> Extension.
>> >>
>> >> For the command to complete successfully, the specified zone must be
>> >> in the Empty state, and active resources must be available. On
>> >> success, the specified zone is transioned to Closed state by the
>> >> device. If less data is supplied by user-space then reported by the
>> >> the Zone Descriptor Extension size, the rest is zero-filled. If more
>> >> data or no data is supplied by user-space, the ioctl fails.
>> >>
>> >> To issue the ioctl, a new blk_zone_set_desc data structure is defined.
>> >> It has following parameters:
>> >>
>> >> * the sector of the specific zone.
>> >> * the length of the data to be associated to the zone.
>> >> * any flags be used by the ioctl. None is defined.
>> >> * data associated to the zone.
>> >>
>> >> The data is laid out after the flags parameter, and it is the
>> >> caller's responsibility to allocate memory for the data that is
>> >> specified in the length parameter.
>> >>
>> >> Signed-off-by: Matias Bjørling <[email protected]>
>> >> ---
>> >> block/blk-zoned.c | 108 ++++++++++++++++++++++++++++++++++
>> >> block/ioctl.c | 2 +
>> >> drivers/nvme/host/core.c | 3 +
>> >> drivers/nvme/host/nvme.h | 9 +++
>> >> drivers/nvme/host/zns.c | 11 ++++
>> >> include/linux/blk_types.h | 2 +
>> >> include/linux/blkdev.h | 9 ++-
>> >> include/uapi/linux/blkzoned.h | 20 ++++++-
>> >> 8 files changed, 162 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/block/blk-zoned.c b/block/blk-zoned.c index
>> >> 81152a260354..4dc40ec006a2 100644
>> >> --- a/block/blk-zoned.c
>> >> +++ b/block/blk-zoned.c
>> >> @@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device *bdev,
>> >> enum req_opf op, } EXPORT_SYMBOL_GPL(blkdev_zone_mgmt);
>> >>
>> >> +/**
>> >> + * blkdev_zone_set_desc - Execute a zone management set zone
>> descriptor
>> >> + * extension operation on a zone
>> >> + * @bdev: Target block device
>> >> + * @sector: Start sector of the zone to operate on
>> >> + * @data: Pointer to the data that is to be associated to the zone
>> >> + * @gfp_mask: Memory allocation flags (for bio_alloc)
>> >> + *
>> >> + * Description:
>> >> + * Associate zone descriptor extension data to a specified zone.
>> >> + * The block device must support zone descriptor extensions.
>> >> + * i.e., by exposing a positive zone descriptor extension size.
>> >> + */
>> >> +int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector,
>> >> + struct page *data, gfp_t gfp_mask)
>> >
>> >struct page for the data ? Why not just a "void *" to allow for
>> >kmalloc/vmalloc data ? And no length for the data ? This is a bit odd.
>> >
>> >> +{
>> >> + struct request_queue *q = bdev_get_queue(bdev);
>> >> + sector_t zone_sectors = blk_queue_zone_sectors(q);
>> >> + struct bio_vec bio_vec;
>> >> + struct bio bio;
>> >> +
>> >> + if (!blk_queue_is_zoned(q))
>> >> + return -EOPNOTSUPP;
>> >> +
>> >> + if (bdev_read_only(bdev))
>> >> + return -EPERM;
>> >
>> >You are not checking the zone_desc_ext_bytes limit here. You should.
>> >> +
>> >> + /* Check alignment (handle eventual smaller last zone) */
>> >> + if (sector & (zone_sectors - 1))
>> >> + return -EINVAL;
>> >
>> >The comment is incorrect. There is nothing special for handling the
>> >last zone in this test.
>> >
>> >> +
>> >> + bio_init(&bio, &bio_vec, 1);
>> >> + bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC;
>> >> + bio.bi_iter.bi_sector = sector;
>> >> + bio_set_dev(&bio, bdev);
>> >> + bio_add_page(&bio, data, queue_zone_desc_ext_bytes(q), 0);
>> >> +
>> >> + /* This may take a while, so be nice to others */
>> >> + cond_resched();
>> >
>> >This is not a loop, so you do not need this.
>> >
>> >> +
>> >> + return submit_bio_wait(&bio);
>> >> +}
>> >> +EXPORT_SYMBOL_GPL(blkdev_zone_set_desc);
>> >> +
>> >> struct zone_report_args {
>> >> struct blk_zone __user *zones;
>> >> };
>> >> @@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct block_device
>> *bdev, fmode_t mode,
>> >> GFP_KERNEL);
>> >> }
>> >>
>> >> +/*
>> >> + * BLKSETDESCZONE ioctl processing.
>> >> + * Called from blkdev_ioctl.
>> >> + */
>> >> +int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t
>> mode,
>> >> + unsigned int cmd, unsigned long arg) {
>> >> + void __user *argp = (void __user *)arg;
>> >> + struct request_queue *q;
>> >> + struct blk_zone_set_desc zsd;
>> >> + void *zsd_data;
>> >> + int ret;
>> >> +
>> >> + if (!argp)
>> >> + return -EINVAL;
>> >> +
>> >> + q = bdev_get_queue(bdev);
>> >> + if (!q)
>> >> + return -ENXIO;
>> >> +
>> >> + if (!blk_queue_is_zoned(q))
>> >> + return -ENOTTY;
>> >> +
>> >> + if (!capable(CAP_SYS_ADMIN))
>> >> + return -EACCES;
>> >> +
>> >> + if (!(mode & FMODE_WRITE))
>> >> + return -EBADF;
>> >> +
>> >> + if (!queue_zone_desc_ext_bytes(q))
>> >> + return -EOPNOTSUPP;
>> >> +
>> >> + if (copy_from_user(&zsd, argp, sizeof(struct blk_zone_set_desc)))
>> >> + return -EFAULT;
>> >> +
>> >> + /* no flags is currently supported */
>> >> + if (zsd.flags)
>> >> + return -ENOTTY;
>> >> +
>> >> + if (!zsd.len || zsd.len > queue_zone_desc_ext_bytes(q))
>> >> + return -ENOTTY;
>> >
>> >This should go into blkdev_zone_set_desc() as well so that in-kernel
>> >users are checked. So there may be no need to check this here.
>> >
>> >> +
>> >> + /* allocate the size of the zone descriptor extension and fill
>> >> + * with the data in the user data buffer. If the data size is less
>> >> + * than the zone descriptor extension size, then the rest of the
>> >> + * zone description extension data buffer is zero-filled.
>> >> + */
>> >> + zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
>> >> + if (!zsd_data)
>> >> + return -ENOMEM;
>> >> +
>> >> + if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
>> >> + zsd.len)) {
>> >> + ret = -EFAULT;
>> >> + goto free;
>> >> + }
>> >> +
>> >> + ret = blkdev_zone_set_desc(bdev, zsd.sector, virt_to_page(zsd_data),
>> >> + GFP_KERNEL);
>> >> +free:
>> >> + free_page((unsigned long) zsd_data);
>> >> + return ret;
>> >> +}
>> >> +
>> >> static inline unsigned long *blk_alloc_zone_bitmap(int node,
>> >> unsigned int nr_zones)
>> >> {
>> >> diff --git a/block/ioctl.c b/block/ioctl.c index
>> >> bdb3bbb253d9..b9744705835b 100644
>> >> --- a/block/ioctl.c
>> >> +++ b/block/ioctl.c
>> >> @@ -515,6 +515,8 @@ static int blkdev_common_ioctl(struct block_device
>> *bdev, fmode_t mode,
>> >> case BLKCLOSEZONE:
>> >> case BLKFINISHZONE:
>> >> return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
>> >> + case BLKSETDESCZONE:
>> >> + return blkdev_zone_set_desc_ioctl(bdev, mode, cmd, arg);
>> >> case BLKGETZONESZ:
>> >> return put_uint(argp, bdev_zone_sectors(bdev));
>> >> case BLKGETNRZONES:
>> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> >> index e961910da4ac..b8f25b0d00ad 100644
>> >> --- a/drivers/nvme/host/core.c
>> >> +++ b/drivers/nvme/host/core.c
>> >> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns,
>> struct request *req,
>> >> case REQ_OP_ZONE_FINISH:
>> >> ret = nvme_setup_zone_mgmt_send(ns, req, cmd,
>> NVME_ZONE_FINISH);
>> >> break;
>> >> + case REQ_OP_ZONE_SET_DESC:
>> >> + ret = nvme_setup_zone_set_desc(ns, req, cmd);
>> >> + break;
>> >> case REQ_OP_WRITE_ZEROES:
>> >> ret = nvme_setup_write_zeroes(ns, req, cmd);
>> >> break;
>> >> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> >> index 662f95fbd909..5bd5a437b038 100644
>> >> --- a/drivers/nvme/host/nvme.h
>> >> +++ b/drivers/nvme/host/nvme.h
>> >> @@ -708,6 +708,9 @@ int nvme_report_zones(struct gendisk *disk,
>> >> sector_t sector, blk_status_t nvme_setup_zone_mgmt_send(struct
>> nvme_ns *ns, struct request *req,
>> >> struct nvme_command *cmnd,
>> >> enum nvme_zone_mgmt_action action);
>> >> +
>> >> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct
>> request *req,
>> >> + struct nvme_command *cmnd);
>> >> #else
>> >> #define nvme_report_zones NULL
>> >>
>> >> @@ -718,6 +721,12 @@ static inline blk_status_t
>> nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
>> >> return BLK_STS_NOTSUPP;
>> >> }
>> >>
>> >> +static inline blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns,
>> >> + struct request *req, struct nvme_command *cmnd) {
>> >> + return BLK_STS_NOTSUPP;
>> >> +}
>> >> +
>> >> static inline int nvme_update_zone_info(struct gendisk *disk,
>> >> struct nvme_ns *ns,
>> >> unsigned lbaf)
>> >> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index
>> >> 5792d953a8f3..bfa64cc685d3 100644
>> >> --- a/drivers/nvme/host/zns.c
>> >> +++ b/drivers/nvme/host/zns.c
>> >> @@ -239,3 +239,14 @@ blk_status_t
>> nvme_setup_zone_mgmt_send(struct
>> >> nvme_ns *ns, struct request *req,
>> >>
>> >> return BLK_STS_OK;
>> >> }
>> >> +
>> >> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct
>> request *req,
>> >> + struct nvme_command *c)
>> >> +{
>> >> + c->zms.opcode = nvme_cmd_zone_mgmt_send;
>> >> + c->zms.nsid = cpu_to_le32(ns->head->ns_id);
>> >> + c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
>> >> + c->zms.action = NVME_ZONE_SET_DESC_EXT;
>> >> +
>> >> + return BLK_STS_OK;
>> >> +}
>> >> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> >> index ccb895f911b1..53b7b05b0004 100644
>> >> --- a/include/linux/blk_types.h
>> >> +++ b/include/linux/blk_types.h
>> >> @@ -316,6 +316,8 @@ enum req_opf {
>> >> REQ_OP_ZONE_FINISH = 12,
>> >> /* write data at the current zone write pointer */
>> >> REQ_OP_ZONE_APPEND = 13,
>> >> + /* associate zone desc extension data to a zone */
>> >> + REQ_OP_ZONE_SET_DESC = 14,
>> >>
>> >> /* SCSI passthrough using struct scsi_request */
>> >> REQ_OP_SCSI_IN = 32,
>> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index
>> >> 2ed55055f68d..c5f092dd5aa3 100644
>> >> --- a/include/linux/blkdev.h
>> >> +++ b/include/linux/blkdev.h
>> >> @@ -370,7 +370,8 @@ extern int blkdev_report_zones_ioctl(struct
>> block_device *bdev, fmode_t mode,
>> >> unsigned int cmd, unsigned long arg);
>> extern int
>> >> blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
>> >> unsigned int cmd, unsigned long arg);
>> >> -
>> >> +extern int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
>> fmode_t mode,
>> >> + unsigned int cmd, unsigned long arg);
>> >> #else /* CONFIG_BLK_DEV_ZONED */
>> >>
>> >> static inline unsigned int blkdev_nr_zones(struct gendisk *disk) @@
>> >> -392,6 +393,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct
>> block_device *bdev,
>> >> return -ENOTTY;
>> >> }
>> >>
>> >> +static inline int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
>> >> + fmode_t mode, unsigned int cmd,
>> >> + unsigned long arg)
>> >> +{
>> >> + return -ENOTTY;
>> >> +}
>> >> #endif /* CONFIG_BLK_DEV_ZONED */
>> >>
>> >> struct request_queue {
>> >> diff --git a/include/uapi/linux/blkzoned.h
>> >> b/include/uapi/linux/blkzoned.h index 42c3366cc25f..68abda9abf33
>> >> 100644
>> >> --- a/include/uapi/linux/blkzoned.h
>> >> +++ b/include/uapi/linux/blkzoned.h
>> >> @@ -142,6 +142,20 @@ struct blk_zone_range {
>> >> __u64 nr_sectors;
>> >> };
>> >>
>> >> +/**
>> >> + * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
>> >> + * @sector: Starting sector of the zone to operate on.
>> >> + * @flags: Feature flags.
>> >> + * @len: size, in bytes, of the data to be associated to the zone.
>> >> + * @data: data to be associated.
>> >> + */
>> >> +struct blk_zone_set_desc {
>> >> + __u64 sector;
>> >> + __u32 flags;
>> >> + __u32 len;
>> >> + __u8 data[0];
>> >> +};
>>
>> Would it make sense to add nr_sectors if the host wants to associate the same
>> metadata to several zones. The use case would be the grouping of larger zones
>> in software.
>
>I believe we get into atomicity concerns if we do ranges, and action
>only supports a single zone. I like to align with the ZNS API as much
>as possible, and let the user-space application track errors per set
>desc ext action.

The atomicity concerns should be the same as current zone management
using nr_sectors, and these are already being supported for open, close,
etc.

TBH, the comment is most about making sure that the IOCTL is extendable
from conception.

>
>>
>> >> +
>> >> /**
>> >> * Zoned block device ioctl's:
>> >> *
>> >> @@ -158,6 +172,10 @@ struct blk_zone_range {
>> >> * The 512 B sector range must be zone aligned.
>> >> * @BLKFINISHZONE: Mark the zones as full in the specified sector range.
>> >> * The 512 B sector range must be zone aligned.
>> >> + * @BLKSETDESCZONE: Set zone description extension data for the zone
>> >> + * in the specified sector. On success, the zone
>> >> + * will transition to the closed zone state.
>> >> + * The 512B sector must be zone aligned.
>> >> */
>> >> #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report)
>> >> #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)
>> >> @@ -166,5 +184,5 @@ struct blk_zone_range {
>> >> #define BLKOPENZONE _IOW(0x12, 134, struct blk_zone_range)
>> >> #define BLKCLOSEZONE _IOW(0x12, 135, struct blk_zone_range)
>> >> #define BLKFINISHZONE _IOW(0x12, 136, struct blk_zone_range)
>> >> -
>> >> +#define BLKSETDESCZONE _IOW(0x12, 137, struct blk_zone_set_desc)
>> >> #endif /* _UAPI_BLKZONED_H */
>> >>
>> >
>> >How to you retreive an extended descriptor that was set ? I do not see
>> >any code doing that. Report zones is not changed, but I would leave
>> >that one as is and implement a BLKGETDESCZONE ioctl & in-kernel API.
>> >
>>
>> In any case, this is needed. I also could not find a way to read the extended
>> descriptor back.
>
>Thanks for the feedback.
>
>Besides the users I have in mind, do you have users for which you need
>the ability for user-space to access zone descriptor extensions data?
>Is this a need to have or nice to have feature from your point of view?

At this moment most of the users we have in mind are user-space
applications that are zone-aware, so I would way this is necessary.

This said, I understand if you prioritize pushing the code that enables
in-kernel users and then re-iterate on this.

Thanks,
Javier

2020-07-07 16:04:38

by Matias Bjørling

[permalink] [raw]
Subject: RE: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block Devices



> -----Original Message-----
> From: Javier González <[email protected]>
> Sent: Tuesday, 7 July 2020 10.43
> To: Matias Bjorling <[email protected]>
> Cc: Damien Le Moal <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Niklas Cassel <[email protected]>; Hans
> Holmberg <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned Block
> Devices
>
> On 03.07.2020 09:44, Matias Bjorling wrote:
> >> -----Original Message-----
> >> From: Javier González <[email protected]>
> >> Sent: Monday, 29 June 2020 21.39
> >> To: Damien Le Moal <[email protected]>
> >> Cc: Matias Bjorling <[email protected]>; [email protected];
> >> [email protected]; [email protected]; [email protected];
> >> [email protected]; Niklas Cassel <[email protected]>;
> >> Hans Holmberg <[email protected]>; [email protected];
> >> linux- [email protected]; [email protected]; linux-
> >> [email protected]
> >> Subject: Re: [PATCH 2/2] block: add BLKSETDESCZONE ioctl for Zoned
> >> Block Devices
> >>
> >> On 29.06.2020 01:00, Damien Le Moal wrote:
> >> >On 2020/06/29 8:01, Matias Bjorling wrote:
> >> >> The NVMe Zoned Namespace Command Set adds support for associating
> >> >> data to a zone through the Zone Descriptor Extension feature.
> >> >>
> >> >> To allow user-space to associate data to a zone, add support
> >> >> through the BLKSETDESCZONE ioctl. The ioctl requires that it is
> >> >> issued to a zoned block device, and that it supports the Zone
> >> >> Descriptor Extension feature. Support is detected through the the
> >> >> zone_desc_ext_bytes sysfs queue entry for the specific block device.
> >> >> A value larger than zero communicates that the device supports the
> >> >> feature.
> >>
> >> Have you considered the possibility of adding this an action to a
> >> IOCTL that looks like the zone management one we discussed last week?
> >> We would start saving IOCTLs already if we count the offline transition and
> this one.
> >
> >Yes, I considered it. Damien and Christoph have asked for it to separate ioctls.
>
> Ok.
>
> >
> >>
> >> >>
> >> >> The ioctl associates data to a zone by issuing a Zone Management
> >> >> Send command with the Zone Send Action set as the Set Zone
> >> >> Descriptor Extension.
> >> >>
> >> >> For the command to complete successfully, the specified zone must
> >> >> be in the Empty state, and active resources must be available. On
> >> >> success, the specified zone is transioned to Closed state by the
> >> >> device. If less data is supplied by user-space then reported by
> >> >> the the Zone Descriptor Extension size, the rest is zero-filled.
> >> >> If more data or no data is supplied by user-space, the ioctl fails.
> >> >>
> >> >> To issue the ioctl, a new blk_zone_set_desc data structure is defined.
> >> >> It has following parameters:
> >> >>
> >> >> * the sector of the specific zone.
> >> >> * the length of the data to be associated to the zone.
> >> >> * any flags be used by the ioctl. None is defined.
> >> >> * data associated to the zone.
> >> >>
> >> >> The data is laid out after the flags parameter, and it is the
> >> >> caller's responsibility to allocate memory for the data that is
> >> >> specified in the length parameter.
> >> >>
> >> >> Signed-off-by: Matias Bjørling <[email protected]>
> >> >> ---
> >> >> block/blk-zoned.c | 108 ++++++++++++++++++++++++++++++++++
> >> >> block/ioctl.c | 2 +
> >> >> drivers/nvme/host/core.c | 3 +
> >> >> drivers/nvme/host/nvme.h | 9 +++
> >> >> drivers/nvme/host/zns.c | 11 ++++
> >> >> include/linux/blk_types.h | 2 +
> >> >> include/linux/blkdev.h | 9 ++-
> >> >> include/uapi/linux/blkzoned.h | 20 ++++++-
> >> >> 8 files changed, 162 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/block/blk-zoned.c b/block/blk-zoned.c index
> >> >> 81152a260354..4dc40ec006a2 100644
> >> >> --- a/block/blk-zoned.c
> >> >> +++ b/block/blk-zoned.c
> >> >> @@ -259,6 +259,50 @@ int blkdev_zone_mgmt(struct block_device
> >> >> *bdev, enum req_opf op, } EXPORT_SYMBOL_GPL(blkdev_zone_mgmt);
> >> >>
> >> >> +/**
> >> >> + * blkdev_zone_set_desc - Execute a zone management set zone
> >> descriptor
> >> >> + * extension operation on a zone
> >> >> + * @bdev: Target block device
> >> >> + * @sector: Start sector of the zone to operate on
> >> >> + * @data: Pointer to the data that is to be associated to the zone
> >> >> + * @gfp_mask: Memory allocation flags (for bio_alloc)
> >> >> + *
> >> >> + * Description:
> >> >> + * Associate zone descriptor extension data to a specified zone.
> >> >> + * The block device must support zone descriptor extensions.
> >> >> + * i.e., by exposing a positive zone descriptor extension size.
> >> >> + */
> >> >> +int blkdev_zone_set_desc(struct block_device *bdev, sector_t sector,
> >> >> + struct page *data, gfp_t gfp_mask)
> >> >
> >> >struct page for the data ? Why not just a "void *" to allow for
> >> >kmalloc/vmalloc data ? And no length for the data ? This is a bit odd.
> >> >
> >> >> +{
> >> >> + struct request_queue *q = bdev_get_queue(bdev);
> >> >> + sector_t zone_sectors = blk_queue_zone_sectors(q);
> >> >> + struct bio_vec bio_vec;
> >> >> + struct bio bio;
> >> >> +
> >> >> + if (!blk_queue_is_zoned(q))
> >> >> + return -EOPNOTSUPP;
> >> >> +
> >> >> + if (bdev_read_only(bdev))
> >> >> + return -EPERM;
> >> >
> >> >You are not checking the zone_desc_ext_bytes limit here. You should.
> >> >> +
> >> >> + /* Check alignment (handle eventual smaller last zone) */
> >> >> + if (sector & (zone_sectors - 1))
> >> >> + return -EINVAL;
> >> >
> >> >The comment is incorrect. There is nothing special for handling the
> >> >last zone in this test.
> >> >
> >> >> +
> >> >> + bio_init(&bio, &bio_vec, 1);
> >> >> + bio.bi_opf = REQ_OP_ZONE_SET_DESC | REQ_SYNC;
> >> >> + bio.bi_iter.bi_sector = sector;
> >> >> + bio_set_dev(&bio, bdev);
> >> >> + bio_add_page(&bio, data, queue_zone_desc_ext_bytes(q), 0);
> >> >> +
> >> >> + /* This may take a while, so be nice to others */
> >> >> + cond_resched();
> >> >
> >> >This is not a loop, so you do not need this.
> >> >
> >> >> +
> >> >> + return submit_bio_wait(&bio);
> >> >> +}
> >> >> +EXPORT_SYMBOL_GPL(blkdev_zone_set_desc);
> >> >> +
> >> >> struct zone_report_args {
> >> >> struct blk_zone __user *zones;
> >> >> };
> >> >> @@ -370,6 +414,70 @@ int blkdev_zone_mgmt_ioctl(struct
> >> >> block_device
> >> *bdev, fmode_t mode,
> >> >> GFP_KERNEL);
> >> >> }
> >> >>
> >> >> +/*
> >> >> + * BLKSETDESCZONE ioctl processing.
> >> >> + * Called from blkdev_ioctl.
> >> >> + */
> >> >> +int blkdev_zone_set_desc_ioctl(struct block_device *bdev, fmode_t
> >> mode,
> >> >> + unsigned int cmd, unsigned long arg) {
> >> >> + void __user *argp = (void __user *)arg;
> >> >> + struct request_queue *q;
> >> >> + struct blk_zone_set_desc zsd;
> >> >> + void *zsd_data;
> >> >> + int ret;
> >> >> +
> >> >> + if (!argp)
> >> >> + return -EINVAL;
> >> >> +
> >> >> + q = bdev_get_queue(bdev);
> >> >> + if (!q)
> >> >> + return -ENXIO;
> >> >> +
> >> >> + if (!blk_queue_is_zoned(q))
> >> >> + return -ENOTTY;
> >> >> +
> >> >> + if (!capable(CAP_SYS_ADMIN))
> >> >> + return -EACCES;
> >> >> +
> >> >> + if (!(mode & FMODE_WRITE))
> >> >> + return -EBADF;
> >> >> +
> >> >> + if (!queue_zone_desc_ext_bytes(q))
> >> >> + return -EOPNOTSUPP;
> >> >> +
> >> >> + if (copy_from_user(&zsd, argp, sizeof(struct blk_zone_set_desc)))
> >> >> + return -EFAULT;
> >> >> +
> >> >> + /* no flags is currently supported */
> >> >> + if (zsd.flags)
> >> >> + return -ENOTTY;
> >> >> +
> >> >> + if (!zsd.len || zsd.len > queue_zone_desc_ext_bytes(q))
> >> >> + return -ENOTTY;
> >> >
> >> >This should go into blkdev_zone_set_desc() as well so that in-kernel
> >> >users are checked. So there may be no need to check this here.
> >> >
> >> >> +
> >> >> + /* allocate the size of the zone descriptor extension and fill
> >> >> + * with the data in the user data buffer. If the data size is less
> >> >> + * than the zone descriptor extension size, then the rest of the
> >> >> + * zone description extension data buffer is zero-filled.
> >> >> + */
> >> >> + zsd_data = (void *) get_zeroed_page(GFP_KERNEL);
> >> >> + if (!zsd_data)
> >> >> + return -ENOMEM;
> >> >> +
> >> >> + if (copy_from_user(zsd_data, argp + sizeof(struct blk_zone_set_desc),
> >> >> + zsd.len)) {
> >> >> + ret = -EFAULT;
> >> >> + goto free;
> >> >> + }
> >> >> +
> >> >> + ret = blkdev_zone_set_desc(bdev, zsd.sector, virt_to_page(zsd_data),
> >> >> + GFP_KERNEL);
> >> >> +free:
> >> >> + free_page((unsigned long) zsd_data);
> >> >> + return ret;
> >> >> +}
> >> >> +
> >> >> static inline unsigned long *blk_alloc_zone_bitmap(int node,
> >> >> unsigned int nr_zones)
> >> >> {
> >> >> diff --git a/block/ioctl.c b/block/ioctl.c index
> >> >> bdb3bbb253d9..b9744705835b 100644
> >> >> --- a/block/ioctl.c
> >> >> +++ b/block/ioctl.c
> >> >> @@ -515,6 +515,8 @@ static int blkdev_common_ioctl(struct
> >> >> block_device
> >> *bdev, fmode_t mode,
> >> >> case BLKCLOSEZONE:
> >> >> case BLKFINISHZONE:
> >> >> return blkdev_zone_mgmt_ioctl(bdev, mode, cmd, arg);
> >> >> + case BLKSETDESCZONE:
> >> >> + return blkdev_zone_set_desc_ioctl(bdev, mode, cmd, arg);
> >> >> case BLKGETZONESZ:
> >> >> return put_uint(argp, bdev_zone_sectors(bdev));
> >> >> case BLKGETNRZONES:
> >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >> >> index e961910da4ac..b8f25b0d00ad 100644
> >> >> --- a/drivers/nvme/host/core.c
> >> >> +++ b/drivers/nvme/host/core.c
> >> >> @@ -776,6 +776,9 @@ blk_status_t nvme_setup_cmd(struct nvme_ns
> >> >> *ns,
> >> struct request *req,
> >> >> case REQ_OP_ZONE_FINISH:
> >> >> ret = nvme_setup_zone_mgmt_send(ns, req, cmd,
> >> NVME_ZONE_FINISH);
> >> >> break;
> >> >> + case REQ_OP_ZONE_SET_DESC:
> >> >> + ret = nvme_setup_zone_set_desc(ns, req, cmd);
> >> >> + break;
> >> >> case REQ_OP_WRITE_ZEROES:
> >> >> ret = nvme_setup_write_zeroes(ns, req, cmd);
> >> >> break;
> >> >> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> >> >> index 662f95fbd909..5bd5a437b038 100644
> >> >> --- a/drivers/nvme/host/nvme.h
> >> >> +++ b/drivers/nvme/host/nvme.h
> >> >> @@ -708,6 +708,9 @@ int nvme_report_zones(struct gendisk *disk,
> >> >> sector_t sector, blk_status_t nvme_setup_zone_mgmt_send(struct
> >> nvme_ns *ns, struct request *req,
> >> >> struct nvme_command *cmnd,
> >> >> enum nvme_zone_mgmt_action action);
> >> >> +
> >> >> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct
> >> request *req,
> >> >> + struct nvme_command *cmnd);
> >> >> #else
> >> >> #define nvme_report_zones NULL
> >> >>
> >> >> @@ -718,6 +721,12 @@ static inline blk_status_t
> >> nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
> >> >> return BLK_STS_NOTSUPP;
> >> >> }
> >> >>
> >> >> +static inline blk_status_t nvme_setup_zone_set_desc(struct nvme_ns
> *ns,
> >> >> + struct request *req, struct nvme_command *cmnd) {
> >> >> + return BLK_STS_NOTSUPP;
> >> >> +}
> >> >> +
> >> >> static inline int nvme_update_zone_info(struct gendisk *disk,
> >> >> struct nvme_ns *ns,
> >> >> unsigned lbaf)
> >> >> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> >> >> index
> >> >> 5792d953a8f3..bfa64cc685d3 100644
> >> >> --- a/drivers/nvme/host/zns.c
> >> >> +++ b/drivers/nvme/host/zns.c
> >> >> @@ -239,3 +239,14 @@ blk_status_t
> >> nvme_setup_zone_mgmt_send(struct
> >> >> nvme_ns *ns, struct request *req,
> >> >>
> >> >> return BLK_STS_OK;
> >> >> }
> >> >> +
> >> >> +blk_status_t nvme_setup_zone_set_desc(struct nvme_ns *ns, struct
> >> request *req,
> >> >> + struct nvme_command *c)
> >> >> +{
> >> >> + c->zms.opcode = nvme_cmd_zone_mgmt_send;
> >> >> + c->zms.nsid = cpu_to_le32(ns->head->ns_id);
> >> >> + c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
> >> >> + c->zms.action = NVME_ZONE_SET_DESC_EXT;
> >> >> +
> >> >> + return BLK_STS_OK;
> >> >> +}
> >> >> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> >> >> index ccb895f911b1..53b7b05b0004 100644
> >> >> --- a/include/linux/blk_types.h
> >> >> +++ b/include/linux/blk_types.h
> >> >> @@ -316,6 +316,8 @@ enum req_opf {
> >> >> REQ_OP_ZONE_FINISH = 12,
> >> >> /* write data at the current zone write pointer */
> >> >> REQ_OP_ZONE_APPEND = 13,
> >> >> + /* associate zone desc extension data to a zone */
> >> >> + REQ_OP_ZONE_SET_DESC = 14,
> >> >>
> >> >> /* SCSI passthrough using struct scsi_request */
> >> >> REQ_OP_SCSI_IN = 32,
> >> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index
> >> >> 2ed55055f68d..c5f092dd5aa3 100644
> >> >> --- a/include/linux/blkdev.h
> >> >> +++ b/include/linux/blkdev.h
> >> >> @@ -370,7 +370,8 @@ extern int blkdev_report_zones_ioctl(struct
> >> block_device *bdev, fmode_t mode,
> >> >> unsigned int cmd, unsigned long arg);
> >> extern int
> >> >> blkdev_zone_mgmt_ioctl(struct block_device *bdev, fmode_t mode,
> >> >> unsigned int cmd, unsigned long arg);
> >> >> -
> >> >> +extern int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
> >> fmode_t mode,
> >> >> + unsigned int cmd, unsigned long arg);
> >> >> #else /* CONFIG_BLK_DEV_ZONED */
> >> >>
> >> >> static inline unsigned int blkdev_nr_zones(struct gendisk *disk)
> >> >> @@
> >> >> -392,6 +393,12 @@ static inline int blkdev_zone_mgmt_ioctl(struct
> >> block_device *bdev,
> >> >> return -ENOTTY;
> >> >> }
> >> >>
> >> >> +static inline int blkdev_zone_set_desc_ioctl(struct block_device *bdev,
> >> >> + fmode_t mode, unsigned int cmd,
> >> >> + unsigned long arg)
> >> >> +{
> >> >> + return -ENOTTY;
> >> >> +}
> >> >> #endif /* CONFIG_BLK_DEV_ZONED */
> >> >>
> >> >> struct request_queue {
> >> >> diff --git a/include/uapi/linux/blkzoned.h
> >> >> b/include/uapi/linux/blkzoned.h index 42c3366cc25f..68abda9abf33
> >> >> 100644
> >> >> --- a/include/uapi/linux/blkzoned.h
> >> >> +++ b/include/uapi/linux/blkzoned.h
> >> >> @@ -142,6 +142,20 @@ struct blk_zone_range {
> >> >> __u64 nr_sectors;
> >> >> };
> >> >>
> >> >> +/**
> >> >> + * struct blk_zone_set_desc - BLKSETDESCZONE ioctl requests
> >> >> + * @sector: Starting sector of the zone to operate on.
> >> >> + * @flags: Feature flags.
> >> >> + * @len: size, in bytes, of the data to be associated to the zone.
> >> >> + * @data: data to be associated.
> >> >> + */
> >> >> +struct blk_zone_set_desc {
> >> >> + __u64 sector;
> >> >> + __u32 flags;
> >> >> + __u32 len;
> >> >> + __u8 data[0];
> >> >> +};
> >>
> >> Would it make sense to add nr_sectors if the host wants to associate
> >> the same metadata to several zones. The use case would be the
> >> grouping of larger zones in software.
> >
> >I believe we get into atomicity concerns if we do ranges, and action
> >only supports a single zone. I like to align with the ZNS API as much
> >as possible, and let the user-space application track errors per set
> >desc ext action.
>
> The atomicity concerns should be the same as current zone management
> using nr_sectors, and these are already being supported for open, close, etc.
>
> TBH, the comment is most about making sure that the IOCTL is extendable
> from conception.
>
> >
> >>
> >> >> +
> >> >> /**
> >> >> * Zoned block device ioctl's:
> >> >> *
> >> >> @@ -158,6 +172,10 @@ struct blk_zone_range {
> >> >> * The 512 B sector range must be zone aligned.
> >> >> * @BLKFINISHZONE: Mark the zones as full in the specified sector
> range.
> >> >> * The 512 B sector range must be zone aligned.
> >> >> + * @BLKSETDESCZONE: Set zone description extension data for the
> zone
> >> >> + * in the specified sector. On success, the zone
> >> >> + * will transition to the closed zone state.
> >> >> + * The 512B sector must be zone aligned.
> >> >> */
> >> >> #define BLKREPORTZONE _IOWR(0x12, 130, struct blk_zone_report)
> >> >> #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)
> >> >> @@ -166,5 +184,5 @@ struct blk_zone_range {
> >> >> #define BLKOPENZONE _IOW(0x12, 134, struct blk_zone_range)
> >> >> #define BLKCLOSEZONE _IOW(0x12, 135, struct blk_zone_range)
> >> >> #define BLKFINISHZONE _IOW(0x12, 136, struct blk_zone_range)
> >> >> -
> >> >> +#define BLKSETDESCZONE _IOW(0x12, 137, struct
> blk_zone_set_desc)
> >> >> #endif /* _UAPI_BLKZONED_H */
> >> >>
> >> >
> >> >How to you retreive an extended descriptor that was set ? I do not
> >> >see any code doing that. Report zones is not changed, but I would
> >> >leave that one as is and implement a BLKGETDESCZONE ioctl & in-kernel
> API.
> >> >
> >>
> >> In any case, this is needed. I also could not find a way to read the
> >> extended descriptor back.
> >
> >Thanks for the feedback.
> >
> >Besides the users I have in mind, do you have users for which you need
> >the ability for user-space to access zone descriptor extensions data?
> >Is this a need to have or nice to have feature from your point of view?
>
> At this moment most of the users we have in mind are user-space applications
> that are zone-aware, so I would way this is necessary.

Thanks - I appreciate the feedback.

>
> This said, I understand if you prioritize pushing the code that enables in-kernel
> users and then re-iterate on this.
>
> Thanks,
> Javier