2020-08-18 05:33:54

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH 0/2] enable append-emulation for ZNS

Currently NVMe driver rejects the ZNS device if zone-append is not
supported natively.
Make it accept the device and activate append-emulation instead. This
is mostly borrowed from SCSI emulation.
The other patch enforces a zone-friendly I/O scheduler for ZNS.

Kanchan Joshi (2):
nvme: set io-scheduler requirement for ZNS
nvme: add emulation for zone-append

drivers/nvme/host/core.c | 41 +++++-
drivers/nvme/host/nvme.h | 60 ++++++++
drivers/nvme/host/zns.c | 307 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 399 insertions(+), 9 deletions(-)

--
2.17.1


2020-08-18 05:34:57

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.

Signed-off-by: Kanchan Joshi <[email protected]>
---
drivers/nvme/host/zns.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 57cfd78731fb..cabd870fb64e 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -96,6 +96,7 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,

q->limits.zoned = BLK_ZONED_HM;
blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
+ blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE);
blk_queue_max_open_zones(q, le32_to_cpu(id->mor) + 1);
blk_queue_max_active_zones(q, le32_to_cpu(id->mar) + 1);
free_data:
--
2.17.1

2020-08-18 05:35:47

by Kanchan Joshi

[permalink] [raw]
Subject: [PATCH 2/2] nvme: add emulation for zone-append

If drive does not support zone-append natively, enable emulation using
regular write.
Make emulated zone-append cmd write-lock the zone, preventing
concurrent append/write on the same zone.

To determine the start-lba for such writes, an array of 32 bit
zone-relative write-pointer (WP) positions is attached with namespace.
This cached WP-position is updated on successful completion as follows:
- APPEND/WRITE/WRITE_ZEROS/WRITE_SAME update it by number of sectors
(512b) copied
- ZONE_RESET updates it to 0 for target zone. ZONE_RESET_ALL does the
same for all zones.
- ZONE_FINISH sets it to zone-size.

On failed-completion for above requests, cached WP-position of target zone
is marked invalid. On subsequent zone-append to that zone, WP position is
refreshed by querying it from device (i.e. zone-report).

If emulated-append cannot immediately proceed due to zone write-lock
or invalid WP position, block-layer is asked to retry it.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
drivers/nvme/host/core.c | 41 +++++-
drivers/nvme/host/nvme.h | 60 ++++++++
drivers/nvme/host/zns.c | 306 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 398 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 88cff309d8e4..78faddf444c3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -287,10 +287,17 @@ void nvme_complete_rq(struct request *req)
nvme_retry_req(req);
return;
}
- } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
- req_op(req) == REQ_OP_ZONE_APPEND) {
- req->__sector = nvme_lba_to_sect(req->q->queuedata,
- le64_to_cpu(nvme_req(req)->result.u64));
+ } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+ bool need_wp_offset_update = false;
+ struct nvme_ns *ns = req->q->queuedata;
+ /* append-emulation requires wp update for some cmds*/
+ if (ns && nvme_is_append_emulated(ns))
+ need_wp_offset_update = nvme_need_zone_wp_update(req);
+ if (need_wp_offset_update)
+ nvme_zone_wp_update(ns, req, status);
+ else if (req_op(req) == REQ_OP_ZONE_APPEND)
+ req->__sector = nvme_lba_to_sect(ns,
+ le64_to_cpu(nvme_req(req)->result.u64));
}

nvme_trace_bio_complete(req, status);
@@ -456,6 +463,8 @@ static void nvme_free_ns(struct kref *kref)
{
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);

+ if (nvme_is_append_emulated(ns))
+ nvme_teardown_append_emulate(ns);
if (ns->ndev)
nvme_nvm_unregister(ns);

@@ -809,7 +818,15 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
break;
case REQ_OP_ZONE_APPEND:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ if (!nvme_is_append_emulated(ns))
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ else {
+ /* prepare append like write, and adjust lba afterwards */
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+ if (ret)
+ break;
+ ret = nvme_append_to_write(ns, req, cmd);
+ }
break;
default:
WARN_ON_ONCE(1);
@@ -2150,7 +2167,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
struct nvme_ns *ns = disk->private_data;
struct nvme_ctrl *ctrl = ns->ctrl;

- ret = blk_revalidate_disk_zones(disk, NULL);
+ ret = nvme_revalidate_disk_zones(disk);
if (!ret)
blk_queue_max_zone_append_sectors(disk->queue,
ctrl->max_zone_append);
@@ -3900,6 +3917,18 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
if (__nvme_revalidate_disk(disk, id))
goto out_put_disk;

+ /* setup append-emulation if required */
+ if (nvme_is_append_emulated(ns)) {
+ ret = nvme_setup_append_emulate(ns);
+ if (ret) {
+ dev_warn(ns->ctrl->device,
+ "append-emulation failed, zoned namespace:%d\n",
+ ns->head->ns_id);
+ nvme_clear_append_emulated(ns);
+ goto out_put_disk;
+ }
+ }
+
if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
ret = nvme_nvm_register(ns, disk_name, node);
if (ret) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ebb8c3ed3885..c84d418fb001 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -421,6 +421,19 @@ enum nvme_ns_features {
NVME_NS_METADATA_SUPPORTED = 1 << 1, /* support getting generated md */
};

+#ifdef CONFIG_BLK_DEV_ZONED
+struct nvme_za_emul {
+ unsigned int nr_zones;
+ spinlock_t zones_wp_offset_lock;
+ u32 *zones_wp_offset;
+ u32 *rev_wp_offset;
+ struct work_struct zone_wp_offset_work;
+ char *zone_wp_update_buf;
+ struct mutex rev_mutex;
+ struct nvme_ns *ns;
+};
+#endif
+
struct nvme_ns {
struct list_head list;

@@ -443,6 +456,10 @@ struct nvme_ns {
u8 pi_type;
#ifdef CONFIG_BLK_DEV_ZONED
u64 zsze;
+ /* set if append needs to be emulated */
+ u8 append_emulate;
+ /* contains all other append-emulation fields */
+ struct nvme_za_emul *za_emul;
#endif
unsigned long features;
unsigned long flags;
@@ -759,9 +776,52 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector,
blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,
struct nvme_command *cmnd,
enum nvme_zone_mgmt_action action);
+
+int nvme_revalidate_disk_zones(struct gendisk *disk);
+/* append-emulation only helpers */
+int nvme_setup_append_emulate(struct nvme_ns *ns);
+void nvme_teardown_append_emulate(struct nvme_ns *ns);
+blk_status_t nvme_append_to_write(struct nvme_ns *ns, struct request *req,
+ struct nvme_command *cmd);
+bool nvme_need_zone_wp_update(struct request *rq);
+void nvme_zone_wp_update(struct nvme_ns *ns, struct request *rq,
+ blk_status_t status);
+void nvme_set_append_emulated(struct nvme_ns *ns);
+void nvme_clear_append_emulated(struct nvme_ns *ns);
+int nvme_is_append_emulated(struct nvme_ns *ns);
#else
#define nvme_report_zones NULL

+static inline void nvme_set_append_emulated(struct nvme_ns *ns) {}
+
+static inline void nvme_clear_append_emulated(struct nvme_ns *ns) {}
+
+static inline int nvme_is_append_emulated(struct nvme_ns *ns)
+{
+ return 0;
+}
+
+static inline int nvme_setup_append_emulate(struct nvme_ns *ns)
+{
+ return 0;
+}
+
+static inline void nvme_teardown_append_emulate(struct nvme_ns *ns) {}
+
+static inline blk_status_t nvme_append_to_write(struct nvme_ns *ns, struct request *req,
+ struct nvme_command *cmd)
+{
+ return BLK_STS_NOTSUPP;
+}
+
+static inline bool nvme_need_zone_wp_update(struct request *rq)
+{
+ return false;
+}
+
+static inline void nvme_zone_wp_update(struct nvme_ns *ns, struct request *rq,
+ blk_status_t status) {}
+
static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns,
struct request *req, struct nvme_command *cmnd,
enum nvme_zone_mgmt_action action)
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index cabd870fb64e..0b1e9f62045a 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -7,6 +7,10 @@
#include <linux/vmalloc.h>
#include "nvme.h"

+/* used for append-emulation */
+#define ZNS_INVALID_WP_OFST (~0u)
+#define ZNS_UPDATING_WP_OFST (ZNS_INVALID_WP_OFST - 1)
+
static int nvme_set_max_append(struct nvme_ctrl *ctrl)
{
struct nvme_command c = { };
@@ -44,13 +48,14 @@ int nvme_update_zone_info(struct gendisk *disk, struct nvme_ns *ns,
struct nvme_id_ns_zns *id;
int status;

- /* Driver requires zone append support */
+ /* Driver does append-emulation if drive does not support zone-append */
if (!(le32_to_cpu(log->iocs[nvme_cmd_zone_append]) &
NVME_CMD_EFFECTS_CSUPP)) {
dev_warn(ns->ctrl->device,
- "append not supported for zoned namespace:%d\n",
+ "append is emulated for zoned namespace:%d\n",
ns->head->ns_id);
- return -EINVAL;
+ /* activate append-emulation */
+ nvme_set_append_emulated(ns);
}

/* Lazily query controller append limit for the first zoned namespace */
@@ -255,3 +260,298 @@ blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req,

return BLK_STS_OK;
}
+
+static void nvme_revalidate_zones_cb(struct gendisk *disk)
+{
+ struct nvme_ns_head *head = NULL;
+ struct nvme_ns *ns;
+ int srcu_idx;
+
+ ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx);
+ swap(ns->za_emul->zones_wp_offset, ns->za_emul->rev_wp_offset);
+ nvme_put_ns_from_disk(head, srcu_idx);
+}
+
+inline int nvme_is_append_emulated(struct nvme_ns *ns)
+{
+ return (ns->append_emulate == 1);
+}
+
+inline void nvme_set_append_emulated(struct nvme_ns *ns)
+{
+ ns->append_emulate = 1;
+}
+
+inline void nvme_clear_append_emulated(struct nvme_ns *ns)
+{
+ ns->append_emulate = 0;
+}
+
+int nvme_revalidate_disk_zones(struct gendisk *disk)
+{
+ int ret = 0;
+ struct nvme_ns *ns = disk->private_data;
+
+ if (!nvme_is_append_emulated(ns))
+ ret = blk_revalidate_disk_zones(disk, NULL);
+ else {
+ struct nvme_za_emul *za_emul = ns->za_emul;
+ unsigned int nr_zones;
+
+ /* serialize multiple revalidate calls */
+ mutex_lock(&za_emul->rev_mutex);
+ nr_zones = get_capacity(disk) >> ilog2(ns->zsze);
+
+ /* avoid rescan zones if possible */
+ if (nr_zones == za_emul->nr_zones &&
+ disk->queue->nr_zones == nr_zones) {
+ mutex_unlock(&za_emul->rev_mutex);
+ goto out;
+ }
+ za_emul->rev_wp_offset = kvcalloc(nr_zones,
+ sizeof(u32), GFP_NOIO);
+ if (!za_emul->rev_wp_offset) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+ ret = blk_revalidate_disk_zones(disk,
+ nvme_revalidate_zones_cb);
+ /* rev_wp_offset has been swapped with zones_wp_offset */
+ kvfree(za_emul->rev_wp_offset);
+ za_emul->rev_wp_offset = NULL;
+unlock:
+ mutex_unlock(&za_emul->rev_mutex);
+ }
+out:
+ return ret;
+}
+
+static unsigned int nvme_get_zone_wp_offset(struct blk_zone *zone)
+{
+ switch (zone->cond) {
+ case BLK_ZONE_COND_IMP_OPEN:
+ case BLK_ZONE_COND_EXP_OPEN:
+ case BLK_ZONE_COND_CLOSED:
+ return zone->wp - zone->start;
+ case BLK_ZONE_COND_FULL:
+ return zone->len;
+ case BLK_ZONE_COND_EMPTY:
+ case BLK_ZONE_COND_OFFLINE:
+ case BLK_ZONE_COND_READONLY:
+ default:
+ /*
+ * Offline and read-only zones do not have a valid
+ * write pointer. Use 0 as for an empty zone.
+ */
+ return 0;
+ }
+}
+
+static int nvme_update_wp_offset_cb(struct blk_zone *zone, unsigned int idx,
+ void *data)
+{
+ struct nvme_za_emul *za_emul = data;
+
+ lockdep_assert_held(&za_emul->zones_wp_offset_lock);
+ za_emul->zones_wp_offset[idx] = nvme_get_zone_wp_offset(zone);
+ return 0;
+}
+
+static void nvme_update_wp_offset_workfn(struct work_struct *work)
+{
+ struct nvme_za_emul *za_emul;
+ struct nvme_ns *ns;
+ unsigned int zno;
+ unsigned long flags;
+ struct nvme_zone_report *report;
+ int buflen, ret;
+
+ buflen = sizeof(struct nvme_zone_report) +
+ sizeof(struct nvme_zone_descriptor);
+ za_emul = container_of(work, struct nvme_za_emul, zone_wp_offset_work);
+ ns = za_emul->ns;
+
+ spin_lock_irqsave(&za_emul->zones_wp_offset_lock, flags);
+
+ for (zno = 0; zno < za_emul->nr_zones; zno++) {
+ if (za_emul->zones_wp_offset[zno] != ZNS_UPDATING_WP_OFST)
+ continue;
+ spin_unlock_irqrestore(&za_emul->zones_wp_offset_lock, flags);
+
+ report = (struct nvme_zone_report *)za_emul->zone_wp_update_buf;
+ memset(report, 0, buflen);
+ ret = __nvme_ns_report_zones(ns, (zno * ns->zsze),
+ report,
+ buflen);
+
+ spin_lock_irqsave(&za_emul->zones_wp_offset_lock, flags);
+ if (ret > 0)
+ nvme_zone_parse_entry(ns, &report->entries[0],
+ zno, nvme_update_wp_offset_cb,
+ za_emul);
+ }
+ spin_unlock_irqrestore(&za_emul->zones_wp_offset_lock, flags);
+ /* remove the reference obtained earlier */
+ nvme_put_ns(ns);
+}
+
+blk_status_t nvme_append_to_write(struct nvme_ns *ns, struct request *req,
+ struct nvme_command *cmd)
+{
+ blk_status_t ret = 0;
+ struct nvme_za_emul *za_emul = ns->za_emul;
+ unsigned int nr_sectors = (blk_rq_bytes(req) >> SECTOR_SHIFT);
+ unsigned int wp_offset, zno = blk_rq_zone_no(req);
+ sector_t lba = blk_rq_pos(req);
+ unsigned long flags;
+
+ if (!blk_req_zone_write_trylock(req))
+ return BLK_STS_RESOURCE;
+
+ spin_lock_irqsave(&za_emul->zones_wp_offset_lock, flags);
+ wp_offset = za_emul->zones_wp_offset[zno];
+ switch (wp_offset) {
+ case ZNS_INVALID_WP_OFST:
+ /*
+ * update zone wp-offset in a deferred worker.
+ * postpone processing current request until worker manages
+ * to refresh wp by querying from device.
+ */
+ kref_get(&ns->kref);
+ za_emul->zones_wp_offset[zno] = ZNS_UPDATING_WP_OFST;
+ queue_work(nvme_wq, &za_emul->zone_wp_offset_work);
+ fallthrough;
+ case ZNS_UPDATING_WP_OFST:
+ ret = BLK_STS_RESOURCE;
+ break;
+ default:
+ if (wp_offset + nr_sectors > ns->zsze) {
+ ret = BLK_STS_IOERR;
+ break;
+ }
+ lba += wp_offset;
+ }
+ spin_unlock_irqrestore(&za_emul->zones_wp_offset_lock, flags);
+ /* unlock zone in case of error, update lba otherwise */
+ if (ret)
+ blk_req_zone_write_unlock(req);
+ else
+ cmd->rw.slba = cpu_to_le64(nvme_sect_to_lba(ns, lba));
+ return ret;
+}
+
+bool nvme_need_zone_wp_update(struct request *rq)
+{
+ switch (req_op(rq)) {
+ case REQ_OP_ZONE_APPEND:
+ case REQ_OP_ZONE_FINISH:
+ case REQ_OP_ZONE_RESET:
+ case REQ_OP_ZONE_RESET_ALL:
+ return true;
+ case REQ_OP_WRITE:
+ case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_WRITE_SAME:
+ return blk_rq_zone_is_seq(rq);
+ default:
+ return false;
+ }
+}
+
+void nvme_zone_wp_update(struct nvme_ns *ns, struct request *rq,
+ blk_status_t status)
+{
+ struct nvme_za_emul *za_emul = ns->za_emul;
+ unsigned long flags;
+ unsigned int zno = blk_rq_zone_no(rq);
+ enum req_opf op = req_op(rq);
+ unsigned int res_bytes = blk_rq_bytes(rq);
+
+ spin_lock_irqsave(&za_emul->zones_wp_offset_lock, flags);
+ /*
+ * Failure handling first, mark wp_offset invalid.
+ * This will force updating wp from device on subsequent access
+ */
+ if (status) {
+ if (op != REQ_OP_ZONE_RESET_ALL) {
+ if (za_emul->zones_wp_offset[zno] !=
+ ZNS_UPDATING_WP_OFST)
+ za_emul->zones_wp_offset[zno] = ZNS_INVALID_WP_OFST;
+
+ } else
+ memset(za_emul->zones_wp_offset, ZNS_INVALID_WP_OFST,
+ za_emul->nr_zones * sizeof(unsigned int));
+ goto unlock;
+ }
+ /* success case handling, update wp-offset */
+ switch (op) {
+ case REQ_OP_ZONE_APPEND:
+ rq->__sector += za_emul->zones_wp_offset[zno];
+ fallthrough;
+ case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_WRITE_SAME:
+ case REQ_OP_WRITE:
+ /* every write should update the wp_offset */
+ if (za_emul->zones_wp_offset[zno] < ns->zsze)
+ za_emul->zones_wp_offset[zno] +=
+ res_bytes >> SECTOR_SHIFT;
+ break;
+ case REQ_OP_ZONE_RESET:
+ za_emul->zones_wp_offset[zno] = 0;
+ break;
+ case REQ_OP_ZONE_FINISH:
+ za_emul->zones_wp_offset[zno] = ns->zsze;
+ break;
+ case REQ_OP_ZONE_RESET_ALL:
+ memset(za_emul->zones_wp_offset, 0,
+ za_emul->nr_zones * sizeof(unsigned int));
+ break;
+ default:
+ break;
+ }
+unlock:
+ spin_unlock_irqrestore(&za_emul->zones_wp_offset_lock, flags);
+ /* release zone write-lock for append */
+ if (op == REQ_OP_ZONE_APPEND)
+ blk_req_zone_write_unlock(rq);
+}
+
+int nvme_setup_append_emulate(struct nvme_ns *ns)
+{
+ struct nvme_za_emul *za_emul;
+ size_t bufsize;
+
+ WARN_ON(ns->za_emul);
+ za_emul = kmalloc(sizeof(struct nvme_za_emul), GFP_KERNEL);
+ if (!za_emul)
+ return -ENOMEM;
+
+ za_emul->zones_wp_offset = NULL;
+ spin_lock_init(&za_emul->zones_wp_offset_lock);
+ za_emul->rev_wp_offset = NULL;
+ mutex_init(&za_emul->rev_mutex);
+ INIT_WORK(&za_emul->zone_wp_offset_work, nvme_update_wp_offset_workfn);
+ /* preallocate buffer for single zone-report */
+ bufsize = sizeof(struct nvme_zone_report) +
+ sizeof(struct nvme_zone_descriptor);
+ za_emul->zone_wp_update_buf = kzalloc(bufsize, GFP_KERNEL);
+ if (!za_emul->zone_wp_update_buf) {
+ kfree(za_emul);
+ return -ENOMEM;
+ }
+ za_emul->nr_zones = get_capacity(ns->disk) >> ilog2(ns->zsze);
+
+ ns->za_emul = za_emul;
+ za_emul->ns = ns;
+
+ return 0;
+}
+
+void nvme_teardown_append_emulate(struct nvme_ns *ns)
+{
+ WARN_ON(!ns->za_emul);
+ kvfree(ns->za_emul->zones_wp_offset);
+ kfree(ns->za_emul->zone_wp_update_buf);
+ ns->za_emul->zones_wp_offset = NULL;
+ ns->za_emul->rev_wp_offset = NULL;
+ kfree(ns->za_emul);
+}
--
2.17.1

2020-08-18 07:12:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.

No, it is not.

2020-08-18 07:15:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
> If drive does not support zone-append natively, enable emulation using
> regular write.
> Make emulated zone-append cmd write-lock the zone, preventing
> concurrent append/write on the same zone.

I really don't think we should add this. ZNS and the Linux support
were all designed with Zone Append in mind, and then your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional. But that doesn't change the fact the Linux should keep
requiring it, especially with the amount of code added here and how it
hooks in the fast path.

2020-08-18 09:52:21

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 18.08.2020 09:12, Christoph Hellwig wrote:
>On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
>> If drive does not support zone-append natively, enable emulation using
>> regular write.
>> Make emulated zone-append cmd write-lock the zone, preventing
>> concurrent append/write on the same zone.
>
>I really don't think we should add this. ZNS and the Linux support
>were all designed with Zone Append in mind, and then your company did
>the nastiest possible move violating the normal NVMe procedures to make
>it optional. But that doesn't change the fact the Linux should keep
>requiring it, especially with the amount of code added here and how it
>hooks in the fast path.

I understand that the NVMe process was agitated and that the current ZNS
implementation in Linux relies in append support from the device
perspective. However, the current TP does allow for not implementing
append, and a number of customers are requiring the use of normal
writes, which we want to support.

During the initial patch review we discussed this and we agreed that the
block layer is designed for append on zone devices, and that for the
foreseeable future this was not going to change. We therefore took the
feedback and followed a similar approach as in the SCSI driver for
implementing append emulation.

We are happy to do more characterization on the impact of these hooks in
the non-zoned hast path and eventually changing the approach if this
proves to be a problem. Our thought is to isolate any potential
performance degradation to the zoned path using the emulation (we do not
see any ATM).

Do you have any early suggestion on how you this patch should look like
to be upstreamable?

Javier

2020-08-18 10:54:17

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 18/08/2020 11.50, Javier Gonzalez wrote:
> On 18.08.2020 09:12, Christoph Hellwig wrote:
>> On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
>>> If drive does not support zone-append natively, enable emulation using
>>> regular write.
>>> Make emulated zone-append cmd write-lock the zone, preventing
>>> concurrent append/write on the same zone.
>>
>> I really don't think we should add this.  ZNS and the Linux support
>> were all designed with Zone Append in mind, and then your company did
>> the nastiest possible move violating the normal NVMe procedures to make
>> it optional.  But that doesn't change the fact the Linux should keep
>> requiring it, especially with the amount of code added here and how it
>> hooks in the fast path.
>
> I understand that the NVMe process was agitated and that the current ZNS
> implementation in Linux relies in append support from the device
> perspective. However, the current TP does allow for not implementing
> append, and a number of customers are requiring the use of normal
> writes, which we want to support.

There is a lot of things that is specified in NVMe, but not implemented
in the Linux kernel. That your company is not able to efficiently
implement the Zone Append command (this is the only reason I can think
of that make you and your company cause such a fuss), shouldn't mean
that everyone else has to suffer.

In any case, SPDK offers adequate support and can be used today.

2020-08-18 15:52:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> I understand that the NVMe process was agitated and that the current ZNS
> implementation in Linux relies in append support from the device
> perspective. However, the current TP does allow for not implementing
> append, and a number of customers are requiring the use of normal
> writes, which we want to support.

The NVMe TPs allow for lots of things, but that doesn't mean we have
to support it.

> Do you have any early suggestion on how you this patch should look like
> to be upstreamable?

My position is that at this point in time we should not consider it.
Zone Append is the major feature in ZNS that solves the issue in ZAC/ZBC.
I want to see broad industry support for it instead of having to add more
code just for zone append emulation than actual current ZNS support. If
in a few years the market place has decided and has lots of drives
available in the consuer market or OEM channels we'll have to reconsider
and potentially merge Zone Append emulation. But my deep hope is that
this does not happen, as it sets us back 10 years in the standards of
zoned storage support again.

2020-08-18 17:00:35

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> a number of customers are requiring the use of normal writes, which we
> want to support.

A device that supports append is completely usable for those customers,
too. There's no need to create divergence in this driver.

2020-08-18 17:32:32

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 18.08.2020 09:58, Keith Busch wrote:
>On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
>> a number of customers are requiring the use of normal writes, which we
>> want to support.
>
>A device that supports append is completely usable for those customers,
>too. There's no need to create divergence in this driver.

Not really. You know as well as I do that some features are disabled for
a particular SSD model on customer requirements. Generic models
implementing append can submit both I/Os, but those that remove append
are left out.

I would like to understand how we can enable these NVMe-compatible
models in Linux. If it is a performance concern, we will address it.

2020-08-18 17:40:50

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote:
> On 18.08.2020 09:58, Keith Busch wrote:
> > On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> > > a number of customers are requiring the use of normal writes, which we
> > > want to support.
> >
> > A device that supports append is completely usable for those customers,
> > too. There's no need to create divergence in this driver.
>
> Not really. You know as well as I do that some features are disabled for
> a particular SSD model on customer requirements. Generic models
> implementing append can submit both I/Os, but those that remove append
> are left out.

You are only supporting my point: if your device supports append, you
get to work in every ZNS use case, otherwise you only get to work in a
subset.

2020-08-18 18:07:49

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 18.08.2020 17:50, Christoph Hellwig wrote:
>On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
>> I understand that the NVMe process was agitated and that the current ZNS
>> implementation in Linux relies in append support from the device
>> perspective. However, the current TP does allow for not implementing
>> append, and a number of customers are requiring the use of normal
>> writes, which we want to support.
>
>The NVMe TPs allow for lots of things, but that doesn't mean we have
>to support it.

Agree. As I replied to Keith, I am just interested in enabling the ZNS
models that come with append disabled.

>
>> Do you have any early suggestion on how you this patch should look like
>> to be upstreamable?
>
>My position is that at this point in time we should not consider it.
>Zone Append is the major feature in ZNS that solves the issue in ZAC/ZBC.
>I want to see broad industry support for it instead of having to add more
>code just for zone append emulation than actual current ZNS support. If
>in a few years the market place has decided and has lots of drives
>available in the consuer market or OEM channels we'll have to reconsider
>and potentially merge Zone Append emulation. But my deep hope is that
>this does not happen, as it sets us back 10 years in the standards of
>zoned storage support again.

I understand that you want vendor alignment in the NVMe driver and I
agree. We are not pushing for a non-append model - you can see that we
are investing effort in implementing the append path in thee block layer
and io_uring and we will continue doing so as patches get merged.

This said, we do have some OEM models that do not implement append and I
would like them to be supported in Linux. As you know, new TPs are being
standardized now and the append emulation is the based for adding
support for this. I do not believe it is unreasonable to find a way to
add support for this SSDs.

If you completely close the door this approach, the alternative is
carrying off-tree patches to the several OEMs that use these devices.
This is not good for the zoned ecosystem nor for the future of Zone
Append.

Are you open to us doing some characterization and if the impact
to the fast path is not significant, moving ahead to a Zone Append
emulation like in SCSI? I will promise that we will remove this path if
requests for these devices terminate.

2020-08-18 18:12:51

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 18.08.2020 12:51, Matias Bjørling wrote:
>On 18/08/2020 11.50, Javier Gonzalez wrote:
>>On 18.08.2020 09:12, Christoph Hellwig wrote:
>>>On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
>>>>If drive does not support zone-append natively, enable emulation using
>>>>regular write.
>>>>Make emulated zone-append cmd write-lock the zone, preventing
>>>>concurrent append/write on the same zone.
>>>
>>>I really don't think we should add this.  ZNS and the Linux support
>>>were all designed with Zone Append in mind, and then your company did
>>>the nastiest possible move violating the normal NVMe procedures to make
>>>it optional.  But that doesn't change the fact the Linux should keep
>>>requiring it, especially with the amount of code added here and how it
>>>hooks in the fast path.
>>
>>I understand that the NVMe process was agitated and that the current ZNS
>>implementation in Linux relies in append support from the device
>>perspective. However, the current TP does allow for not implementing
>>append, and a number of customers are requiring the use of normal
>>writes, which we want to support.
>
>There is a lot of things that is specified in NVMe, but not
>implemented in the Linux kernel. That your company is not able to
>efficiently implement the Zone Append command (this is the only reason
>I can think of that make you and your company cause such a fuss),

This comment is out of place and I will choose to ignore it.

>shouldn't mean that everyone else has to suffer.

This is not a quirk, nor a software work-around, This is a design
decision affecting the storage stack of several OEMs. As such, I would
like to find a way to implement this functionality.

Last time we discussed this in the mailing list, you among others
pointed to append emulation as the best way to enable this path and here
we are. Can you explained what changed?

>
>In any case, SPDK offers adequate support and can be used today.

We take the SPDK discussion in the appropriate mailing lists and slack
channels.

2020-08-18 18:14:35

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 18.08.2020 10:39, Keith Busch wrote:
>On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote:
>> On 18.08.2020 09:58, Keith Busch wrote:
>> > On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
>> > > a number of customers are requiring the use of normal writes, which we
>> > > want to support.
>> >
>> > A device that supports append is completely usable for those customers,
>> > too. There's no need to create divergence in this driver.
>>
>> Not really. You know as well as I do that some features are disabled for
>> a particular SSD model on customer requirements. Generic models
>> implementing append can submit both I/Os, but those that remove append
>> are left out.
>
>You are only supporting my point: if your device supports append, you
>get to work in every ZNS use case, otherwise you only get to work in a
>subset.

There are several devices. Some of them enable append and some do not. I
would like to support the later too.

2020-08-19 07:41:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote:
> I understand that you want vendor alignment in the NVMe driver and I
> agree. We are not pushing for a non-append model - you can see that we
> are investing effort in implementing the append path in thee block layer
> and io_uring and we will continue doing so as patches get merged.
>
> This said, we do have some OEM models that do not implement append and I
> would like them to be supported in Linux. As you know, new TPs are being
> standardized now and the append emulation is the based for adding
> support for this. I do not believe it is unreasonable to find a way to
> add support for this SSDs.

I do not think we should support anything but Zone Append, especially not
the new TP, which is going to add even more horrible code for absolutely
no good reason.

> If you completely close the door this approach, the alternative is
> carrying off-tree patches to the several OEMs that use these devices.
> This is not good for the zoned ecosystem nor for the future of Zone
> Append.

I really don't have a problem with that. If these OEMs want to use
an inferior access model only, they have to pay the price for it.
I also don't think that proxy arguments are very useful. If you OEMs
are troubled by carrying patches becomes they decided to buy inferior
drivers they are perfectly happy to argue their cause here on the list.

> Are you open to us doing some characterization and if the impact
> to the fast path is not significant, moving ahead to a Zone Append
> emulation like in SCSI? I will promise that we will remove this path if
> requests for these devices terminate.

As said I do not think implementing zone append emulation or the TP that
shall not be named are a good idea for Linux.

2020-08-19 08:36:12

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 19.08.2020 09:40, Christoph Hellwig wrote:
>On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote:
>> I understand that you want vendor alignment in the NVMe driver and I
>> agree. We are not pushing for a non-append model - you can see that we
>> are investing effort in implementing the append path in thee block layer
>> and io_uring and we will continue doing so as patches get merged.
>>
>> This said, we do have some OEM models that do not implement append and I
>> would like them to be supported in Linux. As you know, new TPs are being
>> standardized now and the append emulation is the based for adding
>> support for this. I do not believe it is unreasonable to find a way to
>> add support for this SSDs.
>
>I do not think we should support anything but Zone Append, especially not
>the new TP, which is going to add even more horrible code for absolutely
>no good reason.

I must admit that this is a bit frustrating. The new TP adds
functionality beyond operating as an Append alternative that I would
very much like to see upstream (do want to discuss details here).

I understand the concerns about deviating from the Append model, but I
believe we should find a way to add these new features. We are hiding
all the logic in the NVMe driver and not touching the interface with the
block layer, so the overall model is really not changed.

>
>> If you completely close the door this approach, the alternative is
>> carrying off-tree patches to the several OEMs that use these devices.
>> This is not good for the zoned ecosystem nor for the future of Zone
>> Append.
>
>I really don't have a problem with that. If these OEMs want to use
>an inferior access model only, they have to pay the price for it.
>I also don't think that proxy arguments are very useful. If you OEMs
>are troubled by carrying patches becomes they decided to buy inferior
>drivers they are perfectly happy to argue their cause here on the list.

I am not arguing as a proxy, I am stating the trouble we see from our
perspective in having to diverge from mainline when our approach is
being upstream first.

Whether the I/O mode is inferior or superior, they can answer that
themselves if they read this list.
>
>> Are you open to us doing some characterization and if the impact
>> to the fast path is not significant, moving ahead to a Zone Append
>> emulation like in SCSI? I will promise that we will remove this path if
>> requests for these devices terminate.
>
>As said I do not think implementing zone append emulation or the TP that
>shall not be named are a good idea for Linux.

I would ask you to reconsider this position. I have a hard time
understanding how zone append emulation is a good idea in SCSI and not
in NVMe, when there is no performance penalty.

2020-08-19 09:16:14

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 2020/08/19 17:34, Javier Gonzalez wrote:
> On 19.08.2020 09:40, Christoph Hellwig wrote:
>> On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote:
>>> I understand that you want vendor alignment in the NVMe driver and I
>>> agree. We are not pushing for a non-append model - you can see that we
>>> are investing effort in implementing the append path in thee block layer
>>> and io_uring and we will continue doing so as patches get merged.
>>>
>>> This said, we do have some OEM models that do not implement append and I
>>> would like them to be supported in Linux. As you know, new TPs are being
>>> standardized now and the append emulation is the based for adding
>>> support for this. I do not believe it is unreasonable to find a way to
>>> add support for this SSDs.
>>
>> I do not think we should support anything but Zone Append, especially not
>> the new TP, which is going to add even more horrible code for absolutely
>> no good reason.
>
> I must admit that this is a bit frustrating. The new TP adds
> functionality beyond operating as an Append alternative that I would
> very much like to see upstream (do want to discuss details here).
>
> I understand the concerns about deviating from the Append model, but I
> believe we should find a way to add these new features. We are hiding
> all the logic in the NVMe driver and not touching the interface with the
> block layer, so the overall model is really not changed.
>
>>
>>> If you completely close the door this approach, the alternative is
>>> carrying off-tree patches to the several OEMs that use these devices.
>>> This is not good for the zoned ecosystem nor for the future of Zone
>>> Append.
>>
>> I really don't have a problem with that. If these OEMs want to use
>> an inferior access model only, they have to pay the price for it.
>> I also don't think that proxy arguments are very useful. If you OEMs
>> are troubled by carrying patches becomes they decided to buy inferior
>> drivers they are perfectly happy to argue their cause here on the list.
>
> I am not arguing as a proxy, I am stating the trouble we see from our
> perspective in having to diverge from mainline when our approach is
> being upstream first.
>
> Whether the I/O mode is inferior or superior, they can answer that
> themselves if they read this list.
>>
>>> Are you open to us doing some characterization and if the impact
>>> to the fast path is not significant, moving ahead to a Zone Append
>>> emulation like in SCSI? I will promise that we will remove this path if
>>> requests for these devices terminate.
>>
>> As said I do not think implementing zone append emulation or the TP that
>> shall not be named are a good idea for Linux.
>
> I would ask you to reconsider this position. I have a hard time
> understanding how zone append emulation is a good idea in SCSI and not
> in NVMe, when there is no performance penalty.

While defining a zone append command for SCSI/ZBC is possible (using sense data
for returning the written offset), there is no way to define zone append for
SATA/ZAC without entirely breaking the ATA command model. This is why we went
after an emulation implementation instead of trying to standardized native
commands. That implementation does not have any performance impact over regular
writes *and* zone write locking does not in general degrade HDD write
performance (only a few corner cases suffer from it). Comparing things equally,
the same could be said of NVMe drives that do not have zone append native
support: performance will be essentially the same using regular writes and
emulated zone append. But mq-deadline and zone write locking will significantly
lower performance for emulated zone append compared to a native zone append
support by the drive.


--
Damien Le Moal
Western Digital Research

2020-08-19 09:29:00

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
> > Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
>
> No, it is not.

Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
I see that null-block zoned and SCSI-ZBC both set this requirement. I
wonder how it became different for NVMe.

--
Kanchan

2020-08-19 09:41:02

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On 2020/08/19 18:27, Kanchan Joshi wrote:
> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <[email protected]> wrote:
>>
>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
>>
>> No, it is not.
>
> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
> I see that null-block zoned and SCSI-ZBC both set this requirement. I
> wonder how it became different for NVMe.

It is not required for an NVMe ZNS drive that has zone append native support.
zonefs and upcoming btrfs do not use regular writes, removing the requirement
for zone write locking.

In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only
and only if the drive does not have native zone append support. And even in that
case, since for an emulated zone append the zone write lock is taken and
released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
only if the user will also be issuing regular writes at high QD. And that is
trivially controllable by the user by simply setting the drive elevator to
mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.


--
Damien Le Moal
Western Digital Research

2020-08-19 10:36:00

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal <[email protected]> wrote:
>
> On 2020/08/19 18:27, Kanchan Joshi wrote:
> > On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <[email protected]> wrote:
> >>
> >> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
> >>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
> >>
> >> No, it is not.
> >
> > Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
> > I see that null-block zoned and SCSI-ZBC both set this requirement. I
> > wonder how it became different for NVMe.
>
> It is not required for an NVMe ZNS drive that has zone append native support.
> zonefs and upcoming btrfs do not use regular writes, removing the requirement
> for zone write locking.

I understand that if a particular user (zonefs, btrfs etc) is not
sending regular-write and sending append instead, write-lock is not
required.
But if that particular user or some other user (say F2FS) sends
regular write(s), write-lock is needed.
Above block-layer, both the opcodes REQ_OP_WRITE and
REQ_OP_ZONE_APPEND are available to be used by users. And I thought
write-lock is taken or not is a per-opcode thing and not per-user (FS,
MD/DM, user-space etc.), is not that correct? And MQ-deadline can
cater to both the opcodes, while other schedulers cannot serve
REQ_OP_WRITE well for zoned-device.

> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only
> and only if the drive does not have native zone append support.

Sure I can keep it that way, once I get it right. If it is really not
required for native-append drive, it should not be here at the place
where I added.

> And even in that
> case, since for an emulated zone append the zone write lock is taken and
> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
> only if the user will also be issuing regular writes at high QD. And that is
> trivially controllable by the user by simply setting the drive elevator to
> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.

Are we saying applications should switch schedulers based on the write
QD (use any-scheduler for QD1 and mq-deadline for QD-N).
Even if it does that, it does not know what other applications would
be doing. That seems hard-to-get-right and possible only in a
tightly-controlled environment.



--
Joshi

2020-08-19 10:44:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, Aug 19, 2020 at 09:14:13AM +0000, Damien Le Moal wrote:
> While defining a zone append command for SCSI/ZBC is possible (using sense data
> for returning the written offset), there is no way to define zone append for
> SATA/ZAC without entirely breaking the ATA command model. This is why we went
> after an emulation implementation instead of trying to standardized native
> commands. That implementation does not have any performance impact over regular
> writes *and* zone write locking does not in general degrade HDD write
> performance (only a few corner cases suffer from it). Comparing things equally,
> the same could be said of NVMe drives that do not have zone append native
> support: performance will be essentially the same using regular writes and
> emulated zone append. But mq-deadline and zone write locking will significantly
> lower performance for emulated zone append compared to a native zone append
> support by the drive.

And to summarize the most important point - Zone Append doesn't exist
in ZAC/ABC. For people that spent the last years trying to make zoned
storage work, the lack of such a primite has been the major pain point.
That's why I came up with the Zone Append design in response to a
request for such an operation from another company that is now heavily
involved in both Linux development and hosting Linux VMs. For ZAC and
ZBC the best we can do is to emulate the approach in the driver, but
for NVMe we can do it. ZNS until just before the release had Zone
Append mandatory, and it did so for a very good reason. While making
it optional allows OEMs to request drives without it, I fundamentally
think we should not support that in Linux and request vendors do
implement writes to zones the right way.

And just as some OEMs can request certain TPs or optional features to
be implemented, so can Linux. Just to give an example from the zone
world - Linux requires uniform and power of two zone sizes, which in
ZAC and ZBC are not required.

2020-08-19 10:50:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, Aug 19, 2020 at 10:33:53AM +0200, Javier Gonzalez wrote:
> I would ask you to reconsider this position. I have a hard time
> understanding how zone append emulation is a good idea in SCSI and not
> in NVMe, when there is no performance penalty.

Per the numbers on btrfs and zonefs numbers zone append emulation is
faster than using serialized writes, but also way slower than native
zone append. Zone append emulation for SCSI is the way to bring
support for pre-existing standards into the brave new world. Not
requiring Zone Append in a standard designed around it because of
bowing down to last minute political interventions in the standards
committee is shooting ourselves in the foot.

2020-08-19 11:19:18

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On 2020/08/19 19:32, Kanchan Joshi wrote:
> On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal <[email protected]> wrote:
>>
>> On 2020/08/19 18:27, Kanchan Joshi wrote:
>>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <[email protected]> wrote:
>>>>
>>>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
>>>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
>>>>
>>>> No, it is not.
>>>
>>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
>>> I see that null-block zoned and SCSI-ZBC both set this requirement. I
>>> wonder how it became different for NVMe.
>>
>> It is not required for an NVMe ZNS drive that has zone append native support.
>> zonefs and upcoming btrfs do not use regular writes, removing the requirement
>> for zone write locking.
>
> I understand that if a particular user (zonefs, btrfs etc) is not
> sending regular-write and sending append instead, write-lock is not
> required.
> But if that particular user or some other user (say F2FS) sends
> regular write(s), write-lock is needed.

And that can be trivially enabled by setting the drive elevator to mq-deadline.

> Above block-layer, both the opcodes REQ_OP_WRITE and
> REQ_OP_ZONE_APPEND are available to be used by users. And I thought
> write-lock is taken or not is a per-opcode thing and not per-user (FS,
> MD/DM, user-space etc.), is not that correct? And MQ-deadline can
> cater to both the opcodes, while other schedulers cannot serve
> REQ_OP_WRITE well for zoned-device.

mq-deadline ignores zone append commands. No zone lock is taken for these. In
scsi, the emulation takes the zone lock before transforming the zone append into
a regular write. That locking is consistent with the mq-scheduler level locking
since the same lock bitmap is used. So if the user only issues zone append
writes, mq-deadline is not needed and there is no reasons to force its use by
setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber...

>> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only
>> and only if the drive does not have native zone append support.
>
> Sure I can keep it that way, once I get it right. If it is really not
> required for native-append drive, it should not be here at the place
> where I added.
>
>> And even in that
>> case, since for an emulated zone append the zone write lock is taken and
>> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
>> only if the user will also be issuing regular writes at high QD. And that is
>> trivially controllable by the user by simply setting the drive elevator to
>> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.
>
> Are we saying applications should switch schedulers based on the write
> QD (use any-scheduler for QD1 and mq-deadline for QD-N).
> Even if it does that, it does not know what other applications would
> be doing. That seems hard-to-get-right and possible only in a
> tightly-controlled environment.

Even for SMR, the user is free to set the elevator to none, which disables zone
write locking. Issuing writes correctly then becomes the responsibility of the
application. This can be useful for settings that for instance use NCQ I/O
priorities, which give better results when "none" is used.

As far as I know, zoned drives are always used in tightly controlled
environments. Problems like "does not know what other applications would be
doing" are non-existent. Setting up the drive correctly for the use case at hand
is a sysadmin/server setup problem, based on *the* application (singular)
requirements.


--
Damien Le Moal
Western Digital Research

2020-08-19 19:15:43

by David Fugate

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Tue, 2020-08-18 at 07:12 +0000, Christoph Hellwig wrote:
> On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
> > If drive does not support zone-append natively, enable emulation
> > using
> > regular write.
> > Make emulated zone-append cmd write-lock the zone, preventing
> > concurrent append/write on the same zone.
>
> I really don't think we should add this. ZNS and the Linux support
> were all designed with Zone Append in mind, and then your company did
> the nastiest possible move violating the normal NVMe procedures to
> make
> it optional. But that doesn't change the fact the Linux should keep
> requiring it, especially with the amount of code added here and how
> it
> hooks in the fast path.

Intel does not support making *optional* NVMe spec features *required*
by the NVMe driver.

It's forgivable WDC's accepted contribution didn't work with other
vendors' devices choosing not to implement the optional Zone Append,
but it's not OK to reject contributions remedying this. Provided
there's no glaring technical issues, Samsung's contribution should be
accepted to maintain both spec compliance as well as vendor neutrality.


2020-08-19 19:29:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 8/19/20 12:11 PM, David Fugate wrote:
> On Tue, 2020-08-18 at 07:12 +0000, Christoph Hellwig wrote:
>> On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
>>> If drive does not support zone-append natively, enable emulation
>>> using
>>> regular write.
>>> Make emulated zone-append cmd write-lock the zone, preventing
>>> concurrent append/write on the same zone.
>>
>> I really don't think we should add this. ZNS and the Linux support
>> were all designed with Zone Append in mind, and then your company did
>> the nastiest possible move violating the normal NVMe procedures to
>> make
>> it optional. But that doesn't change the fact the Linux should keep
>> requiring it, especially with the amount of code added here and how
>> it
>> hooks in the fast path.
>
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver.

It's not required, the driver will function quite fine without it. If you
want to use ZNS it's required. The Linux driver thankfully doesn't need
any vendor to sign off on what it can or cannot do, or what features
are acceptable.

> It's forgivable WDC's accepted contribution didn't work with other
> vendors' devices choosing not to implement the optional Zone Append,
> but it's not OK to reject contributions remedying this. Provided
> there's no glaring technical issues, Samsung's contribution should be
> accepted to maintain both spec compliance as well as vendor neutrality.

It's *always* ok to reject contributions, if those contributions cause
maintainability issues, unacceptable slowdowns, or whatever other issue
that the maintainers of said driver don't want to deal with. Any
contribution should be judged on merit, not based on political decisions
or opinions. Obviously this thread reeks of it.

--
Jens Axboe

2020-08-19 21:44:04

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver.

This is inaccurate. As another example, the spec optionally allows a
zone size to be a power of 2, but linux requires a power of 2 if you
want to use it with this driver.

> Provided there's no glaring technical issues

There are many. Some may be improved through a serious review process,
but the mess it makes out of the fast path is measurably harmful to
devices that don't subscribe to this. That issue is not so easily
remedied.

Since this patch is a copy of the scsi implementation, the reasonable
thing is take this fight to the generic block layer for a common
solution. We're not taking this in the nvme driver.

2020-08-19 21:58:06

by David Fugate

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote:
> It's not required, the driver will function quite fine without it. If
> you
> want to use ZNS it's required.

The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral*
Linux driver should not either.

> The Linux driver thankfully doesn't need
> any vendor to sign off on what it can or cannot do, or what features
> are acceptable.

The problem is the driver needs one *particular* vendor to sign off.
Existing driver behavior aligns with WDC drives instead of the spec,
giving WDC an unfair advantage. Couple this with NVMe maintainer(s?)
working for WDC, and there's a conflict of interest.

> It's *always* ok to reject contributions, if those contributions
> cause
> maintainability issues, unacceptable slowdowns, or whatever other
> issue
> that the maintainers of said driver don't want to deal with. Any
> contribution should be judged on merit, not based on political
> decisions
> or opinions.

Agreed, but this standard needs to be applied equally to everyone.
E.g., harmless contributions such as
https://lore.kernel.org/linux-nvme/[email protected]/ get
rejected yet clear spec violations from maintainers are accepted? This
type of behavior encourages forking, vendor-specific drivers, etc.
which is somewhere I hope none of us want to go.


2020-08-19 22:13:06

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, Aug 19, 2020 at 03:54:20PM -0600, David Fugate wrote:
> On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote:
> > It's not required, the driver will function quite fine without it. If
> > you
> > want to use ZNS it's required.
>
> The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral*
> Linux driver should not either.

The spec was developed over the course of years with your employer's
involvement, and the software enabling efforts occurred in parallel. The
"optional" part was made that way at the final hour, so please align
your expectations accordingly.

> Agreed, but this standard needs to be applied equally to everyone.
> E.g., harmless contributions such as
> https://lore.kernel.org/linux-nvme/[email protected]/ get
> rejected yet clear spec violations from maintainers are accepted?
> type of behavior encourages forking, vendor-specific drivers, etc.
> which is somewhere I hope none of us want to go.

You're the one who left that thread dangling. You offered to have your
firmware accommodate the Intel sponsored feature that makes your patch
unnecessary in the first place. Your follow up made no sense and you
have not responded to the queries about it.

2020-08-19 23:44:26

by David Fugate

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, 2020-08-19 at 15:10 -0700, Keith Busch wrote:
> You're the one who left that thread dangling. You offered to have
> your
> firmware accommodate the Intel sponsored feature that makes your
> patch
> unnecessary in the first place. Your follow up made no sense and you
> have not responded to the queries about it.

There were queries? My key takeaways were a maintainer NAK followed by
instructions to make the Intel drive align with the driver by
implementing NOIOB. While I disagree with the rejection as it appeared
to be based entirely on politics, I can accept it as the quirk wasn't
in the spec.

It's not fair to make this same "your drive should align with the
driver" demand of Samsung because we *are* talking about a spec'ed
feature here. Technical critques of their patches and real performance
degrades observed are fair game and objective; "your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional" is not.

2020-08-20 03:46:43

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote:
> There were queries? My key takeaways were a maintainer NAK followed by
> instructions to make the Intel drive align with the driver by
> implementing NOIOB. While I disagree with the rejection as it appeared
> to be based entirely on politics, I can accept it as the quirk wasn't
> in the spec.

For the record, the suggestion provided, which you agreed to look into,
most broadly enables your hardware on Linux and was entirely to your
benefit. Not quite as dramatic as a political conspiracy.

You later responded with a technical argument against that suggestion;
however, your reason didn't add up, and that's where you left the
thread.

> It's not fair to make this same "your drive should align with the
> driver" demand of Samsung because we *are* talking about a spec'ed
> feature here. Technical critques of their patches and real performance
> degrades observed are fair game and objective; "your company did
> the nastiest possible move violating the normal NVMe procedures to make
> it optional" is not.

Sure, but you're cherry picking comments from the discussion. The
performance impact exists, and it's generally not acceptable from a
maintenance point to duplicate significant code without at least trying
to provide a common solution.

2020-08-20 05:33:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> On Tue, 2020-08-18 at 07:12 +0000, Christoph Hellwig wrote:
> > On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
> > > If drive does not support zone-append natively, enable emulation
> > > using
> > > regular write.
> > > Make emulated zone-append cmd write-lock the zone, preventing
> > > concurrent append/write on the same zone.
> >
> > I really don't think we should add this. ZNS and the Linux support
> > were all designed with Zone Append in mind, and then your company did
> > the nastiest possible move violating the normal NVMe procedures to
> > make
> > it optional. But that doesn't change the fact the Linux should keep
> > requiring it, especially with the amount of code added here and how
> > it
> > hooks in the fast path.
>
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver.

That is a great demand, but please stop talking of companies here,
because companies simply don't matter for Linux development. People
and use cases do, companies don't and your mail talking about companies
really can't be taken serious.

And I'm not sure why you think Linux is different from any other NVMe
host OS. If random NVMe host A decided they need feature X they are
going to require it, because why not. Especially if the developers of
host A helped to drive the development of feature X. I'm pretty sure
with a little background in storage standards you've seen that play out
a lot before. And it is no different here.

2020-08-20 05:55:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote:
> There were queries? My key takeaways were a maintainer NAK followed by
> instructions to make the Intel drive align with the driver by
> implementing NOIOB. While I disagree with the rejection as it appeared
> to be based entirely on politics, I can accept it as the quirk wasn't
> in the spec.

Stop the crap now please. I'm really offended. I'm offended by you
implying that I pay corporate poitics. I'm primary a Linux developer,
and I stand for what I think is good for Linux. I fight inside companies
that pay me what is good for Linux, I work hard with companies that do not
pay to convinve them what is right. I waste a lot of precious time to sit
in standards meetings to do what is right for Linux. And now you come
here with the politics arguments. Stop, it please.

The Intel NOIOB quirk was in the driver since day 1, added by Matthew
probably with no bad intent, but we kept growing the lists and finally
got Intel to standardized because it was good for Linux, and to be
honest good for Intel as updating quirks in half a dozen drivers
simply does not scale. We got a really nice trivial TP out of it. But
we really should not see any new drivers that do not implement one damn
trivial field a long time after the TP has been publushed. That is not
politics.

Making a fuzz and playing the corporate card on a Linux mailing list
is politics, really bad politics and you need to stop it.

> It's not fair to make this same "your drive should align with the
> driver" demand of Samsung because we *are* talking about a spec'ed
> feature here.

Huh. Next time Dell or NetApp or Facebook are going to require an
optional NVMe feature, and OCP feature or even a vendor specific
feature you are going to go to them and say "you don't play fair"?

> Technical critques of their patches and real performance
> degrades observed are fair game and objective; "your company did
> the nastiest possible move violating the normal NVMe procedures to make
> it optional" is not.

It is pretty fair game if you know the background, that is that I spent
countless hours to make this feature fit the Linux I/O stack requirement
perfecty and worked that it is there. We might as well rely on it.

2020-08-20 06:41:01

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 19.08.2020 13:25, Jens Axboe wrote:
>On 8/19/20 12:11 PM, David Fugate wrote:
>> On Tue, 2020-08-18 at 07:12 +0000, Christoph Hellwig wrote:
>>> On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
>>>> If drive does not support zone-append natively, enable emulation
>>>> using
>>>> regular write.
>>>> Make emulated zone-append cmd write-lock the zone, preventing
>>>> concurrent append/write on the same zone.
>>>
>>> I really don't think we should add this. ZNS and the Linux support
>>> were all designed with Zone Append in mind, and then your company did
>>> the nastiest possible move violating the normal NVMe procedures to
>>> make
>>> it optional. But that doesn't change the fact the Linux should keep
>>> requiring it, especially with the amount of code added here and how
>>> it
>>> hooks in the fast path.
>>
>> Intel does not support making *optional* NVMe spec features *required*
>> by the NVMe driver.
>
>It's not required, the driver will function quite fine without it. If you
>want to use ZNS it's required. The Linux driver thankfully doesn't need
>any vendor to sign off on what it can or cannot do, or what features
>are acceptable.
>
>> It's forgivable WDC's accepted contribution didn't work with other
>> vendors' devices choosing not to implement the optional Zone Append,
>> but it's not OK to reject contributions remedying this. Provided
>> there's no glaring technical issues, Samsung's contribution should be
>> accepted to maintain both spec compliance as well as vendor neutrality.
>
>It's *always* ok to reject contributions, if those contributions cause
>maintainability issues, unacceptable slowdowns, or whatever other issue
>that the maintainers of said driver don't want to deal with. Any
>contribution should be judged on merit, not based on political decisions
>or opinions. Obviously this thread reeks of it.
>

I'll reply here, where the discussion diverges from this particular
patch.

We will stop pushing for this emulation. We have a couple of SSDs where
we disabled Append, we implemented support for them, and we wanted to
push the changes upstream. That's it. This is no politics not a
conspiracy against the current ZNS spec. We spent a lot of time working
on this spec and are actually doing a fair amount of work to support
Append other places in the stack. In any case, the fuzz stops here.

Javier

2020-08-20 06:46:55

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 19.08.2020 12:43, Christoph Hellwig wrote:
>On Wed, Aug 19, 2020 at 09:14:13AM +0000, Damien Le Moal wrote:
>> While defining a zone append command for SCSI/ZBC is possible (using sense data
>> for returning the written offset), there is no way to define zone append for
>> SATA/ZAC without entirely breaking the ATA command model. This is why we went
>> after an emulation implementation instead of trying to standardized native
>> commands. That implementation does not have any performance impact over regular
>> writes *and* zone write locking does not in general degrade HDD write
>> performance (only a few corner cases suffer from it). Comparing things equally,
>> the same could be said of NVMe drives that do not have zone append native
>> support: performance will be essentially the same using regular writes and
>> emulated zone append. But mq-deadline and zone write locking will significantly
>> lower performance for emulated zone append compared to a native zone append
>> support by the drive.
>
>And to summarize the most important point - Zone Append doesn't exist
>in ZAC/ABC. For people that spent the last years trying to make zoned
>storage work, the lack of such a primite has been the major pain point.
>That's why I came up with the Zone Append design in response to a
>request for such an operation from another company that is now heavily
>involved in both Linux development and hosting Linux VMs. For ZAC and
>ZBC the best we can do is to emulate the approach in the driver, but
>for NVMe we can do it. ZNS until just before the release had Zone
>Append mandatory, and it did so for a very good reason. While making
>it optional allows OEMs to request drives without it, I fundamentally
>think we should not support that in Linux and request vendors do
>implement writes to zones the right way.

Ok. We will just pursue Linux support for the ZNS following the append
model.

>
>And just as some OEMs can request certain TPs or optional features to
>be implemented, so can Linux. Just to give an example from the zone
>world - Linux requires uniform and power of two zone sizes, which in
>ZAC and ZBC are not required.

2020-08-20 06:56:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Thu, Aug 20, 2020 at 08:37:19AM +0200, Javier Gonzalez wrote:
> We will stop pushing for this emulation. We have a couple of SSDs where
> we disabled Append, we implemented support for them, and we wanted to
> push the changes upstream. That's it. This is no politics not a
> conspiracy against the current ZNS spec. We spent a lot of time working
> on this spec and are actually doing a fair amount of work to support
> Append other places in the stack. In any case, the fuzz stops here.

FYI, from knowing your personally I'm pretty confident you are not
part of a conspiracy and you are just doing your best given the context,
and I appreciate all your work!

I'm a lot less happy about thinks that happen in other groups not
involving you directly, and I'm still pretty mad at how the games were
played there, and especially other actors the seem to be reading the
list here, and instead of taking part in the discussion are causing fuzz
in completely unrelated venues.

2020-08-20 07:39:22

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Thu, Aug 20, 2020 at 3:22 AM Keith Busch <[email protected]> wrote:
>
> On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> > Intel does not support making *optional* NVMe spec features *required*
> > by the NVMe driver.
>
> This is inaccurate. As another example, the spec optionally allows a
> zone size to be a power of 2, but linux requires a power of 2 if you
> want to use it with this driver.
>
> > Provided there's no glaring technical issues
>
> There are many. Some may be improved through a serious review process,
> but the mess it makes out of the fast path is measurably harmful to
> devices that don't subscribe to this. That issue is not so easily
> remedied.
>
> Since this patch is a copy of the scsi implementation, the reasonable
> thing is take this fight to the generic block layer for a common
> solution. We're not taking this in the nvme driver.

I sincerely want to minimize any adverse impact to the fast-path of
non-zoned devices.
My understanding of that aspect is (I feel it is good to confirm,
irrespective of the future of this patch):

1. Submission path:
There is no extra code for non-zoned device IO. For append, there is
this "ns->append_emulate = 1" check.
Snippet -
case REQ_OP_ZONE_APPEND:
- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ if (!nvme_is_append_emulated(ns))
+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+ else {
+ /* prepare append like write, and adjust lba
afterwards */

2. Completion:
Not as clean as submission for sure.
The extra check is "ns && ns->append_emulate == 1" for completions if
CONFIG_ZONED is enabled.
A slightly better completion-handling version (than the submitted patch) is -

- } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
- req_op(req) == REQ_OP_ZONE_APPEND) {
- req->__sector = nvme_lba_to_sect(req->q->queuedata,
- le64_to_cpu(nvme_req(req)->result.u64));
+ } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+ struct nvme_ns *ns = req->q->queuedata;
+ /* append-emulation requires wp update for some cmds*/
+ if (ns && nvme_is_append_emulated(ns)) {
+ if (nvme_need_zone_wp_update(req))
+ nvme_zone_wp_update(ns, req, status);
+ }
+ else if (req_op(req) == REQ_OP_ZONE_APPEND)
+ req->__sector = nvme_lba_to_sect(ns,
+ le64_to_cpu(nvme_req(req)->result.u64));

Am I missing any other fast-path pain-points?

A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)

This maynot be the best test to see the cost, and I am happy to
conduct more and improvise.

As for the volume of the code - it is same as SCSI emulation. And I
can make efforts to reduce that by moving common code to blk-zone, and
reuse in SCSI/NVMe emulation.
In the patch I tried to isolate append-emulation by keeping everything
into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
which can be turned into "driver_data".

+#ifdef CONFIG_BLK_DEV_ZONED
+struct nvme_za_emul {
+ unsigned int nr_zones;
+ spinlock_t zones_wp_offset_lock;
+ u32 *zones_wp_offset;
+ u32 *rev_wp_offset;
+ struct work_struct zone_wp_offset_work;
+ char *zone_wp_update_buf;
+ struct mutex rev_mutex;
+ struct nvme_ns *ns;
+};
+#endif

Will that be an acceptable line of further work/discussions?

--
Kanchan

2020-08-20 08:07:46

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 20.08.2020 08:52, Christoph Hellwig wrote:
>On Thu, Aug 20, 2020 at 08:37:19AM +0200, Javier Gonzalez wrote:
>> We will stop pushing for this emulation. We have a couple of SSDs where
>> we disabled Append, we implemented support for them, and we wanted to
>> push the changes upstream. That's it. This is no politics not a
>> conspiracy against the current ZNS spec. We spent a lot of time working
>> on this spec and are actually doing a fair amount of work to support
>> Append other places in the stack. In any case, the fuzz stops here.
>
>FYI, from knowing your personally I'm pretty confident you are not
>part of a conspiracy and you are just doing your best given the context,
>and I appreciate all your work!

Thanks Christoph.

>
>I'm a lot less happy about thinks that happen in other groups not
>involving you directly, and I'm still pretty mad at how the games were
>played there, and especially other actors the seem to be reading the
>list here, and instead of taking part in the discussion are causing fuzz
>in completely unrelated venues.

Yes. Hopefully, we will start keeping things separated and using this
list for Linux, technical conversations only.

2020-08-20 08:16:25

by Javier González

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 20.08.2020 13:07, Kanchan Joshi wrote:
>On Thu, Aug 20, 2020 at 3:22 AM Keith Busch <[email protected]> wrote:
>>
>> On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
>> > Intel does not support making *optional* NVMe spec features *required*
>> > by the NVMe driver.
>>
>> This is inaccurate. As another example, the spec optionally allows a
>> zone size to be a power of 2, but linux requires a power of 2 if you
>> want to use it with this driver.
>>
>> > Provided there's no glaring technical issues
>>
>> There are many. Some may be improved through a serious review process,
>> but the mess it makes out of the fast path is measurably harmful to
>> devices that don't subscribe to this. That issue is not so easily
>> remedied.
>>
>> Since this patch is a copy of the scsi implementation, the reasonable
>> thing is take this fight to the generic block layer for a common
>> solution. We're not taking this in the nvme driver.
>
>I sincerely want to minimize any adverse impact to the fast-path of
>non-zoned devices.
>My understanding of that aspect is (I feel it is good to confirm,
>irrespective of the future of this patch):
>
>1. Submission path:
>There is no extra code for non-zoned device IO. For append, there is
>this "ns->append_emulate = 1" check.
>Snippet -
> case REQ_OP_ZONE_APPEND:
>- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
>+ if (!nvme_is_append_emulated(ns))
>+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
>+ else {
>+ /* prepare append like write, and adjust lba
>afterwards */
>
>2. Completion:
> Not as clean as submission for sure.
>The extra check is "ns && ns->append_emulate == 1" for completions if
>CONFIG_ZONED is enabled.
>A slightly better completion-handling version (than the submitted patch) is -
>
>- } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>- req_op(req) == REQ_OP_ZONE_APPEND) {
>- req->__sector = nvme_lba_to_sect(req->q->queuedata,
>- le64_to_cpu(nvme_req(req)->result.u64));
>+ } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
>+ struct nvme_ns *ns = req->q->queuedata;
>+ /* append-emulation requires wp update for some cmds*/
>+ if (ns && nvme_is_append_emulated(ns)) {
>+ if (nvme_need_zone_wp_update(req))
>+ nvme_zone_wp_update(ns, req, status);
>+ }
>+ else if (req_op(req) == REQ_OP_ZONE_APPEND)
>+ req->__sector = nvme_lba_to_sect(ns,
>+ le64_to_cpu(nvme_req(req)->result.u64));
>
>Am I missing any other fast-path pain-points?
>
>A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
>before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
>after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)

It is good to use the QEMU "simulation" path that we implemented to test
performance with different delays, etc., but for these numbers to make
sense we need to put them in contrast to the simulated NAND speed, etc.

>
>This maynot be the best test to see the cost, and I am happy to
>conduct more and improvise.
>
>As for the volume of the code - it is same as SCSI emulation. And I
>can make efforts to reduce that by moving common code to blk-zone, and
>reuse in SCSI/NVMe emulation.
>In the patch I tried to isolate append-emulation by keeping everything
>into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
>which can be turned into "driver_data".
>
>+#ifdef CONFIG_BLK_DEV_ZONED
>+struct nvme_za_emul {
>+ unsigned int nr_zones;
>+ spinlock_t zones_wp_offset_lock;
>+ u32 *zones_wp_offset;
>+ u32 *rev_wp_offset;
>+ struct work_struct zone_wp_offset_work;
>+ char *zone_wp_update_buf;
>+ struct mutex rev_mutex;
>+ struct nvme_ns *ns;
>+};
>+#endif
>
>Will that be an acceptable line of further work/discussions?

I know we spent time enabling this path, but I don't think that moving
the discussion to the block layer will have much more benefit.

Let's keep the support for these non-append devices in xNVMe and focus
on the support for the append-enabled ones in Linux. We have a lot of
good stuff in the backlog that we can start pushing.

2020-08-21 00:00:45

by David Fugate

[permalink] [raw]
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On Thu, 2020-08-20 at 12:45 +0900, Keith Busch wrote:
> For the record, the suggestion provided, which you agreed to look
> into,
> most broadly enables your hardware on Linux and was entirely to your
> benefit. Not quite as dramatic as a political conspiracy.
>
> You later responded with a technical argument against that
> suggestion;
> however, your reason didn't add up, and that's where you left the
> thread.

The suggestion to the rejected patch was passed onto the related FW
team, and the "technical argument" was our FW team's response to the
suggestion which I relayed to the list. At this point, there's no
closure on whether the device will get NOIOB.

My point in bringing up this example was a one-line, highly-
maintainable patch which improves the performance of Linux should not
have been immediatedly NAK'ed as it was. If you believe it should have,
we'll agree to disagree.

2020-09-07 07:02:53

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On Wed, Aug 19, 2020 at 4:47 PM Damien Le Moal <[email protected]> wrote:
>
> On 2020/08/19 19:32, Kanchan Joshi wrote:
> > On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal <[email protected]> wrote:
> >>
> >> On 2020/08/19 18:27, Kanchan Joshi wrote:
> >>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <[email protected]> wrote:
> >>>>
> >>>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
> >>>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
> >>>>
> >>>> No, it is not.
> >>>
> >>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
> >>> I see that null-block zoned and SCSI-ZBC both set this requirement. I
> >>> wonder how it became different for NVMe.
> >>
> >> It is not required for an NVMe ZNS drive that has zone append native support.
> >> zonefs and upcoming btrfs do not use regular writes, removing the requirement
> >> for zone write locking.
> >
> > I understand that if a particular user (zonefs, btrfs etc) is not
> > sending regular-write and sending append instead, write-lock is not
> > required.
> > But if that particular user or some other user (say F2FS) sends
> > regular write(s), write-lock is needed.
>
> And that can be trivially enabled by setting the drive elevator to mq-deadline.
>
> > Above block-layer, both the opcodes REQ_OP_WRITE and
> > REQ_OP_ZONE_APPEND are available to be used by users. And I thought
> > write-lock is taken or not is a per-opcode thing and not per-user (FS,
> > MD/DM, user-space etc.), is not that correct? And MQ-deadline can
> > cater to both the opcodes, while other schedulers cannot serve
> > REQ_OP_WRITE well for zoned-device.
>
> mq-deadline ignores zone append commands. No zone lock is taken for these. In
> scsi, the emulation takes the zone lock before transforming the zone append into
> a regular write. That locking is consistent with the mq-scheduler level locking
> since the same lock bitmap is used. So if the user only issues zone append
> writes, mq-deadline is not needed and there is no reasons to force its use by
> setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber...

Right, got your point.

> >> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only
> >> and only if the drive does not have native zone append support.
> >
> > Sure I can keep it that way, once I get it right. If it is really not
> > required for native-append drive, it should not be here at the place
> > where I added.
> >
> >> And even in that
> >> case, since for an emulated zone append the zone write lock is taken and
> >> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
> >> only if the user will also be issuing regular writes at high QD. And that is
> >> trivially controllable by the user by simply setting the drive elevator to
> >> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.
> >
> > Are we saying applications should switch schedulers based on the write
> > QD (use any-scheduler for QD1 and mq-deadline for QD-N).
> > Even if it does that, it does not know what other applications would
> > be doing. That seems hard-to-get-right and possible only in a
> > tightly-controlled environment.
>
> Even for SMR, the user is free to set the elevator to none, which disables zone
> write locking. Issuing writes correctly then becomes the responsibility of the
> application. This can be useful for settings that for instance use NCQ I/O
> priorities, which give better results when "none" is used.

Was it not a problem that even if the application is sending writes
correctly, scheduler may not preserve the order.
And even when none is being used, re-queue can happen which may lead
to different ordering.

> As far as I know, zoned drives are always used in tightly controlled
> environments. Problems like "does not know what other applications would be
> doing" are non-existent. Setting up the drive correctly for the use case at hand
> is a sysadmin/server setup problem, based on *the* application (singular)
> requirements.

Fine.
But what about the null-block-zone which sets MQ-deadline but does not
actually use write-lock to avoid race among multiple appends on a
zone.
Does that deserve a fix?

--
Kanchan

2020-09-07 08:23:27

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On 2020/09/07 16:01, Kanchan Joshi wrote:
>> Even for SMR, the user is free to set the elevator to none, which disables zone
>> write locking. Issuing writes correctly then becomes the responsibility of the
>> application. This can be useful for settings that for instance use NCQ I/O
>> priorities, which give better results when "none" is used.
>
> Was it not a problem that even if the application is sending writes
> correctly, scheduler may not preserve the order.
> And even when none is being used, re-queue can happen which may lead
> to different ordering.

"Issuing writes correctly" means doing small writes, one per zone at most. In
that case, it does not matter if the block layer reorders writes. Per zone, they
will still be sequential.

>> As far as I know, zoned drives are always used in tightly controlled
>> environments. Problems like "does not know what other applications would be
>> doing" are non-existent. Setting up the drive correctly for the use case at hand
>> is a sysadmin/server setup problem, based on *the* application (singular)
>> requirements.
>
> Fine.
> But what about the null-block-zone which sets MQ-deadline but does not
> actually use write-lock to avoid race among multiple appends on a
> zone.
> Does that deserve a fix?

In nullblk, commands are executed under a spinlock. So there is no concurrency
problem. The spinlock serializes the execution of all commands. null_blk zone
append emulation thus does not need to take the scheduler level zone write lock
like scsi does.



--
Damien Le Moal
Western Digital Research

2020-09-07 11:30:46

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal <[email protected]> wrote:
>
> On 2020/09/07 16:01, Kanchan Joshi wrote:
> >> Even for SMR, the user is free to set the elevator to none, which disables zone
> >> write locking. Issuing writes correctly then becomes the responsibility of the
> >> application. This can be useful for settings that for instance use NCQ I/O
> >> priorities, which give better results when "none" is used.
> >
> > Was it not a problem that even if the application is sending writes
> > correctly, scheduler may not preserve the order.
> > And even when none is being used, re-queue can happen which may lead
> > to different ordering.
>
> "Issuing writes correctly" means doing small writes, one per zone at most. In
> that case, it does not matter if the block layer reorders writes. Per zone, they
> will still be sequential.
>
> >> As far as I know, zoned drives are always used in tightly controlled
> >> environments. Problems like "does not know what other applications would be
> >> doing" are non-existent. Setting up the drive correctly for the use case at hand
> >> is a sysadmin/server setup problem, based on *the* application (singular)
> >> requirements.
> >
> > Fine.
> > But what about the null-block-zone which sets MQ-deadline but does not
> > actually use write-lock to avoid race among multiple appends on a
> > zone.
> > Does that deserve a fix?
>
> In nullblk, commands are executed under a spinlock. So there is no concurrency
> problem. The spinlock serializes the execution of all commands. null_blk zone
> append emulation thus does not need to take the scheduler level zone write lock
> like scsi does.

I do not see spinlock for that. There is one "nullb->lock", but its
scope is limited to memory-backed handling.
For concurrent zone-appends on a zone, multiple threads may set the
"same" write-pointer into incoming request(s).
Are you referring to any other spinlock that can avoid "same wp being
returned to multiple threads".

--
Kanchan

2020-09-07 11:43:05

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On 2020/09/07 20:24, Kanchan Joshi wrote:
> On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal <[email protected]> wrote:
>>
>> On 2020/09/07 16:01, Kanchan Joshi wrote:
>>>> Even for SMR, the user is free to set the elevator to none, which disables zone
>>>> write locking. Issuing writes correctly then becomes the responsibility of the
>>>> application. This can be useful for settings that for instance use NCQ I/O
>>>> priorities, which give better results when "none" is used.
>>>
>>> Was it not a problem that even if the application is sending writes
>>> correctly, scheduler may not preserve the order.
>>> And even when none is being used, re-queue can happen which may lead
>>> to different ordering.
>>
>> "Issuing writes correctly" means doing small writes, one per zone at most. In
>> that case, it does not matter if the block layer reorders writes. Per zone, they
>> will still be sequential.
>>
>>>> As far as I know, zoned drives are always used in tightly controlled
>>>> environments. Problems like "does not know what other applications would be
>>>> doing" are non-existent. Setting up the drive correctly for the use case at hand
>>>> is a sysadmin/server setup problem, based on *the* application (singular)
>>>> requirements.
>>>
>>> Fine.
>>> But what about the null-block-zone which sets MQ-deadline but does not
>>> actually use write-lock to avoid race among multiple appends on a
>>> zone.
>>> Does that deserve a fix?
>>
>> In nullblk, commands are executed under a spinlock. So there is no concurrency
>> problem. The spinlock serializes the execution of all commands. null_blk zone
>> append emulation thus does not need to take the scheduler level zone write lock
>> like scsi does.
>
> I do not see spinlock for that. There is one "nullb->lock", but its
> scope is limited to memory-backed handling.
> For concurrent zone-appends on a zone, multiple threads may set the
> "same" write-pointer into incoming request(s).
> Are you referring to any other spinlock that can avoid "same wp being
> returned to multiple threads".

Checking again, it looks like you are correct. nullb->lock is indeed only used
for processing read/write with memory backing turned on.
We either need to extend that spinlock use, or add one to protect the zone array
when doing zoned commands and checks of read/write against a zone wp.
Care to send a patch ? I can send one too.


--
Damien Le Moal
Western Digital Research

2020-09-07 12:00:53

by Kanchan Joshi

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On Mon, Sep 7, 2020 at 5:07 PM Damien Le Moal <[email protected]> wrote:
>
> On 2020/09/07 20:24, Kanchan Joshi wrote:
> > On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal <[email protected]> wrote:
> >>
> >> On 2020/09/07 16:01, Kanchan Joshi wrote:
> >>>> Even for SMR, the user is free to set the elevator to none, which disables zone
> >>>> write locking. Issuing writes correctly then becomes the responsibility of the
> >>>> application. This can be useful for settings that for instance use NCQ I/O
> >>>> priorities, which give better results when "none" is used.
> >>>
> >>> Was it not a problem that even if the application is sending writes
> >>> correctly, scheduler may not preserve the order.
> >>> And even when none is being used, re-queue can happen which may lead
> >>> to different ordering.
> >>
> >> "Issuing writes correctly" means doing small writes, one per zone at most. In
> >> that case, it does not matter if the block layer reorders writes. Per zone, they
> >> will still be sequential.
> >>
> >>>> As far as I know, zoned drives are always used in tightly controlled
> >>>> environments. Problems like "does not know what other applications would be
> >>>> doing" are non-existent. Setting up the drive correctly for the use case at hand
> >>>> is a sysadmin/server setup problem, based on *the* application (singular)
> >>>> requirements.
> >>>
> >>> Fine.
> >>> But what about the null-block-zone which sets MQ-deadline but does not
> >>> actually use write-lock to avoid race among multiple appends on a
> >>> zone.
> >>> Does that deserve a fix?
> >>
> >> In nullblk, commands are executed under a spinlock. So there is no concurrency
> >> problem. The spinlock serializes the execution of all commands. null_blk zone
> >> append emulation thus does not need to take the scheduler level zone write lock
> >> like scsi does.
> >
> > I do not see spinlock for that. There is one "nullb->lock", but its
> > scope is limited to memory-backed handling.
> > For concurrent zone-appends on a zone, multiple threads may set the
> > "same" write-pointer into incoming request(s).
> > Are you referring to any other spinlock that can avoid "same wp being
> > returned to multiple threads".
>
> Checking again, it looks like you are correct. nullb->lock is indeed only used
> for processing read/write with memory backing turned on.
> We either need to extend that spinlock use, or add one to protect the zone array
> when doing zoned commands and checks of read/write against a zone wp.
> Care to send a patch ? I can send one too.

Sure, I can send.
Do you think it is not OK to use zone write-lock (same like SCSI
emulation) instead of introducing a new spinlock?


--
Kanchan

2020-09-07 13:00:52

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

On 2020/09/07 20:54, Kanchan Joshi wrote:
> On Mon, Sep 7, 2020 at 5:07 PM Damien Le Moal <[email protected]> wrote:
>>
>> On 2020/09/07 20:24, Kanchan Joshi wrote:
>>> On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal <[email protected]> wrote:
>>>>
>>>> On 2020/09/07 16:01, Kanchan Joshi wrote:
>>>>>> Even for SMR, the user is free to set the elevator to none, which disables zone
>>>>>> write locking. Issuing writes correctly then becomes the responsibility of the
>>>>>> application. This can be useful for settings that for instance use NCQ I/O
>>>>>> priorities, which give better results when "none" is used.
>>>>>
>>>>> Was it not a problem that even if the application is sending writes
>>>>> correctly, scheduler may not preserve the order.
>>>>> And even when none is being used, re-queue can happen which may lead
>>>>> to different ordering.
>>>>
>>>> "Issuing writes correctly" means doing small writes, one per zone at most. In
>>>> that case, it does not matter if the block layer reorders writes. Per zone, they
>>>> will still be sequential.
>>>>
>>>>>> As far as I know, zoned drives are always used in tightly controlled
>>>>>> environments. Problems like "does not know what other applications would be
>>>>>> doing" are non-existent. Setting up the drive correctly for the use case at hand
>>>>>> is a sysadmin/server setup problem, based on *the* application (singular)
>>>>>> requirements.
>>>>>
>>>>> Fine.
>>>>> But what about the null-block-zone which sets MQ-deadline but does not
>>>>> actually use write-lock to avoid race among multiple appends on a
>>>>> zone.
>>>>> Does that deserve a fix?
>>>>
>>>> In nullblk, commands are executed under a spinlock. So there is no concurrency
>>>> problem. The spinlock serializes the execution of all commands. null_blk zone
>>>> append emulation thus does not need to take the scheduler level zone write lock
>>>> like scsi does.
>>>
>>> I do not see spinlock for that. There is one "nullb->lock", but its
>>> scope is limited to memory-backed handling.
>>> For concurrent zone-appends on a zone, multiple threads may set the
>>> "same" write-pointer into incoming request(s).
>>> Are you referring to any other spinlock that can avoid "same wp being
>>> returned to multiple threads".
>>
>> Checking again, it looks like you are correct. nullb->lock is indeed only used
>> for processing read/write with memory backing turned on.
>> We either need to extend that spinlock use, or add one to protect the zone array
>> when doing zoned commands and checks of read/write against a zone wp.
>> Care to send a patch ? I can send one too.
>
> Sure, I can send.
> Do you think it is not OK to use zone write-lock (same like SCSI
> emulation) instead of introducing a new spinlock?

zone write lock will not protect against read or zone management commands
executed concurrently with writes. Only concurrent writes to the same zone will
be serialized with the scheduler zone write locking, which may not be used at
all also if the user set the scheduler to none. A lock for exclusive access and
changes to the zone array is needed.


>
>


--
Damien Le Moal
Western Digital Research