From: Hans Holmberg <[email protected]>
Introduce a new target: lzbd - LightNVM Zoned Block Device
The new target makes it possible to expose an
Open-Channel 2.0 SSD as one or more zoned block devices exposing
BLK_ZONE_TYPE_SEQWRITE_REQ zones.
I've been playing around with this the last couple of months and
now I'd love to get some feedback.
It's very been useful to look at null_blk's zone support when
doing the plumbing work and Simon and Klaus has also been very helpful
when figuring out the design. Thanks guys!
Naming is sometimes the hardest thing. I named this thing lzbd, as
I found that most descriptive acronym.
NOTE: This is an early prototype and lacking some vital
features at the moment. It is worth looking at and playing
around with for those interested, but beware of dragons :)
See the lzbd documentation(Documentation/lightnvm/lzbd.txt) for my ideas on how
a full implementation would look like.
What is supported(for now):
* Reads
* Sequential writes
* Unaligned writes (a per-zone ws_opt alignment buffer is used)
* Zone resets
* Zone reporting
* Wear leveling(sort of, wear indices are not upated on reset yet)
I've mainly tested in QEMU (cunits=0, ws_min=4, ws_opt=8).
The zoned block device tests in blktests (tests/zbd) passes, and I've done
a bunch of general smoke testing(aligned/unaligned writes with verification
using dd and fio, ..), so the general plumbing seems to hold up, but
more testing is needed.
Performance is definately not what it should be yet. Only one chunk per zone
is being written to at a time, effectively rate-limiting writes per zone,
which is an interesting constraint, but probably not what we want.
What is not supported(yet):
* Metadata persistance (when the instance is removed, data is lost)
- Zone to chunks mapping needs to be stored
* Sync handling (flushing alignment buffers)
- Zone Aligment buffer needs to be flushed to disk
* Write error handling
- Write errors will require zone -> chunk remapping
of the failing chunk.
* Chuck reset error handling (chunks going offline)
* Updating wear indices on chunk resets
- This is low hanging fruit to fix
* Cunits read buffering
Final thoughts, for now:
Since lzbd (and pblk for that matter) are not entirely unlike file systems,
it would be nice to create a mkfs/fsck/dmzadm-like tool that would:
* Format the drive and persist instance configuration in a superblock
contained in the instance metadata.
* Repair broken(i.e. powerfailed) instances
Per-sector metadata is currently not utilized in lzbd, but would
be helpful in recovery scenarios.
The patch is based on Matias for5.2/core branch in the github
openchannel project. It is also available at [1] (branch for-5.2/lzbd)
Thanks,
Hans
[1] CNEX Labs linux github project: https://github.com/CNEX-Labs/linux
Hans Holmberg (1):
lightnvm: add lzbd - a zoned block device target
Documentation/lightnvm/lzbd.txt | 122 +++++++++++
drivers/lightnvm/Kconfig | 11 +
drivers/lightnvm/Makefile | 3 +
drivers/lightnvm/lzbd-io.c | 342 +++++++++++++++++++++++++++++++
drivers/lightnvm/lzbd-target.c | 392 +++++++++++++++++++++++++++++++++++
drivers/lightnvm/lzbd-user.c | 310 ++++++++++++++++++++++++++++
drivers/lightnvm/lzbd-zone.c | 444 ++++++++++++++++++++++++++++++++++++++++
drivers/lightnvm/lzbd.h | 139 +++++++++++++
8 files changed, 1763 insertions(+)
create mode 100644 Documentation/lightnvm/lzbd.txt
create mode 100644 drivers/lightnvm/lzbd-io.c
create mode 100644 drivers/lightnvm/lzbd-target.c
create mode 100644 drivers/lightnvm/lzbd-user.c
create mode 100644 drivers/lightnvm/lzbd-zone.c
create mode 100644 drivers/lightnvm/lzbd.h
--
2.7.4
From: Hans Holmberg <[email protected]>
Introduce a new target: lzbd - LightNVM Zoned Block Device
The new target makes it possible to expose an
Open-Channel 2.0 SSD as one or more zoned block devices.
See Documentation/lightnvm/lzbd.txt for more information.
Experimental in its present state of implementation.
Signed-off-by: Hans Holmberg <[email protected]>
---
Documentation/lightnvm/lzbd.txt | 122 +++++++++++
drivers/lightnvm/Kconfig | 11 +
drivers/lightnvm/Makefile | 3 +
drivers/lightnvm/lzbd-io.c | 342 +++++++++++++++++++++++++++++++
drivers/lightnvm/lzbd-target.c | 392 +++++++++++++++++++++++++++++++++++
drivers/lightnvm/lzbd-user.c | 310 ++++++++++++++++++++++++++++
drivers/lightnvm/lzbd-zone.c | 444 ++++++++++++++++++++++++++++++++++++++++
drivers/lightnvm/lzbd.h | 139 +++++++++++++
8 files changed, 1763 insertions(+)
create mode 100644 Documentation/lightnvm/lzbd.txt
create mode 100644 drivers/lightnvm/lzbd-io.c
create mode 100644 drivers/lightnvm/lzbd-target.c
create mode 100644 drivers/lightnvm/lzbd-user.c
create mode 100644 drivers/lightnvm/lzbd-zone.c
create mode 100644 drivers/lightnvm/lzbd.h
diff --git a/Documentation/lightnvm/lzbd.txt b/Documentation/lightnvm/lzbd.txt
new file mode 100644
index 000000000000..8bdbc01a25be
--- /dev/null
+++ b/Documentation/lightnvm/lzbd.txt
@@ -0,0 +1,122 @@
+lzbd: A Zoned Block Device LightNVM Target
+==========================================
+
+The lzbd lightnvm target makes it possible to expose an Open-Channel 2.0 SSD
+as one or more zoned block devices.
+
+Each lightnvm target is assigned a range of parallel units. Parallel units(PUs)
+are not shared among targets avoiding I/O QoS disturbances between targets as
+far as possible.
+
+For more information on lightnvm, see [1]
+For more information on Open-Channel 2.0, see [2].
+For more information on zoned block devices see [3].
+
+lzbd is designed to act as a slim adaptor, making it possible to plug
+OCSSD 2.0 SSDs into the zone block device ecosystem.
+
+lzbd manages zone to chunk mapping, read/write restrictions, wear leveling
+and write errors.
+
+Zone geometry
+-------------
+
+From a user perspective, lzbd targets form a number of sequential-write-required
+(BLK_ZONE_TYPE_SEQWRITE_REQ) zones.
+
+Not all of the target's capacity is exposed to the user.
+Some chunks are reserved for metadata and over-provisioning.
+
+The zones follow the same constraints as described in [3].
+
+All zones are of the same size (SZ).
+
+Simple example:
+
+Sector Zone type
+ _______________________
+0 --> | Sequential write req. |
+ | |
+ |_______________________|
+SZ --> | Sequential write req. |
+ | |
+ |_______________________|
+SZ*2..--> | Sequential write req. |
+ | |
+.......... .........................
+ |_______________________|
+SZ*N-1 --> | Sequential write req. |
+ |_______________________|
+
+
+SZ is configurable, but is restricted to a multiple of
+(chunk size (CLBA) * Number of PUs).
+
+Zone to chunk mapping
+---------------------
+
+Zones are spread across PUs to allow maximum write throughput through striping.
+One or more chunks (CHK) per PU is assigned.
+
+Example:
+
+OCSSD 2.0 Geometry: 4 PUs, 16 chunks per PU.
+Zones: 3
+
+ Zone PU0 PU1 PU2 PU3
+_______ _____ _____ _____ _____
+ |CHK 0|CHK 0|CHK A|CHK 0|
+ 0 |CHK 2|CHK 3|CHK 3|CHK 1|
+_______ |_____|_____|_____|_____|
+ |CHK 3|CHK B|CHK 8|CHK A|
+ 1 |CHK 7|CHK F|CHK 2|CHK 3|
+_______ |_____|_____|_____|_____|
+ |CHK 8|CHK 2|CHK 7|CHK 4|
+ 2 |CHK 1|CHK A|CHK 5|CHK 2|
+_______ |_____|_____|_____|_____|
+
+Chunks are assigned to a zone when it is opened based on the chunk wear index.
+
+Note: The disk's Maximum Open Chunks (MAXOC) limit puts an upper bound on
+maximum simultaneously open zones (unless MAXOC = 0).
+
+Meta data and over-provisioning
+-------------------------------
+
+lzbd needs the following meta data to be persisted:
+
+* a zone-to chunk mapping (Z2C) table, size: 4 bytes * Number of chunks
+* a superblock containing target configuration, guuid, on-disk format version,
+ etc.
+
+Additionally, chunks need to be reserved for handling:
+
+* write errors
+* chunks wearing out and going offline
+* persisting data not aligned with the minimal write constraint
+
+The meta data is stored a separate set of chunks from the user data.
+
+Host memory requirements
+------------------------
+
+The Z2C mapping table needs to be kept in host memory (see above), and:
+
+* in order to achieve maximum throughput and alignment requirements,
+ a small write buffer is needed
+ Size: Optimal Write Size (WS_OPT) * Maximum number of open zones.
+
+* to satisify OCSSD 2.0 read restrictions, a read buffer is needed.
+ Size: Number of PUs * Cache Minimum Write Size Units (MW_CUNITS) *
+ Maximum number of open zones.
+
+If MW_CUNITS = 0, no read buffer is needed and data can be written without
+any host copying/buffering (except for handling WS_OPT alignment).
+
+References
+----------
+
+[1] Lightnvm website: http://lightnvm.io/
+[2] OCSSD 2.0 Specification: http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf
+[3] ZBC / Zoned block device support: https://lwn.net/Articles/703871/
+
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index a872cd720967..98882874bda6 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -16,6 +16,17 @@ menuconfig NVM
if NVM
+config NVM_LZBD
+ tristate "Zoned Block Device Open-Channel SSD target"
+ depends on BLK_DEV_ZONED
+ help
+ Allows an open-channel SSD to be exposed as a zoned block device to the
+ host.
+
+ Highly EXPERIMENTAL for now.
+
+ Only say Y if you want to play with it.
+
config NVM_PBLK
tristate "Physical Block Device Open-Channel SSD target"
help
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
index 97d9d7c71550..f9eea8b23b33 100644
--- a/drivers/lightnvm/Makefile
+++ b/drivers/lightnvm/Makefile
@@ -9,3 +9,6 @@ pblk-y := pblk-init.o pblk-core.o pblk-rb.o \
pblk-write.o pblk-cache.o pblk-read.o \
pblk-gc.o pblk-recovery.o pblk-map.o \
pblk-rl.o pblk-sysfs.o
+
+obj-$(CONFIG_NVM_LZBD) += lzbd.o
+lzbd-y := lzbd-target.o lzbd-user.o lzbd-io.o lzbd-zone.o
diff --git a/drivers/lightnvm/lzbd-io.c b/drivers/lightnvm/lzbd-io.c
new file mode 100644
index 000000000000..b210ab33fdd3
--- /dev/null
+++ b/drivers/lightnvm/lzbd-io.c
@@ -0,0 +1,342 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *
+ * Zoned block device lightnvm target
+ * Copyright (C) 2019 CNEX Labs
+ *
+ * Disk I/O
+ */
+
+#include "lzbd.h"
+
+static inline void lzbd_chunk_log(char *message, int err,
+ struct lzbd_chunk *lzbd_chunk)
+{
+
+ /* TODO: create trace points in stead */
+ pr_err("lzbd: %s: err: %d grp: %d pu: %d chk: %d slba: %llu state: %d wp: %llu\n",
+ message,
+ err,
+ lzbd_chunk->ppa.m.grp,
+ lzbd_chunk->ppa.m.pu,
+ lzbd_chunk->ppa.m.chk,
+ lzbd_chunk->meta->slba,
+ lzbd_chunk->meta->state,
+ lzbd_chunk->meta->wp);
+}
+
+int lzbd_reset_chunk(struct lzbd *lzbd, struct lzbd_chunk *chunk)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_rq rqd = {NULL};
+ int ret;
+
+ if ((chunk->meta->state & (NVM_CHK_ST_FREE | NVM_CHK_ST_OFFLINE))) {
+ pr_err("lzbd: reset of chunk in illegal state: %d\n",
+ chunk->meta->state);
+ return -EINVAL;
+ }
+
+ rqd.opcode = NVM_OP_ERASE;
+ rqd.ppa_addr = chunk->ppa;
+ rqd.nr_ppas = 1;
+ rqd.is_seq = 1;
+
+ ret = nvm_submit_io_sync(dev, &rqd);
+
+ /* For now, set the chunk offline if the request fails
+ * TODO: Pass a buffer in the request so we get a full
+ * meta update from the device
+ */
+
+ if (!ret) {
+ if (rqd.error) {
+ if ((rqd.error & 0xfff) == 0x2c0) {
+ lzbd_chunk_log("chunk went offline", 0, chunk);
+ chunk->meta->state = NVM_CHK_ST_OFFLINE;
+ } else {
+ if ((rqd.error & 0xfff) == 0x2c1) {
+ lzbd_chunk_log("invalid reset",
+ -EINVAL, chunk);
+ } else {
+ lzbd_chunk_log("unknown error",
+ -EINVAL, chunk);
+ }
+ return -EINVAL;
+ }
+ } else {
+ chunk->meta->state = NVM_CHK_ST_FREE;
+ chunk->meta->wp = 0;
+ }
+ }
+
+ return ret;
+}
+
+/* Prepare a write request to a chunk. If the function call succeeds
+ * the call must be paired with a lzbd_free_wr_rq
+ */
+static int lzbd_init_wr_rq(struct lzbd *lzbd, struct lzbd_chunk *chunk,
+ struct bio *bio, struct nvm_rq *rq)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct ppa_addr ppa;
+ struct ppa_addr *ppa_list;
+ int metadata_sz = geo->sos * NVM_MAX_VLBA;
+ int nr_ppas = geo->ws_opt;
+ int i;
+
+ memset(rq, 0, sizeof(struct nvm_rq));
+
+ rq->bio = bio;
+ rq->opcode = NVM_OP_PWRITE;
+ rq->nr_ppas = nr_ppas;
+ rq->is_seq = 1;
+ rq->private = &chunk->wr_ctx;
+
+ /* Do we respect the write size restrictions? */
+ if (nr_ppas > geo->ws_opt || (nr_ppas % geo->ws_min)) {
+ pr_err("lzbd: write size violation size: %d\n", nr_ppas);
+ return -EINVAL;
+ }
+
+ /* Is the chunk in the right state? */
+ if (!(chunk->meta->state & (NVM_CHK_ST_FREE | NVM_CHK_ST_OPEN))) {
+ pr_err("lzbd: write to chunk in wrong state: %d\n",
+ chunk->meta->state);
+ return -EINVAL;
+ }
+
+ /* Do we have room for the write? */
+ if ((chunk->meta->wp + nr_ppas) > geo->clba) {
+ pr_err("lzbd: cant fit write into chunk size %d\n", nr_ppas);
+ return -EINVAL;
+ }
+
+ rq->meta_list = nvm_dev_dma_alloc(dev->parent, GFP_KERNEL,
+ &rq->dma_meta_list);
+ if (!rq->meta_list)
+ return -ENOMEM;
+
+ /* We don't care about metadata. yet. */
+ memset(rq->meta_list, 42, metadata_sz);
+
+ if (nr_ppas > 1) {
+ rq->ppa_list = rq->meta_list + metadata_sz;
+ rq->dma_ppa_list = rq->dma_meta_list + metadata_sz;
+ }
+
+ //pr_err("lzbd: writing %d sectors\n", nr_ppas);
+
+ ppa.ppa = chunk->ppa.ppa;
+
+ mutex_lock(&chunk->wr_ctx.wr_lock);
+
+ ppa.m.sec = chunk->meta->wp;
+
+ ppa_list = nvm_rq_to_ppa_list(rq);
+ for (i = 0; i < nr_ppas; i++) {
+ ppa_list[i].ppa = ppa.ppa;
+ ppa.m.sec++;
+ }
+
+ return 0;
+}
+
+static void lzbd_free_wr_rq(struct lzbd *lzbd, struct nvm_rq *rq)
+{
+ struct lzbd_wr_ctx *wr_ctx = rq->private;
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct lzbd_chunk *chunk;
+
+ chunk = container_of(wr_ctx, struct lzbd_chunk, wr_ctx);
+
+ mutex_unlock(&chunk->wr_ctx.wr_lock);
+ nvm_dev_dma_free(dev->parent, rq->meta_list, rq->dma_meta_list);
+}
+
+static inline void lzbd_wr_rq_post(struct nvm_rq *rq)
+{
+ struct lzbd_wr_ctx *wr_ctx = rq->private;
+ struct lzbd *lzbd = wr_ctx->lzbd;
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct lzbd_chunk *chunk;
+
+ chunk = container_of(wr_ctx, struct lzbd_chunk, wr_ctx);
+
+ if (!rq->error) {
+ if (chunk->meta->wp == 0)
+ chunk->meta->state = NVM_CHK_ST_OPEN;
+
+ chunk->meta->wp += rq->nr_ppas;
+ if (chunk->meta->wp == geo->clba)
+ chunk->meta->state = NVM_CHK_ST_CLOSED;
+ }
+}
+
+int lzbd_write_to_chunk_sync(struct lzbd *lzbd, struct lzbd_chunk *chunk,
+ struct bio *bio)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_rq rq;
+ int ret;
+
+ ret = lzbd_init_wr_rq(lzbd, chunk, bio, &rq);
+ if (ret)
+ return ret;
+
+ ret = nvm_submit_io_sync(dev, &rq);
+ if (ret) {
+ ret = rq.error;
+ pr_err("lzbd: sync write request submit failed: %d\n", ret);
+ } else {
+ lzbd_wr_rq_post(&rq);
+ }
+
+ lzbd_free_wr_rq(lzbd, &rq);
+
+ return ret;
+}
+
+static void lzbd_read_endio(struct nvm_rq *rq)
+{
+ struct lzbd_rd_ctx *rd_ctx = container_of(rq, struct lzbd_rd_ctx, rqd);
+ struct lzbd *lzbd = rd_ctx->lzbd;
+ struct lzbd_user_read *read = rd_ctx->read;
+ struct nvm_tgt_dev *dev = lzbd->dev;
+
+ if (unlikely(rq->error))
+ read->error = true;
+
+ if (rq->meta_list)
+ nvm_dev_dma_free(dev->parent, rq->meta_list, rq->dma_meta_list);
+
+ kref_put(&read->ref, lzbd_user_read_put);
+ kfree(rd_ctx);
+}
+
+static int lzbd_read_from_chunk_async(struct lzbd *lzbd,
+ struct lzbd_chunk *chunk,
+ struct bio *bio,
+ struct lzbd_user_read *user_read,
+ int start)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct lzbd_rd_ctx *rd_ctx;
+ struct nvm_rq *rq;
+ struct ppa_addr ppa;
+ struct ppa_addr *ppa_list;
+ int metadata_sz = geo->sos * NVM_MAX_VLBA;
+ int nr_ppas = lzbd_get_bio_len(bio);
+ int ret;
+ int i;
+
+ /* Do we respect the read size restrictions? */
+ if (nr_ppas >= NVM_MAX_VLBA) {
+ pr_err("lzbd: read size violation size: %d\n", nr_ppas);
+ return -EINVAL;
+ }
+
+ /* Is the chunk in the right state? */
+ if (!(chunk->meta->state & (NVM_CHK_ST_OPEN | NVM_CHK_ST_CLOSED))) {
+ pr_err("lzbd: read from chunk in wrong state: %d\n",
+ chunk->meta->state);
+ return -EINVAL;
+ }
+
+ /*Are we reading within bounds? */
+ if ((start + nr_ppas) > geo->clba) {
+ pr_err("lzbd: read past the chunk size %d start: %d\n",
+ nr_ppas, start);
+ return -EINVAL;
+ }
+
+ rd_ctx = kzalloc(sizeof(struct lzbd_rd_ctx), GFP_KERNEL);
+ if (!rd_ctx)
+ return -ENOMEM;
+
+ rd_ctx->read = user_read;
+ rd_ctx->lzbd = lzbd;
+
+ rq = &rd_ctx->rqd;
+ rq->bio = bio;
+ rq->opcode = NVM_OP_PREAD;
+ rq->nr_ppas = nr_ppas;
+ rq->end_io = lzbd_read_endio;
+ rq->private = lzbd;
+ rq->meta_list = nvm_dev_dma_alloc(dev->parent, GFP_KERNEL,
+ &rq->dma_meta_list);
+ if (!rq->meta_list) {
+ kfree(rd_ctx);
+ return -ENOMEM;
+ }
+
+ if (nr_ppas > 1) {
+ rq->ppa_list = rq->meta_list + metadata_sz;
+ rq->dma_ppa_list = rq->dma_meta_list + metadata_sz;
+ }
+
+ ppa.ppa = chunk->ppa.ppa;
+ ppa.m.sec = start;
+
+ ppa_list = nvm_rq_to_ppa_list(rq);
+ for (i = 0; i < nr_ppas; i++) {
+ ppa_list[i].ppa = ppa.ppa;
+ ppa.m.sec++;
+ }
+
+ ret = nvm_submit_io(dev, rq);
+
+ if (ret) {
+ pr_err("lzbd: read request submit failed: %d\n", ret);
+ nvm_dev_dma_free(dev->parent, rq->meta_list, rq->dma_meta_list);
+ kfree(rd_ctx);
+ }
+
+ return ret;
+}
+
+int lzbd_write_to_chunk_user(struct lzbd *lzbd, struct lzbd_chunk *chunk,
+ struct bio *user_bio)
+{
+ struct bio *write_bio;
+ int ret = 0;
+
+ write_bio = bio_clone_fast(user_bio, GFP_KERNEL, &lzbd_bio_set);
+ if (!write_bio)
+ return -ENOMEM;
+
+ ret = lzbd_write_to_chunk_sync(lzbd, chunk, write_bio);
+ if (ret) {
+ ret = -EIO;
+ bio_io_error(user_bio);
+ } else {
+ ret = 0;
+ bio_endio(user_bio);
+ }
+
+ return ret;
+}
+
+int lzbd_read_from_chunk_user(struct lzbd *lzbd, struct lzbd_chunk *chunk,
+ struct bio *bio, struct lzbd_user_read *user_read,
+ int start)
+{
+ struct bio *read_bio;
+ int ret = 0;
+
+ read_bio = bio_clone_fast(bio, GFP_KERNEL, &lzbd_bio_set);
+ if (!read_bio) {
+ pr_err("lzbd: bio clone failed!\n");
+ return -ENOMEM;
+ }
+
+ ret = lzbd_read_from_chunk_async(lzbd, chunk,
+ read_bio, user_read, start);
+
+ return ret;
+}
+
diff --git a/drivers/lightnvm/lzbd-target.c b/drivers/lightnvm/lzbd-target.c
new file mode 100644
index 000000000000..04dd22873eeb
--- /dev/null
+++ b/drivers/lightnvm/lzbd-target.c
@@ -0,0 +1,392 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *
+ * Zoned block device lightnvm target
+ * Copyright (C) 2019 CNEX Labs
+ *
+ * Target handling: module boilerplate, init and remove
+ */
+
+#include <linux/module.h>
+
+#include "lzbd.h"
+
+struct bio_set lzbd_bio_set;
+
+static sector_t lzbd_capacity(void *private)
+{
+ struct lzbd *lzbd = private;
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+
+ return dl->capacity;
+}
+
+static void lzbd_free_chunks(struct lzbd *lzbd)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct lzbd_chunks *chunks = &lzbd->chunks;
+ int parallel_units = geo->all_luns;
+ int i;
+
+ for (i = 0; i < parallel_units; i++) {
+ struct lzbd_pu *pu = &chunks->pus[i];
+ struct list_head *pos, *n;
+ struct lzbd_chunk *chunk;
+
+ mutex_destroy(&pu->lock);
+
+ list_for_each_safe(pos, n, &pu->chk_list) {
+ chunk = list_entry(pos, struct lzbd_chunk, list);
+
+ list_del(pos);
+ mutex_destroy(&chunk->wr_ctx.wr_lock);
+ kfree(chunk);
+ }
+ }
+
+ kfree(chunks->pus);
+ vfree(chunks->meta);
+}
+
+/* Add chunk to chunklist in falling wi order */
+void lzbd_add_chunk(struct lzbd_chunk *chunk,
+ struct list_head *head)
+{
+ struct lzbd_chunk *c = NULL;
+
+ list_for_each_entry(c, head, list) {
+ if (chunk->meta->wi < c->meta->wi)
+ break;
+ }
+
+ list_add_tail(&chunk->list, &c->list);
+}
+
+
+static int lzbd_init_chunks(struct lzbd *lzbd)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct nvm_chk_meta *meta;
+ struct lzbd_chunks *chunks = &lzbd->chunks;
+ int parallel_units = geo->all_luns;
+ struct ppa_addr ppa;
+ int ret;
+ int chk;
+ int i;
+
+ chunks->pus = kcalloc(parallel_units, sizeof(struct lzbd_pu),
+ GFP_KERNEL);
+ if (!chunks->pus)
+ return -ENOMEM;
+
+ meta = vzalloc(geo->all_chunks * sizeof(*meta));
+ if (!meta) {
+ kfree(chunks->pus);
+ return -ENOMEM;
+ }
+
+ chunks->meta = meta;
+
+ for (i = 0; i < parallel_units; i++) {
+ struct lzbd_pu *lzbd_pu = &chunks->pus[i];
+
+ INIT_LIST_HEAD(&lzbd_pu->chk_list);
+ mutex_init(&lzbd_pu->lock);
+ }
+
+ ppa.ppa = 0; /* get all chunks */
+ ret = nvm_get_chunk_meta(dev, ppa, geo->all_chunks, meta);
+ if (ret) {
+ lzbd_free_chunks(lzbd);
+ return -EIO;
+ }
+
+ for (chk = 0; chk < geo->num_chk; chk++) {
+ for (i = 0; i < parallel_units; i++) {
+ struct lzbd_pu *lzbd_pu = &chunks->pus[i];
+ struct nvm_chk_meta *chk_meta;
+ int grp = i / geo->num_lun;
+ int pu = i % geo->num_lun;
+ int offset = 0;
+
+ offset += grp * geo->num_lun * geo->num_chk;
+ offset += pu * geo->num_chk;
+ offset += chk;
+
+ chk_meta = &meta[offset];
+
+ if (!(chk_meta->state & NVM_CHK_ST_OFFLINE)) {
+ struct lzbd_chunk *chunk;
+
+ chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
+ if (!chunk) {
+ lzbd_free_chunks(lzbd);
+ return -ENOMEM;
+ }
+
+ INIT_LIST_HEAD(&chunk->list);
+ chunk->meta = chk_meta;
+ chunk->ppa.m.grp = grp;
+ chunk->ppa.m.pu = pu;
+ chunk->ppa.m.chk = chk;
+ chunk->pu = i;
+
+ lzbd_add_chunk(chunk, &lzbd_pu->chk_list);
+
+ mutex_init(&chunk->wr_ctx.wr_lock);
+ chunk->wr_ctx.lzbd = lzbd;
+ } else {
+ lzbd_pu->offline_chks++;
+ }
+ }
+ }
+
+ return 0;
+}
+
+static struct lzbd_zone *lzbd_init_zones(struct lzbd *lzbd)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ int i;
+ struct lzbd_zone *zones;
+ u64 zone_offset = 0;
+
+ zones = kmalloc_array(dl->zones, sizeof(*zones), GFP_KERNEL);
+ if (!zones)
+ return NULL;
+
+ /* Sequential zones */
+ for (i = 0; i < dl->zones; i++, zone_offset += dl->zone_size) {
+ struct lzbd_zone *zone = &zones[i];
+ struct blk_zone *bz = &zone->blk_zone;
+
+ bz->start = zone_offset;
+ bz->len = dl->zone_size;
+ bz->wp = zone_offset + dl->zone_size;
+ bz->type = BLK_ZONE_TYPE_SEQWRITE_REQ;
+ bz->cond = BLK_ZONE_COND_FULL;
+
+ bz->non_seq = 0;
+ bz->reset = 1;
+
+ /* zero-out reserved bytes to be forward-compatible */
+ memset(bz->reserved, 0, sizeof(bz->reserved));
+
+ zones[i].chunks = NULL;
+ mutex_init(&zone->lock);
+
+ zone->wr_align.buffer = NULL;
+ mutex_init(&zone->wr_align.lock);
+ }
+
+ return zones;
+}
+
+
+static void lzbd_config_disk_queue(struct lzbd *lzbd)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct gendisk *disk = lzbd->disk;
+ struct nvm_geo *geo = &dev->geo;
+ struct request_queue *bqueue = dev->q;
+ struct request_queue *dqueue = disk->queue;
+
+ blk_queue_logical_block_size(dqueue, queue_physical_block_size(bqueue));
+ blk_queue_max_hw_sectors(dqueue, queue_max_hw_sectors(bqueue));
+
+ blk_queue_write_cache(dqueue, true, false);
+
+ dqueue->limits.discard_granularity = geo->clba * geo->csecs;
+ dqueue->limits.discard_alignment = 0;
+ blk_queue_max_discard_sectors(dqueue, UINT_MAX >> 9);
+ blk_queue_flag_set(QUEUE_FLAG_DISCARD, dqueue);
+
+ dqueue->limits.zoned = BLK_ZONED_HM;
+ dqueue->nr_zones = dl->zones;
+ dqueue->limits.chunk_sectors = dl->zone_size;
+}
+
+
+static int lzbd_dev_is_supported(struct nvm_tgt_dev *dev)
+{
+ struct nvm_geo *geo = &dev->geo;
+
+ if (geo->major_ver_id != 2) {
+ pr_err("lzbd only supports Open Channel 2.x devices\n");
+ return 0;
+ }
+
+ if (geo->csecs != LZBD_SECTOR_SIZE) {
+ pr_err("lzbd: unsupported block size %d", geo->csecs);
+ return 0;
+ }
+
+ /* We will need to check(some of) these parameters later on,
+ * but for now, just print them. TODO: check cunit, maxoc
+ */
+ pr_info("lzbd: ws_min:%d ws_opt:%d cunits:%d maxoc:%d maxocpu:%d\n",
+ geo->ws_min, geo->ws_opt, geo->mw_cunits,
+ geo->maxoc, geo->maxocpu);
+
+ return 1;
+}
+
+
+static const struct block_device_operations lzbd_fops = {
+ .report_zones = lzbd_report_zones,
+ .owner = THIS_MODULE,
+};
+
+static void lzbd_dump_geo(struct nvm_tgt_dev *dev)
+{
+ struct nvm_geo *geo = &dev->geo;
+
+ pr_info("lzbd: target geo: num_grp: %d num_pu: %d num_chk: %d ws_opt: %d\n",
+ geo->num_ch, geo->all_luns, geo->num_chk, geo->ws_opt);
+}
+
+static void lzbd_create_layout(struct lzbd *lzbd)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ int user_chunks;
+
+ /* Default to 20% over-provisioning if not specified
+ * (better safe than sorry)
+ */
+ if (geo->op == NVM_TARGET_DEFAULT_OP)
+ dl->op = 20;
+ else
+ dl->op = geo->op;
+
+ dl->meta_chunks = 4;
+ dl->zone_chunks = geo->all_luns;
+ dl->zone_size = (geo->clba * dl->zone_chunks) << 3;
+
+ user_chunks = geo->all_chunks * (100 - dl->op);
+ sector_div(user_chunks, 100);
+
+ dl->zones = user_chunks / dl->zone_chunks;
+ dl->capacity = dl->zones * dl->zone_size;
+}
+
+static void lzbd_dump_layout(struct lzbd *lzbd)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+
+ pr_info("lzbd: layout: op: %d zones: %d per zone chks: %d secs: %llu\n",
+ dl->op, dl->zones, dl->zone_chunks,
+ (unsigned long long)dl->zone_size);
+}
+
+static void *lzbd_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk,
+ int flags)
+{
+ struct lzbd *lzbd;
+
+ lzbd_dump_geo(dev);
+
+ if (!lzbd_dev_is_supported(dev))
+ return ERR_PTR(-EINVAL);
+
+
+ if (!(flags & NVM_TARGET_FACTORY)) {
+ pr_err("lzbd: metadata not persisted, only factory init supported\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ lzbd = kzalloc(sizeof(struct lzbd), GFP_KERNEL);
+ if (!lzbd)
+ return ERR_PTR(-ENOMEM);
+
+ lzbd->dev = dev;
+ lzbd->disk = tdisk;
+
+ lzbd_create_layout(lzbd);
+ lzbd_dump_layout(lzbd);
+
+ lzbd->zones = lzbd_init_zones(lzbd);
+
+ if (!lzbd->zones)
+ goto err_free_lzbd;
+
+ if (lzbd_init_chunks(lzbd))
+ goto err_free_zones;
+ lzbd_config_disk_queue(lzbd);
+
+ /* Override the fops to enable zone reporting support */
+ lzbd->disk->fops = &lzbd_fops;
+
+ return lzbd;
+
+err_free_zones:
+ kfree(lzbd->zones);
+err_free_lzbd:
+ kfree(lzbd);
+
+ return ERR_PTR(-ENOMEM);
+}
+
+static void lzbd_exit(void *private, bool graceful)
+{
+ struct lzbd *lzbd = private;
+
+ lzbd_free_chunks(lzbd);
+ kfree(lzbd->zones);
+ kfree(lzbd);
+}
+
+
+static int lzbd_sysfs_init(struct gendisk *tdisk)
+{
+ /* Crickets */
+ return 0;
+}
+
+static void lzbd_sysfs_exit(struct gendisk *tdisk)
+{
+ /* Tumbleweed */
+}
+
+static struct nvm_tgt_type tt_lzbd = {
+ .name = "lzbd",
+ .version = {0, 0, 1},
+
+ .init = lzbd_init,
+ .exit = lzbd_exit,
+
+ .capacity = lzbd_capacity,
+ .make_rq = lzbd_make_rq,
+
+ .sysfs_init = lzbd_sysfs_init,
+ .sysfs_exit = lzbd_sysfs_exit,
+
+ .owner = THIS_MODULE,
+};
+
+static int __init lzbd_module_init(void)
+{
+ int ret;
+
+ ret = bioset_init(&lzbd_bio_set, BIO_POOL_SIZE, 0, 0);
+ if (ret)
+ return ret;
+
+ return nvm_register_tgt_type(&tt_lzbd);
+}
+
+static void lzbd_module_exit(void)
+{
+ bioset_exit(&lzbd_bio_set);
+ nvm_unregister_tgt_type(&tt_lzbd);
+}
+
+module_init(lzbd_module_init);
+module_exit(lzbd_module_exit);
+MODULE_AUTHOR("Hans Holmberg <[email protected]>");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Zoned Block-Device for Open-Channel SSDs");
diff --git a/drivers/lightnvm/lzbd-user.c b/drivers/lightnvm/lzbd-user.c
new file mode 100644
index 000000000000..e38ec763941e
--- /dev/null
+++ b/drivers/lightnvm/lzbd-user.c
@@ -0,0 +1,310 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *
+ * Zoned block device lightnvm target
+ * Copyright (C) 2019 CNEX Labs
+ *
+ * User interfacing code: read/write/reset requests
+ */
+
+#include "lzbd.h"
+
+static void lzbd_fail_bio(struct bio *bio, char *op)
+{
+ pr_err("lzbd: failing %s. start lba: %lu length: %lu\n", op,
+ lzbd_get_bio_lba(bio), lzbd_get_bio_len(bio));
+
+ bio_io_error(bio);
+}
+
+static struct lzbd_zone *lzbd_get_zone(struct lzbd *lzbd, sector_t sector)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ struct lzbd_zone *zone;
+ struct blk_zone *bz;
+
+ sector_div(sector, dl->zone_size);
+
+ if (sector >= dl->zones)
+ return NULL;
+
+ zone = &lzbd->zones[sector];
+ bz = &zone->blk_zone;
+
+ return zone;
+}
+
+static int lzbd_write_rq(struct lzbd *lzbd, struct lzbd_zone *zone,
+ struct bio *bio)
+{
+ sector_t sector = bio->bi_iter.bi_sector;
+ sector_t nr_secs = lzbd_get_bio_len(bio);
+ struct blk_zone *bz;
+ int left;
+
+ mutex_lock(&zone->lock);
+
+ bz = &zone->blk_zone;
+
+ if (bz->cond == BLK_ZONE_COND_OFFLINE) {
+ mutex_unlock(&zone->lock);
+ return -EIO;
+ }
+
+ if (bz->cond == BLK_ZONE_COND_EMPTY)
+ bz->cond = BLK_ZONE_COND_IMP_OPEN;
+
+ if (sector != bz->wp) {
+ if (sector == bz->start) {
+ if (lzbd_zone_reset(lzbd, zone)) {
+ pr_err("lzbd: zone reset failed");
+ bz->cond = BLK_ZONE_COND_OFFLINE;
+ mutex_unlock(&zone->lock);
+ return -EIO;
+ }
+ bz->cond = BLK_ZONE_COND_IMP_OPEN;
+ bz->wp = bz->start;
+ } else {
+ pr_err("lzbd: write pointer error");
+ mutex_unlock(&zone->lock);
+ return -EIO;
+ }
+ }
+
+ left = lzbd_zone_write(lzbd, zone, bio);
+
+ bz->wp += (nr_secs - left) << 3;
+ if (bz->wp == (bz->start + bz->len)) {
+ lzbd_zone_free_wr_buffer(zone);
+ bz->cond = BLK_ZONE_COND_FULL;
+ }
+
+ mutex_unlock(&zone->lock);
+
+ if (left > 0) {
+ pr_err("lzbd: write did not complete");
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static int lzbd_read_rq(struct lzbd *lzbd, struct lzbd_zone *zone,
+ struct bio *bio)
+{
+ struct blk_zone *bz;
+ sector_t read_end, data_end;
+ sector_t data_start = bio->bi_iter.bi_sector;
+ int ret;
+
+ if (!zone) {
+ lzbd_fail_bio(bio, "lzbd: no zone mapped to read sector");
+ return -EIO;
+ }
+
+ bz = &zone->blk_zone;
+
+ if (!zone->chunks || bz->cond == BLK_ZONE_COND_OFFLINE) {
+ /* No valid data in this zone */
+ zero_fill_bio(bio);
+ bio_endio(bio);
+ return 0;
+ }
+
+ if (data_start >= bz->wp) {
+ zero_fill_bio(bio);
+ bio_endio(bio);
+ return 0;
+ }
+
+ read_end = bio_end_sector(bio);
+ data_end = min_t(sector_t, bz->wp, read_end);
+
+ if (read_end > data_end) {
+ sector_t split_sz = data_end - data_start;
+ struct bio *split;
+
+ if (data_end <= data_start) {
+ lzbd_fail_bio(bio, "internal error(read)");
+ return -EIO;
+ }
+
+ split = bio_split(bio, split_sz,
+ GFP_KERNEL, &lzbd_bio_set);
+
+ ret = lzbd_zone_read(lzbd, zone, split);
+ if (ret) {
+ lzbd_fail_bio(bio, "split read");
+ return -EIO;
+ }
+
+ zero_fill_bio(bio);
+ bio_endio(bio);
+
+ } else {
+ lzbd_zone_read(lzbd, zone, bio);
+ }
+
+ return 0;
+}
+
+static void lzbd_zone_reset_rq(struct lzbd *lzbd, struct request_queue *q,
+ struct bio *bio)
+{
+ sector_t sector = bio->bi_iter.bi_sector;
+ struct lzbd_zone *zone;
+
+ zone = lzbd_get_zone(lzbd, sector);
+
+ if (zone) {
+ struct blk_zone *bz = &zone->blk_zone;
+ int ret;
+
+ mutex_lock(&zone->lock);
+
+ ret = lzbd_zone_reset(lzbd, zone);
+ if (ret) {
+ bz->cond = BLK_ZONE_COND_OFFLINE;
+ lzbd_fail_bio(bio, "zone reset");
+ mutex_unlock(&zone->lock);
+ return;
+ }
+
+ bz->cond = BLK_ZONE_COND_EMPTY;
+ bz->wp = bz->start;
+
+ mutex_unlock(&zone->lock);
+
+ bio_endio(bio);
+ } else {
+ bio_io_error(bio);
+ }
+}
+
+static void lzbd_discard_rq(struct lzbd *lzbd, struct request_queue *q,
+ struct bio *bio)
+{
+ /* TODO: Implement discard */
+ bio_endio(bio);
+}
+
+static struct bio *lzbd_zplit(struct lzbd *lzbd, struct bio *bio,
+ struct lzbd_zone **first_zone)
+{
+ sector_t bio_start = bio->bi_iter.bi_sector;
+ sector_t bio_end, zone_end;
+ struct lzbd_zone *zone;
+ struct blk_zone *bz;
+ struct bio *zone_bio;
+
+ zone = lzbd_get_zone(lzbd, bio_start);
+ if (!zone)
+ return NULL;
+
+ bio_end = bio_end_sector(bio);
+ bz = &zone->blk_zone;
+ zone_end = bz->start + bz->len;
+
+ if (bio_end > zone_end) {
+ zone_bio = bio_split(bio, zone_end - bio_start,
+ GFP_KERNEL, &lzbd_bio_set);
+ } else {
+ zone_bio = bio;
+ }
+
+ *first_zone = zone;
+ return zone_bio;
+}
+
+blk_qc_t lzbd_make_rq(struct request_queue *q, struct bio *bio)
+{
+ struct lzbd *lzbd = q->queuedata;
+
+ if (bio->bi_opf & REQ_PREFLUSH) {
+ /* TODO: Implement syncs */
+ pr_err("lzbd: ignoring sync!\n");
+ }
+
+ if (bio_op(bio) == REQ_OP_READ || bio_op(bio) == REQ_OP_WRITE) {
+ struct bio *zplit;
+ struct lzbd_zone *zone;
+
+ if (!lzbd_get_bio_len(bio)) {
+ bio_endio(bio);
+ return BLK_QC_T_NONE;
+ }
+
+ do {
+ zplit = lzbd_zplit(lzbd, bio, &zone);
+ if (!zplit || !zone) {
+ lzbd_fail_bio(bio, "zone split");
+ return BLK_QC_T_NONE;
+ }
+
+ if (op_is_write(bio_op(bio))) {
+ if (lzbd_write_rq(lzbd, zone, zplit)) {
+ lzbd_fail_bio(zplit, "write");
+ if (zplit != bio)
+ lzbd_fail_bio(bio,
+ "write");
+
+ return BLK_QC_T_NONE;
+ }
+ } else {
+ if (lzbd_read_rq(lzbd, zone, zplit)) {
+ lzbd_fail_bio(zplit, "read");
+ if (zplit != bio)
+ lzbd_fail_bio(bio,
+ "read");
+ return BLK_QC_T_NONE;
+ }
+ }
+ } while (bio != zplit);
+
+ return BLK_QC_T_NONE;
+ }
+
+ switch (bio_op(bio)) {
+ case REQ_OP_DISCARD:
+ lzbd_discard_rq(lzbd, q, bio);
+ break;
+ case REQ_OP_ZONE_RESET:
+ lzbd_zone_reset_rq(lzbd, q, bio);
+ break;
+ default:
+ pr_err("lzbd: unsupported operation: %d", bio_op(bio));
+ bio_io_error(bio);
+ break;
+ }
+
+ return BLK_QC_T_NONE;
+}
+
+int lzbd_report_zones(struct gendisk *disk, sector_t sector,
+ struct blk_zone *zones, unsigned int *nr_zones,
+ gfp_t gfp_mask)
+{
+ struct lzbd *lzbd = disk->private_data;
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ unsigned int max_zones = *nr_zones;
+ unsigned int reported = 0;
+ struct lzbd_zone *zone;
+
+ sector_div(sector, dl->zone_size);
+
+ while ((zone = lzbd_get_zone(lzbd, sector))) {
+ struct blk_zone *bz = &zone->blk_zone;
+
+ if (reported >= max_zones)
+ break;
+
+ memcpy(&zones[reported], bz, sizeof(*bz));
+
+ sector = sector + dl->zone_size;
+ reported++;
+ }
+
+ *nr_zones = reported;
+
+ return 0;
+}
diff --git a/drivers/lightnvm/lzbd-zone.c b/drivers/lightnvm/lzbd-zone.c
new file mode 100644
index 000000000000..813f7b006ef1
--- /dev/null
+++ b/drivers/lightnvm/lzbd-zone.c
@@ -0,0 +1,444 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *
+ * Zoned block device lightnvm target
+ * Copyright (C) 2019 CNEX Labs
+ *
+ * Internal zone handling
+ */
+
+#include "lzbd.h"
+
+static struct lzbd_chunk *lzbd_get_chunk(struct lzbd *lzbd, int pref_pu)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ int parallel_units = geo->all_luns;
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ struct lzbd_chunks *chunks = &lzbd->chunks;
+ int i = pref_pu;
+ int retries = dl->zone_chunks - 1;
+
+ do {
+ struct lzbd_pu *pu = &chunks->pus[i];
+ struct list_head *chk_list = &pu->chk_list;
+
+ mutex_lock(&pu->lock);
+
+ if (!list_empty(&pu->chk_list)) {
+ struct lzbd_chunk *chunk;
+
+ chunk = list_first_entry(chk_list,
+ struct lzbd_chunk, list);
+ list_del(&chunk->list);
+ mutex_unlock(&pu->lock);
+ return chunk;
+ }
+ mutex_unlock(&pu->lock);
+
+ if (++i == parallel_units)
+ i = 0;
+
+ } while (retries--);
+
+ return NULL;
+}
+
+void lzbd_zone_free_wr_buffer(struct lzbd_zone *zone)
+{
+ kfree(zone->wr_align.buffer);
+ zone->wr_align.buffer = NULL;
+ zone->wr_align.secs = 0;
+}
+
+static void lzbd_zone_deallocate(struct lzbd *lzbd, struct lzbd_zone *zone)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ struct lzbd_chunks *chunks = &lzbd->chunks;
+ int i;
+
+ if (!zone->chunks)
+ return;
+
+ for (i = 0; i < dl->zone_chunks; i++) {
+ struct lzbd_chunk *chunk = zone->chunks[i];
+
+ if (chunk) {
+ struct lzbd_pu *pu = &chunks->pus[chunk->pu];
+
+ mutex_lock(&pu->lock);
+
+ /* TODO: implement proper wear leveling
+ * The wear indices do not get updated right now
+ * so just add the chunk at the bottom of the list
+ */
+ list_add_tail(&chunk->list, &pu->chk_list);
+ mutex_unlock(&pu->lock);
+ }
+ }
+
+ lzbd_zone_free_wr_buffer(zone);
+ kfree(zone->chunks);
+ zone->chunks = NULL;
+}
+
+int lzbd_zone_allocate(struct lzbd *lzbd, struct lzbd_zone *zone)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ int to_allocate = dl->zone_chunks;
+ int i;
+
+ zone->chunks = kmalloc_array(to_allocate,
+ sizeof(struct lzbd_chunk *),
+ GFP_KERNEL | __GFP_ZERO);
+
+ if (!zone->chunks)
+ return -ENOMEM;
+
+ zone->wr_align.secs = 0;
+
+ zone->wr_align.buffer = kzalloc(geo->ws_opt << LZBD_SECTOR_BITS,
+ GFP_KERNEL);
+ if (!zone->wr_align.buffer) {
+ kfree(zone->chunks);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < to_allocate; i++) {
+ struct lzbd_chunk *chunk = lzbd_get_chunk(lzbd, i);
+
+ if (!chunk) {
+ pr_err("failed to allocate zone!\n");
+ lzbd_zone_deallocate(lzbd, zone);
+ return -ENOSPC;
+ }
+
+ zone->chunks[i] = chunk;
+ }
+
+ return 0;
+}
+
+static int lzbd_zone_reset_chunks(struct lzbd *lzbd, struct lzbd_zone *zone)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ int i = 0;
+
+ /* TODO: Do parallel resetting and handle reset failures */
+ for (i = 0; i < dl->zone_chunks; i++) {
+ struct lzbd_chunk *chunk = zone->chunks[i];
+ int state = chunk->meta->state;
+ int ret;
+
+ if (state & (NVM_CHK_ST_CLOSED | NVM_CHK_ST_OPEN)) {
+ ret = lzbd_reset_chunk(lzbd, chunk);
+ if (ret) {
+ pr_err("lzbd: reset failed!\n");
+ return -EIO; /* Fail for now if reset fails */
+ }
+ }
+ }
+
+ return 0;
+}
+
+int lzbd_zone_reset(struct lzbd *lzbd, struct lzbd_zone *zone)
+{
+ int ret;
+
+ lzbd_zone_deallocate(lzbd, zone);
+ ret = lzbd_zone_allocate(lzbd, zone);
+ if (ret)
+ return ret;
+
+ ret = lzbd_zone_reset_chunks(lzbd, zone);
+
+ zone->wi = 0;
+ atomic_set(&zone->s_wp, 0);
+
+ return ret;
+}
+
+
+static void lzbd_add_to_align_buf(struct lzbd_wr_align *wr_align,
+ struct bio *bio, int secs)
+{
+ char *buffer = wr_align->buffer;
+
+ buffer += (wr_align->secs * LZBD_SECTOR_SIZE);
+
+ mutex_lock(&wr_align->lock);
+ while (secs--) {
+ char *data = bio_data(bio);
+
+ memcpy(buffer, data, LZBD_SECTOR_SIZE);
+ buffer += LZBD_SECTOR_SIZE;
+ wr_align->secs++;
+ bio_advance(bio, LZBD_SECTOR_SIZE);
+
+ }
+
+ mutex_unlock(&wr_align->lock);
+}
+
+static void lzbd_read_from_align_buf(struct lzbd_wr_align *wr_align,
+ struct bio *bio, int start, int secs)
+{
+ char *buffer = wr_align->buffer;
+
+ buffer += (start * LZBD_SECTOR_SIZE);
+
+ mutex_lock(&wr_align->lock);
+ while (secs--) {
+ char *data = bio_data(bio);
+
+ memcpy(data, buffer, LZBD_SECTOR_SIZE);
+ buffer += LZBD_SECTOR_SIZE;
+
+ bio_advance(bio, LZBD_SECTOR_SIZE);
+ }
+
+ mutex_unlock(&wr_align->lock);
+}
+
+int lzbd_zone_write(struct lzbd *lzbd, struct lzbd_zone *zone, struct bio *bio)
+{
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ struct lzbd_wr_align *wr_align = &zone->wr_align;
+ int sectors_left = lzbd_get_bio_len(bio);
+ int ret;
+
+ /* Unaligned write? */
+ if (wr_align->secs) {
+ int secs;
+
+ secs = min_t(int, geo->ws_opt - wr_align->secs, sectors_left);
+ lzbd_add_to_align_buf(wr_align, bio, secs);
+ sectors_left -= secs;
+
+ /* Time to flush the alignment buffer ? */
+ if (wr_align->secs == geo->ws_opt) {
+ struct bio *bio;
+
+ bio = bio_map_kern(dev->q, wr_align->buffer,
+ geo->ws_opt * LZBD_SECTOR_SIZE,
+ GFP_KERNEL);
+ if (!bio) {
+ pr_err("lzbd: failed to map align bio\n");
+ return -EIO;
+ }
+
+ ret = lzbd_write_to_chunk_user(lzbd,
+ zone->chunks[zone->wi], bio);
+
+ if (ret) {
+ pr_err("lzbd: alignment write failed\n");
+ return sectors_left;
+ }
+
+ wr_align->secs = 0;
+ zone->wi = (zone->wi + 1) % dl->zone_chunks;
+ atomic_add(geo->ws_opt, &zone->s_wp);
+ }
+ }
+
+ if (sectors_left == 0) {
+ bio_endio(bio);
+ return 0;
+ }
+
+ while (sectors_left > geo->ws_opt) {
+ struct bio *split;
+
+ split = bio_split(bio, geo->ws_opt << 3,
+ GFP_KERNEL, &lzbd_bio_set);
+
+ if (split == NULL) {
+ pr_err("lzbd: split failed!\n");
+ return sectors_left;
+ }
+
+ ret = lzbd_write_to_chunk_user(lzbd,
+ zone->chunks[zone->wi], split);
+
+ if (ret)
+ return sectors_left;
+
+ zone->wi = (zone->wi + 1) % dl->zone_chunks;
+ atomic_add(geo->ws_opt, &zone->s_wp);
+
+ sectors_left -= geo->ws_opt;
+ }
+
+ if (sectors_left == geo->ws_opt) {
+ ret = lzbd_write_to_chunk_user(lzbd,
+ zone->chunks[zone->wi], bio);
+ if (ret) {
+ pr_err("lzbd: last aligned write failed\n");
+ return sectors_left;
+ }
+
+ zone->wi = (zone->wi + 1) % dl->zone_chunks;
+ atomic_add(geo->ws_opt, &zone->s_wp);
+ sectors_left -= geo->ws_opt;
+ } else {
+ wr_align->secs = 0;
+ lzbd_add_to_align_buf(wr_align, bio, sectors_left);
+ bio_endio(bio);
+ sectors_left = 0;
+ }
+
+ return sectors_left;
+}
+
+void lzbd_user_read_put(struct kref *ref)
+{
+ struct lzbd_user_read *read;
+
+ read = container_of(ref, struct lzbd_user_read, ref);
+
+ if (unlikely(read->error))
+ bio_io_error(read->user_bio);
+ else
+ bio_endio(read->user_bio);
+
+ kfree(read);
+}
+
+
+static struct lzbd_user_read *lzbd_init_user_read(struct bio *bio)
+{
+ struct lzbd_user_read *rd;
+
+ rd = kmalloc(sizeof(struct lzbd_user_read), GFP_KERNEL);
+ if (!rd)
+ return NULL;
+
+ rd->user_bio = bio;
+ kref_init(&rd->ref);
+ rd->error = false;
+
+ return rd;
+}
+
+
+int lzbd_zone_read(struct lzbd *lzbd, struct lzbd_zone *zone, struct bio *bio)
+{
+ struct lzbd_disk_layout *dl = &lzbd->disk_layout;
+ struct nvm_tgt_dev *dev = lzbd->dev;
+ struct nvm_geo *geo = &dev->geo;
+ struct blk_zone *bz = &zone->blk_zone;
+ struct lzbd_chunk *read_chunk;
+ sector_t lba = lzbd_get_bio_lba(bio);
+ int to_read = lzbd_get_bio_len(bio);
+ struct lzbd_user_read *read;
+ int readsize;
+ int zsi, zso, csi, co;
+ int pu;
+ int ret;
+
+ read = lzbd_init_user_read(bio);
+ if (!read) {
+ pr_err("lzbd: failed to init read\n");
+ bio_io_error(bio);
+ return -EIO;
+ }
+
+ if (!zone->chunks) {
+ /* No data has been written to this zone */
+ zero_fill_bio(bio);
+ bio_endio(bio);
+ kfree(read);
+ return 0;
+ }
+
+ lba -= bz->start >> 3;
+
+ /* TODO: use sector_div instead */
+
+ /* Zone stripe index and offset */
+ zsi = lba / geo->ws_opt; /* zone stripe index */
+ zso = lba % geo->ws_opt; /* zone stripe offset */
+
+ pu = zsi % dl->zone_chunks;
+ read_chunk = zone->chunks[pu];
+
+ /* Chunk stripe index and chunk offset */
+ csi = lba / (dl->zone_chunks * geo->ws_opt);
+ co = csi * geo->ws_opt + zso;
+
+ readsize = min_t(int, geo->ws_opt - zso, to_read);
+
+ while (to_read > 0) {
+ struct bio *rbio = bio;
+ int s_wp = atomic_read(&zone->s_wp);
+
+ if (lba >= s_wp) {
+ /* Grab the write lock to prevent races
+ * with writes
+ */
+ mutex_lock(&zone->lock);
+ if (lba >= atomic_read(&zone->s_wp)) {
+ lzbd_read_from_align_buf(&zone->wr_align, bio,
+ zso, to_read);
+ mutex_unlock(&zone->lock);
+ ret = 0;
+ goto done;
+ }
+ mutex_unlock(&zone->lock);
+ }
+
+ if ((zso + to_read) > geo->ws_opt) {
+
+ rbio = bio_split(bio, readsize << 3, GFP_KERNEL,
+ &lzbd_bio_set);
+
+ if (!rbio) {
+ read->error = true;
+ ret = -EIO;
+ goto done;
+ }
+
+ }
+
+ if (lba + to_read >= s_wp)
+ readsize = s_wp - lba;
+
+ kref_get(&read->ref);
+ ret = lzbd_read_from_chunk_user(lzbd, zone->chunks[pu],
+ rbio, read, co);
+ if (ret) {
+ pr_err("lzbd: user disk read failed!\n");
+ read->error = true;
+ kref_put(&read->ref, lzbd_user_read_put);
+ ret = -EIO;
+ goto done;
+ }
+
+ lba += readsize;
+
+ if (zso) {
+ co -= zso;
+ zso = 0;
+ }
+
+ if (++pu == dl->zone_chunks) {
+ pu = 0;
+ co += geo->ws_opt;
+ }
+
+ to_read -= readsize;
+ readsize = min_t(int, geo->ws_opt, to_read);
+ read_chunk = zone->chunks[pu];
+ }
+
+ ret = 0;
+done:
+ kref_put(&read->ref, lzbd_user_read_put);
+ return ret;
+}
+
diff --git a/drivers/lightnvm/lzbd.h b/drivers/lightnvm/lzbd.h
new file mode 100644
index 000000000000..97cca99a49bf
--- /dev/null
+++ b/drivers/lightnvm/lzbd.h
@@ -0,0 +1,139 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *
+ * Zoned block device lightnvm target
+ * Copyright (C) 2019 CNEX Labs
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/bio.h>
+#include <linux/lightnvm.h>
+
+#define LZBD_SECTOR_BITS (12) /* 4096 */
+#define LZBD_SECTOR_SIZE (4096UL)
+
+/* sector unit to lzbd sector shift*/
+#define LZBD_SECTOR_SHIFT (3)
+
+extern struct bio_set lzbd_bio_set;
+
+
+/* Get length, in lzbd sectors, of bio */
+static inline sector_t lzbd_get_bio_len(struct bio *bio)
+{
+ return bio->bi_iter.bi_size >> LZBD_SECTOR_BITS;
+}
+
+/* Get bio start lba in lzbd sectors */
+static inline sector_t lzbd_get_bio_lba(struct bio *bio)
+{
+ return bio->bi_iter.bi_sector >> LZBD_SECTOR_SHIFT;
+}
+
+struct lzbd_wr_ctx {
+ struct lzbd *lzbd;
+ struct mutex wr_lock; /* Max one outstanding write */
+
+ void *private;
+ /* bio completion list goes here, along with lock*/
+};
+
+struct lzbd_user_read {
+ struct bio *user_bio;
+ struct kref ref;
+ bool error;
+};
+
+struct lzbd_rd_ctx {
+ struct lzbd *lzbd;
+ struct lzbd_user_read *read;
+ struct nvm_rq rqd;
+};
+
+struct lzbd_chunk {
+ struct nvm_chk_meta *meta; /* Metadata for the chunk */
+ struct ppa_addr ppa; /* Start ppa */
+ int pu; /* Parallel unit */
+
+ struct lzbd_wr_ctx wr_ctx;
+ struct list_head list; /* A chunk is offline or
+ * part of a PU free list or
+ * part of a zone chunk list or
+ * part of a metadata list
+ */
+
+ /* a cuinits buffer should go here */
+};
+
+struct lzbd_pu {
+ struct list_head chk_list; /* One list per parallel unit */
+ struct mutex lock; /* Protecting list */
+ int offline_chks;
+};
+
+struct lzbd_chunks {
+ struct lzbd_pu *pus; /* Chunks organized per parallel unit*/
+ struct nvm_chk_meta *meta; /* Metadata for all chunks */
+};
+
+struct lzbd_wr_align {
+ void *buffer; /* Buffer data */
+ int secs; /* Number of 4k secs in buffer */
+ struct mutex lock;
+};
+
+struct lzbd_zone {
+ struct blk_zone blk_zone;
+ struct lzbd_chunk **chunks;
+
+ int wi; /* Write chunk index */
+ atomic_t s_wp; /* Sync write pointer */
+
+ struct lzbd_wr_align wr_align; /* Write alignment buffer */
+
+ struct mutex lock; /* Write lock */
+};
+
+struct lzbd_disk_layout {
+ int op; /* Over provision ratio */
+ int meta_chunks; /* Metadata chunks */
+
+ int zones; /* Number of zones */
+ int zone_chunks; /* Zone per chunk */
+ sector_t zone_size; /* Number of 512b sectors per zone */
+
+ sector_t capacity; /* Disk capacity in 512b sectors */
+};
+
+struct lzbd {
+ struct nvm_tgt_dev *dev;
+ struct gendisk *disk;
+
+ struct lzbd_zone *zones;
+
+ struct lzbd_chunks chunks;
+ struct lzbd_disk_layout disk_layout;
+};
+
+blk_qc_t lzbd_make_rq(struct request_queue *q, struct bio *bio);
+
+int lzbd_report_zones(struct gendisk *disk, sector_t sector,
+ struct blk_zone *zones, unsigned int *nr_zones,
+ gfp_t gfp_mask);
+
+int lzbd_reset_chunk(struct lzbd *lzbd, struct lzbd_chunk *chunk);
+int lzbd_write_to_chunk_sync(struct lzbd *lzbd, struct lzbd_chunk *chunk,
+ struct bio *bio);
+int lzbd_write_to_chunk_user(struct lzbd *lzbd, struct lzbd_chunk *chunk,
+ struct bio *user_bio);
+int lzbd_read_from_chunk_user(struct lzbd *lzbd, struct lzbd_chunk *chunk,
+ struct bio *bio, struct lzbd_user_read *user_read,
+ int start);
+int lzbd_zone_reset(struct lzbd *lzbd, struct lzbd_zone *zone);
+int lzbd_zone_write(struct lzbd *lzbd, struct lzbd_zone *zone, struct bio *bio);
+int lzbd_zone_read(struct lzbd *lzbd, struct lzbd_zone *zone, struct bio *bio);
+void lzbd_zone_free_wr_buffer(struct lzbd_zone *zone);
+void lzbd_user_read_put(struct kref *ref);
+
--
2.7.4
Hi,
On 4/18/19 5:01 AM, [email protected] wrote:
> diff --git a/Documentation/lightnvm/lzbd.txt b/Documentation/lightnvm/lzbd.txt
> new file mode 100644
> index 000000000000..8bdbc01a25be
> --- /dev/null
> +++ b/Documentation/lightnvm/lzbd.txt
> @@ -0,0 +1,122 @@
> +lzbd: A Zoned Block Device LightNVM Target
> +==========================================
> +
> +The lzbd lightnvm target makes it possible to expose an Open-Channel 2.0 SSD
> +as one or more zoned block devices.
> +
> +Each lightnvm target is assigned a range of parallel units. Parallel units(PUs)
> +are not shared among targets avoiding I/O QoS disturbances between targets as
(prefer:) targets,
> +far as possible.
> +
> +For more information on lightnvm, see [1]
end with period above, as the 2 below are done.
> +For more information on Open-Channel 2.0, see [2].
> +For more information on zoned block devices see [3].
> +
> +lzbd is designed to act as a slim adaptor, making it possible to plug
> +OCSSD 2.0 SSDs into the zone block device ecosystem.
> +
> +lzbd manages zone to chunk mapping, read/write restrictions, wear leveling
> +and write errors.
> +
> +Zone geometry
> +-------------
> +
> +From a user perspective, lzbd targets form a number of sequential-write-required
> +(BLK_ZONE_TYPE_SEQWRITE_REQ) zones.
> +
> +Not all of the target's capacity is exposed to the user.
> +Some chunks are reserved for metadata and over-provisioning.
> +
> +The zones follow the same constraints as described in [3].
> +
> +All zones are of the same size (SZ).
> +
> +Simple example:
> +
> +Sector Zone type
> + _______________________
> +0 --> | Sequential write req. |
> + | |
> + |_______________________|
> +SZ --> | Sequential write req. |
> + | |
> + |_______________________|
> +SZ*2..--> | Sequential write req. |
> + | |
> +.......... .........................
> + |_______________________|
> +SZ*N-1 --> | Sequential write req. |
> + |_______________________|
> +
> +
> +SZ is configurable, but is restricted to a multiple of
> +(chunk size (CLBA) * Number of PUs).
> +
> +Zone to chunk mapping
> +---------------------
> +
> +Zones are spread across PUs to allow maximum write throughput through striping.
> +One or more chunks (CHK) per PU is assigned.
> +
> +Example:
> +
> +OCSSD 2.0 Geometry: 4 PUs, 16 chunks per PU.
> +Zones: 3
> +
> + Zone PU0 PU1 PU2 PU3
> +_______ _____ _____ _____ _____
> + |CHK 0|CHK 0|CHK A|CHK 0|
> + 0 |CHK 2|CHK 3|CHK 3|CHK 1|
> +_______ |_____|_____|_____|_____|
> + |CHK 3|CHK B|CHK 8|CHK A|
> + 1 |CHK 7|CHK F|CHK 2|CHK 3|
> +_______ |_____|_____|_____|_____|
> + |CHK 8|CHK 2|CHK 7|CHK 4|
> + 2 |CHK 1|CHK A|CHK 5|CHK 2|
> +_______ |_____|_____|_____|_____|
> +
> +Chunks are assigned to a zone when it is opened based on the chunk wear index.
> +
> +Note: The disk's Maximum Open Chunks (MAXOC) limit puts an upper bound on
> +maximum simultaneously open zones (unless MAXOC = 0).
> +
> +Meta data and over-provisioning
> +-------------------------------
My dictionary searches all use metadata as one word, not two.
> +
> +lzbd needs the following meta data to be persisted:
> +
> +* a zone-to chunk mapping (Z2C) table, size: 4 bytes * Number of chunks
> +* a superblock containing target configuration, guuid, on-disk format version,
what is guuid, please?
> + etc.
> +
> +Additionally, chunks need to be reserved for handling:
> +
> +* write errors
> +* chunks wearing out and going offline
> +* persisting data not aligned with the minimal write constraint
> +
> +The meta data is stored a separate set of chunks from the user data.
> +
> +Host memory requirements
> +------------------------
> +
> +The Z2C mapping table needs to be kept in host memory (see above), and:
> +
> +* in order to achieve maximum throughput and alignment requirements,
> + a small write buffer is needed
> + Size: Optimal Write Size (WS_OPT) * Maximum number of open zones.
> +
> +* to satisify OCSSD 2.0 read restrictions, a read buffer is needed.
satisfy
> + Size: Number of PUs * Cache Minimum Write Size Units (MW_CUNITS) *
> + Maximum number of open zones.
> +
> +If MW_CUNITS = 0, no read buffer is needed and data can be written without
> +any host copying/buffering (except for handling WS_OPT alignment).
> +
> +References
> +----------
> +
> +[1] Lightnvm website: http://lightnvm.io/
> +[2] OCSSD 2.0 Specification: http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf
> +[3] ZBC / Zoned block device support: https://lwn.net/Articles/703871/
> +
thanks.
--
~Randy