2014-10-08 15:57:08

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 0/5] Support for Open-Channel SSDs (was dm-lightnvm)

Hi,

Here is an update on the the common layer for Open-Channel SSDs (LightNVM). A
previous patch was posted here:

http://www.redhat.com/archives/dm-devel/2014-March/msg00115.html

Thanks for all the constructive feedback.

Architectural changes
---------------------
* Moved LightNVM between device drivers and the blk-mq layer. Currently
drivers hook into LightNVM. It will be integrated directly into
the block layer later. Why the block layer? Because it is tightly coupled
with blk-mq, uses its per-request private storage, scalability. Furthermore,
read/write commands are piggy-backed with additional information, such as
flash block health, translation table metadata, etc.

* A device has a number of physical blocks. These can now be exposed through a
number of targets. This can be a typical block layer, but can also be a
specialization, such as key-value store, object-based storage, and so forth.
Allowing file-systems and databases to write directly to physical block
without multiple translation layers between.

* Allow experimentation through QEMU. LightNVM is now initialized when
hardware is a LightNVM-compatible device.

* The development has been moved to https://github.com/OpenChannelSSD

Backgound
---------
Open-channel SSDs are devices which exposes direct access to its physical
flash storage, while keeping a subset of the internal features of SSDs.

A common SSD consists of a flash translation layer (FTL), bad block
management, and hardware units such as flash controller and host
interface controller and a large amount of flash chips.

LightNVM moves part of the FTL responsibility into the host, allowing
the host to manage data placement, garbage collection and parallelism. The
device continues to maintain information about bad block management, implements
a simpler FTL, that allows extensions such as atomic IOs, metadata
persistence and similar to be implemented.

The architecture of LightNVM consists of a core and multiple targets. The core
implements functionality shared across targets, such as initialization, teardown
and statistics. The other part is targets. They are how physical flash are
exposed to user-land. This can be as the block device, key-value store,
object-store, or anything else.

LightNVM is currently hooked up through the null_blk and NVMe driver. The NVMe
extension allow development using the LightNVM-extended QEMU implementation,
using Keith Busch's qemu-nvme branch.

Try it out
-------------

To try LightNVM, a device is required to register as an open-channel SSD.

Currently, two implementations exist. The null_blk and NVMe driver. The
null_blk driver is for performance testing, while the NVMe driver can be
initialized using a patches version of Keith Busch's QEMU NVMe simulator, or if
real hardware is available.

The QEMU branch is available at:

https://github.com/OpenChannelSSD/qemu-nvme

Follow the guide at

https://github.com/OpenChannelSSD/linux/wiki

Available Hardware
------------------

A couple of open platforms are currently being ported to utilize LightNVM:

IIT Madras (https://bitbucket.org/casl/ssd-controller)
An open-source implementation of a NVMe controller in BlueSpec. Can run on
Xilix FPGA's, such as Artix 7, Kintex 7 and Vertex 7.

OpenSSD Jasmine (http://www.openssd-project.org/)
An open-firmware SSD, that allows the user to implement its own FTL within
the controller.

An experimental patch of the firmware is found in the lightnvm branch:
https://github.com/ClydeProjects/OpenSSD/

Todo: Requires bad block management to be useful and storing of host FTL
metadata.

OpenSSD Cosmos (http://www.openssd-project.org/wiki/Cosmos_OpenSSD_Platform)
A complete development board with FPGA, ARM Cortex A9 and FPGA-accelerated
host access.

Draft Specification
-------------------

We are currently creating a draft specification as more and more of the
host/device interface is stabilized. Please see this Google document. It's open
for comments.

http://goo.gl/BYTjLI

In the making
-------------

* The QEMU implementation doesn't yet support loading of translation tables and
thereby the logical to physical sector relationship is forgotten on reboot.
* Bad block management. This is kept device side, however the host still
requires bad block information to prevent writing to dead flash blocks.
* Space-efficient algorithms for translation tables.

Matias Bjørling (5):
NVMe: Convert to blk-mq
block: extend rq_flag_bits
lightnvm: Support for Open-Channel SSDs
lightnvm: NVMe integration
lightnvm: null_blk integration

Documentation/block/null_blk.txt | 9 +
drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/block/null_blk.c | 149 +++-
drivers/block/nvme-core.c | 1469 +++++++++++++++++++-------------------
drivers/block/nvme-scsi.c | 8 +-
drivers/lightnvm/Kconfig | 20 +
drivers/lightnvm/Makefile | 5 +
drivers/lightnvm/core.c | 212 ++++++
drivers/lightnvm/gc.c | 233 ++++++
drivers/lightnvm/kv.c | 513 +++++++++++++
drivers/lightnvm/nvm.c | 540 ++++++++++++++
drivers/lightnvm/nvm.h | 632 ++++++++++++++++
drivers/lightnvm/sysfs.c | 79 ++
drivers/lightnvm/targets.c | 246 +++++++
include/linux/blk_types.h | 4 +
include/linux/lightnvm.h | 130 ++++
include/linux/nvme.h | 19 +-
include/uapi/linux/lightnvm.h | 45 ++
include/uapi/linux/nvme.h | 57 ++
20 files changed, 3603 insertions(+), 770 deletions(-)
create mode 100644 drivers/lightnvm/Kconfig
create mode 100644 drivers/lightnvm/Makefile
create mode 100644 drivers/lightnvm/core.c
create mode 100644 drivers/lightnvm/gc.c
create mode 100644 drivers/lightnvm/kv.c
create mode 100644 drivers/lightnvm/nvm.c
create mode 100644 drivers/lightnvm/nvm.h
create mode 100644 drivers/lightnvm/sysfs.c
create mode 100644 drivers/lightnvm/targets.c
create mode 100644 include/linux/lightnvm.h
create mode 100644 include/uapi/linux/lightnvm.h

--
1.9.1


2014-10-08 15:57:14

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 3/5] lightnvm: Support for Open-Channel SSDs

Open-channel SSDs are devices which exposes direct access to its physical
flash storage, while keeping a subset of the internal features of SSDs.

A common SSD consists of a flash translation layer (FTL), bad block
management, and hardware units such as flash controller and host
interface controller and a large amount of flash chips.

LightNVM moves part of the FTL responsibility into the host, allowing
the host to manage data placement, garbage collection and parallelism. The
device continues to maintain information about bad block management, implements a
simpler FTL, that allows extensions such as atomic IOs, metadata
persistence and similar to be implemented.

The architecture of LightNVM consists of a core and multiple targets. The core
has the part of the driver that is shared across targets, initialization and
teardown and statistics. The other part is targets. They are how physical flash
are exposed to user-land. This can be as the block device, key-value store,
object-store, or anything else.

LightNVM is currently hooked up through the null_blk and NVMe driver. The NVMe
extension allow development using the LightNVM-extended QEMU implementation,
using Keith Busch's qemu-nvme branch.

Contributions in this patch from:

Jesper Madsen <[email protected]>

Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/lightnvm/Kconfig | 20 ++
drivers/lightnvm/Makefile | 5 +
drivers/lightnvm/core.c | 212 ++++++++++++++
drivers/lightnvm/gc.c | 233 ++++++++++++++++
drivers/lightnvm/kv.c | 513 ++++++++++++++++++++++++++++++++++
drivers/lightnvm/nvm.c | 540 ++++++++++++++++++++++++++++++++++++
drivers/lightnvm/nvm.h | 632 ++++++++++++++++++++++++++++++++++++++++++
drivers/lightnvm/sysfs.c | 79 ++++++
drivers/lightnvm/targets.c | 246 ++++++++++++++++
include/linux/lightnvm.h | 130 +++++++++
include/uapi/linux/lightnvm.h | 45 +++
13 files changed, 2658 insertions(+)
create mode 100644 drivers/lightnvm/Kconfig
create mode 100644 drivers/lightnvm/Makefile
create mode 100644 drivers/lightnvm/core.c
create mode 100644 drivers/lightnvm/gc.c
create mode 100644 drivers/lightnvm/kv.c
create mode 100644 drivers/lightnvm/nvm.c
create mode 100644 drivers/lightnvm/nvm.h
create mode 100644 drivers/lightnvm/sysfs.c
create mode 100644 drivers/lightnvm/targets.c
create mode 100644 include/linux/lightnvm.h
create mode 100644 include/uapi/linux/lightnvm.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 622fa26..24815f8 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -38,6 +38,8 @@ source "drivers/message/i2o/Kconfig"

source "drivers/macintosh/Kconfig"

+source "drivers/lightnvm/Kconfig"
+
source "drivers/net/Kconfig"

source "drivers/isdn/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index ebee555..278c31e 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_MTD) += mtd/
obj-$(CONFIG_SPI) += spi/
obj-$(CONFIG_SPMI) += spmi/
obj-y += hsi/
+obj-$(CONFIG_LIGHTNVM) += lightnvm/
obj-y += net/
obj-$(CONFIG_ATM) += atm/
obj-$(CONFIG_FUSION) += message/
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
new file mode 100644
index 0000000..3ee597a
--- /dev/null
+++ b/drivers/lightnvm/Kconfig
@@ -0,0 +1,20 @@
+#
+# LightNVM configuration
+#
+
+menuconfig LIGHTNVM
+ bool "LightNVM support"
+ depends on BLK_DEV
+ default y
+ ---help---
+ Say Y here to get to enable Open-channel SSDs compatible with LightNVM
+ to be recognized.
+
+ LightNVM implements some internals of SSDs within the host.
+ Devices are required to support LightNVM, and allow them to managed by
+ the host. LightNVM is used together with an open-channel firmware, that
+ exposes direct access to the underlying non-volatile memory.
+
+ If you say N, all options in this submenu will be skipped and disabled;
+ only do this if you know what you are doing.
+
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
new file mode 100644
index 0000000..ad31d9b
--- /dev/null
+++ b/drivers/lightnvm/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for LightNVM.
+#
+
+obj-$(CONFIG_LIGHTNVM) += nvm.o core.o gc.o sysfs.o kv.o targets.o
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
new file mode 100644
index 0000000..ec9851a
--- /dev/null
+++ b/drivers/lightnvm/core.c
@@ -0,0 +1,212 @@
+#include <linux/lightnvm.h>
+#include <trace/events/block.h>
+#include "nvm.h"
+
+static void invalidate_block_page(struct nvm_stor *s, struct nvm_addr *p)
+{
+ struct nvm_block *block = p->block;
+ unsigned int page_offset;
+
+ NVM_ASSERT(spin_is_locked(&s->rev_lock));
+
+ spin_lock(&block->lock);
+
+ page_offset = p->addr % s->nr_pages_per_blk;
+ WARN_ON(test_and_set_bit(page_offset, block->invalid_pages));
+ block->nr_invalid_pages++;
+
+ spin_unlock(&block->lock);
+}
+
+void nvm_update_map(struct nvm_stor *s, sector_t l_addr, struct nvm_addr *p,
+ int is_gc)
+{
+ struct nvm_addr *gp;
+ struct nvm_rev_addr *rev;
+
+ BUG_ON(l_addr >= s->nr_pages);
+ BUG_ON(p->addr >= s->nr_pages);
+
+ gp = &s->trans_map[l_addr];
+ spin_lock(&s->rev_lock);
+ if (gp->block) {
+ invalidate_block_page(s, gp);
+ s->rev_trans_map[gp->addr].addr = LTOP_POISON;
+ }
+
+ gp->addr = p->addr;
+ gp->block = p->block;
+
+ rev = &s->rev_trans_map[p->addr];
+ rev->addr = l_addr;
+ spin_unlock(&s->rev_lock);
+}
+
+/* requires pool->lock lock */
+void nvm_reset_block(struct nvm_block *block)
+{
+ struct nvm_stor *s = block->pool->s;
+
+ spin_lock(&block->lock);
+ bitmap_zero(block->invalid_pages, s->nr_pages_per_blk);
+ block->ap = NULL;
+ block->next_page = 0;
+ block->nr_invalid_pages = 0;
+ atomic_set(&block->gc_running, 0);
+ atomic_set(&block->data_size, 0);
+ atomic_set(&block->data_cmnt_size, 0);
+ spin_unlock(&block->lock);
+}
+
+sector_t nvm_alloc_phys_addr(struct nvm_block *block)
+{
+ sector_t addr = LTOP_EMPTY;
+
+ spin_lock(&block->lock);
+
+ if (block_is_full(block))
+ goto out;
+
+ addr = block_to_addr(block) + block->next_page;
+
+ block->next_page++;
+
+out:
+ spin_unlock(&block->lock);
+ return addr;
+}
+
+/* requires ap->lock taken */
+void nvm_set_ap_cur(struct nvm_ap *ap, struct nvm_block *block)
+{
+ BUG_ON(!block);
+
+ if (ap->cur) {
+ spin_lock(&ap->cur->lock);
+ WARN_ON(!block_is_full(ap->cur));
+ spin_unlock(&ap->cur->lock);
+ ap->cur->ap = NULL;
+ }
+ ap->cur = block;
+ ap->cur->ap = ap;
+}
+
+/* Send erase command to device */
+int nvm_erase_block(struct nvm_stor *s, struct nvm_block *block)
+{
+ struct nvm_dev *dev = s->dev;
+
+ if (dev->ops->nvm_erase_block)
+ return dev->ops->nvm_erase_block(dev, block->id);
+
+ return 0;
+}
+
+void nvm_endio(struct nvm_dev *nvm_dev, struct request *rq, int err)
+{
+ struct nvm_stor *s = nvm_dev->stor;
+ struct per_rq_data *pb = get_per_rq_data(nvm_dev, rq);
+ struct nvm_addr *p = pb->addr;
+ struct nvm_block *block = p->block;
+ unsigned int data_cnt;
+
+ /* pr_debug("p: %p s: %llu l: %u pp:%p e:%u (%u)\n",
+ p, p->addr, pb->l_addr, p, err, rq_data_dir(rq)); */
+ nvm_unlock_laddr_range(s, pb->l_addr, 1);
+
+ if (rq_data_dir(rq) == WRITE) {
+ /* maintain data in buffer until block is full */
+ data_cnt = atomic_inc_return(&block->data_cmnt_size);
+ if (data_cnt == s->nr_pages_per_blk) {
+ /*defer scheduling of the block for recycling*/
+ queue_work(s->kgc_wq, &block->ws_eio);
+ }
+ }
+
+ /* all submitted requests allocate their own addr,
+ * except GC reads */
+ if (pb->flags & NVM_RQ_GC)
+ return;
+
+ mempool_free(pb->addr, s->addr_pool);
+}
+
+/* remember to lock l_add before calling nvm_submit_rq */
+void nvm_setup_rq(struct nvm_stor *s, struct request *rq, struct nvm_addr *p,
+ sector_t l_addr, unsigned int flags)
+{
+ struct nvm_block *block = p->block;
+ struct nvm_ap *ap;
+ struct per_rq_data *pb;
+
+ if (block)
+ ap = block_to_ap(s, block);
+ else
+ ap = &s->aps[0];
+
+ pb = get_per_rq_data(s->dev, rq);
+ pb->ap = ap;
+ pb->addr = p;
+ pb->l_addr = l_addr;
+ pb->flags = flags;
+}
+
+int nvm_read_rq(struct nvm_stor *s, struct request *rq)
+{
+ struct nvm_addr *p;
+ sector_t l_addr;
+
+ l_addr = blk_rq_pos(rq) / NR_PHY_IN_LOG;
+
+ nvm_lock_laddr_range(s, l_addr, 1);
+
+ p = s->type->lookup_ltop(s, l_addr);
+ if (!p) {
+ nvm_unlock_laddr_range(s, l_addr, 1);
+ nvm_gc_kick(s);
+ return BLK_MQ_RQ_QUEUE_BUSY;
+ }
+
+ rq->__sector = p->addr * NR_PHY_IN_LOG +
+ (blk_rq_pos(rq) % NR_PHY_IN_LOG);
+
+ if (!p->block)
+ rq->__sector = 0;
+
+ nvm_setup_rq(s, rq, p, l_addr, NVM_RQ_NONE);
+ return BLK_MQ_RQ_QUEUE_OK;
+}
+
+
+int __nvm_write_rq(struct nvm_stor *s, struct request *rq, int is_gc)
+{
+ struct nvm_addr *p;
+ sector_t l_addr = blk_rq_pos(rq) / NR_PHY_IN_LOG;
+
+ nvm_lock_laddr_range(s, l_addr, 1);
+ p = s->type->map_page(s, l_addr, is_gc);
+ if (!p) {
+ BUG_ON(is_gc);
+ nvm_unlock_laddr_range(s, l_addr, 1);
+ nvm_gc_kick(s);
+
+ return BLK_MQ_RQ_QUEUE_BUSY;
+ }
+
+ /*
+ * MB: Should be revised. We need a different hook into device
+ * driver
+ */
+ rq->__sector = p->addr * NR_PHY_IN_LOG;
+ /*printk("nvm: W %llu(%llu) B: %u\n", p->addr, p->addr * NR_PHY_IN_LOG,
+ p->block->id);*/
+
+ nvm_setup_rq(s, rq, p, l_addr, NVM_RQ_NONE);
+
+ return BLK_MQ_RQ_QUEUE_OK;
+}
+
+int nvm_write_rq(struct nvm_stor *s, struct request *rq)
+{
+ return __nvm_write_rq(s, rq, 0);
+}
diff --git a/drivers/lightnvm/gc.c b/drivers/lightnvm/gc.c
new file mode 100644
index 0000000..3735979
--- /dev/null
+++ b/drivers/lightnvm/gc.c
@@ -0,0 +1,233 @@
+#include <linux/lightnvm.h>
+#include "nvm.h"
+
+/* Run only GC if less than 1/X blocks are free */
+#define GC_LIMIT_INVERSE 10
+
+static void queue_pool_gc(struct nvm_pool *pool)
+{
+ struct nvm_stor *s = pool->s;
+
+ queue_work(s->krqd_wq, &pool->gc_ws);
+}
+
+void nvm_gc_cb(unsigned long data)
+{
+ struct nvm_stor *s = (struct nvm_stor *)data;
+ struct nvm_pool *pool;
+ int i;
+
+ nvm_for_each_pool(s, pool, i)
+ queue_pool_gc(pool);
+
+ mod_timer(&s->gc_timer,
+ jiffies + msecs_to_jiffies(s->config.gc_time));
+}
+
+/* the block with highest number of invalid pages, will be in the beginning
+ * of the list */
+static struct nvm_block *block_max_invalid(struct nvm_block *a,
+ struct nvm_block *b)
+{
+ BUG_ON(!a || !b);
+
+ if (a->nr_invalid_pages == b->nr_invalid_pages)
+ return a;
+
+ return (a->nr_invalid_pages < b->nr_invalid_pages) ? b : a;
+}
+
+/* linearly find the block with highest number of invalid pages
+ * requires pool->lock */
+static struct nvm_block *block_prio_find_max(struct nvm_pool *pool)
+{
+ struct list_head *list = &pool->prio_list;
+ struct nvm_block *block, *max;
+
+ BUG_ON(list_empty(list));
+
+ max = list_first_entry(list, struct nvm_block, prio);
+ list_for_each_entry(block, list, prio)
+ max = block_max_invalid(max, block);
+
+ return max;
+}
+
+/* Move data away from flash block to be erased. Additionally update the
+ * l to p and p to l mappings. */
+static void nvm_move_valid_pages(struct nvm_stor *s, struct nvm_block *block)
+{
+ struct nvm_dev *dev = s->dev;
+ struct request_queue *q = dev->q;
+ struct nvm_addr src;
+ struct nvm_rev_addr *rev;
+ struct bio *src_bio;
+ struct request *src_rq, *dst_rq = NULL;
+ struct page *page;
+ int slot;
+ DECLARE_COMPLETION(sync);
+
+ if (bitmap_full(block->invalid_pages, s->nr_pages_per_blk))
+ return;
+
+ while ((slot = find_first_zero_bit(block->invalid_pages,
+ s->nr_pages_per_blk)) <
+ s->nr_pages_per_blk) {
+ /* Perform read */
+ src.addr = block_to_addr(block) + slot;
+ src.block = block;
+
+ BUG_ON(src.addr >= s->nr_pages);
+
+ src_bio = bio_alloc(GFP_NOIO, 1);
+ if (!src_bio) {
+ pr_err("nvm: failed to alloc gc bio request");
+ break;
+ }
+ src_bio->bi_iter.bi_sector = src.addr * NR_PHY_IN_LOG;
+ page = mempool_alloc(s->page_pool, GFP_NOIO);
+
+ /* TODO: may fail whem EXP_PG_SIZE > PAGE_SIZE */
+ bio_add_pc_page(q, src_bio, page, EXPOSED_PAGE_SIZE, 0);
+
+ src_rq = blk_mq_alloc_request(q, READ, GFP_KERNEL, false);
+ if (!src_rq) {
+ mempool_free(page, s->page_pool);
+ pr_err("nvm: failed to alloc gc request");
+ break;
+ }
+
+ blk_init_request_from_bio(src_rq, src_bio);
+
+ /* We take the reverse lock here, and make sure that we only
+ * release it when we have locked its logical address. If
+ * another write on the same logical address is
+ * occuring, we just let it stall the pipeline.
+ *
+ * We do this for both the read and write. Fixing it after each
+ * IO.
+ */
+ spin_lock(&s->rev_lock);
+ /* We use the physical address to go to the logical page addr,
+ * and then update its mapping to its new place. */
+ rev = &s->rev_trans_map[src.addr];
+
+ /* already updated by previous regular write */
+ if (rev->addr == LTOP_POISON) {
+ spin_unlock(&s->rev_lock);
+ goto overwritten;
+ }
+
+ /* unlocked by nvm_submit_bio nvm_endio */
+ __nvm_lock_laddr_range(s, 1, rev->addr, 1);
+ spin_unlock(&s->rev_lock);
+
+ nvm_setup_rq(s, src_rq, &src, rev->addr, NVM_RQ_GC);
+ blk_execute_rq(q, dev->disk, src_rq, 0);
+ blk_put_request(src_rq);
+
+ dst_rq = blk_mq_alloc_request(q, WRITE, GFP_KERNEL, false);
+ blk_init_request_from_bio(dst_rq, src_bio);
+
+ /* ok, now fix the write and make sure that it haven't been
+ * moved in the meantime. */
+ spin_lock(&s->rev_lock);
+
+ /* already updated by previous regular write */
+ if (rev->addr == LTOP_POISON) {
+ spin_unlock(&s->rev_lock);
+ goto overwritten;
+ }
+
+ src_bio->bi_iter.bi_sector = rev->addr * NR_PHY_IN_LOG;
+
+ /* again, unlocked by nvm_endio */
+ __nvm_lock_laddr_range(s, 1, rev->addr, 1);
+
+ spin_unlock(&s->rev_lock);
+
+ __nvm_write_rq(s, dst_rq, 1);
+ blk_execute_rq(q, dev->disk, dst_rq, 0);
+
+overwritten:
+ blk_put_request(dst_rq);
+ bio_put(src_bio);
+ mempool_free(page, s->page_pool);
+ }
+
+ WARN_ON(!bitmap_full(block->invalid_pages, s->nr_pages_per_blk));
+}
+
+void nvm_gc_collect(struct work_struct *work)
+{
+ struct nvm_pool *pool = container_of(work, struct nvm_pool, gc_ws);
+ struct nvm_stor *s = pool->s;
+ struct nvm_block *block;
+ unsigned int nr_blocks_need;
+ unsigned long flags;
+
+ nr_blocks_need = pool->nr_blocks / 10;
+
+ if (nr_blocks_need < s->nr_aps)
+ nr_blocks_need = s->nr_aps;
+
+ local_irq_save(flags);
+ spin_lock(&pool->lock);
+ while (nr_blocks_need > pool->nr_free_blocks &&
+ !list_empty(&pool->prio_list)) {
+ block = block_prio_find_max(pool);
+
+ if (!block->nr_invalid_pages) {
+ __show_pool(pool);
+ pr_err("No invalid pages");
+ break;
+ }
+
+ list_del_init(&block->prio);
+
+ BUG_ON(!block_is_full(block));
+ BUG_ON(atomic_inc_return(&block->gc_running) != 1);
+
+ queue_work(s->kgc_wq, &block->ws_gc);
+
+ nr_blocks_need--;
+ }
+ spin_unlock(&pool->lock);
+ s->next_collect_pool++;
+ local_irq_restore(flags);
+
+ /* TODO: Hint that request queue can be started again */
+}
+
+void nvm_gc_block(struct work_struct *work)
+{
+ struct nvm_block *block = container_of(work, struct nvm_block, ws_gc);
+ struct nvm_stor *s = block->pool->s;
+
+ /* TODO: move outside lock to allow multiple pages
+ * in parallel to be erased. */
+ nvm_move_valid_pages(s, block);
+ nvm_erase_block(s, block);
+ s->type->pool_put_blk(block);
+}
+
+void nvm_gc_recycle_block(struct work_struct *work)
+{
+ struct nvm_block *block = container_of(work, struct nvm_block, ws_eio);
+ struct nvm_pool *pool = block->pool;
+
+ spin_lock(&pool->lock);
+ list_add_tail(&block->prio, &pool->prio_list);
+ spin_unlock(&pool->lock);
+}
+
+void nvm_gc_kick(struct nvm_stor *s)
+{
+ struct nvm_pool *pool;
+ unsigned int i;
+
+ BUG_ON(!s);
+
+ nvm_for_each_pool(s, pool, i)
+ queue_pool_gc(pool);
+}
diff --git a/drivers/lightnvm/kv.c b/drivers/lightnvm/kv.c
new file mode 100644
index 0000000..9c30143
--- /dev/null
+++ b/drivers/lightnvm/kv.c
@@ -0,0 +1,513 @@
+#define DEBUG
+#include <linux/crypto.h>
+#include <linux/scatterlist.h>
+#include <linux/jhash.h>
+#include <linux/slab.h>
+#include "nvm.h"
+
+/* inflight uses jenkins hash - less to compare, collisions only result
+ * in unecessary serialisation.
+ *
+ * could restrict oneself to only grabbing table lock whenever you specifically
+ * want a new entry -- otherwise no concurrent threads will ever be interested
+ * in the same entry
+ */
+
+#define BUCKET_LEN 16
+#define BUCKET_OCCUPANCY_AVG (BUCKET_LEN / 4)
+
+struct kv_inflight {
+ struct list_head list;
+ u32 h1;
+};
+
+struct kv_entry {
+ u64 hash[2];
+ struct nvm_block *blk;
+};
+
+struct nvmkv_io {
+ int offset;
+ unsigned npages;
+ unsigned length;
+ struct page **pages;
+ struct nvm_block *block;
+ int write;
+};
+
+enum {
+ EXISTING_ENTRY = 0,
+ NEW_ENTRY = 1,
+};
+
+static inline unsigned bucket_idx(struct nvmkv_tbl *tbl, u32 hash)
+{
+ return hash % (tbl->tbl_len / BUCKET_LEN);
+}
+
+static void inflight_lock(struct nvmkv_inflight *ilist,
+ struct kv_inflight *ientry)
+{
+ struct kv_inflight *lentry;
+ unsigned long flags;
+
+retry:
+ spin_lock_irqsave(&ilist->lock, flags);
+
+ list_for_each_entry(lentry, &ilist->list, list) {
+ if (lentry->h1 == ientry->h1) {
+ spin_unlock_irqrestore(&ilist->lock, flags);
+ schedule();
+ goto retry;
+ }
+ }
+
+ list_add_tail(&ientry->list, &ilist->list);
+ spin_unlock_irqrestore(&ilist->lock, flags);
+}
+
+static void inflight_unlock(struct nvmkv_inflight *ilist, u32 h1)
+{
+ struct kv_inflight *lentry;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ilist->lock, flags);
+ BUG_ON(list_empty(&ilist->list));
+
+ list_for_each_entry(lentry, &ilist->list, list) {
+ if (lentry->h1 == h1) {
+ list_del(&lentry->list);
+ goto out;
+ }
+ }
+
+ BUG();
+out:
+ spin_unlock_irqrestore(&ilist->lock, flags);
+}
+
+/*TODO reserving '0' for empty entries is technically a no-go as it could be
+ a hash value.*/
+static int __tbl_get_idx(struct nvmkv_tbl *tbl, u32 h1, const u64 *h2,
+ unsigned int type)
+{
+ unsigned b_idx = bucket_idx(tbl, h1);
+ unsigned idx = BUCKET_LEN * b_idx;
+ struct kv_entry *entry = &tbl->entries[idx];
+ unsigned i;
+
+ for (i = 0; i < BUCKET_LEN; i++, entry++) {
+ if (!memcmp(entry->hash, h2, sizeof(u64) * 2)) {
+ if (type == NEW_ENTRY)
+ entry->hash[0] = 1;
+ idx += i;
+ break;
+ }
+ }
+
+ if (i == BUCKET_LEN)
+ idx = -1;
+
+ return idx;
+}
+
+static int tbl_get_idx(struct nvmkv_tbl *tbl, u32 h1, const u64 *h2,
+ unsigned int type)
+{
+ int idx;
+
+ spin_lock(&tbl->lock);
+ idx = __tbl_get_idx(tbl, h1, h2, type);
+ spin_unlock(&tbl->lock);
+ return idx;
+}
+
+
+static int tbl_new_entry(struct nvmkv_tbl *tbl, u32 h1)
+{
+ const u64 empty[2] = { 0, 0 };
+
+ return tbl_get_idx(tbl, h1, empty, NEW_ENTRY);
+}
+
+static u32 hash1(void *key, unsigned key_len)
+{
+ u32 hash;
+ u32 *p = (u32 *) key;
+ u32 len = key_len / sizeof(u32);
+ u32 offset = key_len % sizeof(u32);
+
+ if (offset) {
+ memcpy(&offset, p + len, offset);
+ hash = jhash2(p, len, 0);
+ return jhash2(&offset, 1, hash);
+ }
+ return jhash2(p, len, 0);
+}
+
+static void hash2(void *dst, void *src, size_t src_len)
+{
+ struct scatterlist sg;
+ struct hash_desc hdesc;
+
+ sg_init_one(&sg, src, src_len);
+ hdesc.tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
+ crypto_hash_init(&hdesc);
+ crypto_hash_update(&hdesc, &sg, src_len);
+ crypto_hash_final(&hdesc, (u8 *)dst);
+ crypto_free_hash(hdesc.tfm);
+}
+
+static void *cpy_val(u64 addr, size_t len)
+{
+ void *buf;
+
+ buf = kmalloc(len, GFP_KERNEL);
+ if (!buf)
+ return ERR_PTR(-ENOMEM);
+
+ if (copy_from_user(buf, (void *)addr, len)) {
+ kfree(buf);
+ return ERR_PTR(-EFAULT);
+ }
+ return buf;
+}
+
+static int do_io(struct nvm_stor *s, int rw, u64 blk_addr, void __user *ubuf,
+ unsigned long len)
+{
+ struct nvm_dev *dev = s->dev;
+ struct request_queue *q = dev->q;
+ struct request *rq;
+ struct bio *orig_bio;
+ int ret;
+
+ rq = blk_mq_alloc_request(q, rw, GFP_KERNEL, false);
+ if (!rq) {
+ pr_err("lightnvm: failed to allocate request\n");
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = blk_rq_map_user(q, rq, NULL, ubuf, len, GFP_KERNEL);
+ if (ret) {
+ pr_err("lightnvm: failed to map userspace memory into request\n");
+ goto err_umap;
+ }
+ orig_bio = rq->bio;
+
+ rq->cmd_flags |= REQ_NVM;
+ rq->__sector = blk_addr * NR_PHY_IN_LOG;
+ rq->errors = 0;
+
+ ret = blk_execute_rq(q, dev->disk, rq, 0);
+ if (ret)
+ pr_err("lightnvm: failed to execute request..\n");
+
+ blk_rq_unmap_user(orig_bio);
+
+err_umap:
+ blk_put_request(rq);
+out:
+ return ret;
+}
+
+/**
+ * get - get value from KV store
+ * @s: nvm stor
+ * @cmd: LightNVM KV command
+ * @key: copy of key supplied from userspace.
+ * @h1: hash of key value using hash function 1
+ *
+ * Fetch value identified by the supplied key.
+ */
+static int get(struct nvm_stor *s, struct lightnvm_cmd_kv *cmd, void *key,
+ u32 h1)
+{
+ struct nvmkv_tbl *tbl = &s->kv.tbl;
+ struct kv_entry *entry;
+
+ u64 h2[2];
+ int idx;
+
+ hash2(&h2, key, cmd->key_len);
+
+ idx = tbl_get_idx(tbl, h1, h2, EXISTING_ENTRY);
+ if (idx < 0)
+ return LIGHTNVM_KV_ERR_NOKEY;
+
+ entry = &tbl->entries[idx];
+
+ return do_io(s, READ, block_to_addr(entry->blk),
+ (void __user *)cmd->val_addr, cmd->val_len);
+}
+
+static struct nvm_block *acquire_block(struct nvm_stor *s)
+{
+ struct nvm_ap *ap;
+ struct nvm_pool *pool;
+ struct nvm_block *block = NULL;
+ int i;
+
+ for (i = 0; i < s->nr_aps; i++) {
+ ap = get_next_ap(s);
+ pool = ap->pool;
+
+ block = s->type->pool_get_blk(pool, 0);
+ if (block)
+ break;
+ }
+ return block;
+}
+
+static int update_entry(struct nvm_stor *s, struct lightnvm_cmd_kv *cmd,
+ struct kv_entry *entry)
+{
+ struct nvm_block *block;
+ int ret;
+
+ BUG_ON(!s);
+ BUG_ON(!cmd);
+ BUG_ON(!entry);
+
+ block = acquire_block(s);
+ if (!block) {
+ pr_err("lightnvm: failed to acquire a block\n");
+ ret = -ENOSPC;
+ goto no_block;
+ }
+ ret = do_io(s, WRITE, block_to_addr(block),
+ (void __user *)cmd->val_addr, cmd->val_len);
+ if (ret) {
+ pr_err("lightnvm: failed to write entry\n");
+ ret = -EIO;
+ goto io_err;
+ }
+
+ if (entry->blk)
+ s->type->pool_put_blk(entry->blk);
+
+ entry->blk = block;
+
+ return ret;
+io_err:
+ s->type->pool_put_blk(block);
+no_block:
+ return ret;
+}
+
+/**
+ * put - put/update value in KV store
+ * @s: nvm stor
+ * @cmd: LightNVM KV command
+ * @key: copy of key supplied from userspace.
+ * @h1: hash of key value using hash function 1
+ *
+ * Store the supplied value in an entry identified by the
+ * supplied key. Will overwrite existing entry if necessary.
+ */
+static int put(struct nvm_stor *s, struct lightnvm_cmd_kv *cmd, void *key,
+ u32 h1)
+{
+ struct nvmkv_tbl *tbl = &s->kv.tbl;
+ struct kv_entry *entry;
+ u64 h2[2];
+ int idx, ret;
+
+ hash2(&h2, key, cmd->key_len);
+
+ idx = tbl_get_idx(tbl, h1, h2, EXISTING_ENTRY);
+ if (idx != -1) {
+ entry = &tbl->entries[idx];
+ } else {
+ idx = tbl_new_entry(tbl, h1);
+ if (idx != -1) {
+ entry = &tbl->entries[idx];
+ memcpy(entry->hash, &h2, sizeof(entry->hash));
+ } else {
+ pr_err("lightnvm: no empty entries\n");
+ BUG();
+ }
+ }
+
+ ret = update_entry(s, cmd, entry);
+
+ /* If update_entry failed, we reset the entry->hash, as it was updated
+ * by the previous statements and is no longer valid */
+ if (!entry->blk)
+ memset(entry->hash, 0, sizeof(entry->hash));
+
+ return ret;
+}
+
+/**
+ * update - update existing entry
+ * @s: nvm stor
+ * @cmd: LightNVM KV command
+ * @key: copy of key supplied from userspace.
+ * @h1: hash of key value using hash function 1
+ *
+ * Updates existing value identified by 'k' to the new value.
+ * Operation only succeeds if k points to an existing value.
+ */
+static int update(struct nvm_stor *s, struct lightnvm_cmd_kv *cmd, void *key,
+ u32 h1)
+{
+ struct nvmkv_tbl *tbl = &s->kv.tbl;
+ struct kv_entry *entry;
+ u64 h2[2];
+ int ret;
+
+ hash2(&h2, key, cmd->key_len);
+
+ ret = tbl_get_idx(tbl, h1, h2, EXISTING_ENTRY);
+ if (ret < 0) {
+ pr_debug("lightnvm: no entry, skipping\n");
+ return 0;
+ }
+
+ entry = &tbl->entries[ret];
+
+ ret = update_entry(s, cmd, entry);
+ if (ret)
+ memset(entry->hash, 0, sizeof(entry->hash));
+ return ret;
+}
+
+/**
+ * del - delete entry.
+ * @s: nvm stor
+ * @cmd: LightNVM KV command
+ * @key: copy of key supplied from userspace.
+ * @h1: hash of key value using hash function 1
+ *
+ * Removes the value associated the supplied key.
+ */
+static int del(struct nvm_stor *s, struct lightnvm_cmd_kv *cmd, void *key,
+ u32 h1)
+{
+ struct nvmkv_tbl *tbl = &s->kv.tbl;
+ struct kv_entry *entry;
+ u64 h2[2];
+ int idx;
+
+ hash2(&h2, key, cmd->key_len);
+
+ idx = tbl_get_idx(tbl, h1, h2, EXISTING_ENTRY);
+ if (idx != -1) {
+ entry = &tbl->entries[idx];
+ s->type->pool_put_blk(entry->blk);
+ memset(entry, 0, sizeof(struct kv_entry));
+ } else {
+ pr_debug("lightnvm: could not find entry!\n");
+ }
+
+ return 0;
+}
+
+int nvmkv_unpack(struct nvm_dev *dev, struct lightnvm_cmd_kv __user *ucmd)
+{
+ struct nvm_stor *s = dev->stor;
+ struct nvmkv_inflight *inflight = &s->kv.inflight;
+ struct kv_inflight *ientry;
+ struct lightnvm_cmd_kv cmd;
+ u32 h1;
+ void *key;
+ int ret = 0;
+
+ if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+ return -EFAULT;
+
+ key = cpy_val(cmd.key_addr, cmd.key_len);
+ if (IS_ERR(key)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ h1 = hash1(key, cmd.key_len);
+ ientry = kmem_cache_alloc(inflight->entry_pool, GFP_KERNEL);
+ if (!ientry) {
+ ret = -ENOMEM;
+ goto err_ientry;
+ }
+ ientry->h1 = h1;
+
+ inflight_lock(inflight, ientry);
+
+ switch (cmd.opcode) {
+ case LIGHTNVM_KV_GET:
+ ret = get(s, &cmd, key, h1);
+ break;
+ case LIGHTNVM_KV_PUT:
+ ret = put(s, &cmd, key, h1);
+ break;
+ case LIGHTNVM_KV_UPDATE:
+ ret = update(s, &cmd, key, h1);
+ break;
+ case LIGHTNVM_KV_DEL:
+ ret = del(s, &cmd, key, h1);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ inflight_unlock(inflight, h1);
+
+err_ientry:
+ kfree(key);
+out:
+ if (ret > 0) {
+ ucmd->errcode = ret;
+ ret = 0;
+ }
+ return ret;
+}
+
+int nvmkv_init(struct nvm_stor *s, unsigned long size)
+{
+ struct nvmkv_tbl *tbl = &s->kv.tbl;
+ struct nvmkv_inflight *inflight = &s->kv.inflight;
+ int ret = 0;
+
+ unsigned long buckets = s->total_blocks
+ / (BUCKET_LEN / BUCKET_OCCUPANCY_AVG);
+
+ tbl->bucket_len = BUCKET_LEN;
+ tbl->tbl_len = buckets * tbl->bucket_len;
+
+ tbl->entries = vzalloc(tbl->tbl_len * sizeof(struct kv_entry));
+ if (!tbl->entries) {
+ ret = -ENOMEM;
+ goto err_tbl_entries;
+ }
+
+ inflight->entry_pool = kmem_cache_create("nvmkv_inflight_pool",
+ sizeof(struct kv_inflight), 0,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+ NULL);
+ if (!inflight->entry_pool) {
+ ret = -ENOMEM;
+ goto err_inflight_pool;
+ }
+
+ spin_lock_init(&tbl->lock);
+ INIT_LIST_HEAD(&inflight->list);
+ spin_lock_init(&inflight->lock);
+
+ return 0;
+
+err_inflight_pool:
+ vfree(tbl->entries);
+err_tbl_entries:
+ return ret;
+}
+
+void nvmkv_exit(struct nvm_stor *s)
+{
+ struct nvmkv_tbl *tbl = &s->kv.tbl;
+ struct nvmkv_inflight *inflight = &s->kv.inflight;
+
+ vfree(tbl->entries);
+ kmem_cache_destroy(inflight->entry_pool);
+}
diff --git a/drivers/lightnvm/nvm.c b/drivers/lightnvm/nvm.c
new file mode 100644
index 0000000..aaf3aca
--- /dev/null
+++ b/drivers/lightnvm/nvm.c
@@ -0,0 +1,540 @@
+/*
+ * Copyright (C) 2014 Matias Bjørling.
+ *
+ * Todo
+ *
+ * - Implement fetching of bad pages from flash
+ * - configurable sector size
+ * - handle case of in-page bv_offset (currently hidden assumption of offset=0,
+ * and bv_len spans entire page)
+ *
+ * Optimization possibilities
+ * - Implement per-cpu nvm_block data structure ownership. Removes need
+ * for taking lock on block next_write_id function. I.e. page allocation
+ * becomes nearly lockless, with occasionally movement of blocks on
+ * nvm_block lists.
+ */
+
+#include <linux/blk-mq.h>
+#include <linux/list.h>
+#include <linux/sem.h>
+#include <linux/types.h>
+#include <linux/lightnvm.h>
+
+#include <linux/ktime.h>
+#include <trace/events/block.h>
+
+#include "nvm.h"
+
+/* Defaults
+ * Number of append points per pool. We assume that accesses within a pool is
+ * serial (NAND flash/PCM/etc.)
+ */
+#define APS_PER_POOL 1
+
+/* Run GC every X seconds */
+#define GC_TIME 10
+
+/* Minimum pages needed within a pool */
+#define MIN_POOL_PAGES 16
+
+extern struct nvm_target_type nvm_target_rrpc;
+
+static struct kmem_cache *_addr_cache;
+
+static LIST_HEAD(_targets);
+static DECLARE_RWSEM(_lock);
+
+struct nvm_target_type *find_nvm_target_type(const char *name)
+{
+ struct nvm_target_type *tt;
+
+ list_for_each_entry(tt, &_targets, list)
+ if (!strcmp(name, tt->name))
+ return tt;
+
+ return NULL;
+}
+
+int nvm_register_target(struct nvm_target_type *tt)
+{
+ int ret = 0;
+
+ down_write(&_lock);
+ if (find_nvm_target_type(tt->name))
+ ret = -EEXIST;
+ else
+ list_add(&tt->list, &_targets);
+ up_write(&_lock);
+ return ret;
+}
+
+void nvm_unregister_target(struct nvm_target_type *tt)
+{
+ if (!tt)
+ return;
+
+ down_write(&_lock);
+ list_del(&tt->list);
+ up_write(&_lock);
+}
+
+int nvm_queue_rq(struct nvm_dev *dev, struct request *rq)
+{
+ struct nvm_stor *s = dev->stor;
+ int ret;
+
+ if (rq->cmd_flags & REQ_NVM_MAPPED)
+ return BLK_MQ_RQ_QUEUE_OK;
+
+ if (blk_rq_pos(rq) / NR_PHY_IN_LOG > s->nr_pages) {
+ pr_err("lightnvm: out-of-bound address: %llu",
+ (unsigned long long) blk_rq_pos(rq));
+ return BLK_MQ_RQ_QUEUE_ERROR;
+ }
+
+
+ if (rq_data_dir(rq) == WRITE)
+ ret = s->type->write_rq(s, rq);
+ else
+ ret = s->type->read_rq(s, rq);
+
+ if (ret == BLK_MQ_RQ_QUEUE_OK)
+ rq->cmd_flags |= (REQ_NVM|REQ_NVM_MAPPED);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(nvm_queue_rq);
+
+void nvm_end_io(struct nvm_dev *nvm_dev, struct request *rq, int error)
+{
+ if (rq->cmd_flags & (REQ_NVM|REQ_NVM_MAPPED))
+ nvm_endio(nvm_dev, rq, error);
+
+ if (!(rq->cmd_flags & REQ_NVM))
+ pr_info("lightnvm: request outside lightnvm detected.\n");
+
+ blk_mq_end_io(rq, error);
+}
+EXPORT_SYMBOL_GPL(nvm_end_io);
+
+void nvm_complete_request(struct nvm_dev *nvm_dev, struct request *rq)
+{
+ if (rq->cmd_flags & (REQ_NVM|REQ_NVM_MAPPED))
+ nvm_endio(nvm_dev, rq, 0);
+
+ if (!(rq->cmd_flags & REQ_NVM))
+ pr_info("lightnvm: request outside lightnvm.\n");
+
+ blk_mq_complete_request(rq);
+}
+EXPORT_SYMBOL_GPL(nvm_complete_request);
+
+unsigned int nvm_cmd_size(void)
+{
+ return sizeof(struct per_rq_data);
+}
+EXPORT_SYMBOL_GPL(nvm_cmd_size);
+
+static void nvm_pools_free(struct nvm_stor *s)
+{
+ struct nvm_pool *pool;
+ int i;
+
+ if (s->krqd_wq)
+ destroy_workqueue(s->krqd_wq);
+
+ if (s->kgc_wq)
+ destroy_workqueue(s->kgc_wq);
+
+ nvm_for_each_pool(s, pool, i) {
+ if (!pool->blocks)
+ break;
+ kfree(pool->blocks);
+ }
+ kfree(s->pools);
+ kfree(s->aps);
+}
+
+static int nvm_pools_init(struct nvm_stor *s)
+{
+ struct nvm_pool *pool;
+ struct nvm_block *block;
+ struct nvm_ap *ap;
+ int i, j;
+
+ spin_lock_init(&s->rev_lock);
+
+ s->pools = kcalloc(s->nr_pools, sizeof(struct nvm_pool), GFP_KERNEL);
+ if (!s->pools)
+ goto err_pool;
+
+ nvm_for_each_pool(s, pool, i) {
+ spin_lock_init(&pool->lock);
+ spin_lock_init(&pool->waiting_lock);
+
+ init_completion(&pool->gc_finished);
+
+ INIT_WORK(&pool->gc_ws, nvm_gc_collect);
+
+ INIT_LIST_HEAD(&pool->free_list);
+ INIT_LIST_HEAD(&pool->used_list);
+ INIT_LIST_HEAD(&pool->prio_list);
+
+ pool->id = i;
+ pool->s = s;
+ pool->phy_addr_start = i * s->nr_blks_per_pool;
+ pool->phy_addr_end = (i + 1) * s->nr_blks_per_pool - 1;
+ pool->nr_free_blocks = pool->nr_blocks =
+ pool->phy_addr_end - pool->phy_addr_start + 1;
+ bio_list_init(&pool->waiting_bios);
+ atomic_set(&pool->is_active, 0);
+
+ pool->blocks = vzalloc(sizeof(struct nvm_block) *
+ pool->nr_blocks);
+ if (!pool->blocks)
+ goto err_blocks;
+
+ pool_for_each_block(pool, block, j) {
+ spin_lock_init(&block->lock);
+ atomic_set(&block->gc_running, 0);
+ INIT_LIST_HEAD(&block->list);
+ INIT_LIST_HEAD(&block->prio);
+
+ block->pool = pool;
+ block->id = (i * s->nr_blks_per_pool) + j;
+
+ list_add_tail(&block->list, &pool->free_list);
+ INIT_WORK(&block->ws_gc, nvm_gc_block);
+ INIT_WORK(&block->ws_eio, nvm_gc_recycle_block);
+ }
+ }
+
+ s->nr_aps = s->nr_aps_per_pool * s->nr_pools;
+ s->aps = kcalloc(s->nr_aps, sizeof(struct nvm_ap), GFP_KERNEL);
+ if (!s->aps)
+ goto err_blocks;
+
+ nvm_for_each_ap(s, ap, i) {
+ spin_lock_init(&ap->lock);
+ ap->parent = s;
+ ap->pool = &s->pools[i / s->nr_aps_per_pool];
+
+ block = s->type->pool_get_blk(ap->pool, 0);
+ nvm_set_ap_cur(ap, block);
+
+ /* Emergency gc block */
+ block = s->type->pool_get_blk(ap->pool, 1);
+ ap->gc_cur = block;
+
+ ap->t_read = s->config.t_read;
+ ap->t_write = s->config.t_write;
+ ap->t_erase = s->config.t_erase;
+ }
+
+ /* we make room for each pool context. */
+ s->krqd_wq = alloc_workqueue("knvm-work", WQ_MEM_RECLAIM|WQ_UNBOUND,
+ s->nr_pools);
+ if (!s->krqd_wq) {
+ pr_err("Couldn't alloc knvm-work");
+ goto err_blocks;
+ }
+
+ s->kgc_wq = alloc_workqueue("knvm-gc", WQ_MEM_RECLAIM, 1);
+ if (!s->kgc_wq) {
+ pr_err("Couldn't alloc knvm-gc");
+ goto err_blocks;
+ }
+
+ return 0;
+err_blocks:
+ nvm_pools_free(s);
+err_pool:
+ pr_err("lightnvm: cannot allocate lightnvm data structures");
+ return -ENOMEM;
+}
+
+static int nvm_stor_init(struct nvm_dev *dev, struct nvm_stor *s)
+{
+ int i;
+
+ s->trans_map = vzalloc(sizeof(struct nvm_addr) * s->nr_pages);
+ if (!s->trans_map)
+ return -ENOMEM;
+
+ s->rev_trans_map = vmalloc(sizeof(struct nvm_rev_addr)
+ * s->nr_pages);
+ if (!s->rev_trans_map)
+ goto err_rev_trans_map;
+
+ for (i = 0; i < s->nr_pages; i++) {
+ struct nvm_addr *p = &s->trans_map[i];
+ struct nvm_rev_addr *r = &s->rev_trans_map[i];
+
+ p->addr = LTOP_EMPTY;
+ r->addr = 0xDEADBEEF;
+ }
+
+ s->page_pool = mempool_create_page_pool(MIN_POOL_PAGES, 0);
+ if (!s->page_pool)
+ goto err_dev_lookup;
+
+ s->addr_pool = mempool_create_slab_pool(64, _addr_cache);
+ if (!s->addr_pool)
+ goto err_page_pool;
+
+ s->sector_size = EXPOSED_PAGE_SIZE;
+
+ /* inflight maintenance */
+ percpu_ida_init(&s->free_inflight, NVM_INFLIGHT_TAGS);
+
+ for (i = 0; i < NVM_INFLIGHT_PARTITIONS; i++) {
+ spin_lock_init(&s->inflight_map[i].lock);
+ INIT_LIST_HEAD(&s->inflight_map[i].reqs);
+ }
+
+ /* simple round-robin strategy */
+ atomic_set(&s->next_write_ap, -1);
+
+ s->dev = (void *)dev;
+ dev->stor = s;
+
+ /* Initialize pools. */
+ nvm_pools_init(s);
+
+ if (s->type->init && s->type->init(s))
+ goto err_addr_pool;
+
+ /* FIXME: Clean up pool init on failure. */
+ setup_timer(&s->gc_timer, nvm_gc_cb, (unsigned long)s);
+ mod_timer(&s->gc_timer, jiffies + msecs_to_jiffies(1000));
+
+ return 0;
+err_addr_pool:
+ nvm_pools_free(s);
+ mempool_destroy(s->addr_pool);
+err_page_pool:
+ mempool_destroy(s->page_pool);
+err_dev_lookup:
+ vfree(s->rev_trans_map);
+err_rev_trans_map:
+ vfree(s->trans_map);
+ return -ENOMEM;
+}
+
+#define NVM_TARGET_TYPE "rrpc"
+#define NVM_NUM_POOLS 8
+#define NVM_NUM_BLOCKS 256
+#define NVM_NUM_PAGES 256
+
+struct nvm_dev *nvm_alloc()
+{
+ return kmalloc(sizeof(struct nvm_dev), GFP_KERNEL);
+}
+EXPORT_SYMBOL_GPL(nvm_alloc);
+
+void nvm_free(struct nvm_dev *dev)
+{
+ kfree(dev);
+}
+EXPORT_SYMBOL_GPL(nvm_free);
+
+int nvm_queue_init(struct nvm_dev *dev)
+{
+ int nr_sectors_per_page = 8; /* 512 bytes */
+
+ if (queue_logical_block_size(dev->q) > (nr_sectors_per_page << 9)) {
+ pr_err("nvm: logical page size not supported by hardware");
+ return false;
+ }
+
+ return true;
+}
+
+int nvm_init(struct gendisk *disk, struct nvm_dev *dev)
+{
+ struct nvm_stor *s;
+ struct nvm_id nvm_id;
+ struct nvm_id_chnl *nvm_id_chnl;
+ int ret = 0;
+
+ unsigned long size;
+
+ if (!dev->ops->identify)
+ return -EINVAL;
+
+ if (!nvm_queue_init(dev))
+ return -EINVAL;
+
+ nvm_id_chnl = kmalloc(sizeof(struct nvm_id_chnl), GFP_KERNEL);
+ if (!nvm_id_chnl) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ _addr_cache = kmem_cache_create("nvm_addr_cache",
+ sizeof(struct nvm_addr), 0, 0, NULL);
+ if (!_addr_cache) {
+ ret = -ENOMEM;
+ goto err_memcache;
+ }
+
+ nvm_register_target(&nvm_target_rrpc);
+
+ s = kzalloc(sizeof(struct nvm_stor), GFP_KERNEL);
+ if (!s) {
+ ret = -ENOMEM;
+ goto err_stor;
+ }
+
+ /* hardcode initialization values until user-space util is avail. */
+ s->type = &nvm_target_rrpc;
+ if (!s->type) {
+ pr_err("nvm: %s doesn't exist.", NVM_TARGET_TYPE);
+ ret = -EINVAL;
+ goto err_target;
+ }
+
+ if (dev->ops->identify(dev, &nvm_id)) {
+ ret = -EINVAL;
+ goto err_target;
+ }
+
+ s->nr_pools = nvm_id.nchannels;
+
+ /* TODO: We're limited to the same setup for each channel */
+ if (dev->ops->identify_channel(dev, 0, nvm_id_chnl)) {
+ ret = -EINVAL;
+ goto err_target;
+ }
+
+ s->gran_blk = le64_to_cpu(nvm_id_chnl->gran_erase);
+ s->gran_read = le64_to_cpu(nvm_id_chnl->gran_read);
+ s->gran_write = le64_to_cpu(nvm_id_chnl->gran_write);
+
+ size = (nvm_id_chnl->laddr_end - nvm_id_chnl->laddr_begin)
+ * min(s->gran_read, s->gran_write);
+
+ s->total_blocks = size / s->gran_blk;
+ s->nr_blks_per_pool = s->total_blocks / nvm_id.nchannels;
+ /* TODO: gran_{read,write} may differ */
+ s->nr_pages_per_blk = s->gran_blk / s->gran_read *
+ (s->gran_read / EXPOSED_PAGE_SIZE);
+
+ s->nr_aps_per_pool = APS_PER_POOL;
+ /* s->config.flags = NVM_OPT_* */
+ s->config.gc_time = GC_TIME;
+ s->config.t_read = le32_to_cpu(nvm_id_chnl->t_r) / 1000;
+ s->config.t_write = le32_to_cpu(nvm_id_chnl->t_w) / 1000;
+ s->config.t_erase = le32_to_cpu(nvm_id_chnl->t_e) / 1000;
+
+ /* Constants */
+ s->nr_pages = s->nr_pools * s->nr_blks_per_pool * s->nr_pages_per_blk;
+
+ ret = nvmkv_init(s, size);
+ if (ret) {
+ pr_err("lightnvm: kv init failed.\n");
+ goto err_target;
+ }
+
+ if (s->nr_pages_per_blk > MAX_INVALID_PAGES_STORAGE * BITS_PER_LONG) {
+ pr_err("lightnvm: Num. pages per block too high. Increase MAX_INVALID_PAGES_STORAGE.");
+ ret = -EINVAL;
+ goto err_target;
+ }
+
+ ret = nvm_stor_init(dev, s);
+ if (ret < 0) {
+ pr_err("lightnvm: cannot initialize nvm structure.");
+ goto err_map;
+ }
+
+ pr_info("lightnvm: pools: %u\n", s->nr_pools);
+ pr_info("lightnvm: blocks: %u\n", s->nr_blks_per_pool);
+ pr_info("lightnvm: pages per block: %u\n", s->nr_pages_per_blk);
+ pr_info("lightnvm: append points: %u\n", s->nr_aps);
+ pr_info("lightnvm: append points per pool: %u\n", s->nr_aps_per_pool);
+ pr_info("lightnvm: timings: %u/%u/%u\n",
+ s->config.t_read,
+ s->config.t_write,
+ s->config.t_erase);
+ pr_info("lightnvm: target sector size=%d\n", s->sector_size);
+ pr_info("lightnvm: disk flash size=%d map size=%d\n",
+ s->gran_read, EXPOSED_PAGE_SIZE);
+ pr_info("lightnvm: allocated %lu physical pages (%lu KB)\n",
+ s->nr_pages, s->nr_pages * s->sector_size / 1024);
+
+ dev->stor = s;
+ kfree(nvm_id_chnl);
+ return 0;
+
+err_map:
+ nvmkv_exit(s);
+err_target:
+ kfree(s);
+err_stor:
+ kmem_cache_destroy(_addr_cache);
+err_memcache:
+ kfree(nvm_id_chnl);
+err:
+ pr_err("lightnvm: failed to initialize nvm\n");
+ return ret;
+}
+EXPORT_SYMBOL_GPL(nvm_init);
+
+void nvm_exit(struct nvm_dev *dev)
+{
+ struct nvm_stor *s = dev->stor;
+
+ if (!s)
+ return;
+
+ if (s->type->exit)
+ s->type->exit(s);
+
+ del_timer(&s->gc_timer);
+
+ /* TODO: remember outstanding block refs, waiting to be erased... */
+ nvm_pools_free(s);
+
+ vfree(s->trans_map);
+ vfree(s->rev_trans_map);
+
+ mempool_destroy(s->page_pool);
+ mempool_destroy(s->addr_pool);
+
+ percpu_ida_destroy(&s->free_inflight);
+
+ kfree(s);
+
+ kmem_cache_destroy(_addr_cache);
+
+ pr_info("lightnvm: successfully unloaded");
+}
+
+int nvm_ioctl(struct nvm_dev *dev, fmode_t mode, unsigned int cmd,
+ unsigned long arg)
+{
+ switch (cmd) {
+ case LIGHTNVM_IOCTL_KV:
+ return nvmkv_unpack(dev, (void __user *)arg);
+ default:
+ return -ENOTTY;
+ }
+}
+EXPORT_SYMBOL_GPL(nvm_ioctl);
+
+#ifdef CONFIG_COMPAT
+int nvm_compat_ioctl(struct nvm_dev *dev, fmode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+ return nvm_ioctl(dev, mode, cmd, arg);
+}
+EXPORT_SYMBOL_GPL(nvm_compat_ioctl);
+#else
+#define nvm_compat_ioctl NULL
+#endif
+
+MODULE_DESCRIPTION("LightNVM");
+MODULE_AUTHOR("Matias Bjorling <[email protected]>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/lightnvm/nvm.h b/drivers/lightnvm/nvm.h
new file mode 100644
index 0000000..7acf34d
--- /dev/null
+++ b/drivers/lightnvm/nvm.h
@@ -0,0 +1,632 @@
+/*
+ * Copyright (C) 2014 Matias Bj?rling.
+ *
+ * This file is released under the GPL.
+ */
+
+#ifndef NVM_H_
+#define NVM_H_
+
+#include <linux/blkdev.h>
+#include <linux/list.h>
+#include <linux/list_sort.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/atomic.h>
+#include <linux/delay.h>
+#include <linux/time.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/mempool.h>
+#include <linux/kref.h>
+#include <linux/completion.h>
+#include <linux/hashtable.h>
+#include <linux/percpu_ida.h>
+#include <linux/lightnvm.h>
+#include <linux/blk-mq.h>
+#include <linux/slab.h>
+
+#ifdef NVM_DEBUG
+#define NVM_ASSERT(c) BUG_ON((c) == 0)
+#else
+#define NVM_ASSERT(c)
+#endif
+
+#define NVM_MSG_PREFIX "nvm"
+#define LTOP_EMPTY -1
+#define LTOP_POISON 0xD3ADB33F
+
+/*
+ * For now we hardcode some of the configuration for the LightNVM device that we
+ * have. In the future this should be made configurable.
+ *
+ * Configuration:
+ * EXPOSED_PAGE_SIZE - the page size of which we tell the layers above the
+ * driver to issue. This usually is 512 bytes for 4K for simplivity.
+ */
+
+#define EXPOSED_PAGE_SIZE 4096
+
+/* We currently assume that we the lightnvm device is accepting data in 512
+ * bytes chunks. This should be set to the smallest command size available for a
+ * given device.
+ */
+#define NR_PHY_IN_LOG (EXPOSED_PAGE_SIZE / 512)
+
+/* We partition the namespace of translation map into these pieces for tracking
+ * in-flight addresses. */
+#define NVM_INFLIGHT_PARTITIONS 8
+#define NVM_INFLIGHT_TAGS 256
+
+#define NVM_OPT_MISC_OFFSET 15
+
+enum ltop_flags {
+ /* Update primary mapping (and init secondary mapping as a result) */
+ MAP_PRIMARY = 1 << 0,
+ /* Update only shaddow mapping */
+ MAP_SHADOW = 1 << 1,
+ /* Update only the relevant mapping (primary/shaddow) */
+ MAP_SINGLE = 1 << 2,
+};
+
+enum target_flags {
+ /* No hints applied */
+ NVM_OPT_ENGINE_NONE = 0 << 0,
+ /* Swap aware hints. Detected from block request type */
+ NVM_OPT_ENGINE_SWAP = 1 << 0,
+ /* IOCTL aware hints. Applications may submit direct hints */
+ NVM_OPT_ENGINE_IOCTL = 1 << 1,
+ /* Latency aware hints. Detected from file type or directly from app */
+ NVM_OPT_ENGINE_LATENCY = 1 << 2,
+ /* Pack aware hints. Detected from file type or directly from app */
+ NVM_OPT_ENGINE_PACK = 1 << 3,
+
+ /* Control accesses to append points in the host. Enable this for
+ * devices that doesn't have an internal queue that only lets one
+ * command run at a time within an append point */
+ NVM_OPT_POOL_SERIALIZE = 1 << NVM_OPT_MISC_OFFSET,
+ /* Use fast/slow page access pattern */
+ NVM_OPT_FAST_SLOW_PAGES = 1 << (NVM_OPT_MISC_OFFSET+1),
+ /* Disable dev waits */
+ NVM_OPT_NO_WAITS = 1 << (NVM_OPT_MISC_OFFSET+2),
+};
+
+/* Pool descriptions */
+struct nvm_block {
+ struct {
+ spinlock_t lock;
+ /* points to the next writable page within a block */
+ unsigned int next_page;
+ /* number of pages that are invalid, wrt host page size */
+ unsigned int nr_invalid_pages;
+#define MAX_INVALID_PAGES_STORAGE 8
+ /* Bitmap for invalid page intries */
+ unsigned long invalid_pages[MAX_INVALID_PAGES_STORAGE];
+ } ____cacheline_aligned_in_smp;
+
+ unsigned int id;
+ struct nvm_pool *pool;
+ struct nvm_ap *ap;
+
+ /* Management and GC structures */
+ struct list_head list;
+ struct list_head prio;
+
+ /* Persistent data structures */
+ atomic_t data_size; /* data pages inserted into data variable */
+ atomic_t data_cmnt_size; /* data pages committed to stable storage */
+
+ /* Block state handling */
+ atomic_t gc_running;
+ struct work_struct ws_gc;
+ struct work_struct ws_eio;
+};
+
+/* Logical to physical mapping */
+struct nvm_addr {
+ sector_t addr;
+ struct nvm_block *block;
+};
+
+/* Physical to logical mapping */
+struct nvm_rev_addr {
+ sector_t addr;
+};
+
+struct nvm_pool {
+ /* Pool block lists */
+ struct {
+ spinlock_t lock;
+ } ____cacheline_aligned_in_smp;
+
+ struct list_head used_list; /* In-use blocks */
+ struct list_head free_list; /* Not used blocks i.e. released
+ * and ready for use */
+ struct list_head prio_list; /* Blocks that may be GC'ed. */
+
+ unsigned int id;
+ /* References the physical start block */
+ unsigned long phy_addr_start;
+ /* References the physical end block */
+ unsigned int phy_addr_end;
+
+ unsigned int nr_blocks; /* end_block - start_block. */
+ unsigned int nr_free_blocks; /* Number of unused blocks */
+
+ struct nvm_block *blocks;
+ struct nvm_stor *s;
+
+ /* Postpone issuing I/O if append point is active */
+ atomic_t is_active;
+
+ spinlock_t waiting_lock;
+ struct work_struct waiting_ws;
+ struct bio_list waiting_bios;
+
+ struct bio *cur_bio;
+
+ unsigned int gc_running;
+ struct completion gc_finished;
+ struct work_struct gc_ws;
+
+ void *private;
+};
+
+/*
+ * nvm_ap. ap is an append point. A pool can have 1..X append points attached.
+ * An append point has a current block, that it writes to, and when its full,
+ * it requests a new block, of which it continues its writes.
+ *
+ * one ap per pool may be reserved for pack-hints related writes.
+ * In those that are not not, private is NULL.
+ */
+struct nvm_ap {
+ spinlock_t lock;
+ struct nvm_stor *parent;
+ struct nvm_pool *pool;
+ struct nvm_block *cur;
+ struct nvm_block *gc_cur;
+
+ /* Timings used for end_io waiting */
+ unsigned long t_read;
+ unsigned long t_write;
+ unsigned long t_erase;
+
+ unsigned long io_delayed;
+
+ /* Private field for submodules */
+ void *private;
+};
+
+struct nvm_config {
+ unsigned long flags;
+
+ unsigned int gc_time; /* GC every X microseconds */
+
+ unsigned int t_read;
+ unsigned int t_write;
+ unsigned int t_erase;
+};
+
+struct nvm_inflight_request {
+ struct list_head list;
+ sector_t l_start;
+ sector_t l_end;
+ int tag;
+};
+
+struct nvm_inflight {
+ spinlock_t lock;
+ struct list_head reqs;
+};
+
+struct nvm_stor;
+struct per_rq_data;
+struct nvm_block;
+struct nvm_pool;
+
+/* overridable functionality */
+typedef struct nvm_addr *(*nvm_lookup_ltop_fn)(struct nvm_stor *, sector_t);
+typedef struct nvm_addr *(*nvm_map_ltop_page_fn)(struct nvm_stor *, sector_t,
+ int);
+typedef struct nvm_block *(*nvm_map_ltop_block_fn)(struct nvm_stor *, sector_t,
+ int);
+typedef int (*nvm_write_rq_fn)(struct nvm_stor *, struct request *);
+typedef int (*nvm_read_rq_fn)(struct nvm_stor *, struct request *);
+typedef void (*nvm_alloc_phys_addr_fn)(struct nvm_stor *, struct nvm_block *);
+typedef struct nvm_block *(*nvm_pool_get_blk_fn)(struct nvm_pool *pool,
+ int is_gc);
+typedef void (*nvm_pool_put_blk_fn)(struct nvm_block *block);
+typedef int (*nvm_ioctl_fn)(struct nvm_stor *,
+ unsigned int cmd, unsigned long arg);
+typedef int (*nvm_init_fn)(struct nvm_stor *);
+typedef void (*nvm_exit_fn)(struct nvm_stor *);
+typedef void (*nvm_endio_fn)(struct nvm_stor *, struct request *,
+ struct per_rq_data *, unsigned long *delay);
+
+struct nvm_target_type {
+ const char *name;
+ unsigned int version[3];
+ unsigned int per_rq_size;
+
+ /* lookup functions */
+ nvm_lookup_ltop_fn lookup_ltop;
+
+ /* handling of request */
+ nvm_write_rq_fn write_rq;
+ nvm_read_rq_fn read_rq;
+ nvm_ioctl_fn ioctl;
+ nvm_endio_fn end_rq;
+
+ /* engine-specific overrides */
+ nvm_pool_get_blk_fn pool_get_blk;
+ nvm_pool_put_blk_fn pool_put_blk;
+ nvm_map_ltop_page_fn map_page;
+ nvm_map_ltop_block_fn map_block;
+
+ /* module specific init/teardown */
+ nvm_init_fn init;
+ nvm_exit_fn exit;
+
+ /* For lightnvm internal use */
+ struct list_head list;
+};
+
+struct kv_entry;
+
+struct nvmkv_tbl {
+ u8 bucket_len;
+ u64 tbl_len;
+ struct kv_entry *entries;
+ spinlock_t lock;
+};
+
+struct nvmkv_inflight {
+ struct kmem_cache *entry_pool;
+ spinlock_t lock;
+ struct list_head list;
+};
+
+struct nvmkv {
+ struct nvmkv_tbl tbl;
+ struct nvmkv_inflight inflight;
+};
+
+/* Main structure */
+struct nvm_stor {
+ struct nvm_dev *dev;
+ uint32_t sector_size;
+
+ struct nvm_target_type *type;
+
+ /* Simple translation map of logical addresses to physical addresses.
+ * The logical addresses is known by the host system, while the physical
+ * addresses are used when writing to the disk block device. */
+ struct nvm_addr *trans_map;
+ /* also store a reverse map for garbage collection */
+ struct nvm_rev_addr *rev_trans_map;
+ spinlock_t rev_lock;
+ /* Usually instantiated to the number of available parallel channels
+ * within the hardware device. i.e. a controller with 4 flash channels,
+ * would have 4 pools.
+ *
+ * We assume that the device exposes its channels as a linear address
+ * space. A pool therefore have a phy_addr_start and phy_addr_end that
+ * denotes the start and end. This abstraction is used to let the
+ * lightnvm (or any other device) expose its read/write/erase interface
+ * and be administrated by the host system.
+ */
+ struct nvm_pool *pools;
+
+ /* Append points */
+ struct nvm_ap *aps;
+
+ mempool_t *addr_pool;
+ mempool_t *page_pool;
+
+ /* Frequently used config variables */
+ int nr_pools;
+ int nr_blks_per_pool;
+ int nr_pages_per_blk;
+ int nr_aps;
+ int nr_aps_per_pool;
+ unsigned gran_blk;
+ unsigned gran_read;
+ unsigned gran_write;
+
+ /* Calculated/Cached values. These do not reflect the actual usuable
+ * blocks at run-time. */
+ unsigned long nr_pages;
+ unsigned long total_blocks;
+
+ unsigned int next_collect_pool;
+
+ /* Write strategy variables. Move these into each for structure for each
+ * strategy */
+ atomic_t next_write_ap; /* Whenever a page is written, this is updated
+ * to point to the next write append point */
+ struct workqueue_struct *krqd_wq;
+ struct workqueue_struct *kgc_wq;
+
+ struct timer_list gc_timer;
+
+ /* in-flight data lookup, lookup by logical address. Remember the
+ * overhead of cachelines being used. Keep it low for better cache
+ * utilization. */
+ struct percpu_ida free_inflight;
+ struct nvm_inflight inflight_map[NVM_INFLIGHT_PARTITIONS];
+ struct nvm_inflight_request inflight_addrs[NVM_INFLIGHT_TAGS];
+
+ /* nvm module specific data */
+ void *private;
+
+ /* User configuration */
+ struct nvm_config config;
+
+ unsigned int per_rq_offset;
+
+ struct nvmkv kv;
+};
+
+struct per_rq_data_nvm {
+ struct nvm_dev *dev;
+};
+
+enum {
+ NVM_RQ_NONE = 0,
+ NVM_RQ_GC = 1,
+};
+
+struct per_rq_data {
+ struct nvm_ap *ap;
+ struct nvm_addr *addr;
+ sector_t l_addr;
+ unsigned int flags;
+};
+
+/* reg.c */
+int nvm_register_target(struct nvm_target_type *t);
+void nvm_unregister_target(struct nvm_target_type *t);
+struct nvm_target_type *find_nvm_target_type(const char *name);
+
+/* core.c */
+/* Helpers */
+void nvm_set_ap_cur(struct nvm_ap *, struct nvm_block *);
+sector_t nvm_alloc_phys_addr(struct nvm_block *);
+
+/* Naive implementations */
+void nvm_delayed_bio_submit(struct work_struct *);
+void nvm_deferred_bio_submit(struct work_struct *);
+void nvm_gc_block(struct work_struct *);
+void nvm_gc_recycle_block(struct work_struct *);
+
+/* Allocation of physical addresses from block
+ * when increasing responsibility. */
+struct nvm_addr *nvm_alloc_addr_from_ap(struct nvm_ap *, int is_gc);
+
+/* I/O request related */
+int nvm_write_rq(struct nvm_stor *, struct request *);
+int __nvm_write_rq(struct nvm_stor *, struct request *, int);
+int nvm_read_rq(struct nvm_stor *, struct request *rq);
+int nvm_erase_block(struct nvm_stor *, struct nvm_block *);
+void nvm_update_map(struct nvm_stor *, sector_t, struct nvm_addr *, int);
+void nvm_setup_rq(struct nvm_stor *, struct request *, struct nvm_addr *, sector_t, unsigned int flags);
+
+/* Block maintanence */
+void nvm_reset_block(struct nvm_block *);
+
+void nvm_endio(struct nvm_dev *, struct request *, int);
+
+/* gc.c */
+void nvm_block_erase(struct kref *);
+void nvm_gc_cb(unsigned long data);
+void nvm_gc_collect(struct work_struct *work);
+void nvm_gc_kick(struct nvm_stor *s);
+
+/* targets.c */
+struct nvm_block *nvm_pool_get_block(struct nvm_pool *, int is_gc);
+
+/* nvmkv.c */
+int nvmkv_init(struct nvm_stor *s, unsigned long size);
+void nvmkv_exit(struct nvm_stor *s);
+int nvmkv_unpack(struct nvm_dev *dev, struct lightnvm_cmd_kv __user *ucmd);
+void nvm_pool_put_block(struct nvm_block *);
+
+
+#define nvm_for_each_pool(n, pool, i) \
+ for ((i) = 0, pool = &(n)->pools[0]; \
+ (i) < (n)->nr_pools; (i)++, pool = &(n)->pools[(i)])
+
+#define nvm_for_each_ap(n, ap, i) \
+ for ((i) = 0, ap = &(n)->aps[0]; \
+ (i) < (n)->nr_aps; (i)++, ap = &(n)->aps[(i)])
+
+#define pool_for_each_block(p, b, i) \
+ for ((i) = 0, b = &(p)->blocks[0]; \
+ (i) < (p)->nr_blocks; (i)++, b = &(p)->blocks[(i)])
+
+#define block_for_each_page(b, p) \
+ for ((p)->addr = block_to_addr((b)), (p)->block = (b); \
+ (p)->addr < block_to_addr((b)) \
+ + (b)->pool->s->nr_pages_per_blk; \
+ (p)->addr++)
+
+static inline struct nvm_ap *get_next_ap(struct nvm_stor *s)
+{
+ return &s->aps[atomic_inc_return(&s->next_write_ap) % s->nr_aps];
+}
+
+static inline int block_is_full(struct nvm_block *block)
+{
+ struct nvm_stor *s = block->pool->s;
+
+ return block->next_page == s->nr_pages_per_blk;
+}
+
+static inline sector_t block_to_addr(struct nvm_block *block)
+{
+ struct nvm_stor *s = block->pool->s;
+
+ return block->id * s->nr_pages_per_blk;
+}
+
+static inline struct nvm_pool *paddr_to_pool(struct nvm_stor *s,
+ sector_t p_addr)
+{
+ return &s->pools[p_addr / (s->nr_pages / s->nr_pools)];
+}
+
+static inline struct nvm_ap *block_to_ap(struct nvm_stor *s,
+ struct nvm_block *b)
+{
+ unsigned int ap_idx, div, mod;
+
+ div = b->id / s->nr_blks_per_pool;
+ mod = b->id % s->nr_blks_per_pool;
+ ap_idx = div + (mod / (s->nr_blks_per_pool / s->nr_aps_per_pool));
+
+ return &s->aps[ap_idx];
+}
+
+static inline int physical_to_slot(struct nvm_stor *s, sector_t phys)
+{
+ return phys % s->nr_pages_per_blk;
+}
+
+static inline void *get_per_rq_data(struct nvm_dev *dev, struct request *rq)
+{
+ BUG_ON(!dev);
+ return blk_mq_rq_to_pdu(rq) + dev->drv_cmd_size;
+}
+
+static inline struct nvm_inflight *nvm_laddr_to_inflight(struct nvm_stor *s,
+ sector_t l_addr)
+{
+ return &s->inflight_map[l_addr % NVM_INFLIGHT_PARTITIONS];
+}
+
+static inline int request_equals(struct nvm_inflight_request *r,
+ sector_t laddr_start, sector_t laddr_end)
+{
+ return (r->l_end == laddr_end && r->l_start == laddr_start);
+}
+
+static inline int request_intersects(struct nvm_inflight_request *r,
+ sector_t laddr_start, sector_t laddr_end)
+{
+ return (laddr_end >= r->l_start && laddr_end <= r->l_end) &&
+ (laddr_start >= r->l_start && laddr_start <= r->l_end);
+}
+
+/* TODO: make compatible with multi-block requests */
+static inline void __nvm_lock_laddr_range(struct nvm_stor *s, int spin,
+ sector_t laddr_start, unsigned nsectors)
+{
+ struct nvm_inflight *inflight;
+ struct nvm_inflight_request *r;
+ sector_t laddr_end = laddr_start + nsectors - 1;
+ int tag;
+ unsigned long flags;
+
+ NVM_ASSERT(nsectors >= 1);
+ BUG_ON(laddr_end >= s->nr_pages);
+ /* FIXME: Not yet supported */
+ BUG_ON(nsectors > s->nr_pages_per_blk);
+
+ inflight = nvm_laddr_to_inflight(s, laddr_start);
+ tag = percpu_ida_alloc(&s->free_inflight, __GFP_WAIT);
+
+retry:
+ spin_lock_irqsave(&inflight->lock, flags);
+
+ list_for_each_entry(r, &inflight->reqs, list) {
+ if (request_intersects(r, laddr_start, laddr_end)) {
+ /* existing, overlapping request, come back later */
+ spin_unlock_irqrestore(&inflight->lock, flags);
+ if (!spin)
+ schedule();
+ goto retry;
+ }
+ }
+
+ r = &s->inflight_addrs[tag];
+
+ r->l_start = laddr_start;
+ r->l_end = laddr_end;
+ r->tag = tag;
+
+ list_add_tail(&r->list, &inflight->reqs);
+ spin_unlock_irqrestore(&inflight->lock, flags);
+}
+
+static inline void nvm_lock_laddr_range(struct nvm_stor *s, sector_t laddr_start,
+ unsigned int nsectors)
+{
+ return __nvm_lock_laddr_range(s, 0, laddr_start, nsectors);
+}
+
+static inline void nvm_unlock_laddr_range(struct nvm_stor *s,
+ sector_t laddr_start,
+ unsigned int nsectors)
+{
+ struct nvm_inflight *inflight = nvm_laddr_to_inflight(s, laddr_start);
+ struct nvm_inflight_request *r = NULL;
+ sector_t laddr_end = laddr_start + nsectors - 1;
+ unsigned long flags;
+
+ NVM_ASSERT(nsectors >= 1);
+ NVM_ASSERT(laddr_end >= laddr_start);
+
+ spin_lock_irqsave(&inflight->lock, flags);
+ BUG_ON(list_empty(&inflight->reqs));
+
+ list_for_each_entry(r, &inflight->reqs, list)
+ if (request_equals(r, laddr_start, laddr_end))
+ break;
+
+ BUG_ON(!r || !request_equals(r, laddr_start, laddr_end));
+
+ r->l_start = r->l_end = LTOP_POISON;
+
+ list_del_init(&r->list);
+ spin_unlock_irqrestore(&inflight->lock, flags);
+ percpu_ida_free(&s->free_inflight, r->tag);
+}
+
+static inline void __show_pool(struct nvm_pool *pool)
+{
+ struct list_head *head, *cur;
+ unsigned int free_cnt = 0, used_cnt = 0, prio_cnt = 0;
+
+ NVM_ASSERT(spin_is_locked(&pool->lock));
+
+ list_for_each_safe(head, cur, &pool->free_list)
+ free_cnt++;
+ list_for_each_safe(head, cur, &pool->used_list)
+ used_cnt++;
+ list_for_each_safe(head, cur, &pool->prio_list)
+ prio_cnt++;
+
+ pr_err("lightnvm: P-%d F:%u U:%u P:%u",
+ pool->id, free_cnt, used_cnt, prio_cnt);
+}
+
+static inline void show_pool(struct nvm_pool *pool)
+{
+ spin_lock(&pool->lock);
+ __show_pool(pool);
+ spin_unlock(&pool->lock);
+}
+
+static inline void show_all_pools(struct nvm_stor *s)
+{
+ struct nvm_pool *pool;
+ unsigned int i;
+
+ nvm_for_each_pool(s, pool, i)
+ show_pool(pool);
+}
+
+#endif /* NVM_H_ */
+
diff --git a/drivers/lightnvm/sysfs.c b/drivers/lightnvm/sysfs.c
new file mode 100644
index 0000000..3abe1e4
--- /dev/null
+++ b/drivers/lightnvm/sysfs.c
@@ -0,0 +1,79 @@
+#include <linux/lightnvm.h>
+#include <linux/sysfs.h>
+
+#include "nvm.h"
+
+static ssize_t nvm_attr_free_blocks_show(struct nvm_dev *nvm, char *buf)
+{
+ char *buf_start = buf;
+ struct nvm_stor *stor = nvm->stor;
+ struct nvm_pool *pool;
+ unsigned int i;
+
+ nvm_for_each_pool(stor, pool, i)
+ buf += sprintf(buf, "%8u\t%u\n", i, pool->nr_free_blocks);
+
+ return buf - buf_start;
+}
+
+static ssize_t nvm_attr_show(struct device *dev, char *page,
+ ssize_t (*fn)(struct nvm_dev *, char *))
+{
+ struct gendisk *disk = dev_to_disk(dev);
+ struct nvm_dev *nvm = disk->private_data;
+
+ return fn(nvm, page);
+}
+
+#define NVM_ATTR_RO(_name) \
+static ssize_t nvm_attr_##_name##_show(struct nvm_dev *, char *); \
+static ssize_t nvm_attr_do_show_##_name(struct device *d, \
+ struct device_attribute *attr, char *b) \
+{ \
+ return nvm_attr_show(d, b, nvm_attr_##_name##_show); \
+} \
+static struct device_attribute nvm_attr_##_name = \
+ __ATTR(_name, S_IRUGO, nvm_attr_do_show_##_name, NULL)
+
+NVM_ATTR_RO(free_blocks);
+
+static struct attribute *nvm_attrs[] = {
+ &nvm_attr_free_blocks.attr,
+ NULL,
+};
+
+static struct attribute_group nvm_attribute_group = {
+ .name = "nvm",
+ .attrs = nvm_attrs,
+};
+
+void nvm_remove_sysfs(struct nvm_dev *nvm)
+{
+ struct device *dev;
+
+ if (!nvm || !nvm->disk)
+ return;
+
+ dev = disk_to_dev(nvm->disk);
+ sysfs_remove_group(&dev->kobj, &nvm_attribute_group);
+}
+EXPORT_SYMBOL_GPL(nvm_remove_sysfs);
+
+int nvm_add_sysfs(struct nvm_dev *nvm)
+{
+ int ret;
+ struct device *dev;
+
+ if (!nvm || !nvm->disk)
+ return 0;
+
+ dev = disk_to_dev(nvm->disk);
+ ret = sysfs_create_group(&dev->kobj, &nvm_attribute_group);
+ if (ret)
+ return ret;
+
+ kobject_uevent(&dev->kobj, KOBJ_CHANGE);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(nvm_add_sysfs);
diff --git a/drivers/lightnvm/targets.c b/drivers/lightnvm/targets.c
new file mode 100644
index 0000000..016bf46
--- /dev/null
+++ b/drivers/lightnvm/targets.c
@@ -0,0 +1,246 @@
+#include "nvm.h"
+
+/* use pool_[get/put]_block to administer the blocks in use for each pool.
+ * Whenever a block is in used by an append point, we store it within the
+ * used_list. We then move it back when its free to be used by another append
+ * point.
+ *
+ * The newly claimed block is always added to the back of used_list. As we
+ * assume that the start of used list is the oldest block, and therefore
+ * more likely to contain invalidated pages.
+ */
+struct nvm_block *nvm_pool_get_block(struct nvm_pool *pool, int is_gc)
+{
+ struct nvm_stor *s;
+ struct nvm_block *block = NULL;
+ unsigned long flags;
+
+ BUG_ON(!pool);
+
+ s = pool->s;
+ spin_lock_irqsave(&pool->lock, flags);
+
+ if (list_empty(&pool->free_list)) {
+ pr_err_ratelimited("Pool have no free pages available");
+ __show_pool(pool);
+ spin_unlock_irqrestore(&pool->lock, flags);
+ goto out;
+ }
+
+ while (!is_gc && pool->nr_free_blocks < s->nr_aps) {
+ spin_unlock_irqrestore(&pool->lock, flags);
+ goto out;
+ }
+
+ block = list_first_entry(&pool->free_list, struct nvm_block, list);
+ list_move_tail(&block->list, &pool->used_list);
+
+ pool->nr_free_blocks--;
+
+ spin_unlock_irqrestore(&pool->lock, flags);
+
+ nvm_reset_block(block);
+
+out:
+ return block;
+}
+
+/* We assume that all valid pages have already been moved when added back to the
+ * free list. We add it last to allow round-robin use of all pages. Thereby
+ * provide simple (naive) wear-leveling.
+ */
+void nvm_pool_put_block(struct nvm_block *block)
+{
+ struct nvm_pool *pool = block->pool;
+ unsigned long flags;
+
+ spin_lock_irqsave(&pool->lock, flags);
+
+ list_move_tail(&block->list, &pool->free_list);
+ pool->nr_free_blocks++;
+
+ spin_unlock_irqrestore(&pool->lock, flags);
+}
+
+/* lookup the primary translation table. If there isn't an associated block to
+ * the addr. We assume that there is no data and doesn't take a ref */
+struct nvm_addr *nvm_lookup_ltop(struct nvm_stor *s, sector_t l_addr)
+{
+ struct nvm_addr *gp, *p;
+
+ BUG_ON(!(l_addr >= 0 && l_addr < s->nr_pages));
+
+ p = mempool_alloc(s->addr_pool, GFP_ATOMIC);
+ if (!p)
+ return NULL;
+
+ gp = &s->trans_map[l_addr];
+
+ p->addr = gp->addr;
+ p->block = gp->block;
+
+ /* if it has not been written, p is initialized to 0. */
+ if (p->block) {
+ /* during gc, the mapping will be updated accordently. We
+ * therefore stop submitting new reads to the address, until it
+ * is copied to the new place. */
+ if (atomic_read(&p->block->gc_running))
+ goto err;
+ }
+
+ return p;
+err:
+ mempool_free(p, s->addr_pool);
+ return NULL;
+
+}
+
+static inline unsigned int nvm_rq_sectors(const struct request *rq)
+{
+ /*TODO: remove hardcoding, query nvm_dev for setting*/
+ return blk_rq_bytes(rq) >> 9;
+}
+
+static struct nvm_ap *__nvm_get_ap_rr(struct nvm_stor *s, int is_gc)
+{
+ unsigned int i;
+ struct nvm_pool *pool, *max_free;
+
+ if (!is_gc)
+ return get_next_ap(s);
+
+ /* during GC, we don't care about RR, instead we want to make
+ * sure that we maintain evenness between the block pools. */
+ max_free = &s->pools[0];
+ /* prevent GC-ing pool from devouring pages of a pool with
+ * little free blocks. We don't take the lock as we only need an
+ * estimate. */
+ nvm_for_each_pool(s, pool, i) {
+ if (pool->nr_free_blocks > max_free->nr_free_blocks)
+ max_free = pool;
+ }
+
+ return &s->aps[max_free->id];
+}
+
+/*read/write RQ has locked addr range already*/
+
+static struct nvm_block *nvm_map_block_rr(struct nvm_stor *s, sector_t l_addr,
+ int is_gc)
+{
+ struct nvm_ap *ap = NULL;
+ struct nvm_block *block;
+
+ ap = __nvm_get_ap_rr(s, is_gc);
+
+ spin_lock(&ap->lock);
+ block = s->type->pool_get_blk(ap->pool, is_gc);
+ spin_unlock(&ap->lock);
+ return block; /*NULL iff. no free blocks*/
+}
+
+/* Simple round-robin Logical to physical address translation.
+ *
+ * Retrieve the mapping using the active append point. Then update the ap for
+ * the next write to the disk.
+ *
+ * Returns nvm_addr with the physical address and block. Remember to return to
+ * s->addr_cache when request is finished.
+ */
+static struct nvm_addr *nvm_map_page_rr(struct nvm_stor *s, sector_t l_addr,
+ int is_gc)
+{
+ struct nvm_addr *p;
+ struct nvm_ap *ap;
+ struct nvm_pool *pool;
+ struct nvm_block *p_block;
+ sector_t p_addr;
+
+ p = mempool_alloc(s->addr_pool, GFP_ATOMIC);
+ if (!p)
+ return NULL;
+
+ ap = __nvm_get_ap_rr(s, is_gc);
+ pool = ap->pool;
+
+ spin_lock(&ap->lock);
+
+ p_block = ap->cur;
+ p_addr = nvm_alloc_phys_addr(p_block);
+
+ if (p_addr == LTOP_EMPTY) {
+ p_block = s->type->pool_get_blk(pool, 0);
+
+ if (!p_block) {
+ if (is_gc) {
+ p_addr = nvm_alloc_phys_addr(ap->gc_cur);
+ if (p_addr == LTOP_EMPTY) {
+ p_block = s->type->pool_get_blk(pool, 1);
+ ap->gc_cur = p_block;
+ ap->gc_cur->ap = ap;
+ if (!p_block) {
+ show_all_pools(ap->parent);
+ pr_err("nvm: no more blocks");
+ goto finished;
+ } else {
+ p_addr =
+ nvm_alloc_phys_addr(ap->gc_cur);
+ }
+ }
+ p_block = ap->gc_cur;
+ }
+ goto finished;
+ }
+
+ nvm_set_ap_cur(ap, p_block);
+ p_addr = nvm_alloc_phys_addr(p_block);
+ }
+
+finished:
+ if (p_addr == LTOP_EMPTY) {
+ mempool_free(p, s->addr_pool);
+ return NULL;
+ }
+
+ p->addr = p_addr;
+ p->block = p_block;
+
+ if (!p_block)
+ WARN_ON(is_gc);
+
+ spin_unlock(&ap->lock);
+ if (p)
+ nvm_update_map(s, l_addr, p, is_gc);
+ return p;
+}
+
+/* none target type, round robin, page-based FTL, and cost-based GC */
+struct nvm_target_type nvm_target_rrpc = {
+ .name = "rrpc",
+ .version = {1, 0, 0},
+ .lookup_ltop = nvm_lookup_ltop,
+ .map_page = nvm_map_page_rr,
+ .map_block = nvm_map_block_rr,
+
+ .write_rq = nvm_write_rq,
+ .read_rq = nvm_read_rq,
+
+ .pool_get_blk = nvm_pool_get_block,
+ .pool_put_blk = nvm_pool_put_block,
+};
+
+/* none target type, round robin, block-based FTL, and cost-based GC */
+struct nvm_target_type nvm_target_rrbc = {
+ .name = "rrbc",
+ .version = {1, 0, 0},
+ .lookup_ltop = nvm_lookup_ltop,
+ .map_page = NULL,
+ .map_block = nvm_map_block_rr,
+
+ /*rewrite these to support multi-page writes*/
+ .write_rq = nvm_write_rq,
+ .read_rq = nvm_read_rq,
+
+ .pool_get_blk = nvm_pool_get_block,
+ .pool_put_blk = nvm_pool_put_block,
+};
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
new file mode 100644
index 0000000..d6d0a2c
--- /dev/null
+++ b/include/linux/lightnvm.h
@@ -0,0 +1,130 @@
+#ifndef LIGHTNVM_H
+#define LIGHTNVM_H
+
+#include <uapi/linux/lightnvm.h>
+#include <linux/types.h>
+#include <linux/blk-mq.h>
+#include <linux/genhd.h>
+
+/* HW Responsibilities */
+enum {
+ NVM_RSP_L2P = 0x00,
+ NVM_RSP_P2L = 0x01,
+ NVM_RSP_GC = 0x02,
+ NVM_RSP_ECC = 0x03,
+};
+
+/* Physical NVM Type */
+enum {
+ NVM_NVMT_BLK = 0,
+ NVM_NVMT_BYTE = 1,
+};
+
+/* Internal IO Scheduling algorithm */
+enum {
+ NVM_IOSCHED_CHANNEL = 0,
+ NVM_IOSCHED_CHIP = 1,
+};
+
+/* Status codes */
+enum {
+ NVM_SUCCESS = 0x0000,
+ NVM_INVALID_OPCODE = 0x0001,
+ NVM_INVALID_FIELD = 0x0002,
+ NVM_INTERNAL_DEV_ERROR = 0x0006,
+ NVM_INVALID_CHNLID = 0x000b,
+ NVM_LBA_RANGE = 0x0080,
+ NVM_MAX_QSIZE_EXCEEDED = 0x0102,
+ NVM_RESERVED = 0x0104,
+ NVM_CONFLICTING_ATTRS = 0x0180,
+ NVM_RID_NOT_SAVEABLE = 0x010d,
+ NVM_RID_NOT_CHANGEABLE = 0x010e,
+ NVM_ACCESS_DENIED = 0x0286,
+ NVM_MORE = 0x2000,
+ NVM_DNR = 0x4000,
+ NVM_NO_COMPLETE = 0xffff,
+};
+
+struct nvm_id {
+ u16 ver_id;
+ u8 nvm_type;
+ u16 nchannels;
+ u8 reserved[11];
+};
+
+struct nvm_id_chnl {
+ u64 queue_size;
+ u64 gran_read;
+ u64 gran_write;
+ u64 gran_erase;
+ u64 oob_size;
+ u32 t_r;
+ u32 t_sqr;
+ u32 t_w;
+ u32 t_sqw;
+ u32 t_e;
+ u8 io_sched;
+ u64 laddr_begin;
+ u64 laddr_end;
+ u8 reserved[4034];
+};
+
+struct nvm_get_features {
+ u64 rsp[4];
+ u64 ext[4];
+};
+
+struct nvm_dev;
+
+typedef int (nvm_id_fn)(struct nvm_dev *dev, struct nvm_id *);
+typedef int (nvm_id_chnl_fn)(struct nvm_dev *dev, int chnl_num, struct nvm_id_chnl *);
+typedef int (nvm_get_features_fn)(struct nvm_dev *dev, struct nvm_get_features *);
+typedef int (nvm_set_rsp_fn)(struct nvm_dev *dev, u8 rsp, u8 val);
+typedef int (nvm_queue_rq_fn)(struct nvm_dev *, struct request *);
+typedef int (nvm_erase_blk_fn)(struct nvm_dev *, sector_t);
+
+struct lightnvm_dev_ops {
+ nvm_id_fn *identify;
+ nvm_id_chnl_fn *identify_channel;
+ nvm_get_features_fn *get_features;
+ nvm_set_rsp_fn *set_responsibility;
+
+ /* Requests */
+ nvm_queue_rq_fn *nvm_queue_rq;
+
+ /* LightNVM commands */
+ nvm_erase_blk_fn *nvm_erase_block;
+};
+
+struct nvm_dev {
+ struct lightnvm_dev_ops *ops;
+
+ struct request_queue *q;
+ struct gendisk *disk;
+
+ unsigned int drv_cmd_size;
+
+ void *driver_data;
+ void *stor;
+};
+
+/* LightNVM configuration */
+unsigned int nvm_cmd_size(void);
+
+int nvm_init(struct gendisk *disk, struct nvm_dev *);
+void nvm_exit(struct nvm_dev *);
+struct nvm_dev *nvm_alloc(void);
+void nvm_free(struct nvm_dev *);
+
+int nvm_add_sysfs(struct nvm_dev *);
+void nvm_remove_sysfs(struct nvm_dev *);
+
+/* LightNVM blk-mq request management */
+int nvm_queue_rq(struct nvm_dev *, struct request *);
+void nvm_end_io(struct nvm_dev *, struct request *, int);
+void nvm_complete_request(struct nvm_dev *, struct request *);
+
+int nvm_ioctl(struct nvm_dev *dev, fmode_t mode, unsigned int cmd, unsigned long arg);
+int nvm_compat_ioctl(struct nvm_dev *dev, fmode_t mode, unsigned int cmd, unsigned long arg);
+
+#endif
diff --git a/include/uapi/linux/lightnvm.h b/include/uapi/linux/lightnvm.h
new file mode 100644
index 0000000..3e9051b
--- /dev/null
+++ b/include/uapi/linux/lightnvm.h
@@ -0,0 +1,45 @@
+/*
+ * Definitions for the LightNVM host interface
+ * Copyright (c) 2014, IT University of Copenhagen, Matias Bjorling.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _UAPI_LINUX_LIGHTNVM_H
+#define _UAPI_LINUX_LIGHTNVM_H
+
+#include <linux/types.h>
+
+enum {
+ LIGHTNVM_KV_GET = 0x00,
+ LIGHTNVM_KV_PUT = 0x01,
+ LIGHTNVM_KV_UPDATE = 0x02,
+ LIGHTNVM_KV_DEL = 0x03,
+};
+
+enum {
+ LIGHTNVM_KV_ERR_NONE = 0,
+ LIGHTNVM_KV_ERR_NOKEY = 1,
+};
+
+struct lightnvm_cmd_kv {
+ __u8 opcode;
+ __u8 errcode;
+ __u8 res[6];
+ __u32 key_len;
+ __u32 val_len;
+ __u64 key_addr;
+ __u64 val_addr;
+};
+
+#define LIGHTNVM_IOCTL_ID _IO('O', 0x40)
+#define LIGHTNVM_IOCTL_KV _IOWR('O', 0x50, struct lightnvm_cmd_kv)
+
+#endif /* _UAPI_LINUX_LIGHTNVM_H */
--
1.9.1

2014-10-08 15:57:16

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 4/5] lightnvm: NVMe integration

NVMe devices are identified by the vendor specific bits:

Bit 3 in OACS (device-wide). Currently made per device, as the nvme
namespace is missing in the completion path.
Bit 1 in DSM (per-namespace).

The OACS change can be removed when the namespace is resolvable from the
completion path.

>From there, the NVMe specification is extended with the following
commands:

LightNVM Identify
LightNVM Channel identify
LightNVM Synchronious/Asynchronious erase
LightNVM Get Features
LightNVM Set Responsibility
LightNVM Get Logical to Physical map
LightNVM Get Physical to Logical map

The NVMe integration can be tested using Keith Busch NVMe qemu simulator
Lwith ightNVM patches on top. This can be found at:
https://github/LightNVM/qemu-nvme

Contributions in this patch from:

Jesper Madsen <[email protected]>

Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/nvme-core.c | 256 ++++++++++++++++++++++++++++++++++++++++++++--
include/linux/nvme.h | 4 +
include/uapi/linux/nvme.h | 57 +++++++++++
3 files changed, 311 insertions(+), 6 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 337878b..22319a5 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -38,6 +38,7 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/types.h>
+#include <linux/lightnvm.h>
#include <scsi/sg.h>
#include <asm-generic/io-64-nonatomic-lo-hi.h>

@@ -60,6 +61,10 @@ static unsigned char retry_time = 30;
module_param(retry_time, byte, 0644);
MODULE_PARM_DESC(retry_time, "time in seconds to retry failed I/O");

+static unsigned char force_lightnvm;
+module_param(force_lightnvm, byte, 0644);
+MODULE_PARM_DESC(force_lightnvm, "force initialization of lightnvm");
+
static int nvme_major;
module_param(nvme_major, int, 0);

@@ -139,8 +144,19 @@ struct nvme_cmd_info {
void *ctx;
int aborted;
struct nvme_queue *nvmeq;
+ struct nvme_ns *ns;
};

+static void host_lba_set(struct nvme_command *cmd, u32 val)
+{
+ __le32 *cdw12 = &cmd->common.cdw10[2];
+ __le32 *cdw13 = &cmd->common.cdw10[3];
+
+ val = cpu_to_le32(val);
+ *cdw12 = ((*cdw12) & 0xff00ffff) | ((val & 0xff) << 16);
+ *cdw13 = ((*cdw13) & 0xff) | (val & 0xffffff00);
+}
+
static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
unsigned int hctx_idx)
{
@@ -405,7 +421,10 @@ static void req_completion(struct nvme_queue *nvmeq, void *ctx,
rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
nvme_free_iod(nvmeq->dev, iod);

- blk_mq_complete_request(req);
+ if (nvmeq->dev->oacs & NVME_CTRL_OACS_LIGHTNVM || force_lightnvm)
+ nvm_complete_request(cmd_rq->ns->nvm_dev, req);
+ else
+ blk_mq_complete_request(req);
}

/* length is in bytes. gfp flags indicates whether we may sleep. */
@@ -576,6 +595,10 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
enum dma_data_direction dma_dir;
int psegs = req->nr_phys_segments;
int result = BLK_MQ_RQ_QUEUE_BUSY;
+
+ if (ns->nvm_dev)
+ nvm_queue_rq(ns->nvm_dev, req);
+
/*
* Requeued IO has already been prepped
*/
@@ -591,6 +614,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
req->special = iod;

nvme_set_info(cmd, iod, req_completion);
+ cmd->ns = ns;

if (req->cmd_flags & REQ_DISCARD) {
void *range;
@@ -895,11 +919,61 @@ static int adapter_delete_sq(struct nvme_dev *dev, u16 sqid)
return adapter_delete_queue(dev, nvme_admin_delete_sq, sqid);
}

+int lnvm_identify(struct nvme_dev *dev, dma_addr_t dma_addr)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+ c.common.opcode = lnvm_admin_identify;
+ c.common.nsid = cpu_to_le32(0);
+ c.common.prp1 = cpu_to_le64(dma_addr);
+
+ return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int lnvm_identify_channel(struct nvme_dev *dev, unsigned nsid,
+ dma_addr_t dma_addr)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+ c.common.opcode = lnvm_admin_identify_channel;
+ c.common.nsid = cpu_to_le32(nsid);
+ c.common.cdw10[0] = cpu_to_le32(nsid);
+ c.common.prp1 = cpu_to_le64(dma_addr);
+
+ return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int lnvm_get_features(struct nvme_dev *dev, unsigned nsid, dma_addr_t dma_addr)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+ c.common.opcode = lnvm_admin_get_features;
+ c.common.nsid = cpu_to_le32(nsid);
+ c.common.prp1 = cpu_to_le64(dma_addr);
+
+ return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
+int lnvm_set_responsibility(struct nvme_dev *dev, unsigned nsid,
+ dma_addr_t dma_addr)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+ c.common.opcode = lnvm_admin_set_responsibility;
+ c.common.nsid = cpu_to_le32(nsid);
+ c.common.prp1 = cpu_to_le64(dma_addr);
+
+ return nvme_submit_admin_cmd(dev, &c, NULL);
+}
+
int nvme_identify(struct nvme_dev *dev, unsigned nsid, unsigned cns,
dma_addr_t dma_addr)
{
struct nvme_command c;
-
memset(&c, 0, sizeof(c));
c.identify.opcode = nvme_admin_identify;
c.identify.nsid = cpu_to_le32(nsid);
@@ -1282,6 +1356,90 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
return 0;
}

+static int nvme_nvm_id(struct nvm_dev *nvm_dev, struct nvm_id *nvm_id)
+{
+ struct nvme_ns *ns = nvm_dev->driver_data;
+ struct nvme_dev *dev = ns->dev;
+ struct pci_dev *pdev = dev->pci_dev;
+ struct nvme_lnvm_id_ctrl *ctrl;
+ void *mem;
+ dma_addr_t dma_addr;
+ unsigned int ret = 0;
+
+ mem = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
+ if (!mem)
+ return -ENOMEM;
+
+ ret = lnvm_identify(dev, dma_addr);
+ if (ret) {
+ ret = -EIO;
+ goto out;
+ }
+
+ ctrl = mem;
+ nvm_id->ver_id = le16_to_cpu(ctrl->ver_id);
+ nvm_id->nvm_type = ctrl->nvm_type;
+ nvm_id->nchannels = le16_to_cpu(ctrl->nchannels);
+ out:
+ dma_free_coherent(&pdev->dev, 4096, mem, dma_addr);
+ return ret;
+}
+
+
+static int nvme_nvm_id_chnl(struct nvm_dev *nvm_dev, int chnl_id,
+ struct nvm_id_chnl *ic)
+{
+ struct nvme_ns *ns = nvm_dev->driver_data;
+ struct nvme_dev *dev = ns->dev;
+ struct pci_dev *pdev = dev->pci_dev;
+ struct nvme_lnvm_id_chnl *chnl;
+ void *mem;
+ dma_addr_t dma_addr;
+ unsigned int ret = 0;
+
+ mem = dma_alloc_coherent(&pdev->dev, 4096, &dma_addr, GFP_KERNEL);
+ if (!mem)
+ return -ENOMEM;
+
+ ret = lnvm_identify_channel(dev, chnl_id, dma_addr);
+ if (ret) {
+ ret = -EIO;
+ goto out;
+ }
+
+ chnl = mem;
+ ic->queue_size = le64_to_cpu(chnl->queue_size);
+ ic->gran_read = le64_to_cpu(chnl->gran_read);
+ ic->gran_write = le64_to_cpu(chnl->gran_write);
+ ic->gran_erase = le64_to_cpu(chnl->gran_erase);
+ ic->oob_size = le64_to_cpu(chnl->oob_size);
+ ic->t_r = le32_to_cpu(chnl->t_r);
+ ic->t_sqr = le32_to_cpu(chnl->t_sqr);
+ ic->t_w = le32_to_cpu(chnl->t_w);
+ ic->t_sqw = le32_to_cpu(chnl->t_sqw);
+ ic->t_e = le32_to_cpu(chnl->t_e);
+ ic->io_sched = chnl->io_sched;
+ ic->laddr_begin = le64_to_cpu(chnl->laddr_begin);
+ ic->laddr_end = le64_to_cpu(chnl->laddr_end);
+ out:
+ dma_free_coherent(&pdev->dev, 4096, mem, dma_addr);
+ return ret;
+}
+
+static int nvme_nvm_get_features(struct nvm_dev *dev,
+ struct nvm_get_features *gf)
+{
+ gf->rsp[0] = (1 << NVM_RSP_L2P);
+ gf->rsp[0] |= (1 << NVM_RSP_P2L);
+ gf->rsp[0] |= (1 << NVM_RSP_GC);
+ return 0;
+}
+
+static int nvme_nvm_set_rsp(struct nvm_dev *dev, u8 rsp, u8 val)
+{
+ return NVM_RID_NOT_CHANGEABLE | NVM_DNR;
+}
+
static struct blk_mq_ops nvme_mq_admin_ops = {
.queue_rq = nvme_admin_queue_rq,
.map_queue = blk_mq_map_queue,
@@ -1290,6 +1448,13 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
.timeout = nvme_timeout,
};

+static struct lightnvm_dev_ops nvme_nvm_dev_ops = {
+ .identify = nvme_nvm_id,
+ .identify_channel = nvme_nvm_id_chnl,
+ .get_features = nvme_nvm_get_features,
+ .set_responsibility = nvme_nvm_set_rsp,
+};
+
static struct blk_mq_ops nvme_mq_ops = {
.queue_rq = nvme_queue_rq,
.map_queue = blk_mq_map_queue,
@@ -1455,6 +1620,26 @@ void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
put_page(sg_page(&iod->sg[i]));
}

+static int lnvme_submit_io(struct nvme_ns *ns, struct nvme_user_io *io)
+{
+ struct nvme_command c;
+ struct nvme_dev *dev = ns->dev;
+
+ memset(&c, 0, sizeof(c));
+ c.rw.opcode = io->opcode;
+ c.rw.flags = io->flags;
+ c.rw.nsid = cpu_to_le32(ns->ns_id);
+ c.rw.slba = cpu_to_le64(io->slba);
+ c.rw.length = cpu_to_le16(io->nblocks);
+ c.rw.control = cpu_to_le16(io->control);
+ c.rw.dsmgmt = cpu_to_le32(io->dsmgmt);
+ c.rw.reftag = cpu_to_le32(io->reftag);
+ c.rw.apptag = cpu_to_le16(io->apptag);
+ c.rw.appmask = cpu_to_le16(io->appmask);
+
+ return nvme_submit_io_cmd(dev, ns, &c, NULL);
+}
+
static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
{
struct nvme_dev *dev = ns->dev;
@@ -1481,6 +1666,8 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
iod = nvme_map_user_pages(dev, io.opcode & 1, io.addr, length);
break;
default:
+ if (ns->nvm_dev)
+ return lnvme_submit_io(ns, &io);
return -EINVAL;
}

@@ -1498,6 +1685,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
c.rw.reftag = cpu_to_le32(io.reftag);
c.rw.apptag = cpu_to_le16(io.apptag);
c.rw.appmask = cpu_to_le16(io.appmask);
+ host_lba_set(&c, io.host_lba);

if (meta_len) {
meta_iod = nvme_map_user_pages(dev, io.opcode & 1, io.metadata,
@@ -1632,6 +1820,13 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
unsigned long arg)
{
struct nvme_ns *ns = bdev->bd_disk->private_data;
+ int ret;
+
+ if (ns->nvm_dev) {
+ ret = nvm_ioctl(ns->nvm_dev, mode, cmd, arg);
+ if (ret != -ENOTTY)
+ return ret;
+ }

switch (cmd) {
case NVME_IOCTL_ID:
@@ -1655,6 +1850,13 @@ static int nvme_compat_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
{
struct nvme_ns *ns = bdev->bd_disk->private_data;
+ int ret;
+
+ if (ns->nvm_dev) {
+ ret = nvm_ioctl(ns->nvm_dev, mode, cmd, arg);
+ if (ret != -ENOTTY)
+ return ret;
+ }

switch (cmd) {
case SG_IO:
@@ -1756,6 +1958,7 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
struct nvme_id_ns *id, struct nvme_lba_range_type *rt)
{
struct nvme_ns *ns;
+ struct nvm_dev *nvm_dev = NULL;
struct gendisk *disk;
int node = dev_to_node(&dev->pci_dev->dev);
int lbaf;
@@ -1766,15 +1969,27 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
if (!ns)
return NULL;
+
+ if (id->nsfeat & NVME_NS_FEAT_LIGHTNVM || force_lightnvm) {
+ nvm_dev = nvm_alloc();
+ if (!nvm_dev)
+ goto out_free_ns;
+
+ nvm_dev->ops = &nvme_nvm_dev_ops;
+
+ nvm_dev->driver_data = ns;
+ nvm_dev->drv_cmd_size = dev->tagset.cmd_size - nvm_cmd_size();
+ }
+
ns->queue = blk_mq_init_queue(&dev->tagset);
if (!ns->queue)
- goto out_free_ns;
- queue_flag_set_unlocked(QUEUE_FLAG_DEFAULT, ns->queue);
+ goto out_free_nvm;
queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_SG_GAPS, ns->queue);
queue_flag_clear_unlocked(QUEUE_FLAG_IO_STAT, ns->queue);
ns->dev = dev;
+
ns->queue->queuedata = ns;

disk = alloc_disk_node(0, node);
@@ -1807,8 +2022,24 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
if (dev->oncs & NVME_CTRL_ONCS_DSM)
nvme_config_discard(ns);

+ if (id->nsfeat & NVME_NS_FEAT_LIGHTNVM || force_lightnvm) {
+ /* Limit to 4K until LightNVM supports multiple IOs */
+ blk_queue_max_hw_sectors(ns->queue, 8);
+
+ nvm_dev->q = ns->queue;
+ nvm_dev->disk = disk;
+
+ if (nvm_init(disk, nvm_dev))
+ goto out_put_disk;
+
+ ns->nvm_dev = nvm_dev;
+ }
+
return ns;
-
+ out_put_disk:
+ put_disk(disk);
+ out_free_nvm:
+ nvm_free(nvm_dev);
out_free_queue:
blk_cleanup_queue(ns->queue);
out_free_ns:
@@ -1954,6 +2185,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
ctrl = mem;
nn = le32_to_cpup(&ctrl->nn);
dev->oncs = le16_to_cpup(&ctrl->oncs);
+ dev->oacs = le16_to_cpup(&ctrl->oacs);
dev->abort_limit = ctrl->acl + 1;
dev->vwc = ctrl->vwc;
memcpy(dev->serial, ctrl->sn, sizeof(ctrl->sn));
@@ -1983,6 +2215,15 @@ static int nvme_dev_add(struct nvme_dev *dev)
dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->tagset.driver_data = dev;

+ /* LightNVM is actually per ns, but as the tagset is defined with a set
+ * of operations for the hole device. It currently is either all or
+ * no lightnvm compatible name-spaces for a given device. This should
+ * either be moved toward the nvme_queue_rq function, or allow per ns
+ * queue_rq function to be specified.
+ */
+ if (dev->oacs & NVME_CTRL_OACS_LIGHTNVM || force_lightnvm)
+ dev->tagset.cmd_size += nvm_cmd_size();
+
if (blk_mq_alloc_tag_set(&dev->tagset))
goto out;

@@ -2004,8 +2245,11 @@ static int nvme_dev_add(struct nvme_dev *dev)
if (ns)
list_add_tail(&ns->list, &dev->namespaces);
}
- list_for_each_entry(ns, &dev->namespaces, list)
+ list_for_each_entry(ns, &dev->namespaces, list) {
add_disk(ns->disk);
+ if (ns->nvm_dev)
+ nvm_add_sysfs(ns->nvm_dev);
+ }
res = 0;

out:
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 299e6f5..242ad31 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -20,6 +20,7 @@
#include <linux/miscdevice.h>
#include <linux/kref.h>
#include <linux/blk-mq.h>
+#include <linux/lightnvm.h>

struct nvme_bar {
__u64 cap; /* Controller Capabilities */
@@ -100,6 +101,7 @@ struct nvme_dev {
u32 max_hw_sectors;
u32 stripe_size;
u16 oncs;
+ u16 oacs;
u16 abort_limit;
u8 vwc;
u8 initialized;
@@ -120,6 +122,8 @@ struct nvme_ns {
int ms;
u64 mode_select_num_blocks;
u32 mode_select_block_len;
+
+ struct nvm_dev *nvm_dev;
};

/*
diff --git a/include/uapi/linux/nvme.h b/include/uapi/linux/nvme.h
index 29a7d86..965da53 100644
--- a/include/uapi/linux/nvme.h
+++ b/include/uapi/linux/nvme.h
@@ -85,6 +85,30 @@ struct nvme_id_ctrl {
__u8 vs[1024];
};

+struct nvme_lnvm_id_ctrl {
+ __le16 ver_id;
+ __u8 nvm_type;
+ __le16 nchannels;
+ __u8 unused[4091];
+} __attribute__((packed));
+
+struct nvme_lnvm_id_chnl {
+ __le64 queue_size;
+ __le64 gran_read;
+ __le64 gran_write;
+ __le64 gran_erase;
+ __le64 oob_size;
+ __le32 t_r;
+ __le32 t_sqr;
+ __le32 t_w;
+ __le32 t_sqw;
+ __le32 t_e;
+ __u8 io_sched;
+ __le64 laddr_begin;
+ __le64 laddr_end;
+ __u8 unused[4034];
+} __attribute__((packed));
+
enum {
NVME_CTRL_ONCS_COMPARE = 1 << 0,
NVME_CTRL_ONCS_WRITE_UNCORRECTABLE = 1 << 1,
@@ -123,7 +147,12 @@ struct nvme_id_ns {
};

enum {
+ NVME_CTRL_OACS_LIGHTNVM = 1 << 3,
+};
+
+enum {
NVME_NS_FEAT_THIN = 1 << 0,
+ NVME_NS_FEAT_LIGHTNVM = 1 << 1,
NVME_LBAF_RP_BEST = 0,
NVME_LBAF_RP_BETTER = 1,
NVME_LBAF_RP_GOOD = 2,
@@ -192,6 +221,11 @@ enum nvme_opcode {
nvme_cmd_dsm = 0x09,
};

+enum lnvme_opcode {
+ lnvme_cmd_erase_sync = 0x80,
+ lnvme_cmd_erase_async = 0x81,
+};
+
struct nvme_common_command {
__u8 opcode;
__u8 flags;
@@ -287,6 +321,15 @@ enum nvme_admin_opcode {
nvme_admin_security_recv = 0x82,
};

+enum lnvm_admin_opcode {
+ lnvm_admin_identify = 0xc0,
+ lnvm_admin_identify_channel = 0xc1,
+ lnvm_admin_get_features = 0xc2,
+ lnvm_admin_set_responsibility = 0xc3,
+ lnvm_admin_get_l2p_tbl = 0xc4,
+ lnvm_admin_get_p2l_tbl = 0xc5,
+};
+
enum {
NVME_QUEUE_PHYS_CONTIG = (1 << 0),
NVME_CQ_IRQ_ENABLED = (1 << 1),
@@ -410,6 +453,18 @@ struct nvme_format_cmd {
__u32 rsvd11[5];
};

+struct nvme_lnvm_identify {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd[2];
+ __le64 prp1;
+ __le64 prp2;
+ __le32 cns;
+ __u32 rsvd11[5];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -423,6 +478,7 @@ struct nvme_command {
struct nvme_format_cmd format;
struct nvme_dsm_cmd dsm;
struct nvme_abort_cmd abort;
+ struct nvme_lnvm_identify lnvm_identify;
};
};

@@ -487,6 +543,7 @@ struct nvme_user_io {
__u32 reftag;
__u16 apptag;
__u16 appmask;
+ __u32 host_lba;
};

struct nvme_admin_cmd {
--
1.9.1

2014-10-08 15:57:54

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 5/5] lightnvm: null_blk integration

Allows the null_blk driver to hook into LightNVM for performance
evaluation. The number of channels exposed to LightNVM can be configured
through the lightnvm_num_channels module parameter.

Contributions in this patch from:

Jesper Madsen <[email protected]>

Signed-off-by: Matias Bjørling <[email protected]>
---
Documentation/block/null_blk.txt | 9 +++
drivers/block/null_blk.c | 149 +++++++++++++++++++++++++++++++++++----
2 files changed, 144 insertions(+), 14 deletions(-)

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index b2830b4..639d378 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -14,6 +14,9 @@ The following instances are possible:
Multi-queue block-layer
- Request-based.
- Configurable submission queues per device.
+ LightNVM compatible
+ - Request-based.
+ - Same configuration as the multi-queue block layer.
No block-layer (Known as bio-based)
- Bio-based. IO requests are submitted directly to the device driver.
- Directly accepts bio data structure and returns them.
@@ -28,6 +31,7 @@ queue_mode=[0-2]: Default: 2-Multi-queue
0: Bio-based.
1: Single-queue.
2: Multi-queue.
+ 3: LightNVM device with multi-queue.

home_node=[0--nr_nodes]: Default: NUMA_NO_NODE
Selects what CPU node the data structures are allocated from.
@@ -70,3 +74,8 @@ use_per_node_hctx=[0/1]: Default: 0
parameter.
1: The multi-queue block layer is instantiated with a hardware dispatch
queue for each CPU node in the system.
+
+IV: LightNVM specific parameters
+
+lightnvm_num_channels=[x]: Default: 1
+ Number of LightNVM channels that are exposed to the LightNVM driver.
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 00d469c..2f679b1 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -7,6 +7,7 @@
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/blk-mq.h>
+#include <linux/lightnvm.h>
#include <linux/hrtimer.h>

struct nullb_cmd {
@@ -25,6 +26,7 @@ struct nullb_queue {
unsigned int queue_depth;

struct nullb_cmd *cmds;
+ struct nullb *nb;
};

struct nullb {
@@ -33,6 +35,7 @@ struct nullb {
struct request_queue *q;
struct gendisk *disk;
struct blk_mq_tag_set tag_set;
+ struct nvm_dev *nvm_dev;
struct hrtimer timer;
unsigned int queue_depth;
spinlock_t lock;
@@ -67,6 +70,7 @@ enum {
NULL_Q_BIO = 0,
NULL_Q_RQ = 1,
NULL_Q_MQ = 2,
+ NULL_Q_LIGHTNVM = 4,
};

static int submit_queues;
@@ -79,7 +83,7 @@ MODULE_PARM_DESC(home_node, "Home node for the device");

static int queue_mode = NULL_Q_MQ;
module_param(queue_mode, int, S_IRUGO);
-MODULE_PARM_DESC(queue_mode, "Block interface to use (0=bio,1=rq,2=multiqueue)");
+MODULE_PARM_DESC(queue_mode, "Block interface to use (0=bio,1=rq,2=multiqueue,4=lightnvm)");

static int gb = 250;
module_param(gb, int, S_IRUGO);
@@ -109,6 +113,10 @@ static bool use_per_node_hctx = false;
module_param(use_per_node_hctx, bool, S_IRUGO);
MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware context queues. Default: false");

+static int lightnvm_num_channels = 1;
+module_param(lightnvm_num_channels, int, S_IRUGO);
+MODULE_PARM_DESC(lightnvm_num_channels, "Number of channels to be exposed to LightNVM. Default: 1");
+
static void put_tag(struct nullb_queue *nq, unsigned int tag)
{
clear_bit_unlock(tag, nq->tag_map);
@@ -179,6 +187,9 @@ static void end_cmd(struct nullb_cmd *cmd)
case NULL_Q_MQ:
blk_mq_end_io(cmd->rq, 0);
return;
+ case NULL_Q_LIGHTNVM:
+ nvm_end_io(cmd->nq->nb->nvm_dev, cmd->rq, 0);
+ return;
case NULL_Q_RQ:
INIT_LIST_HEAD(&cmd->rq->queuelist);
blk_end_request_all(cmd->rq, 0);
@@ -227,7 +238,7 @@ static void null_cmd_end_timer(struct nullb_cmd *cmd)

static void null_softirq_done_fn(struct request *rq)
{
- if (queue_mode == NULL_Q_MQ)
+ if (queue_mode & (NULL_Q_MQ|NULL_Q_LIGHTNVM))
end_cmd(blk_mq_rq_to_pdu(rq));
else
end_cmd(rq->special);
@@ -239,6 +250,7 @@ static inline void null_handle_cmd(struct nullb_cmd *cmd)
switch (irqmode) {
case NULL_IRQ_SOFTIRQ:
switch (queue_mode) {
+ case NULL_Q_LIGHTNVM:
case NULL_Q_MQ:
blk_mq_complete_request(cmd->rq);
break;
@@ -313,14 +325,67 @@ static void null_request_fn(struct request_queue *q)
}
}

+static int null_nvm_id(struct nvm_dev *dev, struct nvm_id *nvm_id)
+{
+ nvm_id->ver_id = 0x1;
+ nvm_id->nvm_type = NVM_NVMT_BLK;
+ nvm_id->nchannels = lightnvm_num_channels;
+ return 0;
+}
+
+static int null_nvm_id_chnl(struct nvm_dev *dev, int chnl_num,
+ struct nvm_id_chnl *ic)
+{
+ sector_t size = gb * 1024 * 1024 * 1024ULL;
+
+ sector_div(size, bs);
+ ic->queue_size = hw_queue_depth;
+ ic->gran_read = bs;
+ ic->gran_write = bs;
+ ic->gran_erase = bs * 256;
+ ic->oob_size = 0;
+ ic->t_r = ic->t_sqr = 25000; /* 25us */
+ ic->t_w = ic->t_sqw = 500000; /* 500us */
+ ic->t_e = 1500000; /* 1.500us */
+ ic->io_sched = NVM_IOSCHED_CHANNEL;
+ ic->laddr_begin = 0;
+ ic->laddr_end = size / 8;
+
+ return 0;
+}
+
+static int null_nvm_get_features(struct nvm_dev *dev,
+ struct nvm_get_features *gf)
+{
+ gf->rsp[0] = (1 << NVM_RSP_L2P);
+ gf->rsp[0] |= (1 << NVM_RSP_P2L);
+ gf->rsp[0] |= (1 << NVM_RSP_GC);
+ return 0;
+}
+
+static int null_nvm_set_rsp(struct nvm_dev *dev, u8 rsp, u8 val)
+{
+ return NVM_RID_NOT_CHANGEABLE | NVM_DNR;
+}
+
static int null_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
{
struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq);
+ struct nullb_queue *nq = hctx->driver_data;
+ struct nvm_dev *nvm_dev = nq->nb->nvm_dev;
+ int ret = BLK_MQ_RQ_QUEUE_OK;
+
+ if (nvm_dev) {
+ ret = nvm_queue_rq(nvm_dev, rq);
+ if (ret)
+ goto out;
+ }

cmd->rq = rq;
- cmd->nq = hctx->driver_data;
+ cmd->nq = nq;

null_handle_cmd(cmd);
+out:
return BLK_MQ_RQ_QUEUE_OK;
}

@@ -331,6 +396,7 @@ static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)

init_waitqueue_head(&nq->wait);
nq->queue_depth = nullb->queue_depth;
+ nq->nb = nullb;
}

static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
@@ -346,6 +412,13 @@ static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
return 0;
}

+static struct lightnvm_dev_ops null_nvm_dev_ops = {
+ .identify = null_nvm_id,
+ .identify_channel = null_nvm_id_chnl,
+ .get_features = null_nvm_get_features,
+ .set_responsibility = null_nvm_set_rsp,
+};
+
static struct blk_mq_ops null_mq_ops = {
.queue_rq = null_queue_rq,
.map_queue = blk_mq_map_queue,
@@ -359,8 +432,11 @@ static void null_del_dev(struct nullb *nullb)

del_gendisk(nullb->disk);
blk_cleanup_queue(nullb->q);
- if (queue_mode == NULL_Q_MQ)
+ if (queue_mode & (NULL_Q_MQ|NULL_Q_LIGHTNVM)) {
+ if (queue_mode == NULL_Q_LIGHTNVM)
+ nvm_remove_sysfs(nullb->disk->private_data);
blk_mq_free_tag_set(&nullb->tag_set);
+ }
put_disk(nullb->disk);
kfree(nullb);
}
@@ -374,10 +450,26 @@ static void null_release(struct gendisk *disk, fmode_t mode)
{
}

+static int null_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+ unsigned long arg)
+{
+ struct nullb *nullb = bdev->bd_disk->private_data;
+ int ret;
+
+ if (nullb->nvm_dev) {
+ ret = nvm_ioctl(nullb->nvm_dev, mode, cmd, arg);
+ if (ret != -ENOTTY)
+ return ret;
+ }
+
+ return -ENOTTY;
+};
+
static const struct block_device_operations null_fops = {
.owner = THIS_MODULE,
.open = null_open,
.release = null_release,
+ .ioctl = null_ioctl,
};

static int setup_commands(struct nullb_queue *nq)
@@ -461,6 +553,7 @@ static int null_add_dev(void)
{
struct gendisk *disk;
struct nullb *nullb;
+ struct nvm_dev *nvm_dev = NULL;
sector_t size;
int rv;

@@ -472,14 +565,14 @@ static int null_add_dev(void)

spin_lock_init(&nullb->lock);

- if (queue_mode == NULL_Q_MQ && use_per_node_hctx)
+ if ((queue_mode & (NULL_Q_MQ|NULL_Q_LIGHTNVM)) && use_per_node_hctx)
submit_queues = nr_online_nodes;

rv = setup_queues(nullb);
if (rv)
goto out_free_nullb;

- if (queue_mode == NULL_Q_MQ) {
+ if (queue_mode & (NULL_Q_MQ|NULL_Q_LIGHTNVM)) {
nullb->tag_set.ops = &null_mq_ops;
nullb->tag_set.nr_hw_queues = submit_queues;
nullb->tag_set.queue_depth = hw_queue_depth;
@@ -497,6 +590,18 @@ static int null_add_dev(void)
rv = -ENOMEM;
goto out_cleanup_tags;
}
+
+ if (queue_mode == NULL_Q_LIGHTNVM) {
+ nvm_dev = nvm_alloc();
+ if (!nvm_dev)
+ goto out_cleanup_tags;
+
+ nvm_dev->ops = &null_nvm_dev_ops;
+ nvm_dev->driver_data = nullb;
+
+ nvm_dev->drv_cmd_size = nullb->tag_set.cmd_size;
+ nullb->tag_set.cmd_size += nvm_cmd_size();
+ }
} else if (queue_mode == NULL_Q_BIO) {
nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node);
if (!nullb->q) {
@@ -517,6 +622,7 @@ static int null_add_dev(void)
}

nullb->q->queuedata = nullb;
+
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, nullb->q);

disk = nullb->disk = alloc_disk_node(1, home_node);
@@ -525,11 +631,6 @@ static int null_add_dev(void)
goto out_cleanup_blk_queue;
}

- mutex_lock(&lock);
- list_add_tail(&nullb->list, &nullb_list);
- nullb->index = nullb_indexes++;
- mutex_unlock(&lock);
-
blk_queue_logical_block_size(nullb->q, bs);
blk_queue_physical_block_size(nullb->q, bs);

@@ -540,17 +641,37 @@ static int null_add_dev(void)
disk->flags |= GENHD_FL_EXT_DEVT;
disk->major = null_major;
disk->first_minor = nullb->index;
- disk->fops = &null_fops;
disk->private_data = nullb;
+ disk->fops = &null_fops;
disk->queue = nullb->q;
+
+ if (nvm_dev) {
+ nvm_dev->q = nullb->q;
+ nvm_dev->disk = disk;
+
+ if (nvm_init(disk, nvm_dev))
+ goto out_cleanup_nvm;
+
+ nullb->nvm_dev = nvm_dev;
+ }
+
+ mutex_lock(&lock);
+ list_add_tail(&nullb->list, &nullb_list);
+ nullb->index = nullb_indexes++;
+ mutex_unlock(&lock);
+
sprintf(disk->disk_name, "nullb%d", nullb->index);
add_disk(disk);
+ nvm_add_sysfs(nvm_dev);
return 0;

+out_cleanup_nvm:
+ put_disk(disk);
out_cleanup_blk_queue:
blk_cleanup_queue(nullb->q);
out_cleanup_tags:
- if (queue_mode == NULL_Q_MQ)
+ nvm_free(nvm_dev);
+ if (queue_mode & (NULL_Q_MQ|NULL_Q_LIGHTNVM))
blk_mq_free_tag_set(&nullb->tag_set);
out_cleanup_queues:
cleanup_queues(nullb);
@@ -570,7 +691,7 @@ static int __init null_init(void)
bs = PAGE_SIZE;
}

- if (queue_mode == NULL_Q_MQ && use_per_node_hctx) {
+ if (queue_mode & (NULL_Q_MQ|NULL_Q_LIGHTNVM) && use_per_node_hctx) {
if (submit_queues < nr_online_nodes) {
pr_warn("null_blk: submit_queues param is set to %u.",
nr_online_nodes);
--
1.9.1

2014-10-08 15:58:28

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 2/5] block: extend rq_flag_bits

From: Jesper Madsen <[email protected]>

The rq_flag_bits is extended by REQ_NVM and REQ_NVM_MAPPED

REQ_NVM is used to detect request have through LightNVM on submission, and can
be detected on completion.

REQ_NVM_MAPPED is used to detect if request have mapped appropriately through
LightNVM.

The latter is added temponarily to debug the LightNVM key-value target and will
be removed later.

Signed-off-by: Matias Bjørling <[email protected]>
---
include/linux/blk_types.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 66c2167..b1d2f4d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -191,6 +191,8 @@ enum rq_flag_bits {
__REQ_END, /* last of chain of requests */
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
+ __REQ_NVM, /* request is queued via lightnvm */
+ __REQ_NVM_MAPPED, /* lightnvm mapped this request */
__REQ_NR_BITS, /* stops here */
};

@@ -245,5 +247,7 @@ enum rq_flag_bits {
#define REQ_END (1ULL << __REQ_END)
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
+#define REQ_NVM (1ULL << __REQ_NVM)
+#define REQ_NVM_MAPPED (1ULL << __REQ_NVM_MAPPED)

#endif /* __LINUX_BLK_TYPES_H */
--
1.9.1

2014-10-08 15:58:27

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH 1/5] NVMe: Convert to blk-mq

This converts the NVMe driver to a blk-mq request-based driver.

The NVMe driver is currently bio-based and implements queue logic within itself.
By using blk-mq, a lot of these responsibilities can be moved and simplified.

The patch is divided into the following blocks:

* Per-command data and cmdid have been moved into the struct request field.
The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id maintenance
are now handled by blk-mq through the rq->tag field.

* The logic for splitting bio's has been moved into the blk-mq layer. The
driver instead notifies the block layer about limited gap support in SG
lists.

* blk-mq handles timeouts and is reimplemented within nvme_timeout(). This both
includes abort handling and command cancelation.

* Assignment of nvme queues to CPUs are replaced with the blk-mq version. The
current blk-mq strategy is to assign the number of mapped queues and CPUs to
provide synergy, while the nvme driver assign as many nvme hw queues as
possible. This can be implemented in blk-mq if needed.

* NVMe queues are merged with the hardware dispatch queue (hctx) structure of
blk-mq.

* blk-mq takes care of setup/teardown of nvme queues and guards invalid
accesses. Therefore, RCU-usage for nvme queues can be removed.

* IO tracing and accounting are handled by blk-mq and therefore removed.

* Setup and teardown of nvme queues are now handled by nvme_[init/exit]_hctx.

Contributions in this patch from:

Sam Bradshaw <[email protected]>
Jens Axboe <[email protected]>
Keith Busch <[email protected]>
Robert Nelson <[email protected]>

Acked-by: Keith Busch <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/nvme-core.c | 1219 ++++++++++++++++++---------------------------
drivers/block/nvme-scsi.c | 8 +-
include/linux/nvme.h | 15 +-
3 files changed, 489 insertions(+), 753 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 02351e2..337878b 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -13,9 +13,9 @@
*/

#include <linux/nvme.h>
-#include <linux/bio.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
@@ -33,7 +33,6 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/pci.h>
-#include <linux/percpu.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
@@ -42,9 +41,8 @@
#include <scsi/sg.h>
#include <asm-generic/io-64-nonatomic-lo-hi.h>

-#include <trace/events/block.h>
-
#define NVME_Q_DEPTH 1024
+#define NVME_AQ_DEPTH 64
#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
#define ADMIN_TIMEOUT (admin_timeout * HZ)
@@ -76,10 +74,12 @@ static wait_queue_head_t nvme_kthread_wait;
static struct notifier_block nvme_nb;

static void nvme_reset_failed_dev(struct work_struct *ws);
+static int nvme_process_cq(struct nvme_queue *nvmeq);

struct async_cmd_info {
struct kthread_work work;
struct kthread_worker *worker;
+ struct request *req;
u32 result;
int status;
void *ctx;
@@ -90,7 +90,6 @@ struct async_cmd_info {
* commands and one for I/O commands).
*/
struct nvme_queue {
- struct rcu_head r_head;
struct device *q_dmadev;
struct nvme_dev *dev;
char irqname[24]; /* nvme4294967295-65535\0 */
@@ -99,10 +98,6 @@ struct nvme_queue {
volatile struct nvme_completion *cqes;
dma_addr_t sq_dma_addr;
dma_addr_t cq_dma_addr;
- wait_queue_head_t sq_full;
- wait_queue_t sq_cong_wait;
- struct bio_list sq_cong;
- struct list_head iod_bio;
u32 __iomem *q_db;
u16 q_depth;
u16 cq_vector;
@@ -113,9 +108,8 @@ struct nvme_queue {
u8 cq_phase;
u8 cqe_seen;
u8 q_suspended;
- cpumask_var_t cpu_mask;
struct async_cmd_info cmdinfo;
- unsigned long cmdid_data[];
+ struct blk_mq_hw_ctx *hctx;
};

/*
@@ -143,62 +137,73 @@ typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
struct nvme_cmd_info {
nvme_completion_fn fn;
void *ctx;
- unsigned long timeout;
int aborted;
+ struct nvme_queue *nvmeq;
};

-static struct nvme_cmd_info *nvme_cmd_info(struct nvme_queue *nvmeq)
+static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int hctx_idx)
{
- return (void *)&nvmeq->cmdid_data[BITS_TO_LONGS(nvmeq->q_depth)];
+ struct nvme_dev *dev = data;
+ struct nvme_queue *nvmeq = dev->queues[0];
+
+ WARN_ON(nvmeq->hctx);
+ nvmeq->hctx = hctx;
+ hctx->driver_data = nvmeq;
+ return 0;
}

-static unsigned nvme_queue_extra(int depth)
+static int nvme_admin_init_request(void *data, struct request *req,
+ unsigned int hctx_idx, unsigned int rq_idx,
+ unsigned int numa_node)
{
- return DIV_ROUND_UP(depth, 8) + (depth * sizeof(struct nvme_cmd_info));
+ struct nvme_dev *dev = data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = dev->queues[0];
+
+ WARN_ON(!nvmeq);
+ cmd->nvmeq = nvmeq;
+ return 0;
}

-/**
- * alloc_cmdid() - Allocate a Command ID
- * @nvmeq: The queue that will be used for this command
- * @ctx: A pointer that will be passed to the handler
- * @handler: The function to call on completion
- *
- * Allocate a Command ID for a queue. The data passed in will
- * be passed to the completion handler. This is implemented by using
- * the bottom two bits of the ctx pointer to store the handler ID.
- * Passing in a pointer that's not 4-byte aligned will cause a BUG.
- * We can change this if it becomes a problem.
- *
- * May be called with local interrupts disabled and the q_lock held,
- * or with interrupts enabled and no locks held.
- */
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
- nvme_completion_fn handler, unsigned timeout)
+static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int hctx_idx)
{
- int depth = nvmeq->q_depth - 1;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- int cmdid;
-
- do {
- cmdid = find_first_zero_bit(nvmeq->cmdid_data, depth);
- if (cmdid >= depth)
- return -EBUSY;
- } while (test_and_set_bit(cmdid, nvmeq->cmdid_data));
-
- info[cmdid].fn = handler;
- info[cmdid].ctx = ctx;
- info[cmdid].timeout = jiffies + timeout;
- info[cmdid].aborted = 0;
- return cmdid;
+ struct nvme_dev *dev = data;
+ struct nvme_queue *nvmeq = dev->queues[
+ (hctx_idx % dev->queue_count) + 1];
+
+ /* nvmeq queues are shared between namespaces. We assume here that
+ * blk-mq map the tags so they match up with the nvme queue tags */
+ if (!nvmeq->hctx)
+ nvmeq->hctx = hctx;
+ else
+ WARN_ON(nvmeq->hctx->tags != hctx->tags);
+ irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
+ hctx->cpumask);
+ hctx->driver_data = nvmeq;
+ return 0;
+}
+
+static int nvme_init_request(void *data, struct request *req,
+ unsigned int hctx_idx, unsigned int rq_idx,
+ unsigned int numa_node)
+{
+ struct nvme_dev *dev = data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
+
+ WARN_ON(!nvmeq);
+ cmd->nvmeq = nvmeq;
+ return 0;
}

-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
- nvme_completion_fn handler, unsigned timeout)
+static void nvme_set_info(struct nvme_cmd_info *cmd, void *ctx,
+ nvme_completion_fn handler)
{
- int cmdid;
- wait_event_killable(nvmeq->sq_full,
- (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
- return (cmdid < 0) ? -EINTR : cmdid;
+ cmd->fn = handler;
+ cmd->ctx = ctx;
+ cmd->aborted = 0;
}

/* Special values must be less than 0x1000 */
@@ -206,17 +211,11 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
#define CMD_CTX_CANCELLED (0x30C + CMD_CTX_BASE)
#define CMD_CTX_COMPLETED (0x310 + CMD_CTX_BASE)
#define CMD_CTX_INVALID (0x314 + CMD_CTX_BASE)
-#define CMD_CTX_ABORT (0x318 + CMD_CTX_BASE)
-
static void special_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
if (ctx == CMD_CTX_CANCELLED)
return;
- if (ctx == CMD_CTX_ABORT) {
- ++nvmeq->dev->abort_limit;
- return;
- }
if (ctx == CMD_CTX_COMPLETED) {
dev_warn(nvmeq->q_dmadev,
"completed id %d twice on queue %d\n",
@@ -233,6 +232,31 @@ static void special_completion(struct nvme_queue *nvmeq, void *ctx,
dev_warn(nvmeq->q_dmadev, "Unknown special completion %p\n", ctx);
}

+static void *cancel_cmd_info(struct nvme_cmd_info *cmd, nvme_completion_fn *fn)
+{
+ void *ctx;
+
+ if (fn)
+ *fn = cmd->fn;
+ ctx = cmd->ctx;
+ cmd->fn = special_completion;
+ cmd->ctx = CMD_CTX_CANCELLED;
+ return ctx;
+}
+
+static void abort_completion(struct nvme_queue *nvmeq, void *ctx,
+ struct nvme_completion *cqe)
+{
+ struct request *req = ctx;
+ u16 status = le16_to_cpup(&cqe->status) >> 1;
+ u32 result = le32_to_cpup(&cqe->result);
+
+ blk_put_request(req);
+
+ dev_warn(nvmeq->q_dmadev, "Abort status:%x result:%x", status, result);
+ ++nvmeq->dev->abort_limit;
+}
+
static void async_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
@@ -240,90 +264,38 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
cmdinfo->result = le32_to_cpup(&cqe->result);
cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
+ blk_put_request(cmdinfo->req);
+}
+
+static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
+ unsigned int tag)
+{
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+ struct request *req = blk_mq_tag_to_rq(hctx->tags, tag);
+
+ return blk_mq_rq_to_pdu(req);
}

/*
* Called with local interrupts disabled and the q_lock held. May not sleep.
*/
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *nvme_finish_cmd(struct nvme_queue *nvmeq, int tag,
nvme_completion_fn *fn)
{
+ struct nvme_cmd_info *cmd = get_cmd_from_tag(nvmeq, tag);
void *ctx;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
-
- if (cmdid >= nvmeq->q_depth || !info[cmdid].fn) {
- if (fn)
- *fn = special_completion;
+ if (tag >= nvmeq->q_depth) {
+ *fn = special_completion;
return CMD_CTX_INVALID;
}
if (fn)
- *fn = info[cmdid].fn;
- ctx = info[cmdid].ctx;
- info[cmdid].fn = special_completion;
- info[cmdid].ctx = CMD_CTX_COMPLETED;
- clear_bit(cmdid, nvmeq->cmdid_data);
- wake_up(&nvmeq->sq_full);
+ *fn = cmd->fn;
+ ctx = cmd->ctx;
+ cmd->fn = special_completion;
+ cmd->ctx = CMD_CTX_COMPLETED;
return ctx;
}

-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
- nvme_completion_fn *fn)
-{
- void *ctx;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- if (fn)
- *fn = info[cmdid].fn;
- ctx = info[cmdid].ctx;
- info[cmdid].fn = special_completion;
- info[cmdid].ctx = CMD_CTX_CANCELLED;
- return ctx;
-}
-
-static struct nvme_queue *raw_nvmeq(struct nvme_dev *dev, int qid)
-{
- return rcu_dereference_raw(dev->queues[qid]);
-}
-
-static struct nvme_queue *get_nvmeq(struct nvme_dev *dev) __acquires(RCU)
-{
- struct nvme_queue *nvmeq;
- unsigned queue_id = get_cpu_var(*dev->io_queue);
-
- rcu_read_lock();
- nvmeq = rcu_dereference(dev->queues[queue_id]);
- if (nvmeq)
- return nvmeq;
-
- rcu_read_unlock();
- put_cpu_var(*dev->io_queue);
- return NULL;
-}
-
-static void put_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
- rcu_read_unlock();
- put_cpu_var(nvmeq->dev->io_queue);
-}
-
-static struct nvme_queue *lock_nvmeq(struct nvme_dev *dev, int q_idx)
- __acquires(RCU)
-{
- struct nvme_queue *nvmeq;
-
- rcu_read_lock();
- nvmeq = rcu_dereference(dev->queues[q_idx]);
- if (nvmeq)
- return nvmeq;
-
- rcu_read_unlock();
- return NULL;
-}
-
-static void unlock_nvmeq(struct nvme_queue *nvmeq) __releases(RCU)
-{
- rcu_read_unlock();
-}
-
/**
* nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
* @nvmeq: The queue to use
@@ -380,7 +352,6 @@ nvme_alloc_iod(unsigned nseg, unsigned nbytes, gfp_t gfp)
iod->length = nbytes;
iod->nents = 0;
iod->first_dma = 0ULL;
- iod->start_time = jiffies;
}

return iod;
@@ -404,65 +375,37 @@ void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod)
kfree(iod);
}

-static void nvme_start_io_acct(struct bio *bio)
-{
- struct gendisk *disk = bio->bi_bdev->bd_disk;
- if (blk_queue_io_stat(disk->queue)) {
- const int rw = bio_data_dir(bio);
- int cpu = part_stat_lock();
- part_round_stats(cpu, &disk->part0);
- part_stat_inc(cpu, &disk->part0, ios[rw]);
- part_stat_add(cpu, &disk->part0, sectors[rw],
- bio_sectors(bio));
- part_inc_in_flight(&disk->part0, rw);
- part_stat_unlock();
- }
-}
-
-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
-{
- struct gendisk *disk = bio->bi_bdev->bd_disk;
- if (blk_queue_io_stat(disk->queue)) {
- const int rw = bio_data_dir(bio);
- unsigned long duration = jiffies - start_time;
- int cpu = part_stat_lock();
- part_stat_add(cpu, &disk->part0, ticks[rw], duration);
- part_round_stats(cpu, &disk->part0);
- part_dec_in_flight(&disk->part0, rw);
- part_stat_unlock();
- }
-}
-
-static void bio_completion(struct nvme_queue *nvmeq, void *ctx,
+static void req_completion(struct nvme_queue *nvmeq, void *ctx,
struct nvme_completion *cqe)
{
struct nvme_iod *iod = ctx;
- struct bio *bio = iod->private;
+ struct request *req = iod->private;
+ struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+
u16 status = le16_to_cpup(&cqe->status) >> 1;
- int error = 0;

if (unlikely(status)) {
- if (!(status & NVME_SC_DNR ||
- bio->bi_rw & REQ_FAILFAST_MASK) &&
- (jiffies - iod->start_time) < IOD_TIMEOUT) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- list_add_tail(&iod->node, &nvmeq->iod_bio);
- wake_up(&nvmeq->sq_full);
+ if (!(status & NVME_SC_DNR || blk_noretry_request(req))
+ && (jiffies - req->start_time) < req->timeout) {
+ blk_mq_requeue_request(req);
+ blk_mq_kick_requeue_list(req->q);
return;
}
- error = -EIO;
- }
- if (iod->nents) {
- dma_unmap_sg(nvmeq->q_dmadev, iod->sg, iod->nents,
- bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
- nvme_end_io_acct(bio, iod->start_time);
- }
+ req->errors = -EIO;
+ } else
+ req->errors = 0;
+
+ if (cmd_rq->aborted)
+ dev_warn(&nvmeq->dev->pci_dev->dev,
+ "completing aborted command with status:%04x\n",
+ status);
+
+ if (iod->nents)
+ dma_unmap_sg(&nvmeq->dev->pci_dev->dev, iod->sg, iod->nents,
+ rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
nvme_free_iod(nvmeq->dev, iod);

- trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio, error);
- bio_endio(bio, error);
+ blk_mq_complete_request(req);
}

/* length is in bytes. gfp flags indicates whether we may sleep. */
@@ -544,88 +487,25 @@ int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod, int total_len,
return total_len;
}

-static int nvme_split_and_submit(struct bio *bio, struct nvme_queue *nvmeq,
- int len)
-{
- struct bio *split = bio_split(bio, len >> 9, GFP_ATOMIC, NULL);
- if (!split)
- return -ENOMEM;
-
- trace_block_split(bdev_get_queue(bio->bi_bdev), bio,
- split->bi_iter.bi_sector);
- bio_chain(split, bio);
-
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- bio_list_add(&nvmeq->sq_cong, split);
- bio_list_add(&nvmeq->sq_cong, bio);
- wake_up(&nvmeq->sq_full);
-
- return 0;
-}
-
-/* NVMe scatterlists require no holes in the virtual address */
-#define BIOVEC_NOT_VIRT_MERGEABLE(vec1, vec2) ((vec2)->bv_offset || \
- (((vec1)->bv_offset + (vec1)->bv_len) % PAGE_SIZE))
-
-static int nvme_map_bio(struct nvme_queue *nvmeq, struct nvme_iod *iod,
- struct bio *bio, enum dma_data_direction dma_dir, int psegs)
-{
- struct bio_vec bvec, bvprv;
- struct bvec_iter iter;
- struct scatterlist *sg = NULL;
- int length = 0, nsegs = 0, split_len = bio->bi_iter.bi_size;
- int first = 1;
-
- if (nvmeq->dev->stripe_size)
- split_len = nvmeq->dev->stripe_size -
- ((bio->bi_iter.bi_sector << 9) &
- (nvmeq->dev->stripe_size - 1));
-
- sg_init_table(iod->sg, psegs);
- bio_for_each_segment(bvec, bio, iter) {
- if (!first && BIOVEC_PHYS_MERGEABLE(&bvprv, &bvec)) {
- sg->length += bvec.bv_len;
- } else {
- if (!first && BIOVEC_NOT_VIRT_MERGEABLE(&bvprv, &bvec))
- return nvme_split_and_submit(bio, nvmeq,
- length);
-
- sg = sg ? sg + 1 : iod->sg;
- sg_set_page(sg, bvec.bv_page,
- bvec.bv_len, bvec.bv_offset);
- nsegs++;
- }
-
- if (split_len - length < bvec.bv_len)
- return nvme_split_and_submit(bio, nvmeq, split_len);
- length += bvec.bv_len;
- bvprv = bvec;
- first = 0;
- }
- iod->nents = nsegs;
- sg_mark_end(sg);
- if (dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
- return -ENOMEM;
-
- BUG_ON(length != bio->bi_iter.bi_size);
- return length;
-}
-
-static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
- struct bio *bio, struct nvme_iod *iod, int cmdid)
+/*
+ * We reuse the small pool to allocate the 16-byte range here as it is not
+ * worth having a special pool for these or additional cases to handle freeing
+ * the iod.
+ */
+static void nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+ struct request *req, struct nvme_iod *iod)
{
struct nvme_dsm_range *range =
(struct nvme_dsm_range *)iod_list(iod)[0];
struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];

range->cattr = cpu_to_le32(0);
- range->nlb = cpu_to_le32(bio->bi_iter.bi_size >> ns->lba_shift);
- range->slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
+ range->nlb = cpu_to_le32(blk_rq_bytes(req) >> ns->lba_shift);
+ range->slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));

memset(cmnd, 0, sizeof(*cmnd));
cmnd->dsm.opcode = nvme_cmd_dsm;
- cmnd->dsm.command_id = cmdid;
+ cmnd->dsm.command_id = req->tag;
cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
cmnd->dsm.prp1 = cpu_to_le64(iod->first_dma);
cmnd->dsm.nr = 0;
@@ -634,11 +514,9 @@ static int nvme_submit_discard(struct nvme_queue *nvmeq, struct nvme_ns *ns,
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
writel(nvmeq->sq_tail, nvmeq->q_db);
-
- return 0;
}

-static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
int cmdid)
{
struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
@@ -651,49 +529,34 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
writel(nvmeq->sq_tail, nvmeq->q_db);
-
- return 0;
}

-static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
+static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+ struct nvme_ns *ns)
{
- struct bio *bio = iod->private;
- struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
+ struct request *req = iod->private;
struct nvme_command *cmnd;
- int cmdid;
- u16 control;
- u32 dsmgmt;
+ u16 control = 0;
+ u32 dsmgmt = 0;

- cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
- if (unlikely(cmdid < 0))
- return cmdid;
-
- if (bio->bi_rw & REQ_DISCARD)
- return nvme_submit_discard(nvmeq, ns, bio, iod, cmdid);
- if (bio->bi_rw & REQ_FLUSH)
- return nvme_submit_flush(nvmeq, ns, cmdid);
-
- control = 0;
- if (bio->bi_rw & REQ_FUA)
+ if (req->cmd_flags & REQ_FUA)
control |= NVME_RW_FUA;
- if (bio->bi_rw & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+ if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
control |= NVME_RW_LR;

- dsmgmt = 0;
- if (bio->bi_rw & REQ_RAHEAD)
+ if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;

cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
memset(cmnd, 0, sizeof(*cmnd));

- cmnd->rw.opcode = bio_data_dir(bio) ? nvme_cmd_write : nvme_cmd_read;
- cmnd->rw.command_id = cmdid;
+ cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
+ cmnd->rw.command_id = req->tag;
cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
- cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, bio->bi_iter.bi_sector));
- cmnd->rw.length =
- cpu_to_le16((bio->bi_iter.bi_size >> ns->lba_shift) - 1);
+ cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+ cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
cmnd->rw.control = cpu_to_le16(control);
cmnd->rw.dsmgmt = cpu_to_le32(dsmgmt);

@@ -704,45 +567,32 @@ static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod)
return 0;
}

-static int nvme_split_flush_data(struct nvme_queue *nvmeq, struct bio *bio)
-{
- struct bio *split = bio_clone(bio, GFP_ATOMIC);
- if (!split)
- return -ENOMEM;
-
- split->bi_iter.bi_size = 0;
- split->bi_phys_segments = 0;
- bio->bi_rw &= ~REQ_FLUSH;
- bio_chain(split, bio);
-
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- bio_list_add(&nvmeq->sq_cong, split);
- bio_list_add(&nvmeq->sq_cong, bio);
- wake_up_process(nvme_thread);
-
- return 0;
-}
-
-/*
- * Called with local interrupts disabled and the q_lock held. May not sleep.
- */
-static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
- struct bio *bio)
+static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
{
+ struct nvme_ns *ns = hctx->queue->queuedata;
+ struct nvme_queue *nvmeq = hctx->driver_data;
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
struct nvme_iod *iod;
- int psegs = bio_phys_segments(ns->queue, bio);
- int result;
+ enum dma_data_direction dma_dir;
+ int psegs = req->nr_phys_segments;
+ int result = BLK_MQ_RQ_QUEUE_BUSY;
+ /*
+ * Requeued IO has already been prepped
+ */
+ iod = req->special;
+ if (iod)
+ goto submit_iod;

- if ((bio->bi_rw & REQ_FLUSH) && psegs)
- return nvme_split_flush_data(nvmeq, bio);
-
- iod = nvme_alloc_iod(psegs, bio->bi_iter.bi_size, GFP_ATOMIC);
+ iod = nvme_alloc_iod(psegs, blk_rq_bytes(req), GFP_ATOMIC);
if (!iod)
- return -ENOMEM;
+ return result;

- iod->private = bio;
- if (bio->bi_rw & REQ_DISCARD) {
+ iod->private = req;
+ req->special = iod;
+
+ nvme_set_info(cmd, iod, req_completion);
+
+ if (req->cmd_flags & REQ_DISCARD) {
void *range;
/*
* We reuse the small pool to allocate the 16-byte range here
@@ -752,33 +602,53 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
range = dma_pool_alloc(nvmeq->dev->prp_small_pool,
GFP_ATOMIC,
&iod->first_dma);
- if (!range) {
- result = -ENOMEM;
- goto free_iod;
- }
+ if (!range)
+ goto finish_cmd;
iod_list(iod)[0] = (__le64 *)range;
iod->npages = 0;
} else if (psegs) {
- result = nvme_map_bio(nvmeq, iod, bio,
- bio_data_dir(bio) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
- psegs);
- if (result <= 0)
- goto free_iod;
- if (nvme_setup_prps(nvmeq->dev, iod, result, GFP_ATOMIC) !=
- result) {
- result = -ENOMEM;
- goto free_iod;
+ dma_dir = rq_data_dir(req) ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+
+ sg_init_table(iod->sg, psegs);
+ iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
+ if (!iod->nents) {
+ result = BLK_MQ_RQ_QUEUE_ERROR;
+ goto finish_cmd;
}
- nvme_start_io_acct(bio);
+
+ if (!dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir))
+ goto finish_cmd;
+
+ if (blk_rq_bytes(req) != nvme_setup_prps(nvmeq->dev, iod,
+ blk_rq_bytes(req), GFP_ATOMIC))
+ goto finish_cmd;
+ }
+
+ submit_iod:
+ spin_lock_irq(&nvmeq->q_lock);
+ if (nvmeq->q_suspended) {
+ spin_unlock_irq(&nvmeq->q_lock);
+ goto finish_cmd;
}
- if (unlikely(nvme_submit_iod(nvmeq, iod))) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- list_add_tail(&iod->node, &nvmeq->iod_bio);
+
+ if (req->cmd_flags & REQ_DISCARD) {
+ nvme_submit_discard(nvmeq, ns, req, iod);
+ goto queued;
+ }
+
+ if (req->cmd_flags & REQ_FLUSH) {
+ nvme_submit_flush(nvmeq, ns, req->tag);
+ goto queued;
}
- return 0;

- free_iod:
+ nvme_submit_iod(nvmeq, iod, ns);
+ queued:
+ nvme_process_cq(nvmeq);
+ spin_unlock_irq(&nvmeq->q_lock);
+ return BLK_MQ_RQ_QUEUE_OK;
+
+ finish_cmd:
+ nvme_finish_cmd(nvmeq, req->tag, NULL);
nvme_free_iod(nvmeq->dev, iod);
return result;
}
@@ -801,8 +671,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
head = 0;
phase = !phase;
}
-
- ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
+ ctx = nvme_finish_cmd(nvmeq, cqe.command_id, &fn);
fn(nvmeq, ctx, &cqe);
}

@@ -823,29 +692,12 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
return 1;
}

-static void nvme_make_request(struct request_queue *q, struct bio *bio)
+/* Admin queue isn't initialized as a request queue. If at some point this
+ * happens anyway, make sure to notify the user */
+static int nvme_admin_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
{
- struct nvme_ns *ns = q->queuedata;
- struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
- int result = -EBUSY;
-
- if (!nvmeq) {
- bio_endio(bio, -EIO);
- return;
- }
-
- spin_lock_irq(&nvmeq->q_lock);
- if (!nvmeq->q_suspended && bio_list_empty(&nvmeq->sq_cong))
- result = nvme_submit_bio_queue(nvmeq, ns, bio);
- if (unlikely(result)) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full, &nvmeq->sq_cong_wait);
- bio_list_add(&nvmeq->sq_cong, bio);
- }
-
- nvme_process_cq(nvmeq);
- spin_unlock_irq(&nvmeq->q_lock);
- put_nvmeq(nvmeq);
+ WARN_ON_ONCE(1);
+ return BLK_MQ_RQ_QUEUE_ERROR;
}

static irqreturn_t nvme_irq(int irq, void *data)
@@ -869,10 +721,11 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
return IRQ_WAKE_THREAD;
}

-static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
+static void nvme_abort_cmd_info(struct nvme_queue *nvmeq, struct nvme_cmd_info *
+ cmd_info)
{
spin_lock_irq(&nvmeq->q_lock);
- cancel_cmdid(nvmeq, cmdid, NULL);
+ cancel_cmd_info(cmd_info, NULL);
spin_unlock_irq(&nvmeq->q_lock);
}

@@ -895,45 +748,31 @@ static void sync_completion(struct nvme_queue *nvmeq, void *ctx,
* Returns 0 on success. If the result is negative, it's a Linux error code;
* if the result is positive, it's an NVM Express status code
*/
-static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
- struct nvme_command *cmd,
+static int nvme_submit_sync_cmd(struct request *req, struct nvme_command *cmd,
u32 *result, unsigned timeout)
{
- int cmdid, ret;
+ int ret;
struct sync_cmd_info cmdinfo;
- struct nvme_queue *nvmeq;
-
- nvmeq = lock_nvmeq(dev, q_idx);
- if (!nvmeq)
- return -ENODEV;
+ struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd_rq->nvmeq;

cmdinfo.task = current;
cmdinfo.status = -EINTR;

- cmdid = alloc_cmdid(nvmeq, &cmdinfo, sync_completion, timeout);
- if (cmdid < 0) {
- unlock_nvmeq(nvmeq);
- return cmdid;
- }
- cmd->common.command_id = cmdid;
+ cmd->common.command_id = req->tag;
+
+ nvme_set_info(cmd_rq, &cmdinfo, sync_completion);

set_current_state(TASK_KILLABLE);
ret = nvme_submit_cmd(nvmeq, cmd);
if (ret) {
- free_cmdid(nvmeq, cmdid, NULL);
- unlock_nvmeq(nvmeq);
+ nvme_finish_cmd(nvmeq, req->tag, NULL);
set_current_state(TASK_RUNNING);
- return ret;
}
- unlock_nvmeq(nvmeq);
schedule_timeout(timeout);

if (cmdinfo.status == -EINTR) {
- nvmeq = lock_nvmeq(dev, q_idx);
- if (nvmeq) {
- nvme_abort_command(nvmeq, cmdid);
- unlock_nvmeq(nvmeq);
- }
+ nvme_abort_cmd_info(nvmeq, blk_mq_rq_to_pdu(req));
return -EINTR;
}

@@ -943,59 +782,78 @@ static int nvme_submit_sync_cmd(struct nvme_dev *dev, int q_idx,
return cmdinfo.status;
}

-static int nvme_submit_async_cmd(struct nvme_queue *nvmeq,
+static int nvme_submit_admin_async_cmd(struct nvme_dev *dev,
struct nvme_command *cmd,
struct async_cmd_info *cmdinfo, unsigned timeout)
{
- int cmdid;
+ struct nvme_queue *nvmeq = dev->queues[0];
+ struct request *req;
+ struct nvme_cmd_info *cmd_rq;

- cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
- if (cmdid < 0)
- return cmdid;
+ req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+ if (!req)
+ return -ENOMEM;
+
+ req->timeout = timeout;
+ cmd_rq = blk_mq_rq_to_pdu(req);
+ cmdinfo->req = req;
+ nvme_set_info(cmd_rq, cmdinfo, async_completion);
cmdinfo->status = -EINTR;
- cmd->common.command_id = cmdid;
+
+ cmd->common.command_id = req->tag;
+
return nvme_submit_cmd(nvmeq, cmd);
}

+int __nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
+ u32 *result, unsigned timeout)
+{
+ int res;
+ struct request *req;
+
+ req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_KERNEL, false);
+ if (!req)
+ return -ENOMEM;
+ res = nvme_submit_sync_cmd(req, cmd, result, timeout);
+ blk_put_request(req);
+ return res;
+}
+
int nvme_submit_admin_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
u32 *result)
{
- return nvme_submit_sync_cmd(dev, 0, cmd, result, ADMIN_TIMEOUT);
+ return __nvme_submit_admin_cmd(dev, cmd, result, ADMIN_TIMEOUT);
}

-int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_command *cmd,
- u32 *result)
+int nvme_submit_io_cmd(struct nvme_dev *dev, struct nvme_ns *ns,
+ struct nvme_command *cmd, u32 *result)
{
- return nvme_submit_sync_cmd(dev, smp_processor_id() + 1, cmd, result,
- NVME_IO_TIMEOUT);
-}
+ int res;
+ struct request *req;

-static int nvme_submit_admin_cmd_async(struct nvme_dev *dev,
- struct nvme_command *cmd, struct async_cmd_info *cmdinfo)
-{
- return nvme_submit_async_cmd(raw_nvmeq(dev, 0), cmd, cmdinfo,
- ADMIN_TIMEOUT);
+ req = blk_mq_alloc_request(ns->queue, WRITE, (GFP_KERNEL|__GFP_WAIT),
+ false);
+ if (!req)
+ return -ENOMEM;
+ res = nvme_submit_sync_cmd(req, cmd, result, NVME_IO_TIMEOUT);
+ blk_put_request(req);
+ return res;
}

static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
{
- int status;
struct nvme_command c;

memset(&c, 0, sizeof(c));
c.delete_queue.opcode = opcode;
c.delete_queue.qid = cpu_to_le16(id);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
- int status;
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;

@@ -1007,16 +865,12 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
c.create_cq.cq_flags = cpu_to_le16(flags);
c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
struct nvme_queue *nvmeq)
{
- int status;
struct nvme_command c;
int flags = NVME_QUEUE_PHYS_CONTIG | NVME_SQ_PRIO_MEDIUM;

@@ -1028,10 +882,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid,
c.create_sq.sq_flags = cpu_to_le16(flags);
c.create_sq.cqid = cpu_to_le16(qid);

- status = nvme_submit_admin_cmd(dev, &c, NULL);
- if (status)
- return -EIO;
- return 0;
+ return nvme_submit_admin_cmd(dev, &c, NULL);
}

static int adapter_delete_cq(struct nvme_dev *dev, u16 cqid)
@@ -1087,28 +938,27 @@ int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
}

/**
- * nvme_abort_cmd - Attempt aborting a command
- * @cmdid: Command id of a timed out IO
- * @queue: The queue with timed out IO
+ * nvme_abort_req - Attempt aborting a request
*
* Schedule controller reset if the command was already aborted once before and
* still hasn't been returned to the driver, or if this is the admin queue.
*/
-static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
+static void nvme_abort_req(struct request *req)
{
- int a_cmdid;
+ struct nvme_cmd_info *cmd_rq = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd_rq->nvmeq;
+ struct nvme_dev *dev = nvmeq->dev;
+ struct request *abort_req;
+ struct nvme_cmd_info *abort_cmd;
struct nvme_command cmd;
- struct nvme_dev *dev = nvmeq->dev;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- struct nvme_queue *adminq;

- if (!nvmeq->qid || info[cmdid].aborted) {
+ if (!nvmeq->qid || cmd_rq->aborted) {
if (work_busy(&dev->reset_work))
return;
list_del_init(&dev->node);
dev_warn(&dev->pci_dev->dev,
- "I/O %d QID %d timeout, reset controller\n", cmdid,
- nvmeq->qid);
+ "I/O %d QID %d timeout, reset controller\n",
+ req->tag, nvmeq->qid);
dev->reset_workfn = nvme_reset_failed_dev;
queue_work(nvme_workq, &dev->reset_work);
return;
@@ -1117,89 +967,88 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
if (!dev->abort_limit)
return;

- adminq = rcu_dereference(dev->queues[0]);
- a_cmdid = alloc_cmdid(adminq, CMD_CTX_ABORT, special_completion,
- ADMIN_TIMEOUT);
- if (a_cmdid < 0)
+ abort_req = blk_mq_alloc_request(dev->admin_q, WRITE, GFP_ATOMIC,
+ false);
+ if (!abort_req)
return;

+ abort_cmd = blk_mq_rq_to_pdu(abort_req);
+ nvme_set_info(abort_cmd, abort_req, abort_completion);
+
memset(&cmd, 0, sizeof(cmd));
cmd.abort.opcode = nvme_admin_abort_cmd;
- cmd.abort.cid = cmdid;
+ cmd.abort.cid = req->tag;
cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
- cmd.abort.command_id = a_cmdid;
+ cmd.abort.command_id = abort_req->tag;

--dev->abort_limit;
- info[cmdid].aborted = 1;
- info[cmdid].timeout = jiffies + ADMIN_TIMEOUT;
+ cmd_rq->aborted = 1;

- dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", cmdid,
+ dev_warn(nvmeq->q_dmadev, "Aborting I/O %d QID %d\n", req->tag,
nvmeq->qid);
- nvme_submit_cmd(adminq, &cmd);
+ if (nvme_submit_cmd(dev->queues[0], &cmd) < 0) {
+ dev_warn(nvmeq->q_dmadev,
+ "Could not abort I/O %d QID %d",
+ req->tag, nvmeq->qid);
+ }
}

-/**
- * nvme_cancel_ios - Cancel outstanding I/Os
- * @queue: The queue to cancel I/Os on
- * @timeout: True to only cancel I/Os which have timed out
- */
-static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
+static void nvme_cancel_queue_ios(void *data, unsigned long *tag_map)
{
- int depth = nvmeq->q_depth - 1;
- struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
- unsigned long now = jiffies;
- int cmdid;
+ struct nvme_queue *nvmeq = data;
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+ unsigned int tag = 0;

- for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
+ tag = 0;
+ do {
+ struct request *req;
void *ctx;
nvme_completion_fn fn;
+ struct nvme_cmd_info *cmd;
static struct nvme_completion cqe = {
.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
};
+ int qdepth = nvmeq == nvmeq->dev->queues[0] ?
+ nvmeq->dev->admin_tagset.queue_depth :
+ nvmeq->dev->tagset.queue_depth;

- if (timeout && !time_after(now, info[cmdid].timeout))
- continue;
- if (info[cmdid].ctx == CMD_CTX_CANCELLED)
- continue;
- if (timeout && nvmeq->dev->initialized) {
- nvme_abort_cmd(cmdid, nvmeq);
+ /* zero'd bits are free tags */
+ tag = find_next_zero_bit(tag_map, qdepth, tag);
+ if (tag >= qdepth)
+ break;
+
+ req = blk_mq_tag_to_rq(hctx->tags, tag++);
+ cmd = blk_mq_rq_to_pdu(req);
+
+ if (cmd->ctx == CMD_CTX_CANCELLED)
continue;
- }
- dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
- nvmeq->qid);
- ctx = cancel_cmdid(nvmeq, cmdid, &fn);
+
+ dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n",
+ req->tag, nvmeq->qid);
+ ctx = cancel_cmd_info(cmd, &fn);
fn(nvmeq, ctx, &cqe);
- }
+ } while (1);
}

-static void nvme_free_queue(struct rcu_head *r)
+static enum blk_eh_timer_return nvme_timeout(struct request *req)
{
- struct nvme_queue *nvmeq = container_of(r, struct nvme_queue, r_head);
+ struct nvme_cmd_info *cmd = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = cmd->nvmeq;

- spin_lock_irq(&nvmeq->q_lock);
- while (bio_list_peek(&nvmeq->sq_cong)) {
- struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
- bio_endio(bio, -EIO);
- }
- while (!list_empty(&nvmeq->iod_bio)) {
- static struct nvme_completion cqe = {
- .status = cpu_to_le16(
- (NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1),
- };
- struct nvme_iod *iod = list_first_entry(&nvmeq->iod_bio,
- struct nvme_iod,
- node);
- list_del(&iod->node);
- bio_completion(nvmeq, iod, &cqe);
- }
- spin_unlock_irq(&nvmeq->q_lock);
+ dev_warn(nvmeq->q_dmadev, "Timeout I/O %d QID %d\n", req->tag,
+ nvmeq->qid);
+ if (nvmeq->dev->initialized)
+ nvme_abort_req(req);

+ return BLK_EH_RESET_TIMER;
+}
+
+static void nvme_free_queue(struct nvme_queue *nvmeq)
+{
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
- if (nvmeq->qid)
- free_cpumask_var(nvmeq->cpu_mask);
kfree(nvmeq);
}

@@ -1208,10 +1057,10 @@ static void nvme_free_queues(struct nvme_dev *dev, int lowest)
int i;

for (i = dev->queue_count - 1; i >= lowest; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
- rcu_assign_pointer(dev->queues[i], NULL);
- call_rcu(&nvmeq->r_head, nvme_free_queue);
+ struct nvme_queue *nvmeq = dev->queues[i];
dev->queue_count--;
+ dev->queues[i] = NULL;
+ nvme_free_queue(nvmeq);
}
}

@@ -1242,15 +1091,18 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)

static void nvme_clear_queue(struct nvme_queue *nvmeq)
{
+ struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+
spin_lock_irq(&nvmeq->q_lock);
nvme_process_cq(nvmeq);
- nvme_cancel_ios(nvmeq, false);
+ if (hctx && hctx->tags)
+ blk_mq_tag_busy_iter(hctx->tags, nvme_cancel_queue_ios, nvmeq);
spin_unlock_irq(&nvmeq->q_lock);
}

static void nvme_disable_queue(struct nvme_dev *dev, int qid)
{
- struct nvme_queue *nvmeq = raw_nvmeq(dev, qid);
+ struct nvme_queue *nvmeq = dev->queues[qid];

if (!nvmeq)
return;
@@ -1270,8 +1122,7 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
int depth, int vector)
{
struct device *dmadev = &dev->pci_dev->dev;
- unsigned extra = nvme_queue_extra(depth);
- struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq) + extra, GFP_KERNEL);
+ struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
if (!nvmeq)
return NULL;

@@ -1286,9 +1137,6 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
if (!nvmeq->sq_cmds)
goto free_cqdma;

- if (qid && !zalloc_cpumask_var(&nvmeq->cpu_mask, GFP_KERNEL))
- goto free_sqdma;
-
nvmeq->q_dmadev = dmadev;
nvmeq->dev = dev;
snprintf(nvmeq->irqname, sizeof(nvmeq->irqname), "nvme%dq%d",
@@ -1296,23 +1144,16 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
spin_lock_init(&nvmeq->q_lock);
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
- init_waitqueue_head(&nvmeq->sq_full);
- init_waitqueue_entry(&nvmeq->sq_cong_wait, nvme_thread);
- bio_list_init(&nvmeq->sq_cong);
- INIT_LIST_HEAD(&nvmeq->iod_bio);
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
nvmeq->q_depth = depth;
nvmeq->cq_vector = vector;
nvmeq->qid = qid;
nvmeq->q_suspended = 1;
dev->queue_count++;
- rcu_assign_pointer(dev->queues[qid], nvmeq);
+ dev->queues[qid] = nvmeq;

return nvmeq;

- free_sqdma:
- dma_free_coherent(dmadev, SQ_SIZE(depth), (void *)nvmeq->sq_cmds,
- nvmeq->sq_dma_addr);
free_cqdma:
dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
nvmeq->cq_dma_addr);
@@ -1335,15 +1176,12 @@ static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
{
struct nvme_dev *dev = nvmeq->dev;
- unsigned extra = nvme_queue_extra(nvmeq->q_depth);

nvmeq->sq_tail = 0;
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
- memset(nvmeq->cmdid_data, 0, extra);
memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
- nvme_cancel_ios(nvmeq, false);
nvmeq->q_suspended = 0;
dev->online_queues++;
}
@@ -1444,6 +1282,52 @@ static int nvme_shutdown_ctrl(struct nvme_dev *dev)
return 0;
}

+static struct blk_mq_ops nvme_mq_admin_ops = {
+ .queue_rq = nvme_admin_queue_rq,
+ .map_queue = blk_mq_map_queue,
+ .init_hctx = nvme_admin_init_hctx,
+ .init_request = nvme_admin_init_request,
+ .timeout = nvme_timeout,
+};
+
+static struct blk_mq_ops nvme_mq_ops = {
+ .queue_rq = nvme_queue_rq,
+ .map_queue = blk_mq_map_queue,
+ .init_hctx = nvme_init_hctx,
+ .init_request = nvme_init_request,
+ .timeout = nvme_timeout,
+};
+
+static int nvme_alloc_admin_tags(struct nvme_dev *dev)
+{
+ if (!dev->admin_q) {
+ dev->admin_tagset.ops = &nvme_mq_admin_ops;
+ dev->admin_tagset.nr_hw_queues = 1;
+ dev->admin_tagset.queue_depth = NVME_AQ_DEPTH;
+ dev->admin_tagset.timeout = ADMIN_TIMEOUT;
+ dev->admin_tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+ dev->admin_tagset.cmd_size = sizeof(struct nvme_cmd_info);
+ dev->admin_tagset.driver_data = dev;
+
+ if (blk_mq_alloc_tag_set(&dev->admin_tagset))
+ return -ENOMEM;
+
+ dev->admin_q = blk_mq_init_queue(&dev->admin_tagset);
+ if (!dev->admin_q) {
+ blk_mq_free_tag_set(&dev->admin_tagset);
+ return -ENOMEM;
+ }
+ }
+
+ return 0;
+}
+
+static void nvme_free_admin_tags(struct nvme_dev *dev)
+{
+ if (dev->admin_q)
+ blk_mq_free_tag_set(&dev->admin_tagset);
+}
+
static int nvme_configure_admin_queue(struct nvme_dev *dev)
{
int result;
@@ -1455,9 +1339,9 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
if (result < 0)
return result;

- nvmeq = raw_nvmeq(dev, 0);
+ nvmeq = dev->queues[0];
if (!nvmeq) {
- nvmeq = nvme_alloc_queue(dev, 0, 64, 0);
+ nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH, 0);
if (!nvmeq)
return -ENOMEM;
}
@@ -1477,16 +1361,26 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)

result = nvme_enable_ctrl(dev, cap);
if (result)
- return result;
+ goto free_nvmeq;
+
+ result = nvme_alloc_admin_tags(dev);
+ if (result)
+ goto free_nvmeq;

result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
if (result)
- return result;
+ goto free_tags;

spin_lock_irq(&nvmeq->q_lock);
nvme_init_queue(nvmeq, 0);
spin_unlock_irq(&nvmeq->q_lock);
return result;
+
+ free_tags:
+ nvme_free_admin_tags(dev);
+ free_nvmeq:
+ nvme_free_queues(dev, 0);
+ return result;
}

struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
@@ -1644,7 +1538,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
if (length != (io.nblocks + 1) << ns->lba_shift)
status = -ENOMEM;
else
- status = nvme_submit_io_cmd(dev, &c, NULL);
+ status = nvme_submit_io_cmd(dev, ns, &c, NULL);

if (meta_len) {
if (status == NVME_SC_SUCCESS && !(io.opcode & 1)) {
@@ -1716,10 +1610,11 @@ static int nvme_user_admin_cmd(struct nvme_dev *dev,

timeout = cmd.timeout_ms ? msecs_to_jiffies(cmd.timeout_ms) :
ADMIN_TIMEOUT;
+
if (length != cmd.data_len)
status = -ENOMEM;
else
- status = nvme_submit_sync_cmd(dev, 0, &c, &cmd.result, timeout);
+ status = __nvme_submit_admin_cmd(dev, &c, &cmd.result, timeout);

if (cmd.data_len) {
nvme_unmap_user_pages(dev, cmd.opcode & 1, iod);
@@ -1808,41 +1703,6 @@ static const struct block_device_operations nvme_fops = {
.getgeo = nvme_getgeo,
};

-static void nvme_resubmit_iods(struct nvme_queue *nvmeq)
-{
- struct nvme_iod *iod, *next;
-
- list_for_each_entry_safe(iod, next, &nvmeq->iod_bio, node) {
- if (unlikely(nvme_submit_iod(nvmeq, iod)))
- break;
- list_del(&iod->node);
- if (bio_list_empty(&nvmeq->sq_cong) &&
- list_empty(&nvmeq->iod_bio))
- remove_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- }
-}
-
-static void nvme_resubmit_bios(struct nvme_queue *nvmeq)
-{
- while (bio_list_peek(&nvmeq->sq_cong)) {
- struct bio *bio = bio_list_pop(&nvmeq->sq_cong);
- struct nvme_ns *ns = bio->bi_bdev->bd_disk->private_data;
-
- if (bio_list_empty(&nvmeq->sq_cong) &&
- list_empty(&nvmeq->iod_bio))
- remove_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- if (nvme_submit_bio_queue(nvmeq, ns, bio)) {
- if (!waitqueue_active(&nvmeq->sq_full))
- add_wait_queue(&nvmeq->sq_full,
- &nvmeq->sq_cong_wait);
- bio_list_add_head(&nvmeq->sq_cong, bio);
- break;
- }
- }
-}
-
static int nvme_kthread(void *data)
{
struct nvme_dev *dev, *next;
@@ -1858,28 +1718,23 @@ static int nvme_kthread(void *data)
continue;
list_del_init(&dev->node);
dev_warn(&dev->pci_dev->dev,
- "Failed status, reset controller\n");
+ "Failed status: %x, reset controller\n",
+ readl(&dev->bar->csts));
dev->reset_workfn = nvme_reset_failed_dev;
queue_work(nvme_workq, &dev->reset_work);
continue;
}
- rcu_read_lock();
for (i = 0; i < dev->queue_count; i++) {
- struct nvme_queue *nvmeq =
- rcu_dereference(dev->queues[i]);
+ struct nvme_queue *nvmeq = dev->queues[i];
if (!nvmeq)
continue;
spin_lock_irq(&nvmeq->q_lock);
if (nvmeq->q_suspended)
goto unlock;
nvme_process_cq(nvmeq);
- nvme_cancel_ios(nvmeq, true);
- nvme_resubmit_bios(nvmeq);
- nvme_resubmit_iods(nvmeq);
unlock:
spin_unlock_irq(&nvmeq->q_lock);
}
- rcu_read_unlock();
}
spin_unlock(&dev_list_lock);
schedule_timeout(round_jiffies_relative(HZ));
@@ -1902,27 +1757,30 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
{
struct nvme_ns *ns;
struct gendisk *disk;
+ int node = dev_to_node(&dev->pci_dev->dev);
int lbaf;

if (rt->attributes & NVME_LBART_ATTRIB_HIDE)
return NULL;

- ns = kzalloc(sizeof(*ns), GFP_KERNEL);
+ ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
if (!ns)
return NULL;
- ns->queue = blk_alloc_queue(GFP_KERNEL);
+ ns->queue = blk_mq_init_queue(&dev->tagset);
if (!ns->queue)
goto out_free_ns;
- ns->queue->queue_flags = QUEUE_FLAG_DEFAULT;
+ queue_flag_set_unlocked(QUEUE_FLAG_DEFAULT, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
- blk_queue_make_request(ns->queue, nvme_make_request);
+ queue_flag_set_unlocked(QUEUE_FLAG_SG_GAPS, ns->queue);
+ queue_flag_clear_unlocked(QUEUE_FLAG_IO_STAT, ns->queue);
ns->dev = dev;
ns->queue->queuedata = ns;

- disk = alloc_disk(0);
+ disk = alloc_disk_node(0, node);
if (!disk)
goto out_free_queue;
+
ns->ns_id = nsid;
ns->disk = disk;
lbaf = id->flbas & 0xf;
@@ -1931,6 +1789,8 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
if (dev->max_hw_sectors)
blk_queue_max_hw_sectors(ns->queue, dev->max_hw_sectors);
+ if (dev->stripe_size)
+ blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
if (dev->vwc & NVME_CTRL_VWC_PRESENT)
blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);

@@ -1956,143 +1816,19 @@ static struct nvme_ns *nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid,
return NULL;
}

-static int nvme_find_closest_node(int node)
-{
- int n, val, min_val = INT_MAX, best_node = node;
-
- for_each_online_node(n) {
- if (n == node)
- continue;
- val = node_distance(node, n);
- if (val < min_val) {
- min_val = val;
- best_node = n;
- }
- }
- return best_node;
-}
-
-static void nvme_set_queue_cpus(cpumask_t *qmask, struct nvme_queue *nvmeq,
- int count)
-{
- int cpu;
- for_each_cpu(cpu, qmask) {
- if (cpumask_weight(nvmeq->cpu_mask) >= count)
- break;
- if (!cpumask_test_and_set_cpu(cpu, nvmeq->cpu_mask))
- *per_cpu_ptr(nvmeq->dev->io_queue, cpu) = nvmeq->qid;
- }
-}
-
-static void nvme_add_cpus(cpumask_t *mask, const cpumask_t *unassigned_cpus,
- const cpumask_t *new_mask, struct nvme_queue *nvmeq, int cpus_per_queue)
-{
- int next_cpu;
- for_each_cpu(next_cpu, new_mask) {
- cpumask_or(mask, mask, get_cpu_mask(next_cpu));
- cpumask_or(mask, mask, topology_thread_cpumask(next_cpu));
- cpumask_and(mask, mask, unassigned_cpus);
- nvme_set_queue_cpus(mask, nvmeq, cpus_per_queue);
- }
-}
-
static void nvme_create_io_queues(struct nvme_dev *dev)
{
- unsigned i, max;
+ unsigned i;

- max = min(dev->max_qid, num_online_cpus());
- for (i = dev->queue_count; i <= max; i++)
+ for (i = dev->queue_count; i <= dev->max_qid; i++)
if (!nvme_alloc_queue(dev, i, dev->q_depth, i - 1))
break;

- max = min(dev->queue_count - 1, num_online_cpus());
- for (i = dev->online_queues; i <= max; i++)
- if (nvme_create_queue(raw_nvmeq(dev, i), i))
+ for (i = dev->online_queues; i <= dev->queue_count - 1; i++)
+ if (nvme_create_queue(dev->queues[i], i))
break;
}

-/*
- * If there are fewer queues than online cpus, this will try to optimally
- * assign a queue to multiple cpus by grouping cpus that are "close" together:
- * thread siblings, core, socket, closest node, then whatever else is
- * available.
- */
-static void nvme_assign_io_queues(struct nvme_dev *dev)
-{
- unsigned cpu, cpus_per_queue, queues, remainder, i;
- cpumask_var_t unassigned_cpus;
-
- nvme_create_io_queues(dev);
-
- queues = min(dev->online_queues - 1, num_online_cpus());
- if (!queues)
- return;
-
- cpus_per_queue = num_online_cpus() / queues;
- remainder = queues - (num_online_cpus() - queues * cpus_per_queue);
-
- if (!alloc_cpumask_var(&unassigned_cpus, GFP_KERNEL))
- return;
-
- cpumask_copy(unassigned_cpus, cpu_online_mask);
- cpu = cpumask_first(unassigned_cpus);
- for (i = 1; i <= queues; i++) {
- struct nvme_queue *nvmeq = lock_nvmeq(dev, i);
- cpumask_t mask;
-
- cpumask_clear(nvmeq->cpu_mask);
- if (!cpumask_weight(unassigned_cpus)) {
- unlock_nvmeq(nvmeq);
- break;
- }
-
- mask = *get_cpu_mask(cpu);
- nvme_set_queue_cpus(&mask, nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- topology_thread_cpumask(cpu),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- topology_core_cpumask(cpu),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- cpumask_of_node(cpu_to_node(cpu)),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- cpumask_of_node(
- nvme_find_closest_node(
- cpu_to_node(cpu))),
- nvmeq, cpus_per_queue);
- if (cpus_weight(mask) < cpus_per_queue)
- nvme_add_cpus(&mask, unassigned_cpus,
- unassigned_cpus,
- nvmeq, cpus_per_queue);
-
- WARN(cpumask_weight(nvmeq->cpu_mask) != cpus_per_queue,
- "nvme%d qid:%d mis-matched queue-to-cpu assignment\n",
- dev->instance, i);
-
- irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
- nvmeq->cpu_mask);
- cpumask_andnot(unassigned_cpus, unassigned_cpus,
- nvmeq->cpu_mask);
- cpu = cpumask_next(cpu, unassigned_cpus);
- if (remainder && !--remainder)
- cpus_per_queue++;
- unlock_nvmeq(nvmeq);
- }
- WARN(cpumask_weight(unassigned_cpus), "nvme%d unassigned online cpus\n",
- dev->instance);
- i = 0;
- cpumask_andnot(unassigned_cpus, cpu_possible_mask, cpu_online_mask);
- for_each_cpu(cpu, unassigned_cpus)
- *per_cpu_ptr(dev->io_queue, cpu) = (i++ % queues) + 1;
- free_cpumask_var(unassigned_cpus);
-}
-
static int set_queue_count(struct nvme_dev *dev, int count)
{
int status;
@@ -2116,33 +1852,9 @@ static size_t db_bar_size(struct nvme_dev *dev, unsigned nr_io_queues)
return 4096 + ((nr_io_queues + 1) * 8 * dev->db_stride);
}

-static void nvme_cpu_workfn(struct work_struct *work)
-{
- struct nvme_dev *dev = container_of(work, struct nvme_dev, cpu_work);
- if (dev->initialized)
- nvme_assign_io_queues(dev);
-}
-
-static int nvme_cpu_notify(struct notifier_block *self,
- unsigned long action, void *hcpu)
-{
- struct nvme_dev *dev;
-
- switch (action) {
- case CPU_ONLINE:
- case CPU_DEAD:
- spin_lock(&dev_list_lock);
- list_for_each_entry(dev, &dev_list, node)
- schedule_work(&dev->cpu_work);
- spin_unlock(&dev_list_lock);
- break;
- }
- return NOTIFY_OK;
-}
-
static int nvme_setup_io_queues(struct nvme_dev *dev)
{
- struct nvme_queue *adminq = raw_nvmeq(dev, 0);
+ struct nvme_queue *adminq = dev->queues[0];
struct pci_dev *pdev = dev->pci_dev;
int result, i, vecs, nr_io_queues, size;

@@ -2201,7 +1913,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)

/* Free previously allocated queues that are no longer usable */
nvme_free_queues(dev, nr_io_queues + 1);
- nvme_assign_io_queues(dev);
+ nvme_create_io_queues(dev);

return 0;

@@ -2250,8 +1962,29 @@ static int nvme_dev_add(struct nvme_dev *dev)
if (ctrl->mdts)
dev->max_hw_sectors = 1 << (ctrl->mdts + shift - 9);
if ((pdev->vendor == PCI_VENDOR_ID_INTEL) &&
- (pdev->device == 0x0953) && ctrl->vs[3])
+ (pdev->device == 0x0953) && ctrl->vs[3]) {
+ unsigned int max_hw_sectors;
+
dev->stripe_size = 1 << (ctrl->vs[3] + shift);
+ max_hw_sectors = dev->stripe_size >> (shift - 9);
+ if (dev->max_hw_sectors) {
+ dev->max_hw_sectors = min(max_hw_sectors,
+ dev->max_hw_sectors);
+ } else
+ dev->max_hw_sectors = max_hw_sectors;
+ }
+
+ dev->tagset.ops = &nvme_mq_ops;
+ dev->tagset.nr_hw_queues = dev->online_queues - 1;
+ dev->tagset.timeout = NVME_IO_TIMEOUT;
+ dev->tagset.numa_node = dev_to_node(&dev->pci_dev->dev);
+ dev->tagset.queue_depth = min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH);
+ dev->tagset.cmd_size = sizeof(struct nvme_cmd_info);
+ dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+ dev->tagset.driver_data = dev;
+
+ if (blk_mq_alloc_tag_set(&dev->tagset))
+ goto out;

id_ns = mem;
for (i = 1; i <= nn; i++) {
@@ -2401,7 +2134,8 @@ static int adapter_async_del_queue(struct nvme_queue *nvmeq, u8 opcode,
c.delete_queue.qid = cpu_to_le16(nvmeq->qid);

init_kthread_work(&nvmeq->cmdinfo.work, fn);
- return nvme_submit_admin_cmd_async(nvmeq->dev, &c, &nvmeq->cmdinfo);
+ return nvme_submit_admin_async_cmd(nvmeq->dev, &c, &nvmeq->cmdinfo,
+ ADMIN_TIMEOUT);
}

static void nvme_del_cq_work_handler(struct kthread_work *work)
@@ -2464,7 +2198,7 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
atomic_set(&dq.refcount, 0);
dq.worker = &worker;
for (i = dev->queue_count - 1; i > 0; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+ struct nvme_queue *nvmeq = dev->queues[i];

if (nvme_suspend_queue(nvmeq))
continue;
@@ -2506,7 +2240,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)

if (!dev->bar || (dev->bar && readl(&dev->bar->csts) == -1)) {
for (i = dev->queue_count - 1; i >= 0; i--) {
- struct nvme_queue *nvmeq = raw_nvmeq(dev, i);
+ struct nvme_queue *nvmeq = dev->queues[i];
nvme_suspend_queue(nvmeq);
nvme_clear_queue(nvmeq);
}
@@ -2518,6 +2252,12 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
nvme_dev_unmap(dev);
}

+static void nvme_dev_remove_admin(struct nvme_dev *dev)
+{
+ if (dev->admin_q && !blk_queue_dying(dev->admin_q))
+ blk_cleanup_queue(dev->admin_q);
+}
+
static void nvme_dev_remove(struct nvme_dev *dev)
{
struct nvme_ns *ns;
@@ -2599,7 +2339,7 @@ static void nvme_free_dev(struct kref *kref)
struct nvme_dev *dev = container_of(kref, struct nvme_dev, kref);

nvme_free_namespaces(dev);
- free_percpu(dev->io_queue);
+ blk_mq_free_tag_set(&dev->tagset);
kfree(dev->queues);
kfree(dev->entry);
kfree(dev);
@@ -2726,7 +2466,7 @@ static void nvme_dev_reset(struct nvme_dev *dev)
{
nvme_dev_shutdown(dev);
if (nvme_dev_resume(dev)) {
- dev_err(&dev->pci_dev->dev, "Device failed to resume\n");
+ dev_warn(&dev->pci_dev->dev, "Device failed to resume\n");
kref_get(&dev->kref);
if (IS_ERR(kthread_run(nvme_remove_dead_ctrl, dev, "nvme%d",
dev->instance))) {
@@ -2751,28 +2491,28 @@ static void nvme_reset_workfn(struct work_struct *work)

static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
- int result = -ENOMEM;
+ int node, result = -ENOMEM;
struct nvme_dev *dev;

- dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ node = dev_to_node(&pdev->dev);
+ if (node == NUMA_NO_NODE)
+ set_dev_node(&pdev->dev, 0);
+
+ dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
if (!dev)
return -ENOMEM;
- dev->entry = kcalloc(num_possible_cpus(), sizeof(*dev->entry),
- GFP_KERNEL);
+ dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
+ GFP_KERNEL, node);
if (!dev->entry)
goto free;
- dev->queues = kcalloc(num_possible_cpus() + 1, sizeof(void *),
- GFP_KERNEL);
+ dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
+ GFP_KERNEL, node);
if (!dev->queues)
goto free;
- dev->io_queue = alloc_percpu(unsigned short);
- if (!dev->io_queue)
- goto free;

INIT_LIST_HEAD(&dev->namespaces);
dev->reset_workfn = nvme_reset_failed_dev;
INIT_WORK(&dev->reset_work, nvme_reset_workfn);
- INIT_WORK(&dev->cpu_work, nvme_cpu_workfn);
dev->pci_dev = pdev;
pci_set_drvdata(pdev, dev);
result = nvme_set_instance(dev);
@@ -2810,6 +2550,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)

remove:
nvme_dev_remove(dev);
+ nvme_dev_remove_admin(dev);
nvme_free_namespaces(dev);
shutdown:
nvme_dev_shutdown(dev);
@@ -2819,7 +2560,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
release:
nvme_release_instance(dev);
free:
- free_percpu(dev->io_queue);
kfree(dev->queues);
kfree(dev->entry);
kfree(dev);
@@ -2828,12 +2568,12 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)

static void nvme_reset_notify(struct pci_dev *pdev, bool prepare)
{
- struct nvme_dev *dev = pci_get_drvdata(pdev);
+ struct nvme_dev *dev = pci_get_drvdata(pdev);

- if (prepare)
- nvme_dev_shutdown(dev);
- else
- nvme_dev_resume(dev);
+ if (prepare)
+ nvme_dev_shutdown(dev);
+ else
+ nvme_dev_resume(dev);
}

static void nvme_shutdown(struct pci_dev *pdev)
@@ -2852,12 +2592,12 @@ static void nvme_remove(struct pci_dev *pdev)

pci_set_drvdata(pdev, NULL);
flush_work(&dev->reset_work);
- flush_work(&dev->cpu_work);
misc_deregister(&dev->miscdev);
nvme_dev_remove(dev);
nvme_dev_shutdown(dev);
+ nvme_dev_remove_admin(dev);
nvme_free_queues(dev, 0);
- rcu_barrier();
+ nvme_free_admin_tags(dev);
nvme_release_instance(dev);
nvme_release_prp_pools(dev);
kref_put(&dev->kref, nvme_free_dev);
@@ -2941,18 +2681,11 @@ static int __init nvme_init(void)
else if (result > 0)
nvme_major = result;

- nvme_nb.notifier_call = &nvme_cpu_notify;
- result = register_hotcpu_notifier(&nvme_nb);
- if (result)
- goto unregister_blkdev;
-
result = pci_register_driver(&nvme_driver);
if (result)
- goto unregister_hotcpu;
+ goto unregister_blkdev;
return 0;

- unregister_hotcpu:
- unregister_hotcpu_notifier(&nvme_nb);
unregister_blkdev:
unregister_blkdev(nvme_major, "nvme");
kill_workq:
diff --git a/drivers/block/nvme-scsi.c b/drivers/block/nvme-scsi.c
index a4cd6d6..52c0356 100644
--- a/drivers/block/nvme-scsi.c
+++ b/drivers/block/nvme-scsi.c
@@ -2105,7 +2105,7 @@ static int nvme_trans_do_nvme_io(struct nvme_ns *ns, struct sg_io_hdr *hdr,

nvme_offset += unit_num_blocks;

- nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
if (nvme_sc != NVME_SC_SUCCESS) {
nvme_unmap_user_pages(dev,
(is_write) ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
@@ -2658,7 +2658,7 @@ static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
c.common.opcode = nvme_cmd_flush;
c.common.nsid = cpu_to_le32(ns->ns_id);

- nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);
res = nvme_trans_status_code(hdr, nvme_sc);
if (res)
goto out;
@@ -2686,7 +2686,7 @@ static int nvme_trans_synchronize_cache(struct nvme_ns *ns,
c.common.opcode = nvme_cmd_flush;
c.common.nsid = cpu_to_le32(ns->ns_id);

- nvme_sc = nvme_submit_io_cmd(ns->dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(ns->dev, ns, &c, NULL);

res = nvme_trans_status_code(hdr, nvme_sc);
if (res)
@@ -2894,7 +2894,7 @@ static int nvme_trans_unmap(struct nvme_ns *ns, struct sg_io_hdr *hdr,
c.dsm.nr = cpu_to_le32(ndesc - 1);
c.dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);

- nvme_sc = nvme_submit_io_cmd(dev, &c, NULL);
+ nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);
res = nvme_trans_status_code(hdr, nvme_sc);

dma_free_coherent(&dev->pci_dev->dev, ndesc * sizeof(*range),
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 2bf4031..299e6f5 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -19,6 +19,7 @@
#include <linux/pci.h>
#include <linux/miscdevice.h>
#include <linux/kref.h>
+#include <linux/blk-mq.h>

struct nvme_bar {
__u64 cap; /* Controller Capabilities */
@@ -70,8 +71,10 @@ extern unsigned char nvme_io_timeout;
*/
struct nvme_dev {
struct list_head node;
- struct nvme_queue __rcu **queues;
- unsigned short __percpu *io_queue;
+ struct nvme_queue **queues;
+ struct request_queue *admin_q;
+ struct blk_mq_tag_set tagset;
+ struct blk_mq_tag_set admin_tagset;
u32 __iomem *dbs;
struct pci_dev *pci_dev;
struct dma_pool *prp_page_pool;
@@ -90,7 +93,6 @@ struct nvme_dev {
struct miscdevice miscdev;
work_func_t reset_workfn;
struct work_struct reset_work;
- struct work_struct cpu_work;
char name[12];
char serial[20];
char model[40];
@@ -132,7 +134,6 @@ struct nvme_iod {
int offset; /* Of PRP list */
int nents; /* Used in scatterlist */
int length; /* Of data, in bytes */
- unsigned long start_time;
dma_addr_t first_dma;
struct list_head node;
struct scatterlist sg[0];
@@ -150,12 +151,14 @@ static inline u64 nvme_block_nr(struct nvme_ns *ns, sector_t sector)
*/
void nvme_free_iod(struct nvme_dev *dev, struct nvme_iod *iod);

-int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int , gfp_t);
+int nvme_setup_prps(struct nvme_dev *, struct nvme_iod *, int, gfp_t);
struct nvme_iod *nvme_map_user_pages(struct nvme_dev *dev, int write,
unsigned long addr, unsigned length);
void nvme_unmap_user_pages(struct nvme_dev *dev, int write,
struct nvme_iod *iod);
-int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_command *, u32 *);
+int nvme_submit_io_cmd(struct nvme_dev *, struct nvme_ns *,
+ struct nvme_command *, u32 *);
+int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns);
int nvme_submit_admin_cmd(struct nvme_dev *, struct nvme_command *,
u32 *result);
int nvme_identify(struct nvme_dev *, unsigned nsid, unsigned cns,
--
1.9.1

2014-10-08 16:14:05

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH 4/5] lightnvm: NVMe integration

On Wed, 8 Oct 2014, Matias Bjørling wrote:
> NVMe devices are identified by the vendor specific bits:
>
> Bit 3 in OACS (device-wide). Currently made per device, as the nvme
> namespace is missing in the completion path.

The NVM-Express 1.2 actually defined this bit for Namespace Management,
so I don't think we can use bits the spec marked as "reserved". Maybe
you can trigger off some vendor specific Command Set Supported instead.