2015-06-05 12:54:53

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 0/8] Support for Open-Channel SSDs

Hi,

This is an updated version based on the feedback from Christoph.

Patch 1-2 are fixes and preparation for the nvme driver. The first fixes
a flag bug. The second allows rq->special in nvme_submit_sync_cmd to
be set and used.
Patch 3 fixes capacity reporting in null_blk.
Patch 4-8 introduces LightNVM and a simple target.

Jens, the patches are on top of your for-next tree.

Development and further information on LightNVM can be found at:

https://github.com/OpenChannelSSD/linux

Changes since v3:

- Remove dependence on REQ_NVM_GC
- Refactor nvme integration to use nvme_submit_sync_cmd for
internal commands.
- Fix race condition bug on multiple threads on RRPC target.
- Rename sysfs entry under the block device from nvm to lightnvm.
The configuration is found in /sys/block/*/lightnvm/

Changes since v2:

Feedback from Paul Bolle:
- Fix license to GPLv2, documentation, compilation.
Feedback from Keith Busch:
- nvme: Move lightnvm out and into nvme-lightnvm.c.
- nvme: Set controller css on lightnvm command set.
- nvme: Remove OACS.
Feedback from Christoph Hellwig:
- lightnvm: Move out of block layer into /drivers/lightnvm/core.c
- lightnvm: refactor request->phys_sector into device drivers.
- lightnvm: refactor prep/unprep into device drivers.
- lightnvm: move nvm_dev from request_queue to gendisk.

New
- Bad block table support (From Javier).
- Update maintainers file.

Changes since v1:

- Splitted LightNVM into two parts. A get/put interface for flash
blocks and the respective targets that implement flash translation
layer logic.
- Updated the patches according to the LightNVM specification changes.
- Added interface to add/remove targets for a block device.

Matias Bjørling (8):
nvme: add special param for nvme_submit_sync_cmd
nvme: don't overwrite req->cmd_flags on sync cmd
null_blk: wrong capacity when bs is not 512 bytes
bio: Introduce LightNVM payload
lightnvm: Support for Open-Channel SSDs
lightnvm: RRPC target
null_blk: LightNVM support
nvme: LightNVM support

Documentation/block/null_blk.txt | 8 +
MAINTAINERS | 9 +
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/block/Makefile | 2 +-
drivers/block/null_blk.c | 136 ++++-
drivers/block/nvme-core.c | 134 ++++-
drivers/block/nvme-lightnvm.c | 320 +++++++++++
drivers/block/nvme-scsi.c | 4 +-
drivers/lightnvm/Kconfig | 26 +
drivers/lightnvm/Makefile | 6 +
drivers/lightnvm/core.c | 833 +++++++++++++++++++++++++++++
drivers/lightnvm/rrpc.c | 1088 ++++++++++++++++++++++++++++++++++++++
drivers/lightnvm/rrpc.h | 221 ++++++++
include/linux/bio.h | 9 +
include/linux/blk_types.h | 4 +-
include/linux/blkdev.h | 2 +
include/linux/genhd.h | 3 +
include/linux/lightnvm.h | 379 +++++++++++++
include/linux/nvme.h | 11 +-
include/uapi/linux/nvme.h | 131 +++++
21 files changed, 3300 insertions(+), 30 deletions(-)
create mode 100644 drivers/block/nvme-lightnvm.c
create mode 100644 drivers/lightnvm/Kconfig
create mode 100644 drivers/lightnvm/Makefile
create mode 100644 drivers/lightnvm/core.c
create mode 100644 drivers/lightnvm/rrpc.c
create mode 100644 drivers/lightnvm/rrpc.h
create mode 100644 include/linux/lightnvm.h

--
2.1.4


2015-06-05 12:57:33

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 1/8] nvme: add special param for nvme_submit_sync_cmd

In preparation for LightNVM, it requires a hook for internal commands
to resolve its state on command completion. Use req->special for this
and move the command result to req->sense.

Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/nvme-core.c | 19 ++++++++++---------
drivers/block/nvme-scsi.c | 4 ++--
include/linux/nvme.h | 2 +-
3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 6ed1356..d2955fe 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -614,7 +614,7 @@ static void req_completion(struct nvme_queue *nvmeq, void *ctx,
req->errors = 0;
if (req->cmd_type == REQ_TYPE_DRV_PRIV) {
u32 result = le32_to_cpup(&cqe->result);
- req->special = (void *)(uintptr_t)result;
+ req->sense = (void *)(uintptr_t)result;
}

if (cmd_rq->aborted)
@@ -998,7 +998,7 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
*/
int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
void *buffer, void __user *ubuffer, unsigned bufflen,
- u32 *result, unsigned timeout)
+ u32 *result, unsigned timeout, void *special)
{
bool write = cmd->common.opcode & 1;
struct bio *bio = NULL;
@@ -1019,7 +1019,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,

req->cmd = (unsigned char *)cmd;
req->cmd_len = sizeof(struct nvme_command);
- req->special = (void *)0;
+ req->special = special;

if (buffer && bufflen) {
ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_WAIT);
@@ -1036,7 +1036,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
if (bio)
blk_rq_unmap_user(bio);
if (result)
- *result = (u32)(uintptr_t)req->special;
+ *result = (u32)(uintptr_t)req->sense;
ret = req->errors;
out:
blk_mq_free_request(req);
@@ -1046,7 +1046,8 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
void *buffer, unsigned bufflen)
{
- return __nvme_submit_sync_cmd(q, cmd, buffer, NULL, bufflen, NULL, 0);
+ return __nvme_submit_sync_cmd(q, cmd, buffer, NULL, bufflen, NULL, 0,
+ NULL);
}

static int nvme_submit_async_admin_req(struct nvme_dev *dev)
@@ -1209,7 +1210,7 @@ int nvme_get_features(struct nvme_dev *dev, unsigned fid, unsigned nsid,
c.features.fid = cpu_to_le32(fid);

return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
- result, 0);
+ result, 0, NULL);
}

int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
@@ -1224,7 +1225,7 @@ int nvme_set_features(struct nvme_dev *dev, unsigned fid, unsigned dword11,
c.features.dword11 = cpu_to_le32(dword11);

return __nvme_submit_sync_cmd(dev->admin_q, &c, NULL, NULL, 0,
- result, 0);
+ result, 0, NULL);
}

int nvme_get_log_page(struct nvme_dev *dev, struct nvme_smart_log **log)
@@ -1787,7 +1788,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
c.rw.metadata = cpu_to_le64(meta_dma);

status = __nvme_submit_sync_cmd(ns->queue, &c, NULL,
- (void __user *)io.addr, length, NULL, 0);
+ (void __user *)io.addr, length, NULL, 0, NULL);
unmap:
if (meta) {
if (status == NVME_SC_SUCCESS && !write) {
@@ -1831,7 +1832,7 @@ static int nvme_user_cmd(struct nvme_dev *dev, struct nvme_ns *ns,

status = __nvme_submit_sync_cmd(ns ? ns->queue : dev->admin_q, &c,
NULL, (void __user *)cmd.addr, cmd.data_len,
- &cmd.result, timeout);
+ &cmd.result, timeout, NULL);
if (status >= 0) {
if (put_user(cmd.result, &ucmd->result))
return -EFAULT;
diff --git a/drivers/block/nvme-scsi.c b/drivers/block/nvme-scsi.c
index 8e6223e..ad35bb7 100644
--- a/drivers/block/nvme-scsi.c
+++ b/drivers/block/nvme-scsi.c
@@ -1297,7 +1297,7 @@ static int nvme_trans_send_download_fw_cmd(struct nvme_ns *ns, struct sg_io_hdr
c.dlfw.offset = cpu_to_le32(offset/BYTES_TO_DWORDS);

nvme_sc = __nvme_submit_sync_cmd(dev->admin_q, &c, NULL,
- hdr->dxferp, tot_len, NULL, 0);
+ hdr->dxferp, tot_len, NULL, 0, NULL);
return nvme_trans_status_code(hdr, nvme_sc);
}

@@ -1704,7 +1704,7 @@ static int nvme_trans_do_nvme_io(struct nvme_ns *ns, struct sg_io_hdr *hdr,
break;
}
nvme_sc = __nvme_submit_sync_cmd(ns->queue, &c, NULL,
- next_mapping_addr, unit_len, NULL, 0);
+ next_mapping_addr, unit_len, NULL, 0, NULL);
if (nvme_sc)
break;

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 986bf8a..fce2090 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -150,7 +150,7 @@ int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
void *buf, unsigned bufflen);
int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
void *buffer, void __user *ubuffer, unsigned bufflen,
- u32 *result, unsigned timeout);
+ u32 *result, unsigned timeout, void *special);
int nvme_identify_ctrl(struct nvme_dev *dev, struct nvme_id_ctrl **id);
int nvme_identify_ns(struct nvme_dev *dev, unsigned nsid,
struct nvme_id_ns **id);
--
2.1.4

2015-06-05 12:56:56

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 2/8] nvme: don't overwrite req->cmd_flags on sync cmd

In __nvme_submit_sync_cmd, the request direction is overwritten when
the REQ_FAILFAST_DRIVER flag is set.

Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/nvme-core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index d2955fe..6e433b1 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1010,7 +1010,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
return PTR_ERR(req);

req->cmd_type = REQ_TYPE_DRV_PRIV;
- req->cmd_flags = REQ_FAILFAST_DRIVER;
+ req->cmd_flags |= REQ_FAILFAST_DRIVER;
req->__data_len = 0;
req->__sector = (sector_t) -1;
req->bio = req->biotail = NULL;
--
2.1.4

2015-06-05 12:54:58

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 3/8] null_blk: wrong capacity when bs is not 512 bytes

set_capacity() sets device capacity by the number of 512 bytes
sectors. null_blk calculates the number of sectors by size / bs
and calls set_capacity. This leads to the wrong number
of sectors when bs is not 512 bytes.

Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/null_blk.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 65cd61a..79972ab 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -576,8 +576,7 @@ static int null_add_dev(void)
blk_queue_physical_block_size(nullb->q, bs);

size = gb * 1024 * 1024 * 1024ULL;
- sector_div(size, bs);
- set_capacity(disk, size);
+ set_capacity(disk, size >> 9);

disk->flags |= GENHD_FL_EXT_DEVT | GENHD_FL_SUPPRESS_PARTITION_INFO;
disk->major = null_major;
--
2.1.4

2015-06-05 12:56:38

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 4/8] bio: Introduce LightNVM payload

LightNVM integrates on both sides of the block layer. The lower layer
implements mapping of logical to physical addressing, while the layer
above can string together multiple LightNVM devices and expose them as a
single block device.

Having multiple devices underneath requires a way to resolve where the
IO came from when submitted through the block layer. Extending bio with
a LightNVM payload solves this problem.

Signed-off-by: Matias Bjørling <[email protected]>
---
include/linux/bio.h | 9 +++++++++
include/linux/blk_types.h | 4 +++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index f0291cf..1fe490e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -368,6 +368,15 @@ static inline void bip_set_seed(struct bio_integrity_payload *bip,

#endif /* CONFIG_BLK_DEV_INTEGRITY */

+#if defined(CONFIG_NVM)
+
+/* bio open-channel ssd payload */
+struct bio_nvm_payload {
+ void *private;
+};
+
+#endif /* CONFIG_NVM */
+
extern void bio_trim(struct bio *bio, int offset, int size);
extern struct bio *bio_split(struct bio *bio, int sectors,
gfp_t gfp, struct bio_set *bs);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 45a6be8..41d9ba1 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -83,7 +83,9 @@ struct bio {
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
};
-
+#if defined(CONFIG_NVM)
+ struct bio_nvm_payload *bi_nvm; /* open-channel ssd backend */
+#endif
unsigned short bi_vcnt; /* how many bio_vec's */

/*
--
2.1.4

2015-06-05 12:55:09

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 5/8] lightnvm: Support for Open-Channel SSDs

Open-channel SSDs are devices that share responsibilities with the host
in order to implement and maintain features that typical SSDs keep
strictly in firmware. These include (i) the Flash Translation Layer
(FTL), (ii) bad block management, and (iii) hardware units such as the
flash controller, the interface controller, and large amounts of flash
chips. In this way, Open-channels SSDs exposes direct access to their
physical flash storage, while keeping a subset of the internal features
of SSDs.

LightNVM is a specification that gives support to Open-channel SSDs
LightNVM allows the host to manage data placement, garbage collection,
and parallelism. Device specific responsibilities such as bad block
management, FTL extensions to support atomic IOs, or metadata
persistence are still handled by the device.

The implementation of LightNVM consists of two parts: core and
(multiple) targets. The core implements functionality shared across
targets. This is initialization, teardown and statistics. The targets
implement the interface that exposes physical flash to user-space
applications. Examples of such targets include key-value store,
object-store, as well as traditional block devices, which can be
application-specific.

Contributions in this patch from:

Javier Gonzalez <[email protected]>
Jesper Madsen <[email protected]>

Signed-off-by: Matias Bjørling <[email protected]>
---
MAINTAINERS | 9 +
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/lightnvm/Kconfig | 16 +
drivers/lightnvm/Makefile | 5 +
drivers/lightnvm/core.c | 833 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/genhd.h | 3 +
include/linux/lightnvm.h | 379 +++++++++++++++++++++
8 files changed, 1249 insertions(+)
create mode 100644 drivers/lightnvm/Kconfig
create mode 100644 drivers/lightnvm/Makefile
create mode 100644 drivers/lightnvm/core.c
create mode 100644 include/linux/lightnvm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 781e099..c4119c4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5903,6 +5903,15 @@ M: Sasha Levin <[email protected]>
S: Maintained
F: tools/lib/lockdep/

+LIGHTNVM PLATFORM SUPPORT
+M: Matias Bjorling <[email protected]>
+M: Javier Gonzalez <[email protected]>
+W: http://github/OpenChannelSSD
+S: Maintained
+F: drivers/lightnvm/
+F: include/linux/lightnvm.h
+F: include/uapi/linux/lightnvm.h
+
LINUX FOR IBM pSERIES (RS/6000)
M: Paul Mackerras <[email protected]>
W: http://www.ibm.com/linux/ltc/projects/ppc
diff --git a/drivers/Kconfig b/drivers/Kconfig
index c0cc96b..da47047 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -42,6 +42,8 @@ source "drivers/net/Kconfig"

source "drivers/isdn/Kconfig"

+source "drivers/lightnvm/Kconfig"
+
# input before char - char/joystick depends on it. As does USB.

source "drivers/input/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 46d2554..2629be2 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -165,3 +165,5 @@ obj-$(CONFIG_RAS) += ras/
obj-$(CONFIG_THUNDERBOLT) += thunderbolt/
obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/
obj-$(CONFIG_ANDROID) += android/
+
+obj-$(CONFIG_NVM) += lightnvm/
diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
new file mode 100644
index 0000000..1f8412c
--- /dev/null
+++ b/drivers/lightnvm/Kconfig
@@ -0,0 +1,16 @@
+#
+# Open-Channel SSD NVM configuration
+#
+
+menuconfig NVM
+ bool "Open-Channel SSD target support"
+ depends on BLOCK
+ help
+ Say Y here to get to enable Open-channel SSDs.
+
+ Open-Channel SSDs implement a set of extension to SSDs, that
+ exposes direct access to the underlying non-volatile memory.
+
+ If you say N, all options in this submenu will be skipped and disabled
+ only do this if you know what you are doing.
+
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
new file mode 100644
index 0000000..38185e9
--- /dev/null
+++ b/drivers/lightnvm/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for Open-Channel SSDs.
+#
+
+obj-$(CONFIG_NVM) := core.o
diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
new file mode 100644
index 0000000..3fc1d9c
--- /dev/null
+++ b/drivers/lightnvm/core.c
@@ -0,0 +1,833 @@
+/*
+ * core.c - Open-channel SSD integration core
+ *
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/sem.h>
+#include <linux/bitmap.h>
+
+#include <linux/lightnvm.h>
+
+static LIST_HEAD(_targets);
+static DECLARE_RWSEM(_lock);
+
+struct nvm_target_type *nvm_find_target_type(const char *name)
+{
+ struct nvm_target_type *tt;
+
+ list_for_each_entry(tt, &_targets, list)
+ if (!strcmp(name, tt->name))
+ return tt;
+
+ return NULL;
+}
+
+int nvm_register_target(struct nvm_target_type *tt)
+{
+ int ret = 0;
+
+ down_write(&_lock);
+ if (nvm_find_target_type(tt->name))
+ ret = -EEXIST;
+ else
+ list_add(&tt->list, &_targets);
+ up_write(&_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL(nvm_register_target);
+
+void nvm_unregister_target(struct nvm_target_type *tt)
+{
+ if (!tt)
+ return;
+
+ down_write(&_lock);
+ list_del(&tt->list);
+ up_write(&_lock);
+}
+EXPORT_SYMBOL(nvm_unregister_target);
+
+static void nvm_reset_block(struct nvm_lun *lun, struct nvm_block *block)
+{
+ spin_lock(&block->lock);
+ bitmap_zero(block->invalid_pages, lun->nr_pages_per_blk);
+ block->next_page = 0;
+ block->nr_invalid_pages = 0;
+ atomic_set(&block->data_cmnt_size, 0);
+ spin_unlock(&block->lock);
+}
+
+/* use nvm_lun_[get/put]_block to administer the blocks in use for each lun.
+ * Whenever a block is in used by an append point, we store it within the
+ * used_list. We then move it back when its free to be used by another append
+ * point.
+ *
+ * The newly claimed block is always added to the back of used_list. As we
+ * assume that the start of used list is the oldest block, and therefore
+ * more likely to contain invalidated pages.
+ */
+struct nvm_block *nvm_get_blk(struct nvm_lun *lun, int is_gc)
+{
+ struct nvm_block *block = NULL;
+
+ BUG_ON(!lun);
+
+ spin_lock(&lun->lock);
+
+ if (list_empty(&lun->free_list)) {
+ pr_err_ratelimited("nvm: lun %u have no free pages available",
+ lun->id);
+ spin_unlock(&lun->lock);
+ goto out;
+ }
+
+ while (!is_gc && lun->nr_free_blocks < lun->reserved_blocks) {
+ spin_unlock(&lun->lock);
+ goto out;
+ }
+
+ block = list_first_entry(&lun->free_list, struct nvm_block, list);
+ list_move_tail(&block->list, &lun->used_list);
+
+ lun->nr_free_blocks--;
+
+ spin_unlock(&lun->lock);
+
+ nvm_reset_block(lun, block);
+
+out:
+ return block;
+}
+EXPORT_SYMBOL(nvm_get_blk);
+
+/* We assume that all valid pages have already been moved when added back to the
+ * free list. We add it last to allow round-robin use of all pages. Thereby
+ * provide simple (naive) wear-leveling.
+ */
+void nvm_put_blk(struct nvm_block *block)
+{
+ struct nvm_lun *lun = block->lun;
+
+ spin_lock(&lun->lock);
+
+ list_move_tail(&block->list, &lun->free_list);
+ lun->nr_free_blocks++;
+
+ spin_unlock(&lun->lock);
+}
+EXPORT_SYMBOL(nvm_put_blk);
+
+sector_t nvm_alloc_addr(struct nvm_block *block)
+{
+ sector_t addr = ADDR_EMPTY;
+
+ spin_lock(&block->lock);
+ if (block_is_full(block))
+ goto out;
+
+ addr = block_to_addr(block) + block->next_page;
+
+ block->next_page++;
+out:
+ spin_unlock(&block->lock);
+ return addr;
+}
+EXPORT_SYMBOL(nvm_alloc_addr);
+
+int nvm_internal_rw(struct nvm_dev *dev, struct nvm_internal_cmd *cmd)
+{
+ if (!dev->ops->internal_rw)
+ return 0;
+
+ return dev->ops->internal_rw(dev->q, cmd);
+}
+EXPORT_SYMBOL(nvm_internal_rw);
+
+/* Send erase command to device */
+int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *block)
+{
+ if (!dev->ops->erase_block)
+ return 0;
+
+ return dev->ops->erase_block(dev->q, block->id);
+}
+EXPORT_SYMBOL(nvm_erase_blk);
+
+static void nvm_blocks_free(struct nvm_dev *dev)
+{
+ struct nvm_lun *lun;
+ int i;
+
+ nvm_for_each_lun(dev, lun, i) {
+ if (!lun->blocks)
+ break;
+ vfree(lun->blocks);
+ }
+}
+
+static void nvm_luns_free(struct nvm_dev *dev)
+{
+ kfree(dev->luns);
+}
+
+static int nvm_luns_init(struct nvm_dev *dev)
+{
+ struct nvm_lun *lun;
+ struct nvm_id_chnl *chnl;
+ int i;
+
+ dev->luns = kcalloc(dev->nr_luns, sizeof(struct nvm_lun), GFP_KERNEL);
+ if (!dev->luns)
+ return -ENOMEM;
+
+ nvm_for_each_lun(dev, lun, i) {
+ chnl = &dev->identity.chnls[i];
+ pr_info("nvm: p %u qsize %u gr %u ge %u begin %llu end %llu\n",
+ i, chnl->queue_size, chnl->gran_read, chnl->gran_erase,
+ chnl->laddr_begin, chnl->laddr_end);
+
+ spin_lock_init(&lun->lock);
+
+ INIT_LIST_HEAD(&lun->free_list);
+ INIT_LIST_HEAD(&lun->used_list);
+ INIT_LIST_HEAD(&lun->bb_list);
+
+ lun->id = i;
+ lun->dev = dev;
+ lun->chnl = chnl;
+ lun->reserved_blocks = 2; /* for GC only */
+ lun->nr_blocks =
+ (chnl->laddr_end - chnl->laddr_begin + 1) /
+ (chnl->gran_erase / chnl->gran_read);
+ lun->nr_free_blocks = lun->nr_blocks;
+ lun->nr_pages_per_blk = chnl->gran_erase / chnl->gran_write *
+ (chnl->gran_write / dev->sector_size);
+
+ dev->total_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+ dev->total_blocks += lun->nr_blocks;
+
+ if (lun->nr_pages_per_blk >
+ MAX_INVALID_PAGES_STORAGE * BITS_PER_LONG) {
+ pr_err("nvm: number of pages per block too high.");
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+static int nvm_block_bb(u32 lun_id, void *bb_bitmap, unsigned int nr_blocks,
+ void *private)
+{
+ struct nvm_dev *dev = private;
+ struct nvm_lun *lun = &dev->luns[lun_id];
+ struct nvm_block *block;
+ int i;
+
+ if (unlikely(bitmap_empty(bb_bitmap, nr_blocks)))
+ return 0;
+
+ i = -1;
+ while ((i = find_next_bit(bb_bitmap, nr_blocks, i + 1)) <
+ nr_blocks) {
+ block = &lun->blocks[i];
+ if (!block) {
+ pr_err("nvm: BB data is out of bounds!\n");
+ return -EINVAL;
+ }
+ list_move_tail(&block->list, &lun->bb_list);
+ }
+
+ return 0;
+}
+
+static int nvm_block_map(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+ struct nvm_dev *dev = private;
+ sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+ u64 elba = slba + nlb;
+ struct nvm_lun *lun;
+ struct nvm_block *blk;
+ sector_t total_pgs_per_lun = /* each lun have the same configuration */
+ dev->luns[0].nr_blocks * dev->luns[0].nr_pages_per_blk;
+ u64 i;
+ int lun_id;
+
+ if (unlikely(elba > dev->total_pages)) {
+ pr_err("nvm: L2P data from device is out of bounds!\n");
+ return -EINVAL;
+ }
+
+ for (i = 0; i < nlb; i++) {
+ u64 pba = le64_to_cpu(entries[i]);
+
+ if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+ pr_err("nvm: L2P data entry is out of bounds!\n");
+ return -EINVAL;
+ }
+
+ /* Address zero is a special one. The first page on a disk is
+ * protected. As it often holds internal device boot
+ * information. */
+ if (!pba)
+ continue;
+
+ /* resolve block from physical address */
+ lun_id = pba / total_pgs_per_lun;
+ lun = &dev->luns[lun_id];
+
+ /* Calculate block offset into lun */
+ pba = pba - (total_pgs_per_lun * lun_id);
+ blk = &lun->blocks[pba / lun->nr_pages_per_blk];
+
+ if (!blk->type) {
+ /* at this point, we don't know anything about the
+ * block. It's up to the FTL on top to re-etablish the
+ * block state */
+ list_move_tail(&blk->list, &lun->used_list);
+ blk->type = 1;
+ lun->nr_free_blocks--;
+ }
+ }
+
+ return 0;
+}
+
+static int nvm_blocks_init(struct nvm_dev *dev)
+{
+ struct nvm_lun *lun;
+ struct nvm_block *block;
+ sector_t lun_iter, block_iter, cur_block_id = 0;
+ int ret;
+
+ nvm_for_each_lun(dev, lun, lun_iter) {
+ lun->blocks = vzalloc(sizeof(struct nvm_block) *
+ lun->nr_blocks);
+ if (!lun->blocks)
+ return -ENOMEM;
+
+ lun_for_each_block(lun, block, block_iter) {
+ spin_lock_init(&block->lock);
+ INIT_LIST_HEAD(&block->list);
+
+ block->lun = lun;
+ block->id = cur_block_id++;
+
+ /* First block is reserved for device */
+ if (unlikely(lun_iter == 0 && block_iter == 0))
+ continue;
+
+ list_add_tail(&block->list, &lun->free_list);
+ }
+
+ if (dev->ops->get_bb_tbl) {
+ ret = dev->ops->get_bb_tbl(dev->q, lun->id,
+ lun->nr_blocks, nvm_block_bb, dev);
+ if (ret)
+ pr_err("nvm: could not read BB table\n");
+ }
+ }
+
+ if (dev->ops->get_l2p_tbl) {
+ ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+ nvm_block_map, dev);
+ if (ret) {
+ pr_err("nvm: could not read L2P table.\n");
+ pr_warn("nvm: default block initialization");
+ }
+ }
+
+ return 0;
+}
+
+static void nvm_core_free(struct nvm_dev *dev)
+{
+ kfree(dev->identity.chnls);
+ kfree(dev);
+}
+
+static int nvm_core_init(struct nvm_dev *dev, int max_qdepth)
+{
+ dev->nr_luns = dev->identity.nchannels;
+ dev->sector_size = EXPOSED_PAGE_SIZE;
+ INIT_LIST_HEAD(&dev->online_targets);
+
+ return 0;
+}
+
+static void nvm_free(struct nvm_dev *dev)
+{
+ if (!dev)
+ return;
+
+ nvm_blocks_free(dev);
+ nvm_luns_free(dev);
+ nvm_core_free(dev);
+}
+
+int nvm_validate_features(struct nvm_dev *dev)
+{
+ struct nvm_get_features gf;
+ int ret;
+
+ ret = dev->ops->get_features(dev->q, &gf);
+ if (ret)
+ return ret;
+
+ /* Only default configuration is supported.
+ * I.e. L2P, No ondrive GC and drive performs ECC */
+ if (gf.rsp != 0x0 || gf.ext != 0x0)
+ return -EINVAL;
+
+ return 0;
+}
+
+int nvm_validate_responsibility(struct nvm_dev *dev)
+{
+ if (!dev->ops->set_responsibility)
+ return 0;
+
+ return dev->ops->set_responsibility(dev->q, 0);
+}
+
+int nvm_init(struct nvm_dev *dev)
+{
+ struct blk_mq_tag_set *tag_set = dev->q->tag_set;
+ int max_qdepth;
+ int ret = 0;
+
+ if (!dev->q || !dev->ops)
+ return -EINVAL;
+
+ if (dev->ops->identify(dev->q, &dev->identity)) {
+ pr_err("nvm: device could not be identified\n");
+ ret = -EINVAL;
+ goto err;
+ }
+
+ max_qdepth = tag_set->queue_depth * tag_set->nr_hw_queues;
+
+ pr_debug("nvm dev: ver %u type %u chnls %u max qdepth: %i\n",
+ dev->identity.ver_id,
+ dev->identity.nvm_type,
+ dev->identity.nchannels,
+ max_qdepth);
+
+ ret = nvm_validate_features(dev);
+ if (ret) {
+ pr_err("nvm: disk features are not supported.");
+ goto err;
+ }
+
+ ret = nvm_validate_responsibility(dev);
+ if (ret) {
+ pr_err("nvm: disk responsibilities are not supported.");
+ goto err;
+ }
+
+ ret = nvm_core_init(dev, max_qdepth);
+ if (ret) {
+ pr_err("nvm: could not initialize core structures.\n");
+ goto err;
+ }
+
+ ret = nvm_luns_init(dev);
+ if (ret) {
+ pr_err("nvm: could not initialize luns\n");
+ goto err;
+ }
+
+ if (!dev->nr_luns) {
+ pr_err("nvm: device did not expose any luns.\n");
+ goto err;
+ }
+
+ ret = nvm_blocks_init(dev);
+ if (ret) {
+ pr_err("nvm: could not initialize blocks\n");
+ goto err;
+ }
+
+ pr_info("nvm: allocating %lu physical pages (%lu KB)\n",
+ dev->total_pages, dev->total_pages * dev->sector_size / 1024);
+ pr_info("nvm: luns: %u\n", dev->nr_luns);
+ pr_info("nvm: blocks: %lu\n", dev->total_blocks);
+ pr_info("nvm: target sector size=%d\n", dev->sector_size);
+
+ return 0;
+err:
+ nvm_free(dev);
+ pr_err("nvm: failed to initialize nvm\n");
+ return ret;
+}
+
+void nvm_exit(struct nvm_dev *dev)
+{
+ nvm_free(dev);
+
+ pr_info("nvm: successfully unloaded\n");
+}
+
+static int nvm_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+ unsigned long arg)
+{
+ return 0;
+}
+
+static int nvm_open(struct block_device *bdev, fmode_t mode)
+{
+ return 0;
+}
+
+static void nvm_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations nvm_fops = {
+ .owner = THIS_MODULE,
+ .ioctl = nvm_ioctl,
+ .open = nvm_open,
+ .release = nvm_release,
+};
+
+static int nvm_create_target(struct gendisk *bdisk, char *ttname, char *tname,
+ int lun_begin, int lun_end)
+{
+ struct request_queue *qqueue = bdisk->queue;
+ struct nvm_dev *qnvm = bdisk->nvm;
+ struct request_queue *tqueue;
+ struct gendisk *tdisk;
+ struct nvm_target_type *tt;
+ struct nvm_target *t;
+ void *targetdata;
+
+ tt = nvm_find_target_type(ttname);
+ if (!tt) {
+ pr_err("nvm: target type %s not found\n", ttname);
+ return -EINVAL;
+ }
+
+ down_write(&_lock);
+ list_for_each_entry(t, &qnvm->online_targets, list) {
+ if (!strcmp(tname, t->disk->disk_name)) {
+ pr_err("nvm: target name already exists.\n");
+ up_write(&_lock);
+ return -EINVAL;
+ }
+ }
+ up_write(&_lock);
+
+ t = kmalloc(sizeof(struct nvm_target), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ tqueue = blk_alloc_queue_node(GFP_KERNEL, qqueue->node);
+ if (!tqueue)
+ goto err_t;
+ blk_queue_make_request(tqueue, tt->make_rq);
+
+ tdisk = alloc_disk(0);
+ if (!tdisk)
+ goto err_queue;
+
+ sprintf(tdisk->disk_name, "%s", tname);
+ tdisk->flags = GENHD_FL_EXT_DEVT;
+ tdisk->major = 0;
+ tdisk->first_minor = 0;
+ tdisk->fops = &nvm_fops;
+ tdisk->queue = tqueue;
+
+ targetdata = tt->init(bdisk, tdisk, lun_begin, lun_end);
+ if (IS_ERR(targetdata))
+ goto err_init;
+
+ tdisk->private_data = targetdata;
+ tqueue->queuedata = targetdata;
+
+ set_capacity(tdisk, tt->capacity(targetdata));
+ add_disk(tdisk);
+
+ t->type = tt;
+ t->disk = tdisk;
+
+ down_write(&_lock);
+ list_add_tail(&t->list, &qnvm->online_targets);
+ up_write(&_lock);
+
+ return 0;
+err_init:
+ put_disk(tdisk);
+err_queue:
+ blk_cleanup_queue(tqueue);
+err_t:
+ kfree(t);
+ return -ENOMEM;
+}
+
+/* _lock must be taken */
+static void nvm_remove_target(struct nvm_target *t)
+{
+ struct nvm_target_type *tt = t->type;
+ struct gendisk *tdisk = t->disk;
+ struct request_queue *q = tdisk->queue;
+
+ del_gendisk(tdisk);
+ if (tt->exit)
+ tt->exit(tdisk->private_data);
+ blk_cleanup_queue(q);
+
+ put_disk(tdisk);
+
+ list_del(&t->list);
+ kfree(t);
+}
+
+
+static ssize_t free_blocks_show(struct device *d, struct device_attribute *attr,
+ char *page)
+{
+ struct gendisk *disk = dev_to_disk(d);
+ struct nvm_dev *dev = disk->nvm;
+
+ char *page_start = page;
+ struct nvm_lun *lun;
+ unsigned int i;
+
+ nvm_for_each_lun(dev, lun, i)
+ page += sprintf(page, "%8u\t%u\n", i, lun->nr_free_blocks);
+
+ return page - page_start;
+}
+
+DEVICE_ATTR_RO(free_blocks);
+
+static ssize_t configure_store(struct device *d, struct device_attribute *attr,
+ const char *buf, size_t cnt)
+{
+ struct gendisk *disk = dev_to_disk(d);
+ struct nvm_dev *dev = disk->nvm;
+ char name[255], ttname[255];
+ int lun_begin, lun_end, ret;
+
+ if (cnt >= 255)
+ return -EINVAL;
+
+ ret = sscanf(buf, "%s %s %u:%u", name, ttname, &lun_begin, &lun_end);
+ if (ret != 4) {
+ pr_err("nvm: configure must be in the format of \"name targetname lun_begin:lun_end\".\n");
+ return -EINVAL;
+ }
+
+ if (lun_begin > lun_end || lun_end > dev->nr_luns) {
+ pr_err("nvm: lun out of bound (%u:%u > %u)\n",
+ lun_begin, lun_end, dev->nr_luns);
+ return -EINVAL;
+ }
+
+ ret = nvm_create_target(disk, name, ttname, lun_begin, lun_end);
+ if (ret)
+ pr_err("nvm: configure disk failed\n");
+
+ return cnt;
+}
+DEVICE_ATTR_WO(configure);
+
+static ssize_t remove_store(struct device *d, struct device_attribute *attr,
+ const char *buf, size_t cnt)
+{
+ struct gendisk *disk = dev_to_disk(d);
+ struct nvm_dev *dev = disk->nvm;
+ struct nvm_target *t = NULL;
+ char tname[255];
+ int ret;
+
+ if (cnt >= 255)
+ return -EINVAL;
+
+ ret = sscanf(buf, "%s", tname);
+ if (ret != 1) {
+ pr_err("nvm: remove use the following format \"targetname\".\n");
+ return -EINVAL;
+ }
+
+ down_write(&_lock);
+ list_for_each_entry(t, &dev->online_targets, list) {
+ if (!strcmp(tname, t->disk->disk_name)) {
+ nvm_remove_target(t);
+ ret = 0;
+ break;
+ }
+ }
+ up_write(&_lock);
+
+ if (ret)
+ pr_err("nvm: target \"%s\" doesn't exist.\n", tname);
+
+ return cnt;
+}
+DEVICE_ATTR_WO(remove);
+
+static struct attribute *nvm_attrs[] = {
+ &dev_attr_free_blocks.attr,
+ &dev_attr_configure.attr,
+ &dev_attr_remove.attr,
+ NULL,
+};
+
+static struct attribute_group nvm_attribute_group = {
+ .name = "lightnvm",
+ .attrs = nvm_attrs,
+};
+
+int nvm_attach_sysfs(struct gendisk *disk)
+{
+ struct device *dev = disk_to_dev(disk);
+ int ret;
+
+ if (!disk->nvm)
+ return 0;
+
+ ret = sysfs_update_group(&dev->kobj, &nvm_attribute_group);
+ if (ret)
+ return ret;
+
+ kobject_uevent(&dev->kobj, KOBJ_CHANGE);
+
+ return 0;
+}
+EXPORT_SYMBOL(nvm_attach_sysfs);
+
+void nvm_remove_sysfs(struct gendisk *disk)
+{
+ struct device *dev = disk_to_dev(disk);
+
+ sysfs_remove_group(&dev->kobj, &nvm_attribute_group);
+}
+
+int nvm_register(struct request_queue *q, struct gendisk *disk,
+ struct nvm_dev_ops *ops)
+{
+ struct nvm_dev *nvm;
+ int ret;
+
+ if (!ops->identify || !ops->get_features)
+ return -EINVAL;
+
+ /* does not yet support multi-page IOs. */
+ blk_queue_max_hw_sectors(q, queue_logical_block_size(q) >> 9);
+
+ nvm = kzalloc(sizeof(struct nvm_dev), GFP_KERNEL);
+ if (!nvm)
+ return -ENOMEM;
+
+ nvm->q = q;
+ nvm->ops = ops;
+
+ ret = nvm_init(nvm);
+ if (ret)
+ goto err_init;
+
+ disk->nvm = nvm;
+
+ return 0;
+err_init:
+ kfree(nvm);
+ return ret;
+}
+EXPORT_SYMBOL(nvm_register);
+
+void nvm_unregister(struct gendisk *disk)
+{
+ if (!disk->nvm)
+ return;
+
+ nvm_remove_sysfs(disk);
+
+ nvm_exit(disk->nvm);
+}
+EXPORT_SYMBOL(nvm_unregister);
+
+int nvm_prep_rq(struct request *rq, struct nvm_rq_data *rqdata)
+{
+ struct nvm_target_instance *ins;
+ struct bio *bio;
+
+ if (rqdata->phys_sector)
+ return 0;
+
+ if (rq->cmd_type == REQ_TYPE_DRV_PRIV) {
+ struct nvm_internal_cmd *cmd = rq->special;
+
+ /* internal nvme request with no relation to target */
+ if (!cmd)
+ return 0;
+
+ ins = cmd->target;
+ } else {
+ bio = rq->bio;
+ if (unlikely(!bio))
+ return 0;
+
+ if (unlikely(!bio->bi_nvm)) {
+ if (bio_data_dir(bio) == WRITE) {
+ pr_warn("nvm: attempting to write without FTL.\n");
+ return NVM_PREP_ERROR;
+ }
+ return NVM_PREP_OK;
+ }
+
+ ins = container_of(bio->bi_nvm, struct nvm_target_instance,
+ payload);
+ }
+ return ins->tt->prep_rq(rq, rqdata, ins);
+}
+EXPORT_SYMBOL(nvm_prep_rq);
+
+void nvm_unprep_rq(struct request *rq, struct nvm_rq_data *rqdata)
+{
+ struct nvm_target_instance *ins;
+ struct bio *bio;
+
+ if (!rqdata->phys_sector)
+ return;
+
+ if (rq->cmd_type == REQ_TYPE_DRV_PRIV) {
+ struct nvm_internal_cmd *cmd = rq->special;
+
+ /* internal nvme request with no relation to target*/
+ if (!cmd)
+ return;
+
+ ins = cmd->target;
+ } else {
+ bio = rq->bio;
+ if (unlikely(!bio))
+ return;
+ ins = container_of(bio->bi_nvm, struct nvm_target_instance,
+ payload);
+ }
+ ins->tt->unprep_rq(rq, rqdata, ins);
+}
+EXPORT_SYMBOL(nvm_unprep_rq);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index ec274e0..7d7442e 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -199,6 +199,9 @@ struct gendisk {
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct blk_integrity *integrity;
#endif
+#ifdef CONFIG_NVM
+ struct nvm_dev *nvm;
+#endif
int node_id;
};

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
new file mode 100644
index 0000000..dd26466
--- /dev/null
+++ b/include/linux/lightnvm.h
@@ -0,0 +1,379 @@
+#ifndef NVM_H
+#define NVM_H
+
+enum {
+ NVM_PREP_OK = 0,
+ NVM_PREP_BUSY = 1,
+ NVM_PREP_REQUEUE = 2,
+ NVM_PREP_DONE = 3,
+ NVM_PREP_ERROR = 4,
+};
+
+#ifdef CONFIG_NVM
+
+#include <linux/blkdev.h>
+#include <linux/types.h>
+
+enum {
+ /* HW Responsibilities */
+ NVM_RSP_L2P = 0x00,
+ NVM_RSP_GC = 0x01,
+ NVM_RSP_ECC = 0x02,
+
+ /* Physical NVM Type */
+ NVM_NVMT_BLK = 0,
+ NVM_NVMT_BYTE = 1,
+
+ /* Internal IO Scheduling algorithm */
+ NVM_IOSCHED_CHANNEL = 0,
+ NVM_IOSCHED_CHIP = 1,
+
+ /* Status codes */
+ NVM_SUCCESS = 0,
+ NVM_RSP_NOT_CHANGEABLE = 1,
+};
+
+struct nvm_id_chnl {
+ u64 laddr_begin;
+ u64 laddr_end;
+ u32 oob_size;
+ u32 queue_size;
+ u32 gran_read;
+ u32 gran_write;
+ u32 gran_erase;
+ u32 t_r;
+ u32 t_sqr;
+ u32 t_w;
+ u32 t_sqw;
+ u32 t_e;
+ u16 chnl_parallelism;
+ u8 io_sched;
+ u8 res[133];
+};
+
+struct nvm_id {
+ u8 ver_id;
+ u8 nvm_type;
+ u16 nchannels;
+ struct nvm_id_chnl *chnls;
+};
+
+struct nvm_get_features {
+ u64 rsp;
+ u64 ext;
+};
+
+struct nvm_target {
+ struct list_head list;
+ struct nvm_target_type *type;
+ struct gendisk *disk;
+};
+
+struct nvm_internal_cmd {
+ void *target;
+ sector_t phys_lba;
+ int rw;
+ void *buffer;
+ unsigned bufflen;
+ unsigned timeout;
+};
+
+extern void nvm_unregister(struct gendisk *);
+extern int nvm_attach_sysfs(struct gendisk *disk);
+
+typedef int (nvm_l2p_update_fn)(u64, u64, u64 *, void *);
+typedef int (nvm_bb_update_fn)(u32, void *, unsigned int, void *);
+typedef int (nvm_id_fn)(struct request_queue *, struct nvm_id *);
+typedef int (nvm_get_features_fn)(struct request_queue *,
+ struct nvm_get_features *);
+typedef int (nvm_set_rsp_fn)(struct request_queue *, u64);
+typedef int (nvm_get_l2p_tbl_fn)(struct request_queue *, u64, u64,
+ nvm_l2p_update_fn *, void *);
+typedef int (nvm_op_bb_tbl_fn)(struct request_queue *, int, unsigned int,
+ nvm_bb_update_fn *, void *);
+typedef int (nvm_internal_rw_fn)(struct request_queue *,
+ struct nvm_internal_cmd *);
+typedef int (nvm_erase_blk_fn)(struct request_queue *, sector_t);
+
+struct nvm_dev_ops {
+ nvm_id_fn *identify;
+ nvm_get_features_fn *get_features;
+ nvm_set_rsp_fn *set_responsibility;
+ nvm_get_l2p_tbl_fn *get_l2p_tbl;
+ nvm_op_bb_tbl_fn *set_bb_tbl;
+ nvm_op_bb_tbl_fn *get_bb_tbl;
+
+ nvm_internal_rw_fn *internal_rw;
+ nvm_erase_blk_fn *erase_block;
+};
+
+struct nvm_blocks;
+
+/*
+ * We assume that the device exposes its channels as a linear address
+ * space. A lun therefore have a phy_addr_start and phy_addr_end that
+ * denotes the start and end. This abstraction is used to let the
+ * open-channel SSD (or any other device) expose its read/write/erase
+ * interface and be administrated by the host system.
+ */
+struct nvm_lun {
+ struct nvm_dev *dev;
+
+ /* lun block lists */
+ struct list_head used_list; /* In-use blocks */
+ struct list_head free_list; /* Not used blocks i.e. released
+ * and ready for use */
+ struct list_head bb_list; /* Bad blocks. Mutually exclusive with
+ free_list and used_list */
+
+
+ struct {
+ spinlock_t lock;
+ } ____cacheline_aligned_in_smp;
+
+ struct nvm_block *blocks;
+ struct nvm_id_chnl *chnl;
+
+ int id;
+ int reserved_blocks;
+
+ unsigned int nr_blocks; /* end_block - start_block. */
+ unsigned int nr_free_blocks; /* Number of unused blocks */
+
+ int nr_pages_per_blk;
+};
+
+struct nvm_block {
+ /* Management structures */
+ struct list_head list;
+ struct nvm_lun *lun;
+
+ spinlock_t lock;
+
+#define MAX_INVALID_PAGES_STORAGE 8
+ /* Bitmap for invalid page intries */
+ unsigned long invalid_pages[MAX_INVALID_PAGES_STORAGE];
+ /* points to the next writable page within a block */
+ unsigned int next_page;
+ /* number of pages that are invalid, wrt host page size */
+ unsigned int nr_invalid_pages;
+
+ unsigned int id;
+ int type;
+ /* Persistent data structures */
+ atomic_t data_cmnt_size; /* data pages committed to stable storage */
+};
+
+struct nvm_dev {
+ struct nvm_dev_ops *ops;
+ struct request_queue *q;
+
+ struct nvm_id identity;
+
+ struct list_head online_targets;
+
+ int nr_luns;
+ struct nvm_lun *luns;
+
+ /*int nr_blks_per_lun;
+ int nr_pages_per_blk;*/
+ /* Calculated/Cached values. These do not reflect the actual usuable
+ * blocks at run-time. */
+ unsigned long total_pages;
+ unsigned long total_blocks;
+
+ uint32_t sector_size;
+};
+
+struct nvm_rq_data {
+ sector_t phys_sector;
+};
+
+/* Logical to physical mapping */
+struct nvm_addr {
+ sector_t addr;
+ struct nvm_block *block;
+};
+
+/* Physical to logical mapping */
+struct nvm_rev_addr {
+ sector_t addr;
+};
+
+struct rrpc_inflight_rq {
+ struct list_head list;
+ sector_t l_start;
+ sector_t l_end;
+};
+
+struct nvm_per_rq {
+ struct rrpc_inflight_rq inflight_rq;
+ struct nvm_addr *addr;
+ unsigned int flags;
+};
+
+typedef void (nvm_tgt_make_rq)(struct request_queue *, struct bio *);
+typedef int (nvm_tgt_prep_rq)(struct request *, struct nvm_rq_data *, void *);
+typedef void (nvm_tgt_unprep_rq)(struct request *, struct nvm_rq_data *,
+ void *);
+typedef sector_t (nvm_tgt_capacity)(void *);
+typedef void *(nvm_tgt_init_fn)(struct gendisk *, struct gendisk *, int, int);
+typedef void (nvm_tgt_exit_fn)(void *);
+
+struct nvm_target_type {
+ const char *name;
+ unsigned int version[3];
+
+ /* target entry points */
+ nvm_tgt_make_rq *make_rq;
+ nvm_tgt_prep_rq *prep_rq;
+ nvm_tgt_unprep_rq *unprep_rq;
+ nvm_tgt_capacity *capacity;
+
+ /* module-specific init/teardown */
+ nvm_tgt_init_fn *init;
+ nvm_tgt_exit_fn *exit;
+
+ /* For open-channel SSD internal use */
+ struct list_head list;
+};
+
+struct nvm_target_instance {
+ struct bio_nvm_payload payload;
+ struct nvm_target_type *tt;
+};
+
+extern struct nvm_target_type *nvm_find_target_type(const char *);
+extern int nvm_register_target(struct nvm_target_type *);
+extern void nvm_unregister_target(struct nvm_target_type *);
+extern int nvm_register(struct request_queue *, struct gendisk *,
+ struct nvm_dev_ops *);
+extern void nvm_unregister(struct gendisk *);
+extern int nvm_prep_rq(struct request *, struct nvm_rq_data *);
+extern void nvm_unprep_rq(struct request *, struct nvm_rq_data *);
+extern struct nvm_block *nvm_get_blk(struct nvm_lun *, int);
+extern void nvm_put_blk(struct nvm_block *block);
+extern int nvm_internal_rw(struct nvm_dev *, struct nvm_internal_cmd *);
+extern int nvm_erase_blk(struct nvm_dev *, struct nvm_block *);
+extern sector_t nvm_alloc_addr(struct nvm_block *);
+static inline struct nvm_dev *nvm_get_dev(struct gendisk *disk)
+{
+ return disk->nvm;
+}
+
+#define nvm_for_each_lun(dev, lun, i) \
+ for ((i) = 0, lun = &(dev)->luns[0]; \
+ (i) < (dev)->nr_luns; (i)++, lun = &(dev)->luns[(i)])
+
+#define lun_for_each_block(p, b, i) \
+ for ((i) = 0, b = &(p)->blocks[0]; \
+ (i) < (p)->nr_blocks; (i)++, b = &(p)->blocks[(i)])
+
+#define block_for_each_page(b, p) \
+ for ((p)->addr = block_to_addr((b)), (p)->block = (b); \
+ (p)->addr < block_to_addr((b)) \
+ + (b)->lun->dev->nr_pages_per_blk; \
+ (p)->addr++)
+
+/* We currently assume that we the lightnvm device is accepting data in 512
+ * bytes chunks. This should be set to the smallest command size available for a
+ * given device.
+ */
+#define NVM_SECTOR (512)
+#define EXPOSED_PAGE_SIZE (4096)
+
+#define NR_PHY_IN_LOG (EXPOSED_PAGE_SIZE / NVM_SECTOR)
+
+#define NVM_MSG_PREFIX "nvm"
+#define ADDR_EMPTY (~0ULL)
+
+static inline int block_is_full(struct nvm_block *block)
+{
+ struct nvm_lun *lun = block->lun;
+
+ return block->next_page == lun->nr_pages_per_blk;
+}
+
+static inline sector_t block_to_addr(struct nvm_block *block)
+{
+ struct nvm_lun *lun = block->lun;
+
+ return block->id * lun->nr_pages_per_blk;
+}
+
+static inline struct nvm_lun *paddr_to_lun(struct nvm_dev *dev,
+ sector_t p_addr)
+{
+ return &dev->luns[p_addr / (dev->total_pages / dev->nr_luns)];
+}
+
+static inline void nvm_init_rq_data(struct nvm_rq_data *rqdata)
+{
+ rqdata->phys_sector = 0;
+}
+
+#else /* CONFIG_NVM */
+
+struct nvm_dev_ops;
+struct nvm_dev;
+struct nvm_lun;
+struct nvm_block;
+struct nvm_per_rq {
+};
+struct nvm_rq_data {
+};
+struct nvm_internal_cmd {
+};
+struct nvm_target_type;
+struct nvm_target_instance;
+
+static inline struct nvm_target_type *nvm_find_target_type(const char *c)
+{
+ return NULL;
+}
+static inline int nvm_register_target(struct nvm_target_type *tt)
+{
+ return -EINVAL;
+}
+static inline void nvm_unregister_target(struct nvm_target_type *tt) {}
+static inline int nvm_register(struct request_queue *q, struct gendisk *disk,
+ struct nvm_dev_ops *ops)
+{
+ return -EINVAL;
+}
+static inline void nvm_unregister(struct gendisk *disk) {}
+static inline int nvm_prep_rq(struct request *rq, struct nvm_rq_data *rqdata)
+{
+ return -EINVAL;
+}
+static inline void nvm_unprep_rq(struct request *rq, struct nvm_rq_data *rqdata)
+{
+}
+static inline struct nvm_block *nvm_get_blk(struct nvm_lun *lun, int is_gc)
+{
+ return NULL;
+}
+static inline void nvm_put_blk(struct nvm_block *blk) {}
+static inline int nvm_internal_rw(struct nvm_dev *dev,
+ const struct nvm_internal_cmd *cmd)
+{
+ return -EINVAL;
+}
+static inline int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk)
+{
+ return -EINVAL;
+}
+static inline sector_t nvm_alloc_addr(struct nvm_block *blk)
+{
+ return 0;
+}
+static inline struct nvm_dev *nvm_get_dev(struct gendisk *disk)
+{
+ return NULL;
+}
+static inline void nvm_init_rq_data(struct nvm_rq_data *rqdata) { }
+static inline int nvm_attach_sysfs(struct gendisk *dev) { return 0; }
+
+
+#endif /* CONFIG_NVM */
+#endif /* LIGHTNVM.H */
--
2.1.4

2015-06-05 12:55:36

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 6/8] lightnvm: RRPC target

This target implements a simple strategy FTL for Open-Channel SSDs.
It does round-robin selection across channels and lunis. It uses a
simple greedy cost-based garbage collector and exposes the physical
flash as a block device.

Signed-off-by: Javier González <[email protected]>
Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/lightnvm/Kconfig | 10 +
drivers/lightnvm/Makefile | 1 +
drivers/lightnvm/rrpc.c | 1088 +++++++++++++++++++++++++++++++++++++++++++++
drivers/lightnvm/rrpc.h | 221 +++++++++
include/linux/blkdev.h | 2 +
5 files changed, 1322 insertions(+)
create mode 100644 drivers/lightnvm/rrpc.c
create mode 100644 drivers/lightnvm/rrpc.h

diff --git a/drivers/lightnvm/Kconfig b/drivers/lightnvm/Kconfig
index 1f8412c..1773891 100644
--- a/drivers/lightnvm/Kconfig
+++ b/drivers/lightnvm/Kconfig
@@ -14,3 +14,13 @@ menuconfig NVM
If you say N, all options in this submenu will be skipped and disabled
only do this if you know what you are doing.

+if NVM
+
+config NVM_RRPC
+ tristate "Round-robin Hybrid Open-Channel SSD"
+ ---help---
+ Allows an open-channel SSD to be exposed as a block device to the
+ host. The target is implemented using a linear mapping table and
+ cost-based garbage collection. It is optimized for 4K IO sizes.
+
+endif # NVM
diff --git a/drivers/lightnvm/Makefile b/drivers/lightnvm/Makefile
index 38185e9..b2a39e2 100644
--- a/drivers/lightnvm/Makefile
+++ b/drivers/lightnvm/Makefile
@@ -3,3 +3,4 @@
#

obj-$(CONFIG_NVM) := core.o
+obj-$(CONFIG_NVM_RRPC) += rrpc.o
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
new file mode 100644
index 0000000..d3a7f16
--- /dev/null
+++ b/drivers/lightnvm/rrpc.c
@@ -0,0 +1,1088 @@
+/*
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * Implementation of a Round-robin page-based Hybrid FTL for Open-channel SSDs.
+ */
+
+#include "rrpc.h"
+
+static struct kmem_cache *_gcb_cache;
+static DECLARE_RWSEM(_lock);
+
+#define rrpc_for_each_lun(rrpc, rlun, i) \
+ for ((i) = 0, rlun = &(rrpc)->luns[0]; \
+ (i) < (rrpc)->nr_luns; (i)++, rlun = &(rrpc)->luns[(i)])
+
+static void invalidate_block_page(struct nvm_addr *p)
+{
+ struct nvm_block *block = p->block;
+ unsigned int page_offset;
+
+ if (!block)
+ return;
+
+ spin_lock(&block->lock);
+ page_offset = p->addr % block->lun->nr_pages_per_blk;
+ WARN_ON(test_and_set_bit(page_offset, block->invalid_pages));
+ block->nr_invalid_pages++;
+ spin_unlock(&block->lock);
+}
+
+static inline void __nvm_page_invalidate(struct rrpc *rrpc, struct nvm_addr *a)
+{
+ BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+ if (a->addr == ADDR_EMPTY)
+ return;
+
+ invalidate_block_page(a);
+ rrpc->rev_trans_map[a->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+static void rrpc_invalidate_range(struct rrpc *rrpc, sector_t slba,
+ unsigned len)
+{
+ sector_t i;
+
+ spin_lock(&rrpc->rev_lock);
+ for (i = slba; i < slba + len; i++) {
+ struct nvm_addr *gp = &rrpc->trans_map[i];
+
+ __nvm_page_invalidate(rrpc, gp);
+ gp->block = NULL;
+ }
+ spin_unlock(&rrpc->rev_lock);
+}
+
+static struct request *rrpc_inflight_laddr_acquire(struct rrpc *rrpc,
+ sector_t laddr, unsigned int pages)
+{
+ struct request *rq;
+ struct rrpc_inflight_rq *inf;
+
+ rq = blk_mq_alloc_request(rrpc->q_dev, READ, GFP_NOIO, false);
+ if (!rq)
+ return ERR_PTR(-ENOMEM);
+
+ inf = rrpc_get_inflight_rq(rq);
+ if (rrpc_lock_laddr(rrpc, laddr, pages, inf)) {
+ blk_mq_free_request(rq);
+ return NULL;
+ }
+
+ return rq;
+}
+
+static void rrpc_inflight_laddr_release(struct rrpc *rrpc, struct request *rq)
+{
+ struct rrpc_inflight_rq *inf;
+
+ inf = rrpc_get_inflight_rq(rq);
+ rrpc_unlock_laddr(rrpc, inf->l_start, inf);
+
+ blk_mq_free_request(rq);
+}
+
+static void rrpc_discard(struct rrpc *rrpc, struct bio *bio)
+{
+ sector_t slba = bio->bi_iter.bi_sector / NR_PHY_IN_LOG;
+ sector_t len = bio->bi_iter.bi_size / EXPOSED_PAGE_SIZE;
+ struct request *rq;
+
+ do {
+ rq = rrpc_inflight_laddr_acquire(rrpc, slba, len);
+ schedule();
+ } while (!rq);
+
+ if (IS_ERR(rq)) {
+ bio_io_error(bio);
+ return;
+ }
+
+ rrpc_invalidate_range(rrpc, slba, len);
+ rrpc_inflight_laddr_release(rrpc, rq);
+}
+
+/* requires lun->lock taken */
+static void rrpc_set_lun_cur(struct rrpc_lun *rlun, struct nvm_block *block)
+{
+ BUG_ON(!block);
+
+ if (rlun->cur) {
+ spin_lock(&rlun->cur->lock);
+ WARN_ON(!block_is_full(rlun->cur));
+ spin_unlock(&rlun->cur->lock);
+ }
+ rlun->cur = block;
+}
+
+static struct rrpc_lun *get_next_lun(struct rrpc *rrpc)
+{
+ int next = atomic_inc_return(&rrpc->next_lun);
+
+ return &rrpc->luns[next % rrpc->nr_luns];
+}
+
+static void rrpc_gc_kick(struct rrpc *rrpc)
+{
+ struct rrpc_lun *rlun;
+ unsigned int i;
+
+ for (i = 0; i < rrpc->nr_luns; i++) {
+ rlun = &rrpc->luns[i];
+ queue_work(rrpc->krqd_wq, &rlun->ws_gc);
+ }
+}
+
+/**
+ * rrpc_gc_timer - default gc timer function.
+ * @data: ptr to the 'nvm' structure
+ *
+ * Description:
+ * rrpc configures a timer to kick the GC to force proactive behavior.
+ *
+ **/
+static void rrpc_gc_timer(unsigned long data)
+{
+ struct rrpc *rrpc = (struct rrpc *)data;
+
+ rrpc_gc_kick(rrpc);
+ mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+}
+
+/*
+ * rrpc_move_valid_pages -- migrate live data off the block
+ * @rrpc: the 'rrpc' structure
+ * @block: the block from which to migrate live pages
+ *
+ * Description:
+ * GC algorithms may call this function to migrate remaining live
+ * pages off the block prior to erasing it. This function blocks
+ * further execution until the operation is complete.
+ */
+static int rrpc_move_valid_pages(struct rrpc *rrpc, struct nvm_block *block)
+{
+ struct nvm_dev *dev = rrpc->q_nvm;
+ struct nvm_lun *lun = block->lun;
+ struct nvm_internal_cmd cmd = {
+ .target = &rrpc->instance,
+ .timeout = 30,
+ };
+ struct nvm_rev_addr *rev;
+ struct request *rq;
+ struct page *page;
+ int slot;
+ int ret;
+ sector_t phys_addr;
+
+ if (bitmap_full(block->invalid_pages, lun->nr_pages_per_blk))
+ return 0;
+
+ page = mempool_alloc(rrpc->page_pool, GFP_NOIO);
+
+ while ((slot = find_first_zero_bit(block->invalid_pages,
+ lun->nr_pages_per_blk)) <
+ lun->nr_pages_per_blk) {
+
+ /* Lock laddr */
+ phys_addr = block_to_addr(block) + slot;
+
+try:
+ spin_lock(&rrpc->rev_lock);
+ /* Get logical address from physical to logical table */
+ rev = &rrpc->rev_trans_map[phys_addr - rrpc->poffset];
+ /* already updated by previous regular write */
+ if (rev->addr == ADDR_EMPTY) {
+ spin_unlock(&rrpc->rev_lock);
+ continue;
+ }
+
+ rq = rrpc_inflight_laddr_acquire(rrpc, rev->addr, 1);
+ if (!rq) {
+ spin_unlock(&rrpc->rev_lock);
+ schedule();
+ goto try;
+ }
+
+ spin_unlock(&rrpc->rev_lock);
+
+ cmd.phys_lba = rev->addr;
+ cmd.rw = READ;
+ cmd.buffer = page_address(page);
+ cmd.bufflen = EXPOSED_PAGE_SIZE;
+ ret = nvm_internal_rw(dev, &cmd);
+ if (ret) {
+ pr_err("nvm: gc read failure.\n");
+ goto out;
+ }
+
+ /* turn the command around and write the data back to a new
+ * address */
+ cmd.rw = WRITE;
+ ret = nvm_internal_rw(dev, &cmd);
+ if (ret) {
+ pr_err("nvm: gc write failure.\n");
+ goto out;
+ }
+
+ rrpc_inflight_laddr_release(rrpc, rq);
+ }
+
+ mempool_free(page, rrpc->page_pool);
+
+out:
+ if (!bitmap_full(block->invalid_pages, lun->nr_pages_per_blk)) {
+ pr_err("nvm: failed to garbage collect block\n");
+ return -EIO;
+ }
+
+ return 0;
+}
+
+static void rrpc_block_gc(struct work_struct *work)
+{
+ struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+ ws_gc);
+ struct rrpc *rrpc = gcb->rrpc;
+ struct nvm_block *block = gcb->block;
+ struct nvm_dev *dev = rrpc->q_nvm;
+
+ pr_debug("nvm: block '%d' being reclaimed\n", block->id);
+
+ if (rrpc_move_valid_pages(rrpc, block))
+ goto done;
+
+ nvm_erase_blk(dev, block);
+ nvm_put_blk(block);
+done:
+ mempool_free(gcb, rrpc->gcb_pool);
+}
+
+/* the block with highest number of invalid pages, will be in the beginning
+ * of the list */
+static struct rrpc_block *rblock_max_invalid(struct rrpc_block *ra,
+ struct rrpc_block *rb)
+{
+ struct nvm_block *a = ra->parent;
+ struct nvm_block *b = rb->parent;
+
+ BUG_ON(!a || !b);
+
+ if (a->nr_invalid_pages == b->nr_invalid_pages)
+ return ra;
+
+ return (a->nr_invalid_pages < b->nr_invalid_pages) ? rb : ra;
+}
+
+/* linearly find the block with highest number of invalid pages
+ * requires lun->lock */
+static struct rrpc_block *block_prio_find_max(struct rrpc_lun *rlun)
+{
+ struct list_head *prio_list = &rlun->prio_list;
+ struct rrpc_block *rblock, *max;
+
+ BUG_ON(list_empty(prio_list));
+
+ max = list_first_entry(prio_list, struct rrpc_block, prio);
+ list_for_each_entry(rblock, prio_list, prio)
+ max = rblock_max_invalid(max, rblock);
+
+ return max;
+}
+
+static void rrpc_lun_gc(struct work_struct *work)
+{
+ struct rrpc_lun *rlun = container_of(work, struct rrpc_lun, ws_gc);
+ struct rrpc *rrpc = rlun->rrpc;
+ struct nvm_lun *lun = rlun->parent;
+ struct rrpc_block_gc *gcb;
+ unsigned int nr_blocks_need;
+
+ nr_blocks_need = lun->nr_blocks / GC_LIMIT_INVERSE;
+
+ if (nr_blocks_need < rrpc->nr_luns)
+ nr_blocks_need = rrpc->nr_luns;
+
+ spin_lock(&lun->lock);
+ while (nr_blocks_need > lun->nr_free_blocks &&
+ !list_empty(&rlun->prio_list)) {
+ struct rrpc_block *rblock = block_prio_find_max(rlun);
+ struct nvm_block *block = rblock->parent;
+
+ if (!block->nr_invalid_pages)
+ break;
+
+ list_del_init(&rblock->prio);
+
+ BUG_ON(!block_is_full(block));
+
+ pr_debug("nvm: selected block '%d' as GC victim\n",
+ block->id);
+
+ gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+ if (!gcb)
+ break;
+
+ gcb->rrpc = rrpc;
+ gcb->block = rblock->parent;
+ INIT_WORK(&gcb->ws_gc, rrpc_block_gc);
+
+ queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+
+ nr_blocks_need--;
+ }
+ spin_unlock(&lun->lock);
+
+ /* TODO: Hint that request queue can be started again */
+}
+
+static void rrpc_gc_queue(struct work_struct *work)
+{
+ struct rrpc_block_gc *gcb = container_of(work, struct rrpc_block_gc,
+ ws_gc);
+ struct rrpc *rrpc = gcb->rrpc;
+ struct nvm_block *block = gcb->block;
+ struct nvm_lun *lun = block->lun;
+ struct rrpc_lun *rlun = &rrpc->luns[lun->id - rrpc->lun_offset];
+ struct rrpc_block *rblock =
+ &rlun->blocks[block->id % lun->nr_blocks];
+
+ spin_lock(&rlun->lock);
+ list_add_tail(&rblock->prio, &rlun->prio_list);
+ spin_unlock(&rlun->lock);
+
+ mempool_free(gcb, rrpc->gcb_pool);
+ pr_debug("nvm: block '%d' is full, allow GC (sched)\n", block->id);
+}
+
+static int rrpc_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd,
+ unsigned long arg)
+{
+ return 0;
+}
+
+static int rrpc_open(struct block_device *bdev, fmode_t mode)
+{
+ return 0;
+}
+
+static void rrpc_release(struct gendisk *disk, fmode_t mode)
+{
+}
+
+static const struct block_device_operations rrpc_fops = {
+ .owner = THIS_MODULE,
+ .ioctl = rrpc_ioctl,
+ .open = rrpc_open,
+ .release = rrpc_release,
+};
+
+static struct rrpc_lun *__rrpc_get_lun_rr(struct rrpc *rrpc, int is_gc)
+{
+ unsigned int i;
+ struct rrpc_lun *rlun, *max_free;
+
+ if (!is_gc)
+ return get_next_lun(rrpc);
+
+ /* FIXME */
+ /* during GC, we don't care about RR, instead we want to make
+ * sure that we maintain evenness between the block luns. */
+ max_free = &rrpc->luns[0];
+ /* prevent GC-ing lun from devouring pages of a lun with
+ * little free blocks. We don't take the lock as we only need an
+ * estimate. */
+ rrpc_for_each_lun(rrpc, rlun, i) {
+ if (rlun->parent->nr_free_blocks >
+ max_free->parent->nr_free_blocks)
+ max_free = rlun;
+ }
+
+ return max_free;
+}
+
+static inline void __rrpc_page_invalidate(struct rrpc *rrpc,
+ struct nvm_addr *gp)
+{
+ BUG_ON(!spin_is_locked(&rrpc->rev_lock));
+ if (gp->addr == ADDR_EMPTY)
+ return;
+
+ invalidate_block_page(gp);
+ rrpc->rev_trans_map[gp->addr - rrpc->poffset].addr = ADDR_EMPTY;
+}
+
+struct nvm_addr *nvm_update_map(struct rrpc *rrpc, sector_t l_addr,
+ struct nvm_block *p_block, sector_t p_addr, int is_gc)
+{
+ struct nvm_addr *gp;
+ struct nvm_rev_addr *rev;
+
+ BUG_ON(l_addr >= rrpc->nr_pages);
+
+ gp = &rrpc->trans_map[l_addr];
+ spin_lock(&rrpc->rev_lock);
+ if (gp->block)
+ __nvm_page_invalidate(rrpc, gp);
+
+ gp->addr = p_addr;
+ gp->block = p_block;
+
+ rev = &rrpc->rev_trans_map[gp->addr - rrpc->poffset];
+ rev->addr = l_addr;
+ spin_unlock(&rrpc->rev_lock);
+
+ return gp;
+}
+
+/* Simple round-robin Logical to physical address translation.
+ *
+ * Retrieve the mapping using the active append point. Then update the ap for
+ * the next write to the disk.
+ *
+ * Returns nvm_addr with the physical address and block. Remember to return to
+ * rrpc->addr_cache when request is finished.
+ */
+static struct nvm_addr *rrpc_map_page(struct rrpc *rrpc, sector_t laddr,
+ int is_gc)
+{
+ struct rrpc_lun *rlun;
+ struct nvm_lun *lun;
+ struct nvm_block *p_block;
+ sector_t p_addr;
+
+ rlun = __rrpc_get_lun_rr(rrpc, is_gc);
+ lun = rlun->parent;
+
+ if (!is_gc && lun->nr_free_blocks < rrpc->nr_luns * 4)
+ return NULL;
+
+ spin_lock(&rlun->lock);
+
+ p_block = rlun->cur;
+ p_addr = nvm_alloc_addr(p_block);
+
+ if (p_addr == ADDR_EMPTY) {
+ p_block = nvm_get_blk(lun, 0);
+
+ if (!p_block) {
+ if (is_gc) {
+ p_addr = nvm_alloc_addr(rlun->gc_cur);
+ if (p_addr == ADDR_EMPTY) {
+ p_block = nvm_get_blk(lun, 1);
+ if (!p_block) {
+ pr_err("rrpc: no more blocks");
+ goto finished;
+ } else {
+ rlun->gc_cur = p_block;
+ p_addr =
+ nvm_alloc_addr(rlun->gc_cur);
+ }
+ }
+ p_block = rlun->gc_cur;
+ }
+ goto finished;
+ }
+
+ rrpc_set_lun_cur(rlun, p_block);
+ p_addr = nvm_alloc_addr(p_block);
+ }
+
+finished:
+ if (p_addr == ADDR_EMPTY)
+ goto err;
+
+ if (!p_block)
+ WARN_ON(is_gc);
+
+ spin_unlock(&rlun->lock);
+ return nvm_update_map(rrpc, laddr, p_block, p_addr, is_gc);
+err:
+ spin_unlock(&rlun->lock);
+ return NULL;
+}
+
+static void rrpc_unprep_rq(struct request *rq, struct nvm_rq_data *rqdata,
+ void *private)
+{
+ struct rrpc *rrpc = private;
+ struct nvm_per_rq *pb = get_per_rq_data(rq);
+ struct nvm_addr *p = pb->addr;
+ struct nvm_block *block = p->block;
+ struct nvm_lun *lun = block->lun;
+ struct rrpc_block_gc *gcb;
+ int cmnt_size;
+
+ rrpc_unlock_rq(rrpc, rq);
+
+ if (rq_data_dir(rq) == WRITE) {
+ cmnt_size = atomic_inc_return(&block->data_cmnt_size);
+ if (likely(cmnt_size != lun->nr_pages_per_blk))
+ return;
+
+ gcb = mempool_alloc(rrpc->gcb_pool, GFP_ATOMIC);
+ if (!gcb) {
+ pr_err("rrpc: not able to queue block for gc.");
+ return;
+ }
+
+ gcb->rrpc = rrpc;
+ gcb->block = block;
+ INIT_WORK(&gcb->ws_gc, rrpc_gc_queue);
+
+ queue_work(rrpc->kgc_wq, &gcb->ws_gc);
+ }
+}
+
+static int rrpc_read_rq(struct rrpc *rrpc, struct request *rq,
+ struct nvm_rq_data *rqdata)
+{
+ struct nvm_addr *gp;
+ struct nvm_per_rq *pb;
+ sector_t l_addr = nvm_get_laddr(rq);
+
+ if (rrpc_lock_rq(rrpc, rq))
+ return NVM_PREP_REQUEUE;
+
+ BUG_ON(!(l_addr >= 0 && l_addr < rrpc->nr_pages));
+ gp = &rrpc->trans_map[l_addr];
+
+ if (gp->block) {
+ rqdata->phys_sector = nvm_get_sector(gp->addr);
+ } else {
+ rrpc_unlock_rq(rrpc, rq);
+ blk_mq_end_request(rq, 0);
+ return NVM_PREP_DONE;
+ }
+
+ pb = get_per_rq_data(rq);
+ pb->addr = gp;
+ return NVM_PREP_OK;
+}
+
+static int rrpc_write_rq(struct rrpc *rrpc, struct request *rq,
+ struct nvm_rq_data *rqdata)
+{
+ struct nvm_per_rq *pb;
+ struct nvm_addr *p;
+ int is_gc = 0;
+ sector_t l_addr = nvm_get_laddr(rq);
+
+ if (rq->special)
+ is_gc = 1;
+
+ if (rrpc_lock_rq(rrpc, rq))
+ return NVM_PREP_REQUEUE;
+
+ p = rrpc_map_page(rrpc, l_addr, is_gc);
+ if (!p) {
+ BUG_ON(is_gc);
+ rrpc_unlock_rq(rrpc, rq);
+ rrpc_gc_kick(rrpc);
+ return NVM_PREP_REQUEUE;
+ }
+
+ rqdata->phys_sector = nvm_get_sector(p->addr);
+
+ pb = get_per_rq_data(rq);
+ pb->addr = p;
+
+ return NVM_PREP_OK;
+}
+
+static int rrpc_prep_rq(struct request *rq, struct nvm_rq_data *rqdata,
+ void *private)
+{
+ struct rrpc *rrpc = private;
+
+ if (rq_data_dir(rq) == WRITE)
+ return rrpc_write_rq(rrpc, rq, rqdata);
+
+ return rrpc_read_rq(rrpc, rq, rqdata);
+}
+
+static void rrpc_make_rq(struct request_queue *q, struct bio *bio)
+{
+ struct rrpc *rrpc = q->queuedata;
+
+ if (bio->bi_rw & REQ_DISCARD) {
+ rrpc_discard(rrpc, bio);
+ return;
+ }
+
+ bio->bi_nvm = &rrpc->instance.payload;
+ bio->bi_bdev = rrpc->q_bdev;
+
+ generic_make_request(bio);
+}
+
+static void rrpc_gc_free(struct rrpc *rrpc)
+{
+ struct rrpc_lun *rlun;
+ int i;
+
+ if (rrpc->krqd_wq)
+ destroy_workqueue(rrpc->krqd_wq);
+
+ if (rrpc->kgc_wq)
+ destroy_workqueue(rrpc->kgc_wq);
+
+ if (!rrpc->luns)
+ return;
+
+ for (i = 0; i < rrpc->nr_luns; i++) {
+ rlun = &rrpc->luns[i];
+
+ if (!rlun->blocks)
+ break;
+ vfree(rlun->blocks);
+ }
+}
+
+static int rrpc_gc_init(struct rrpc *rrpc)
+{
+ rrpc->krqd_wq = alloc_workqueue("knvm-work", WQ_MEM_RECLAIM|WQ_UNBOUND,
+ rrpc->nr_luns);
+ if (!rrpc->krqd_wq)
+ return -ENOMEM;
+
+ rrpc->kgc_wq = alloc_workqueue("knvm-gc", WQ_MEM_RECLAIM, 1);
+ if (!rrpc->kgc_wq)
+ return -ENOMEM;
+
+ setup_timer(&rrpc->gc_timer, rrpc_gc_timer, (unsigned long)rrpc);
+
+ return 0;
+}
+
+static void rrpc_map_free(struct rrpc *rrpc)
+{
+ vfree(rrpc->rev_trans_map);
+ vfree(rrpc->trans_map);
+}
+
+static int rrpc_l2p_update(u64 slba, u64 nlb, u64 *entries, void *private)
+{
+ struct rrpc *rrpc = (struct rrpc *)private;
+ struct nvm_dev *dev = rrpc->q_nvm;
+ struct nvm_addr *addr = rrpc->trans_map + slba;
+ struct nvm_rev_addr *raddr = rrpc->rev_trans_map;
+ sector_t max_pages = dev->total_pages * (dev->sector_size >> 9);
+ u64 elba = slba + nlb;
+ u64 i;
+
+ if (unlikely(elba > dev->total_pages)) {
+ pr_err("nvm: L2P data from device is out of bounds!\n");
+ return -EINVAL;
+ }
+
+ for (i = 0; i < nlb; i++) {
+ u64 pba = le64_to_cpu(entries[i]);
+ /* LNVM treats address-spaces as silos, LBA and PBA are
+ * equally large and zero-indexed. */
+ if (unlikely(pba >= max_pages && pba != U64_MAX)) {
+ pr_err("nvm: L2P data entry is out of bounds!\n");
+ return -EINVAL;
+ }
+
+ /* Address zero is a special one. The first page on a disk is
+ * protected. As it often holds internal device boot
+ * information. */
+ if (!pba)
+ continue;
+
+ addr[i].addr = pba;
+ raddr[pba].addr = slba + i;
+ }
+
+ return 0;
+}
+
+static int rrpc_map_init(struct rrpc *rrpc)
+{
+ struct nvm_dev *dev = rrpc->q_nvm;
+ sector_t i;
+ int ret;
+
+ rrpc->trans_map = vzalloc(sizeof(struct nvm_addr) * rrpc->nr_pages);
+ if (!rrpc->trans_map)
+ return -ENOMEM;
+
+ rrpc->rev_trans_map = vmalloc(sizeof(struct nvm_rev_addr)
+ * rrpc->nr_pages);
+ if (!rrpc->rev_trans_map)
+ return -ENOMEM;
+
+ for (i = 0; i < rrpc->nr_pages; i++) {
+ struct nvm_addr *p = &rrpc->trans_map[i];
+ struct nvm_rev_addr *r = &rrpc->rev_trans_map[i];
+
+ p->addr = ADDR_EMPTY;
+ r->addr = ADDR_EMPTY;
+ }
+
+ if (!dev->ops->get_l2p_tbl)
+ return 0;
+
+ /* Bring up the mapping table from device */
+ ret = dev->ops->get_l2p_tbl(dev->q, 0, dev->total_pages,
+ rrpc_l2p_update, rrpc);
+ if (ret) {
+ pr_err("nvm: rrpc: could not read L2P table.\n");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+
+/* Minimum pages needed within a lun */
+#define PAGE_POOL_SIZE 16
+#define ADDR_POOL_SIZE 64
+
+static int rrpc_core_init(struct rrpc *rrpc)
+{
+ int i;
+
+ down_write(&_lock);
+ if (!_gcb_cache) {
+ _gcb_cache = kmem_cache_create("nvm_gcb_cache",
+ sizeof(struct rrpc_block_gc), 0, 0, NULL);
+ if (!_gcb_cache) {
+ up_write(&_lock);
+ return -ENOMEM;
+ }
+ }
+ up_write(&_lock);
+
+ rrpc->page_pool = mempool_create_page_pool(PAGE_POOL_SIZE, 0);
+ if (!rrpc->page_pool)
+ return -ENOMEM;
+
+ rrpc->gcb_pool = mempool_create_slab_pool(rrpc->q_nvm->nr_luns,
+ _gcb_cache);
+ if (!rrpc->gcb_pool)
+ return -ENOMEM;
+
+ for (i = 0; i < NVM_INFLIGHT_PARTITIONS; i++) {
+ struct nvm_inflight *map = &rrpc->inflight_map[i];
+
+ spin_lock_init(&map->lock);
+ INIT_LIST_HEAD(&map->reqs);
+ }
+
+ return 0;
+}
+
+static void rrpc_core_free(struct rrpc *rrpc)
+{
+ if (rrpc->page_pool)
+ mempool_destroy(rrpc->page_pool);
+ if (rrpc->gcb_pool)
+ mempool_destroy(rrpc->gcb_pool);
+}
+
+static void rrpc_luns_free(struct rrpc *rrpc)
+{
+ kfree(rrpc->luns);
+}
+
+static int rrpc_luns_init(struct rrpc *rrpc, int lun_begin, int lun_end)
+{
+ struct nvm_dev *dev = rrpc->q_nvm;
+ struct nvm_block *block;
+ struct rrpc_lun *rlun;
+ int i, j;
+
+ spin_lock_init(&rrpc->rev_lock);
+
+ rrpc->luns = kcalloc(rrpc->nr_luns, sizeof(struct rrpc_lun),
+ GFP_KERNEL);
+ if (!rrpc->luns)
+ return -ENOMEM;
+
+ /* 1:1 mapping */
+ for (i = 0; i < rrpc->nr_luns; i++) {
+ struct nvm_lun *lun = &dev->luns[i + lun_begin];
+
+ rlun = &rrpc->luns[i];
+ rlun->rrpc = rrpc;
+ rlun->parent = lun;
+ rlun->nr_blocks = lun->nr_blocks;
+
+ rrpc->total_blocks += lun->nr_blocks;
+ rrpc->nr_pages += lun->nr_blocks * lun->nr_pages_per_blk;
+
+ INIT_LIST_HEAD(&rlun->prio_list);
+ INIT_WORK(&rlun->ws_gc, rrpc_lun_gc);
+ spin_lock_init(&rlun->lock);
+
+ rlun->blocks = vzalloc(sizeof(struct rrpc_block) *
+ rlun->nr_blocks);
+ if (!rlun->blocks)
+ goto err;
+
+ lun_for_each_block(lun, block, j) {
+ struct rrpc_block *rblock = &rlun->blocks[j];
+
+ rblock->parent = block;
+ INIT_LIST_HEAD(&rblock->prio);
+ }
+ }
+
+ return 0;
+err:
+ return -ENOMEM;
+}
+
+static void rrpc_free(struct rrpc *rrpc)
+{
+ rrpc_gc_free(rrpc);
+ rrpc_map_free(rrpc);
+ rrpc_core_free(rrpc);
+ rrpc_luns_free(rrpc);
+
+ kfree(rrpc);
+}
+
+static void rrpc_exit(void *private)
+{
+ struct rrpc *rrpc = private;
+
+ blkdev_put(rrpc->q_bdev, FMODE_WRITE | FMODE_READ);
+ del_timer(&rrpc->gc_timer);
+
+ flush_workqueue(rrpc->krqd_wq);
+ flush_workqueue(rrpc->kgc_wq);
+
+ rrpc_free(rrpc);
+}
+
+static sector_t rrpc_capacity(void *private)
+{
+ struct rrpc *rrpc = private;
+ struct nvm_lun *lun;
+ sector_t reserved;
+ int i, max_pages_per_blk = 0;
+
+ nvm_for_each_lun(rrpc->q_nvm, lun, i) {
+ if (lun->nr_pages_per_blk > max_pages_per_blk)
+ max_pages_per_blk = lun->nr_pages_per_blk;
+ }
+
+ /* cur, gc, and two emergency blocks for each lun */
+ reserved = rrpc->nr_luns * max_pages_per_blk * 4;
+
+ if (reserved > rrpc->nr_pages) {
+ pr_err("rrpc: not enough space available to expose storage.\n");
+ return 0;
+ }
+
+ return ((rrpc->nr_pages - reserved) / 10) * 9 * NR_PHY_IN_LOG;
+}
+
+/*
+ * Looks up the logical address from reverse trans map and check if its valid by
+ * comparing the logical to physical address with the physical address.
+ * Returns 0 on free, otherwise 1 if in use
+ */
+static void rrpc_block_map_update(struct rrpc *rrpc, struct nvm_block *block)
+{
+ struct nvm_lun *lun = block->lun;
+ int offset;
+ struct nvm_addr *laddr;
+ sector_t paddr, pladdr;
+
+ for (offset = 0; offset < lun->nr_pages_per_blk; offset++) {
+ paddr = block_to_addr(block) + offset;
+
+ pladdr = rrpc->rev_trans_map[paddr].addr;
+ if (pladdr == ADDR_EMPTY)
+ continue;
+
+ laddr = &rrpc->trans_map[pladdr];
+
+ if (paddr == laddr->addr) {
+ laddr->block = block;
+ } else {
+ set_bit(offset, block->invalid_pages);
+ block->nr_invalid_pages++;
+ }
+ }
+}
+
+static int rrpc_blocks_init(struct rrpc *rrpc)
+{
+ struct nvm_dev *dev = rrpc->q_nvm;
+ struct nvm_lun *lun;
+ struct nvm_block *blk;
+ sector_t lun_iter, blk_iter;
+
+ for (lun_iter = 0; lun_iter < rrpc->nr_luns; lun_iter++) {
+ lun = &dev->luns[lun_iter + rrpc->lun_offset];
+
+ lun_for_each_block(lun, blk, blk_iter)
+ rrpc_block_map_update(rrpc, blk);
+ }
+
+ return 0;
+}
+
+static int rrpc_luns_configure(struct rrpc *rrpc)
+{
+ struct rrpc_lun *rlun;
+ struct nvm_block *blk;
+ int i;
+
+ for (i = 0; i < rrpc->nr_luns; i++) {
+ rlun = &rrpc->luns[i];
+
+ blk = nvm_get_blk(rlun->parent, 0);
+ if (!blk)
+ return -EINVAL;
+
+ rrpc_set_lun_cur(rlun, blk);
+
+ /* Emergency gc block */
+ blk = nvm_get_blk(rlun->parent, 1);
+ if (!blk)
+ return -EINVAL;
+ rlun->gc_cur = blk;
+ }
+
+ return 0;
+}
+
+static struct nvm_target_type tt_rrpc;
+
+static void *rrpc_init(struct gendisk *bdisk, struct gendisk *tdisk,
+ int lun_begin, int lun_end)
+{
+ struct request_queue *bqueue = bdisk->queue;
+ struct request_queue *tqueue = tdisk->queue;
+ struct nvm_dev *dev;
+ struct block_device *bdev;
+ struct rrpc *rrpc;
+ int ret;
+
+ if (!nvm_get_dev(bdisk)) {
+ pr_err("nvm: block device not supported.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ bdev = bdget_disk(bdisk, 0);
+ if (blkdev_get(bdev, FMODE_WRITE | FMODE_READ, NULL)) {
+ pr_err("nvm: could not access backing device\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ dev = nvm_get_dev(bdisk);
+
+ rrpc = kzalloc(sizeof(struct rrpc), GFP_KERNEL);
+ if (!rrpc) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ rrpc->q_dev = bqueue;
+ rrpc->q_nvm = bdisk->nvm;
+ rrpc->q_bdev = bdev;
+ rrpc->nr_luns = lun_end - lun_begin + 1;
+ rrpc->instance.tt = &tt_rrpc;
+
+ /* simple round-robin strategy */
+ atomic_set(&rrpc->next_lun, -1);
+
+ ret = rrpc_luns_init(rrpc, lun_begin, lun_end);
+ if (ret) {
+ pr_err("nvm: could not initialize luns\n");
+ goto err;
+ }
+
+ rrpc->poffset = rrpc->luns[0].parent->nr_blocks *
+ rrpc->luns[0].parent->nr_pages_per_blk * lun_begin;
+ rrpc->lun_offset = lun_begin;
+
+ ret = rrpc_core_init(rrpc);
+ if (ret) {
+ pr_err("nvm: rrpc: could not initialize core\n");
+ goto err;
+ }
+
+ ret = rrpc_map_init(rrpc);
+ if (ret) {
+ pr_err("nvm: rrpc: could not initialize maps\n");
+ goto err;
+ }
+
+ ret = rrpc_blocks_init(rrpc);
+ if (ret) {
+ pr_err("nvm: rrpc: could not initialize state for blocks\n");
+ goto err;
+ }
+
+ ret = rrpc_luns_configure(rrpc);
+ if (ret) {
+ pr_err("nvm: rrpc: not enough blocks available in LUNs.\n");
+ goto err;
+ }
+
+ ret = rrpc_gc_init(rrpc);
+ if (ret) {
+ pr_err("nvm: rrpc: could not initialize gc\n");
+ goto err;
+ }
+
+ /* make sure to inherit the size from the underlying device */
+ blk_queue_logical_block_size(tqueue, queue_physical_block_size(bqueue));
+ blk_queue_max_hw_sectors(tqueue, queue_max_hw_sectors(bqueue));
+
+ pr_info("nvm: rrpc initialized with %u luns and %llu pages.\n",
+ rrpc->nr_luns, (unsigned long long)rrpc->nr_pages);
+
+ mod_timer(&rrpc->gc_timer, jiffies + msecs_to_jiffies(10));
+
+ return rrpc;
+err:
+ blkdev_put(bdev, FMODE_WRITE | FMODE_READ);
+ rrpc_free(rrpc);
+ return ERR_PTR(ret);
+}
+
+/* round robin, page-based FTL, and cost-based GC */
+static struct nvm_target_type tt_rrpc = {
+ .name = "rrpc",
+
+ .make_rq = rrpc_make_rq,
+ .prep_rq = rrpc_prep_rq,
+ .unprep_rq = rrpc_unprep_rq,
+
+ .capacity = rrpc_capacity,
+
+ .init = rrpc_init,
+ .exit = rrpc_exit,
+};
+
+static int __init rrpc_module_init(void)
+{
+ return nvm_register_target(&tt_rrpc);
+}
+
+static void rrpc_module_exit(void)
+{
+ nvm_unregister_target(&tt_rrpc);
+}
+
+module_init(rrpc_module_init);
+module_exit(rrpc_module_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Round-Robin Cost-based Hybrid Layer for Open-Channel SSDs");
diff --git a/drivers/lightnvm/rrpc.h b/drivers/lightnvm/rrpc.h
new file mode 100644
index 0000000..d8b1588
--- /dev/null
+++ b/drivers/lightnvm/rrpc.h
@@ -0,0 +1,221 @@
+/*
+ * Copyright (C) 2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * Implementation of a Round-robin page-based Hybrid FTL for Open-channel SSDs.
+ */
+
+#ifndef RRPC_H_
+#define RRPC_H_
+
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+
+#include <linux/lightnvm.h>
+
+/* We partition the namespace of translation map into these pieces for tracking
+ * in-flight addresses. */
+#define NVM_INFLIGHT_PARTITIONS 1
+
+/* Run only GC if less than 1/X blocks are free */
+#define GC_LIMIT_INVERSE 10
+#define GC_TIME_SECS 100
+
+struct nvm_inflight {
+ spinlock_t lock;
+ struct list_head reqs;
+};
+
+struct rrpc_lun;
+
+struct rrpc_block {
+ struct nvm_block *parent;
+ struct list_head prio;
+};
+
+struct rrpc_lun {
+ struct rrpc *rrpc;
+ struct nvm_lun *parent;
+ struct nvm_block *cur, *gc_cur;
+ struct rrpc_block *blocks; /* Reference to block allocation */
+ struct list_head prio_list; /* Blocks that may be GC'ed */
+ struct work_struct ws_gc;
+
+ int nr_blocks;
+ spinlock_t lock;
+};
+
+struct rrpc {
+ /* instance must be kept in top to resolve rrpc in prep/unprep */
+ struct nvm_target_instance instance;
+
+ struct nvm_dev *q_nvm;
+ struct request_queue *q_dev;
+ struct block_device *q_bdev;
+
+ int nr_luns;
+ int lun_offset;
+ sector_t poffset; /* physical page offset */
+
+ struct rrpc_lun *luns;
+
+ /* calculated values */
+ unsigned long nr_pages;
+ unsigned long total_blocks;
+
+ /* Write strategy variables. Move these into each for structure for each
+ * strategy */
+ atomic_t next_lun; /* Whenever a page is written, this is updated
+ * to point to the next write lun */
+
+ /* Simple translation map of logical addresses to physical addresses.
+ * The logical addresses is known by the host system, while the physical
+ * addresses are used when writing to the disk block device. */
+ struct nvm_addr *trans_map;
+ /* also store a reverse map for garbage collection */
+ struct nvm_rev_addr *rev_trans_map;
+ spinlock_t rev_lock;
+
+ struct nvm_inflight inflight_map[NVM_INFLIGHT_PARTITIONS];
+
+ mempool_t *addr_pool;
+ mempool_t *page_pool;
+ mempool_t *gcb_pool;
+
+ struct timer_list gc_timer;
+ struct workqueue_struct *krqd_wq;
+ struct workqueue_struct *kgc_wq;
+
+ struct gc_blocks *gblks;
+ struct gc_luns *gluns;
+};
+
+struct rrpc_block_gc {
+ struct rrpc *rrpc;
+ struct nvm_block *block;
+ struct work_struct ws_gc;
+};
+
+static inline sector_t nvm_get_laddr(struct request *rq)
+{
+ if (rq->special) {
+ struct nvm_internal_cmd *cmd = rq->special;
+
+ return cmd->phys_lba;
+ }
+
+ return blk_rq_pos(rq) / NR_PHY_IN_LOG;
+}
+
+static inline sector_t nvm_get_sector(sector_t laddr)
+{
+ return laddr * NR_PHY_IN_LOG;
+}
+
+static inline struct nvm_per_rq *get_per_rq_data(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+
+ return blk_mq_rq_to_pdu(rq) + q->tag_set->cmd_size -
+ sizeof(struct nvm_per_rq);
+}
+
+static inline int request_intersects(struct rrpc_inflight_rq *r,
+ sector_t laddr_start, sector_t laddr_end)
+{
+ return (laddr_end >= r->l_start && laddr_end <= r->l_end) &&
+ (laddr_start >= r->l_start && laddr_start <= r->l_end);
+}
+
+static int __rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+ unsigned pages, struct rrpc_inflight_rq *r)
+{
+ struct nvm_inflight *map =
+ &rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+ sector_t laddr_end = laddr + pages - 1;
+ struct rrpc_inflight_rq *rtmp;
+
+ spin_lock_irq(&map->lock);
+ list_for_each_entry(rtmp, &map->reqs, list) {
+ if (unlikely(request_intersects(rtmp, laddr, laddr_end))) {
+ /* existing, overlapping request, come back later */
+ spin_unlock_irq(&map->lock);
+ return 1;
+ }
+ }
+
+ r->l_start = laddr;
+ r->l_end = laddr_end;
+
+ list_add_tail(&r->list, &map->reqs);
+ spin_unlock_irq(&map->lock);
+ return 0;
+}
+
+static inline int rrpc_lock_laddr(struct rrpc *rrpc, sector_t laddr,
+ unsigned pages,
+ struct rrpc_inflight_rq *r)
+{
+ BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+ return __rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline struct rrpc_inflight_rq *rrpc_get_inflight_rq(struct request *rq)
+{
+ struct nvm_per_rq *pd = get_per_rq_data(rq);
+
+ return &pd->inflight_rq;
+}
+
+static inline int rrpc_lock_rq(struct rrpc *rrpc, struct request *rq)
+{
+ sector_t laddr = nvm_get_laddr(rq);
+ unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+ struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+ if (rq->special)
+ return 0;
+
+ return rrpc_lock_laddr(rrpc, laddr, pages, r);
+}
+
+static inline void rrpc_unlock_laddr(struct rrpc *rrpc, sector_t laddr,
+ struct rrpc_inflight_rq *r)
+{
+ struct nvm_inflight *map =
+ &rrpc->inflight_map[laddr % NVM_INFLIGHT_PARTITIONS];
+ unsigned long flags;
+
+ spin_lock_irqsave(&map->lock, flags);
+ list_del_init(&r->list);
+ spin_unlock_irqrestore(&map->lock, flags);
+}
+
+static inline void rrpc_unlock_rq(struct rrpc *rrpc, struct request *rq)
+{
+ sector_t laddr = nvm_get_laddr(rq);
+ unsigned int pages = blk_rq_bytes(rq) / EXPOSED_PAGE_SIZE;
+ struct rrpc_inflight_rq *r = rrpc_get_inflight_rq(rq);
+
+ BUG_ON((laddr + pages) > rrpc->nr_pages);
+
+ if (rq->special)
+ return;
+
+ rrpc_unlock_laddr(rrpc, laddr, r);
+}
+
+#endif /* RRPC_H_ */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0283c7e..5635940 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1471,6 +1471,8 @@ extern bool blk_integrity_merge_bio(struct request_queue *, struct request *,
static inline
struct blk_integrity *bdev_get_integrity(struct block_device *bdev)
{
+ if (unlikely(!bdev))
+ return NULL;
return bdev->bd_disk->integrity;
}

--
2.1.4

2015-06-05 12:55:33

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 7/8] null_blk: LightNVM support

Initial support for LightNVM. The support can be used to benchmark
performance of targets and core implementation.

Signed-off-by: Matias Bjørling <[email protected]>
---
Documentation/block/null_blk.txt | 8 +++
drivers/block/null_blk.c | 133 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 138 insertions(+), 3 deletions(-)

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index 2f6c6ff..a34f50a 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -70,3 +70,11 @@ use_per_node_hctx=[0/1]: Default: 0
parameter.
1: The multi-queue block layer is instantiated with a hardware dispatch
queue for each CPU node in the system.
+
+IV: LightNVM specific parameters
+
+nvm_enable=[x]: Default: 0
+ Enable LightNVM for null block devices. Requires blk-mq to be used.
+
+nvm_num_channels=[x]: Default: 1
+ Number of LightNVM channels that is exposed to the LightNVM driver.
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 79972ab..1337541 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -8,6 +8,7 @@
#include <linux/slab.h>
#include <linux/blk-mq.h>
#include <linux/hrtimer.h>
+#include <linux/lightnvm.h>

struct nullb_cmd {
struct list_head list;
@@ -17,6 +18,7 @@ struct nullb_cmd {
struct bio *bio;
unsigned int tag;
struct nullb_queue *nq;
+ struct nvm_rq_data nvm_rqdata;
};

struct nullb_queue {
@@ -147,6 +149,14 @@ static bool use_per_node_hctx = false;
module_param(use_per_node_hctx, bool, S_IRUGO);
MODULE_PARM_DESC(use_per_node_hctx, "Use per-node allocation for hardware context queues. Default: false");

+static bool nvm_enable;
+module_param(nvm_enable, bool, S_IRUGO);
+MODULE_PARM_DESC(nvm_enable, "Enable Open-channel SSD. Default: false");
+
+static int nvm_num_channels = 1;
+module_param(nvm_num_channels, int, S_IRUGO);
+MODULE_PARM_DESC(nvm_num_channels, "Number of channels to be exposed from the Open-Channel SSD. Default: 1");
+
static void put_tag(struct nullb_queue *nq, unsigned int tag)
{
clear_bit_unlock(tag, nq->tag_map);
@@ -273,6 +283,9 @@ static void null_softirq_done_fn(struct request *rq)

static inline void null_handle_cmd(struct nullb_cmd *cmd)
{
+ if (nvm_enable)
+ nvm_unprep_rq(cmd->rq, &cmd->nvm_rqdata);
+
/* Complete IO by inline, softirq or timer */
switch (irqmode) {
case NULL_IRQ_SOFTIRQ:
@@ -351,6 +364,85 @@ static void null_request_fn(struct request_queue *q)
}
}

+#ifdef CONFIG_NVM
+
+static int null_nvm_id(struct request_queue *q, struct nvm_id *id)
+{
+ sector_t size = gb * 1024 * 1024 * 1024ULL;
+ unsigned long per_chnl_size =
+ size / bs / nvm_num_channels;
+ struct nvm_id_chnl *chnl;
+ int i;
+
+ id->ver_id = 0x1;
+ id->nvm_type = NVM_NVMT_BLK;
+ id->nchannels = nvm_num_channels;
+
+ id->chnls = kmalloc_array(id->nchannels, sizeof(struct nvm_id_chnl),
+ GFP_KERNEL);
+ if (!id->chnls)
+ return -ENOMEM;
+
+ for (i = 0; i < id->nchannels; i++) {
+ chnl = &id->chnls[i];
+ chnl->queue_size = hw_queue_depth;
+ chnl->gran_read = bs;
+ chnl->gran_write = bs;
+ chnl->gran_erase = bs * 256;
+ chnl->oob_size = 0;
+ chnl->t_r = chnl->t_sqr = 25000; /* 25us */
+ chnl->t_w = chnl->t_sqw = 500000; /* 500us */
+ chnl->t_e = 1500000; /* 1.500us */
+ chnl->io_sched = NVM_IOSCHED_CHANNEL;
+ chnl->laddr_begin = per_chnl_size * i;
+ chnl->laddr_end = per_chnl_size * (i + 1) - 1;
+ }
+
+ return 0;
+}
+
+static int null_nvm_get_features(struct request_queue *q,
+ struct nvm_get_features *gf)
+{
+ gf->rsp = 0;
+ gf->ext = 0;
+
+ return 0;
+}
+
+static int null_nvm_internal_rw(struct request_queue *q,
+ struct nvm_internal_cmd *cmd)
+{
+ struct request *req;
+ int ret;
+
+ req = blk_mq_alloc_request(q, cmd->rw, GFP_KERNEL, false);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ req->cmd_type = REQ_TYPE_DRV_PRIV;
+ req->cmd_flags |= REQ_FAILFAST_DRIVER;
+ req->__data_len = 0;
+ req->__sector = (sector_t) -1;
+ req->bio = req->biotail = NULL;
+ req->timeout = 30;
+ req->special = cmd;
+
+ blk_execute_rq(req->q, NULL, req, 0);
+ ret = req->errors;
+ blk_mq_free_request(req);
+ return ret;
+}
+
+static struct nvm_dev_ops null_nvm_dev_ops = {
+ .identify = null_nvm_id,
+ .get_features = null_nvm_get_features,
+ .internal_rw = null_nvm_internal_rw,
+};
+#else
+static struct nvm_dev_ops null_nvm_dev_ops;
+#endif /* CONFIG_NVM */
+
static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@@ -359,6 +451,22 @@ static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
cmd->rq = bd->rq;
cmd->nq = hctx->driver_data;

+ if (nvm_enable) {
+ nvm_init_rq_data(&cmd->nvm_rqdata);
+ switch (nvm_prep_rq(cmd->rq, &cmd->nvm_rqdata)) {
+ case NVM_PREP_DONE:
+ return BLK_MQ_RQ_QUEUE_OK;
+ case NVM_PREP_REQUEUE:
+ blk_mq_requeue_request(bd->rq);
+ blk_mq_kick_requeue_list(hctx->queue);
+ return BLK_MQ_RQ_QUEUE_OK;
+ case NVM_PREP_BUSY:
+ return BLK_MQ_RQ_QUEUE_BUSY;
+ case NVM_PREP_ERROR:
+ return BLK_MQ_RQ_QUEUE_ERROR;
+ }
+ }
+
blk_mq_start_request(bd->rq);

null_handle_cmd(cmd);
@@ -517,14 +625,21 @@ static int null_add_dev(void)
goto out_free_nullb;

if (queue_mode == NULL_Q_MQ) {
+ int cmd_size = sizeof(struct nullb_cmd);
+
+ if (nvm_enable)
+ cmd_size += sizeof(struct nvm_per_rq);
+
nullb->tag_set.ops = &null_mq_ops;
nullb->tag_set.nr_hw_queues = submit_queues;
nullb->tag_set.queue_depth = hw_queue_depth;
nullb->tag_set.numa_node = home_node;
- nullb->tag_set.cmd_size = sizeof(struct nullb_cmd);
- nullb->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+ nullb->tag_set.cmd_size = cmd_size;
nullb->tag_set.driver_data = nullb;

+ if (!nvm_enable)
+ nullb->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+
rv = blk_mq_alloc_tag_set(&nullb->tag_set);
if (rv)
goto out_cleanup_queues;
@@ -568,8 +683,8 @@ static int null_add_dev(void)
}

mutex_lock(&lock);
- list_add_tail(&nullb->list, &nullb_list);
nullb->index = nullb_indexes++;
+ list_add_tail(&nullb->list, &nullb_list);
mutex_unlock(&lock);

blk_queue_logical_block_size(nullb->q, bs);
@@ -578,16 +693,23 @@ static int null_add_dev(void)
size = gb * 1024 * 1024 * 1024ULL;
set_capacity(disk, size >> 9);

+ if (nvm_enable && nvm_register(nullb->q, disk, &null_nvm_dev_ops))
+ goto out_cleanup_disk;
+
disk->flags |= GENHD_FL_EXT_DEVT | GENHD_FL_SUPPRESS_PARTITION_INFO;
disk->major = null_major;
disk->first_minor = nullb->index;
disk->fops = &null_fops;
disk->private_data = nullb;
disk->queue = nullb->q;
+
sprintf(disk->disk_name, "nullb%d", nullb->index);
add_disk(disk);
+ nvm_attach_sysfs(disk);
return 0;

+out_cleanup_disk:
+ put_disk(disk);
out_cleanup_blk_queue:
blk_cleanup_queue(nullb->q);
out_cleanup_tags:
@@ -611,6 +733,11 @@ static int __init null_init(void)
bs = PAGE_SIZE;
}

+ if (nvm_enable && bs != 4096) {
+ pr_warn("null_blk: only 4K sectors are supported for Open-Channel SSDs. bs is set to 4K.\n");
+ bs = 4096;
+ }
+
if (queue_mode == NULL_Q_MQ && use_per_node_hctx) {
if (submit_queues < nr_online_nodes) {
pr_warn("null_blk: submit_queues param is set to %u.",
--
2.1.4

2015-06-05 12:55:28

by Matias Bjørling

[permalink] [raw]
Subject: [PATCH v4 8/8] nvme: LightNVM support

The first generation of Open-Channel SSDs will be based on NVMe. The
integration requires that a NVMe device exposes itself as a LightNVM
device. The way this is done currently is by hooking into the
Controller Capabilities (CAP register) and a bit in NSFEAT for each
namespace.

After detection, vendor specific codes are used to identify the device
and enumerate supported features.

Signed-off-by: Javier González <[email protected]>
Signed-off-by: Matias Bjørling <[email protected]>
---
drivers/block/Makefile | 2 +-
drivers/block/nvme-core.c | 113 +++++++++++++--
drivers/block/nvme-lightnvm.c | 320 ++++++++++++++++++++++++++++++++++++++++++
include/linux/nvme.h | 9 ++
include/uapi/linux/nvme.h | 131 +++++++++++++++++
5 files changed, 564 insertions(+), 11 deletions(-)
create mode 100644 drivers/block/nvme-lightnvm.c

diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 9cc6c18..37f7b3b 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -45,6 +45,6 @@ obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
obj-$(CONFIG_ZRAM) += zram/

-nvme-y := nvme-core.o nvme-scsi.o
+nvme-y := nvme-core.o nvme-scsi.o nvme-lightnvm.o
skd-y := skd_main.o
swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 6e433b1..be6e67d 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -39,6 +39,7 @@
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/types.h>
+#include <linux/lightnvm.h>
#include <scsi/sg.h>
#include <asm-generic/io-64-nonatomic-lo-hi.h>

@@ -134,6 +135,11 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096);
BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
+ BUILD_BUG_ON(sizeof(struct nvme_nvm_hb_rw) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_nvm_l2ptbl) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_nvm_bbtbl) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_nvm_set_resp) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64);
}

typedef void (*nvme_completion_fn)(struct nvme_queue *, void *,
@@ -408,6 +414,7 @@ static inline void iod_init(struct nvme_iod *iod, unsigned nbytes,
iod->npages = -1;
iod->length = nbytes;
iod->nents = 0;
+ nvm_init_rq_data(&iod->nvm_rqdata);
}

static struct nvme_iod *
@@ -634,6 +641,8 @@ static void req_completion(struct nvme_queue *nvmeq, void *ctx,
}
nvme_free_iod(nvmeq->dev, iod);

+ nvm_unprep_rq(req, &iod->nvm_rqdata);
+
blk_mq_complete_request(req);
}

@@ -717,8 +726,8 @@ static int nvme_setup_prps(struct nvme_dev *dev, struct nvme_iod *iod,
return total_len;
}

-static void nvme_submit_priv(struct nvme_queue *nvmeq, struct request *req,
- struct nvme_iod *iod)
+static void nvme_submit_priv(struct nvme_queue *nvmeq, struct nvme_ns *ns,
+ struct request *req, struct nvme_iod *iod)
{
struct nvme_command *cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];

@@ -727,6 +736,9 @@ static void nvme_submit_priv(struct nvme_queue *nvmeq, struct request *req,
if (req->nr_phys_segments) {
cmnd->rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->rw.prp2 = cpu_to_le64(iod->first_dma);
+
+ if (ns && ns->type == NVME_NS_NVM)
+ nvme_nvm_prep_internal_rq(req, ns, cmnd, iod);
}

if (++nvmeq->sq_tail == nvmeq->q_depth)
@@ -778,6 +790,46 @@ static void nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
writel(nvmeq->sq_tail, nvmeq->q_db);
}

+static int nvme_nvm_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
+ struct nvme_ns *ns)
+{
+#ifdef CONFIG_NVM
+ struct request *req = iod_get_private(iod);
+ struct nvme_command *cmnd;
+ u16 control = 0;
+ u32 dsmgmt = 0;
+
+ if (req->cmd_flags & REQ_FUA)
+ control |= NVME_RW_FUA;
+ if (req->cmd_flags & (REQ_FAILFAST_DEV | REQ_RAHEAD))
+ control |= NVME_RW_LR;
+
+ cmnd = &nvmeq->sq_cmds[nvmeq->sq_tail];
+ memset(cmnd, 0, sizeof(*cmnd));
+
+ cmnd->nvm_hb_rw.opcode = (rq_data_dir(req) ?
+ nvme_nvm_cmd_hb_write : nvme_nvm_cmd_hb_read);
+ cmnd->nvm_hb_rw.command_id = req->tag;
+ cmnd->nvm_hb_rw.nsid = cpu_to_le32(ns->ns_id);
+ cmnd->nvm_hb_rw.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+ cmnd->nvm_hb_rw.prp2 = cpu_to_le64(iod->first_dma);
+ cmnd->nvm_hb_rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
+ cmnd->nvm_hb_rw.length = cpu_to_le16(
+ (blk_rq_bytes(req) >> ns->lba_shift) - 1);
+ cmnd->nvm_hb_rw.control = cpu_to_le16(control);
+ cmnd->nvm_hb_rw.dsmgmt = cpu_to_le32(dsmgmt);
+ cmnd->nvm_hb_rw.phys_addr =
+ cpu_to_le64(nvme_block_nr(ns,
+ iod->nvm_rqdata.phys_sector));
+
+ if (++nvmeq->sq_tail == nvmeq->q_depth)
+ nvmeq->sq_tail = 0;
+ writel(nvmeq->sq_tail, nvmeq->q_db);
+#endif /* CONFIG_NVM */
+
+ return 0;
+}
+
static int nvme_submit_iod(struct nvme_queue *nvmeq, struct nvme_iod *iod,
struct nvme_ns *ns)
{
@@ -909,14 +961,31 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
}
}

+ if (ns && ns->type == NVME_NS_NVM) {
+ switch (nvm_prep_rq(req, &iod->nvm_rqdata)) {
+ case NVM_PREP_DONE:
+ goto done_cmd;
+ case NVM_PREP_REQUEUE:
+ blk_mq_requeue_request(req);
+ blk_mq_kick_requeue_list(hctx->queue);
+ goto done_cmd;
+ case NVM_PREP_BUSY:
+ goto retry_cmd;
+ case NVM_PREP_ERROR:
+ goto error_cmd;
+ }
+ }
+
nvme_set_info(cmd, iod, req_completion);
spin_lock_irq(&nvmeq->q_lock);
if (req->cmd_type == REQ_TYPE_DRV_PRIV)
- nvme_submit_priv(nvmeq, req, iod);
+ nvme_submit_priv(nvmeq, ns, req, iod);
else if (req->cmd_flags & REQ_DISCARD)
nvme_submit_discard(nvmeq, ns, req, iod);
else if (req->cmd_flags & REQ_FLUSH)
nvme_submit_flush(nvmeq, ns, req->tag);
+ else if (ns && ns->type == NVME_NS_NVM)
+ nvme_nvm_submit_iod(nvmeq, iod, ns);
else
nvme_submit_iod(nvmeq, iod, ns);

@@ -924,6 +993,9 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
spin_unlock_irq(&nvmeq->q_lock);
return BLK_MQ_RQ_QUEUE_OK;

+ done_cmd:
+ nvme_free_iod(nvmeq->dev, iod);
+ return BLK_MQ_RQ_QUEUE_OK;
error_cmd:
nvme_free_iod(dev, iod);
return BLK_MQ_RQ_QUEUE_ERROR;
@@ -1699,7 +1771,8 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)

dev->page_size = 1 << page_shift;

- dev->ctrl_config = NVME_CC_CSS_NVM;
+ dev->ctrl_config = NVME_CAP_LIGHTNVM(cap) ?
+ NVME_CC_CSS_LIGHTNVM : NVME_CC_CSS_NVM;
dev->ctrl_config |= (page_shift - 12) << NVME_CC_MPS_SHIFT;
dev->ctrl_config |= NVME_CC_ARB_RR | NVME_CC_SHN_NONE;
dev->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
@@ -1932,6 +2005,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
u8 lbaf, pi_type;
u16 old_ms;
unsigned short bs;
+ int ret = 0;

if (nvme_identify_ns(dev, ns->ns_id, &id)) {
dev_warn(dev->dev, "%s: Identify failure\n", __func__);
@@ -1977,8 +2051,17 @@ static int nvme_revalidate_disk(struct gendisk *disk)
if (dev->oncs & NVME_CTRL_ONCS_DSM)
nvme_config_discard(ns);

+ if ((dev->ctrl_config & NVME_CC_CSS_LIGHTNVM) &&
+ id->nsfeat & NVME_NS_FEAT_NVM && ns->type != NVME_NS_NVM) {
+ ret = nvme_nvm_register(ns->queue, disk);
+ if (ret)
+ dev_warn(dev->dev,
+ "%s: LightNVM init failure\n", __func__);
+ ns->type = NVME_NS_NVM;
+ }
+
kfree(id);
- return 0;
+ return ret;
}

static const struct block_device_operations nvme_fops = {
@@ -2058,7 +2141,6 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
ns->ns_id = nsid;
ns->disk = disk;
ns->lba_shift = 9; /* set to a default value for 512 until disk is validated */
- list_add_tail(&ns->list, &dev->namespaces);

blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
if (dev->max_hw_sectors)
@@ -2072,7 +2154,6 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
disk->first_minor = 0;
disk->fops = &nvme_fops;
disk->private_data = ns;
- disk->queue = ns->queue;
disk->driverfs_dev = dev->device;
disk->flags = GENHD_FL_EXT_DEVT;
sprintf(disk->disk_name, "nvme%dn%d", dev->instance, nsid);
@@ -2084,11 +2165,20 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
* requires it.
*/
set_capacity(disk, 0);
- nvme_revalidate_disk(ns->disk);
+ if (nvme_revalidate_disk(ns->disk))
+ goto out_put_disk;
+
+ list_add_tail(&ns->list, &dev->namespaces);
+
+ disk->queue = ns->queue;
add_disk(ns->disk);
+ nvm_attach_sysfs(ns->disk);
if (ns->ms)
revalidate_disk(ns->disk);
return;
+
+ out_put_disk:
+ put_disk(disk);
out_free_queue:
blk_cleanup_queue(ns->queue);
out_free_ns:
@@ -2217,7 +2307,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
int res;
unsigned nn, i;
struct nvme_id_ctrl *ctrl;
- int shift = NVME_CAP_MPSMIN(readq(&dev->bar->cap)) + 12;
+ u64 cap = readq(&dev->bar->cap);
+ int shift = NVME_CAP_MPSMIN(cap) + 12;

res = nvme_identify_ctrl(dev, &ctrl);
if (res) {
@@ -2255,9 +2346,11 @@ static int nvme_dev_add(struct nvme_dev *dev)
dev->tagset.queue_depth =
min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1;
dev->tagset.cmd_size = nvme_cmd_size(dev);
- dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->tagset.driver_data = dev;

+ if (!NVME_CAP_LIGHTNVM(cap))
+ dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+
if (blk_mq_alloc_tag_set(&dev->tagset))
return 0;

diff --git a/drivers/block/nvme-lightnvm.c b/drivers/block/nvme-lightnvm.c
new file mode 100644
index 0000000..1a57c1b8
--- /dev/null
+++ b/drivers/block/nvme-lightnvm.c
@@ -0,0 +1,320 @@
+/*
+ * nvme-lightnvm.c - LightNVM NVMe device
+ *
+ * Copyright (C) 2014-2015 IT University of Copenhagen
+ * Initial release: Matias Bjorling <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139,
+ * USA.
+ *
+ */
+
+#include <linux/nvme.h>
+#include <linux/bitops.h>
+#include <linux/lightnvm.h>
+
+#ifdef CONFIG_NVM
+
+static int init_chnls(struct request_queue *q, struct nvm_id *nvm_id,
+ struct nvme_nvm_id *nvme_nvm_id)
+{
+ struct nvme_nvm_id_chnl *src = nvme_nvm_id->chnls;
+ struct nvm_id_chnl *dst = nvm_id->chnls;
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_command c = {
+ .nvm_identify.opcode = nvme_nvm_admin_identify,
+ .nvm_identify.nsid = cpu_to_le32(ns->ns_id),
+ };
+ unsigned int len = nvm_id->nchannels;
+ int i, end, ret, off = 0;
+
+ while (len) {
+ end = min_t(u32, NVME_NVM_CHNLS_PR_REQ, len);
+
+ for (i = 0; i < end; i++, dst++, src++) {
+ dst->laddr_begin = le64_to_cpu(src->laddr_begin);
+ dst->laddr_end = le64_to_cpu(src->laddr_end);
+ dst->oob_size = le32_to_cpu(src->oob_size);
+ dst->queue_size = le32_to_cpu(src->queue_size);
+ dst->gran_read = le32_to_cpu(src->gran_read);
+ dst->gran_write = le32_to_cpu(src->gran_write);
+ dst->gran_erase = le32_to_cpu(src->gran_erase);
+ dst->t_r = le32_to_cpu(src->t_r);
+ dst->t_sqr = le32_to_cpu(src->t_sqr);
+ dst->t_w = le32_to_cpu(src->t_w);
+ dst->t_sqw = le32_to_cpu(src->t_sqw);
+ dst->t_e = le32_to_cpu(src->t_e);
+ dst->io_sched = src->io_sched;
+ }
+
+ len -= end;
+ if (!len)
+ break;
+
+ off += end;
+
+ c.nvm_identify.chnl_off = off;
+
+ ret = nvme_submit_sync_cmd(q, &c, nvme_nvm_id, 4096);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+static int nvme_nvm_identify(struct request_queue *q, struct nvm_id *nvm_id)
+{
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_nvm_id *nvme_nvm_id;
+ struct nvme_command c = {
+ .nvm_identify.opcode = nvme_nvm_admin_identify,
+ .nvm_identify.nsid = cpu_to_le32(ns->ns_id),
+ .nvm_identify.chnl_off = 0,
+ };
+ int ret;
+
+ nvme_nvm_id = kmalloc(4096, GFP_KERNEL);
+ if (!nvme_nvm_id)
+ return -ENOMEM;
+
+ ret = nvme_submit_sync_cmd(q, &c, nvme_nvm_id, 4096);
+ if (ret) {
+ ret = -EIO;
+ goto out;
+ }
+
+ nvm_id->ver_id = nvme_nvm_id->ver_id;
+ nvm_id->nvm_type = nvme_nvm_id->nvm_type;
+ nvm_id->nchannels = le16_to_cpu(nvme_nvm_id->nchannels);
+
+ if (!nvm_id->chnls)
+ nvm_id->chnls = kmalloc(sizeof(struct nvm_id_chnl)
+ * nvm_id->nchannels, GFP_KERNEL);
+ if (!nvm_id->chnls) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = init_chnls(q, nvm_id, nvme_nvm_id);
+out:
+ kfree(nvme_nvm_id);
+ return ret;
+}
+
+static int nvme_nvm_get_features(struct request_queue *q,
+ struct nvm_get_features *gf)
+{
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_command c = {
+ .common.opcode = nvme_nvm_admin_get_features,
+ .common.nsid = ns->ns_id,
+ };
+ int sz = sizeof(struct nvm_get_features);
+ int ret;
+ u64 *resp;
+
+ resp = kmalloc(sz, GFP_KERNEL);
+ if (!resp)
+ return -ENOMEM;
+
+ ret = nvme_submit_sync_cmd(q, &c, resp, sz);
+ if (ret)
+ goto done;
+
+ gf->rsp = le64_to_cpu(resp[0]);
+ gf->ext = le64_to_cpu(resp[1]);
+
+done:
+ kfree(resp);
+ return ret;
+}
+
+static int nvme_nvm_set_resp(struct request_queue *q, u64 resp)
+{
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_command c = {
+ .nvm_resp.opcode = nvme_nvm_admin_set_resp,
+ .nvm_resp.nsid = cpu_to_le32(ns->ns_id),
+ .nvm_resp.resp = cpu_to_le64(resp),
+ };
+
+ return nvme_submit_sync_cmd(q, &c, NULL, 0);
+}
+
+static int nvme_nvm_get_l2p_tbl(struct request_queue *q, u64 slba, u64 nlb,
+ nvm_l2p_update_fn *update_l2p, void *priv)
+{
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_dev *dev = ns->dev;
+ struct nvme_command c = {
+ .nvm_l2p.opcode = nvme_nvm_admin_get_l2p_tbl,
+ .nvm_l2p.nsid = cpu_to_le32(ns->ns_id),
+ };
+ u32 len = queue_max_hw_sectors(q) << 9;
+ u64 nlb_pr_rq = len / sizeof(u64);
+ u64 cmd_slba = slba;
+ void *entries;
+ int ret = 0;
+
+ entries = kmalloc(len, GFP_KERNEL);
+ if (!entries)
+ return -ENOMEM;
+
+ while (nlb) {
+ u64 cmd_nlb = min_t(u64, nlb_pr_rq, nlb);
+
+ c.nvm_l2p.slba = cmd_slba;
+ c.nvm_l2p.nlb = cmd_nlb;
+
+ ret = nvme_submit_sync_cmd(q, &c, entries, len);
+ if (ret) {
+ dev_err(dev->dev, "L2P table transfer failed (%d)\n",
+ ret);
+ ret = -EIO;
+ goto out;
+ }
+
+ if (update_l2p(cmd_slba, cmd_nlb, entries, priv)) {
+ ret = -EINTR;
+ goto out;
+ }
+
+ cmd_slba += cmd_nlb;
+ nlb -= cmd_nlb;
+ }
+
+out:
+ kfree(entries);
+ return ret;
+}
+
+static int nvme_nvm_set_bb_tbl(struct request_queue *q, int lunid,
+ unsigned int nr_blocks, nvm_bb_update_fn *update_bbtbl, void *priv)
+{
+ return 0;
+}
+
+static int nvme_nvm_get_bb_tbl(struct request_queue *q, int lunid,
+ unsigned int nr_blocks, nvm_bb_update_fn *update_bbtbl, void *priv)
+{
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_dev *dev = ns->dev;
+ struct nvme_command c = {
+ .nvm_get_bb.opcode = nvme_nvm_admin_get_bb_tbl,
+ .nvm_get_bb.nsid = cpu_to_le32(ns->ns_id),
+ .nvm_get_bb.lbb = cpu_to_le32(lunid),
+ };
+ void *bb_bitmap;
+ u16 bb_bitmap_size;
+ int ret = 0;
+
+ bb_bitmap_size = ((nr_blocks >> 15) + 1) * PAGE_SIZE;
+ bb_bitmap = kmalloc(bb_bitmap_size, GFP_KERNEL);
+ if (!bb_bitmap)
+ return -ENOMEM;
+
+ bitmap_zero(bb_bitmap, nr_blocks);
+
+ ret = nvme_submit_sync_cmd(q, &c, bb_bitmap, bb_bitmap_size);
+ if (ret) {
+ dev_err(dev->dev, "get bad block table failed (%d)\n", ret);
+ ret = -EIO;
+ goto out;
+ }
+
+ ret = update_bbtbl(lunid, bb_bitmap, nr_blocks, priv);
+ if (ret) {
+ ret = -EINTR;
+ goto out;
+ }
+
+out:
+ kfree(bb_bitmap);
+ return ret;
+}
+
+int nvme_nvm_prep_internal_rq(struct request *rq, struct nvme_ns *ns,
+ struct nvme_command *c, struct nvme_iod *iod)
+{
+ struct nvm_rq_data *rqdata = &iod->nvm_rqdata;
+ struct nvm_internal_cmd *cmd = rq->special;
+
+ if (!cmd)
+ return 0;
+
+ if (nvm_prep_rq(rq, rqdata))
+ dev_err(ns->dev->dev, "lightnvm: internal cmd failed\n");
+
+ c->nvm_hb_rw.length = cpu_to_le16(
+ (blk_rq_bytes(rq) >> ns->lba_shift) - 1);
+ c->nvm_hb_rw.nsid = cpu_to_le32(ns->ns_id);
+ c->nvm_hb_rw.slba = cpu_to_le64(cmd->phys_lba);
+ c->nvm_hb_rw.phys_addr =
+ cpu_to_le64(nvme_block_nr(ns, rqdata->phys_sector));
+
+ return 0;
+}
+
+static int nvme_nvm_internal_rw(struct request_queue *q,
+ struct nvm_internal_cmd *cmd)
+{
+ struct nvme_command c;
+
+ memset(&c, 0, sizeof(c));
+
+ c.nvm_hb_rw.opcode = (cmd->rw ?
+ nvme_nvm_cmd_hb_write : nvme_nvm_cmd_hb_read);
+
+ return __nvme_submit_sync_cmd(q, &c, cmd->buffer, NULL,
+ cmd->bufflen, NULL, 30, cmd);
+}
+
+static int nvme_nvm_erase_block(struct request_queue *q, sector_t block_id)
+{
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_command c = {
+ .nvm_erase.opcode = nvme_nvm_cmd_erase,
+ .nvm_erase.nsid = cpu_to_le32(ns->ns_id),
+ .nvm_erase.blk_addr = cpu_to_le64(block_id),
+ };
+
+ return nvme_submit_sync_cmd(q, &c, NULL, 0);
+}
+
+static struct nvm_dev_ops nvme_nvm_dev_ops = {
+ .identify = nvme_nvm_identify,
+ .get_features = nvme_nvm_get_features,
+ .set_responsibility = nvme_nvm_set_resp,
+ .get_l2p_tbl = nvme_nvm_get_l2p_tbl,
+ .set_bb_tbl = nvme_nvm_set_bb_tbl,
+ .get_bb_tbl = nvme_nvm_get_bb_tbl,
+ .internal_rw = nvme_nvm_internal_rw,
+ .erase_block = nvme_nvm_erase_block,
+};
+
+#else
+static struct nvm_dev_ops nvme_nvm_dev_ops;
+static nvm_data_rq;
+
+void nvme_nvm_prep_internal_rq(struct request *rq, struct nvme_ns *ns,
+ struct nvme_command *c, struct nvme_iod *iod)
+{
+}
+#endif /* CONFIG_NVM */
+
+int nvme_nvm_register(struct request_queue *q, struct gendisk *disk)
+{
+ return nvm_register(q, disk, &nvme_nvm_dev_ops);
+}
+
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index fce2090..8fbc7bf 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -19,6 +19,7 @@
#include <linux/pci.h>
#include <linux/kref.h>
#include <linux/blk-mq.h>
+#include <linux/lightnvm.h>

struct nvme_bar {
__u64 cap; /* Controller Capabilities */
@@ -39,10 +40,12 @@ struct nvme_bar {
#define NVME_CAP_STRIDE(cap) (((cap) >> 32) & 0xf)
#define NVME_CAP_MPSMIN(cap) (((cap) >> 48) & 0xf)
#define NVME_CAP_MPSMAX(cap) (((cap) >> 52) & 0xf)
+#define NVME_CAP_LIGHTNVM(cap) (((cap) >> 38) & 0x1)

enum {
NVME_CC_ENABLE = 1 << 0,
NVME_CC_CSS_NVM = 0 << 4,
+ NVME_CC_CSS_LIGHTNVM = 1 << 4,
NVME_CC_MPS_SHIFT = 7,
NVME_CC_ARB_RR = 0 << 11,
NVME_CC_ARB_WRRU = 1 << 11,
@@ -120,6 +123,7 @@ struct nvme_ns {
u16 ms;
bool ext;
u8 pi_type;
+ int type;
u64 mode_select_num_blocks;
u32 mode_select_block_len;
};
@@ -137,6 +141,7 @@ struct nvme_iod {
int nents; /* Used in scatterlist */
int length; /* Of data, in bytes */
dma_addr_t first_dma;
+ struct nvm_rq_data nvm_rqdata; /* Physical sectors description of the I/O */
struct scatterlist meta_sg[1]; /* metadata requires single contiguous buffer */
struct scatterlist sg[0];
};
@@ -166,4 +171,8 @@ int nvme_sg_io(struct nvme_ns *ns, struct sg_io_hdr __user *u_hdr);
int nvme_sg_io32(struct nvme_ns *ns, unsigned long arg);
int nvme_sg_get_version_num(int __user *ip);

+int nvme_nvm_register(struct request_queue *q, struct gendisk *disk);
+int nvme_nvm_prep_internal_rq(struct request *rq, struct nvme_ns *ns,
+ struct nvme_command *c, struct nvme_iod *iod);
+
#endif /* _LINUX_NVME_H */
diff --git a/include/uapi/linux/nvme.h b/include/uapi/linux/nvme.h
index aef9a81..8adf845 100644
--- a/include/uapi/linux/nvme.h
+++ b/include/uapi/linux/nvme.h
@@ -85,6 +85,35 @@ struct nvme_id_ctrl {
__u8 vs[1024];
};

+struct nvme_nvm_id_chnl {
+ __le64 laddr_begin;
+ __le64 laddr_end;
+ __le32 oob_size;
+ __le32 queue_size;
+ __le32 gran_read;
+ __le32 gran_write;
+ __le32 gran_erase;
+ __le32 t_r;
+ __le32 t_sqr;
+ __le32 t_w;
+ __le32 t_sqw;
+ __le32 t_e;
+ __le16 chnl_parallelism;
+ __u8 io_sched;
+ __u8 reserved[133];
+} __attribute__((packed));
+
+struct nvme_nvm_id {
+ __u8 ver_id;
+ __u8 nvm_type;
+ __le16 nchannels;
+ __u8 reserved[252];
+ struct nvme_nvm_id_chnl chnls[];
+} __attribute__((packed));
+
+#define NVME_NVM_CHNLS_PR_REQ ((4096U - sizeof(struct nvme_nvm_id)) \
+ / sizeof(struct nvme_nvm_id_chnl))
+
enum {
NVME_CTRL_ONCS_COMPARE = 1 << 0,
NVME_CTRL_ONCS_WRITE_UNCORRECTABLE = 1 << 1,
@@ -130,6 +159,7 @@ struct nvme_id_ns {

enum {
NVME_NS_FEAT_THIN = 1 << 0,
+ NVME_NS_FEAT_NVM = 1 << 3,
NVME_NS_FLBAS_LBA_MASK = 0xf,
NVME_NS_FLBAS_META_EXT = 0x10,
NVME_LBAF_RP_BEST = 0,
@@ -146,6 +176,8 @@ enum {
NVME_NS_DPS_PI_TYPE1 = 1,
NVME_NS_DPS_PI_TYPE2 = 2,
NVME_NS_DPS_PI_TYPE3 = 3,
+
+ NVME_NS_NVM = 1,
};

struct nvme_smart_log {
@@ -229,6 +261,12 @@ enum nvme_opcode {
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
nvme_cmd_resv_release = 0x15,
+
+ nvme_nvm_cmd_hb_write = 0x81,
+ nvme_nvm_cmd_hb_read = 0x02,
+ nvme_nvm_cmd_phys_write = 0x91,
+ nvme_nvm_cmd_phys_read = 0x92,
+ nvme_nvm_cmd_erase = 0x90,
};

struct nvme_common_command {
@@ -261,6 +299,73 @@ struct nvme_rw_command {
__le16 appmask;
};

+struct nvme_nvm_hb_rw {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd2;
+ __le64 metadata;
+ __le64 prp1;
+ __le64 prp2;
+ __le64 slba;
+ __le16 length;
+ __le16 control;
+ __le32 dsmgmt;
+ __le64 phys_addr;
+};
+
+struct nvme_nvm_l2ptbl {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __le32 cdw2[4];
+ __le64 prp1;
+ __le64 prp2;
+ __le64 slba;
+ __le32 nlb;
+ __le16 cdw14[6];
+};
+
+struct nvme_nvm_bbtbl {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd[2];
+ __le64 prp1;
+ __le64 prp2;
+ __le32 prp1_len;
+ __le32 prp2_len;
+ __le32 lbb;
+ __u32 rsvd11[3];
+};
+
+struct nvme_nvm_set_resp {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd[2];
+ __le64 prp1;
+ __le64 prp2;
+ __le64 resp;
+ __u32 rsvd11[4];
+};
+
+struct nvme_nvm_erase_blk {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd[2];
+ __le64 prp1;
+ __le64 prp2;
+ __le64 blk_addr;
+ __u32 rsvd11[4];
+};
+
enum {
NVME_RW_LR = 1 << 15,
NVME_RW_FUA = 1 << 14,
@@ -328,6 +433,13 @@ enum nvme_admin_opcode {
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
nvme_admin_security_recv = 0x82,
+
+ nvme_nvm_admin_identify = 0xe2,
+ nvme_nvm_admin_get_features = 0xe6,
+ nvme_nvm_admin_set_resp = 0xe5,
+ nvme_nvm_admin_get_l2p_tbl = 0xea,
+ nvme_nvm_admin_get_bb_tbl = 0xf2,
+ nvme_nvm_admin_set_bb_tbl = 0xf1,
};

enum {
@@ -457,6 +569,18 @@ struct nvme_format_cmd {
__u32 rsvd11[5];
};

+struct nvme_nvm_identify {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __u64 rsvd[2];
+ __le64 prp1;
+ __le64 prp2;
+ __le32 chnl_off;
+ __u32 rsvd11[5];
+};
+
struct nvme_command {
union {
struct nvme_common_command common;
@@ -470,6 +594,13 @@ struct nvme_command {
struct nvme_format_cmd format;
struct nvme_dsm_cmd dsm;
struct nvme_abort_cmd abort;
+ struct nvme_nvm_identify nvm_identify;
+ struct nvme_nvm_hb_rw nvm_hb_rw;
+ struct nvme_nvm_l2ptbl nvm_l2p;
+ struct nvme_nvm_bbtbl nvm_get_bb;
+ struct nvme_nvm_bbtbl nvm_set_bb;
+ struct nvme_nvm_set_resp nvm_resp;
+ struct nvme_nvm_erase_blk nvm_erase;
};
};

--
2.1.4

2015-06-05 18:17:14

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] bio: Introduce LightNVM payload

> -
> +#if defined(CONFIG_NVM)
> + struct bio_nvm_payload *bi_nvm; /* open-channel ssd backend */
> +#endif
> unsigned short bi_vcnt; /* how many bio_vec's */
>

Jens suggests this to implemented using a bio clone. Will do in the next
refresh.

2015-06-08 14:49:04

by Stephen Bates

[permalink] [raw]
Subject: RE: [PATCH v4 0/8] Support for Open-Channel SSDs

Hi

I have tested this patchset using fio and a simple script that is available at the URL below. Note testing was performed in a QEMU environment and not on real lightnvm block IO devices.

https://github.com/OpenChannelSSD/lightnvm-hw

(see the sanity.sh script in the sanity sub-folder).

Tested-by: Stephen Bates <[email protected]>

Cheers

Stephen

-----Original Message-----
From: Matias Bjørling [mailto:[email protected]]
Sent: Friday, June 5, 2015 6:54 AM
To: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]; Stephen Bates; [email protected]; Matias Bjørling
Subject: [PATCH v4 0/8] Support for Open-Channel SSDs

Hi,

This is an updated version based on the feedback from Christoph.

Patch 1-2 are fixes and preparation for the nvme driver. The first fixes a flag bug. The second allows rq->special in nvme_submit_sync_cmd to be set and used.
Patch 3 fixes capacity reporting in null_blk.
Patch 4-8 introduces LightNVM and a simple target.

Jens, the patches are on top of your for-next tree.

Development and further information on LightNVM can be found at:

https://github.com/OpenChannelSSD/linux

Changes since v3:

- Remove dependence on REQ_NVM_GC
- Refactor nvme integration to use nvme_submit_sync_cmd for
internal commands.
- Fix race condition bug on multiple threads on RRPC target.
- Rename sysfs entry under the block device from nvm to lightnvm.
The configuration is found in /sys/block/*/lightnvm/

Changes since v2:

Feedback from Paul Bolle:
- Fix license to GPLv2, documentation, compilation.
Feedback from Keith Busch:
- nvme: Move lightnvm out and into nvme-lightnvm.c.
- nvme: Set controller css on lightnvm command set.
- nvme: Remove OACS.
Feedback from Christoph Hellwig:
- lightnvm: Move out of block layer into /drivers/lightnvm/core.c
- lightnvm: refactor request->phys_sector into device drivers.
- lightnvm: refactor prep/unprep into device drivers.
- lightnvm: move nvm_dev from request_queue to gendisk.

New
- Bad block table support (From Javier).
- Update maintainers file.

Changes since v1:

- Splitted LightNVM into two parts. A get/put interface for flash
blocks and the respective targets that implement flash translation
layer logic.
- Updated the patches according to the LightNVM specification changes.
- Added interface to add/remove targets for a block device.

Matias Bjørling (8):
nvme: add special param for nvme_submit_sync_cmd
nvme: don't overwrite req->cmd_flags on sync cmd
null_blk: wrong capacity when bs is not 512 bytes
bio: Introduce LightNVM payload
lightnvm: Support for Open-Channel SSDs
lightnvm: RRPC target
null_blk: LightNVM support
nvme: LightNVM support

Documentation/block/null_blk.txt | 8 +
MAINTAINERS | 9 +
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/block/Makefile | 2 +-
drivers/block/null_blk.c | 136 ++++-
drivers/block/nvme-core.c | 134 ++++-
drivers/block/nvme-lightnvm.c | 320 +++++++++++
drivers/block/nvme-scsi.c | 4 +-
drivers/lightnvm/Kconfig | 26 +
drivers/lightnvm/Makefile | 6 +
drivers/lightnvm/core.c | 833 +++++++++++++++++++++++++++++
drivers/lightnvm/rrpc.c | 1088 ++++++++++++++++++++++++++++++++++++++
drivers/lightnvm/rrpc.h | 221 ++++++++
include/linux/bio.h | 9 +
include/linux/blk_types.h | 4 +-
include/linux/blkdev.h | 2 +
include/linux/genhd.h | 3 +
include/linux/lightnvm.h | 379 +++++++++++++
include/linux/nvme.h | 11 +-
include/uapi/linux/nvme.h | 131 +++++
21 files changed, 3300 insertions(+), 30 deletions(-) create mode 100644 drivers/block/nvme-lightnvm.c create mode 100644 drivers/lightnvm/Kconfig create mode 100644 drivers/lightnvm/Makefile create mode 100644 drivers/lightnvm/core.c create mode 100644 drivers/lightnvm/rrpc.c create mode 100644 drivers/lightnvm/rrpc.h create mode 100644 include/linux/lightnvm.h

--
2.1.4
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-06-09 07:32:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 2/8] nvme: don't overwrite req->cmd_flags on sync cmd

On Fri, Jun 05, 2015 at 02:54:24PM +0200, Matias Bj??rling wrote:
> In __nvme_submit_sync_cmd, the request direction is overwritten when
> the REQ_FAILFAST_DRIVER flag is set.

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

This is a fix for a last minute regression and needs to got into the
drivers/for-4.2 tree ASAP.

2015-06-09 07:46:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

Hi Matias,

I've been looking over this and I really think it needs a fundamental
rearchitecture still. The design of using a separate stacking
block device and all kinds of private hooks does not look very
maintainable.

Here is my counter suggestion:

- the stacking block device goes away
- the nvm_target_type make_rq and prep_rq callbacks are combined
into one and called from the nvme/null_blk ->queue_rq method
early on to prepare the FTL state. The drivers that are LightNVM
enabled reserve a pointer to it in their per request data, which
the unprep_rq callback is called on durign I/O completion.

2015-06-10 18:13:07

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

On 06/09/2015 09:46 AM, Christoph Hellwig wrote:
> Hi Matias,
>
> I've been looking over this and I really think it needs a fundamental
> rearchitecture still. The design of using a separate stacking
> block device and all kinds of private hooks does not look very
> maintainable.
>
> Here is my counter suggestion:
>
> - the stacking block device goes away
> - the nvm_target_type make_rq and prep_rq callbacks are combined
> into one and called from the nvme/null_blk ->queue_rq method
> early on to prepare the FTL state. The drivers that are LightNVM
> enabled reserve a pointer to it in their per request data, which
> the unprep_rq callback is called on durign I/O completion.
>

I agree with this, if it only was a common FTL that would be
implemented. This is maybe where we start, but what I really want to
enable is these two use-cases:

1. A get/put flash block API, that user-space applications can use.
That will enable application-driven FTLs. E.g. RocksDB can be integrated
tightly with the SSD. Allowing data placement and garbage collection to
be strictly controlled. Data placement will reduce the need for
over-provisioning, as data that age at the same time are placed in the
same flash block, and garbage collection can be scheduled to not
interfere with user requests. Together, it will remove I/O outliers
significantly.

2. Large drive arrays with global FTL. The stacking block device model
enables this. It allows an FTL to span multiple devices, and thus
perform data placement and garbage collection over tens to hundred of
devices. That'll greatly improve wear-leveling, as there is a much
higher probability of a fully inactive block with more flash.
Additionally, as the parallelism grows within the storage array, we can
slice and dice the devices using the get/put flash block API and enable
applications to get predictable performance, while using large arrays
that have a single address space.

If it too much for now to get upstream, I can live with (2) removed and
then I make the changes you proposed.

What do you think?

Thanks
-Matias

2015-06-11 10:29:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

On Wed, Jun 10, 2015 at 08:11:42PM +0200, Matias Bjorling wrote:
> 1. A get/put flash block API, that user-space applications can use.
> That will enable application-driven FTLs. E.g. RocksDB can be integrated
> tightly with the SSD. Allowing data placement and garbage collection to
> be strictly controlled. Data placement will reduce the need for
> over-provisioning, as data that age at the same time are placed in the
> same flash block, and garbage collection can be scheduled to not
> interfere with user requests. Together, it will remove I/O outliers
> significantly.
>
> 2. Large drive arrays with global FTL. The stacking block device model
> enables this. It allows an FTL to span multiple devices, and thus
> perform data placement and garbage collection over tens to hundred of
> devices. That'll greatly improve wear-leveling, as there is a much
> higher probability of a fully inactive block with more flash.
> Additionally, as the parallelism grows within the storage array, we can
> slice and dice the devices using the get/put flash block API and enable
> applications to get predictable performance, while using large arrays
> that have a single address space.
>
> If it too much for now to get upstream, I can live with (2) removed and
> then I make the changes you proposed.

In this case your driver API really isn't the Linux block API
anymore. I think the right API is a simple asynchronous submit with
callback into the driver, with the block device only provided by
the lightnvm layer.

Note that for NVMe it might still make sense to implement this using
blk-mq and a struct request, but those should be internal similar to
how NVMe implements admin commands.

2015-06-13 16:17:17

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

On 06/11/2015 12:29 PM, Christoph Hellwig wrote:
> On Wed, Jun 10, 2015 at 08:11:42PM +0200, Matias Bjorling wrote:
>> 1. A get/put flash block API, that user-space applications can use.
>> That will enable application-driven FTLs. E.g. RocksDB can be integrated
>> tightly with the SSD. Allowing data placement and garbage collection to
>> be strictly controlled. Data placement will reduce the need for
>> over-provisioning, as data that age at the same time are placed in the
>> same flash block, and garbage collection can be scheduled to not
>> interfere with user requests. Together, it will remove I/O outliers
>> significantly.
>>
>> 2. Large drive arrays with global FTL. The stacking block device model
>> enables this. It allows an FTL to span multiple devices, and thus
>> perform data placement and garbage collection over tens to hundred of
>> devices. That'll greatly improve wear-leveling, as there is a much
>> higher probability of a fully inactive block with more flash.
>> Additionally, as the parallelism grows within the storage array, we can
>> slice and dice the devices using the get/put flash block API and enable
>> applications to get predictable performance, while using large arrays
>> that have a single address space.
>>
>> If it too much for now to get upstream, I can live with (2) removed and
>> then I make the changes you proposed.
>
> In this case your driver API really isn't the Linux block API
> anymore. I think the right API is a simple asynchronous submit with
> callback into the driver, with the block device only provided by
> the lightnvm layer.

Agree. A group is working on a RocksDB prototype at the moment. When
that is done, such an interface would be polished and submitted for
review. The first patches here are to lay the groundwork for block I/O
FTLs and generic flash block interface.

>
> Note that for NVMe it might still make sense to implement this using
> blk-mq and a struct request, but those should be internal similar to
> how NVMe implements admin commands.

How about handling I/O merges? In the case where a block API is exposed
with a global FTL, filesystems relies on I/O merges for improving
performance. If using internal commands, merging has to implemented in
the lightnvm stack itself, I rather want to use blk-mq and not duplicate
the effort. I've kept the stacking model, so that I/Os go through the
queue I/O path and then picked up in the device driver.



2015-06-17 13:59:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

On Sat, Jun 13, 2015 at 06:17:11PM +0200, Matias Bjorling wrote:
> > Note that for NVMe it might still make sense to implement this using
> > blk-mq and a struct request, but those should be internal similar to
> > how NVMe implements admin commands.
>
> How about handling I/O merges? In the case where a block API is exposed
> with a global FTL, filesystems relies on I/O merges for improving
> performance. If using internal commands, merging has to implemented in
> the lightnvm stack itself, I rather want to use blk-mq and not duplicate
> the effort. I've kept the stacking model, so that I/Os go through the
> queue I/O path and then picked up in the device driver.

I don't think the current abuses of the block API are acceptable though.
The crazy deep merging shouldn't be too relevant for SSD-type devices
so I think you'd do better than trying to reuse the TYPE_FS level
blk-mq merging code. If you want to reuse the request
allocation/submission code that's still doable.

As a start add a new submit_io method to the nvm_dev_ops, and add
an implementation similar to pscsi_execute_cmd in
drivers/target/target_core_pscsi.c for nvme, and a trivial no op
for a null-nvm driver replacing the null-blk additions. This
will give you very similar behavior to your current code, while
allowing to drop all the hacks in the block code. Note that simple
plugging will work just fine which should be all you'll need.

2015-06-17 18:04:18

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

> I don't think the current abuses of the block API are acceptable though.
> The crazy deep merging shouldn't be too relevant for SSD-type devices
> so I think you'd do better than trying to reuse the TYPE_FS level
> blk-mq merging code. If you want to reuse the request
> allocation/submission code that's still doable.
>
> As a start add a new submit_io method to the nvm_dev_ops, and add
> an implementation similar to pscsi_execute_cmd in
> drivers/target/target_core_pscsi.c for nvme, and a trivial no op
> for a null-nvm driver replacing the null-blk additions. This
> will give you very similar behavior to your current code, while
> allowing to drop all the hacks in the block code. Note that simple
> plugging will work just fine which should be all you'll need.
>

Thanks, I appreciate you taking the time to go through it. I'll respin
the patches and remove the block hacks.

2015-07-06 13:16:45

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] bio: Introduce LightNVM payload

On Fri 2015-06-05 20:17:02, Matias Bjorling wrote:
> >-
> >+#if defined(CONFIG_NVM)
> >+ struct bio_nvm_payload *bi_nvm; /* open-channel ssd backend */
> >+#endif
> > unsigned short bi_vcnt; /* how many bio_vec's */
> >
>
> Jens suggests this to implemented using a bio clone. Will do in the next
> refresh.

And I'd spell OpenChannel SSD the way you do in changelogs... for
consistency.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-07-16 12:23:52

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

> As a start add a new submit_io method to the nvm_dev_ops, and add
> an implementation similar to pscsi_execute_cmd in
> drivers/target/target_core_pscsi.c for nvme, and a trivial no op
> for a null-nvm driver replacing the null-blk additions. This
> will give you very similar behavior to your current code, while
> allowing to drop all the hacks in the block code. Note that simple
> plugging will work just fine which should be all you'll need.
>

A quick question. The flow is getting into place and it is looking good.

However, the code path is still left with a per device flash block
management core data structure in gendisk->nvm. ->nvm holds the device
configuration (number of flash chips, channels, flash page sizes, etc.),
free/used blocks in the media and other small structures. Basically
keeping track of the state of the blocks on the media.

It is nice to have it associated with gendisk, as it then easily can be
accessed from lightnvm code, without knowing which device driver that is
underneath.

If moving it outside gendisk, one approach would be to create a separate
block device for each open-channel ssd initialized. E.g. /dev/nvme0n1
has its block management information exposed through
/dev/lnvm/nvme0n1_bm. For each *_bm, the private field holds a map
between request_queue and bm. Effectively using a gendisk to act as a
link between the real device and any FTL target. This seems just as
hacky as the gendisk approach.

Any other approaches or is gendisk good for now?

Thanks, Matias



2015-07-16 12:46:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

Hi Matias,

the underlying lighnvm driver (nvme or NULL) shouldn't register
a gendisk - the only gendisk you'll need is that for the block
device that sits on top of lightnvm.

2015-07-16 13:07:06

by Matias Bjørling

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] Support for Open-Channel SSDs

On 07/16/2015 02:46 PM, Christoph Hellwig wrote:
> Hi Matias,
>
> the underlying lighnvm driver (nvme or NULL) shouldn't register
> a gendisk - the only gendisk you'll need is that for the block
> device that sits on top of lightnvm.
>

That could work as well. I'll refactor the nvme/null drivers to allow
that instead. Thanks