2019-05-12 15:56:04

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 0/7] nvme-pci: support device coredump

This enables to collect snapshot of controller information via device
coredump mechanism. The nvme device coredump is triggered when command
timeout occurs, and can also be triggered by writing sysfs attribute.

After finishing the nvme device coredump, the following files are created.

- regs: NVMe controller registers (00h to 4Fh)
- sq<qid>: Submission queue
- cq<qid>: Completion queue
- telemetry-ctrl-log: Telemetry controller-initiated log (if available)
- data: Empty

The device coredump mechanism currently allows drivers to create only a
single coredump file, so this also provides a new function that allows
drivers to create several device coredump files in one crashed device.

* v3
- Merge 'add telemetry log page definisions' patch and 'add facility to
check log page attributes' patch
- Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli
- Add BUILD_BUG_ON for the size of struct nvme_telemetry_log_page_hdr
- Fix typo s/machanism/mechanism/ in commit log
- Fix max transfer size calculation for get log page
- Add function comments
- Extract 'enable to trigger device coredump by hand' patch
- Don't try to get telemetry log when admin queue is not available
- Avoid deadlock in .coredump callback

* v2
- Add Reviewed-by tag.
- Add patch to fix typo in comment
- Remove unneeded braces.
- Allocate device_entry followed by an array of devcd_file elements.
- Add telemetry log page definisions
- Add facility to check log page attributes
- Exclude the doorbell registers from register dump.
- Save controller registers in a binary format instead of a text format.
- Create an empty 'data' file in the device coredump.
- Save telemetry controller-initiated log if available
- Make coredump procedure into two phases (before resetting controller and
after resetting as soon as admin queue is available).

Akinobu Mita (7):
devcoredump: use memory_read_from_buffer
devcoredump: fix typo in comment
devcoredump: allow to create several coredump files in one device
nvme: add basic facility to get telemetry log page
nvme-pci: add device coredump infrastructure
nvme-pci: trigger device coredump on command timeout
nvme-pci: enable to trigger device coredump by hand

drivers/base/devcoredump.c | 168 +++++++++------
drivers/nvme/host/Kconfig | 1 +
drivers/nvme/host/core.c | 3 +
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/pci.c | 494 ++++++++++++++++++++++++++++++++++++++++++--
include/linux/devcoredump.h | 33 +++
include/linux/nvme.h | 17 ++
7 files changed, 644 insertions(+), 73 deletions(-)

Cc: Johannes Berg <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Minwoo Im <[email protected]>
Cc: Kenneth Heitke <[email protected]>
--
2.7.4


2019-05-12 15:56:04

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 3/7] devcoredump: allow to create several coredump files in one device

The device coredump mechanism currently allows drivers to create only a
single coredump file. If there are several binary blobs to dump, we need
to define a binary format or conver to text format in order to put them
into a single coredump file.

This provides a new function that allows drivers to create several device
coredump files in one crashed device.

Cc: Johannes Berg <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Minwoo Im <[email protected]>
Cc: Kenneth Heitke <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
---
* v3
- No change since v2

drivers/base/devcoredump.c | 155 ++++++++++++++++++++++++++++++--------------
include/linux/devcoredump.h | 33 ++++++++++
2 files changed, 139 insertions(+), 49 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index e42d0b5..4dd6dba 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -25,16 +25,20 @@ static bool devcd_disabled;
/* if data isn't read by userspace after 5 minutes then delete it */
#define DEVCD_TIMEOUT (HZ * 60 * 5)

-struct devcd_entry {
- struct device devcd_dev;
- void *data;
- size_t datalen;
- struct module *owner;
+struct devcd_file {
+ struct bin_attribute bin_attr;
ssize_t (*read)(char *buffer, loff_t offset, size_t count,
void *data, size_t datalen);
void (*free)(void *data);
+};
+
+struct devcd_entry {
+ struct device devcd_dev;
+ struct module *owner;
struct delayed_work del_wk;
struct device *failing_dev;
+ int num_files;
+ struct devcd_file files[];
};

static struct devcd_entry *dev_to_devcd(struct device *dev)
@@ -45,8 +49,14 @@ static struct devcd_entry *dev_to_devcd(struct device *dev)
static void devcd_dev_release(struct device *dev)
{
struct devcd_entry *devcd = dev_to_devcd(dev);
+ int i;
+
+ for (i = 0; i < devcd->num_files; i++) {
+ struct devcd_file *file = &devcd->files[i];
+
+ file->free(file->bin_attr.private);
+ }

- devcd->free(devcd->data);
module_put(devcd->owner);

/*
@@ -64,9 +74,14 @@ static void devcd_dev_release(struct device *dev)
static void devcd_del(struct work_struct *wk)
{
struct devcd_entry *devcd;
+ int i;

devcd = container_of(wk, struct devcd_entry, del_wk.work);

+ for (i = 0; i < devcd->num_files; i++)
+ device_remove_bin_file(&devcd->devcd_dev,
+ &devcd->files[i].bin_attr);
+
device_del(&devcd->devcd_dev);
put_device(&devcd->devcd_dev);
}
@@ -75,10 +90,11 @@ static ssize_t devcd_data_read(struct file *filp, struct kobject *kobj,
struct bin_attribute *bin_attr,
char *buffer, loff_t offset, size_t count)
{
- struct device *dev = kobj_to_dev(kobj);
- struct devcd_entry *devcd = dev_to_devcd(dev);
+ struct devcd_file *file =
+ container_of(bin_attr, struct devcd_file, bin_attr);

- return devcd->read(buffer, offset, count, devcd->data, devcd->datalen);
+ return file->read(buffer, offset, count, bin_attr->private,
+ bin_attr->size);
}

static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj,
@@ -93,25 +109,6 @@ static ssize_t devcd_data_write(struct file *filp, struct kobject *kobj,
return count;
}

-static struct bin_attribute devcd_attr_data = {
- .attr = { .name = "data", .mode = S_IRUSR | S_IWUSR, },
- .size = 0,
- .read = devcd_data_read,
- .write = devcd_data_write,
-};
-
-static struct bin_attribute *devcd_dev_bin_attrs[] = {
- &devcd_attr_data, NULL,
-};
-
-static const struct attribute_group devcd_dev_group = {
- .bin_attrs = devcd_dev_bin_attrs,
-};
-
-static const struct attribute_group *devcd_dev_groups[] = {
- &devcd_dev_group, NULL,
-};
-
static int devcd_free(struct device *dev, void *data)
{
struct devcd_entry *devcd = dev_to_devcd(dev);
@@ -157,7 +154,6 @@ static struct class devcd_class = {
.name = "devcoredump",
.owner = THIS_MODULE,
.dev_release = devcd_dev_release,
- .dev_groups = devcd_dev_groups,
.class_groups = devcd_class_groups,
};

@@ -234,30 +230,55 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset,
offset);
}

+static struct devcd_entry *devcd_alloc(struct dev_coredumpm_bulk_data *files,
+ int num_files, gfp_t gfp)
+{
+ struct devcd_entry *devcd;
+ int i;
+
+ devcd = kzalloc(struct_size(devcd, files, num_files), gfp);
+ if (!devcd)
+ return NULL;
+
+ devcd->num_files = num_files;
+
+ for (i = 0; i < devcd->num_files; i++) {
+ struct devcd_file *file = &devcd->files[i];
+
+ sysfs_bin_attr_init(&file->bin_attr);
+ file->bin_attr.attr.name = files[i].name;
+
+ file->bin_attr.attr.mode = 0600;
+ file->bin_attr.size = files[i].datalen;
+ file->bin_attr.private = files[i].data;
+ file->bin_attr.read = devcd_data_read;
+ file->bin_attr.write = devcd_data_write;
+
+ file->read = files[i].read;
+ file->free = files[i].free;
+ }
+
+ return devcd;
+}
+
/**
- * dev_coredumpm - create device coredump with read/free methods
+ * dev_coredumpm_bulk - create a number of device coredump files
* @dev: the struct device for the crashed device
* @owner: the module that contains the read/free functions, use %THIS_MODULE
- * @data: data cookie for the @read/@free functions
- * @datalen: length of the data
* @gfp: allocation flags
- * @read: function to read from the given buffer
- * @free: function to free the given buffer
+ * @files: the configuration of device coredump files
+ * @num_files: the number of device coredump files to create
*
- * Creates a new device coredump for the given device. If a previous one hasn't
- * been read yet, the new coredump is discarded. The data lifetime is determined
- * by the device coredump framework and when it is no longer needed the @free
- * function will be called to free the data.
+ * This function allows drivers to create several device coredump files in
+ * one crashed device.
*/
-void dev_coredumpm(struct device *dev, struct module *owner,
- void *data, size_t datalen, gfp_t gfp,
- ssize_t (*read)(char *buffer, loff_t offset, size_t count,
- void *data, size_t datalen),
- void (*free)(void *data))
+void dev_coredumpm_bulk(struct device *dev, struct module *owner, gfp_t gfp,
+ struct dev_coredumpm_bulk_data *files, int num_files)
{
static atomic_t devcd_count = ATOMIC_INIT(0);
struct devcd_entry *devcd;
struct device *existing;
+ int i;

if (devcd_disabled)
goto free;
@@ -272,15 +293,11 @@ void dev_coredumpm(struct device *dev, struct module *owner,
if (!try_module_get(owner))
goto free;

- devcd = kzalloc(sizeof(*devcd), gfp);
+ devcd = devcd_alloc(files, num_files, gfp);
if (!devcd)
goto put_module;

devcd->owner = owner;
- devcd->data = data;
- devcd->datalen = datalen;
- devcd->read = read;
- devcd->free = free;
devcd->failing_dev = get_device(dev);

device_initialize(&devcd->devcd_dev);
@@ -292,6 +309,12 @@ void dev_coredumpm(struct device *dev, struct module *owner,
if (device_add(&devcd->devcd_dev))
goto put_device;

+ for (i = 0; i < devcd->num_files; i++) {
+ if (device_create_bin_file(&devcd->devcd_dev,
+ &devcd->files[i].bin_attr))
+ /* nothing - some files will be missing */;
+ }
+
if (sysfs_create_link(&devcd->devcd_dev.kobj, &dev->kobj,
"failing_device"))
/* nothing - symlink will be missing */;
@@ -309,7 +332,41 @@ void dev_coredumpm(struct device *dev, struct module *owner,
put_module:
module_put(owner);
free:
- free(data);
+ for (i = 0; i < num_files; i++)
+ files[i].free(files[i].data);
+}
+EXPORT_SYMBOL_GPL(dev_coredumpm_bulk);
+
+/**
+ * dev_coredumpm - create device coredump with read/free methods
+ * @dev: the struct device for the crashed device
+ * @owner: the module that contains the read/free functions, use %THIS_MODULE
+ * @data: data cookie for the @read/@free functions
+ * @datalen: length of the data
+ * @gfp: allocation flags
+ * @read: function to read from the given buffer
+ * @free: function to free the given buffer
+ *
+ * Creates a new device coredump for the given device. If a previous one hasn't
+ * been read yet, the new coredump is discarded. The data lifetime is determined
+ * by the device coredump framework and when it is no longer needed the @free
+ * function will be called to free the data.
+ */
+void dev_coredumpm(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen),
+ void (*free)(void *data))
+{
+ struct dev_coredumpm_bulk_data bulk_data = {
+ .name = "data",
+ .data = data,
+ .datalen = datalen,
+ .read = read,
+ .free = free,
+ };
+
+ dev_coredumpm_bulk(dev, owner, gfp, &bulk_data, 1);
}
EXPORT_SYMBOL_GPL(dev_coredumpm);

diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
index 269521f..9addb6f 100644
--- a/include/linux/devcoredump.h
+++ b/include/linux/devcoredump.h
@@ -65,6 +65,26 @@ static inline void _devcd_free_sgtable(struct scatterlist *table)
kfree(delete_iter);
}

+/**
+ * struct dev_coredumpm_bulk_data - Data used for dev_coredumpm_bulk
+ *
+ * @name: coredump file name
+ * @data: data cookie for the @read/@free functions
+ * @datalen: length of the data
+ * @read: function to read from the given buffer
+ * @free: function to free the given buffer
+ *
+ * An array of this structure is passed as argument to dev_coredumpm_bulk, and
+ * used to describe each device coredump.
+ */
+struct dev_coredumpm_bulk_data {
+ char *name;
+ void *data;
+ size_t datalen;
+ ssize_t (*read)(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen);
+ void (*free)(void *data);
+};

#ifdef CONFIG_DEV_COREDUMP
void dev_coredumpv(struct device *dev, void *data, size_t datalen,
@@ -76,6 +96,9 @@ void dev_coredumpm(struct device *dev, struct module *owner,
void *data, size_t datalen),
void (*free)(void *data));

+void dev_coredumpm_bulk(struct device *dev, struct module *owner, gfp_t gfp,
+ struct dev_coredumpm_bulk_data *files, int num_files);
+
void dev_coredumpsg(struct device *dev, struct scatterlist *table,
size_t datalen, gfp_t gfp);
#else
@@ -95,6 +118,16 @@ dev_coredumpm(struct device *dev, struct module *owner,
free(data);
}

+static inline
+void dev_coredumpm_bulk(struct device *dev, struct module *owner, gfp_t gfp,
+ struct dev_coredumpm_bulk_data *files, int num_files)
+{
+ int i;
+
+ for (i = 0; i < num_files; i++)
+ files[i].free(files[i].data);
+}
+
static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
size_t datalen, gfp_t gfp)
{
--
2.7.4

2019-05-12 15:56:07

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 4/7] nvme: add basic facility to get telemetry log page

This adds the required definisions to get telemetry log page.
The telemetry log page structure and identifier are copied from nvme-cli.

We also need a facility to check log page attributes in order to know
the controller supports the telemetry log pages and log page offset field
for the Get Log Page command. The telemetry data area could be larger
than maximum data transfer size, so we may need to split into multiple
transfers with incremental page offset.

Cc: Johannes Berg <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Minwoo Im <[email protected]>
Cc: Kenneth Heitke <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
---
* v3
- Merge 'add telemetry log page definisions' patch and 'add facility to
check log page attributes' patch
- Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli
- Add BUILD_BUG_ON for the size of struct nvme_telemetry_log_page_hdr

drivers/nvme/host/core.c | 2 ++
drivers/nvme/host/nvme.h | 1 +
include/linux/nvme.h | 17 +++++++++++++++++
3 files changed, 20 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a6644a2..0cea2a8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2585,6 +2585,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
} else
ctrl->shutdown_timeout = shutdown_timeout;

+ ctrl->lpa = id->lpa;
ctrl->npss = id->npss;
ctrl->apsta = id->apsta;
prev_apst_enabled = ctrl->apst_enabled;
@@ -3898,6 +3899,7 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
+ BUILD_BUG_ON(sizeof(struct nvme_telemetry_log_page_hdr) != 512);
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5ee75b5..7f6f1fc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -195,6 +195,7 @@ struct nvme_ctrl {
u32 vs;
u32 sgls;
u16 kas;
+ u8 lpa;
u8 npss;
u8 apsta;
u32 oaes;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c40720c..8c0b29d 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -294,6 +294,8 @@ enum {
NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
NVME_CTRL_OACS_DBBUF_SUPP = 1 << 8,
NVME_CTRL_LPA_CMD_EFFECTS_LOG = 1 << 1,
+ NVME_CTRL_LPA_EXTENDED_DATA = 1 << 2,
+ NVME_CTRL_LPA_TELEMETRY_LOG = 1 << 3,
};

struct nvme_lbaf {
@@ -396,6 +398,20 @@ enum {
NVME_NIDT_UUID = 0x03,
};

+struct nvme_telemetry_log_page_hdr {
+ __u8 lpi; /* Log page identifier */
+ __u8 rsvd[4];
+ __u8 iee_oui[3];
+ __le16 dalb1; /* Data area 1 last block */
+ __le16 dalb2; /* Data area 2 last block */
+ __le16 dalb3; /* Data area 3 last block */
+ __u8 rsvd1[368];
+ __u8 ctrlavail; /* Controller initiated data avail?*/
+ __u8 ctrldgn; /* Controller initiated telemetry Data Gen # */
+ __u8 rsnident[128];
+ __u8 telemetry_dataarea[0];
+};
+
struct nvme_smart_log {
__u8 critical_warning;
__u8 temperature[2];
@@ -832,6 +848,7 @@ enum {
NVME_LOG_FW_SLOT = 0x03,
NVME_LOG_CHANGED_NS = 0x04,
NVME_LOG_CMD_EFFECTS = 0x05,
+ NVME_LOG_TELEMETRY_CTRL = 0x08,
NVME_LOG_ANA = 0x0c,
NVME_LOG_DISC = 0x70,
NVME_LOG_RESERVATION = 0x80,
--
2.7.4

2019-05-12 15:56:25

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

This provides three functions to implement the device coredump for nvme
driver.

nvme_coredump_init() - This function is called when the driver determines
to start collecting device coredump. The snapshots of the controller
registers, and admin and IO queues are captured by this.

nvme_coredump_logs() - This function is called as soon as the device is
recovered from the crash and admin queue becomes available. If the device
coredump has already been started by nvme_coredump_init(), the telemetry
controller-initiated data will be collected. Otherwise do nothing.

nvme_coredump_complete() - This functions is called when the driver
determines that there is nothing to collect device coredump anymore.
All collected coredumps are exported via device coredump mechanism.

After finishing the nvme device coredump, the following files are created.

- regs: NVMe controller registers (00h to 4Fh)
- sq<qid>: Submission queue
- cq<qid>: Completion queue
- telemetry-ctrl-log: Telemetry controller-initiated log (if available)
- data: Empty

The reason for an empty 'data' file is to provide a uniform way to notify
the device coredump is no longer needed by writing the 'data' file.

Since all existing drivers using the device coredump provide a 'data' file
if the nvme device coredump doesn't provide it, the userspace programs need
to know which driver provides what coredump file.

Cc: Johannes Berg <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Minwoo Im <[email protected]>
Cc: Kenneth Heitke <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
---
* v3
- Fix typo s/machanism/mechanism/ in commit log
- Fix max transfer size calculation for get log page
- Add function comments
- Extract 'enable to trigger device coredump by hand' patch

drivers/nvme/host/Kconfig | 1 +
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/pci.c | 448 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 450 insertions(+)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 0f345e2..c3a06af 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -5,6 +5,7 @@ config BLK_DEV_NVME
tristate "NVM Express block device"
depends on PCI && BLOCK
select NVME_CORE
+ select WANT_DEV_COREDUMP
---help---
The NVM Express driver is for solid state drives directly
connected to the PCI or PCI Express bus. If you know you
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0cea2a8..172551b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2462,6 +2462,7 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,

return nvme_submit_sync_cmd(ctrl->admin_q, &c, log, size);
}
+EXPORT_SYMBOL_GPL(nvme_get_log);

static int nvme_get_effects_log(struct nvme_ctrl *ctrl)
{
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3e4fb89..3eebb98 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -9,6 +9,7 @@
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
+#include <linux/devcoredump.h>
#include <linux/dmi.h>
#include <linux/init.h>
#include <linux/interrupt.h>
@@ -89,6 +90,10 @@ struct nvme_queue;
static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);

+static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
+static void __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
+static void __maybe_unused nvme_coredump_complete(struct nvme_dev *dev);
+
/*
* Represents an NVM Express device. Each nvme_dev is a PCI function.
*/
@@ -131,6 +136,9 @@ struct nvme_dev {
dma_addr_t host_mem_descs_dma;
struct nvme_host_mem_buf_desc *host_mem_descs;
void **host_mem_desc_bufs;
+
+ struct dev_coredumpm_bulk_data *dumps;
+ int num_dumps;
};

static int io_queue_depth_set(const char *val, const struct kernel_param *kp)
@@ -2849,6 +2857,446 @@ static int nvme_resume(struct device *dev)

static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);

+#ifdef CONFIG_DEV_COREDUMP
+
+static ssize_t nvme_coredump_read(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen)
+{
+ return memory_read_from_buffer(buffer, count, &offset, data, datalen);
+}
+
+static void nvme_coredump_free(void *data)
+{
+ kvfree(data);
+}
+
+static int nvme_coredump_empty(struct dev_coredumpm_bulk_data *data)
+{
+ data->name = kstrdup("data", GFP_KERNEL);
+ if (!data->name)
+ return -ENOMEM;
+
+ data->data = NULL;
+ data->datalen = 0;
+ data->read = nvme_coredump_read;
+ data->free = nvme_coredump_free;
+
+ return 0;
+}
+
+static int nvme_coredump_regs(struct dev_coredumpm_bulk_data *data,
+ struct nvme_ctrl *ctrl)
+{
+ const int reg_size = 0x50; /* 00h to 4Fh */
+
+ data->name = kstrdup("regs", GFP_KERNEL);
+ if (!data->name)
+ return -ENOMEM;
+
+ data->data = kvzalloc(reg_size, GFP_KERNEL);
+ if (!data->data) {
+ kfree(data->name);
+ return -ENOMEM;
+ }
+ memcpy_fromio(data->data, to_nvme_dev(ctrl)->bar, reg_size);
+
+ data->datalen = reg_size;
+ data->read = nvme_coredump_read;
+ data->free = nvme_coredump_free;
+
+ return 0;
+}
+
+static void *kvmemdup(const void *src, size_t len, gfp_t gfp)
+{
+ void *p;
+
+ p = kvmalloc(len, gfp);
+ if (p)
+ memcpy(p, src, len);
+
+ return p;
+}
+
+static int nvme_coredump_queues(struct dev_coredumpm_bulk_data *bulk_data,
+ struct nvme_ctrl *ctrl)
+{
+ int i;
+
+ for (i = 0; i < ctrl->queue_count; i++) {
+ struct dev_coredumpm_bulk_data *data = &bulk_data[2 * i];
+ struct nvme_queue *nvmeq = &to_nvme_dev(ctrl)->queues[i];
+
+ data[0].name = kasprintf(GFP_KERNEL, "sq%d", i);
+ data[0].data = kvmemdup(nvmeq->sq_cmds,
+ SQ_SIZE(nvmeq->q_depth), GFP_KERNEL);
+ data[0].datalen = SQ_SIZE(nvmeq->q_depth);
+ data[0].read = nvme_coredump_read;
+ data[0].free = nvme_coredump_free;
+
+ data[1].name = kasprintf(GFP_KERNEL, "cq%d", i);
+ data[1].data = kvmemdup((void *)nvmeq->cqes,
+ CQ_SIZE(nvmeq->q_depth), GFP_KERNEL);
+ data[1].datalen = CQ_SIZE(nvmeq->q_depth);
+ data[1].read = nvme_coredump_read;
+ data[1].free = nvme_coredump_free;
+
+ if (!data[0].name || !data[1].name ||
+ !data[0].data || !data[1].data)
+ goto free;
+ }
+
+ return 0;
+free:
+ for (; i >= 0; i--) {
+ struct dev_coredumpm_bulk_data *data = &bulk_data[2 * i];
+
+ kfree(data[0].name);
+ kfree(data[1].name);
+ kvfree(data[0].data);
+ kvfree(data[1].data);
+ }
+
+ return -ENOMEM;
+}
+
+static struct
+dev_coredumpm_bulk_data *nvme_coredump_alloc(struct nvme_dev *dev, int n)
+{
+ struct dev_coredumpm_bulk_data *data;
+
+ data = krealloc(dev->dumps, sizeof(*data) * (dev->num_dumps + n),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!data)
+ return NULL;
+
+ dev->dumps = data;
+ data += dev->num_dumps;
+ dev->num_dumps += n;
+
+ return data;
+}
+
+static void __nvme_coredump_discard(struct nvme_dev *dev, bool free_data)
+{
+ int i;
+
+ for (i = 0; i < dev->num_dumps; i++) {
+ kfree(dev->dumps[i].name);
+ if (free_data)
+ dev->dumps[i].free(dev->dumps[i].data);
+ }
+
+ kfree(dev->dumps);
+ dev->dumps = NULL;
+ dev->num_dumps = 0;
+}
+
+static void nvme_coredump_discard(struct nvme_dev *dev)
+{
+ __nvme_coredump_discard(dev, true);
+}
+
+static void nvme_coredump_clear(struct nvme_dev *dev)
+{
+ __nvme_coredump_discard(dev, false);
+}
+
+/**
+ * nvme_coredump_init() - start collecting device coredump
+ * @dev: the struct nvme_dev for the crashed device
+ *
+ * This function is called when the driver determines to start collecting
+ * device coredump. The snapshots of the controller registers, and admin and
+ * IO queues are captured by this.
+ */
+static void nvme_coredump_init(struct nvme_dev *dev)
+{
+ struct nvme_ctrl *ctrl = &dev->ctrl;
+ struct dev_coredumpm_bulk_data *bulk_data;
+ int ret;
+
+ if (WARN_ON(dev->dumps))
+ nvme_coredump_discard(dev);
+
+ bulk_data = nvme_coredump_alloc(dev, 2 + 2 * ctrl->queue_count);
+ if (!bulk_data)
+ return;
+
+ ret = nvme_coredump_empty(&bulk_data[0]);
+ if (ret)
+ goto free_bulk_data;
+
+ ret = nvme_coredump_regs(&bulk_data[1], ctrl);
+ if (ret)
+ goto free_bulk_data;
+
+ ret = nvme_coredump_queues(&bulk_data[2], ctrl);
+ if (ret)
+ goto free_bulk_data;
+
+ return;
+
+free_bulk_data:
+ nvme_coredump_discard(dev);
+}
+
+static ssize_t nvme_coredump_read_sgtable(char *buffer, loff_t offset,
+ size_t buf_len, void *data,
+ size_t data_len)
+{
+ struct sg_table *table = data;
+
+ if (data_len <= offset)
+ return 0;
+
+ if (offset + buf_len > data_len)
+ buf_len = data_len - offset;
+
+ return sg_pcopy_to_buffer(table->sgl, sg_nents(table->sgl), buffer,
+ buf_len, offset);
+}
+
+static void nvme_coredump_free_sgtable(void *data)
+{
+ struct sg_table *table = data;
+ struct scatterlist *sg;
+ int n = sg_nents(table->sgl);
+ int i;
+
+ for_each_sg(table->sgl, sg, n, i) {
+ struct page *page = sg_page(sg);
+
+ if (page)
+ __free_page(page);
+
+ }
+
+ kfree(table);
+}
+
+static struct sg_table *nvme_coredump_alloc_sgtable(size_t bytes)
+{
+ struct sg_table *table;
+ struct scatterlist *sg;
+ int n = DIV_ROUND_UP(bytes, PAGE_SIZE);
+ int i;
+
+ table = kzalloc(sizeof(*table), GFP_KERNEL);
+ if (!table)
+ return NULL;
+
+ if (sg_alloc_table(table, n, GFP_KERNEL))
+ goto free_table;
+
+ for_each_sg(table->sgl, sg, n, i) {
+ struct page *page = alloc_page(GFP_KERNEL);
+ size_t size = min_t(int, bytes, PAGE_SIZE);
+
+ if (!page)
+ goto free_page;
+
+ sg_set_page(sg, page, size, 0);
+ bytes -= size;
+ }
+
+ return table;
+free_page:
+ for_each_sg(table->sgl, sg, n, i) {
+ struct page *page = sg_page(sg);
+
+ if (page)
+ __free_page(page);
+
+ }
+free_table:
+ kfree(table);
+
+ return NULL;
+}
+
+static int nvme_get_telemetry_log_blocks(struct nvme_ctrl *ctrl, void *buf,
+ size_t bytes, loff_t offset)
+{
+ loff_t pos = 0;
+ u32 chunk_size;
+
+ if (check_mul_overflow(ctrl->max_hw_sectors, 512u, &chunk_size))
+ chunk_size = UINT_MAX;
+
+ while (pos < bytes) {
+ size_t size = min_t(size_t, bytes - pos, chunk_size);
+ int ret;
+
+ ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_TELEMETRY_CTRL,
+ 0, buf + pos, size, offset + pos);
+ if (ret)
+ return ret;
+
+ pos += size;
+ }
+
+ return 0;
+}
+
+static int nvme_get_telemetry_log(struct nvme_ctrl *ctrl,
+ struct sg_table *table, size_t bytes)
+{
+ int n = sg_nents(table->sgl);
+ struct scatterlist *sg;
+ size_t offset = 0;
+ int i;
+
+ for_each_sg(table->sgl, sg, n, i) {
+ struct page *page = sg_page(sg);
+ size_t size = min_t(int, bytes - offset, sg->length);
+ int ret;
+
+ ret = nvme_get_telemetry_log_blocks(ctrl, page_address(page),
+ size, offset);
+ if (ret)
+ return ret;
+
+ offset += size;
+ }
+
+ return 0;
+}
+
+static int nvme_coredump_telemetry_log(struct dev_coredumpm_bulk_data *data,
+ struct nvme_ctrl *ctrl)
+{
+ struct nvme_telemetry_log_page_hdr *page_hdr;
+ struct sg_table *table;
+ int log_size;
+ int ret;
+ u8 ctrldgn;
+
+ if (!(ctrl->lpa & NVME_CTRL_LPA_TELEMETRY_LOG))
+ return -EINVAL;
+ if (!(ctrl->lpa & NVME_CTRL_LPA_EXTENDED_DATA))
+ return -EINVAL;
+
+ page_hdr = kzalloc(sizeof(*page_hdr), GFP_KERNEL);
+ if (!page_hdr)
+ return -ENOMEM;
+
+ ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_TELEMETRY_CTRL, 0,
+ page_hdr, sizeof(*page_hdr), 0);
+ if (ret)
+ goto free_page_hdr;
+
+ if (!page_hdr->ctrlavail) {
+ ret = -EINVAL;
+ goto free_page_hdr;
+ }
+
+ log_size = (le16_to_cpu(page_hdr->dalb3) + 1) * 512;
+
+ table = nvme_coredump_alloc_sgtable(log_size);
+ if (!table) {
+ ret = -ENOMEM;
+ goto free_page_hdr;
+ }
+
+ ret = nvme_get_telemetry_log(ctrl, table, log_size);
+ if (ret)
+ goto free_table;
+
+ sg_pcopy_to_buffer(table->sgl, sg_nents(table->sgl), &ctrldgn,
+ sizeof(ctrldgn),
+ offsetof(typeof(*page_hdr), ctrldgn));
+
+ ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_TELEMETRY_CTRL, 0,
+ page_hdr, sizeof(*page_hdr), 0);
+ if (ret)
+ goto free_table;
+
+ if (page_hdr->ctrldgn != ctrldgn) {
+ ret = -EINVAL;
+ goto free_table;
+ }
+
+ data->name = kstrdup("telemetry-ctrl-log", GFP_KERNEL);
+ if (!data->name) {
+ ret = -ENOMEM;
+ goto free_table;
+ }
+
+ data->data = table;
+ data->datalen = log_size;
+ data->read = nvme_coredump_read_sgtable;
+ data->free = nvme_coredump_free_sgtable;
+
+ kfree(page_hdr);
+
+ return 0;
+free_table:
+ nvme_coredump_free_sgtable(table);
+free_page_hdr:
+ kfree(page_hdr);
+
+ return ret;
+}
+
+/**
+ * nvme_coredump_logs() - get telemetry log if available
+ * @dev: the struct nvme_dev for the crashed device
+ *
+ * This function is called as soon as the device is recovered from the crash
+ * and admin queue becomes available. If the device coredump has already been
+ * started by nvme_coredump_init(), the telemetry controller-initiated data
+ * will be collected. Otherwise do nothing.
+ */
+static void nvme_coredump_logs(struct nvme_dev *dev)
+{
+ struct dev_coredumpm_bulk_data *bulk_data;
+
+ if (!dev->dumps)
+ return;
+
+ bulk_data = nvme_coredump_alloc(dev, 1);
+ if (!bulk_data)
+ return;
+
+ if (nvme_coredump_telemetry_log(bulk_data, &dev->ctrl))
+ dev->num_dumps--;
+}
+
+/**
+ * nvme_coredump_complete() - finish device coredump
+ * @dev: the struct nvme_dev for the crashed device
+ *
+ * This functions is called when the driver determines that there is nothing
+ * to collect device coredump anymore. All collected coredumps are exported
+ * via device coredump mechanism.
+ */
+static void nvme_coredump_complete(struct nvme_dev *dev)
+{
+ if (!dev->dumps)
+ return;
+
+ dev_coredumpm_bulk(dev->dev, THIS_MODULE, GFP_KERNEL,
+ dev->dumps, dev->num_dumps);
+ nvme_coredump_clear(dev);
+}
+
+#else
+
+static void nvme_coredump_init(struct nvme_dev *dev)
+{
+}
+
+static void nvme_coredump_logs(struct nvme_dev *dev)
+{
+}
+
+static void nvme_coredump_complete(struct nvme_dev *dev)
+{
+}
+
+#endif /* CONFIG_DEV_COREDUMP */
+
static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
--
2.7.4

2019-05-12 15:56:35

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 7/7] nvme-pci: enable to trigger device coredump by hand

This provides a way to trigger the nvme device coredump by writing
anything to /sys/devices/.../coredump attribute.

Cc: Johannes Berg <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Minwoo Im <[email protected]>
Cc: Kenneth Heitke <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
---
* v3
- Extracted from 'add device coredump infrastructure' patch
- Avoid deadlock in .coredump callback

drivers/nvme/host/pci.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6522592..fad5395 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3288,6 +3288,14 @@ static void nvme_coredump_complete(struct nvme_dev *dev)
nvme_coredump_clear(dev);
}

+static void nvme_coredump(struct device *dev)
+{
+ struct nvme_dev *ndev = dev_get_drvdata(dev);
+
+ nvme_dev_disable(ndev, false, true);
+ nvme_reset_ctrl_sync(&ndev->ctrl);
+}
+
#else

static void nvme_coredump_init(struct nvme_dev *dev)
@@ -3302,6 +3310,10 @@ static void nvme_coredump_complete(struct nvme_dev *dev)
{
}

+static void nvme_coredump(struct device *dev)
+{
+}
+
#endif /* CONFIG_DEV_COREDUMP */

static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
@@ -3409,6 +3421,7 @@ static struct pci_driver nvme_driver = {
.shutdown = nvme_shutdown,
.driver = {
.pm = &nvme_dev_pm_ops,
+ .coredump = nvme_coredump,
},
.sriov_configure = pci_sriov_configure_simple,
.err_handler = &nvme_err_handler,
--
2.7.4

2019-05-12 15:57:49

by Akinobu Mita

[permalink] [raw]
Subject: [PATCH v3 6/7] nvme-pci: trigger device coredump on command timeout

This enables the nvme driver to trigger a device coredump when command
timeout occurs, and it helps diagnose and debug issues.

This can be tested with fail_io_timeout fault injection.

# echo 1 > /sys/kernel/debug/fail_io_timeout/probability
# echo 1 > /sys/kernel/debug/fail_io_timeout/times
# echo 1 > /sys/block/nvme0n1/io-timeout-fail
# dd if=/dev/nvme0n1 of=/dev/null

Cc: Johannes Berg <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Minwoo Im <[email protected]>
Cc: Kenneth Heitke <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
---
* v3
- Don't try to get telemetry log when admin queue is not available

drivers/nvme/host/pci.c | 39 +++++++++++++++++++++++----------------
1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3eebb98..6522592 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -87,12 +87,12 @@ MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO.");
struct nvme_dev;
struct nvme_queue;

-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump);
static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);

-static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
-static void __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
-static void __maybe_unused nvme_coredump_complete(struct nvme_dev *dev);
+static void nvme_coredump_init(struct nvme_dev *dev);
+static void nvme_coredump_logs(struct nvme_dev *dev);
+static void nvme_coredump_complete(struct nvme_dev *dev);

/*
* Represents an NVM Express device. Each nvme_dev is a PCI function.
@@ -1280,7 +1280,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
*/
if (nvme_should_reset(dev, csts)) {
nvme_warn_reset(dev, csts);
- nvme_dev_disable(dev, false);
+ nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(&dev->ctrl);
return BLK_EH_DONE;
}
@@ -1309,7 +1309,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
dev_warn_ratelimited(dev->ctrl.device,
"I/O %d QID %d timeout, disable controller\n",
req->tag, nvmeq->qid);
- nvme_dev_disable(dev, shutdown);
+ nvme_dev_disable(dev, shutdown, true);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
default:
@@ -1325,7 +1325,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
dev_warn(dev->ctrl.device,
"I/O %d QID %d timeout, reset controller\n",
req->tag, nvmeq->qid);
- nvme_dev_disable(dev, false);
+ nvme_dev_disable(dev, false, true);
nvme_reset_ctrl(&dev->ctrl);

nvme_req(req)->flags |= NVME_REQ_CANCELLED;
@@ -2382,7 +2382,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
}
}

-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool dump)
{
bool dead = true;
struct pci_dev *pdev = to_pci_dev(dev->dev);
@@ -2407,6 +2407,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
}

+ if (dump)
+ nvme_coredump_init(dev);
+
nvme_stop_queues(&dev->ctrl);

if (!dead && dev->ctrl.queue_count > 0) {
@@ -2477,7 +2480,7 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status);

nvme_get_ctrl(&dev->ctrl);
- nvme_dev_disable(dev, false);
+ nvme_dev_disable(dev, false, false);
nvme_kill_queues(&dev->ctrl);
if (!queue_work(nvme_wq, &dev->remove_work))
nvme_put_ctrl(&dev->ctrl);
@@ -2499,7 +2502,7 @@ static void nvme_reset_work(struct work_struct *work)
* moving on.
*/
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
- nvme_dev_disable(dev, false);
+ nvme_dev_disable(dev, false, false);

mutex_lock(&dev->shutdown_lock);
result = nvme_pci_enable(dev);
@@ -2536,6 +2539,9 @@ static void nvme_reset_work(struct work_struct *work)
if (result)
goto out;

+ nvme_coredump_logs(dev);
+ nvme_coredump_complete(dev);
+
if (dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) {
if (!dev->ctrl.opal_dev)
dev->ctrl.opal_dev =
@@ -2598,6 +2604,7 @@ static void nvme_reset_work(struct work_struct *work)
out_unlock:
mutex_unlock(&dev->shutdown_lock);
out:
+ nvme_coredump_complete(dev);
nvme_remove_dead_ctrl(dev, result);
}

@@ -2788,7 +2795,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
static void nvme_reset_prepare(struct pci_dev *pdev)
{
struct nvme_dev *dev = pci_get_drvdata(pdev);
- nvme_dev_disable(dev, false);
+ nvme_dev_disable(dev, false, false);
}

static void nvme_reset_done(struct pci_dev *pdev)
@@ -2800,7 +2807,7 @@ static void nvme_reset_done(struct pci_dev *pdev)
static void nvme_shutdown(struct pci_dev *pdev)
{
struct nvme_dev *dev = pci_get_drvdata(pdev);
- nvme_dev_disable(dev, true);
+ nvme_dev_disable(dev, true, false);
}

/*
@@ -2817,14 +2824,14 @@ static void nvme_remove(struct pci_dev *pdev)

if (!pci_device_is_present(pdev)) {
nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
- nvme_dev_disable(dev, true);
+ nvme_dev_disable(dev, true, false);
nvme_dev_remove_admin(dev);
}

flush_work(&dev->ctrl.reset_work);
nvme_stop_ctrl(&dev->ctrl);
nvme_remove_namespaces(&dev->ctrl);
- nvme_dev_disable(dev, true);
+ nvme_dev_disable(dev, true, false);
nvme_release_cmb(dev);
nvme_free_host_mem(dev);
nvme_dev_remove_admin(dev);
@@ -2841,7 +2848,7 @@ static int nvme_suspend(struct device *dev)
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);

- nvme_dev_disable(ndev, true);
+ nvme_dev_disable(ndev, true, false);
return 0;
}

@@ -3313,7 +3320,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
case pci_channel_io_frozen:
dev_warn(dev->ctrl.device,
"frozen state error detected, reset controller\n");
- nvme_dev_disable(dev, false);
+ nvme_dev_disable(dev, false, false);
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
dev_warn(dev->ctrl.device,
--
2.7.4

2019-05-13 08:19:36

by Minwoo Im

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

> +static int nvme_get_telemetry_log_blocks(struct nvme_ctrl *ctrl, void *buf,
> + size_t bytes, loff_t offset)
> +{
> + loff_t pos = 0;
> + u32 chunk_size;
> +
> + if (check_mul_overflow(ctrl->max_hw_sectors, 512u, &chunk_size))
> + chunk_size = UINT_MAX;
> +
> + while (pos < bytes) {
> + size_t size = min_t(size_t, bytes - pos, chunk_size);
> + int ret;
> +
> + ret = nvme_get_log(ctrl, NVME_NSID_ALL,
> NVME_LOG_TELEMETRY_CTRL,
> + 0, buf + pos, size, offset + pos);
> + if (ret)
> + return ret;
> +
> + pos += size;
> + }
> +
> + return 0;
> +}
> +
> +static int nvme_get_telemetry_log(struct nvme_ctrl *ctrl,
> + struct sg_table *table, size_t bytes)
> +{
> + int n = sg_nents(table->sgl);
> + struct scatterlist *sg;
> + size_t offset = 0;
> + int i;
> +
> + for_each_sg(table->sgl, sg, n, i) {
> + struct page *page = sg_page(sg);
> + size_t size = min_t(int, bytes - offset, sg->length);
> + int ret;
> +
> + ret = nvme_get_telemetry_log_blocks(ctrl,
> page_address(page),
> + size, offset);
> + if (ret)
> + return ret;
> +
> + offset += size;
> + }
> +
> + return 0;
> +}

Can we have those two in nvme-core module instead of being in pci module?

2019-05-13 09:32:26

by Minwoo Im

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] nvme-pci: trigger device coredump on command timeout

> -static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
> -static void __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
> -static void __maybe_unused nvme_coredump_complete(struct nvme_dev
> *dev);
> +static void nvme_coredump_init(struct nvme_dev *dev);
> +static void nvme_coredump_logs(struct nvme_dev *dev);
> +static void nvme_coredump_complete(struct nvme_dev *dev);

You just have added those three prototypes in previous patch. Did I miss
something here?

2019-05-13 14:22:29

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] nvme-pci: trigger device coredump on command timeout

On Sun, May 12, 2019 at 08:54:16AM -0700, Akinobu Mita wrote:
> @@ -2536,6 +2539,9 @@ static void nvme_reset_work(struct work_struct *work)
> if (result)
> goto out;
>
> + nvme_coredump_logs(dev);

If you change nvme_coredump_logs to return an int, check it here for < 0
and abandon the reset if true.

> + nvme_coredump_complete(dev);
> +
> if (dev->ctrl.oacs & NVME_CTRL_OACS_SEC_SUPP) {
> if (!dev->ctrl.opal_dev)
> dev->ctrl.opal_dev =

2019-05-13 14:56:47

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

On Sun, May 12, 2019 at 08:54:15AM -0700, Akinobu Mita wrote:
> +static void nvme_coredump_logs(struct nvme_dev *dev)
> +{
> + struct dev_coredumpm_bulk_data *bulk_data;
> +
> + if (!dev->dumps)
> + return;
> +
> + bulk_data = nvme_coredump_alloc(dev, 1);
> + if (!bulk_data)
> + return;
> +
> + if (nvme_coredump_telemetry_log(bulk_data, &dev->ctrl))
> + dev->num_dumps--;
> +}

You'll need this function to return the same 'int' value from
nvme_coredump_telemetry_log. A negative value here means that the
device didn't produce a response, and that's important to check from
the reset work since you'll need to abort the reset if that happens.

2019-05-13 15:58:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

Usage of a scatterlist here is rather bogus as we never use
it for dma mapping. Why can't you store the various pages in a
large bio_vec and then just issue that to the device in one
get log page command? (or at least a few if MDTS kicks in?)

2019-05-13 16:23:58

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

2019年5月13日(月) 22:55 Keith Busch <[email protected]>:
>
> On Sun, May 12, 2019 at 08:54:15AM -0700, Akinobu Mita wrote:
> > +static void nvme_coredump_logs(struct nvme_dev *dev)
> > +{
> > + struct dev_coredumpm_bulk_data *bulk_data;
> > +
> > + if (!dev->dumps)
> > + return;
> > +
> > + bulk_data = nvme_coredump_alloc(dev, 1);
> > + if (!bulk_data)
> > + return;
> > +
> > + if (nvme_coredump_telemetry_log(bulk_data, &dev->ctrl))
> > + dev->num_dumps--;
> > +}
>
> You'll need this function to return the same 'int' value from
> nvme_coredump_telemetry_log. A negative value here means that the
> device didn't produce a response, and that's important to check from
> the reset work since you'll need to abort the reset if that happens.

OK. Make sense.

2019-05-13 17:59:00

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

On 05/13/2019 12:46 AM, Minwoo Im wrote:
>> +static int nvme_get_telemetry_log_blocks(struct nvme_ctrl *ctrl, void *buf,
>> + size_t bytes, loff_t offset)
>> +{
>> + loff_t pos = 0;
>> + u32 chunk_size;
>> +
>> + if (check_mul_overflow(ctrl->max_hw_sectors, 512u, &chunk_size))
>> + chunk_size = UINT_MAX;
>> +
>> + while (pos < bytes) {
>> + size_t size = min_t(size_t, bytes - pos, chunk_size);
>> + int ret;
>> +
>> + ret = nvme_get_log(ctrl, NVME_NSID_ALL,
>> NVME_LOG_TELEMETRY_CTRL,
>> + 0, buf + pos, size, offset + pos);
>> + if (ret)
>> + return ret;
>> +
>> + pos += size;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int nvme_get_telemetry_log(struct nvme_ctrl *ctrl,
>> + struct sg_table *table, size_t bytes)
>> +{
>> + int n = sg_nents(table->sgl);
>> + struct scatterlist *sg;
>> + size_t offset = 0;
>> + int i;
>> +
A little comment would be nice if you are using sg operations.
>> + for_each_sg(table->sgl, sg, n, i) {
>> + struct page *page = sg_page(sg);
>> + size_t size = min_t(int, bytes - offset, sg->length);
>> + int ret;
>> +
>> + ret = nvme_get_telemetry_log_blocks(ctrl,
>> page_address(page),
>> + size, offset);
>> + if (ret)
>> + return ret;
>> +
>> + offset += size;
>> + }
>> +
>> + return 0;
>> +}
>
> Can we have those two in nvme-core module instead of being in pci module?

Since they are based on the controller they should be moved next to
nvme_get_log() in the ${KERN_DIR}/drivers/nvme/host/core.c.

>
> _______________________________________________
> Linux-nvme mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

2019-05-13 18:35:28

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] nvme-pci: trigger device coredump on command timeout

2019年5月13日(月) 16:41 Minwoo Im <[email protected]>:
>
> > -static void __maybe_unused nvme_coredump_init(struct nvme_dev *dev);
> > -static void __maybe_unused nvme_coredump_logs(struct nvme_dev *dev);
> > -static void __maybe_unused nvme_coredump_complete(struct nvme_dev
> > *dev);
> > +static void nvme_coredump_init(struct nvme_dev *dev);
> > +static void nvme_coredump_logs(struct nvme_dev *dev);
> > +static void nvme_coredump_complete(struct nvme_dev *dev);
>
> You just have added those three prototypes in previous patch. Did I miss
> something here?

These __maybe_unused are needed only in the patch 5/7.
Because these functions are still unused before applying patch 6/7.

2019-05-13 18:36:01

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

2019年5月13日(月) 23:03 Christoph Hellwig <[email protected]>:
>
> Usage of a scatterlist here is rather bogus as we never use
> it for dma mapping. Why can't you store the various pages in a
> large bio_vec and then just issue that to the device in one
> get log page command? (or at least a few if MDTS kicks in?)

OK. I'll try to use bio_vec and see how it goes.

2019-05-13 19:15:39

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] nvme: add basic facility to get telemetry log page

On 05/12/2019 08:55 AM, Akinobu Mita wrote:
> This adds the required definisions to get telemetry log page.
s/definisions/definitions/
> The telemetry log page structure and identifier are copied from nvme-cli.
>
> We also need a facility to check log page attributes in order to know
> the controller supports the telemetry log pages and log page offset field
> for the Get Log Page command. The telemetry data area could be larger
> than maximum data transfer size, so we may need to split into multiple
> transfers with incremental page offset.
>
> Cc: Johannes Berg <[email protected]>
> Cc: Keith Busch <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Sagi Grimberg <[email protected]>
> Cc: Minwoo Im <[email protected]>
> Cc: Kenneth Heitke <[email protected]>
> Signed-off-by: Akinobu Mita <[email protected]>
> ---
> * v3
> - Merge 'add telemetry log page definisions' patch and 'add facility to
> check log page attributes' patch
> - Copy struct nvme_telemetry_log_page_hdr from the latest nvme-cli
> - Add BUILD_BUG_ON for the size of struct nvme_telemetry_log_page_hdr
>
> drivers/nvme/host/core.c | 2 ++
> drivers/nvme/host/nvme.h | 1 +
> include/linux/nvme.h | 17 +++++++++++++++++
> 3 files changed, 20 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a6644a2..0cea2a8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2585,6 +2585,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> } else
> ctrl->shutdown_timeout = shutdown_timeout;
>
> + ctrl->lpa = id->lpa;
> ctrl->npss = id->npss;
> ctrl->apsta = id->apsta;
> prev_apst_enabled = ctrl->apst_enabled;
> @@ -3898,6 +3899,7 @@ static inline void _nvme_check_size(void)
> BUILD_BUG_ON(sizeof(struct nvme_id_ctrl) != NVME_IDENTIFY_DATA_SIZE);
> BUILD_BUG_ON(sizeof(struct nvme_id_ns) != NVME_IDENTIFY_DATA_SIZE);
> BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64);
> + BUILD_BUG_ON(sizeof(struct nvme_telemetry_log_page_hdr) != 512);
> BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
> BUILD_BUG_ON(sizeof(struct nvme_dbbuf) != 64);
> BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 5ee75b5..7f6f1fc 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -195,6 +195,7 @@ struct nvme_ctrl {
> u32 vs;
> u32 sgls;
> u16 kas;
> + u8 lpa;
> u8 npss;
> u8 apsta;
> u32 oaes;
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index c40720c..8c0b29d 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -294,6 +294,8 @@ enum {
> NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
> NVME_CTRL_OACS_DBBUF_SUPP = 1 << 8,
> NVME_CTRL_LPA_CMD_EFFECTS_LOG = 1 << 1,
> + NVME_CTRL_LPA_EXTENDED_DATA = 1 << 2,
> + NVME_CTRL_LPA_TELEMETRY_LOG = 1 << 3,
> };
>
> struct nvme_lbaf {
> @@ -396,6 +398,20 @@ enum {
> NVME_NIDT_UUID = 0x03,
> };
>
> +struct nvme_telemetry_log_page_hdr {
> + __u8 lpi; /* Log page identifier */
> + __u8 rsvd[4];
> + __u8 iee_oui[3];
> + __le16 dalb1; /* Data area 1 last block */
> + __le16 dalb2; /* Data area 2 last block */
> + __le16 dalb3; /* Data area 3 last block */
> + __u8 rsvd1[368];
> + __u8 ctrlavail; /* Controller initiated data avail?*/
> + __u8 ctrldgn; /* Controller initiated telemetry Data Gen # */
> + __u8 rsnident[128];
> + __u8 telemetry_dataarea[0];
> +};
> +

nit:- Thanks for adding the comments, can you please align all the above
comments like :-

+struct nvme_telemetry_log_page_hdr {
+ __u8 lpi; /* Log page identifier */
+ __u8 rsvd[4];
+ __u8 iee_oui[3];
+ __le16 dalb1; /* Data area 1 last block */
+ __le16 dalb2; /* Data area 2 last block */
+ __le16 dalb3; /* Data area 3 last block */
+ __u8 rsvd1[368];
+ __u8 ctrlavail; /* Controller initiated data avail?*/
+ __u8 ctrldgn; /* Controller initiated telemetry Data
Gen # */
+ __u8 rsnident[128];
+ __u8 telemetry_dataarea[0];
+};
+


> struct nvme_smart_log {
> __u8 critical_warning;
> __u8 temperature[2];
> @@ -832,6 +848,7 @@ enum {
> NVME_LOG_FW_SLOT = 0x03,
> NVME_LOG_CHANGED_NS = 0x04,
> NVME_LOG_CMD_EFFECTS = 0x05,
> + NVME_LOG_TELEMETRY_CTRL = 0x08,
> NVME_LOG_ANA = 0x0c,
> NVME_LOG_DISC = 0x70,
> NVME_LOG_RESERVATION = 0x80,
>

2019-05-14 14:07:40

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] nvme: add basic facility to get telemetry log page

2019年5月14日(火) 0:34 Chaitanya Kulkarni <[email protected]>:
>
> On 05/12/2019 08:55 AM, Akinobu Mita wrote:
> > This adds the required definisions to get telemetry log page.
> s/definisions/definitions/

OK.

> > diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> > index c40720c..8c0b29d 100644
> > --- a/include/linux/nvme.h
> > +++ b/include/linux/nvme.h
> > @@ -294,6 +294,8 @@ enum {
> > NVME_CTRL_OACS_DIRECTIVES = 1 << 5,
> > NVME_CTRL_OACS_DBBUF_SUPP = 1 << 8,
> > NVME_CTRL_LPA_CMD_EFFECTS_LOG = 1 << 1,
> > + NVME_CTRL_LPA_EXTENDED_DATA = 1 << 2,
> > + NVME_CTRL_LPA_TELEMETRY_LOG = 1 << 3,
> > };
> >
> > struct nvme_lbaf {
> > @@ -396,6 +398,20 @@ enum {
> > NVME_NIDT_UUID = 0x03,
> > };
> >
> > +struct nvme_telemetry_log_page_hdr {
> > + __u8 lpi; /* Log page identifier */
> > + __u8 rsvd[4];
> > + __u8 iee_oui[3];
> > + __le16 dalb1; /* Data area 1 last block */
> > + __le16 dalb2; /* Data area 2 last block */
> > + __le16 dalb3; /* Data area 3 last block */
> > + __u8 rsvd1[368];
> > + __u8 ctrlavail; /* Controller initiated data avail?*/
> > + __u8 ctrldgn; /* Controller initiated telemetry Data Gen # */
> > + __u8 rsnident[128];
> > + __u8 telemetry_dataarea[0];
> > +};
> > +
>
> nit:- Thanks for adding the comments, can you please align all the above
> comments like :-

OK. I'll send a patch for nvme-cli at first.

> +struct nvme_telemetry_log_page_hdr {
> + __u8 lpi; /* Log page identifier */
> + __u8 rsvd[4];
> + __u8 iee_oui[3];
> + __le16 dalb1; /* Data area 1 last block */
> + __le16 dalb2; /* Data area 2 last block */
> + __le16 dalb3; /* Data area 3 last block */
> + __u8 rsvd1[368];
> + __u8 ctrlavail; /* Controller initiated data avail?*/
> + __u8 ctrldgn; /* Controller initiated telemetry Data
> Gen # */
> + __u8 rsnident[128];
> + __u8 telemetry_dataarea[0];
> +};
> +

2019-05-14 14:09:27

by Akinobu Mita

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] nvme-pci: add device coredump infrastructure

2019年5月14日(火) 0:23 Chaitanya Kulkarni <[email protected]>:
>
> On 05/13/2019 12:46 AM, Minwoo Im wrote:
> >> +static int nvme_get_telemetry_log_blocks(struct nvme_ctrl *ctrl, void *buf,
> >> + size_t bytes, loff_t offset)
> >> +{
> >> + loff_t pos = 0;
> >> + u32 chunk_size;
> >> +
> >> + if (check_mul_overflow(ctrl->max_hw_sectors, 512u, &chunk_size))
> >> + chunk_size = UINT_MAX;
> >> +
> >> + while (pos < bytes) {
> >> + size_t size = min_t(size_t, bytes - pos, chunk_size);
> >> + int ret;
> >> +
> >> + ret = nvme_get_log(ctrl, NVME_NSID_ALL,
> >> NVME_LOG_TELEMETRY_CTRL,
> >> + 0, buf + pos, size, offset + pos);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + pos += size;
> >> + }
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int nvme_get_telemetry_log(struct nvme_ctrl *ctrl,
> >> + struct sg_table *table, size_t bytes)
> >> +{
> >> + int n = sg_nents(table->sgl);
> >> + struct scatterlist *sg;
> >> + size_t offset = 0;
> >> + int i;
> >> +
> A little comment would be nice if you are using sg operations.
> >> + for_each_sg(table->sgl, sg, n, i) {
> >> + struct page *page = sg_page(sg);
> >> + size_t size = min_t(int, bytes - offset, sg->length);
> >> + int ret;
> >> +
> >> + ret = nvme_get_telemetry_log_blocks(ctrl,
> >> page_address(page),
> >> + size, offset);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + offset += size;
> >> + }
> >> +
> >> + return 0;
> >> +}
> >
> > Can we have those two in nvme-core module instead of being in pci module?
>
> Since they are based on the controller they should be moved next to
> nvme_get_log() in the ${KERN_DIR}/drivers/nvme/host/core.c.

OK. But these functions will be changed to use bio_vec instead of sg in
the next version.