2024-02-07 18:47:05

by David Jeffery

[permalink] [raw]
Subject: [RFC PATCH 0/6] async device shutdown support

This is another attempt to implement an acceptable implementation of async
device shutdown, inspired by a previous attempt by Tanjore Suresh. For
systems with many disks, async shutdown can greatly reduce shutdown times
from having slow operations run in parallel. The older patches were rejected,
with this new implementation attempting to fix my understanding of the flaws
in the older patches.

Using similar interfaces and building off the ideas of the older patches,
this patchset creates an async shutdown implementation which follows the
basic ordering of the shutdown list, ensuring the shutdown of any children
devices complete whether synchronous or asynchronous before performing
shutdown on a parent device.

In addition to an implementation for asynchronous pci nvme shutdown, this
patchset also adds support for async shutdown of sd devices for cache flush.
As an example of the effects of the patch, one system with a large amount of
disks went from over 30 seconds to shut down to less than 5.

The specific driver changes are the roughest part of this patchset. But the
acceptability of the core async functionality is critical. Any feedback on
flaws or improvements is appreciated.

David Jeffery (6):
minimal async shutdown infrastructure
Improve ability to perform async shutdown in parallel
pci bus async shutdown support
pci nvme async shutdown support
scsi mid layer support for async command submit
sd: async cache flush on shutdown

drivers/base/base.h | 1 +
drivers/base/core.c | 149 +++++++++++++++++++++++++++++++++-
drivers/nvme/host/core.c | 26 ++++--
drivers/nvme/host/nvme.h | 2 +
drivers/nvme/host/pci.c | 53 +++++++++++-
drivers/pci/pci-driver.c | 24 +++++-
drivers/scsi/scsi_lib.c | 138 ++++++++++++++++++++++++-------
drivers/scsi/sd.c | 66 +++++++++++++--
drivers/scsi/sd.h | 2 +
include/linux/device/bus.h | 8 +-
include/linux/device/driver.h | 7 ++
include/linux/pci.h | 4 +
include/scsi/scsi_device.h | 8 ++
13 files changed, 439 insertions(+), 49 deletions(-)

--
2.43.0



2024-02-07 18:47:26

by David Jeffery

[permalink] [raw]
Subject: [RFC PATCH 3/6] pci bus async shutdown support

Add async shutdown shutdown fields and Convert pci's shutdown logic into
async shutdown calls so that individual pci device drivers can implement
async shutdown.

Signed-off-by: David Jeffery <[email protected]>
Tested-by: Laurence Oberman <[email protected]>

---
drivers/pci/pci-driver.c | 24 ++++++++++++++++++++++--
include/linux/pci.h | 4 ++++
2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 51ec9e7e784f..0ad418905115 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -502,16 +502,28 @@ static void pci_device_remove(struct device *dev)
pci_dev_put(pci_dev);
}

-static void pci_device_shutdown(struct device *dev)
+static void pci_device_async_shutdown_start(struct device *dev)
{
struct pci_dev *pci_dev = to_pci_dev(dev);
struct pci_driver *drv = pci_dev->driver;

pm_runtime_resume(dev);

- if (drv && drv->shutdown)
+ if (drv && drv->async_shutdown_start)
+ drv->async_shutdown_start(pci_dev);
+ else if (drv && drv->shutdown)
drv->shutdown(pci_dev);

+}
+
+static void pci_device_async_shutdown_end(struct device *dev)
+{
+ struct pci_dev *pci_dev = to_pci_dev(dev);
+ struct pci_driver *drv = pci_dev->driver;
+
+ if (drv && drv->async_shutdown_end)
+ drv->async_shutdown_end(pci_dev);
+
/*
* If this is a kexec reboot, turn off Bus Master bit on the
* device to tell it to not continue to do DMA. Don't touch
@@ -523,6 +535,12 @@ static void pci_device_shutdown(struct device *dev)
pci_clear_master(pci_dev);
}

+static void pci_device_shutdown(struct device *dev)
+{
+ pci_device_async_shutdown_start(dev);
+ pci_device_async_shutdown_end(dev);
+}
+
#ifdef CONFIG_PM_SLEEP

/* Auxiliary functions used for system resume */
@@ -1682,6 +1700,8 @@ struct bus_type pci_bus_type = {
.probe = pci_device_probe,
.remove = pci_device_remove,
.shutdown = pci_device_shutdown,
+ .async_shutdown_start = pci_device_async_shutdown_start,
+ .async_shutdown_end = pci_device_async_shutdown_end,
.dev_groups = pci_dev_groups,
.bus_groups = pci_bus_groups,
.drv_groups = pci_drv_groups,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index add9368e6314..6f61325c956a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -917,6 +917,8 @@ struct module;
* Useful for enabling wake-on-lan (NIC) or changing
* the power state of a device before reboot.
* e.g. drivers/net/e100.c.
+ * @async_shutdown_start: This starts the asynchronous shutdown
+ * @async_shutdown_end: This completes the started asynchronous shutdown
* @sriov_configure: Optional driver callback to allow configuration of
* number of VFs to enable via sysfs "sriov_numvfs" file.
* @sriov_set_msix_vec_count: PF Driver callback to change number of MSI-X
@@ -947,6 +949,8 @@ struct pci_driver {
int (*suspend)(struct pci_dev *dev, pm_message_t state); /* Device suspended */
int (*resume)(struct pci_dev *dev); /* Device woken up */
void (*shutdown)(struct pci_dev *dev);
+ void (*async_shutdown_start)(struct pci_dev *dev);
+ void (*async_shutdown_end)(struct pci_dev *dev);
int (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
int (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
u32 (*sriov_get_vf_total_msix)(struct pci_dev *pf);
--
2.43.0


2024-02-07 18:48:35

by David Jeffery

[permalink] [raw]
Subject: [RFC PATCH 5/6] scsi mid layer support for async command submit

Create scsi_execute_cmd_nowait to allow asynchronous scsi command submit.
Parts of the code originally in scsi_execute_cmd are shifted into helper
functions used by both scsi_execute_cmd and the new scsi_execute_cmd_nowait.

The scsi_exec_args struct is expanded to contain the fields needed for
the completion and callback for the async path.

Signed-off-by: David Jeffery <[email protected]>
Tested-by: Laurence Oberman <[email protected]>

---
drivers/scsi/scsi_lib.c | 138 +++++++++++++++++++++++++++++--------
include/scsi/scsi_device.h | 8 +++
2 files changed, 118 insertions(+), 28 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 1fb80eae9a63..fe35bc2021e3 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -185,42 +185,37 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
}

/**
- * scsi_execute_cmd - insert request and wait for the result
+ * scsi_execute_init - helper for setting up a scsi_cmnd in a request
* @sdev: scsi_device
* @cmd: scsi command
* @opf: block layer request cmd_flags
- * @buffer: data buffer
- * @bufflen: len of buffer
* @timeout: request timeout in HZ
* @retries: number of times to retry request
- * @args: Optional args. See struct definition for field descriptions
+ * @args: scsi command args
*
- * Returns the scsi_cmnd result field if a command was executed, or a negative
- * Linux error code if we didn't get that far.
+ * Returns a request if successful, or an error pointer if there was a failure.
*/
-int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
- blk_opf_t opf, void *buffer, unsigned int bufflen,
- int timeout, int retries,
- const struct scsi_exec_args *args)
+static struct request *scsi_execute_init(struct scsi_device *sdev,
+ const unsigned char *cmd,
+ blk_opf_t opf,
+ int timeout, int retries,
+ struct scsi_exec_args *args)
{
- static const struct scsi_exec_args default_args;
struct request *req;
struct scsi_cmnd *scmd;
int ret;

- if (!args)
- args = &default_args;
- else if (WARN_ON_ONCE(args->sense &&
- args->sense_len != SCSI_SENSE_BUFFERSIZE))
- return -EINVAL;
+ if (WARN_ON_ONCE(args->sense &&
+ args->sense_len != SCSI_SENSE_BUFFERSIZE))
+ return ERR_PTR(-EINVAL);

req = scsi_alloc_request(sdev->request_queue, opf, args->req_flags);
if (IS_ERR(req))
- return PTR_ERR(req);
+ return req;

- if (bufflen) {
- ret = blk_rq_map_kern(sdev->request_queue, req,
- buffer, bufflen, GFP_NOIO);
+ if (args->bufflen) {
+ ret = blk_rq_map_kern(sdev->request_queue, req, args->buffer,
+ args->bufflen, GFP_NOIO);
if (ret)
goto out;
}
@@ -232,19 +227,27 @@ int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
req->timeout = timeout;
req->rq_flags |= RQF_QUIET;

- /*
- * head injection *required* here otherwise quiesce won't work
- */
- blk_execute_rq(req, true);
+ return req;
+out:
+ blk_mq_free_request(req);
+ return ERR_PTR(ret);
+}
+
+static int scsi_execute_uninit(struct request *req,
+ struct scsi_exec_args *args)
+{
+ struct scsi_cmnd *scmd;

+ scmd = blk_mq_rq_to_pdu(req);
/*
* Some devices (USB mass-storage in particular) may transfer
* garbage data together with a residue indicating that the data
* is invalid. Prevent the garbage from being misinterpreted
* and prevent security leaks by zeroing out the excess data.
*/
- if (unlikely(scmd->resid_len > 0 && scmd->resid_len <= bufflen))
- memset(buffer + bufflen - scmd->resid_len, 0, scmd->resid_len);
+ if (unlikely(scmd->resid_len > 0 && scmd->resid_len <= args->bufflen))
+ memset(args->buffer + args->bufflen - scmd->resid_len, 0,
+ scmd->resid_len);

if (args->resid)
*args->resid = scmd->resid_len;
@@ -254,14 +257,93 @@ int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
scsi_normalize_sense(scmd->sense_buffer, scmd->sense_len,
args->sshdr);

- ret = scmd->result;
- out:
+ args->result = scmd->result;
+
+ if (args->callback)
+ args->callback(scmd, args);
+
+ return scmd->result;
+}
+
+/**
+ * scsi_execute_cmd - insert request and wait for the result
+ * @sdev: scsi_device
+ * @cmd: scsi command
+ * @opf: block layer request cmd_flags
+ * @buffer: data buffer
+ * @bufflen: len of buffer
+ * @timeout: request timeout in HZ
+ * @retries: number of times to retry request
+ * @args: Optional args. See struct definition for field descriptions
+ *
+ * Returns the scsi_cmnd result field if a command was executed, or a negative
+ * Linux error code if we didn't get that far.
+ */
+int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
+ blk_opf_t opf, void *buffer, unsigned int bufflen,
+ int timeout, int retries,
+ const struct scsi_exec_args *const_args)
+{
+ struct scsi_exec_args args;
+ int ret;
+ struct request *req;
+
+ if (!const_args)
+ memset(&args, 0, sizeof(struct scsi_exec_args));
+ else
+ args = *const_args;
+
+ args.buffer = buffer;
+ args.bufflen = bufflen;
+
+ req = scsi_execute_init(sdev, cmd, opf, timeout, retries, &args);
+
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ /*
+ * head injection *required* here otherwise quiesce won't work
+ */
+ blk_execute_rq(req, true);
+
+ ret = scsi_execute_uninit(req, &args);
+
blk_mq_free_request(req);

return ret;
}
EXPORT_SYMBOL(scsi_execute_cmd);

+
+static enum rq_end_io_ret scsi_execute_cmd_complete(struct request *req,
+ blk_status_t ret)
+{
+ struct scsi_exec_args *args = req->end_io_data;
+
+ scsi_execute_uninit(req, args);
+ return RQ_END_IO_FREE;
+}
+
+int scsi_execute_cmd_nowait(struct scsi_device *sdev, const unsigned char *cmd,
+ blk_opf_t opf, int timeout, int retries,
+ struct scsi_exec_args *args)
+{
+ struct request *req;
+
+ req = scsi_execute_init(sdev, cmd, opf, timeout, retries, args);
+
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+
+ req->end_io = scsi_execute_cmd_complete;
+ req->end_io_data = args;
+
+ blk_execute_rq_nowait(req, true);
+
+ return 0;
+}
+EXPORT_SYMBOL(scsi_execute_cmd_nowait);
+
/*
* Wake up the error handler if necessary. Avoid as follows that the error
* handler is not woken up if host in-flight requests number ==
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 5ec1e71a09de..c80c98b48bc1 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -497,6 +497,10 @@ struct scsi_exec_args {
blk_mq_req_flags_t req_flags; /* BLK_MQ_REQ flags */
int scmd_flags; /* SCMD flags */
int *resid; /* residual length */
+ void *buffer; /* data buffer */
+ unsigned int bufflen; /* buffer length */
+ int result; /* scsi layer result */
+ void (*callback)(struct scsi_cmnd *scmd, struct scsi_exec_args *args);
};

int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
@@ -504,6 +508,10 @@ int scsi_execute_cmd(struct scsi_device *sdev, const unsigned char *cmd,
int timeout, int retries,
const struct scsi_exec_args *args);

+int scsi_execute_cmd_nowait(struct scsi_device *sdev, const unsigned char *cmd,
+ blk_opf_t opf, int timeout, int retries,
+ struct scsi_exec_args *args);
+
extern void sdev_disable_disk_events(struct scsi_device *sdev);
extern void sdev_enable_disk_events(struct scsi_device *sdev);
extern int scsi_vpd_lun_id(struct scsi_device *, char *, size_t);
--
2.43.0


2024-02-07 20:36:38

by Jeremy Allison

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] async device shutdown support

> This is another attempt to implement an acceptable implementation of async
> device shutdown, inspired by a previous attempt by Tanjore Suresh. For
> systems with many disks, async shutdown can greatly reduce shutdown times
> from having slow operations run in parallel. The older patches were rejected,
> with this new implementation attempting to fix my understanding of the flaws
> in the older patches

Hi David,

It may have escaped your notice that I was shepherding a newer version
of Tanjore's original patchset through the nvme lists already. Please
look at version 5 here (I am working on version 6 currently).

https://lore.kernel.org/linux-nvme/[email protected]/

As your work is very similar (although has some of the same problems
that people already asked me to fix in earlier versions of the code)
maybe we can collaborate on getting a unified version of this work
reviewed.

Please take a look at the link above, and see if we can merge our efforts.

Thanks !

Jeremy Allison / CIQ.
Samba Team.

2024-02-07 20:55:28

by Jeremy Allison

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] async device shutdown support

> This is another attempt to implement an acceptable implementation of async
> device shutdown, inspired by a previous attempt by Tanjore Suresh. For
> systems with many disks, async shutdown can greatly reduce shutdown times
> from having slow operations run in parallel. The older patches were rejected,
> with this new implementation attempting to fix my understanding of the flaws
> in the older patches

Hi David,

It may have escaped your notice that I was shepherding a newer version
of Tanjore's original patchset through the nvme lists already. Please
look at version 5 here (I am working on version 6 currently).

https://lore.kernel.org/linux-nvme/[email protected]/

As your work is very similar (although has some of the same problems
that people already asked me to fix in earlier versions of the code)
maybe we can collaborate on getting a unified version of this work
reviewed.

Please take a look at the link above, and see if we can merge our efforts.

Thanks !

Jeremy Allison / CIQ.
Samba Team.