Received: by 10.192.165.148 with SMTP id m20csp1384785imm; Wed, 2 May 2018 21:01:44 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrBkeuh/7jsGCvrRlG3mcy9NyInZYsPrmZdvaRp58aMVCLOM10dQxvaR5jS6icV++zgm+cc X-Received: by 10.98.10.72 with SMTP id s69mr21681904pfi.134.1525320104056; Wed, 02 May 2018 21:01:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525320104; cv=none; d=google.com; s=arc-20160816; b=cJoAZTpBvLEhiGLHs45cUVyvEBr5mmqSAo16YKXBy0TJb6DbznteeDPpEAqF8K5yE6 xvnKafPPQN+CKR4T8gPRBU4uL2+JzPOC9NUozjjBfxWyv+QG39DolUX7KjN8Ddgumy7H RElXwpg6e7xus6ElRRmTXLYcSRSj1Dvi2aokm5DM5t/NPZbNaiIcbzxHvcRJ/LLl5acP YouSGWcU+LdukZ6RVCj7l1+I32yGquzZcJyad4J0+gY2HNeUbWLU4fuO8j7XgUXLig8Z nLjBX2EOKJooFyzmZ5l/zzuIM0wFCF7DVjc9R88aluPH3bzri1Tb/nwdAqM6VfO6gP4Q Tlbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:to:from:dkim-signature:arc-authentication-results; bh=v1eLTNmRblmbIeK87KBSgfE23uqpqRFXf6g/3DvYgz4=; b=w9aoJ9wN7dDnn3Nz1xve0Qvpo9OL/FNxNZt7XvXrhIe9PR1QZjRVNa7D5K7vNp3jCo ulFg71qo8eFiR7oH8w5JzlRdfdlRJQDqtB7Htb5A6isOqfMPxOqGG1QODz0SmSXkGukH +uwBydTiPPdQOCAUclKXdnJH9SLr7D96o4tl1B7CeansM3XcM2I4tChH0I2GQ0JPjkbC hTLJeYsvWv5podTPEF1Yhl5bXo4bxJxZPBupTPmCn9MRFWjd/5clspF9qIHHOlNNlrHp pFNio5wOVyChlhuLY/KDC0i5K547PgI3oZedAL9jZV601g8jdyIbWOG7kNJ34LBR2sZq 2FsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=Zq9/FM3v; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p8-v6si12607889plk.441.2018.05.02.21.01.29; Wed, 02 May 2018 21:01:44 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=Zq9/FM3v; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752092AbeECD75 (ORCPT + 99 others); Wed, 2 May 2018 23:59:57 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:57766 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751872AbeECD7o (ORCPT ); Wed, 2 May 2018 23:59:44 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w433pIiX026867; Thu, 3 May 2018 03:59:41 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : subject : date : message-id : in-reply-to : references; s=corp-2017-10-26; bh=v1eLTNmRblmbIeK87KBSgfE23uqpqRFXf6g/3DvYgz4=; b=Zq9/FM3vrVCUeCYpw9oDnJhsO26WYhVsSxbxQselTgdhxreMK5CjUmeQFALdHsiZtrwP 3nNkSUxLfNMG0llEPcDLLxXIBAXvVlyjkl3xP91owHdTMGQDnyzDAk9UAu46wR0501Tu 38/i3zECcQ5YjCtkyi8X69bpBmWXQLjtOxNYlbxQytDTx5jozSq1c9dgPoPq95kweYh/ /RvGS1sngTJ5rhI1y80lyGHSijoTLiEGV+ipnwdM6qkrp7GQNONk/VyecuquJmGqXaXt tad9VgfPBr73caVsrhpsSSWg6q77VDvum8TjLFepjAz16RoGE+EvgWU9tgo9gbmXtgeK eQ== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2hmhmfqq9v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 03 May 2018 03:59:41 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w433xfpF012156 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 3 May 2018 03:59:41 GMT Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w433xe1l023290; Thu, 3 May 2018 03:59:40 GMT Received: from xakep.us.oracle.com (/10.154.188.87) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 02 May 2018 20:59:40 -0700 From: Pavel Tatashin To: pasha.tatashin@oracle.com, steven.sistare@oracle.com, daniel.m.jordan@oracle.com, linux-kernel@vger.kernel.org, jeffrey.t.kirsher@intel.com, intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org, gregkh@linuxfoundation.org Subject: [PATCH 2/2] drivers core: multi-threading device shutdown Date: Wed, 2 May 2018 23:59:31 -0400 Message-Id: <20180503035931.22439-3-pasha.tatashin@oracle.com> X-Mailer: git-send-email 2.17.0 In-Reply-To: <20180503035931.22439-1-pasha.tatashin@oracle.com> References: <20180503035931.22439-1-pasha.tatashin@oracle.com> X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8881 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1805030034 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When system is rebooted, halted or kexeced device_shutdown() is called. This function shuts down every single device by calling either: dev->bus->shutdown(dev) dev->driver->shutdown(dev) Even on a machine just with a moderate amount of devices, device_shutdown() may take multiple seconds to complete. Because many devices require a specific delays to perform this operation. Here is sample analysis of time it takes to call device_shutdown() on two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine. device_shutdown 2.95s mlx4_shutdown 1.14s megasas_shutdown 0.24s ixgbe_shutdown 0.37s x 4 (four ixgbe devices on my machine). the rest 0.09s In mlx4 we spent the most time, but that is because there is a 1 second sleep: mlx4_shutdown mlx4_unload_one mlx4_free_ownership msleep(1000) With megasas we spend quoter of second, but sometimes longer (up-to 0.5s) in this path: megasas_shutdown megasas_flush_cache megasas_issue_blocked_cmd wait_event_timeout Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time is spread all over the place, with bigger offenders: ixgbe_shutdown __ixgbe_shutdown ixgbe_close_suspend ixgbe_down ixgbe_init_hw_generic ixgbe_reset_hw_X540 msleep(100); 0.104483472 ixgbe_get_san_mac_addr_generic 0.048414851 ixgbe_get_wwn_prefix_generic 0.048409893 ixgbe_start_hw_X540 ixgbe_start_hw_generic ixgbe_clear_hw_cntrs_generic 0.048581502 ixgbe_setup_fc_generic 0.024225800 All the ixgbe_*generic functions end-up calling: ixgbe_read_eerd_X540() ixgbe_acquire_swfw_sync_X540 usleep_range(5000, 6000); ixgbe_release_swfw_sync_X540 usleep_range(5000, 6000); While these are short sleeps, they end-up calling them over 24 times! 24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices. While we should keep optimizing the individual device drivers, in some cases this is simply a hardware property that forces a specific delay, and we must wait. So, the solution for this problem is to shutdown devices in parallel. However, we must shutdown children before shutting down parents, so parent device must wait for its children to finish. With this patch, on the same machine devices_shutdown() takes 1.142s, and without mlx4 one second delay only 0.38s Signed-off-by: Pavel Tatashin --- drivers/base/core.c | 238 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 189 insertions(+), 49 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index b610816eb887..f370369a303b 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "base.h" #include "power/power.h" @@ -2102,6 +2103,59 @@ const char *device_get_devnode(struct device *dev, return *tmp = s; } +/** + * device_children_count - device children count + * @parent: parent struct device. + * + * Returns number of children for this device or 0 if nonde. + */ +static int device_children_count(struct device *parent) +{ + struct klist_iter i; + int children = 0; + + if (!parent->p) + return 0; + + klist_iter_init(&parent->p->klist_children, &i); + while (next_device(&i)) + children++; + klist_iter_exit(&i); + + return children; +} + +/** + * device_get_child_by_index - Return child using the provide index. + * @parent: parent struct device. + * @index: Index of the child, where 0 is the first child in the children list, + * and so on. + * + * Returns child or NULL if child with this index is not present. + */ +static struct device * +device_get_child_by_index(struct device *parent, int index) +{ + struct klist_iter i; + struct device *dev = NULL, *d; + int child_index = 0; + + if (!parent->p || index < 0) + return NULL; + + klist_iter_init(&parent->p->klist_children, &i); + while ((d = next_device(&i)) != NULL) { + if (child_index == index) { + dev = d; + break; + } + child_index++; + } + klist_iter_exit(&i); + + return dev; +} + /** * device_for_each_child - device child iterator. * @parent: parent struct device. @@ -2765,71 +2819,157 @@ int device_move(struct device *dev, struct device *new_parent, } EXPORT_SYMBOL_GPL(device_move); +/* + * device_shutdown_one - call ->shutdown() for the device passed as + * argument. + */ +static void device_shutdown_one(struct device *dev) +{ + /* Don't allow any more runtime suspends */ + pm_runtime_get_noresume(dev); + pm_runtime_barrier(dev); + + if (dev->class && dev->class->shutdown_pre) { + if (initcall_debug) + dev_info(dev, "shutdown_pre\n"); + dev->class->shutdown_pre(dev); + } + if (dev->bus && dev->bus->shutdown) { + if (initcall_debug) + dev_info(dev, "shutdown\n"); + dev->bus->shutdown(dev); + } else if (dev->driver && dev->driver->shutdown) { + if (initcall_debug) + dev_info(dev, "shutdown\n"); + dev->driver->shutdown(dev); + } + + /* Release device lock, and decrement the reference counter */ + device_unlock(dev); + put_device(dev); +} + +static DECLARE_COMPLETION(device_root_tasks_complete); +static void device_shutdown_tree(struct device *dev); +static atomic_t device_root_tasks; + +/* + * Passed as an argument to to device_shutdown_task(). + * child_next_index the next available child index. + * tasks_running number of tasks still running. Each tasks decrements it + * when job is finished and the last tasks signals that the + * job is complete. + * complete Used to signal job competition. + * parent Parent device. + */ +struct device_shutdown_task_data { + atomic_t child_next_index; + atomic_t tasks_running; + struct completion complete; + struct device *parent; +}; + +static int device_shutdown_task(void *data) +{ + struct device_shutdown_task_data *tdata = + (struct device_shutdown_task_data *)data; + int child_idx = atomic_inc_return(&tdata->child_next_index) - 1; + struct device *dev = device_get_child_by_index(tdata->parent, + child_idx); + + if (dev) + device_shutdown_tree(dev); + if (atomic_dec_return(&tdata->tasks_running) == 0) + complete(&tdata->complete); + return 0; +} + +/* + * Shutdown device tree with root started in dev. If dev has no children + * simply shutdown only this device. If dev has children recursively shutdown + * children first, and only then the parent. For performance reasons children + * are shutdown in parallel using kernel threads. + */ +static void device_shutdown_tree(struct device *dev) +{ + int children_count = device_children_count(dev); + + if (children_count) { + struct device_shutdown_task_data tdata; + int i; + + init_completion(&tdata.complete); + atomic_set(&tdata.child_next_index, 0); + atomic_set(&tdata.tasks_running, children_count); + tdata.parent = dev; + + for (i = 0; i < children_count; i++) { + kthread_run(device_shutdown_task, + &tdata, "device_shutdown.%s", + dev_name(dev)); + } + wait_for_completion(&tdata.complete); + } + device_shutdown_one(dev); +} + +/* + * On shutdown each root device (the one that does not have a parent) goes + * through this function. + */ +static int +device_shutdown_root_task(void *data) +{ + struct device *dev = (struct device *)data; + + device_shutdown_tree(dev); + if (atomic_dec_return(&device_root_tasks) == 0) + complete(&device_root_tasks_complete); + return 0; +} + /** * device_shutdown - call ->shutdown() on each device to shutdown. */ void device_shutdown(void) { - struct device *dev, *parent; + struct list_head *pos, *next; + int root_devices = 0; + struct device *dev; spin_lock(&devices_kset->list_lock); /* - * Walk the devices list backward, shutting down each in turn. - * Beware that device unplug events may also start pulling - * devices offline, even as the system is shutting down. + * Prepare devices for shutdown: lock, and increment references in every + * devices. Remove child devices from the list, and count number of root + * devices. */ - while (!list_empty(&devices_kset->list)) { - dev = list_entry(devices_kset->list.prev, struct device, - kobj.entry); + list_for_each_safe(pos, next, &devices_kset->list) { + dev = list_entry(pos, struct device, kobj.entry); - /* - * hold reference count of device's parent to - * prevent it from being freed because parent's - * lock is to be held - */ - parent = get_device(dev->parent); get_device(dev); - /* - * Make sure the device is off the kset list, in the - * event that dev->*->shutdown() doesn't remove it. - */ - list_del_init(&dev->kobj.entry); - spin_unlock(&devices_kset->list_lock); - - /* hold lock to avoid race with probe/release */ - if (parent) - device_lock(parent); device_lock(dev); - /* Don't allow any more runtime suspends */ - pm_runtime_get_noresume(dev); - pm_runtime_barrier(dev); - - if (dev->class && dev->class->shutdown_pre) { - if (initcall_debug) - dev_info(dev, "shutdown_pre\n"); - dev->class->shutdown_pre(dev); - } - if (dev->bus && dev->bus->shutdown) { - if (initcall_debug) - dev_info(dev, "shutdown\n"); - dev->bus->shutdown(dev); - } else if (dev->driver && dev->driver->shutdown) { - if (initcall_debug) - dev_info(dev, "shutdown\n"); - dev->driver->shutdown(dev); - } - - device_unlock(dev); - if (parent) - device_unlock(parent); - - put_device(dev); - put_device(parent); - + if (!dev->parent) + root_devices++; + else + list_del_init(&dev->kobj.entry); + } + atomic_set(&device_root_tasks, root_devices); + /* + * Shutdown the root devices in parallel. The children are going to be + * shutdown first. + */ + list_for_each_safe(pos, next, &devices_kset->list) { + dev = list_entry(pos, struct device, kobj.entry); + list_del_init(&dev->kobj.entry); + spin_unlock(&devices_kset->list_lock); + kthread_run(device_shutdown_root_task, + dev, "device_root_shutdown.%s", + dev_name(dev)); spin_lock(&devices_kset->list_lock); } spin_unlock(&devices_kset->list_lock); + wait_for_completion(&device_root_tasks_complete); } /* -- 2.17.0