Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp2751554yba; Fri, 10 May 2019 18:34:50 -0700 (PDT) X-Google-Smtp-Source: APXvYqxJTdDaYvsaffVoAFgAlMUAazgNI4Gjt3SlZ3gxeBtz24F/MQ9AhokK985gMivTZHgfzAbn X-Received: by 2002:a17:902:8c82:: with SMTP id t2mr8390323plo.256.1557538490470; Fri, 10 May 2019 18:34:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557538490; cv=none; d=google.com; s=arc-20160816; b=qsOwmrQ6wZkwyge+dfGy4kzaeGpl4B8W8cv6G90jCH/V5s6kgW59Mb+RaexaM2wkrn FG0zczMn24GFXy8zM72GpRnR59L42oQR8oiFvbwD55909bTK/auaIAEKDe/UAe6FzEup AW7OGHYM6EthGywHW+cI4sCM8IPoei6BkLAZOAWrv/2ukEtDzVS8awsvp9DB42nFZMWO F9VdHPCX0xMns6qwWpU2vXQABCjG1dd2PRWWkVwJkcYTYcHff/hJwSSFMu45fpnGhdKK cr45E64/6lBEB+VmwBGJ4kHQeZLdsQlnfJhtfm3XS1oSpRsZwFGEiSVOj+tuQhQX6/hc Hcmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=+QlRui6mMfQBesLLPURC8MRQqc0HgvwOUKvhXzitG2c=; b=HfHjSCNn/bzkqfZMQ9TsI8bGPr3SYi0/SDq05DLymz26sE8O7/lvAuOyoQtS30P87F DF4Hnzgp1bHrIGlWBxvAW95nAxubl5dtDgQmhMuum5eu8TKhJuPAKdRjG+reUp/YQeyQ j8vu/0rQ2NNlAoIgX5kMFHvFOyx6hWqzrf5HWncuiZlwbv+ooev5Lqxar43USdvkS/+3 8QwLB0JW4unSUU+VpRU5+ZVJ0x1Ey9eSvrmRRyRDUBCh6ritBlQyHW+N1wY3cVsF8e7O azJmtx1pMh5kLEss6m+7DRSpijEOAwJ6YALNxkp9NowVrp/bUZf8nbUO8T1a9yZi74Tj FUGg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z5si9177713pgv.146.2019.05.10.18.34.33; Fri, 10 May 2019 18:34:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728210AbfEKBdW (ORCPT + 99 others); Fri, 10 May 2019 21:33:22 -0400 Received: from linux.microsoft.com ([13.77.154.182]:53320 "EHLO linux.microsoft.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728064AbfEKBdW (ORCPT ); Fri, 10 May 2019 21:33:22 -0400 Received: from localhost.localdomain (unknown [131.107.160.135]) by linux.microsoft.com (Postfix) with ESMTPSA id B0EE92074A30; Fri, 10 May 2019 18:33:20 -0700 (PDT) Subject: Re: [PATCH] nvme/pci: Use host managed power state for suspend To: Keith Busch , Christoph Hellwig , Sagi Grimberg , linux-nvme@lists.infradead.org Cc: linux-pm , Mario Limonciello , lkml , Kai Heng Feng , Rafael Wysocki References: <20190510212937.11661-1-keith.busch@intel.com> From: Edmund Nadolski Message-ID: <61585196-0e16-87f7-3c58-3d11425758d8@linux.microsoft.com> Date: Fri, 10 May 2019 18:33:20 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190510212937.11661-1-keith.busch@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/10/19 2:29 PM, Keith Busch wrote: > The nvme pci driver prepares its devices for power loss during suspend > by shutting down the controllers, and the power setting is deferred to > pci driver's power management before the platform removes power. The > suspend-to-idle mode, however, does not remove power. > > NVMe devices that implement host managed power settings can achieve > lower power and better transition latencies than using generic PCI > power settings. Try to use this feature if the platform is not involved > with the suspend. If successful, restore the previous power state on > resume. > > Cc: Mario Limonciello > Cc: Kai Heng Feng > Signed-off-by: Keith Busch > --- > Disclaimer: I've tested only on emulation faking support for the feature. > > General question: different devices potentially have divergent values > for power consumption and transition latencies. Would it be useful to > allow a user tunable setting to select the desired target power state > instead of assuming the lowest one? In general I prefer fewer knobs; but for a new setting it seems advisable not to presume that one value is appropriate for everyone (as long as they can't set an invalid value or otherwise hose things). > drivers/nvme/host/core.c | 27 ++++++++++++++++++++++++ > drivers/nvme/host/nvme.h | 2 ++ > drivers/nvme/host/pci.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 82 insertions(+) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index a6644a2c3ef7..eb3640fd8838 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1132,6 +1132,33 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword > return ret; > } > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps) > +{ > + return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL); > +} > +EXPORT_SYMBOL_GPL(nvme_set_power); > + > +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result) > +{ > + struct nvme_command c; > + union nvme_result res; > + int ret; > + > + if (!result) > + return -EINVAL; As this is only called from one place, is this check really needed? > + > + memset(&c, 0, sizeof(c)); > + c.features.opcode = nvme_admin_get_features; > + c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT); > + > + ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res, > + NULL, 0, 0, NVME_QID_ANY, 0, 0, false); > + if (ret >= 0) > + *result = le32_to_cpu(res.u32); > + return ret; > +} > +EXPORT_SYMBOL_GPL(nvme_get_power); > + > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) > { > u32 q_count = (*count - 1) | ((*count - 1) << 16); > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 5ee75b5ff83f..eaa571ac06d2 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -459,6 +459,8 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, > unsigned timeout, int qid, int at_head, > blk_mq_req_flags_t flags, bool poll); > int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps); > +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result); > void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl(struct nvme_ctrl *ctrl); > int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 3e4fb891a95a..0d5d91e5b293 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -116,6 +117,7 @@ struct nvme_dev { > u32 cmbsz; > u32 cmbloc; > struct nvme_ctrl ctrl; > + u32 last_ps; > > mempool_t *iod_mempool; > > @@ -2828,11 +2830,59 @@ static void nvme_remove(struct pci_dev *pdev) > } > > #ifdef CONFIG_PM_SLEEP > +static int nvme_deep_state(struct nvme_dev *dev) > +{ > + struct pci_dev *pdev = to_pci_dev(dev->dev); > + int ret; > + > + /* > + * Save the current power state in case a user tool set a power policy > + * for this device. We'll restore that state on resume. > + */ > + dev->last_ps = 0; > + ret = nvme_get_power(&dev->ctrl, &dev->last_ps); Should we validate (range check) the value reported by the device? > + > + /* > + * Return the error to halt suspend if the driver either couldn't > + * submit a command or didn't see a response. > + */ > + if (ret < 0) > + return ret; > + > + ret = nvme_set_power(&dev->ctrl, dev->ctrl.npss); > + if (ret < 0) > + return ret; > + > + if (!ret) { > + /* > + * A saved state prevents pci pm from generically controlling > + * the device's power. We're using protocol specific settings > + * so we don't want pci interfering. > + */ > + pci_save_state(pdev); > + } else { > + /* > + * The drive failed the low power request. Fallback to device > + * shutdown and clear npss to force a controller reset on > + * resume. The value will be rediscovered during reset. > + */ > + dev->ctrl.npss = 0; > + nvme_dev_disable(dev, true); > + } > + return 0; > +} > + > static int nvme_suspend(struct device *dev) > { > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + /* > + * Try to use nvme if the device supports host managed power settings > + * and platform firmware is not involved. > + */ > + if (!pm_suspend_via_firmware() && ndev->ctrl.npss) > + return nvme_deep_state(ndev); > nvme_dev_disable(ndev, true); > return 0; > } > @@ -2842,6 +2892,9 @@ static int nvme_resume(struct device *dev) > struct pci_dev *pdev = to_pci_dev(dev); > struct nvme_dev *ndev = pci_get_drvdata(pdev); > > + if (!pm_resume_via_firmware() && ndev->ctrl.npss) > + if (nvme_set_power(&ndev->ctrl, ndev->last_ps) == 0) > + return 0; > nvme_reset_ctrl(&ndev->ctrl); > return 0; > } >