Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp582829yba; Thu, 9 May 2019 02:43:54 -0700 (PDT) X-Google-Smtp-Source: APXvYqwnm+ZClbjqJH965MUDPXCp7QCdPqvoee6dHwUAiT0B2iNhRvRNSzhPAXrLfAZJVr8S8YoD X-Received: by 2002:a17:902:bcc6:: with SMTP id o6mr3498766pls.275.1557395034484; Thu, 09 May 2019 02:43:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557395034; cv=none; d=google.com; s=arc-20160816; b=axKzYBIKUaUu9zY5i53EriteWKqqDO1vcdeLdhcms5OIdT/V6eHvjDqHiIqrH+KtZ0 5XKbfloOziBgyiHMyDwVRu/+/CAcZPiRwEWRvnjob1uyXnPnk8g7VpR5TjSrUVzR/lBE iIq1s5vkrYeEEtd6hQbD4gdnljRcc8FD4Dr/pDGuP53HVbPlmA/hFQj2Vnlf1zCiCWlb j4ilw6yupgeym5p8whnCzmcIualQEIM6ZhiyWqHZw6sgkhCwoa+/VmrIWXc0U4HlZpYi ZYdVhesiQnhjFAvmD1yejHkTNjG8MgM70c4J7XwIjIUcte0ugatmk3P6heRpL+bEjnUq ERKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version; bh=vCyQ60BE6nSCa7adNn8wGWtGspW5wwW93oaMgXkndqY=; b=AcizwSh8kZPV2Y8K5UTHnvv4pwiNa9zpkfwoKZroMUuIZJlmbuxL017FzqbEalIQC0 AJ9z0gjYur8jxUzpJlf/LWj9R0LjEIGqTtFeKIBM/D9bB4IcBaXx5dPkOgbWrffjwfGb um1Kvw6NfY2RRgyVh3XAHATBz6TNoUjofA41vV0Fyy8+kGR4IkdTRBUy7mcAxH0fvMPz eFhqGZ+riq0mg2Vos5431rBgK9Tf/lzutuYR1LjrIKDGNllbt7DUAkxVZrJidetePDp0 YB1U1u5o9XV3hLaRfDQlPQBMezJKzKOOGN3PjvaBBQPKPEFF9Qg4hvV7sKNU30oPdGfW yTtA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x28si2050744pff.104.2019.05.09.02.43.38; Thu, 09 May 2019 02:43:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726546AbfEIJmk (ORCPT + 99 others); Thu, 9 May 2019 05:42:40 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:38413 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725826AbfEIJmj (ORCPT ); Thu, 9 May 2019 05:42:39 -0400 Received: from mail-pf1-f199.google.com ([209.85.210.199]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1hOfZU-0005ub-0q for linux-kernel@vger.kernel.org; Thu, 09 May 2019 09:42:36 +0000 Received: by mail-pf1-f199.google.com with SMTP id n3so1243120pff.4 for ; Thu, 09 May 2019 02:42:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=vCyQ60BE6nSCa7adNn8wGWtGspW5wwW93oaMgXkndqY=; b=CaIIUrcFU6dj8gCqHgiPBtB55grJNqxZ3hT6goAHINwA42Hz0xCNK8wvZhyxg51RmO bDOsunc4MhijtKcRTaqYWFHZjR22aH2DkaHQEIewmFqhQ07YeRh0QgSbZ3ZorDx5FYlg 4+7zR8iZn+4e8w7vNSZ6ehPXYI2xvtYsMMHYaXdUqcI0FiRVsoi2iL3AdwRi037VOC2u IFku3SAEg2obRv4NbidXmY3vOvFhQi9c4kKe9AfgWDKdSRC2owVNlZkG+18Vr581SJ1B /ERmUaiciUkNtAtzNhthLzqN2p/kUPaLvg+espDiOaA+bl7cCilX6dOZWyIJ+OmKSQIp hy9w== X-Gm-Message-State: APjAAAWoCNqAHb1NM7LBZGK/CoH4yLfmvlDSU8YXDtw39o8JoSNVnDTA V2yvd1p4CvFTIbRbVWe86HPWjoMVE+bKlqFecOKO1LgkUszcFfNsdLfBGi1JvCDfd3Md/FPTS7v cqF9OG7ZaGFy7MAc5wkuGsNdyv5LyP6j8shbytMZ9iA== X-Received: by 2002:a63:10c:: with SMTP id 12mr4207300pgb.276.1557394954727; Thu, 09 May 2019 02:42:34 -0700 (PDT) X-Received: by 2002:a63:10c:: with SMTP id 12mr4207277pgb.276.1557394954398; Thu, 09 May 2019 02:42:34 -0700 (PDT) Received: from 2001-b011-380f-14b9-f0ba-4a15-3e79-97f9.dynamic-ip6.hinet.net (2001-b011-380f-14b9-f0ba-4a15-3e79-97f9.dynamic-ip6.hinet.net. [2001:b011:380f:14b9:f0ba:4a15:3e79:97f9]) by smtp.gmail.com with ESMTPSA id f87sm3062680pff.56.2019.05.09.02.42.31 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 May 2019 02:42:33 -0700 (PDT) Content-Type: text/plain; charset=utf-8; delsp=yes; format=flowed Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\)) Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle From: Kai-Heng Feng In-Reply-To: Date: Thu, 9 May 2019 17:42:30 +0800 Cc: Christoph Hellwig , Rafael Wysocki , Mario Limonciello , Keith Busch , Keith Busch , Jens Axboe , Sagi Grimberg , linux-nvme , Linux PM , LKML Content-Transfer-Encoding: 8bit Message-Id: References: <20190508185955.11406-1-kai.heng.feng@canonical.com> <20190508191624.GA8365@localhost.localdomain> <3CDA9F13-B17C-456F-8CE1-3A63C6E0DC8F@canonical.com> <20190508195159.GA1530@lst.de> <20190509061237.GA15229@lst.de> <064701C3-2BD4-4D93-891D-B7FBB5040FC4@canonical.com> To: "Rafael J. Wysocki" X-Mailer: Apple Mail (2.3445.104.8) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org at 17:07, Rafael J. Wysocki wrote: > On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng > wrote: >> Cc Rafael and linux-pm > > I would have been much more useful to CC the patch to linux-pm at > least from the outset. > >> at 14:12, Christoph Hellwig wrote: >> >>> On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com >>> wrote: >>>> You might think this would be adding runtime_suspend/runtime_resume >>>> callbacks, but those also get called actually at runtime which is not >>>> the goal here. At runtime, these types of disks should rely on APST >>>> which >>>> should calculate the appropriate latencies around the different power >>>> states. >>>> >>>> This code path is only applicable in the suspend to idle state, which >>>> /does/ >>>> call suspend/resume functions associated with dev_pm_ops. There isn't >>>> a dedicated function in there for use only in suspend to idle, which is >>>> why pm_suspend_via_s2idle() needs to get called. >>> >>> The problem is that it also gets called for others paths: >>> >>> #ifdef CONFIG_PM_SLEEP >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> .suspend = suspend_fn, \ >>> .resume = resume_fn, \ >>> .freeze = suspend_fn, \ >>> .thaw = resume_fn, \ >>> .poweroff = suspend_fn, \ >>> .restore = resume_fn, >>> #else >>> else >>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) >>> #endif >>> >>> #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \ >>> const struct dev_pm_ops name = { \ >>> SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \ >>> } >>> >>> And at least for poweroff this new code seems completely wrong, even >>> for freeze it looks rather borderline. >> >> Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so >> the old code path will be taken. >> >>> And more to the points - if these "modern MS standby" systems are >>> becoming common, which it looks they are, we need support in the PM core >>> for those instead of working around the decisions in low-level drivers. >> >> Rafael, what do you think about this? > > The difference between suspend-to-idle and suspend-to-RAM (S3) boils > down to the fact that at the end of S3 suspend all control of the > system is passed to the platform firmware. Then, the firmware can > take care of some things that may not be taken care of by drivers (it > sometimes assumes that drivers will not take care of those things even > which is generally problematic). > > For suspend-to-idle the final physical state of the system should (in > theory) be the same as the deepest possible physical idle state of it > achievable through working-state PM (combination of PM-runtime and > cpuidle, roughly speaking). However, in practice the working-state PM > may not even be able to get there, as it generally requires many > things to happen exactly at the right time in a specific order and the > probability of that in the working-state PM situation is practically > 0. Suspend-to-idle helps here by quiescing everything in an ordered > fashion which makes all of the requisite conditions more likely to be > met together. > > So yes, from an individual driver perspective, the device handling for > s2idle should be more like for PM-runtime than for S3 (s2R), but this > really shouldn't matter (and it doesn't matter for the vast majority > of drivers). > > Unfortunately, the "modern MS standby" concept makes it matter, > because "modern MS standby" causes system-wide transitions to be > "special" and it appears to expect drivers to take care of the "extra > bit" that would have been taken care of by the platform firmware in > the S3 case. [Note that in the Windows world the "modern MS standby" > systems don't support S3 ("modern MS standby" and S3 support are > mutually exclusive in Windows AFAICS) while Linux needs to support S3 > and is expected to achieve the minimum power state through s2idle > (generally, even on the same platform) at the same time.] > >> Including this patch, there are five drivers that use >> pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3. > > Well, that is not a large number relative to the total number of > drivers in Linux. That’s right, but I think we are going to see more of similar cases. > >> So I think maybe it’s time to introduce a new suspend callback for S2I? > > That would be a set of 6 new suspend and resume callbacks, mind you, > and there's quite a few of them already. And the majority of drivers > would not need to use them anyway. I think suspend_to_idle() and resume_from_idle() should be enough? What are other 4 callbacks? > > Also, please note that, possibly apart from the device power state > setting, the S2I and S2R handling really aren't that different at all. > You basically need to carry out the same preparations during suspend > and reverse them during resume in both cases. But for this case, it’s quite different to the original suspend and resume callbacks. > > That said I admit that there are cases in which device drivers need to > know that the system-wide transition under way is into s2idle and so > they should do extra stuff. If pm_suspend_via_firmware() is not > sufficient for that, then I'm open to other suggestions, but > introducing a new set of callbacks for that alone would be rather > excessive IMO. From drivers’ perspective nothing changed, as PM core can prioritize suspend_to_idle() over suspend() when it’s actually S2I. > >>>> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and >>>> freeze (hibernate), so to avoid any changes to the hibernate case it >>>> seems >>>> to me that there needs to be a new nvme_freeze() that calls into the >>>> existing >>>> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into >>>> the >>>> existing nvme_reset_ctrl for the thaw pm op. >>> >>> At least, yes. >> >> Hibernation should remain the same as stated above. > > Depending on what check is used in that code path. > pm_suspend_via_s2idle() will return "true" in the hibernation path > too, for one. You are right, I should use !pm_suspend_via_firmware() instead. > >>>>> enterprise class NVMe devices >>>>> that don't do APST and don't really do different power states at >>>>> all in many cases. >>>> >>>> Enterprise class NVMe devices that don't do APST - do they typically >>>> have a non-zero value for ndev->ctrl.npss? >>>> >>>> If not, they wouldn't enter this new codepath even if the server entered >>>> into S2I. >>> >>> No, devices that do set NPSS will have at least some power states >>> per definition, although they might not be too useful. I suspect >>> checking >>> APSTA might be safer, but if we don't want to rely on APST we should >>> check for a power state supporting the condition that the MS document >>> quoted in the original document supports. >> >> If Modern Standby or Connected Standby is not supported by servers, I >> don’t >> think the design documents mean much here. >> We probably should check if the platform firmware really supports S2I >> instead. > > S2I is expected to work regardless of the platform firmware and there > is nothing like "platform firmware support for S2I". IOW, that check > would always return "false". > > What you really need to know is if the given particular transition is S2I. Maybe a helper based on FADT flag and _DSM can do this thing? Kai-Heng