Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp22134pxb; Tue, 12 Apr 2022 15:40:51 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxKQQOBl1SHkiy4lX614sM2Tvwt5SBE5gBTyes1Pu0sxeeAtaAE733vaBecXC3tAXsfbG4A X-Received: by 2002:a17:90a:b307:b0:1bd:37f3:f0fc with SMTP id d7-20020a17090ab30700b001bd37f3f0fcmr7388304pjr.132.1649803251588; Tue, 12 Apr 2022 15:40:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649803251; cv=none; d=google.com; s=arc-20160816; b=xqNh2SYpLL75OLA14mGs0cHSmEsW5Dkd5DepOGWJ6P6AScgvpPosVhW9d7EXOC2flv qSA0/zsiSZ892NH9Xrm/OTpIu33HH2V435KxbJsFMpyVgLaAYwXFyM7hQjJJRVIX8jKE ejTcMPovu4Jzzl7UXK93bACrnXOGorMqIr6jwzFU0PD4ZIlGuvqlGdB4D02t8Mkr4wkt ao2wt8njEtayeTlGPdvHr+TjU4ZVKm5D9Q5lKlVBWe3q0tR3MZASqBo0oP2UBotWGP2U AoguCNtI0BaHX+l0/Gd6QgKUZ6LuyCLXLztx4MYXwOIez6SUC5UQ3AwudgAwxByIH/7W q2Cg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject:from :references:cc:to:dkim-signature; bh=hD5XMroenSU8X4AkTXV88LcoJRDzo8bgj88SpBoYj5U=; b=e2D95maqRR46oYzKnSnGhCnuKH7VrWEifZzWmUIy+fbCScr2HilSII7wRUlQd5NRBR 6yEB4G9tOE6NAmONIQI83I4oYfRJyYc/zPdi9OPfokKtsoHLUp+sYbK7bONDTEfEhOIm yZImAQe/SDnU79MgnPBgok8tXUz3/C0FWSdTly6McItVIBw0ZKFhkwrTAkn3rmxGB39/ E4p/14W193M86fTM59uU9fUMrU+cRw0zNSyeLJgPPRSGzMFO7/1y7imQPhqG5IFkEuGX /kTkagibbNAOutVTCiKDGX8z6298NsMKoZzSFbNit9nnClZkGYwbmb0/zocTi2LsQuY2 AynA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=dVwVmRgD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id c2-20020a056a000ac200b004fa3a8e0032si15322674pfl.233.2022.04.12.15.40.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Apr 2022 15:40:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=dVwVmRgD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E05DEFD6D4; Tue, 12 Apr 2022 14:19:17 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345112AbiDKKnk (ORCPT + 99 others); Mon, 11 Apr 2022 06:43:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56630 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231744AbiDKKni (ORCPT ); Mon, 11 Apr 2022 06:43:38 -0400 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 836E03F32B; Mon, 11 Apr 2022 03:41:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1649673684; x=1681209684; h=to:cc:references:from:subject:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=F7BhIvlrt00kTkPSYro8/o/CAvwADvOIxE2+7hXIp+s=; b=dVwVmRgDHk8kNBNsSARxPmMIbiyJXYmy+W+rtMFlcNwquCBkhKHDSYcY f15ZMeKFmNBEVAJRSPxXIl/L8Rx+8zb+nzaL9rsaGBDHKLmfAAJVXKqyP My3nIoBO2aeDjx5oZ+s/Am5KXPbG77F8tNqJbuZ7CzIUWq9Dpcpaeh+l8 r5wqk0T0clVxtbOrgFnXQCCVuTBzc5slltEBaU8ISF53BYGdMfe2gdySp PZIY/oCjwA9qKzSO528z2CgYNnquuUy4yJK3CpvCRp90CGfl3wHMkF1Xr SJCsZW5SFL/eyjTsA7yoTzLSgcUBSHA966kIPKlPq3B9iPXBnHmd7HTgr w==; X-IronPort-AV: E=McAfee;i="6400,9594,10313"; a="348523635" X-IronPort-AV: E=Sophos;i="5.90,251,1643702400"; d="scan'208";a="348523635" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Apr 2022 03:41:24 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,251,1643702400"; d="scan'208";a="723919511" Received: from mattu-haswell.fi.intel.com (HELO [10.237.72.199]) ([10.237.72.199]) by orsmga005.jf.intel.com with ESMTP; 11 Apr 2022 03:41:21 -0700 To: Alan Stern , Evan Green Cc: Greg Kroah-Hartman , Mathias Nyman , Rajat Jain , Thomas Gleixner , Bjorn Helgaas , "Rafael J. Wysocki" , Youngjin Jang , LKML , linux-usb@vger.kernel.org References: <20220407115918.1.I8226c7fdae88329ef70957b96a39b346c69a914e@changeid> From: Mathias Nyman Subject: Re: [PATCH] USB: hcd-pci: Fully suspend across freeze/thaw cycle Message-ID: <022a50ac-7866-2140-1b40-776255f3a036@linux.intel.com> Date: Mon, 11 Apr 2022 13:43:15 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0 Thunderbird/78.14.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_NONE,SPF_HELO_NONE, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi On 9.4.2022 4.58, Alan Stern wrote: > On Fri, Apr 08, 2022 at 02:52:30PM -0700, Evan Green wrote: >> Hi Alan, > > Hello. > >> On Fri, Apr 8, 2022 at 7:29 AM Alan Stern wrote: >>> >>> On Thu, Apr 07, 2022 at 11:59:55AM -0700, Evan Green wrote: >>>> The documentation for the freeze() method says that it "should quiesce >>>> the device so that it doesn't generate IRQs or DMA". The unspoken >>>> consequence of not doing this is that MSIs aimed at non-boot CPUs may >>>> get fully lost if they're sent during the period where the target CPU is >>>> offline. >>>> >>>> The current callbacks for USB HCD do not fully quiesce interrupts, >>>> specifically on XHCI. Change to use the full suspend/resume flow for >>>> freeze/thaw to ensure interrupts are fully quiesced. This fixes issues >>>> where USB devices fail to thaw during hibernation because XHCI misses >>>> its interrupt and fails to recover. >>> >>> I don't think your interpretation is quite right. The problem doesn't lie >>> in the HCD callbacks but rather in the root-hub callbacks. >>> >>> Correct me if I'm wrong about xHCI, but AFAIK the host controller doesn't >>> issue any interrupt requests on its own behalf; it issues IRQs only on >>> behalf of its root hubs. Given that the root hubs should be suspended >>> (i.e., frozen) at this point, and hence not running, the only IRQs they >>> might make would be for wakeup requests. >>> >>> So during freeze, wakeups should be disabled on root hubs. Currently I >>> believe we don't do this; if a root hub was already runtime suspended when >>> asked to go into freeze, its wakeup setting will remain unchanged. _That_ In xHCI case freeze will suspend the roothub and make sure all connected devices are in suspended U3 state, but it won't prevent interrupts. And yes, my understanding is also that if devices were runtime suspended with wake enabled before freeze, then devices can can initiate resume any time in the first stages of hibernate (freeze-thaw), causing an interrupt. We can reduce interrupts by disabling device wake in freeze, but any port change can still cause interrupts. >> >> For my issue at least, it's the opposite. Enabling runtime pm on the >> controller significantly reduces the repro rate of the lost interrupt. > > That doesn't seem to make sense. If the controller is in runtime suspend at > the start of hibernation, the pci_pm_freeze() routine will do a runtime > resume before calling the HCD freeze function. So when the controller gets > put into the freeze state, it is guaranteed not to be runtime suspended > regardless of what you enable. > >> I think having the controller runtime suspended reduces the overall >> number of interrupts that flow in, which is why my chances to hit an >> interrupt in this window drop, but aren't fully eliminated. > > When you ran your tests, was wakeup enabled for the host controller? > >> I think xhci may still find reasons to generate interrupts even if all >> of its root hub ports are suspended without wake events. For example, >> won't Port Status Change Events still come in if a device is unplugged >> or overcurrents in between freeze() and thaw()? Yes, as long as host is running, and host is running between freeze and thaw. > > I'm not an expert on xHCI or xhci-hcd. For that, we should ask the xhci-hcd > maintainer (CC'ed). In fact, he should have been CC'ed on the original > patch since it was meant to fix a problem involving xHCI controllers. > > With EHCI, for example, if a port status change event occurs while the root > hub is suspended with wakeups disabled, no interrupt request will be > generated because the port-specific WKOC_E, WKDSCNNT_E, and WKCNNT_E (Wake > on Over-Current Enable, Wake on Disconnect Enable, and Wake on Connect > Enable) bits are turned off. In effect, the port-status change events can > occur but they aren't treated as wakeup events. The port-specific wake flags in xHCI only affects interrupt and wake generation for a suspended host. In the freeze() to thaw() stage host is running so flags don't have any effect > >> The spec does mention >> that generation of this event is gated by the HCHalted flag, but at >> least in my digging around I couldn't find a place where we halt the >> controller through this path. > > Bear in mind that suspending the controller and suspending the root hub are > two different things. > >> With how fragile xhci (and maybe >> others?) are towards lost interrupts, even if it does happen to be >> perfect now, it seems like it would be more resilient to just fully >> suspend the controller across this transition. > > Suspending the controller won't fix the problem if the wakeup settings for > the root hubs are wrong (although it may reduce the window for a race, like > what you mentioned above). Conversely, if the wakeup settings for the root > hubs are correct then suspending the controller shouldn't make any > difference. > >> I'd also put forward the hypothesis (feel free to shoot it down!) that >> unless there's a human-scale time penalty with this change, the >> downsides of being more heavy handed like this across freeze/thaw are >> minimal. There's always a thaw() right on the heels of freeze(), and >> hibernation is such a rare and jarring transition that being able to >> recover after the transition is more important than accomplishing the >> transition quickly. > > That's true, but it ignores the underlying problem described in the > preceding paragraphs. > Would it make sense prevent xHCI interrupt generation in the host freeze() stage, clearing the xHCI EINT bit in addition to calling check_roothub_suspend()? Then enable it back in thaw() Thanks -Mathias