Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp974404pxb; Wed, 6 Apr 2022 05:46:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxjLkMh9VUIXSrfpMIc19mpAHOpoQkOLCbQe7I4Ybj/Rnpc4k6nJtI0W88XYZ2LEFIbD72k X-Received: by 2002:a17:90b:4ccb:b0:1c9:c1de:9c9d with SMTP id nd11-20020a17090b4ccb00b001c9c1de9c9dmr9828734pjb.105.1649249215662; Wed, 06 Apr 2022 05:46:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649249215; cv=none; d=google.com; s=arc-20160816; b=q8bUZS9apTwVJVkDFbLgqoNZIAf6ILQRrq/2B1RwsU1yItpJ+vuHN+lBpmFnDKRgng 2HwRLzZfbPMfUcaIHvkUbb12gN1PgT5JJIT4XjZrs/oZ7lDVm5leaH6IB91MfZhFUIfO 2tNFIy7kR65APPZ4TpSb30EuhstJs74LV7+s4x5DfWQpLFXs132mnPwbe0dGqL71lgLk BG5g6CY9Eq+Ii5FcB3zDvDDPDE40zSL4WnxBODuYXVjtfnsGoy4W+sZpSgF03ytOw+tA kLJIDzfJgBnbstlJuvdUdsK3ZUt6AHd0/ctkzNhxqJcVyOgw7+Ft+tsXppS2JwKJWDU/ k52w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:in-reply-to:subject :cc:to:dkim-signature:dkim-signature:from; bh=i81MASCCDw38dcKZylteUy/3pg50r3ulw2eAeBX1oQs=; b=X4U1tN/jOL+XGaI/aOOHTAsZav9vxsd8cGyCsYMyLxWu0N3Hct+6nPKuDNNWkP2VKn 3HhSS+ZB+5T/DvbgXBhOKBFvlYXAUvl+RrOIvjTs62BhWAHmqQD5FeOd/+NWZUNEpZzO UefNlmSoxtSNDiT8Y2uWzuI20Qq2N+cMlgsidDVK9DeNknV9p4wJ2zSn1kIFyQrka41G txa/R6aU3nNrXpmkafqqv5Nj29aq9zlBDThubSsf8z2NTGk9UYMfZty786K33GrfScbS hykuloOAh4591eb48GwDJAcl5Z55uN+IOpj03UT0Ezq3C2FCuVenn0nWMnF0hqHbmcpC UiSA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="blb33Z/h"; dkim=neutral (no key) header.i=@linutronix.de; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id f18-20020a170902ce9200b00153b2d16509si17254770plg.273.2022.04.06.05.46.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Apr 2022 05:46:55 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b="blb33Z/h"; dkim=neutral (no key) header.i=@linutronix.de; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 916D651AC38; Wed, 6 Apr 2022 02:32:25 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1358419AbiDEUQj (ORCPT + 99 others); Tue, 5 Apr 2022 16:16:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60350 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1444822AbiDEPmE (ORCPT ); Tue, 5 Apr 2022 11:42:04 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ADAE318B258; Tue, 5 Apr 2022 07:06:52 -0700 (PDT) From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1649167611; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to; bh=i81MASCCDw38dcKZylteUy/3pg50r3ulw2eAeBX1oQs=; b=blb33Z/hkJODAWD0H0EQS/ucph9bnmTVwHRBaB8faYgOxdQiY1v+iBJaTHaNRuekJOtKPN ftfSFYAe47v1dSNmihMBBH1Uu2xNTQVJdwHgnmhPTKiUEX4R9aGzzLCR0dShjccozOAAxp oUI31XXjZk6cepgUaDwHHR6kYrdPUEy1C/4NzSt243SIacCu3JMKAXrZWqTLmqCAfJKtyt Qwqz8k+Xo9itZXJ/J3gtQ2OzmHnvTghPeHmofW/8CsWy0Qp+5G+OH+2EC+BLPl1vpolf5d LVWC3JV9vCMAYgjfutiivuCw3TVl2I1CqCzViZal/pBka4S7ZXg5GBCAdH3Oxg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1649167611; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to; bh=i81MASCCDw38dcKZylteUy/3pg50r3ulw2eAeBX1oQs=; b=6qOMCJYfbs8XtKREWwkNMmyZpa/kppvHyluNeM4S50mmbUr3N+rkt67nQ6cuHWbR+xWQR8 crbdlICxoSpd+BDQ== To: Evan Green Cc: LKML , Rajat Jain , Linux PM , linux-pci , Bjorn Helgaas , Greg Kroah-Hartman , Mathias Nyman , "Rafael J. Wysocki" Subject: Re: Lost MSIs during hibernate In-Reply-To: Date: Tue, 05 Apr 2022 16:06:50 +0200 Message-ID: <87a6cz39qd.ffs@tglx> MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Evan! On Mon, Apr 04 2022 at 12:11, Evan Green wrote: > To my surprise, I'm back with another MSI problem, and hoping to get > some advice on how to approach fixing it. Why am I not surprised? > What worries me is those IRQ "no longer affine" messages, as well as > my "EVAN don't touch hw" prints, indicating that requests to change > the MSI are being dropped. These ignored requests are coming in when > we try to migrate all IRQs off of the non-boot CPU, and they get > ignored because all devices are "frozen" at this point, and presumably > not in D0. They are disabled at that point. > To further try and prove that theory, I wrote a script to do the > hibernate prepare image step in a loop, but messed with XHCI's IRQ > affinity beforehand. If I move the IRQ to core 0, so far I have never > seen a hang. But if I move it to another core, I can usually get a > hang in the first attempt. I also very occasionally see wifi splats > when trying this, and those "no longer affine" prints are all the wifi > queue IRQs. So I think a wifi packet coming in at the wrong time can > do the same thing. > > I wanted to see what thoughts you might have on this. Should I try to > make a patch that moves all IRQs to CPU 0 *before* the devices all > freeze? Sounds a little unpleasant. Or should PCI be doing something > different to avoid this combination of "you're not allowed to modify > my MSIs, but I might still generate interrupts that must not be lost"? PCI cannot do much here and moving interrupts around is papering over the underlying problem. xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023 This sets up the interrupt when the driver is loaded xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee01000 4024 Ditto xhci_hcd 0000:00:0d.0: calling pci_pm_freeze+0x0/0xad @ 423, parent: pci0000:00 xhci_hcd 0000:00:14.0: calling pci_pm_freeze+0x0/0xad @ 4644, parent: pci0000:00 xhci_hcd 0000:00:14.0: pci_pm_freeze+0x0/0xad returned 0 after 0 usecs xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023 xhci_hcd 0000:00:0d.0: pci_pm_freeze+0x0/0xad returned 0 after 196000 usecs Those freeze() calls end up in xhci_suspend(), which tears down the XHCI and ensures that no interrupts are on flight. xhci_hcd 0000:00:0d.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4645, parent: pci0000:00 xhci_hcd 0000:00:0d.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 30 usecs xhci_hcd 0000:00:14.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4644, parent: pci0000:00 xhci_hcd 0000:00:14.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 3118 usecs Now the devices are disabled and not accessible xhci_hcd 0000:00:14.0: EVAN Don't touch hw 0 fee00000 4024 xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee1e000 4045 xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee00000 4045 xhci_hcd 0000:00:14.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 9, parent: pci0000:00 xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee00000 4024 This is the early restore _before_ the XHCI resume code is called This interrupt is targeted at CPU0 (it's the one which could not be written above). xhci_hcd 0000:00:14.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 5272 usecs xhci_hcd 0000:00:0d.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 1123, parent: pci0000:00 xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee00000 4045 Ditto xhci_hcd 0000:00:0d.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 623 usecs xhci_hcd 0000:00:14.0: calling pci_pm_thaw+0x0/0x7c @ 3856, parent: pci0000:00 xhci_hcd 0000:00:14.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs xhci_hcd 0000:00:0d.0: calling pci_pm_thaw+0x0/0x7c @ 4664, parent: pci0000:00 xhci_hcd 0000:00:0d.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs That means the suspend/resume logic is doing the right thing. How the XHCI ends up being confused here is a mystery. Cc'ed a few more folks. Thanks, tglx