Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp2644552pxb; Sun, 31 Oct 2021 21:36:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx0fxZItl0QVCF4G5Zukm5TWJaiUWzMY86FEYopASHxeBwdW5ow9lA3GYAcdFSoPG7QvoJC X-Received: by 2002:a05:6602:134d:: with SMTP id i13mr19645609iov.164.1635741416399; Sun, 31 Oct 2021 21:36:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635741416; cv=none; d=google.com; s=arc-20160816; b=ziXdYhSomf06/T36LyQcpkilzuXj3xfF6Qczqz41bUlv6JudqgV2MzamPKYt4pwTqm xsEVWXT43+2jSoPoCGp0/lD52QUpFJLZJvrqdr7PLsqJbAYYgjUHK2i1oeTMfGC4BnrC DEdUQrYyfhn6X0O9ys2pMEOYjcStT89XvqqiQJ0safs4qqpqN/w/JcAWefybLCevhhtF osLwAsNEjDNtf0sWVYsihJ9UW6E3aznwe8mKTQL8MKxm95dDiTQY/JT7MGXIp7/GS/SB oL5BGvSr/WE5V2Nk8DHmnT4GwIB4dptQdPCmD1Qiq3YlepitbGtkTxwm7ODWGdnxwqDx k7zA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=rPuCvocgoqcLO1V7CjbdCCqaDF3oAtmRQEnh4OXdZEc=; b=gc60GRdWB6opsujjc+0HAw88eWfnxiIu2FfeckjMqJwpkSE5uodplszlxbvpVKwJZk J5E8pvz5EHhYVaoY1YQBj3997OPClYVSNdkFGFJr0/Pg5/iHnoFoeMsEshftKKm+k7xx KccI9O7XV87hH+h0F4zVZoPVH8GE342ada+pPZtqaZmRPls1CtMpdnAq7a07cYpJaH7n ayKAz4XFr6xnOMQ/HepTESITZcHXJtTRtM1kyg9FVVBj8YFAAGEhyJvTd4+kS6l9WoLz Wux9b7qM7ybU1UUDjDFNhWeCdId0pqyKFs3FtShSQoPfoZMagjIkx2P6y23M7CVxr6Z1 0w9w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=t+FahR4n; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b5si13471ilj.40.2021.10.31.21.36.45; Sun, 31 Oct 2021 21:36:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=t+FahR4n; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229530AbhKAEhw (ORCPT + 99 others); Mon, 1 Nov 2021 00:37:52 -0400 Received: from smtp-relay-internal-0.canonical.com ([185.125.188.122]:54382 "EHLO smtp-relay-internal-0.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229462AbhKAEhv (ORCPT ); Mon, 1 Nov 2021 00:37:51 -0400 Received: from mail-pl1-f197.google.com (mail-pl1-f197.google.com [209.85.214.197]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-0.canonical.com (Postfix) with ESMTPS id AF6C13F19B for ; Mon, 1 Nov 2021 04:35:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1635741317; bh=rPuCvocgoqcLO1V7CjbdCCqaDF3oAtmRQEnh4OXdZEc=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=t+FahR4nlUsL+EE0JV8v970hR/JQfqm+Fr7Y2F+fRTceagyY0aCa3HGRjA9H4yWYN a0ZITU+67IvK54aPLom9X0MlgfmAvgfDtSXwcVkDODqRLJzTRYNksjDAxZgbvRsu13 KggzA+EzTh3uZ3pOj0ErKDuhCF+MYv1T5Bp+512Bvwud8Vgz0tCAFZ70DHFpUIQ3jk KePCOXKniAZpGJDAzVgPtMjd+fzby6nvxgvMQlidn0lMYtu2PTIWDLtWVBIAblAVOr V4ipOOj6oczmP/RTfm/dQISyBzht39vsIlQyBfHktajyuJUiNPHz30Yl/8f5AEW4e/ sl+05SvHPq4qQ== Received: by mail-pl1-f197.google.com with SMTP id w8-20020a170902a70800b0013ffaf12fbaso5317250plq.23 for ; Sun, 31 Oct 2021 21:35:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rPuCvocgoqcLO1V7CjbdCCqaDF3oAtmRQEnh4OXdZEc=; b=sGYsIgH4vcWVQyJNYpAGDJs0tk9zX1RgfaJonMaj3RYEoTu9a9yjQcU0PMWcKJsdC9 aJj4wqv3gG21gPaJDLTApMI/BiXHhejAuD/T45L5b5kRvu9MuvvZbi4oOuUC7FTQaQ2p 0mvQ+lcF92EBblQPXPAE0ixnudpOa3RXx3RAcJaPKoJAQnq8W0t02Nf/7xrq/jjsg6IT voK3f42uYrGHGSBWwNWrxh/TdAo5iCDifeuX5xMR3cV4Mlg3lT7M4nFsgQvvFCKfTQoV eHmsUoGH0h4L3eWzgp3xZishzBeJrqMIRXVLlYKHShbZrASYLR22ZT4uU9DS+CrAh1bk tIjQ== X-Gm-Message-State: AOAM531BtZY2tgPHZHgW+gtZDaTPV7xZXJHkMW/9aj62qzZ/n++fwwSy aM3LTrnL+ob2eycjb7zMgzv/Giahz5XwpCSBqH8QV+1qiOz92hLIMNfNIUZ+aSG+BjABy4l4gb2 Oi+VuCRnzej1g23LoUFGM68c8JdU9Ecdt1OCYXFcPEs5UN6G6Yhc9MgLCHA== X-Received: by 2002:a17:902:6544:b0:13e:dd16:bd5b with SMTP id d4-20020a170902654400b0013edd16bd5bmr23143381pln.61.1635741316153; Sun, 31 Oct 2021 21:35:16 -0700 (PDT) X-Received: by 2002:a17:902:6544:b0:13e:dd16:bd5b with SMTP id d4-20020a170902654400b0013edd16bd5bmr23143368pln.61.1635741315880; Sun, 31 Oct 2021 21:35:15 -0700 (PDT) MIME-Version: 1.0 References: <20210914104301.48270518.alex.williamson@redhat.com> <9e8d0e9e-1d94-35e8-be1f-cf66916c24b2@canonical.com> <20210915103235.097202d2.alex.williamson@redhat.com> <2fadf33d-8487-94c2-4460-2a20fdb2ea12@canonical.com> <20211005171326.3f25a43a.alex.williamson@redhat.com> <20211012140516.6838248b.alex.williamson@redhat.com> In-Reply-To: From: Matthew Ruffell Date: Mon, 1 Nov 2021 17:35:04 +1300 Message-ID: Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio To: Alex Williamson Cc: linux-pci@vger.kernel.org, lkml , kvm@vger.kernel.org, nathan.langford@xcelesunifiedtechnologies.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Alex, Nathan has been running a workload on the 5.14 kernel + the test patch, and has ran into some interesting softlockups and hardlockups. The first, happened on a secondary server running a Windows VM, with 7 (of 10) 1080TI GPUs passed through. Full dmesg: https://paste.ubuntu.com/p/Wx5hCBBXKb/ There isn't any "irq x: nobody cared" messages, and the crashkernel gets stuck in the usual copying IR tables from dmar, which suggests an ongoing interrupt storm. Nathan disabled "kernel.hardlockup_panic = 1" sysctl, and managed to reproduce the issue again, suggesting that we get stuck in kernel space for too long without the ability for interrupts to be serviced. It starts with the NIC hitting a tx queue timeout, and then does a NMI to unwind the stack of each CPU, although the stacks don't appear to indicate where things are stuck. The server then remains softlocked, and keeps unwinding stacks every 26 seconds or so, until it eventually hardlockups. Full dmesg: https://people.canonical.com/~mruffell/sf314568/1080TI_hardlockup.txt The next interesting thing to report is when Nathan started the same Windows VM on the primary host we have been debugging on, with the 8x 2080TI GPUs. Nathan experienced a stuck VM, with the host responding just fine. When Nathan reset the VM, he got 4x "irq xx: nobody cared" messages on IRQs 25, 27, 29 and 31, which at the time corresponded to the PEX 8747 upstream PCI switches. Interestingly, Nathan also observed 2x GPU Audio devices sharing the same IRQ line as the upstream PCI switch, although Nathan mentioned this only occured very briefly, and the GPU audio devices were re-assigned different IRQs shortly afterward. Full dmesg: https://paste.ubuntu.com/p/C2V4CY3yjZ/ Output showing upstream ports belonging to those IRQs: https://paste.ubuntu.com/p/6fkSbyFNWT/ Full lspci: https://paste.ubuntu.com/p/CTX5kbjpRP/ Let us know if you would like any additional debug information. As always, we are happy to test patches out. Thanks, Matthew