Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp27560ybl; Thu, 23 Jan 2020 17:19:54 -0800 (PST) X-Google-Smtp-Source: APXvYqxDujpcolMPib7UcJkfHaIUXiYs7mVmHyMCg5rGHjHXSNUsPtEqMb7Id0fVIvs7ircRvA1I X-Received: by 2002:a9d:634e:: with SMTP id y14mr966514otk.162.1579828793687; Thu, 23 Jan 2020 17:19:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579828793; cv=none; d=google.com; s=arc-20160816; b=anAstOa/PSS4asA2lCPWZm+YM/PY3ZP7Magq7FM1Mg5m3uvHN04jZ6g4WKCA2swQVP hnA+DZ4Of9WbaML22qMShCcwAufK0pmi/UP78zYsW8uXmWEtowZ335QjiypfaAHwACp9 tYWSP6gWHs6b/Zas7Vgx6iyc0yG/T4Jw7ld9yo4CWgMUt63YAGGntK9cuxedT0AsC74g kr7+bUsg6AxrwNpmnDC30lq0eWm0M0cqNJ2zp4tAV0XhxDmbiGTGibxZjSQi60Bka0Sw 7Yni5qKjaQvEloBrPH/5pg0qPnRBzNk/QZq2/rWUyy0DpuNF+2G7O6dJ4DgiDYVPaZJN 0oVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=e385S31i/qGwjac6oNxoCBIhYZUY/kvlHUmU1b+MVcM=; b=PY6CEbpzev/a43V1E7BYZ6rSvWrXQSsdeoRaTMy8DpFlZXo9mDw8mnsW0XOWSNk9Zd ZMlMr4MXtpxgPtJLARzAVuCtkG2T5VOVla90Cee3rM7bzkoT/A6yLQbYIJ/iz/I5NwH5 8lTXfWPBXj6oAWc5jjWrzDE8zhlaVIiK4JBlskNC0MGDOWw1HqOkiZRf6PzFwEuMbmqD ZYvus0uFkmghMzStFP1qSKFlWwBslYKEm2xohj4Z9bsQ7s0AK53VTGRqVnyD1ne1DN5b yO5zMdezdAac2JslfULpjmM4F+jdD3yeVSilI6ZMgjYi0OLDwj9CVxleJ9K1gHzH5/44 NUMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=ksrlogme; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 38si2016026otu.166.2020.01.23.17.19.41; Thu, 23 Jan 2020 17:19:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=ksrlogme; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730150AbgAXAaP (ORCPT + 99 others); Thu, 23 Jan 2020 19:30:15 -0500 Received: from mail-lf1-f65.google.com ([209.85.167.65]:43438 "EHLO mail-lf1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729875AbgAXAaP (ORCPT ); Thu, 23 Jan 2020 19:30:15 -0500 Received: by mail-lf1-f65.google.com with SMTP id 9so51418lfq.10 for ; Thu, 23 Jan 2020 16:30:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=e385S31i/qGwjac6oNxoCBIhYZUY/kvlHUmU1b+MVcM=; b=ksrlogmefFSt4bVtJHfwxg17cwfPMKbdgjKSJjtnd/7zYUbNsEmAyaFqmpdZLkIUzY 5cmhzlPUNOTDcIvKH7a6XnntCxM+xCOWn6OnDdqvweyoNgyGE03hzTH2cCDAmBs/DT3L ZWZBgKsM/Fxp2ZTUJlzOjFr7Q45pXVCJbP4+4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=e385S31i/qGwjac6oNxoCBIhYZUY/kvlHUmU1b+MVcM=; b=aTsyWeWjjF3dEAj7rZPqQAQLFpO4zkiEkVSSZBzEaQErSSGlcyT8gxf5UjnHvoeZvE Vk/IrD3+pT9pjY1zAxB8cIdkq6+qwZDJSa6Xz2MK9ePIsrTFdumnWJb+3pu3X/Ukwmta NVZe2RgWnkW27M0WX1Q+9o3xcd+N1+GPZx6gfPosAjFvJDClR5ZPXWusLuMJbSw5Gv+K ikEAk2DadIoV8TPrOB95AnlMVOxqdk8utUr+h/eMn6Zgfco/a52Jwfgw8wGzVau3vXv8 /iRqLKiu3UwSlClXXM+/4/wcdyM3Oe6gKb7a5bUALrn3iWd/+NY7H9rOG2HIHt+7YC8s Mmww== X-Gm-Message-State: APjAAAUdMXpgWXUFVGd/OVFjVP/hwaTpS+TW3TS/18EZ8gbvlwmrpaYp litUDM8bf4CbET9KwW4zB7hN/wYt4BQ= X-Received: by 2002:a19:4208:: with SMTP id p8mr190659lfa.160.1579825812208; Thu, 23 Jan 2020 16:30:12 -0800 (PST) Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com. [209.85.167.41]) by smtp.gmail.com with ESMTPSA id q26sm1826062lfp.85.2020.01.23.16.30.11 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 23 Jan 2020 16:30:11 -0800 (PST) Received: by mail-lf1-f41.google.com with SMTP id t23so62346lfk.6 for ; Thu, 23 Jan 2020 16:30:11 -0800 (PST) X-Received: by 2002:ac2:489b:: with SMTP id x27mr191071lfc.130.1579825810814; Thu, 23 Jan 2020 16:30:10 -0800 (PST) MIME-Version: 1.0 References: <20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid> <87y2tytv5i.fsf@nanos.tec.linutronix.de> <87eevqkpgn.fsf@nanos.tec.linutronix.de> In-Reply-To: From: Evan Green Date: Thu, 23 Jan 2020 16:29:34 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2] PCI/MSI: Avoid torn updates to MSI pairs To: Thomas Gleixner Cc: Rajat Jain , Bjorn Helgaas , linux-pci , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 23, 2020 at 2:59 PM Evan Green wrote: > > On Thu, Jan 23, 2020 at 12:59 PM Evan Green wrote: > > > > On Thu, Jan 23, 2020 at 10:17 AM Thomas Gleixner wrote: > > > > > > Evan, > > > > > > Thomas Gleixner writes: > > > > This is not yet debugged fully and as this is happening on MSI-X I'm not > > > > really convinced yet that your 'torn write' theory holds. > > > > > > can you please apply the debug patch below and run your test. When the > > > failure happens, stop the tracer and collect the trace. > > > > > > Another question. Did you ever try to change the affinity of that > > > interrupt without hotplug rapidly while the device makes traffic? If > > > not, it would be interesting whether this leads to a failure as well. > > > > Thanks for the patch. Looks pretty familiar :) > > I ran into issues where trace_printks on offlined cores seem to > > disappear. I even made sure the cores were back online when I > > collected the trace. So your logs might not be useful. Known issue > > with the tracer? > > > > I figured I'd share my own debug chicken scratch, in case you could > > glean anything from it. The LOG entries print out timestamps (divide > > by 1000000) that you can match up back to earlier in the log (ie so > > the last XHCI MSI change occurred at 74.032501, the last interrupt > > came in at 74.032405). Forgive the mess. > > > > I also tried changing the affinity rapidly without CPU hotplug, but > > didn't see the issue, at least not in the few minutes I waited > > (normally repros easily within 1 minute). An interesting datapoint. > > One additional datapoint. The intel guys suggested enabling > CONFIG_IRQ_REMAP, which does seem to eliminate the issue for me. I'm > still hoping there's a smaller fix so I don't have to add all that in. I did another experiment that I think lends credibility to my torn MSI hypothesis. I have the following change: diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 1f69b12d5bb86..0336d23f9ba9a 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -1798,6 +1798,7 @@ void (*machine_check_vector)(struct pt_regs *, long error_code) = dotraplinkage void do_mce(struct pt_regs *regs, long error_code) { +printk("EVAN MACHINE CHECK HC died"); machine_check_vector(regs, error_code); } diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 23a363fd4c59c..31f683da857e3 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -315,6 +315,11 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) msgctl |= entry->msi_attrib.multiple << 4; pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl); +if (entry->msi_attrib.is_64) { +pci_write_config_word(dev, pos + PCI_MSI_DATA_64, 0x4012); +} else { +pci_write_config_word(dev, pos + PCI_MSI_DATA_32, 0x4012); +} pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO, msg->address_lo); if (entry->msi_attrib.is_64) { And indeed, I get a machine check, despite the fact that MSI_DATA is overwritten just after address is updated. [ 79.937179] smpboot: CPU 1 is now offline [ 80.001685] smpboot: CPU 3 is now offline [ 80.025210] smpboot: CPU 5 is now offline [ 80.049517] smpboot: CPU 7 is now offline [ 80.094263] x86: Booting SMP configuration: [ 80.099394] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 80.136233] smpboot: Booting Node 0 Processor 3 APIC 0x3 [ 80.155732] smpboot: Booting Node 0 Processor 5 APIC 0x5 [ 80.173632] smpboot: Booting Node 0 Processor 7 APIC 0x7 [ 80.297198] smpboot: CPU 1 is now offline [ 80.331347] EVAN MACHINE CHECK HC died [ 82.281555] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler [ 82.295775] Kernel Offset: disabled [ 82.301740] gsmi: Log Shutdown Reason 0x02 [ 82.313942] Rebooting in 30 seconds.. [ 112.204113] ACPI MEMORY or I/O RESET_REG. -Evan