Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp27553ybl; Thu, 23 Jan 2020 17:19:53 -0800 (PST) X-Google-Smtp-Source: APXvYqy2x+WJJ1oTclxvzKG4DGV3PDCJJS0RBTjz+hOzcGzNE7ce2MB1uWaPWCwN8YNL8BXti+P2 X-Received: by 2002:a05:6830:1042:: with SMTP id b2mr913526otp.306.1579828793610; Thu, 23 Jan 2020 17:19:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579828793; cv=none; d=google.com; s=arc-20160816; b=ONWCLLaX8sWfVVGg4pKYX/e0VTE8KmnQh0vp4V9be/2ZufifTwjhr25tP6qaYzY9CV lGcBcGhaFKuVJ8Cl44w/d3p2q7RgAD//BxB/bhPTXOI6HqzmF3cjuBfqO8TzCNqB67pZ TFv8viR9skEJcEormashScxlJGrOpumOtWyBgc+oPMhNvXbsrT8EfewcTXqjXY5zZPkN OLSkKszs9eev0gi0Vxti1jN1Dxb+KsGOvcTcRHfpteJEr02B0pBJW4xPeNeMr2Z+tD+1 GteBxacDD9buFfhaJxy6WMNlU0uvvkUn/9Le1tND0snd4dMtqtptRjQ+AkEtt/bZ3LOz xASA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from; bh=nUYgzhD90NFJxxC6BEiN7WUjRKzeb/9VoTmNRqk7ebY=; b=y5JnT8FuAdOoWl134vSCKJ96EWzmxTG9U5u6soQUkeAZhQhxA44mRfEVEMy18NBm57 6p6GZROhd/9sTkbkiVZ+bmckTT8G+NnYUvJZ6oMUvrHBLwlAuxpTiinMnc6TYlOxW4iO QF2xzp+svQP/wPQIPlkFSH9MguDCpoLYvQhu22qyCST0Hv/dDkZkfchE7LPkdhNetFZ4 zdYc7yxYMC6Bp3m2s21jO57H70jzY9ZbAmuYPjqqK5AArhTWuRTVNw1dq5LydsazqBhl v2eQn97GHednAm/kiWEyXyrzEileMO3xPTMtDtqp81vBgAy/kr3S/uOwc5q+Ct/Qxjzg KDbA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q190si1635289oic.187.2020.01.23.17.19.41; Thu, 23 Jan 2020 17:19:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730221AbgAXAuk (ORCPT + 99 others); Thu, 23 Jan 2020 19:50:40 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:41242 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729425AbgAXAuk (ORCPT ); Thu, 23 Jan 2020 19:50:40 -0500 Received: from p5b06da22.dip0.t-ipconnect.de ([91.6.218.34] helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1iunBF-00026p-5F; Fri, 24 Jan 2020 01:50:37 +0100 Received: by nanos.tec.linutronix.de (Postfix, from userid 1000) id 4A553100490; Fri, 24 Jan 2020 01:50:36 +0100 (CET) From: Thomas Gleixner To: Evan Green Cc: Rajat Jain , Bjorn Helgaas , linux-pci , Linux Kernel Mailing List Subject: Re: [PATCH v2] PCI/MSI: Avoid torn updates to MSI pairs In-Reply-To: References: <20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid> <87y2tytv5i.fsf@nanos.tec.linutronix.de> <87eevqkpgn.fsf@nanos.tec.linutronix.de> Date: Fri, 24 Jan 2020 01:50:36 +0100 Message-ID: <87pnf91xur.fsf@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Evan Green writes: > On Thu, Jan 23, 2020 at 12:59 PM Evan Green wrote: >> >> On Thu, Jan 23, 2020 at 10:17 AM Thomas Gleixner wrote: >> > >> > Evan, >> > >> > Thomas Gleixner writes: >> > > This is not yet debugged fully and as this is happening on MSI-X I'm not >> > > really convinced yet that your 'torn write' theory holds. As you pointed out that this is not on MSI-X I'm considering the torn write theory to be more likely. :) >> > can you please apply the debug patch below and run your test. When the >> > failure happens, stop the tracer and collect the trace. >> > >> > Another question. Did you ever try to change the affinity of that >> > interrupt without hotplug rapidly while the device makes traffic? If >> > not, it would be interesting whether this leads to a failure as well. >> >> Thanks for the patch. Looks pretty familiar :) >> I ran into issues where trace_printks on offlined cores seem to >> disappear. I even made sure the cores were back online when I >> collected the trace. So your logs might not be useful. Known issue >> with the tracer? No. I tried the patch myself to verify that it does what I want. The only information I'm missing right now is the interrupt number to look for. But I'll stare at it with brain awake tomorrow morning again. >> I also tried changing the affinity rapidly without CPU hotplug, but >> didn't see the issue, at least not in the few minutes I waited >> (normally repros easily within 1 minute). An interesting datapoint. That's what I expected. The main difference is that the vector modification happens at a point where a device is not supposed to send an interrupt. They happen when the interrupt of the device is serviced before the driver handler is invoked and at that point the device should not send another one. > One additional datapoint. The intel guys suggested enabling > CONFIG_IRQ_REMAP, which does seem to eliminate the issue for me. I'm > still hoping there's a smaller fix so I don't have to add all that in. Right, I wanted to ask you that as well and forgot. With interrupt remapping the migration happens at the remapping unit which does not have the horrible 'move it while servicing' requirement and it suppports proper masking. Thanks, tglx