Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752722AbdGHDh4 (ORCPT ); Fri, 7 Jul 2017 23:37:56 -0400 Received: from mail-qt0-f176.google.com ([209.85.216.176]:34988 "EHLO mail-qt0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751812AbdGHDhy (ORCPT ); Fri, 7 Jul 2017 23:37:54 -0400 MIME-Version: 1.0 In-Reply-To: References: <1498133721-21152-1-git-send-email-dingtianhong@huawei.com> From: Alexander Duyck Date: Fri, 7 Jul 2017 20:37:52 -0700 Message-ID: Subject: Re: [PATCH v6 0/3] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag To: Casey Leedom Cc: Ding Tianhong , "ashok.raj@intel.com" , "helgaas@kernel.org" , Michael Werner , Ganesh GR , "asit.k.mallick@intel.com" , "patrick.j.cramer@intel.com" , "Suravee.Suthikulpanit@amd.com" , "Bob.Shaw@amd.com" , "l.stach@pengutronix.de" , "amira@mellanox.com" , "gabriele.paoloni@huawei.com" , "David.Laight@aculab.com" , "jeffrey.t.kirsher@intel.com" , "catalin.marinas@arm.com" , "will.deacon@arm.com" , "mark.rutland@arm.com" , "robin.murphy@arm.com" , "davem@davemloft.net" , "linux-arm-kernel@lists.infradead.org" , "netdev@vger.kernel.org" , "linux-pci@vger.kernel.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5898 Lines: 99 On Fri, Jul 7, 2017 at 7:04 PM, Casey Leedom wrote: > Okay, thanks for the note Alexander. I'll have to look more closely at > the patch on Monday and try it out on one of the targeted systems to verify > the semantics you describe. > > However, that said, there is no way to tell a priori where a device will > send TLPs. To simply assume that all TLPs will be directed towards the Root > Complex is a big assumption. Only the device and the code controlling it > know where the TLPs will be directed. That's why there are changes required > in the cxgb4 driver. For instance, the code in > drivers/net/ethernet/chelsio./cxgb4/sge.c: t4_sge_alloc_rxq() knows that > it's allocating Free List Buffers in Host Memory and that the RX Queues that > it's allocating in the Hardware will eventually send Ingress Data to those > Free List Buffers. (And similarly for the Free List Buffer Pointer Queue > with respect to DMA Reads from the host.) In that routine we explicitly > configure the Hardware to use/not-use the Relaxed Ordering Attribute via the > FW_IQ_CMD_FL0FETCHRO and FW_IQ_CMD_FL0DATARO flags. Basically we're > conditionally setting them based on the desirability of sending Relaxed > Ordering TLPs to the Root Complex. (And we would perform the same kind of > check for an nVME application ... which brings us to ...) The general idea with this is to keep this simple. In the vast majority of cases the assumption is that a device will be sending data back and forth from system memory. As such the mostly likely thing that any given device will be interacting with is the root complex. By making it so that we disable relaxed ordering if the root complex says we can't support it seems like the simplest and most direct solution to avoid the issue of us sending any requests with relaxed ordering enabled TLPs to the root complex. With that said what we are getting into are subtleties that end up impacting devices that perform peer-to-peer operations and I don't suspect that peer-to-peer is really all that common. Ideally my thought on this is that if there is something in the setup that cannot support relaxed ordering we should be disabling relaxed ordering and then re-enabling it in the drivers that have special peer-to-peer routes that they are aware of. This should help to simplify things for cases such as a function being direct assigned as the configuration space should be passed through to the guest with the relaxed ordering attribute disabled, and that status will be visible to the guest. If we just use the quirk we lose that and it becomes problematic if a function is direct assigned on a system that doesn't support relaxed ordering as the guest has no visibility into the host's quirks. > And what would be the code using these patch APIs to set up a Peer-to-Peer > nVME-style application? In that case we'd need the Chelsio adapter's PCIe > Capability Device Control[Relaxed Ordering Enable] set for the nVME > application ... and we would avoid programming the Chelsio Hardware to use > Relaxed Ordering for TLPs directed at the Root Complex. Thus we would be in > a position where some TLPs being emitted by the device to Peer devices would > have Relaxed Ordering set and some directed at the Root Complex would not. > And the only way for that to work is if the source device's Device > Control[Relaxed Ordering Enable] is set ... Right. I admit this is pushing extra complexity into the driver, but what else would you have us do? The idea here is more of a lock-out tag-out type setup. We go in and disable relaxed ordering on the devices if it isn't safe to send a relaxed ordering TLP to the root complex. We then leave it up to the driver to go through and re-enable it if the driver knows enough about how it works and what kind of transactions it might issue. I'm not saying we have to leave the relaxed ordering bit disabled in the control register. I'm saying we disable it at first, and then leave it up to the device driver to re-enable it if it needs the functionality for something that is other than root-complex based and it knows it can avoid sending the frames to the root complex. Ideally such a driver would also clean up after itself if removed so that it leaves the device in the same state it found it in. Maybe we don't even really need patch 3/3 in this series. If disabling the relaxed ordering bit in the configuration space already disabled relaxed ordering for the entire device then this patch doesn't even really do anything then right? The device takes care of it already for us so we could probably just drop this patch as it currently stands. If that is the case maybe we need to refocus patch 3/3 on re-enabling relaxed ordering for that nVME specific case. That might be beyond the scope of what Ding can handle though, and I am not familiar with the Chelsio hardware either. So that might be something that is best left to you as a follow-up patch. > Finally, setting aside my disagreements with the patch, we still have the > code in the cxgb4 driver which explicitly turns on its own Device > Control[Relaxed Ordering Enable] in cxgb4_main.c: > enable_pcie_relaxed_ordering(). So the patch is something of a loop if all > we're doing is testing our own Relaxed Ordering Enable state ... I assume you are talking about what I am suggesting and not what is in the actual patch. Basically what it comes down to is sort of a loop. You would need to add some code on your probe and remove routines to check the status of the bits and configure them the way you want, and on remove you would need to clean things up so that the next time a driver loads it doesn't get confused. However you should only need this in the case where the root complex cannot support relaxed ordering. In all other cases you should be just running with relaxed ordering enabled for everything.