Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030741Ab2HHTdI (ORCPT ); Wed, 8 Aug 2012 15:33:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:24645 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753297Ab2HHTdD (ORCPT ); Wed, 8 Aug 2012 15:33:03 -0400 Message-ID: <5022BEEC.5080702@redhat.com> Date: Wed, 08 Aug 2012 15:33:00 -0400 From: Don Dutile User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110927 Red Hat/3.1.15-1.el6_1 Thunderbird/3.1.15 MIME-Version: 1.0 To: Alex Williamson CC: Bjorn Helgaas , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, dsahern@gmail.com Subject: Re: [PATCH] pci: Account for virtual buses in pci_acs_path_enabled References: <20120804181445.6598.6505.stgit@bling.home> <1344232549.3441.7.camel@ul30vt.home> <50218DB6.2080209@redhat.com> <1344439468.26297.14.camel@bling.home> In-Reply-To: <1344439468.26297.14.camel@bling.home> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9886 Lines: 207 On 08/08/2012 11:24 AM, Alex Williamson wrote: > On Tue, 2012-08-07 at 23:00 -0700, Bjorn Helgaas wrote: >> On Tue, Aug 7, 2012 at 2:50 PM, Don Dutile wrote: >>> On 08/06/2012 04:47 PM, Bjorn Helgaas wrote: >>>> >>>> On Sun, Aug 5, 2012 at 11:55 PM, Alex Williamson >>>> wrote: >>>>> >>>>> On Sun, 2012-08-05 at 23:30 -0600, Bjorn Helgaas wrote: >>>>>> >>>>>> On Sat, Aug 4, 2012 at 12:19 PM, Alex Williamson >>>>>> wrote: >>>>>>> >>>>>>> It's possible to have buses without an associated bridge >>>>>>> (bus->self == NULL). SR-IOV can generate such buses. When >>>>>>> we find these, skip to the parent bus to look for the next >>>>>>> ACS test. >>>>>> >>>>>> >>>>>> To make sure I understand the problem here, I think you're referring >>>>>> to the situation where an SR-IOV device can span several bus numbers, >>>>>> e.g., the "VFs Spanning Multiple Bus Numbers" implementation note in >>>>>> the SR-IOV 1.1 spec, sec. 2.1.2. >>>>>> >>>>>> It says "All PFs must be located on the Device's captured Bus Number" >>>>>> -- I think that means every PF will be directly on a bridge's >>>>>> secondary bus and hence will have a valid dev->bus->self pointer. >>>>>> >>>>>> However, VFs need not be on the same bus number. If a VF is on >>>>>> (captured Bus Number plus 1), I think we allocate a new struct pci_bus >>>>>> for it, but there's no P2P bridge that leads to that bus, so the >>>>>> bus->self pointer is probably NULL. >>>>> >>>>> >>>>> Yes, exactly. virtfn_add_bus() is where we're creating this new bus. >>>>> >>>>>> This makes me quite nervous, because I bet there are many places that >>>>>> assume every non-root bus has a valid bus->self pointer -- I know I >>>>>> certainly had that assumption. >>>>>> >>>>>> I looked at callers of pci_is_root_bus(), and at first glance, it seems >>>>>> like >>>>>> iommu_init_device(), intel_iommu_add_device(), pci_acs_path_enabled(), >>>>> >>>>> >>>>> >>>>> These 3 are handled by this patch, plus the intel and amd iommu patches >>>>> I sent. >>>>> >>>>>> pci_get_interrupt_pin(), pci_common_swizzle(), >>>>> >>>>> >>>>> If sr-iov is the only source of these virtual buses, these are probably >>>>> ok since VFs don't support INTx. >>>>> >>>>>> pci_find_upstream_pcie_bridge(), and >>>>> >>>>> >>>>> Here the pci_is_root_bus() is after a pci_is_pcie() check, so again if >>>>> sr-iov only (and assuming VFs properly report PCIe capability), we >>>>> shouldn't stumble on it. >>>>> >>>>>> pci_bus_release_bridge_resources() all might have similar problems. >>>>> >>>>> >>>>> This one might deserve further investigation. Thanks, >>>> >>>> >>>> We can fix all these places piecemeal, but that doesn't feel like a >>>> very satisfying solution. It makes it much harder to know that each >>>> place is correct, and this oddity of a bus with no upstream bridge is >>>> still lying around, waiting to bite us again later. >>>> >>>> What other possible ways of fixing this do we have? Could we set >>>> bus->self (multiple buses would then point to the same bridge, and I >>>> don't know if that would break something)? Add something like a >>>> pci_upstream_p2p_bridge() interface that would encapsulate traversing >>> >>> ^^^ and this name will reduce the confusion? :) >> >> I don't claim that :) I just wanted to explore other possible >> solutions. Changing every loop that searches the parent chain so it >> knows about this SR-IOV oddity doesn't seem like the ideal solution, >> though maybe it's the best we can do given the constraints. > > Yep, these are the two alternatives I thought of too, set bus->self or > create a helper function. The former leaves present assumptions in > place, but I don't know whether we'll break something else having two > buses claiming to be sourced by the same device. Maintaining the NULL > self pointer almost feels like a better representation of the hardware, > but requires evaluating all the current users and coming up with a > helper. That's more work, but we probably want to move away from > everyone manually walking pci data structures anyway. > >>>> Since these fake VF buses don't have a bridge that points to them, I >>> >>> Well, they aren't fake busses, just ARI-identifiers, which translate the >>> B:D.F/8:5.3 >>> format to simply a 16-bit i.d. >> >> I think an SR-IOV device can consume multiple bus numbers even without >> ARI (in fact, I think ARI reduces the number of bus numbers the >> device requires ... e.g., a PF and 15 VFs would require two bus >> numbers without ARI (04:00.0 - 04:00.7 and 05:00.0 - 05:00.7) but only >> one bus number with ARI (04:00.0 - 04:01.7)). (I think "04:01.7" is >> how Linux would represent the 8-bit function number ARI gives you. >> You could also think of it as "04:00.0f") > > Yep, that's what I think is happening too. I've been testing on a > system with ARI, so using the same device as David, PFs at 1:00.0/1 and > VFs at 1:10.0 - 1:11.5. The sr-iov capability reports VF offset: 128, > stride: 2. IIRC, w/o ARI this same device reports VF offset 384 (256 + > 128), which seems to match David's lspci. > Correct, lack of ARI forces the use of more bus numbers, not less. It is 'fixed' by the iov.c code by increasing the sub_bus num register to cover the range of busses the VF stride wants to use, if those bus-num's aren't already consumed; if consumed, the VFs are inaccessible/unconfigurable. >> >>> So, VF devices should be attached to same bus->devices list as it's PF. >> >> I don't think it works that way today, does it? In the SR-IOV spec >> example in sec 2.1.2: >> >> PF 0 at 04:00.0 >> ARI Capable is set >> First VF Offset = 1, VF Stride = 1, NumVFs = 600 >> >> I think we have three separate bus->devices lists: >> >> pci_bus 04: devices list contains PF 0 and VF 0,1 through VF 0,255 >> pci_bus 05: devices list contains VF 0,256 - VF 0,511 >> pci_bus 06: devices list contains VF 0,512 - VF 0,600 >> >>> pci_dev->bus should be same bus ptr as PF's pci_dev as well, since the >>> VF uses all that's busses resources, support functions (cfg, dma-ops, etc.) >>> as well. >>> Searching the driver/pci area, support of functions like AER want the >>> bus struct that's receiving/handling the PCIe error, associated (hw) port, >>> etc., >>> so another reason the VF's pci-dev bus ptr should be the same as the PF's. >> >> Maybe every VF *should* have the same dev->bus pointer as the PF, but >> I don't think it does today. I think we only store the bus number in >> the struct pci_bus, so if we *did* give all the VFs the same dev->bus >> pointer and put all the VFs in the same bus->devices list, we'd have >> to store the bus number elsewhere, e.g., in the struct pci_dev. >> >> That might make sense, but the magnitude of a change like that makes >> my head hurt -- it would affect drivers, arch code, config accessors, >> etc. > > Yep, I agree, that's a total rework of what a struct pci_bus represents. > Thanks, > > Alex > To sum up, I believe the sw structures should reflect the hw connectivity and association. If you buy that bridge (all pun intended), then the VFs should get connected to same pci-bus as the PF b/c it physically is, and the register set for that parent bridge is the one the PF is connected to ... bus->self == NULL will cause heartburn for AERs associated with VFs .... no dev to point to in order to do AER handling ... :( Likewise, additional bus structs could be generated, but they would require ptrs in them to ensure they point to the same structures (same dev-lists, same resource structs, etc.) and/or inherit the PF fields/attributes. Changes to a pci-dev associated with a bridge will also have to check if 'peer' (VF/SRIOV) busses are associated with that bridge/pci-dev, and duplicate updates -- which begs for the same bus struct to be used. I think we have to break the identification of devices, B:D.F (VFs specifically) and the existing 1-to-1 relationship with a hw device. Helper functions to change VF B:D.F formatting to be PF-B:[potentially->-32-D]:F would be one way to do it. Changing devfn in pci-dev struct from u8 to u16 would one option; putting the entire B:D.F into pci-dev and making it u32 would probably be the cleanest (and I'm familiar with at least one OS & it's PCI developer that did just that ;-), but not for these ARI reasons), and then macros like PCI_FUNC() and PCI_SLOT() could be changed appropriately, as well as a new PCI_BUS() macro. The only serious gotcha I see in the B:D.F interpretation is how it is presented in sysfs, since a D value >32 could cause some heartburn for user apps traversing sysfs, and how those values would get copy / transposed to other interaces, e.g., libvirt scanning sysfs for VFs, and how the B:D.F is parsed and passed down to qemu when a device is assigned to a guest. The 32-bit B:D.F in pci-dev, and new helpers to present & parse based on that field are looking better to me now... As for PCI tree traversal, endpoints shouldn't be doing it; I'll bet the majority that do, are looking for the parent bus to get/print the bus-num; That leaves PCI core code for bus scanning, hot plug, ACS tree traversal, etc., which should be manageable in a change like the ones I mention in the previous paragraph. Alex: does VFIO use pci functions to do traversal, or does it implement its own functions to travers pci trees? So, as stated previously, this certainly hurts the brain cells, but not as much as I thought it would. Someday, I'll try to figure out how mr-iov-based busses come & go in the pci dev tree! :-o ! - Don -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/