Received: by 10.223.185.116 with SMTP id b49csp4249244wrg; Tue, 6 Mar 2018 12:22:11 -0800 (PST) X-Google-Smtp-Source: AG47ELswUSH7+5zJIYcTxKK+HHPX0A8HSrhjT03Km3GwHjNj+L4b1lx0YPXTxk5am+P8iphzvwtY X-Received: by 10.99.113.9 with SMTP id m9mr16461024pgc.164.1520367731729; Tue, 06 Mar 2018 12:22:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520367731; cv=none; d=google.com; s=arc-20160816; b=t1nTIs42TcB8MXGe48lLb1gWz7LBgCyHkZ0KC9C2BgPZU0/Yp2gOxabGrur8f6QGVs O558j92zYhMBYjBPBH5pwp+16mAvP9LzIbCP1GESoxUr9i9yeGj827i6LZoZ+pu0u2mr JJSP4kuc0tbMr9+a8tC2agx7QELX0qLvji+ZGzxV6MQa6R7yfghBuULIW7deA28/RAt7 FzRJR6OFAPQa6q/R2FRMZGCtVdmopcDk1yW1zpZ4jNIS1Fm9Te23J6tjLkIOtxnXNUlz sLfw7ljMuZSAXaW/cv1YjtX925LbJc5/zrr7ltZdFAUunk7g2qp5zG+29nO/m84K3kQl RMHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=JpdfVDlP1BXBmzYhEnf/o1FTtgks3q4k1cbVoZnPQ0w=; b=QQKnVFcgj5Zs+yJ88/AJtfGNy65iNNry0I2jowJ9w/qoJM81eXflpkse3+5KaJBrXN 7vfvdwVr8/NVvl8KjVWx1DcEySI2nrRoge6Tbx3xftHNj8PcgzWCiR5MFtcBkcuUlw5l I3e21kp2U0P7cXQCM96WlS3g3A4h8xxjp05+hzRBNBnMyiyH8QNHYCN2CTPrFZZRlSd9 TBMPTD8lraFSIIGDY8eISYv3KIDk/YCQYbC0kVxPhV2Y2glRb/LmmyjNjnk0jO38hrT4 HoAheRaokQN+3LMs8fM/6NIfjeyCsMghQV+OuTDzp2GoaW4erR7pNeZf97Tu2X2mT/mI /lyQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y72-v6si11704159plh.478.2018.03.06.12.21.57; Tue, 06 Mar 2018 12:22:11 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754020AbeCFUTp (ORCPT + 99 others); Tue, 6 Mar 2018 15:19:45 -0500 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:50718 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753598AbeCFUTm (ORCPT ); Tue, 6 Mar 2018 15:19:42 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 17D4184221; Tue, 6 Mar 2018 20:19:41 +0000 (UTC) Received: from [10.18.17.89] (dhcp-17-89.bos.redhat.com [10.18.17.89]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2FEE4213AEF8; Tue, 6 Mar 2018 20:19:40 +0000 (UTC) Subject: Re: [PATCH] pci-iov: Add support for unmanaged SR-IOV To: Alexander Duyck Cc: Alex Williamson , Bjorn Helgaas , linux-pci@vger.kernel.org, Alexander Duyck , virtio-dev@lists.oasis-open.org, kvm@vger.kernel.org, Netdev , "Daly, Dan" , LKML , Maximilian Heyne , "Wang, Liang-min" , "Rustad, Mark D" , David Woodhouse , dwmw@amazon.co.uk, Ilya Lesokhin , Kirti Wankhede References: <20180227190520.3273.79728.stgit@localhost.localdomain> <20180227144015.228d4f5c@w520.home> <20180228155949.00fde8ae@w520.home> <20180301132224.1ca06949@w520.home> From: Don Dutile Message-ID: Date: Tue, 6 Mar 2018 15:19:39 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Tue, 06 Mar 2018 20:19:41 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Tue, 06 Mar 2018 20:19:41 +0000 (UTC) for IP:'10.11.54.6' DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'ddutile@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/05/2018 04:41 PM, Alexander Duyck wrote: > On Mon, Mar 5, 2018 at 12:57 PM, Don Dutile wrote: >> On 03/01/2018 03:22 PM, Alex Williamson wrote: >>> >>> On Wed, 28 Feb 2018 16:36:38 -0800 >>> Alexander Duyck wrote: >>> >>>> On Wed, Feb 28, 2018 at 2:59 PM, Alex Williamson >>>> wrote: >>>>> >>>>> On Wed, 28 Feb 2018 09:49:21 -0800 >>>>> Alexander Duyck wrote: >>>>> >>>>>> >>>>>> On Tue, Feb 27, 2018 at 2:25 PM, Alexander Duyck >>>>>> wrote: >>>>>>> >>>>>>> On Tue, Feb 27, 2018 at 1:40 PM, Alex Williamson >>>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, 27 Feb 2018 11:06:54 -0800 >>>>>>>> Alexander Duyck wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> From: Alexander Duyck >>>>>>>>> >>>>>>>>> This patch is meant to add support for SR-IOV on devices when the >>>>>>>>> VFs are >>>>>>>>> not managed by the kernel. Examples of recent patches attempting to >>>>>>>>> do this >>>>>>>>> include: >>>>>>>> >>>>>>>> >>>>>>>> It appears to enable sriov when the _pf_ is not managed by the >>>>>>>> kernel, but by "managed" we mean that either there is no pf driver or >>>>>>>> the pf driver doesn't provide an sriov_configure callback, >>>>>>>> intentionally or otherwise. >>>>>>>> >>>>>>>>> >>>>>>>>> virto - https://patchwork.kernel.org/patch/10241225/ >>>>>>>>> pci-stub - https://patchwork.kernel.org/patch/10109935/ >>>>>>>>> vfio - https://patchwork.kernel.org/patch/10103353/ >>>>>>>>> uio - https://patchwork.kernel.org/patch/9974031/ >>>>>>>> >>>>>>>> >>>>>>>> So is the goal to get around the issues with enabling sriov on each >>>>>>>> of >>>>>>>> the above drivers by doing it under the covers or are you really just >>>>>>>> trying to enable sriov for a truly unmanage (no pf driver) case? For >>>>>>>> example, should a driver explicitly not wanting sriov enabled >>>>>>>> implement >>>>>>>> a dummy sriov_configure function? >>>>>>>> >>>>>>>>> >>>>>>>>> Since this is quickly blowing up into a multi-driver problem it is >>>>>>>>> probably >>>>>>>>> best to implement this solution in one spot. >>>>>>>>> >>>>>>>>> This patch is an attempt to do that. What we do with this patch is >>>>>>>>> provide >>>>>>>>> a generic call to enable SR-IOV in the case that the PF driver is >>>>>>>>> either >>>>>>>>> not present, or the PF driver doesn't support configuring SR-IOV. >>>>>>>>> >>>>>>>>> A new sysfs value called sriov_unmanaged_autoprobe has been added. >>>>>>>>> This >>>>>>>>> value is used as the drivers_autoprobe setting of the VFs when they >>>>>>>>> are >>>>>>>>> being managed by an external entity such as userspace or device >>>>>>>>> firmware >>>>>>>>> instead of being managed by the kernel. >>>>>>>> >>>>>>>> >>>>>>>> Documentation/ABI/testing/sysfs-bus-pci update is missing. >>>>>>> >>>>>>> >>>>>>> I can make sure to update that in the next version. >>>>>>> >>>>>>>>> >>>>>>>>> One side effect of this change is that the sriov_drivers_autoprobe >>>>>>>>> and >>>>>>>>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV >>>>>>>>> is >>>>>>>>> disabled. Attempts to update them when SR-IOV is in use will only >>>>>>>>> update >>>>>>>>> the local value and will not update sriov->autoprobe. >>>>>>>> >>>>>>>> >>>>>>>> And we expect users to understand when sriov_drivers_autoprobe >>>>>>>> applies >>>>>>>> vs sriov_unmanaged_autoprobe, even though they're using the same >>>>>>>> interfaces to enable sriov? Are all combinations expected to work, >>>>>>>> ex. >>>>>>>> unmanaged sriov is enabled, a native pf driver loads, vfs work? Not >>>>>>>> only does it seems like there's opportunity to use this incorrectly, >>>>>>>> I >>>>>>>> think maybe it might be difficult to use correctly. >>>>>>>> >>>>>>>>> >>>>>>>>> I based my patch set originally on the patch by Mark Rustad but >>>>>>>>> there isn't >>>>>>>>> much left after going through and cleaning out the bits that were no >>>>>>>>> longer >>>>>>>>> needed, and after incorporating the feedback from David Miller. >>>>>>>>> >>>>>>>>> I have included the authors of the original 4 patches above in the >>>>>>>>> Cc here. >>>>>>>>> My hope is to get feedback and/or review on if this works for their >>>>>>>>> use >>>>>>>>> cases. >>>>>>>>> >>>>>>>>> Cc: Mark Rustad >>>>>>>>> Cc: Maximilian Heyne >>>>>>>>> Cc: Liang-Min Wang >>>>>>>>> Cc: David Woodhouse >>>>>>>>> Signed-off-by: Alexander Duyck >>>>>>>>> --- >>>>>>>>> drivers/pci/iov.c | 27 +++++++++++++++++++- >>>>>>>>> drivers/pci/pci-driver.c | 2 + >>>>>>>>> drivers/pci/pci-sysfs.c | 62 >>>>>>>>> +++++++++++++++++++++++++++++++++++++++++----- >>>>>>>>> drivers/pci/pci.h | 4 ++- >>>>>>>>> include/linux/pci.h | 1 + >>>>>>>>> 5 files changed, 86 insertions(+), 10 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>>>>>>>> index 677924ae0350..7b8858bd8d03 100644 >>>>>>>>> --- a/drivers/pci/iov.c >>>>>>>>> +++ b/drivers/pci/iov.c >>>>>>>>> @@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int >>>>>>>>> pos) >>>>>>>>> pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, >>>>>>>>> &iov->vf_device); >>>>>>>>> iov->pgsz = pgsz; >>>>>>>>> iov->self = dev; >>>>>>>>> + iov->autoprobe = true; >>>>>>>>> iov->drivers_autoprobe = true; >>>>>>>>> pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap); >>>>>>>>> pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, >>>>>>>>> &iov->link); >>>>>>>>> @@ -643,8 +644,11 @@ void pci_restore_iov_state(struct pci_dev *dev) >>>>>>>>> */ >>>>>>>>> void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool >>>>>>>>> auto_probe) >>>>>>>>> { >>>>>>>>> - if (dev->is_physfn) >>>>>>>>> + if (dev->is_physfn) { >>>>>>>>> dev->sriov->drivers_autoprobe = auto_probe; >>>>>>>>> + if (!dev->sriov->num_VFs) >>>>>>>>> + dev->sriov->autoprobe = auto_probe; >>>>>>>> >>>>>>>> >>>>>>>> Why is dev->sriov->autoprobe set any time other than immediately >>>>>>>> prior >>>>>>>> to enabling VFs? >>>>>>> >>>>>>> >>>>>>> My concern here was drivers that are still floating around with the >>>>>>> old module parameter option for enabling SR-IOV. In the unlikely event >>>>>>> that somebody was to use such a driver I wanted to make certain that >>>>>>> the drivers_autoprobe state was pre-populated. >>>>> >>>>> >>>>> Good point, but maybe that just means we should be setting it in >>>>> sriov_enable()? >>>> >>>> >>>> I suppose we could. I would just have to check and see if we have any >>>> drivers lurking out there that are supporting SR-IOV without >>>> supporting the sysfs approach. As long as that is the case we could >>>> probably put it there. >>>> >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> >>>>>>>>> /** >>>>>>>>> @@ -703,6 +707,27 @@ void pci_disable_sriov(struct pci_dev *dev) >>>>>>>>> EXPORT_SYMBOL_GPL(pci_disable_sriov); >>>>>>>>> >>>>>>>>> /** >>>>>>>>> + * pci_sriov_configure_unmanaged - helper to configure unmanaged >>>>>>>>> SR-IOV >>>>>>>>> + * @dev: the PCI device >>>>>>>>> + * @nr_virtfn: number of virtual functions to enable, 0 to disable >>>>>>>>> + * >>>>>>>>> + * Used to provide generic enable/disable SR-IOV option for devices >>>>>>>>> + * that do not manage the VFs generated by their driver, or have no >>>>>>>>> + * driver present. >>>>>>>>> + */ >>>>>>>>> +int pci_sriov_configure_unmanaged(struct pci_dev *dev, int >>>>>>>>> nr_virtfn) >>>>>>>>> +{ >>>>>>>>> + int rc = 0; >>>>>>>>> + >>>>>>>>> + if (!nr_virtfn) >>>>>>>>> + pci_disable_sriov(dev); >>>>>>>>> + else >>>>>>>>> + rc = pci_enable_sriov(dev, nr_virtfn); >>>>>>>>> + >>>>>>>>> + return rc ? rc : nr_virtfn; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> * pci_num_vf - return number of VFs associated with a PF >>>>>>>>> device_release_driver >>>>>>>>> * @dev: the PCI device >>>>>>>>> * >>>>>>>>> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c >>>>>>>>> index 3bed6beda051..2cc68dff6130 100644 >>>>>>>>> --- a/drivers/pci/pci-driver.c >>>>>>>>> +++ b/drivers/pci/pci-driver.c >>>>>>>>> @@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev >>>>>>>>> *dev) >>>>>>>>> #ifdef CONFIG_PCI_IOV >>>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>>> { >>>>>>>>> - return (!pdev->is_virtfn || >>>>>>>>> pdev->physfn->sriov->drivers_autoprobe); >>>>>>>>> + return (!pdev->is_virtfn || pdev->physfn->sriov->autoprobe); >>>>>>>>> } >>>>>>>>> #else >>>>>>>>> static inline bool pci_device_can_probe(struct pci_dev *pdev) >>>>>>>>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >>>>>>>>> index eb6bee8724cc..e701b6dbc267 100644 >>>>>>>>> --- a/drivers/pci/pci-sysfs.c >>>>>>>>> +++ b/drivers/pci/pci-sysfs.c >>>>>>>>> @@ -605,6 +605,7 @@ static ssize_t sriov_numvfs_store(struct device >>>>>>>>> *dev, >>>>>>>>> struct device_attribute *attr, >>>>>>>>> const char *buf, size_t count) >>>>>>>>> { >>>>>>>>> + int (*sriov_configure)(struct pci_dev *dev, int num_vfs); >>>>>>>>> struct pci_dev *pdev = to_pci_dev(dev); >>>>>>>>> int ret; >>>>>>>>> u16 num_vfs; >>>>>>>>> @@ -622,15 +623,20 @@ static ssize_t sriov_numvfs_store(struct >>>>>>>>> device *dev, >>>>>>>>> goto exit; >>>>>>>>> >>>>>>>>> /* is PF driver loaded w/callback */ >>>>>>>>> - if (!pdev->driver || !pdev->driver->sriov_configure) { >>>>>>>>> - pci_info(pdev, "Driver doesn't support SRIOV >>>>>>>>> configuration via sysfs\n"); >>>>>>>>> - ret = -ENOENT; >>>>>>>>> - goto exit; >>>>>>>>> - } >>>>>>>>> + if (pdev->driver && pdev->driver->sriov_configure) >>>>>>>>> + sriov_configure = pdev->driver->sriov_configure; >>>>>>>>> + else >>>>>>>>> + sriov_configure = pci_sriov_configure_unmanaged; >>>>>>>> >>>>>>>> >>>>>>>> So an unwitting user is now able to enable vfs, independent of the >>>>>>>> pf... the trouble being that they're probably going to expect them to >>>>>>>> work and the more common case is that they won't. For instance, what >>>>>>>> can you do with an igbvf when igb isn't managing the pf? >>>>>>> >>>>>>> >>>>>>> Well the VFs wouldn't be able to do anything. Basically they would be >>>>>>> sitting there with no driver loaded on them unless they are assigned >>>>>>> to a guest, or the root user had enabled the unmanaged option. If you >>>>>>> did load a driver on it the VF would sit there with link down unless >>>>>>> either the PF driver is loaded or some user-space entity steps in to >>>>>>> start managing the PF. >>>>>>> >>>>>>> In reality this can already happen as last I recall igb and ixgbe were >>>>>>> already capable of having the PF driver removed when SR-IOV was >>>>>>> enabled and VFs were assigned. Basically the VFs just report link down >>>>>>> and don't do anything. Reloading the PF driver would have it take over >>>>>>> in the case of igb and ixgbe since they were designed to handle that >>>>>>> type of scenario. >>>>> >>>>> >>>>> I think only ixgbe behaves this way between the two, igb disables sriov >>>>> in its remove function unless we're hung up by that silly >>>>> pci_vfs_assigned() check, which doesn't apply to vfio assignment. >>>>> Regardless, yes, some pf drivers do leave sriov enabled on remove, >>>>> whether that's useful or reasonable is another question. >>>> >>>> >>>> Right. We can argue that another day. I was just sighting the igb >>>> behavior. >>>> >>>>>>>> Or what happens when vfio-pci owns the pf, sriov is enabled via the >>>>>>>> unmanaged interface, and the pf user driver segfaults and gets >>>>>>>> killed, >>>>>>>> causing vfio-pci to restore the pf state, including wiping the sriov >>>>>>>> config? >>>>>>> >>>>>>> >>>>>>> Wiping the config shouldn't have any effect on the allocated VF pci >>>>>>> devices. It will cause the VFs to act as though they have fallen off >>>>>>> of the bus though and the guests would see a surprise remove type >>>>>>> behavior if I am not mistaken. The MMIO and config accesses would fail >>>>>>> until the SR-IOV configuration is restored. Really this shouldn't be a >>>>>>> problem as long as the SR-IOV is enabled prior to loading the vfio-pci >>>>>>> driver as I would assume it would restore the state an re-enable >>>>>>> SR-IOV. >>>>> >>>>> >>>>> Nope, sriov does not appear to be part of the device state saved and >>>>> restored around reset, so I think we just end up in an inconsistent >>>>> state. Also, vfio-pci stores a copy of the saved state of a device >>>>> prior to giving the user access to the device and tries to restore that >>>>> copy when released by the user, so anything that's part of the save >>>>> state and modified via side-channels while the device is opened, would >>>>> still be lost. >>>> >>>> >>>> Well it is but it isn't. The pci_restore_state will call >>>> pci_restore_iov_state which will just repopulate the data straight out >>>> of the pdev->sriov structure. I think the assumption is we were >>>> already carrying enough information that there wasn't any point in >>>> saving the state since we could just restore it from that structure >>>> anyway. >>> >>> >>> Ah, right, I remember that now. >>> >>>> >>>> I've been chasing it down and can verify that much this morning after >>>> testing. A regular hit of the sysfs reset control will not erase the >>>> state. >>> >>> >>> Perhaps long term, but momentarily the vf disappears and that could >>> imply data loss, no? >>> >>>> One issue that I think I found that may be a bug though is if I >>>> load the vfio-pci driver on an interface and unbind it I am seeing the >>>> port state reset. I have it chased down to the idle D3 code. It looks >>>> like going from vfio_pci_probe to vfio_pci_remove is triggering the >>>> equivilent of the pci_pm_reset since it is cycling me through >>>> D0->D3->D0 without restoring state after the fact. I also verified >>>> that setting disable_idle_d3 resolves the issue. Would you have any >>>> complaints about me doing a save_state in the probe call, and a >>>> restore_state in the remove? It seems like that should probably be the >>>> easiest fix. >>> >>> >>> Hmm, perhaps I had assumed that pci_set_power_state() would handle such >>> things, but indeed I do see other drivers calling pci_save_state() >>> prior and pci_restore_state() after. I think we'd need to audit all of >>> vfio-pci's calls to pci_set_power_state(), not just probe and remove. >>> >>>>>>> In the grand scheme of things how would the situation you describe be >>>>>>> any different than someone using the "reset" sysfs control on the PF >>>>>>> while using SR-IOV with driver supported SR-IOV? >>>>> >>>>> >>>>> Similar, but the pf driver has enabled sriov and can register the >>>>> reset_done error handler to be notified to re-enable sriov or perform >>>>> other cleanup. If the pf driver is not participating in sriov, it >>>>> would seem exceptional to support this. Triggering reset via sysfs >>>>> also falls more closely to shooting yourself in the foot than I think >>>>> we want to design into driver/subsystem APIs. >>>> >>>> >>>> To some extent I agree with the shooting yourself in the foot. At the >>>> same time there isn't actually all that much to re-enable SR-IOV. >>>> Restoring the PCI SR-IOV configuration space and re-enabling the bus >>>> master enable. >>> >>> >>> Bus master is a pretty interesting example of the trivial control a pf >>> driver can exert on the vfs though. >>> >>>> The Amazon guys would probably know better than I since I haven't >>>> really worked much with one of these parts yet. Actually the virtio >>>> that Mark pushed may behave the same way too. As far as I know in >>>> these firmware cases the hardware itself has everything >>>> pre-partitioned and set to re-enable as soon as the SR-IOV bits are >>>> set. I think all they need is a few bits toggled and they are pretty >>>> much good to go. >>> >>> >>> Are they just looking for an sriov capable stub driver? With >>> increasing vf counts, being able to use something like vfio-pci on the >>> pf seems like all risk with statistically insignificant increase in >>> density. On the other hand, if there's a userspace pf management >>> driver, why not just make it trusted by adding it as a native host >>> kernel driver? If we're talking about tainting the host kernel to >>> enable this interaction, maybe it should just be tainted by an out of >>> tree, possibly non-gpl host pf driver anyway. There can't really be a >>> case where the pf would be used by an average user without some degree >>> of privilege or cooperation, right? >>> >>>>>>> >>>>>>> I suppose if you really wanted we could add a new call that you could >>>>>>> put into the sriov_configure pointer that would just make it always >>>>>>> return error. Then that way the configuration could be locked until >>>>>>> the driver is unloaded. >>>>>>> >>>>>>>> >>>>>>>> I guess I don't understand how vfs can operate fully independent of >>>>>>>> the >>>>>>>> pf and why vfio-pci wouldn't just implement a dummy sriov_configure >>>>>>>> to >>>>>>>> avoid contending with such issues. >>>>>>> >>>>>>> >>>>>>> This approach isn't without risk, but it isn't as if it is really a >>>>>>> new risk we are introducing, and we could do further work to help >>>>>>> smooth out issues like this. Much of this depends on the design of the >>>>>>> device and the drivers involved. What is happening in this case is >>>>>>> that the VFs are managed outside of the kernel itself either by some >>>>>>> user-space entity or by a firmware entity. >>>>> >>>>> >>>>> But we can't simply split vfs and pfs as separate resources, the vf is >>>>> dependent on the pf and that implies a privilege hierarchy, if not at >>>>> least cooperation. The pf driver can intentionally or unintentionally >>>>> disconnect the vfs at any point in time, possibly leading to data loss >>>>> or at least denial of service for the vf consumer. I also don't trust >>>>> that there aren't numerous sriov implementations where the pf isn't >>>>> privileged to the point of being able to snoop data from the vf. So >>>>> what sort of usage models are we enabling for the vf? A firmware owned >>>>> vf subject to these conditions seems mostly pointless. >>>> >>>> >>>> In the case of these sort of parts the PF isn't really all that >>>> privileged. Usually the PF is just a VF with the SR-IOV capability >>>> hanging off of it. I suspect the only thing that might outright >>>> control anything would be the bus master enable bit. Everything else >>>> could be independent. The PF driver in such cases doesn't do much. It >>>> is basically just the host for the configuration space. >>> >>> >>> This is sounding more like an sriov capable stub driver. Certainly not >>> all pfs behave the way you describe above, many are significantly more >>> privileged and even if not, bus master is a pretty trivial control >>> point. So we probably need a driver that claims devices known to >>> behave this way, or a meta driver that bind via dynamic IDs or >>> driver_override. The proposal here sort of covertly turns anything >>> that's not an sriov driver into that stub driver with no guarantee that >>> the subverted driver is a willing or safe participant. >>> >>>>>> So I was thinking about this some more. In the case of vfio-pci things >>>>>> are a bit different since you are essentially taking a given device >>>>>> and handing it to a VM or userspace and it doesn't guarantee a >>>>>> communication between the two. >>>>> >>>>> >>>>> Doesn't guarantee communication or cooperation or even trust. >>>> >>>> >>>> Right, but at the same time I consider this to be a shoot yourself in >>>> the foot type scenario. If you are going to hand off your PF while VFs >>>> are active then you are asking for whatever repercussions you end up >>>> getting. I've added a warning and a TAINT_USER flag to my code at this >>>> point if you load an active PF in vfio, and I have added a function >>>> that locks the setting so it cannot be changed once we place a PF in >>>> the control of vfio-pci. >>>> >>>> The way I see it there are two scenarios. One where the PF is just a >>>> VF with an extra bit of SR-IOV configuration space floating off of it. >>>> The other is where we want to have SR-IOV enabled and have some third >>>> party managing the PF. The first one doesn't really care about the >>>> communication and would prefer that whatever owns the driver on the PF >>>> just ignore the SR-IOV portion of the configuration space. The second >>>> one actually cares and would want some sort of >>>> communication/cooperation but for now I don't see that as being the >>>> easiest thing to do so it might be best to just let it see a fixed >>>> number of VFs it just has to work with that. >>> >>> >>> There's no such thing as a read-only sriov capability in the spec, >>> which is a problem we ran into a while ago, vfio-pci exposes the >>> capability, but protects it as read-only so the user cannot create >>> devices on the host. QEMU passed this through to the guest, but that >>> failed as soon as OVMF started supporting sriov as it's unable to size >>> the VF BAR resources. Now QEMU drops the sriov capability from the >>> guest capability chain. So it's not clear how this fixed number of vfs >>> plan works. Are we inventing our own capability for this? If the pf >>> driver is just a userspace driver and not a VM, maybe what we do now is >>> sufficient, otherwise there's no standard for exposing fixed vfs. >>> >>>>>> My thought is to look at making SR-IOV configuration static or treat >>>>>> it as read-only when the vfio-pci driver is loaded on a given >>>>>> interface. In addition I would set the TAINT_USER flag and add a >>>>>> warning about loading vfio-pci on an active PF, and provide the number >>>>>> of VFs that were allocated. >>>>>> >>>>>> The idea with all of this being that we would at least have a partial >>>>>> lock on all of this so that you can allocate some number of VFs, and >>>>>> then start partitioning things up where you could assign the PF and >>>>>> VFs as needed to the various interfaces. Once the PF is assigned to >>>>>> the vfio-pci driver it would be locked in terms of the number of VFs >>>>>> that are provided so there would be no need for an extra communication >>>>>> channel between the PF driver and the host to deal with changes. >>>>>> >>>>>> Thoughts? >>>>> >>>>> >>>>> Is an sriov configuration ever static? vfio is the most prolific user >>>>> of the pci reset interfaces, AFAIK. Even if we add support for >>>>> restoring the sriov configuration and even if the pf user isn't trying >>>>> to be malicious, vf users would need to be prepared for their device >>>>> arbitrarily dropping off the bus, mmio randomly (from their >>>>> perspective) being disabled, and maybe other side-effects of the user >>>>> just trying to use the pf device. Is there even any guarantee that a >>>>> pf driver can operate identically, regardless of the state of sriov on >>>>> the pf? It seems like the pf driver needs to be aware of which queues >>>>> are available for itself vs the vfs in some designs. Is there still >>>>> long term value in a solution for this if the kernel is tainted? >>>> >>>> >>>> As far as the device randomly resetting I don't really see how that is >>>> any different from what we already have in terms of solutions >>>> including stuff supported by in-driver SR-IOV. The Intel drivers are >>>> notorious for resetting for any little thing like an MTU change. Us >>>> resetting due to someone calling open/release on an interface wouldn't >>>> be anything new. In addition AER can always hit us too. Yes we don't >>>> have the same recovery mechanisms in place, but in the case of >>>> something such as this we probably don't need that complex of a >>>> recovery setup. >>> >>> >>> [I think below paragraph is specifically answering last question above, >>> still long term value in kernel tainting solution] >>> >>>> I would think there probably is. If not we wouldn't have gotten the >>>> patches earlier that were still doing the tainting and warning and >>>> making use of the vfio-pci driver to enable SR-IOV. I suspect the use >>>> case for all this is to enable SR-IOV, setup the PF in vfio-pci, and >>>> then assign VFs into and out of guests. I don't see the PF doing a lot >>>> of moving after it is setup. The taint flag only really applies if >>>> someone is looking for support and quite honestly I figure for now the >>>> USER flag is appropriate for this kind of thing since we are deferring >>>> all the networking control to the PF which is managed by userspace. >>> >>> >>> If we're going to throw up our hands and taint the kernel, then maybe >>> we could entertain the idea of doing that in a trivial vfio-pci >>> sriov_configure function. But does that really meet all the use cases >>> and what's the advantage of that vs Amazon (I guess they're the driver >>> here) tainting the kernel with their own driver? Mellanox has tried to >>> enable sriov in vfio-pci in the past, Cc Ilya. >>> >>>>> Minimally, it seems like the pf driver (the user driver, not vfio-pci) >>>>> needs to be a cooperating party in this exchange. How do we determine >>>>> that and what QoS guarantees are they providing by agreeing to it? >>>> >>>> >>>> I think the general idea here is that the user has already tested this >>>> out and determined for themselves that the user driver does what they >>>> want/need. If it didn't they wouldn't be going to all these lengths to >>>> set this up. >>>> >>>>> Otherwise it doesn't seem like sriov provides the sort of static, hard >>>>> partitioning of the device that you're asking for. A vf is a piece of >>>>> the pf and we've released ownership and control of the pf to a user. >>>>> Thanks, >>>>> >>>>> Alex >>>> >>>> >>>> I am pretty sure that you are describing is true of some, but not for >>>> all. I think the Amazon solutions and the virtio solution are doing >>>> hard partitioning of the part. I will leave it to those guys to speak >>>> for themselves since I don't know anything about the hardware design >>>> of those parts. >>> >>> >>> I think we'd need device specific knowledge and enablement to be able >>> to take advantage of any hardware partitioning, otherwise we need to >>> assume the pf is privileged, as implemented in other sriov devices. >>> >>> I'm also trying to imagine whether there's a solution via the new >>> vfio/mdev interface, where the mdev vendor driver would bind to the pf >>> and effectively present itself as the mdev device. The vendor driver >>> could provide sriov capabilities and bear the burden of ensuring that >>> the pf is used cooperatively. The only existing mdev vendor drivers are >>> vGPUs and rely on on-device DMA translation and isolation, such as >>> through GTTs, but there have been some thoughts on providing IOMMU based >>> isolation of mdev/sriov mixed devices (assuming DMA is even required >>> for userspace management of the pf in this use case). [Cc Kirti] >>> Thanks, >>> >>> Alex >>> >> Apologies for getting to this party late. >> >> My 2 cents: >> >> I think a stub driver is needed for security reasons. >> Multiple reasons/cases stated in the thread to (continue to) follow that >> model. >> ... and wondering why pci-stub isn't enhanced to do just that, since >> it's already used in device-assignment tasks (VM or DPDK). > > There was an attempt to update pci-stub: > https://patchwork.kernel.org/patch/10109935/ > > One issue with using pci-stub is it doesn't address the security > issues needed if an IOMMU is enabled. For that reason I think vfio-pci > might be a better choice, and that is why my v2 is based around it. > Sorry, not seeing why pci-stub has an IOMMU issue and vfio-pci does not, at least from a security point of view. vfio-pci is far-more savvy wrt IOMMU groups, and that can be viewed as a security check, so if that's what you mean, I'll buy that vowel. Otherwise, plse shine (more) light on yonder window ... > For my v2 of this patch set I dropped the pure generic solution and > instead just updated virtio and vfio-pci in order to meet the needs as > called out in the previous patch sets. > virtio is needed? ... More stuff I need to dig into... >> -- pf devices that have drivers that coordinate don't need pci-stub; >> pf devices w/no drivers get assigned to pci-stub, then VF's are >> enabled/disabled/tracked/managed.... > > Yeah, my first attempt at this was to avoid doing what I consider to > be somewhat ridiculous things like adding SR-IOV support to > virtio_net. It didn't seem right to me but in the grand scheme of > things that ends up kind of being the way we have to go in order to > put some boundaries on when/where SR-IOV can be enabled. > It seems to me we are bending backwards for an unprivileged, PF driver use case that should is the muck of doing all of this... IMO... :-p >> mdev wrapping of VFs doesn't (re)solve device-specific needs/extensions >> for VF migration either. >> PF-device-specific needs should be handled by a PF driver; >> if no PF driver, then assuming no device-specific need exists, and using >> non-device-specific, common pci-stub VF handling > > The issue here is how do you classify that there is no common PF > driver, or is this something that would be decided by the user? > pci bus code checks *after* full bus scan & driver-matching/load if a (PF) pci-dev has VF's (sriov cfg block), and no PF driver attached. if some (I don't agree should exist -- see IMO above) user-space PF wants to be used, unconfigure the vfio-pci driver from the PF first (another simple check). Not a priviledged user?.... ah, another reason _not_ to allow userspace PF. > Also I don't know if this addresses the needs fro the DPDK crowd to > enable a userspace PF driver? > and why do we suddenly think that an (unprivileged) userspace PF driver in the *host* (not a bounded/controlled guest VM) should be allowed to control a device? ... because DPDK is 'a good citizen'? bug free?!? Sorry, IMO, make a PF stub for VF assignment. >> -- even w/pci-quirk hooks/workarounds if they need to be added. >> Finally, if an ah-ha moment is reached that concludes with the >> need for a PF driver to handle device-specific issues, the pci-stub VF >> functions may provide the guide/framework for such hastily put together PF >> drivers, >> minimizing PF driver variances for sriov handling. > > I'm not sure I followed this part. Are you saying the pci-stub driver > gets used as a template essentially for creating a PF driver if > needed? That's what I was eluding to. Help those who think a PF stub driver for the kernel is so onerous. ... and no, I don't agree with the idea of a DPDK PF driver being asynchronously built and used wrt a kernel (version). IMO, that's another fault in that strategy. > >> Although a temporary disconnect is handled/emulated well in the (e)net >> space, >> I also don't think it's an issue for storage, as I would expect two >> situations to occur >> wrt data-loss scenarios: >> a) a PF device has a kernel driver, and it has to handle the up/down of a VF >> b) the PF is quiesced -- no device-specific kernel driver -- >> and there is no VF loss due to PF resetting b/c there >> is no need to reset PF -- pci-stub providing stable, kernel config. >> Per VF data loss is always an issue for storage interconnects >> (unless they are net-based, like iSCSI, RDMA, etc.) which is handled >> at the VF driver or layer above it. >> >> - Don > > My concern is we are starting to see a third option here. A PF that > has a kernel driver, but knows nothing of SR-IOV as it is completely > managed by firmware. An example being the virtio_net that somehow > knows how to do SR-IOV (https://patchwork.kernel.org/patch/10241225/). > My understanding is that the part in question is providing a > lightweight PF that is essentially just another VF, but has some extra > PCIe configuration space to it allowing for SR-IOV and AER > functionality. So, give it a light-wt PF driver. Maybe I'd appreciate this issue better if I understood how 'managed by firmware' is implemented. > > In addition we are also seeing customer scenarios where things like > DPDK are running as the PF with kernel VFs. Ideally we want to contain > that as much as possible and my understanding is that igb_uio > currently isn't really providing much in the way of containment. uh hum... q.e.d.! > > - Alex > IMO, the tail(DPDK; fw?) is wagging the dog(kernel) here. Either, we can add a light-wt, PF stub driver that does VF assignment for a conforming PF design (and just keep adding vid/did's to it over time), or that design is taken, tweaked and a new PF 'lightweight driver' is created to handle the hacking/workaround that needs to be done for a PF. 'Device-assignment' (properly) assumes the PF is in the control/host domain, and VFs are 'given' to a non-privileged, managed domain (managed by the control/host domain). I don't see how you maintain security if the PF isn't in the control/host domain.