Since commit abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO
access on disabled memory") VFIO now rejects guest MMIO access when the
PCI_COMMAND_MEMORY (MSE) bit is OFF. This is however not the case for
VFs (fixed in commit ebfa440ce38b ("vfio/pci: Fix SR-IOV VF handling
with MMIO blocking")). Furthermore, on s390 where we always run with at
least a bare-metal hypervisor (LPAR) PCI_COMMAND_MEMORY, unlike Device/
Vendor IDs and BARs, is not emulated when VFs are passed-through to the
OS independently.
Based upon Bjorn's most recent comment [1], I investigated the notion of
setting is_virtfn=1 for VFs passed-through to Linux and not linked to a
parent PF (referred to as a 'detached VF' in my prior post). However,
we rapidly run into issues on how to treat an is_virtfn device with no
linked PF. Further complicating the issue is when you consider the guest
kernel has a passed-through VF but has CONFIG_PCI_IOV=n as in many
locations is_virtfn checking is ifdef'd out altogether and the device is
assumed to be an independent PCI function.
The decision made by VFIO whether to require or emulate a PCI feature
(in this case PCI_COMMAND_MEMORY) is based upon the knowledge it has
about the device, including implicit expectations of what/is not
emulated below VFIO. (ex: is it safe to read vendor/id from config
space?) -- Our firmware layer attempts similar behavior by emulating
things such as vendor/id/BAR access - without these an unlinked VF would
not be usable. But what is or is not emulated by the layer below may be
different based upon which entity is providing the emulation (vfio,
LPAR, some other hypervisor)
So, the proposal here aims to fix the immediate issue of s390
pass-through VFs becoming suddenly unusable by vfio by using a dev_flags
bit to identify a VF feature that we know is hardwired to 0 for any
VF (PCI_COMMAND_MEMORY) and de-coupling the need for emulating
PCI_COMMAND_MEMORY from the is_virtfn flag. The exact scope of is_virtfn
and physfn for bare-metal vs guest scenarios and identifying what
features are / are not emulated by the lower-level hypervisors is a much
bigger discussion independent of this limited proposal.
Changes from v3:
- Propose a dev_flags model for the MSE bit
- Set the bit for typical iov linking
- Also set the bit for s390 VFs (linked and unlinked)
- Modify vfio-pci to look at the dev_flags bit instead of is_virtfn
[1]: https://marc.info/?l=linux-pci&m=159856041930022&w=2
Matthew Rosato (3):
PCI/IOV: Mark VFs as not implementing MSE bit
s390/pci: Mark all VFs as not implementing MSE bit
vfio/pci: Decouple MSE bit checks from is_virtfn
arch/s390/pci/pci_bus.c | 5 +++--
drivers/pci/iov.c | 1 +
drivers/vfio/pci/vfio_pci_config.c | 20 +++++++++++++-------
include/linux/pci.h | 2 ++
4 files changed, 19 insertions(+), 9 deletions(-)
--
1.8.3.1
While it is true that devices with is_virtfn=1 will have an
MSE that is hard-wired to 0, this is not the only case where
we see this behavior -- For example some bare-metal hypervisors
lack MSE bit emulation for devices not setting is_virtfn (s390).
Fix this by instead checking for the newly-added
PCI_DEV_FLAGS_FORCE_COMMAND_MEM flag which directly denotes the
need for MSE bit emulation in vfio.
Signed-off-by: Matthew Rosato <[email protected]>
Reviewed-by: Niklas Schnelle <[email protected]>
---
drivers/vfio/pci/vfio_pci_config.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index d98843f..47fb3c7 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -406,7 +406,8 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_device *vdev)
* PF SR-IOV capability, there's therefore no need to trigger
* faults based on the virtual value.
*/
- return pdev->is_virtfn || (cmd & PCI_COMMAND_MEMORY);
+ return (pdev->dev_flags & PCI_DEV_FLAGS_FORCE_COMMAND_MEM) ||
+ (cmd & PCI_COMMAND_MEMORY);
}
/*
@@ -520,8 +521,9 @@ static int vfio_basic_config_read(struct vfio_pci_device *vdev, int pos,
count = vfio_default_config_read(vdev, pos, count, perm, offset, val);
- /* Mask in virtual memory enable for SR-IOV devices */
- if (offset == PCI_COMMAND && vdev->pdev->is_virtfn) {
+ /* Mask in virtual memory enable */
+ if ((offset == PCI_COMMAND) &&
+ (vdev->pdev->dev_flags & PCI_DEV_FLAGS_FORCE_COMMAND_MEM)) {
u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
u32 tmp_val = le32_to_cpu(*val);
@@ -589,9 +591,11 @@ static int vfio_basic_config_write(struct vfio_pci_device *vdev, int pos,
* shows it disabled (phys_mem/io, then the device has
* undergone some kind of backdoor reset and needs to be
* restored before we allow it to enable the bars.
- * SR-IOV devices will trigger this, but we catch them later
+ * SR-IOV devices will trigger this - for mem enable let's
+ * catch this now and for io enable it will be caught later
*/
- if ((new_mem && virt_mem && !phys_mem) ||
+ if ((new_mem && virt_mem && !phys_mem &&
+ !(pdev->dev_flags & PCI_DEV_FLAGS_FORCE_COMMAND_MEM)) ||
(new_io && virt_io && !phys_io) ||
vfio_need_bar_restore(vdev))
vfio_bar_restore(vdev);
@@ -1734,9 +1738,11 @@ int vfio_config_init(struct vfio_pci_device *vdev)
vconfig[PCI_INTERRUPT_PIN]);
vconfig[PCI_INTERRUPT_PIN] = 0; /* Gratuitous for good VFs */
-
+ }
+ if (pdev->dev_flags & PCI_DEV_FLAGS_FORCE_COMMAND_MEM) {
/*
- * VFs do no implement the memory enable bit of the COMMAND
+ * VFs and devices that set PCI_DEV_FLAGS_FORCE_COMMAND_MEM
+ * do not implement the memory enable bit of the COMMAND
* register therefore we'll not have it set in our initial
* copy of config space after pci_enable_device(). For
* consistency with PFs, set the virtual enable bit here.
--
1.8.3.1
For s390 we can have VFs that are passed-through without
the associated PF. Firmware provides an emulation layer
to allow these devices to operate independently, but is
missing emulation of the MSE bit. For these as well as
linked VFs, mark a dev_flags bit that specifies these
devices do not implement the MSE bit.
Signed-off-by: Matthew Rosato <[email protected]>
Reviewed-by: Niklas Schnelle <[email protected]>
---
arch/s390/pci/pci_bus.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/s390/pci/pci_bus.c b/arch/s390/pci/pci_bus.c
index 5967f30..73789a7 100644
--- a/arch/s390/pci/pci_bus.c
+++ b/arch/s390/pci/pci_bus.c
@@ -197,9 +197,10 @@ void pcibios_bus_add_device(struct pci_dev *pdev)
* With pdev->no_vf_scan the common PCI probing code does not
* perform PF/VF linking.
*/
- if (zdev->vfn)
+ if (zdev->vfn) {
zpci_bus_setup_virtfn(zdev->zbus, pdev, zdev->vfn);
-
+ pdev->dev_flags |= PCI_DEV_FLAGS_FORCE_COMMAND_MEM;
+ }
}
static int zpci_bus_add_device(struct zpci_bus *zbus, struct zpci_dev *zdev)
--
1.8.3.1
Per the PCIe spec, VFs cannot implement the MSE bit
AKA PCI_COMMAND_MEMORY, and it must be hard-wired to 0.
Use a dev_flags bit to signify this requirement.
Signed-off-by: Matthew Rosato <[email protected]>
---
drivers/pci/iov.c | 1 +
include/linux/pci.h | 2 ++
2 files changed, 3 insertions(+)
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b37e08c..2bec77c 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -180,6 +180,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
virtfn->device = iov->vf_device;
virtfn->is_virtfn = 1;
virtfn->physfn = pci_dev_get(dev);
+ virtfn->dev_flags |= PCI_DEV_FLAGS_FORCE_COMMAND_MEM;
if (id == 0)
pci_read_vf_config_common(virtfn);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8355306..9316cce 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -227,6 +227,8 @@ enum pci_dev_flags {
PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
/* Don't use Relaxed Ordering for TLPs directed at this device */
PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
+ /* Device does not implement PCI_COMMAND_MEMORY (e.g. a VF) */
+ PCI_DEV_FLAGS_FORCE_COMMAND_MEM = (__force pci_dev_flags_t) (1 << 12),
};
enum pci_irq_reroute_variant {
--
1.8.3.1
On Wed, Sep 02, 2020 at 03:46:34PM -0400, Matthew Rosato wrote:
> Per the PCIe spec, VFs cannot implement the MSE bit
> AKA PCI_COMMAND_MEMORY, and it must be hard-wired to 0.
> Use a dev_flags bit to signify this requirement.
This approach seems sensible to me, but
- This is confusing because while the spec does not use "MSE" to
refer to the Command Register "Memory Space Enable" bit
(PCI_COMMAND_MEMORY), it *does* use "MSE" in the context of the
"VF MSE" bit, which is in the PF SR-IOV Capability. But of
course, you're not talking about that here. Maybe something like
this?
For VFs, the Memory Space Enable bit in the Command Register is
hard-wired to 0.
Add a dev_flags bit to signify devices where the Command
Register Memory Space Enable bit does not control the device's
response to MMIO accesses.
- "PCI_DEV_FLAGS_FORCE_COMMAND_MEM" says something about how you
plan to *use* this, but I'd rather use a term that describes the
hardware, e.g., "PCI_DEV_FLAGS_NO_COMMAND_MEMORY".
- How do we decide whether to use dev_flags vs a bitfield like
dev->is_virtfn? The latter seems simpler unless there's a reason
to use dev_flags. If there's a reason, maybe we could add a
comment at pci_dev_flags for future reference.
- Wrap the commit log to fill a 75-char line. It's arbitrary, but
that's what I use for consistency.
> Signed-off-by: Matthew Rosato <[email protected]>
> ---
> drivers/pci/iov.c | 1 +
> include/linux/pci.h | 2 ++
> 2 files changed, 3 insertions(+)
>
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index b37e08c..2bec77c 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -180,6 +180,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> virtfn->device = iov->vf_device;
> virtfn->is_virtfn = 1;
> virtfn->physfn = pci_dev_get(dev);
> + virtfn->dev_flags |= PCI_DEV_FLAGS_FORCE_COMMAND_MEM;
>
> if (id == 0)
> pci_read_vf_config_common(virtfn);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 8355306..9316cce 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -227,6 +227,8 @@ enum pci_dev_flags {
> PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
> /* Don't use Relaxed Ordering for TLPs directed at this device */
> PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> + /* Device does not implement PCI_COMMAND_MEMORY (e.g. a VF) */
> + PCI_DEV_FLAGS_FORCE_COMMAND_MEM = (__force pci_dev_flags_t) (1 << 12),
> };
>
> enum pci_irq_reroute_variant {
> --
> 1.8.3.1
>
On 9/3/20 12:41 PM, Bjorn Helgaas wrote:
> On Wed, Sep 02, 2020 at 03:46:34PM -0400, Matthew Rosato wrote:
>> Per the PCIe spec, VFs cannot implement the MSE bit
>> AKA PCI_COMMAND_MEMORY, and it must be hard-wired to 0.
>> Use a dev_flags bit to signify this requirement.
>
> This approach seems sensible to me, but
>
> - This is confusing because while the spec does not use "MSE" to
> refer to the Command Register "Memory Space Enable" bit
> (PCI_COMMAND_MEMORY), it *does* use "MSE" in the context of the
> "VF MSE" bit, which is in the PF SR-IOV Capability. But of
> course, you're not talking about that here. Maybe something like
> this?
>
> For VFs, the Memory Space Enable bit in the Command Register is
> hard-wired to 0.
>
> Add a dev_flags bit to signify devices where the Command
> Register Memory Space Enable bit does not control the device's
> response to MMIO accesses.
Will do. I'll change the usage of the MSE acronym in the other patches
as well.
>
> - "PCI_DEV_FLAGS_FORCE_COMMAND_MEM" says something about how you
> plan to *use* this, but I'd rather use a term that describes the
> hardware, e.g., "PCI_DEV_FLAGS_NO_COMMAND_MEMORY".
Sure, I will change.
>
> - How do we decide whether to use dev_flags vs a bitfield like
> dev->is_virtfn? The latter seems simpler unless there's a reason
> to use dev_flags. If there's a reason, maybe we could add a
> comment at pci_dev_flags for future reference.
>
Something like:
/*
* Device does not implement PCI_COMMAND_MEMORY - this is true for any
* device marked is_virtfn, but is also true for any VF passed-through
* a lower-level hypervisor where emulation of the Memory Space Enable
* bit was not provided.
*/
PCI_DEV_FLAGS_NO_COMMAND_MEMORY = (__force pci_dev_flags_t) (1 << 12),
?
> - Wrap the commit log to fill a 75-char line. It's arbitrary, but
> that's what I use for consistency.
Sure, will do. I'll roll up a new version once I have feedback from
Alex on the vfio changes.
>
>> Signed-off-by: Matthew Rosato <[email protected]>
>> ---
>> drivers/pci/iov.c | 1 +
>> include/linux/pci.h | 2 ++
>> 2 files changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index b37e08c..2bec77c 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -180,6 +180,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
>> virtfn->device = iov->vf_device;
>> virtfn->is_virtfn = 1;
>> virtfn->physfn = pci_dev_get(dev);
>> + virtfn->dev_flags |= PCI_DEV_FLAGS_FORCE_COMMAND_MEM;
>>
>> if (id == 0)
>> pci_read_vf_config_common(virtfn);
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 8355306..9316cce 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -227,6 +227,8 @@ enum pci_dev_flags {
>> PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
>> /* Don't use Relaxed Ordering for TLPs directed at this device */
>> PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
>> + /* Device does not implement PCI_COMMAND_MEMORY (e.g. a VF) */
>> + PCI_DEV_FLAGS_FORCE_COMMAND_MEM = (__force pci_dev_flags_t) (1 << 12),
>> };
>>
>> enum pci_irq_reroute_variant {
>> --
>> 1.8.3.1
>>
On Thu, Sep 03, 2020 at 01:10:02PM -0400, Matthew Rosato wrote:
> On 9/3/20 12:41 PM, Bjorn Helgaas wrote:
> > - How do we decide whether to use dev_flags vs a bitfield like
> > dev->is_virtfn? The latter seems simpler unless there's a reason
> > to use dev_flags. If there's a reason, maybe we could add a
> > comment at pci_dev_flags for future reference.
>
> Something like:
>
> /*
> * Device does not implement PCI_COMMAND_MEMORY - this is true for any
> * device marked is_virtfn, but is also true for any VF passed-through
> * a lower-level hypervisor where emulation of the Memory Space Enable
> * bit was not provided.
> */
> PCI_DEV_FLAGS_NO_COMMAND_MEMORY = (__force pci_dev_flags_t) (1 << 12),
Sorry, I wasn't clear about this. I was trying to suggest that if
there are some situations where we need to use pci_dev_flags instead
of a bitfield, it would be useful to have a generic comment to help
decide between them.
I don't know that there *is* a good reason, and unless somebody can
think of one, I'd like to get rid of pci_dev_flags completely and
convert them all to bitfields.
Given that, my preference would be to just add a new bitfield,
something like this:
struct pci_dev {
...
unsigned int no_command_memory:1; /* No PCI_COMMAND_MEMORY */
On Thu, 3 Sep 2020 13:10:02 -0400
Matthew Rosato <[email protected]> wrote:
> On 9/3/20 12:41 PM, Bjorn Helgaas wrote:
> > On Wed, Sep 02, 2020 at 03:46:34PM -0400, Matthew Rosato wrote:
> >> Per the PCIe spec, VFs cannot implement the MSE bit
> >> AKA PCI_COMMAND_MEMORY, and it must be hard-wired to 0.
> >> Use a dev_flags bit to signify this requirement.
> >
> > This approach seems sensible to me, but
> >
> > - This is confusing because while the spec does not use "MSE" to
> > refer to the Command Register "Memory Space Enable" bit
> > (PCI_COMMAND_MEMORY), it *does* use "MSE" in the context of the
> > "VF MSE" bit, which is in the PF SR-IOV Capability. But of
> > course, you're not talking about that here. Maybe something like
> > this?
> >
> > For VFs, the Memory Space Enable bit in the Command Register is
> > hard-wired to 0.
> >
> > Add a dev_flags bit to signify devices where the Command
> > Register Memory Space Enable bit does not control the device's
> > response to MMIO accesses.
>
> Will do. I'll change the usage of the MSE acronym in the other patches
> as well.
>
> >
> > - "PCI_DEV_FLAGS_FORCE_COMMAND_MEM" says something about how you
> > plan to *use* this, but I'd rather use a term that describes the
> > hardware, e.g., "PCI_DEV_FLAGS_NO_COMMAND_MEMORY".
>
> Sure, I will change.
>
> >
> > - How do we decide whether to use dev_flags vs a bitfield like
> > dev->is_virtfn? The latter seems simpler unless there's a reason
> > to use dev_flags. If there's a reason, maybe we could add a
> > comment at pci_dev_flags for future reference.
> >
>
> Something like:
>
> /*
> * Device does not implement PCI_COMMAND_MEMORY - this is true for any
> * device marked is_virtfn, but is also true for any VF passed-through
> * a lower-level hypervisor where emulation of the Memory Space Enable
> * bit was not provided.
> */
> PCI_DEV_FLAGS_NO_COMMAND_MEMORY = (__force pci_dev_flags_t) (1 << 12),
>
> ?
>
> > - Wrap the commit log to fill a 75-char line. It's arbitrary, but
> > that's what I use for consistency.
>
> Sure, will do. I'll roll up a new version once I have feedback from
> Alex on the vfio changes.
The usage of MSE threw me a bit too, as Bjorn notes that's specific to
the SR-IOV capability. I think this also uncovers a latent bug in our
calling of vfio_bar_restore(), it really doesn't do a good job of
determining whether an enable bit is implemented, regardless of whether
it's a VF or the device simply doesn't use that address space. For
example I imagine you could reproduce triggering a reset recovery on
s390 by trying to write the VF command register to 1 with setpci from a
guest (since you won't have is_virtfn to bail out of the recovery
function). I think we'll still need this dev_flag to differentiate
unimplmented and enabled versus simply unimplemented to resolve that
though, so the change looks ok to me. Thanks,
Alex
> >> Signed-off-by: Matthew Rosato <[email protected]>
> >> ---
> >> drivers/pci/iov.c | 1 +
> >> include/linux/pci.h | 2 ++
> >> 2 files changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> >> index b37e08c..2bec77c 100644
> >> --- a/drivers/pci/iov.c
> >> +++ b/drivers/pci/iov.c
> >> @@ -180,6 +180,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> >> virtfn->device = iov->vf_device;
> >> virtfn->is_virtfn = 1;
> >> virtfn->physfn = pci_dev_get(dev);
> >> + virtfn->dev_flags |= PCI_DEV_FLAGS_FORCE_COMMAND_MEM;
> >>
> >> if (id == 0)
> >> pci_read_vf_config_common(virtfn);
> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
> >> index 8355306..9316cce 100644
> >> --- a/include/linux/pci.h
> >> +++ b/include/linux/pci.h
> >> @@ -227,6 +227,8 @@ enum pci_dev_flags {
> >> PCI_DEV_FLAGS_NO_FLR_RESET = (__force pci_dev_flags_t) (1 << 10),
> >> /* Don't use Relaxed Ordering for TLPs directed at this device */
> >> PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 11),
> >> + /* Device does not implement PCI_COMMAND_MEMORY (e.g. a VF) */
> >> + PCI_DEV_FLAGS_FORCE_COMMAND_MEM = (__force pci_dev_flags_t) (1 << 12),
> >> };
> >>
> >> enum pci_irq_reroute_variant {
> >> --
> >> 1.8.3.1
> >>
>