2022-09-13 13:29:24

by Ajay Kaher

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor


Note: Corrected the Subject.

> On 07/09/22, 8:50 PM, "Vitaly Kuznetsov" <[email protected]> wrote:
>
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb7986..1e5a8f7 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -20,6 +20,7 @@
>> #include <asm/pci_x86.h>
>> #include <asm/setup.h>
>> #include <asm/irqdomain.h>
>> +#include <asm/hypervisor.h>
>>
>> unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | PCI_PROBE_CONF2 |
>> PCI_PROBE_MMCONF;
>> @@ -57,14 +58,58 @@ int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>> return -EINVAL;
>> }
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> + int reg, int len, u32 *val)
>> +{
>> + if (raw_pci_ext_ops)
>> + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>> + if (domain == 0 && reg < 256 && raw_pci_ops)
>> + return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>> + return -EINVAL;
>> +}
>> +
>> +static int vm_raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>> + int reg, int len, u32 val)
>> +{
>> + if (raw_pci_ext_ops)
>> + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>> + if (domain == 0 && reg < 256 && raw_pci_ops)
>> + return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>> + return -EINVAL;
>> +}
>
> These look exactly like raw_pci_read()/raw_pci_write() but with inverted
> priority. We could've added a parameter but to be more flexible, I'd
> suggest we add a 'priority' field to 'struct pci_raw_ops' and make
> raw_pci_read()/raw_pci_write() check it before deciding what to use
> first. To be on the safe side, you can leave raw_pci_ops's priority
> higher than raw_pci_ext_ops's by default and only tweak it in
> arch/x86/kernel/cpu/vmware.c

Thanks Vitaly for your response.

1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
to be used, not something with-in struct pci_raw_ops.

It's a generic solution for all hypervisor (sorry for earlier wrong Subject), not specific to VMware.

Further looking for feedback if it's impacting to any hypervisor.

-Ajay




2022-09-13 13:59:44

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

Ajay Kaher <[email protected]> writes:

> Note: Corrected the Subject.
>
>> On 07/09/22, 8:50 PM, "Vitaly Kuznetsov" <[email protected]> wrote:
>>
>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>> index ddb7986..1e5a8f7 100644
>>> --- a/arch/x86/pci/common.c
>>> +++ b/arch/x86/pci/common.c
>>> @@ -20,6 +20,7 @@
>>> #include <asm/pci_x86.h>
>>> #include <asm/setup.h>
>>> #include <asm/irqdomain.h>
>>> +#include <asm/hypervisor.h>
>>>
>>> unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | PCI_PROBE_CONF2 |
>>> PCI_PROBE_MMCONF;
>>> @@ -57,14 +58,58 @@ int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> return -EINVAL;
>>> }
>>>
>>> +#ifdef CONFIG_HYPERVISOR_GUEST
>>> +static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> + int reg, int len, u32 *val)
>>> +{
>>> + if (raw_pci_ext_ops)
>>> + return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>> + if (domain == 0 && reg < 256 && raw_pci_ops)
>>> + return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>> + return -EINVAL;
>>> +}
>>> +
>>> +static int vm_raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> + int reg, int len, u32 val)
>>> +{
>>> + if (raw_pci_ext_ops)
>>> + return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>> + if (domain == 0 && reg < 256 && raw_pci_ops)
>>> + return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>> + return -EINVAL;
>>> +}
>>
>> These look exactly like raw_pci_read()/raw_pci_write() but with inverted
>> priority. We could've added a parameter but to be more flexible, I'd
>> suggest we add a 'priority' field to 'struct pci_raw_ops' and make
>> raw_pci_read()/raw_pci_write() check it before deciding what to use
>> first. To be on the safe side, you can leave raw_pci_ops's priority
>> higher than raw_pci_ext_ops's by default and only tweak it in
>> arch/x86/kernel/cpu/vmware.c
>
> Thanks Vitaly for your response.
>
> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
> to be used, not something with-in struct pci_raw_ops.

I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
something like (completely untested):

diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
index 70533fdcbf02..fb8270fa6c78 100644
--- a/arch/x86/include/asm/pci_x86.h
+++ b/arch/x86/include/asm/pci_x86.h
@@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
extern bool mp_should_keep_irq(struct device *dev);

struct pci_raw_ops {
+ int rating;
int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 *val);
int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index ddb798603201..e9965fd11576 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 *val)
{
- if (domain == 0 && reg < 256 && raw_pci_ops)
+ if (domain == 0 && reg < 256 && raw_pci_ops &&
+ (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
if (raw_pci_ext_ops)
return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
@@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 val)
{
- if (domain == 0 && reg < 256 && raw_pci_ops)
+ if (domain == 0 && reg < 256 && raw_pci_ops &&
+ (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
if (raw_pci_ext_ops)
return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);

and then somewhere in Vmware hypervisor initialization code
(arch/x86/kernel/cpu/vmware.c) you do

raw_pci_ext_ops->rating = 100;

why wouldn't it work?

(diclaimer: completely untested, raw_pci_ops/raw_pci_ext_ops
initialization has to be checked so 'rating' is not garbage).

>
> It's a generic solution for all hypervisor (sorry for earlier wrong
> Subject), not specific to VMware. Further looking for feedback if it's
> impacting to any hypervisor.

That's the tricky part. We can check modern hypervisor versions, but
what about all other versions in existence? How can we know that there's
no QEMU/Hyper-V/... version out there where MMIO path is broken? I'd
suggest we limit the change to Vmware hypervisor, other hypervisors may
use the same mechanism (like the one above) later (but the person
suggesting the patch is always responsible for the research why it is
safe to do so).

--
Vitaly

2022-09-29 05:48:28

by Ajay Kaher

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor


> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <[email protected]> wrote:
>>
>> Thanks Vitaly for your response.
>>
>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>> to be used, not something with-in struct pci_raw_ops.
>
> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
> something like (completely untested):
>
> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
> index 70533fdcbf02..fb8270fa6c78 100644
> --- a/arch/x86/include/asm/pci_x86.h
> +++ b/arch/x86/include/asm/pci_x86.h
> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
> extern bool mp_should_keep_irq(struct device *dev);
>
> struct pci_raw_ops {
> + int rating;
> int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
> int reg, int len, u32 *val);
> int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index ddb798603201..e9965fd11576 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> int reg, int len, u32 *val)
> {
> - if (domain == 0 && reg < 256 && raw_pci_ops)
> + if (domain == 0 && reg < 256 && raw_pci_ops &&
> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
> return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> if (raw_pci_ext_ops)
> return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
> int reg, int len, u32 val)
> {
> - if (domain == 0 && reg < 256 && raw_pci_ops)
> + if (domain == 0 && reg < 256 && raw_pci_ops &&
> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
> return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
> if (raw_pci_ext_ops)
> return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>
> and then somewhere in Vmware hypervisor initialization code
> (arch/x86/kernel/cpu/vmware.c) you do
>
> raw_pci_ext_ops->rating = 100;

Thanks Vitaly, for your review and helping us to improve the code.

I was working to make changes as you suggested, but before sending v3 would like to
discuss on following:

If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
and following change is must in arch/x86/pci/mmconfig_64.c:

-const struct pci_raw_ops pci_mmcfg = {
+struct pci_raw_ops pci_mmcfg = {
.read = pci_mmcfg_read,
.write = pci_mmcfg_write,
};

So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?

And raw_pci_read() will have following change:

- if (domain == 0 && reg < 256 && raw_pci_ops)
+ if (domain == 0 && reg < 256 && raw_pci_ops &&
+ (!prefer_raw_pci_ext_ops || !raw_pci_ext_ops)

>
> why wouldn't it work?
>
> (diclaimer: completely untested, raw_pci_ops/raw_pci_ext_ops
> initialization has to be checked so 'rating' is not garbage).
>
>>
>> It's a generic solution for all hypervisor (sorry for earlier wrong
>> Subject), not specific to VMware. Further looking for feedback if it's
>> impacting to any hypervisor.
>
> That's the tricky part. We can check modern hypervisor versions, but
> what about all other versions in existence? How can we know that there's
> no QEMU/Hyper-V/... version out there where MMIO path is broken? I'd
> suggest we limit the change to Vmware hypervisor, other hypervisors may
> use the same mechanism (like the one above) later (but the person
> suggesting the patch is always responsible for the research why it is
> safe to do so).

Ok, as of now we will make this change specific to VMware hypervisor.

- Ajay



2022-09-29 09:44:55

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor


On 29.09.22 07:36, Ajay Kaher wrote:
>> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <[email protected]> wrote:
>>> Thanks Vitaly for your response.
>>>
>>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>>> to be used, not something with-in struct pci_raw_ops.
>> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
>> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
>> something like (completely untested):
>>
>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>> index 70533fdcbf02..fb8270fa6c78 100644
>> --- a/arch/x86/include/asm/pci_x86.h
>> +++ b/arch/x86/include/asm/pci_x86.h
>> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
>> extern bool mp_should_keep_irq(struct device *dev);
>>
>> struct pci_raw_ops {
>> + int rating;
>> int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 *val);
>> int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb798603201..e9965fd11576 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 *val)
>> {
>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>> return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>> if (raw_pci_ext_ops)
>> return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 val)
>> {
>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>> return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>> if (raw_pci_ext_ops)
>> return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>
>> and then somewhere in Vmware hypervisor initialization code
>> (arch/x86/kernel/cpu/vmware.c) you do
>>
>> raw_pci_ext_ops->rating = 100;
> Thanks Vitaly, for your review and helping us to improve the code.
>
> I was working to make changes as you suggested, but before sending v3 would like to
> discuss on following:
>
> If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
> and following change is must in arch/x86/pci/mmconfig_64.c:
>
> -const struct pci_raw_ops pci_mmcfg = {
> +struct pci_raw_ops pci_mmcfg = {
> .read = pci_mmcfg_read,
> .write = pci_mmcfg_write,
> };
>
> So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?
>
> And raw_pci_read() will have following change:
>
> - if (domain == 0 && reg < 256 && raw_pci_ops)
> + if (domain == 0 && reg < 256 && raw_pci_ops &&
> + (!prefer_raw_pci_ext_ops || !raw_pci_ext_ops)
>
>> why wouldn't it work?
>>
>> (diclaimer: completely untested, raw_pci_ops/raw_pci_ext_ops
>> initialization has to be checked so 'rating' is not garbage).
>>
>>> It's a generic solution for all hypervisor (sorry for earlier wrong
>>> Subject), not specific to VMware. Further looking for feedback if it's
>>> impacting to any hypervisor.
>> That's the tricky part. We can check modern hypervisor versions, but
>> what about all other versions in existence? How can we know that there's
>> no QEMU/Hyper-V/... version out there where MMIO path is broken? I'd
>> suggest we limit the change to Vmware hypervisor, other hypervisors may
>> use the same mechanism (like the one above) later (but the person
>> suggesting the patch is always responsible for the research why it is
>> safe to do so).
> Ok, as of now we will make this change specific to VMware hypervisor.


Is there a way we can make it an ACPI property in MCFG to have the
environment self-describe the fact that it's safe to do ECAM access for
config space access over legacy PIO? That way we don't need to patch
guests every time a hypervisor decides that it's safe to prefer ECAM.

Also, Michael (CC'ed) mentioned that according to spec, your PCIe host
bridge with PCI_COMMAND.MEMORY=0 would stop responding to its ECAM
window. Given that most ARM systems have no PIO fallback path, we want
to make sure we never hit that condition.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2022-10-03 15:18:41

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

Ajay Kaher <[email protected]> writes:

>> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <[email protected]> wrote:
>>>
>>> Thanks Vitaly for your response.
>>>
>>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>>> to be used, not something with-in struct pci_raw_ops.
>>
>> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
>> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
>> something like (completely untested):
>>
>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>> index 70533fdcbf02..fb8270fa6c78 100644
>> --- a/arch/x86/include/asm/pci_x86.h
>> +++ b/arch/x86/include/asm/pci_x86.h
>> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
>> extern bool mp_should_keep_irq(struct device *dev);
>>
>> struct pci_raw_ops {
>> + int rating;
>> int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 *val);
>> int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb798603201..e9965fd11576 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 *val)
>> {
>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>> return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>> if (raw_pci_ext_ops)
>> return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int reg, int len, u32 val)
>> {
>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>> return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>> if (raw_pci_ext_ops)
>> return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>
>> and then somewhere in Vmware hypervisor initialization code
>> (arch/x86/kernel/cpu/vmware.c) you do
>>
>> raw_pci_ext_ops->rating = 100;
>
> Thanks Vitaly, for your review and helping us to improve the code.
>
> I was working to make changes as you suggested, but before sending v3 would like to
> discuss on following:
>
> If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
> and following change is must in arch/x86/pci/mmconfig_64.c:
>
> -const struct pci_raw_ops pci_mmcfg = {
> +struct pci_raw_ops pci_mmcfg = {
> .read = pci_mmcfg_read,
> .write = pci_mmcfg_write,
> };
>
> So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?
>
> And raw_pci_read() will have following change:
>
> - if (domain == 0 && reg < 256 && raw_pci_ops)
> + if (domain == 0 && reg < 256 && raw_pci_ops &&
> + (!prefer_raw_pci_ext_ops || !raw_pci_ext_ops)
>

Not my but rather PCI maintainer's call but IMHO dropping 'const' is
better, introducing a new global var is our 'last resort' and should be
avoided whenever possible. Alternatively, you can add a
raw_pci_ext_ops_preferred() function checking somethin within 'struct
hypervisor_x86' but I'm unsure if it's better.

Also, please check Alex' question/suggestion.

...

--
Vitaly

2022-10-03 18:15:34

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <[email protected]> wrote:

> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
> better, introducing a new global var is our 'last resort' and should be
> avoided whenever possible. Alternatively, you can add a
> raw_pci_ext_ops_preferred() function checking somethin within 'struct
> hypervisor_x86' but I'm unsure if it's better.
>
> Also, please check Alex' question/suggestion.

Here is my take (and Ajay knows probably more than me):

Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
The two options are either to use a reserved field (which who knows, might
be used one day) or some OEM ID. I am also not familiar with
PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.

Anyhow, I understand (although not relate) to the objection for a new global
variable. How about explicitly calling this hardware bug a “bug” and using
the proper infrastructure? Calling it explicitly a bug may even push whoever
can to resolve it.

IOW, how about doing something along the lines of (not tested):


-- >8 --

Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor

---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/kernel/cpu/vmware.c | 2 ++
arch/x86/pci/common.c | 6 ++++--
4 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..216b6f357b6d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -460,5 +460,6 @@
#define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
#define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
#define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
+#define X86_BUG_ECAM_MMIO X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */

#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..c94175fa304b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
{
u64 ia32_cap = x86_read_arch_cap_msr();

+ setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
+
/* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
!(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 02039ec3597d..8903776284a6 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
setup_force_cpu_cap(X86_FEATURE_VMCALL);
else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
+
+ setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
}

static void __init vmware_platform_setup(void)
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index ddb798603201..bc81cf4c1014 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 *val)
{
- if (domain == 0 && reg < 256 && raw_pci_ops)
+ if (domain == 0 && reg < 256 && raw_pci_ops &&
+ (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
if (raw_pci_ext_ops)
return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
@@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
int reg, int len, u32 val)
{
- if (domain == 0 && reg < 256 && raw_pci_ops)
+ if (domain == 0 && reg < 256 && raw_pci_ops &&
+ (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
if (raw_pci_ext_ops)
return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
--
2.34.1




2022-10-03 21:46:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

On October 3, 2022 10:34:15 AM PDT, Nadav Amit <[email protected]> wrote:
>On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <[email protected]> wrote:
>
>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>> better, introducing a new global var is our 'last resort' and should be
>> avoided whenever possible. Alternatively, you can add a
>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>> hypervisor_x86' but I'm unsure if it's better.
>>
>> Also, please check Alex' question/suggestion.
>
>Here is my take (and Ajay knows probably more than me):
>
>Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>The two options are either to use a reserved field (which who knows, might
>be used one day) or some OEM ID. I am also not familiar with
>PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>
>Anyhow, I understand (although not relate) to the objection for a new global
>variable. How about explicitly calling this hardware bug a “bug” and using
>the proper infrastructure? Calling it explicitly a bug may even push whoever
>can to resolve it.
>
>IOW, how about doing something along the lines of (not tested):
>
>
>-- >8 --
>
>Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor
>
>---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/common.c | 2 ++
> arch/x86/kernel/cpu/vmware.c | 2 ++
> arch/x86/pci/common.c | 6 ++++--
> 4 files changed, 9 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>index ef4775c6db01..216b6f357b6d 100644
>--- a/arch/x86/include/asm/cpufeatures.h
>+++ b/arch/x86/include/asm/cpufeatures.h
>@@ -460,5 +460,6 @@
> #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
> #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
> #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
>+#define X86_BUG_ECAM_MMIO X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */
>
> #endif /* _ASM_X86_CPUFEATURES_H */
>diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>index 3e508f239098..c94175fa304b 100644
>--- a/arch/x86/kernel/cpu/common.c
>+++ b/arch/x86/kernel/cpu/common.c
>@@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
> {
> u64 ia32_cap = x86_read_arch_cap_msr();
>
>+ setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
>+
> /* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
> if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
> !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
>diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
>index 02039ec3597d..8903776284a6 100644
>--- a/arch/x86/kernel/cpu/vmware.c
>+++ b/arch/x86/kernel/cpu/vmware.c
>@@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
> setup_force_cpu_cap(X86_FEATURE_VMCALL);
> else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
> setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
>+
>+ setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
> }
>
> static void __init vmware_platform_setup(void)
>diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>index ddb798603201..bc81cf4c1014 100644
>--- a/arch/x86/pci/common.c
>+++ b/arch/x86/pci/common.c
>@@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> int reg, int len, u32 *val)
> {
>- if (domain == 0 && reg < 256 && raw_pci_ops)
>+ if (domain == 0 && reg < 256 && raw_pci_ops &&
>+ (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
> return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> if (raw_pci_ext_ops)
> return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>@@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
> int reg, int len, u32 val)
> {
>- if (domain == 0 && reg < 256 && raw_pci_ops)
>+ if (domain == 0 && reg < 256 && raw_pci_ops &&
>+ (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
> return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
> if (raw_pci_ext_ops)
> return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);

Also... any reason we can't just set raw_pci_ops == raw_ext_pci_ops for the case when the latter is preferred, and dispense with the conditionals in the use path? Similarly, raw_ext_pci_ops could be pointed to error routines instead of left at NULL.

2022-10-03 21:47:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

On October 3, 2022 8:03:41 AM PDT, Vitaly Kuznetsov <[email protected]> wrote:
>Ajay Kaher <[email protected]> writes:
>
>>> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <[email protected]> wrote:
>>>>
>>>> Thanks Vitaly for your response.
>>>>
>>>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>>>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>>>> to be used, not something with-in struct pci_raw_ops.
>>>
>>> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
>>> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
>>> something like (completely untested):
>>>
>>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>>> index 70533fdcbf02..fb8270fa6c78 100644
>>> --- a/arch/x86/include/asm/pci_x86.h
>>> +++ b/arch/x86/include/asm/pci_x86.h
>>> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
>>> extern bool mp_should_keep_irq(struct device *dev);
>>>
>>> struct pci_raw_ops {
>>> + int rating;
>>> int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> int reg, int len, u32 *val);
>>> int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>> index ddb798603201..e9965fd11576 100644
>>> --- a/arch/x86/pci/common.c
>>> +++ b/arch/x86/pci/common.c
>>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> int reg, int len, u32 *val)
>>> {
>>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>>> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>> return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>> if (raw_pci_ext_ops)
>>> return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> int reg, int len, u32 val)
>>> {
>>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>>> + (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>> return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>> if (raw_pci_ext_ops)
>>> return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>>
>>> and then somewhere in Vmware hypervisor initialization code
>>> (arch/x86/kernel/cpu/vmware.c) you do
>>>
>>> raw_pci_ext_ops->rating = 100;
>>
>> Thanks Vitaly, for your review and helping us to improve the code.
>>
>> I was working to make changes as you suggested, but before sending v3 would like to
>> discuss on following:
>>
>> If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
>> and following change is must in arch/x86/pci/mmconfig_64.c:
>>
>> -const struct pci_raw_ops pci_mmcfg = {
>> +struct pci_raw_ops pci_mmcfg = {
>> .read = pci_mmcfg_read,
>> .write = pci_mmcfg_write,
>> };
>>
>> So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?
>>
>> And raw_pci_read() will have following change:
>>
>> - if (domain == 0 && reg < 256 && raw_pci_ops)
>> + if (domain == 0 && reg < 256 && raw_pci_ops &&
>> + (!prefer_raw_pci_ext_ops || !raw_pci_ext_ops)
>>
>
>Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>better, introducing a new global var is our 'last resort' and should be
>avoided whenever possible. Alternatively, you can add a
>raw_pci_ext_ops_preferred() function checking somethin within 'struct
>hypervisor_x86' but I'm unsure if it's better.
>
>Also, please check Alex' question/suggestion.
>
>...
>

Could this be ro_after_init?

2022-10-04 09:22:41

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

Nadav Amit <[email protected]> writes:

> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <[email protected]> wrote:
>
>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>> better, introducing a new global var is our 'last resort' and should be
>> avoided whenever possible. Alternatively, you can add a
>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>> hypervisor_x86' but I'm unsure if it's better.
>>
>> Also, please check Alex' question/suggestion.
>
> Here is my take (and Ajay knows probably more than me):
>
> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
> The two options are either to use a reserved field (which who knows, might
> be used one day) or some OEM ID. I am also not familiar with
> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>
> Anyhow, I understand (although not relate) to the objection for a new global
> variable. How about explicitly calling this hardware bug a “bug” and using
> the proper infrastructure? Calling it explicitly a bug may even push whoever
> can to resolve it.
>
> IOW, how about doing something along the lines of (not tested):
>

Works for me. Going forward, the intention shoud be to also clear the
bug on other x86 hypervisors, e.g. we test modern Hyper-V versions and
if MMIO works well we clear it, we test modern QEMU/KVM setups and if
MMIO works introduce a feature bit somewhere and also clear the bug in
the guest when the bit is set.

--
Vitaly

2022-10-04 09:47:29

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

Hey Nadav,

On 03.10.22 19:34, Nadav Amit wrote:
> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <[email protected]> wrote:
>
>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>> better, introducing a new global var is our 'last resort' and should be
>> avoided whenever possible. Alternatively, you can add a
>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>> hypervisor_x86' but I'm unsure if it's better.
>>
>> Also, please check Alex' question/suggestion.
> Here is my take (and Ajay knows probably more than me):
>
> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
> The two options are either to use a reserved field (which who knows, might
> be used one day) or some OEM ID. I am also not familiar with
> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>
> Anyhow, I understand (although not relate) to the objection for a new global
> variable. How about explicitly calling this hardware bug a “bug” and using
> the proper infrastructure? Calling it explicitly a bug may even push whoever
> can to resolve it.


I am a lot more concerned with how we propagate it externally than
within Linux. If we hard code that all Linux kernels 6.2+ that are
running in VMware prefer ECAM over PIO, we lock ourselves into that
stance for better or worse, which means:

* All past and future versions of any VMware hypervisor product have to
always allow ECAM access for any PCIe config space write
* No other hypervisor benefits from any of this without upstream code change
* No real hardware platform benefits from this without upstream code change

By moving it into MCFG, we can create a path for the outside environment
to tell the OS whether it's safe to use ECAM always. This obviously
doesn't work with MCFG as it stands today, we'd have to propose an MCFG
spec change to the PCI SIG's "PCI Firmware Specification" to add the
respective field. Future VMware versions could then always expose the
flag - and if you find it broken, remove it again.

Putting all of the logic on which system potentially prefers ECAM over
PIO config space access into Linux is just a big hack that we should
avoid as much as possible.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


2022-10-04 19:09:19

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

On Oct 4, 2022, at 1:22 AM, Alexander Graf <[email protected]> wrote:

> ⚠ External Email
>
> Hey Nadav,
>
> On 03.10.22 19:34, Nadav Amit wrote:
>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <[email protected]> wrote:
>>
>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>> better, introducing a new global var is our 'last resort' and should be
>>> avoided whenever possible. Alternatively, you can add a
>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>> hypervisor_x86' but I'm unsure if it's better.
>>>
>>> Also, please check Alex' question/suggestion.
>> Here is my take (and Ajay knows probably more than me):
>>
>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>> The two options are either to use a reserved field (which who knows, might
>> be used one day) or some OEM ID. I am also not familiar with
>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>>
>> Anyhow, I understand (although not relate) to the objection for a new global
>> variable. How about explicitly calling this hardware bug a “bug” and using
>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>> can to resolve it.
>
>
> I am a lot more concerned with how we propagate it externally than
> within Linux. If we hard code that all Linux kernels 6.2+ that are
> running in VMware prefer ECAM over PIO, we lock ourselves into that
> stance for better or worse, which means:
>
> * All past and future versions of any VMware hypervisor product have to
> always allow ECAM access for any PCIe config space write
> * No other hypervisor benefits from any of this without upstream code change
> * No real hardware platform benefits from this without upstream code change
>
> By moving it into MCFG, we can create a path for the outside environment
> to tell the OS whether it's safe to use ECAM always. This obviously
> doesn't work with MCFG as it stands today, we'd have to propose an MCFG
> spec change to the PCI SIG's "PCI Firmware Specification" to add the
> respective field. Future VMware versions could then always expose the
> flag - and if you find it broken, remove it again.
>
> Putting all of the logic on which system potentially prefers ECAM over
> PIO config space access into Linux is just a big hack that we should
> avoid as much as possible.

Thanks Alex. You raise important points. Let me try to break down your
concerns slightly differently:

1. Enabling MMIO access should be selective, and potentially controlled by
the hypervisor. The very least a "chicken-bit” is needed.

2. PCI SIG would change its specifications to address unclear hardware bug.

I think (1) makes sense and we can discuss different ways of addressing it.
But (2) would not happen in a reasonable timeline and seems to me as an
unnecessary complication.

But before we discuss how to address the issue, perhaps we need to first
understand it better. I am not sure that I understand this MMIO bug, and so
far nobody was able to provide exact details.

So I went to have a look. It might not be super helpful, but for the record,
here is what I collected.

First, we have commit d6ece5491ae71d ("i386/x86-64 Correct for broken MCFG
tables on K8 systems”). It tried to "try to discover all devices on bus 0
that are unreachable using MM and fallback for them.” Interestingly, it
seems similar to FreeBSD code (commit 2d10570afe2b3e) that also mentions K8
and has similar detection logic in FreeBSD’s pcie_cfgregopen().

Then commit a0ca9909609470 ("PCI x86: always use conf1 to access config
space below 256 bytes”). The correspondence [1] mentions some bugs: ATI
chipset, VIA chipset, Intel 3 Series Express chipset family and some reports
on Nvidia. It turned out some devices had problem probing - to figure out if
MMIO is broken - the way the previous patch did.

All of these bugs are circa 2008, of course. And note that FreeBSD did not
take a similar path. The correspondence around Linux patch is endless. I
admit that I did not understand whether eventually the issues were found to
be per-bus or per-device.


Back to the matter at hand. The benefit of using the MCFG approach that you
propose is that it can enable native systems to use MMIO as well. However,
since the list of bugs is unclear and the problems might be device-specific,
it is not clear what information BIOSes have that Linux doesn’t. In other
words, the benefit of getting it into the specifications is questionable,
and the complexity+time is high.

Can we agree that the feature would be enabled explicitly by the hypervisor
and Linux would enable it based on the hypervisor input (through some
channel?)

Thanks,
Nadav

[1] https://lore.kernel.org/all/[email protected]/T/#u

2022-10-10 15:21:07

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

On Oct 4, 2022, at 11:48 AM, Nadav Amit <[email protected]> wrote:

> On Oct 4, 2022, at 1:22 AM, Alexander Graf <[email protected]> wrote:
>
>> ⚠ External Email
>>
>> Hey Nadav,
>>
>> On 03.10.22 19:34, Nadav Amit wrote:
>>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <[email protected]> wrote:
>>>
>>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>>> better, introducing a new global var is our 'last resort' and should be
>>>> avoided whenever possible. Alternatively, you can add a
>>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>>> hypervisor_x86' but I'm unsure if it's better.
>>>>
>>>> Also, please check Alex' question/suggestion.
>>> Here is my take (and Ajay knows probably more than me):
>>>
>>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>>> The two options are either to use a reserved field (which who knows, might
>>> be used one day) or some OEM ID. I am also not familiar with
>>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>>>
>>> Anyhow, I understand (although not relate) to the objection for a new global
>>> variable. How about explicitly calling this hardware bug a “bug” and using
>>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>>> can to resolve it.
>>
>>
>> I am a lot more concerned with how we propagate it externally than
>> within Linux. If we hard code that all Linux kernels 6.2+ that are
>> running in VMware prefer ECAM over PIO, we lock ourselves into that
>> stance for better or worse, which means:
>>
>> * All past and future versions of any VMware hypervisor product have to
>> always allow ECAM access for any PCIe config space write
>> * No other hypervisor benefits from any of this without upstream code change
>> * No real hardware platform benefits from this without upstream code change
>>
>> By moving it into MCFG, we can create a path for the outside environment
>> to tell the OS whether it's safe to use ECAM always. This obviously
>> doesn't work with MCFG as it stands today, we'd have to propose an MCFG
>> spec change to the PCI SIG's "PCI Firmware Specification" to add the
>> respective field. Future VMware versions could then always expose the
>> flag - and if you find it broken, remove it again.
>>
>> Putting all of the logic on which system potentially prefers ECAM over
>> PIO config space access into Linux is just a big hack that we should
>> avoid as much as possible.
>
> Thanks Alex. You raise important points. Let me try to break down your
> concerns slightly differently:
>
> 1. Enabling MMIO access should be selective, and potentially controlled by
> the hypervisor. The very least a "chicken-bit” is needed.
>
> 2. PCI SIG would change its specifications to address unclear hardware bug.
>
> I think (1) makes sense and we can discuss different ways of addressing it.
> But (2) would not happen in a reasonable timeline and seems to me as an
> unnecessary complication.
>
> But before we discuss how to address the issue, perhaps we need to first
> understand it better. I am not sure that I understand this MMIO bug, and so
> far nobody was able to provide exact details.
>
> So I went to have a look. It might not be super helpful, but for the record,
> here is what I collected.
>
> First, we have commit d6ece5491ae71d ("i386/x86-64 Correct for broken MCFG
> tables on K8 systems”). It tried to "try to discover all devices on bus 0
> that are unreachable using MM and fallback for them.” Interestingly, it
> seems similar to FreeBSD code (commit 2d10570afe2b3e) that also mentions K8
> and has similar detection logic in FreeBSD’s pcie_cfgregopen().
>
> Then commit a0ca9909609470 ("PCI x86: always use conf1 to access config
> space below 256 bytes”). The correspondence [1] mentions some bugs: ATI
> chipset, VIA chipset, Intel 3 Series Express chipset family and some reports
> on Nvidia. It turned out some devices had problem probing - to figure out if
> MMIO is broken - the way the previous patch did.
>
> All of these bugs are circa 2008, of course. And note that FreeBSD did not
> take a similar path. The correspondence around Linux patch is endless. I
> admit that I did not understand whether eventually the issues were found to
> be per-bus or per-device.
>
>
> Back to the matter at hand. The benefit of using the MCFG approach that you
> propose is that it can enable native systems to use MMIO as well. However,
> since the list of bugs is unclear and the problems might be device-specific,
> it is not clear what information BIOSes have that Linux doesn’t. In other
> words, the benefit of getting it into the specifications is questionable,
> and the complexity+time is high.
>
> Can we agree that the feature would be enabled explicitly by the hypervisor
> and Linux would enable it based on the hypervisor input (through some
> channel?)

Alex, is it ok with you? We will enable the feature (disable the bug)
explicitly from the hypervisor, but would not rely on MCFG changes, which
would even in the best case would take some time.

2022-10-10 17:13:40

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor

On Tue, Oct 04, 2022 at 06:48:11PM +0000, Nadav Amit wrote:
> On Oct 4, 2022, at 1:22 AM, Alexander Graf <[email protected]> wrote:
>
> > ⚠ External Email
> >
> > Hey Nadav,
> >
> > On 03.10.22 19:34, Nadav Amit wrote:
> >> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <[email protected]> wrote:
> >>
> >>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
> >>> better, introducing a new global var is our 'last resort' and should be
> >>> avoided whenever possible. Alternatively, you can add a
> >>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
> >>> hypervisor_x86' but I'm unsure if it's better.
> >>>
> >>> Also, please check Alex' question/suggestion.
> >> Here is my take (and Ajay knows probably more than me):
> >>
> >> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
> >> The two options are either to use a reserved field (which who knows, might
> >> be used one day) or some OEM ID. I am also not familiar with
> >> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
> >>
> >> Anyhow, I understand (although not relate) to the objection for a new global
> >> variable. How about explicitly calling this hardware bug a “bug” and using
> >> the proper infrastructure? Calling it explicitly a bug may even push whoever
> >> can to resolve it.
> >
> >
> > I am a lot more concerned with how we propagate it externally than
> > within Linux. If we hard code that all Linux kernels 6.2+ that are
> > running in VMware prefer ECAM over PIO, we lock ourselves into that
> > stance for better or worse, which means:
> >
> > * All past and future versions of any VMware hypervisor product have to
> > always allow ECAM access for any PCIe config space write
> > * No other hypervisor benefits from any of this without upstream code change
> > * No real hardware platform benefits from this without upstream code change
> >
> > By moving it into MCFG, we can create a path for the outside environment
> > to tell the OS whether it's safe to use ECAM always. This obviously
> > doesn't work with MCFG as it stands today, we'd have to propose an MCFG
> > spec change to the PCI SIG's "PCI Firmware Specification" to add the
> > respective field. Future VMware versions could then always expose the
> > flag - and if you find it broken, remove it again.
> >
> > Putting all of the logic on which system potentially prefers ECAM over
> > PIO config space access into Linux is just a big hack that we should
> > avoid as much as possible.
>
> Thanks Alex. You raise important points. Let me try to break down your
> concerns slightly differently:
>
> 1. Enabling MMIO access should be selective, and potentially controlled by
> the hypervisor. The very least a "chicken-bit” is needed.
>
> 2. PCI SIG would change its specifications to address unclear hardware bug.
>
> I think (1) makes sense and we can discuss different ways of addressing it.
> But (2) would not happen in a reasonable timeline and seems to me as an
> unnecessary complication.
>
> But before we discuss how to address the issue, perhaps we need to first
> understand it better. I am not sure that I understand this MMIO bug, and so
> far nobody was able to provide exact details.
>
> So I went to have a look. It might not be super helpful, but for the record,
> here is what I collected.
>
> First, we have commit d6ece5491ae71d ("i386/x86-64 Correct for broken MCFG
> tables on K8 systems”). It tried to "try to discover all devices on bus 0
> that are unreachable using MM and fallback for them.” Interestingly, it
> seems similar to FreeBSD code (commit 2d10570afe2b3e) that also mentions K8
> and has similar detection logic in FreeBSD’s pcie_cfgregopen().
>
> Then commit a0ca9909609470 ("PCI x86: always use conf1 to access config
> space below 256 bytes”). The correspondence [1] mentions some bugs: ATI
> chipset, VIA chipset, Intel 3 Series Express chipset family and some reports
> on Nvidia. It turned out some devices had problem probing - to figure out if
> MMIO is broken - the way the previous patch did.

There's also a statement by Linus that MCFG might not cover all buses
in that thread. I didn't think the implications through yet ...

> All of these bugs are circa 2008, of course. And note that FreeBSD did not
> take a similar path. The correspondence around Linux patch is endless. I
> admit that I did not understand whether eventually the issues were found to
> be per-bus or per-device.
>
>
> Back to the matter at hand. The benefit of using the MCFG approach that you
> propose is that it can enable native systems to use MMIO as well. However,
> since the list of bugs is unclear and the problems might be device-specific,
> it is not clear what information BIOSes have that Linux doesn’t. In other
> words, the benefit of getting it into the specifications is questionable,
> and the complexity+time is high.
>
> Can we agree that the feature would be enabled explicitly by the hypervisor
> and Linux would enable it based on the hypervisor input (through some
> channel?)
>
> Thanks,
> Nadav
>
> [1] https://lore.kernel.org/all/[email protected]/T/#u