If MCFG area is not reserved in E820, Xen by default will defer its usage
until Dom0 registers it explicitly after ACPI parser recognizes it as
a reserved resource in DSDT. Having it reserved in E820 is not
mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
and firmware is free to keep a hole E820 in that place. Xen doesn't know
what exactly is inside this hole since it lacks full ACPI view of the
platform therefore it's potentially harmful to access MCFG region
without additional checks as some machines are known to provide
inconsistent information on the size of the region.
Now xen_mcfg_late() runs after acpi_init() which is too late as some basic
PCI enumeration starts exactly there. Trying to register a device prior
to MCFG reservation causes multiple problems with PCIe extended
capability initializations in Xen (e.g. SR-IOV VF BAR sizing). There are
no convenient hooks for us to subscribe to so try to register MCFG
areas earlier upon the first invocation of xen_add_device(). Keep the
existing initcall in case information of MCFG areas is updated later
in acpi_init().
Signed-off-by: Igor Druzhinin <[email protected]>
---
drivers/xen/pci.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index 7494dbe..800f415 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -29,6 +29,9 @@
#include "../pci/pci.h"
#ifdef CONFIG_PCI_MMCONFIG
#include <asm/pci_x86.h>
+
+static int xen_mcfg_late(void);
+static bool __read_mostly pci_mcfg_reserved = false;
#endif
static bool __read_mostly pci_seg_supported = true;
@@ -41,6 +44,16 @@ static int xen_add_device(struct device *dev)
struct pci_dev *physfn = pci_dev->physfn;
#endif
+#ifdef CONFIG_PCI_MMCONFIG
+ /*
+ * Try to reserve MCFG areas discovered so far early on first invocation
+ * due to this being potentially called from inside of acpi_init
+ */
+ if (!pci_mcfg_reserved) {
+ xen_mcfg_late();
+ pci_mcfg_reserved = true;
+ }
+#endif
if (pci_seg_supported) {
struct {
struct physdev_pci_device_add add;
@@ -213,7 +226,7 @@ static int __init register_xen_pci_notifier(void)
arch_initcall(register_xen_pci_notifier);
#ifdef CONFIG_PCI_MMCONFIG
-static int __init xen_mcfg_late(void)
+static int xen_mcfg_late(void)
{
struct pci_mmcfg_region *cfg;
int rc;
--
2.7.4
On 04.09.2019 02:20, Igor Druzhinin wrote:
> If MCFG area is not reserved in E820, Xen by default will defer its usage
> until Dom0 registers it explicitly after ACPI parser recognizes it as
> a reserved resource in DSDT. Having it reserved in E820 is not
> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
> and firmware is free to keep a hole E820 in that place. Xen doesn't know
> what exactly is inside this hole since it lacks full ACPI view of the
> platform therefore it's potentially harmful to access MCFG region
> without additional checks as some machines are known to provide
> inconsistent information on the size of the region.
Irrespective of this being a good change, I've had another thought
while reading this paragraph, for a hypervisor side control: Linux
has a "memopt=" command line option allowing fine grained control
over the E820 map. We could have something similar to allow
inserting an E820_RESERVED region into a hole (it would be the
responsibility of the admin to guarantee no other conflicts, i.e.
it should generally be used only if e.g. the MCFG is indeed known
to live at the specified place, and being properly represented in
the ACPI tables). Thoughts?
Jan
On 04/09/2019 10:08, Jan Beulich wrote:
> On 04.09.2019 02:20, Igor Druzhinin wrote:
>> If MCFG area is not reserved in E820, Xen by default will defer its usage
>> until Dom0 registers it explicitly after ACPI parser recognizes it as
>> a reserved resource in DSDT. Having it reserved in E820 is not
>> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
>> and firmware is free to keep a hole E820 in that place. Xen doesn't know
>> what exactly is inside this hole since it lacks full ACPI view of the
>> platform therefore it's potentially harmful to access MCFG region
>> without additional checks as some machines are known to provide
>> inconsistent information on the size of the region.
>
> Irrespective of this being a good change, I've had another thought
> while reading this paragraph, for a hypervisor side control: Linux
> has a "memopt=" command line option allowing fine grained control
> over the E820 map. We could have something similar to allow
> inserting an E820_RESERVED region into a hole (it would be the
> responsibility of the admin to guarantee no other conflicts, i.e.
> it should generally be used only if e.g. the MCFG is indeed known
> to live at the specified place, and being properly represented in
> the ACPI tables). Thoughts?
What other use cases can you think of in case we'd have this option?
From the top of my head, it might be providing a memmap for a second Xen
after doing kexec from Xen to Xen.
What benefits do you think it might have over just accepting a hole
using "mcfg=relaxed" option from admin perspective?
Igor
On 04.09.2019 13:36, Igor Druzhinin wrote:
> On 04/09/2019 10:08, Jan Beulich wrote:
>> On 04.09.2019 02:20, Igor Druzhinin wrote:
>>> If MCFG area is not reserved in E820, Xen by default will defer its usage
>>> until Dom0 registers it explicitly after ACPI parser recognizes it as
>>> a reserved resource in DSDT. Having it reserved in E820 is not
>>> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
>>> and firmware is free to keep a hole E820 in that place. Xen doesn't know
>>> what exactly is inside this hole since it lacks full ACPI view of the
>>> platform therefore it's potentially harmful to access MCFG region
>>> without additional checks as some machines are known to provide
>>> inconsistent information on the size of the region.
>>
>> Irrespective of this being a good change, I've had another thought
>> while reading this paragraph, for a hypervisor side control: Linux
>> has a "memopt=" command line option allowing fine grained control
>> over the E820 map. We could have something similar to allow
>> inserting an E820_RESERVED region into a hole (it would be the
>> responsibility of the admin to guarantee no other conflicts, i.e.
>> it should generally be used only if e.g. the MCFG is indeed known
>> to live at the specified place, and being properly represented in
>> the ACPI tables). Thoughts?
>
> What other use cases can you think of in case we'd have this option?
> From the top of my head, it might be providing a memmap for a second Xen
> after doing kexec from Xen to Xen.
>
> What benefits do you think it might have over just accepting a hole
> using "mcfg=relaxed" option from admin perspective?
It wouldn't be MCFG-specific, i.e. it could also be used to e.g.
convert holes to E820_RESERVED to silence VT-d's respective RMRR
warning. Plus by inserting the entry into our own E820 we'd also
propagate it to users of XENMEM_machine_memory_map.
Jan
On 9/3/19 8:20 PM, Igor Druzhinin wrote:
> If MCFG area is not reserved in E820, Xen by default will defer its usage
> until Dom0 registers it explicitly after ACPI parser recognizes it as
> a reserved resource in DSDT. Having it reserved in E820 is not
> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
> and firmware is free to keep a hole E820 in that place. Xen doesn't know
> what exactly is inside this hole since it lacks full ACPI view of the
> platform therefore it's potentially harmful to access MCFG region
> without additional checks as some machines are known to provide
> inconsistent information on the size of the region.
>
> Now xen_mcfg_late() runs after acpi_init() which is too late as some basic
> PCI enumeration starts exactly there. Trying to register a device prior
> to MCFG reservation causes multiple problems with PCIe extended
> capability initializations in Xen (e.g. SR-IOV VF BAR sizing). There are
> no convenient hooks for us to subscribe to so try to register MCFG
> areas earlier upon the first invocation of xen_add_device().
Where is MCFG parsed? pci_arch_init()?
-boris
> Keep the
> existing initcall in case information of MCFG areas is updated later
> in acpi_init().
>
On 06/09/2019 23:30, Boris Ostrovsky wrote:
> On 9/3/19 8:20 PM, Igor Druzhinin wrote:
>> If MCFG area is not reserved in E820, Xen by default will defer its usage
>> until Dom0 registers it explicitly after ACPI parser recognizes it as
>> a reserved resource in DSDT. Having it reserved in E820 is not
>> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
>> and firmware is free to keep a hole E820 in that place. Xen doesn't know
>> what exactly is inside this hole since it lacks full ACPI view of the
>> platform therefore it's potentially harmful to access MCFG region
>> without additional checks as some machines are known to provide
>> inconsistent information on the size of the region.
>>
>> Now xen_mcfg_late() runs after acpi_init() which is too late as some basic
>> PCI enumeration starts exactly there. Trying to register a device prior
>> to MCFG reservation causes multiple problems with PCIe extended
>> capability initializations in Xen (e.g. SR-IOV VF BAR sizing). There are
>> no convenient hooks for us to subscribe to so try to register MCFG
>> areas earlier upon the first invocation of xen_add_device().
>
>
> Where is MCFG parsed? pci_arch_init()?
It happens twice:
1) first time early one in pci_arch_init() that is arch_initcall - that
time pci_mmcfg_list will be freed immediately there because MCFG area is
not reserved in E820;
2) second time late one in acpi_init() which is subsystem_initcall right
before where PCI enumeration starts - this time ACPI tables will be
checked for a reserved resource and pci_mmcfg_list will be finally
populated.
The problem is that on a system that doesn't have MCFG area reserved in
E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
called in the same place. So MCFG is still not in use by Xen at this
point since we haven't reached our xen_mcfg_late().
Igor
On 08/09/2019 19:28, Boris Ostrovsky wrote:
> On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>>
>> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>>>
>>> Where is MCFG parsed? pci_arch_init()?
>>>> It happens twice:
>> 1) first time early one in pci_arch_init() that is arch_initcall - that
>> time pci_mmcfg_list will be freed immediately there because MCFG area is
>> not reserved in E820;
>> 2) second time late one in acpi_init() which is subsystem_initcall right
>> before where PCI enumeration starts - this time ACPI tables will be
>> checked for a reserved resource and pci_mmcfg_list will be finally
>> populated.
>>
>> The problem is that on a system that doesn't have MCFG area reserved in
>> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
>> called in the same place. So MCFG is still not in use by Xen at this
>> point since we haven't reached our xen_mcfg_late().
>
>
> Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
> realize that we'd be doing this twice (or maybe even three times since
> apparently both pci_arch_init() and acpi_ini() do it).
>
I don't thine it makes sense:
a) it needs to be done after ACPI is initialized since we need to parse
it to figure out the exact reserved region - that's why it's currently
done in acpi_init() (see commit message for the reasons why)
b) given (a) we cannot do it ourselves before acpi_init and after is too
late as we're already past ACPI PCI enumeration
c) we'd have to do it in the same place I call xen_mcfg_late() and it'd
be code duplication of what's already done by the existing code.
Igor
On 9/8/19 5:11 PM, Igor Druzhinin wrote:
> On 08/09/2019 19:28, Boris Ostrovsky wrote:
>> On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>>> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>>>> Where is MCFG parsed? pci_arch_init()?
>>>>> It happens twice:
>>> 1) first time early one in pci_arch_init() that is arch_initcall - that
>>> time pci_mmcfg_list will be freed immediately there because MCFG area is
>>> not reserved in E820;
>>> 2) second time late one in acpi_init() which is subsystem_initcall right
>>> before where PCI enumeration starts - this time ACPI tables will be
>>> checked for a reserved resource and pci_mmcfg_list will be finally
>>> populated.
>>>
>>> The problem is that on a system that doesn't have MCFG area reserved in
>>> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
>>> called in the same place. So MCFG is still not in use by Xen at this
>>> point since we haven't reached our xen_mcfg_late().
>>
>> Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
>> realize that we'd be doing this twice (or maybe even three times since
>> apparently both pci_arch_init() and acpi_ini() do it).
>>
> I don't thine it makes sense:
> a) it needs to be done after ACPI is initialized since we need to parse
> it to figure out the exact reserved region - that's why it's currently
> done in acpi_init() (see commit message for the reasons why)
Hmm... We should be able to parse ACPI tables by the time
pci_arch_init() is called. In fact, if you look at
pci_mmcfg_early_init() you will see that it does just that.
> b) given (a) we cannot do it ourselves before acpi_init and after is too
> late as we're already past ACPI PCI enumeration
> c) we'd have to do it in the same place I call xen_mcfg_late() and it'd
> be code duplication of what's already done by the existing code.
If we manage to parse MCFG ourselves early then maybe we won't not need
xen_mcfg_late()? We can call PHYSDEVOP_pci_mmcfg_reserved right away.
-boris
On 09/09/2019 00:30, Boris Ostrovsky wrote:
> On 9/8/19 5:11 PM, Igor Druzhinin wrote:
>> On 08/09/2019 19:28, Boris Ostrovsky wrote:
>>> On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>>>> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>>>>> Where is MCFG parsed? pci_arch_init()?
>>>>>> It happens twice:
>>>> 1) first time early one in pci_arch_init() that is arch_initcall - that
>>>> time pci_mmcfg_list will be freed immediately there because MCFG area is
>>>> not reserved in E820;
>>>> 2) second time late one in acpi_init() which is subsystem_initcall right
>>>> before where PCI enumeration starts - this time ACPI tables will be
>>>> checked for a reserved resource and pci_mmcfg_list will be finally
>>>> populated.
>>>>
>>>> The problem is that on a system that doesn't have MCFG area reserved in
>>>> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
>>>> called in the same place. So MCFG is still not in use by Xen at this
>>>> point since we haven't reached our xen_mcfg_late().
>>>
>>> Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
>>> realize that we'd be doing this twice (or maybe even three times since
>>> apparently both pci_arch_init() and acpi_ini() do it).
>>>
>> I don't thine it makes sense:
>> a) it needs to be done after ACPI is initialized since we need to parse
>> it to figure out the exact reserved region - that's why it's currently
>> done in acpi_init() (see commit message for the reasons why)
>
> Hmm... We should be able to parse ACPI tables by the time
> pci_arch_init() is called. In fact, if you look at
> pci_mmcfg_early_init() you will see that it does just that.
>
The point is not to parse MCFG after acpi_init but to parse DSDT for
reserved resource which could be done only after ACPI initialization.
>> b) given (a) we cannot do it ourselves before acpi_init and after is too
>> late as we're already past ACPI PCI enumeration
>> c) we'd have to do it in the same place I call xen_mcfg_late() and it'd
>> be code duplication of what's already done by the existing code.
>
>
> If we manage to parse MCFG ourselves early then maybe we won't not need
> xen_mcfg_late()? We can call PHYSDEVOP_pci_mmcfg_reserved right away.
Again, this cannot be done untile acpi_init finishes basic setup to
parse DSDT.
Igor
On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>
> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>> On 9/3/19 8:20 PM, Igor Druzhinin wrote:
>>> If MCFG area is not reserved in E820, Xen by default will defer its usage
>>> until Dom0 registers it explicitly after ACPI parser recognizes it as
>>> a reserved resource in DSDT. Having it reserved in E820 is not
>>> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
>>> and firmware is free to keep a hole E820 in that place. Xen doesn't know
>>> what exactly is inside this hole since it lacks full ACPI view of the
>>> platform therefore it's potentially harmful to access MCFG region
>>> without additional checks as some machines are known to provide
>>> inconsistent information on the size of the region.
>>>
>>> Now xen_mcfg_late() runs after acpi_init() which is too late as some basic
>>> PCI enumeration starts exactly there. Trying to register a device prior
>>> to MCFG reservation causes multiple problems with PCIe extended
>>> capability initializations in Xen (e.g. SR-IOV VF BAR sizing). There are
>>> no convenient hooks for us to subscribe to so try to register MCFG
>>> areas earlier upon the first invocation of xen_add_device().
>>
>> Where is MCFG parsed? pci_arch_init()?
> It happens twice:
> 1) first time early one in pci_arch_init() that is arch_initcall - that
> time pci_mmcfg_list will be freed immediately there because MCFG area is
> not reserved in E820;
> 2) second time late one in acpi_init() which is subsystem_initcall right
> before where PCI enumeration starts - this time ACPI tables will be
> checked for a reserved resource and pci_mmcfg_list will be finally
> populated.
>
> The problem is that on a system that doesn't have MCFG area reserved in
> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
> called in the same place. So MCFG is still not in use by Xen at this
> point since we haven't reached our xen_mcfg_late().
Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
realize that we'd be doing this twice (or maybe even three times since
apparently both pci_arch_init() and acpi_ini() do it).
-boris
On 9/8/19 7:37 PM, Igor Druzhinin wrote:
> On 09/09/2019 00:30, Boris Ostrovsky wrote:
>> On 9/8/19 5:11 PM, Igor Druzhinin wrote:
>>> On 08/09/2019 19:28, Boris Ostrovsky wrote:
>>>> On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>>>>> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>>>>>> Where is MCFG parsed? pci_arch_init()?
>>>>>>> It happens twice:
>>>>> 1) first time early one in pci_arch_init() that is arch_initcall - that
>>>>> time pci_mmcfg_list will be freed immediately there because MCFG area is
>>>>> not reserved in E820;
>>>>> 2) second time late one in acpi_init() which is subsystem_initcall right
>>>>> before where PCI enumeration starts - this time ACPI tables will be
>>>>> checked for a reserved resource and pci_mmcfg_list will be finally
>>>>> populated.
>>>>>
>>>>> The problem is that on a system that doesn't have MCFG area reserved in
>>>>> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
>>>>> called in the same place. So MCFG is still not in use by Xen at this
>>>>> point since we haven't reached our xen_mcfg_late().
>>>> Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
>>>> realize that we'd be doing this twice (or maybe even three times since
>>>> apparently both pci_arch_init() and acpi_ini() do it).
>>>>
>>> I don't thine it makes sense:
>>> a) it needs to be done after ACPI is initialized since we need to parse
>>> it to figure out the exact reserved region - that's why it's currently
>>> done in acpi_init() (see commit message for the reasons why)
>> Hmm... We should be able to parse ACPI tables by the time
>> pci_arch_init() is called. In fact, if you look at
>> pci_mmcfg_early_init() you will see that it does just that.
>>
> The point is not to parse MCFG after acpi_init but to parse DSDT for
> reserved resource which could be done only after ACPI initialization.
OK, I think I understand now what you are trying to do --- you are
essentially trying to account for the range inserted by
setup_mcfg_map(), right?
The other question I have is why you think it's worth keeping
xen_mcfg_late() as a late initcall. How could MCFG info be updated
between acpi_init() and late_initcalls being run? I'd think it can only
happen when a new device is hotplugged.
-boris
>
>>> b) given (a) we cannot do it ourselves before acpi_init and after is too
>>> late as we're already past ACPI PCI enumeration
>>> c) we'd have to do it in the same place I call xen_mcfg_late() and it'd
>>> be code duplication of what's already done by the existing code.
>>
>> If we manage to parse MCFG ourselves early then maybe we won't not need
>> xen_mcfg_late()? We can call PHYSDEVOP_pci_mmcfg_reserved right away.
> Again, this cannot be done untile acpi_init finishes basic setup to
> parse DSDT.
>
> Igor
On 09/09/2019 20:19, Boris Ostrovsky wrote:
> On 9/8/19 7:37 PM, Igor Druzhinin wrote:
>> On 09/09/2019 00:30, Boris Ostrovsky wrote:
>>> On 9/8/19 5:11 PM, Igor Druzhinin wrote:
>>>> On 08/09/2019 19:28, Boris Ostrovsky wrote:
>>>>> On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>>>>>> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>>>>>>> Where is MCFG parsed? pci_arch_init()?
>>>>>>>> It happens twice:
>>>>>> 1) first time early one in pci_arch_init() that is arch_initcall - that
>>>>>> time pci_mmcfg_list will be freed immediately there because MCFG area is
>>>>>> not reserved in E820;
>>>>>> 2) second time late one in acpi_init() which is subsystem_initcall right
>>>>>> before where PCI enumeration starts - this time ACPI tables will be
>>>>>> checked for a reserved resource and pci_mmcfg_list will be finally
>>>>>> populated.
>>>>>>
>>>>>> The problem is that on a system that doesn't have MCFG area reserved in
>>>>>> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
>>>>>> called in the same place. So MCFG is still not in use by Xen at this
>>>>>> point since we haven't reached our xen_mcfg_late().
>>>>> Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
>>>>> realize that we'd be doing this twice (or maybe even three times since
>>>>> apparently both pci_arch_init() and acpi_ini() do it).
>>>>>
>>>> I don't thine it makes sense:
>>>> a) it needs to be done after ACPI is initialized since we need to parse
>>>> it to figure out the exact reserved region - that's why it's currently
>>>> done in acpi_init() (see commit message for the reasons why)
>>> Hmm... We should be able to parse ACPI tables by the time
>>> pci_arch_init() is called. In fact, if you look at
>>> pci_mmcfg_early_init() you will see that it does just that.
>>>
>> The point is not to parse MCFG after acpi_init but to parse DSDT for
>> reserved resource which could be done only after ACPI initialization.
>
> OK, I think I understand now what you are trying to do --- you are
> essentially trying to account for the range inserted by
> setup_mcfg_map(), right?
>
Actually, pci_mmcfg_late_init() that's called out of acpi_init() -
that's where MCFG areas are properly sized. setup_mcfg_map() is mostly
for bus hotplug where MCFG area is discovered by evaluating _CBA method;
for cold-plugged buses it just confirms that MCFG area is already
registered because it is mandated for them to be in MCFG table at boot time.
> The other question I have is why you think it's worth keeping
> xen_mcfg_late() as a late initcall. How could MCFG info be updated
> between acpi_init() and late_initcalls being run? I'd think it can only
> happen when a new device is hotplugged.
>
It was a precaution against setup_mcfg_map() calls that might add new
areas that are not in MCFG table but for some reason have _CBA method.
It's obviously a "firmware is broken" scenario so I don't have strong
feelings to keep it here. Will prefer to remove in v2 if you want.
Igor
On 9/9/19 5:48 PM, Igor Druzhinin wrote:
> On 09/09/2019 20:19, Boris Ostrovsky wrote:
>> On 9/8/19 7:37 PM, Igor Druzhinin wrote:
>>> On 09/09/2019 00:30, Boris Ostrovsky wrote:
>>>> On 9/8/19 5:11 PM, Igor Druzhinin wrote:
>>>>> On 08/09/2019 19:28, Boris Ostrovsky wrote:
>>>>>> On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>>>>>>> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>>>>>>>> Where is MCFG parsed? pci_arch_init()?
>>>>>>>>> It happens twice:
>>>>>>> 1) first time early one in pci_arch_init() that is arch_initcall - that
>>>>>>> time pci_mmcfg_list will be freed immediately there because MCFG area is
>>>>>>> not reserved in E820;
>>>>>>> 2) second time late one in acpi_init() which is subsystem_initcall right
>>>>>>> before where PCI enumeration starts - this time ACPI tables will be
>>>>>>> checked for a reserved resource and pci_mmcfg_list will be finally
>>>>>>> populated.
>>>>>>>
>>>>>>> The problem is that on a system that doesn't have MCFG area reserved in
>>>>>>> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
>>>>>>> called in the same place. So MCFG is still not in use by Xen at this
>>>>>>> point since we haven't reached our xen_mcfg_late().
>>>>>> Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
>>>>>> realize that we'd be doing this twice (or maybe even three times since
>>>>>> apparently both pci_arch_init() and acpi_ini() do it).
>>>>>>
>>>>> I don't thine it makes sense:
>>>>> a) it needs to be done after ACPI is initialized since we need to parse
>>>>> it to figure out the exact reserved region - that's why it's currently
>>>>> done in acpi_init() (see commit message for the reasons why)
>>>> Hmm... We should be able to parse ACPI tables by the time
>>>> pci_arch_init() is called. In fact, if you look at
>>>> pci_mmcfg_early_init() you will see that it does just that.
>>>>
>>> The point is not to parse MCFG after acpi_init but to parse DSDT for
>>> reserved resource which could be done only after ACPI initialization.
>> OK, I think I understand now what you are trying to do --- you are
>> essentially trying to account for the range inserted by
>> setup_mcfg_map(), right?
>>
> Actually, pci_mmcfg_late_init() that's called out of acpi_init() -
> that's where MCFG areas are properly sized.
pci_mmcfg_late_init() reads the (static) MCFG, which doesn't need DSDT parsing, does it? setup_mcfg_map() OTOH does need it as it uses data from _CBA (or is it _CRS?), and I think that's why we can't parse MCFG prior to acpi_init(). So what I said above indeed won't work.
> setup_mcfg_map() is mostly
> for bus hotplug where MCFG area is discovered by evaluating _CBA method;
> for cold-plugged buses it just confirms that MCFG area is already
> registered because it is mandated for them to be in MCFG table at boot time.
>
>> The other question I have is why you think it's worth keeping
>> xen_mcfg_late() as a late initcall. How could MCFG info be updated
>> between acpi_init() and late_initcalls being run? I'd think it can only
>> happen when a new device is hotplugged.
>>
> It was a precaution against setup_mcfg_map() calls that might add new
> areas that are not in MCFG table but for some reason have _CBA method.
> It's obviously a "firmware is broken" scenario so I don't have strong
> feelings to keep it here. Will prefer to remove in v2 if you want.
Isn't setup_mcfg_map() called before the first xen_add_device() which is where you are calling xen_mcfg_late()?
-boris