http://bugzilla.kernel.org/show_bug.cgi?id=8833
We write 0xffffffff to BARs to detect BAR size, this will change BAR
base to 0xfxxxxxx depends on BAR size. In the bug, PCI MCFG base address
is 0xf4000000. One PCI device (gfx) has a 256M BAR, the detection code
will temprarily change it to 0xf0000000, so conflict with MCFG decode
range. Later memory based config space read/write address is decoded by
both MCH and gfx and cause a hang. This patch disables resource decode
in BAR size detection to avoid resource conflict.
Signed-off-by: Shaohua Li <[email protected]>
---
drivers/pci/probe.c | 6 ++++++
1 file changed, 6 insertions(+)
Index: linux/drivers/pci/probe.c
===================================================================
--- linux.orig/drivers/pci/probe.c 2007-09-12 10:44:19.000000000 +0800
+++ linux/drivers/pci/probe.c 2007-09-13 12:58:18.000000000 +0800
@@ -185,6 +185,11 @@ static void pci_read_bases(struct pci_de
unsigned int pos, reg, next;
u32 l, sz;
struct resource *res;
+ u16 command;
+
+ pci_read_config_word(dev, PCI_COMMAND, &command);
+ pci_write_config_word(dev, PCI_COMMAND,
+ command & (~(PCI_COMMAND_MEMORY|PCI_COMMAND_IO)));
for(pos=0; pos<howmany; pos = next) {
u64 l64;
@@ -283,6 +288,7 @@ static void pci_read_bases(struct pci_de
}
}
}
+ pci_write_config_word(dev, PCI_COMMAND, command);
}
void pci_read_bridge_bases(struct pci_bus *child)
On Thu, Sep 13, 2007 at 02:21:07PM +0800, Shaohua Li wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=8833
> We write 0xffffffff to BARs to detect BAR size, this will change BAR
> base to 0xfxxxxxx depends on BAR size. In the bug, PCI MCFG base address
> is 0xf4000000. One PCI device (gfx) has a 256M BAR, the detection code
> will temprarily change it to 0xf0000000, so conflict with MCFG decode
> range. Later memory based config space read/write address is decoded by
> both MCH and gfx and cause a hang. This patch disables resource decode
> in BAR size detection to avoid resource conflict.
You missed the part where we have to avoid doing this for host bridges ...
http://marc.info/?l=linux-kernel&m=118809338631160&w=2
(I believe it's now queued in Greg's tree, and possibly Andrew's tree
too)
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
On Thu, 2007-09-13 at 01:31 -0600, Matthew Wilcox wrote:
> On Thu, Sep 13, 2007 at 02:21:07PM +0800, Shaohua Li wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=8833
> > We write 0xffffffff to BARs to detect BAR size, this will change BAR
> > base to 0xfxxxxxx depends on BAR size. In the bug, PCI MCFG base address
> > is 0xf4000000. One PCI device (gfx) has a 256M BAR, the detection code
> > will temprarily change it to 0xf0000000, so conflict with MCFG decode
> > range. Later memory based config space read/write address is decoded by
> > both MCH and gfx and cause a hang. This patch disables resource decode
> > in BAR size detection to avoid resource conflict.
>
> You missed the part where we have to avoid doing this for host bridges ...
>
> http://marc.info/?l=linux-kernel&m=118809338631160&w=2
>
> (I believe it's now queued in Greg's tree, and possibly Andrew's tree
> too)
Hmm, I don't know there is already a fix for this, so waste a whole
morning :(. yes, your patch looks better.
Thanks,
Shaohua
On Thu, Sep 13, 2007 at 03:24:46PM +0800, Shaohua Li wrote:
> On Thu, 2007-09-13 at 01:31 -0600, Matthew Wilcox wrote:
> > You missed the part where we have to avoid doing this for host bridges ...
> >
> > http://marc.info/?l=linux-kernel&m=118809338631160&w=2
> >
> > (I believe it's now queued in Greg's tree, and possibly Andrew's tree
> > too)
> Hmm, I don't know there is already a fix for this, so waste a whole
> morning :(. yes, your patch looks better.
Yes, this bug is causing a lot of people to waste time. I fielded an
internal request for this patch this afternoon. I appreciate we're
post-rc6 at this point, but it does rather suck to be releasing a kernel
which freezes on boot on this class of machines.
Unfortunately if this patch does cause any machine to break, these will
be machines that worked fine up until this point, so that would be a
regression, which is worse. Life sucks.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
On Thu, Sep 13, 2007 at 01:55:36AM -0600, Matthew Wilcox wrote:
> On Thu, Sep 13, 2007 at 03:24:46PM +0800, Shaohua Li wrote:
> > On Thu, 2007-09-13 at 01:31 -0600, Matthew Wilcox wrote:
> > > You missed the part where we have to avoid doing this for host bridges ...
> > >
> > > http://marc.info/?l=linux-kernel&m=118809338631160&w=2
> > >
> > > (I believe it's now queued in Greg's tree, and possibly Andrew's tree
> > > too)
> > Hmm, I don't know there is already a fix for this, so waste a whole
> > morning :(. yes, your patch looks better.
>
> Yes, this bug is causing a lot of people to waste time. I fielded an
> internal request for this patch this afternoon. I appreciate we're
> post-rc6 at this point, but it does rather suck to be releasing a kernel
> which freezes on boot on this class of machines.
But it's not like the kernel every worked on this class of machines,
right? This is not something that was a regression from what I can see.
> Unfortunately if this patch does cause any machine to break, these will
> be machines that worked fine up until this point, so that would be a
> regression, which is worse. Life sucks.
If, after a while, you think the change should go into the -stable tree,
I have no objection.
thanks,
greg k-h
On Thu, Sep 13, 2007 at 02:53:13AM -0700, Greg KH wrote:
> On Thu, Sep 13, 2007 at 01:55:36AM -0600, Matthew Wilcox wrote:
> > Unfortunately if this patch does cause any machine to break, these will
> > be machines that worked fine up until this point, so that would be a
> > regression, which is worse. Life sucks.
>
> If, after a while, you think the change should go into the -stable tree,
> I have no objection.
I think it shouldn't - this change will almost certainly cause a regression.
There is a lot of system devices besides the host bridges that shouldn't be
disabled during BAR probe, like interrupt controllers, power management
controllers and so on.
We need a more sophisticated fix - I'm thinking of introducing "probe" field
in struct pci_dev which can be set by "early" quirk routines.
Ivan.
On Thu, Sep 13, 2007 at 03:16:37PM +0400, Ivan Kokshaysky wrote:
> On Thu, Sep 13, 2007 at 02:53:13AM -0700, Greg KH wrote:
> > On Thu, Sep 13, 2007 at 01:55:36AM -0600, Matthew Wilcox wrote:
> > > Unfortunately if this patch does cause any machine to break, these will
> > > be machines that worked fine up until this point, so that would be a
> > > regression, which is worse. Life sucks.
> >
> > If, after a while, you think the change should go into the -stable tree,
> > I have no objection.
>
> I think it shouldn't - this change will almost certainly cause a regression.
> There is a lot of system devices besides the host bridges that shouldn't be
> disabled during BAR probe, like interrupt controllers, power management
> controllers and so on.
Well, if it's going to cause a regression with machines that currently
work properly, I'll just drop it entirely. I would much rather not have
a machine work at all with Linux, than break other people's working
machines.
> We need a more sophisticated fix - I'm thinking of introducing "probe" field
> in struct pci_dev which can be set by "early" quirk routines.
That might work out well. Matthew, want to look into this for a
possible way to get your fix into the tree in a way that will not affect
others?
thanks,
greg k-h
Ivan Kokshaysky wrote:
> On Thu, Sep 13, 2007 at 02:53:13AM -0700, Greg KH wrote:
>> On Thu, Sep 13, 2007 at 01:55:36AM -0600, Matthew Wilcox wrote:
>>> Unfortunately if this patch does cause any machine to break, these will
>>> be machines that worked fine up until this point, so that would be a
>>> regression, which is worse. Life sucks.
>> If, after a while, you think the change should go into the -stable tree,
>> I have no objection.
>
> I think it shouldn't - this change will almost certainly cause a regression.
> There is a lot of system devices besides the host bridges that shouldn't be
> disabled during BAR probe, like interrupt controllers, power management
> controllers and so on.
>
> We need a more sophisticated fix - I'm thinking of introducing "probe" field
> in struct pci_dev which can be set by "early" quirk routines.
Disabling the BAR decoding does not mean disabling the device (aside
from the one group of host bridge that apparently disables CPU to RAM
access, whose designers were apparently on crack when they read the PCI
spec and which is why we don't disable the decode on host bridges).
There really is no other sane way to do it.
If we do encounter other devices that choke on having the BAR disabled
during probing then we can add additional quirk logic, but we haven't
run into anything like that yet.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Thu, Sep 13, 2007 at 09:32:42PM -0600, Robert Hancock wrote:
> Disabling the BAR decoding does not mean disabling the device (aside
> from the one group of host bridge that apparently disables CPU to RAM
> access, whose designers were apparently on crack when they read the PCI
> spec and which is why we don't disable the decode on host bridges).
> There really is no other sane way to do it.
Disabling IO and MEM decode effectively disables device as a PCI target.
Worse, it's often not just BARs that get disabled, but some non-standard
or hidden address ranges as well. Host bridges are just one common
example, there is a lot of other insane hardware, more than one could
expect.
BTW, the way how the internal address decode works on these latest Intel
whippets doesn't look sane either. ;-)
> If we do encounter other devices that choke on having the BAR disabled
> during probing then we can add additional quirk logic, but we haven't
> run into anything like that yet.
As Greg said, the main point is "no regression".
Folks, can you check if the patch below helps?
Ivan.
--- 2.6.23-rc6-git4/include/linux/pci.h 2007-09-12 17:22:57.000000000 +0400
+++ linux/include/linux/pci.h 2007-09-14 10:26:29.000000000 +0400
@@ -183,6 +183,7 @@ struct pci_dev {
unsigned int msi_enabled:1;
unsigned int msix_enabled:1;
unsigned int is_managed:1;
+ unsigned int disable_while_probe:1; /* Probe BARs with IO and MMIO turned off */
atomic_t enable_cnt; /* pci_enable_device has been called */
u32 saved_config_space[16]; /* config space saved at suspend time */
--- 2.6.23-rc6-git4/arch/i386/pci/fixup.c 2007-09-12 17:22:55.000000000 +0400
+++ linux/arch/i386/pci/fixup.c 2007-09-14 14:43:13.000000000 +0400
@@ -444,3 +444,21 @@ static void __devinit pci_siemens_interr
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_SIEMENS, 0x0015,
pci_siemens_interrupt_controller);
+
+/*
+ * Work around apparent quirk in internal address decoding on
+ * some Intel chipsets (G31, G33, Q965 etc.):
+ * when some MMIO address range of integrated peripheral overlaps
+ * mmconfig area, mmconfig accesses are blocked. This can happen
+ * when we size the BAR writing all ones to it.
+ * In practice, only integrated graphics controller has MMIO range
+ * large enough (BAR2, 256Mb) to cause a trouble, so we disable
+ * address decoders just on that function during PCI BAR probe.
+ */
+static void __devinit pci_intel_mmconfig(struct pci_dev *dev)
+{
+ if ((dev->class >> 8) == PCI_CLASS_DISPLAY_VGA &&
+ pci_domain_nr(dev->bus) == 0 && dev->bus->number == 0)
+ dev->disable_while_probe = 1;
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_intel_mmconfig);
--- 2.6.23-rc6-git4/drivers/pci/probe.c 2007-09-14 00:35:47.000000000 +0400
+++ linux/drivers/pci/probe.c 2007-09-14 10:27:45.000000000 +0400
@@ -185,6 +185,13 @@ static void pci_read_bases(struct pci_de
unsigned int pos, reg, next;
u32 l, sz;
struct resource *res;
+ u16 cmd, orig_cmd;
+
+ if (dev->disable_while_probe) {
+ pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
+ cmd |= orig_cmd & ~(PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
+ }
for(pos=0; pos<howmany; pos = next) {
u64 l64;
@@ -283,6 +290,8 @@ static void pci_read_bases(struct pci_de
}
}
}
+ if (dev->disable_while_probe)
+ pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
}
void pci_read_bridge_bases(struct pci_bus *child)
On Fri, Sep 14, 2007 at 03:14:01PM +0400, Ivan Kokshaysky wrote:
> whippets doesn't look sane either. ;-)
^^^^^^^^
It's not me, it's a spellchecker :-) I meant "chipsets" of course, sorry.
Ivan.
Ivan Kokshaysky wrote:
> On Thu, Sep 13, 2007 at 09:32:42PM -0600, Robert Hancock wrote:
>> Disabling the BAR decoding does not mean disabling the device (aside
>> from the one group of host bridge that apparently disables CPU to RAM
>> access, whose designers were apparently on crack when they read the PCI
>> spec and which is why we don't disable the decode on host bridges).
>> There really is no other sane way to do it.
>
> Disabling IO and MEM decode effectively disables device as a PCI target.
> Worse, it's often not just BARs that get disabled, but some non-standard
> or hidden address ranges as well. Host bridges are just one common
> example, there is a lot of other insane hardware, more than one could
> expect.
Do you have an example of specific hardware that exhibits this problem?
So far after a similar patch has been in -mm for months there have been
no reports of any such problems.
> BTW, the way how the internal address decode works on these latest Intel
> whippets doesn't look sane either. ;-)
>
>> If we do encounter other devices that choke on having the BAR disabled
>> during probing then we can add additional quirk logic, but we haven't
>> run into anything like that yet.
>
> As Greg said, the main point is "no regression".
>
> Folks, can you check if the patch below helps?
This isn't guaranteed to help. I don't think it is only integrated
graphics that could cause this problem, I think that an add-on PCI
Express card can do this as well. Depending on the chipset memory decode
rules, any PCI or PCI Express device with a BAR large enough could cause
issues.
>
> Ivan.
>
> --- 2.6.23-rc6-git4/include/linux/pci.h 2007-09-12 17:22:57.000000000 +0400
> +++ linux/include/linux/pci.h 2007-09-14 10:26:29.000000000 +0400
> @@ -183,6 +183,7 @@ struct pci_dev {
> unsigned int msi_enabled:1;
> unsigned int msix_enabled:1;
> unsigned int is_managed:1;
> + unsigned int disable_while_probe:1; /* Probe BARs with IO and MMIO turned off */
> atomic_t enable_cnt; /* pci_enable_device has been called */
>
> u32 saved_config_space[16]; /* config space saved at suspend time */
> --- 2.6.23-rc6-git4/arch/i386/pci/fixup.c 2007-09-12 17:22:55.000000000 +0400
> +++ linux/arch/i386/pci/fixup.c 2007-09-14 14:43:13.000000000 +0400
> @@ -444,3 +444,21 @@ static void __devinit pci_siemens_interr
> }
> DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_SIEMENS, 0x0015,
> pci_siemens_interrupt_controller);
> +
> +/*
> + * Work around apparent quirk in internal address decoding on
> + * some Intel chipsets (G31, G33, Q965 etc.):
> + * when some MMIO address range of integrated peripheral overlaps
> + * mmconfig area, mmconfig accesses are blocked. This can happen
> + * when we size the BAR writing all ones to it.
> + * In practice, only integrated graphics controller has MMIO range
> + * large enough (BAR2, 256Mb) to cause a trouble, so we disable
> + * address decoders just on that function during PCI BAR probe.
> + */
> +static void __devinit pci_intel_mmconfig(struct pci_dev *dev)
> +{
> + if ((dev->class >> 8) == PCI_CLASS_DISPLAY_VGA &&
> + pci_domain_nr(dev->bus) == 0 && dev->bus->number == 0)
> + dev->disable_while_probe = 1;
> +}
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_intel_mmconfig);
> --- 2.6.23-rc6-git4/drivers/pci/probe.c 2007-09-14 00:35:47.000000000 +0400
> +++ linux/drivers/pci/probe.c 2007-09-14 10:27:45.000000000 +0400
> @@ -185,6 +185,13 @@ static void pci_read_bases(struct pci_de
> unsigned int pos, reg, next;
> u32 l, sz;
> struct resource *res;
> + u16 cmd, orig_cmd;
> +
> + if (dev->disable_while_probe) {
> + pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> + cmd |= orig_cmd & ~(PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
> + pci_write_config_word(dev, PCI_COMMAND, cmd);
> + }
>
> for(pos=0; pos<howmany; pos = next) {
> u64 l64;
> @@ -283,6 +290,8 @@ static void pci_read_bases(struct pci_de
> }
> }
> }
> + if (dev->disable_while_probe)
> + pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
> }
>
> void pci_read_bridge_bases(struct pci_bus *child)
>
On Fri, Sep 14, 2007 at 08:30:59AM -0600, Robert Hancock wrote:
> Do you have an example of specific hardware that exhibits this problem?
Well, first two results of google search for "disable bar when sizing":
http://lkml.org/lkml/2002/12/21/95
http://lkml.org/lkml/2002/12/20/110
> So far after a similar patch has been in -mm for months there have been
> no reports of any such problems.
You cannot compare user base of -mm and release kernels. Also, note
that we're talking about maybe 0.01% of systems running Linux.
And similar patch appeared in various trees several times over the last
decade and every time it had to be reverted.
> This isn't guaranteed to help. I don't think it is only integrated
> graphics that could cause this problem, I think that an add-on PCI
> Express card can do this as well. Depending on the chipset memory decode
> rules, any PCI or PCI Express device with a BAR large enough could cause
> issues.
No, it's impossible for several reasons. Most obvious one is that a PCI-E
bridge does isolate stuff quite nicely.
Ivan.
Ivan Kokshaysky wrote:
> On Fri, Sep 14, 2007 at 08:30:59AM -0600, Robert Hancock wrote:
>> Do you have an example of specific hardware that exhibits this problem?
>
> Well, first two results of google search for "disable bar when sizing":
> http://lkml.org/lkml/2002/12/21/95
> http://lkml.org/lkml/2002/12/20/110
In the first one, Linus talks about a USB controller whose SMM code
chokes on the BAR being disabled. That explanation seems odd to me. If
it chokes on the BAR disabled, how doesn't it choke on the BAR being
moved to a different location, which is unavoidable during probing?
Also, I think we do USB handoff from the BIOS much earlier than we used
to, which likely prevents these issues.
In the second one, he mentions that "So the secondary rule to "don't
turnoff MEM or IO accesses" is "never allocate real PCI BAR resources at
the top of memory". Well, unfortunately, the second one isn't possible
to meet given that we have boards with the MMCONFIG up there, so
disabling the decode is the only thing we can do. That whole discussion
was also based on the fact that it wasn't necessary to solve problems.
This is no longer a theoretical problem. We now have real problems that
we need this in order to solve.
>
>> So far after a similar patch has been in -mm for months there have been
>> no reports of any such problems.
>
> You cannot compare user base of -mm and release kernels. Also, note
> that we're talking about maybe 0.01% of systems running Linux.
> And similar patch appeared in various trees several times over the last
> decade and every time it had to be reverted.
>
>> This isn't guaranteed to help. I don't think it is only integrated
>> graphics that could cause this problem, I think that an add-on PCI
>> Express card can do this as well. Depending on the chipset memory decode
>> rules, any PCI or PCI Express device with a BAR large enough could cause
>> issues.
>
> No, it's impossible for several reasons. Most obvious one is that a PCI-E
> bridge does isolate stuff quite nicely.
It's not impossible at all. In fact I'm quite sure (Jesse can confirm)
that in the case of the board he was using, it was an add-in graphics
card where he saw this problem.
The fact is that in the case of MMCONFIG overlap with PCI BARs, which
one takes priority is completely undefined. In the case of this Intel
chipset, clearly the PCI Express device connected to the northbridge had
higher decode priority than the MMCONFIG aperture.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On 9/14/07, Robert Hancock <[email protected]> wrote:
> It's not impossible at all. In fact I'm quite sure (Jesse can confirm)
> that in the case of the board he was using, it was an add-in graphics
> card where he saw this problem.
>
> The fact is that in the case of MMCONFIG overlap with PCI BARs, which
> one takes priority is completely undefined. In the case of this Intel
> chipset, clearly the PCI Express device connected to the northbridge had
> higher decode priority than the MMCONFIG aperture.
can you relocate the MMCONFIG above RAM range? for example 512G...
YH
Yinghai Lu wrote:
> On 9/14/07, Robert Hancock <[email protected]> wrote:
>> It's not impossible at all. In fact I'm quite sure (Jesse can confirm)
>> that in the case of the board he was using, it was an add-in graphics
>> card where he saw this problem.
>>
>> The fact is that in the case of MMCONFIG overlap with PCI BARs, which
>> one takes priority is completely undefined. In the case of this Intel
>> chipset, clearly the PCI Express device connected to the northbridge had
>> higher decode priority than the MMCONFIG aperture.
>
> can you relocate the MMCONFIG above RAM range? for example 512G...
On some chipsets that might be possible, however I think that on the
Intel ones it's not possible to move it above 4G. And in any case this
would require chipset-specific knowledge, which we would rather not have
to need.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Fri, Sep 14, 2007 at 05:53:58PM -0600, Robert Hancock wrote:
> In the first one, Linus talks about a USB controller whose SMM code
> chokes on the BAR being disabled. That explanation seems odd to me. If
> it chokes on the BAR disabled, how doesn't it choke on the BAR being
> moved to a different location, which is unavoidable during probing?
Here is an article with detailed explanation of that USB issue:
http://support.microsoft.com/kb/250635
The most interesting fact there is that windows *does not* clear
PCI_COMMAND_MEMORY bits during PCI bus enumeration, so BIOS writers
have to rely on that. Yet another reason why your patch is unsafe.
> Also, I think we do USB handoff from the BIOS much earlier than we used
> to, which likely prevents these issues.
I don't think so. This can only be done in USB driver, which cannot run
before PCI device discovery.
> In the second one, he mentions that "So the secondary rule to "don't
> turnoff MEM or IO accesses" is "never allocate real PCI BAR resources at
> the top of memory". Well, unfortunately, the second one isn't possible
> to meet given that we have boards with the MMCONFIG up there, so
> disabling the decode is the only thing we can do. That whole discussion
> was also based on the fact that it wasn't necessary to solve problems.
> This is no longer a theoretical problem. We now have real problems that
> we need this in order to solve.
I have suggested a *practical* solution that disables the decode only
when it's unavoidable. If it's not sufficient for some machines, it
can be extended quite easily.
> > No, it's impossible for several reasons. Most obvious one is that a PCI-E
> > bridge does isolate stuff quite nicely.
>
> It's not impossible at all. In fact I'm quite sure (Jesse can confirm)
> that in the case of the board he was using, it was an add-in graphics
> card where he saw this problem.
Again, it's impossible with AGP or PCI-E cards, which are always separated
from the root bus by AGP or PCI-E bridge. The bridge does decode in the
first place, so when you write the BAR with all ones, you move it outside
the bridge decoding window. Address collision isn't possible in this
case in principle.
> The fact is that in the case of MMCONFIG overlap with PCI BARs, which
> one takes priority is completely undefined. In the case of this Intel
> chipset, clearly the PCI Express device connected to the northbridge had
> higher decode priority than the MMCONFIG aperture.
Another possible solution is not to use mmconfig for bar sizing at all,
if it's *that* broken. It would be more complicated though, since it
probably requires changes to all architectures.
Ivan.
Ivan Kokshaysky wrote:
> On Fri, Sep 14, 2007 at 05:53:58PM -0600, Robert Hancock wrote:
>> In the first one, Linus talks about a USB controller whose SMM code
>> chokes on the BAR being disabled. That explanation seems odd to me. If
>> it chokes on the BAR disabled, how doesn't it choke on the BAR being
>> moved to a different location, which is unavoidable during probing?
>
> Here is an article with detailed explanation of that USB issue:
>
> http://support.microsoft.com/kb/250635
>
> The most interesting fact there is that windows *does not* clear
> PCI_COMMAND_MEMORY bits during PCI bus enumeration, so BIOS writers
> have to rely on that. Yet another reason why your patch is unsafe.
Windows 98 doesn't, it says. That doesn't say anything about what newer
versions of Windows are doing (like Vista, which is the first to
actually use MMCONFIG).
>
>> Also, I think we do USB handoff from the BIOS much earlier than we used
>> to, which likely prevents these issues.
>
> I don't think so. This can only be done in USB driver, which cannot run
> before PCI device discovery.
Check the code. We have a quirk to handle this, in
drivers/usb/host/pci-quirks.c
>
>> In the second one, he mentions that "So the secondary rule to "don't
>> turnoff MEM or IO accesses" is "never allocate real PCI BAR resources at
>> the top of memory". Well, unfortunately, the second one isn't possible
>> to meet given that we have boards with the MMCONFIG up there, so
>> disabling the decode is the only thing we can do. That whole discussion
>> was also based on the fact that it wasn't necessary to solve problems.
>> This is no longer a theoretical problem. We now have real problems that
>> we need this in order to solve.
>
> I have suggested a *practical* solution that disables the decode only
> when it's unavoidable. If it's not sufficient for some machines, it
> can be extended quite easily.
>
>>> No, it's impossible for several reasons. Most obvious one is that a PCI-E
>>> bridge does isolate stuff quite nicely.
>> It's not impossible at all. In fact I'm quite sure (Jesse can confirm)
>> that in the case of the board he was using, it was an add-in graphics
>> card where he saw this problem.
>
> Again, it's impossible with AGP or PCI-E cards, which are always separated
> from the root bus by AGP or PCI-E bridge. The bridge does decode in the
> first place, so when you write the BAR with all ones, you move it outside
> the bridge decoding window. Address collision isn't possible in this
> case in principle.
And yet, something was clearly happening to cause this. Jesse, what were
the bridge decode windows (shown in lspci -v) on one of those boards
that was having issues?
>
>> The fact is that in the case of MMCONFIG overlap with PCI BARs, which
>> one takes priority is completely undefined. In the case of this Intel
>> chipset, clearly the PCI Express device connected to the northbridge had
>> higher decode priority than the MMCONFIG aperture.
>
> Another possible solution is not to use mmconfig for bar sizing at all,
> if it's *that* broken. It would be more complicated though, since it
> probably requires changes to all architectures.
It's not broken at all, it just requires that we not do silly things
like stick enabled decode regions over top of the aperture.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Sun, Sep 16, 2007 at 03:13:55PM +0400, Ivan Kokshaysky wrote:
> Another possible solution is not to use mmconfig for bar sizing at all,
> if it's *that* broken. It would be more complicated though, since it
> probably requires changes to all architectures.
I don't think it would be that complicated. We could just delay probing
for mmconfig until after the pci bus probes are done. No changes to
other architectures needed.
I'm really disappointed with mmconfig. Here was a great chance to get
rid of all the sucky performance problems of the past, and BIOS writers
and chipset designers have fucked up the implementation so much that
it's now the ugliest method for getting to pci config space. And it's
the *only* method of getting to extended config space.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
On Thu, 2007-09-13 at 15:16 +0400, Ivan Kokshaysky wrote:
> On Thu, Sep 13, 2007 at 02:53:13AM -0700, Greg KH wrote:
> > On Thu, Sep 13, 2007 at 01:55:36AM -0600, Matthew Wilcox wrote:
> > > Unfortunately if this patch does cause any machine to break, these will
> > > be machines that worked fine up until this point, so that would be a
> > > regression, which is worse. Life sucks.
> >
> > If, after a while, you think the change should go into the -stable tree,
> > I have no objection.
>
> I think it shouldn't - this change will almost certainly cause a regression.
> There is a lot of system devices besides the host bridges that shouldn't be
> disabled during BAR probe, like interrupt controllers, power management
> controllers and so on.
>
> We need a more sophisticated fix - I'm thinking of introducing "probe" field
> in struct pci_dev which can be set by "early" quirk routines.
Agreed. I have a similar problem on ppc where it's common to have things
like the main PIC on a PCI device. Note that another problem is (or at
least was, i haven't checked recently) the P2P bridge scanning code
that, in a similar way, can block the path to all devices below it. I
-do- have a case for example with Apple Xserve G4's where the main Apple
IO ASIC, which is a PCI device containing the PIC, the power management
controller, and various low level system control IOs is behind a pair of
P2P bridges.
One solution for us (PPC) is to enforce those devices and bridges to be
described in the OF tree, and generalize a bit the code we have for some
64 bits machines, that synthetizes the pci_dev's from the OF nodes
rather than probing. But that's not going to help other archs.
In fact, that's a problem we also have with
pci_assign_unassigned_resources() which will happily move things around
that must not be moved, especially when sitting behind P2P bridges.
So the root of the issue is much deeper than just a quirk here I
believe.
Cheers,
Ben.
On Thu, 2007-09-13 at 21:32 -0600, Robert Hancock wrote:
>
> If we do encounter other devices that choke on having the BAR
> disabled
> during probing then we can add additional quirk logic, but we haven't
> run into anything like that yet.
Well... if the device needs to be accessed to service an interrupt then
you do have a problem. For example... the PIC :-)
Problem is.. it's not practical nor really feasible generally to have
IRQs off on all CPUs during PCI probing neither... Unless we define that
the initial boot time probing is "special", and the first pass that
actually probes devices (and doesn't muck around with the sysfs
hierarchy etc...) can be run in a special context with all interrupt
servicing disabled on the PIC, though that will require some arch
support.
Ben.
Benjamin Herrenschmidt wrote:
> On Thu, 2007-09-13 at 21:32 -0600, Robert Hancock wrote:
>> If we do encounter other devices that choke on having the BAR
>> disabled
>> during probing then we can add additional quirk logic, but we haven't
>> run into anything like that yet.
>
> Well... if the device needs to be accessed to service an interrupt then
> you do have a problem. For example... the PIC :-)
>
> Problem is.. it's not practical nor really feasible generally to have
> IRQs off on all CPUs during PCI probing neither... Unless we define that
> the initial boot time probing is "special", and the first pass that
> actually probes devices (and doesn't muck around with the sysfs
> hierarchy etc...) can be run in a special context with all interrupt
> servicing disabled on the PIC, though that will require some arch
> support.
>
> Ben.
We would already have this problem, though. If it causes problems to
disable decode on the BAR because we try to access it in interrupt
context, we would already have problems because we move the thing to
0xFFFFFFFF during probing anyway..
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Sun, 2007-09-16 at 17:37 -0600, Robert Hancock wrote:
> We would already have this problem, though. If it causes problems to
> disable decode on the BAR because we try to access it in interrupt
> context, we would already have problems because we move the thing to
> 0xFFFFFFFF during probing anyway..
Yes, it's not a new problem, I was just pointing out the fact that there
are deeper issues already there in the same area.
Today, we mostly get lucky because we rarely take an irq during boot
time PCI probe but I've seen it happen (one easy way to screw things up
on -many- machines is to enable the DEBUG stuff in the PCI probe code
and use a serial console for example, especially if access to the device
with the serial ports in it gets temporarily cut off).
Cheers,
Ben.
On Sun, 2007-09-16 at 15:13 +0400, Ivan Kokshaysky wrote:
> On Fri, Sep 14, 2007 at 05:53:58PM -0600, Robert Hancock wrote:
> > In the first one, Linus talks about a USB controller whose SMM code
> > chokes on the BAR being disabled. That explanation seems odd to me. If
> > it chokes on the BAR disabled, how doesn't it choke on the BAR being
> > moved to a different location, which is unavoidable during probing?
>
> Here is an article with detailed explanation of that USB issue:
>
> http://support.microsoft.com/kb/250635
>
> The most interesting fact there is that windows *does not* clear
> PCI_COMMAND_MEMORY bits during PCI bus enumeration, so BIOS writers
> have to rely on that. Yet another reason why your patch is unsafe.
>
> > Also, I think we do USB handoff from the BIOS much earlier than we used
> > to, which likely prevents these issues.
>
> I don't think so. This can only be done in USB driver, which cannot run
> before PCI device discovery.
Most BIOS has an option called legacy emulation, which will emulate USB
mouse as legacy mouse. BIOS is using USB before driver is loaded.
If moving BAR work, disabling BAR should work too.
> > In the second one, he mentions that "So the secondary rule to "don't
> > turnoff MEM or IO accesses" is "never allocate real PCI BAR resources at
> > the top of memory". Well, unfortunately, the second one isn't possible
> > to meet given that we have boards with the MMCONFIG up there, so
> > disabling the decode is the only thing we can do. That whole discussion
> > was also based on the fact that it wasn't necessary to solve problems.
> > This is no longer a theoretical problem. We now have real problems that
> > we need this in order to solve.
>
> I have suggested a *practical* solution that disables the decode only
> when it's unavoidable. If it's not sufficient for some machines, it
> can be extended quite easily.
If you just want to workaroud the issue at hand, I'd take the method to
not use mmcfg at probe stage.
> > > No, it's impossible for several reasons. Most obvious one is that a PCI-E
> > > bridge does isolate stuff quite nicely.
> >
> > It's not impossible at all. In fact I'm quite sure (Jesse can confirm)
> > that in the case of the board he was using, it was an add-in graphics
> > card where he saw this problem.
>
> Again, it's impossible with AGP or PCI-E cards, which are always
> separated
> from the root bus by AGP or PCI-E bridge. The bridge does decode in
> the
> first place, so when you write the BAR with all ones, you move it
> outside
> the bridge decoding window. Address collision isn't possible in this
> case in principle.
I can confirm this is an add-in graphics card. the bfd is 00:02.0, so
it's not behind any AGP/PCI-E bridge.
Thanks,
Shaohua
On Sun, Sep 16, 2007 at 11:34:35AM -0600, Robert Hancock wrote:
> > The most interesting fact there is that windows *does not* clear
> > PCI_COMMAND_MEMORY bits during PCI bus enumeration, so BIOS writers
> > have to rely on that. Yet another reason why your patch is unsafe.
>
> Windows 98 doesn't, it says. That doesn't say anything about what newer
> versions of Windows are doing (like Vista, which is the first to
> actually use MMCONFIG).
But we're not going to drop support for hardware "designed for Windows 98",
aren't we?
> Check the code. We have a quirk to handle this, in
> drivers/usb/host/pci-quirks.c
It doesn't help. It's a "final" fixup, so it runs much later than BAR sizing.
Ivan.
On Sun, Sep 16, 2007 at 01:52:59PM -0600, Matthew Wilcox wrote:
> I don't think it would be that complicated. We could just delay probing
> for mmconfig until after the pci bus probes are done. No changes to
> other architectures needed.
Indeed. Seems like it's a way to go.
> I'm really disappointed with mmconfig. Here was a great chance to get
> rid of all the sucky performance problems of the past, and BIOS writers
> and chipset designers have fucked up the implementation so much that
> it's now the ugliest method for getting to pci config space. And it's
> the *only* method of getting to extended config space.
Same feeling...
Fortunately, it seems that we don't access the extended config space
during bus probe.
Ivan.
On Sun, Sep 16, 2007 at 10:01:52PM +0200, Benjamin Herrenschmidt wrote:
> Agreed. I have a similar problem on ppc where it's common to have things
> like the main PIC on a PCI device. Note that another problem is (or at
> least was, i haven't checked recently) the P2P bridge scanning code
> that, in a similar way, can block the path to all devices below it. I
> -do- have a case for example with Apple Xserve G4's where the main Apple
> IO ASIC, which is a PCI device containing the PIC, the power management
> controller, and various low level system control IOs is behind a pair of
> P2P bridges.
I think the P2P probing code is pretty safe now - there are read-only
accesses to the bridge config, unless you request to reassign the bus
numbers. Though it won't be safe anymore with the patch in question.
> One solution for us (PPC) is to enforce those devices and bridges to be
> described in the OF tree, and generalize a bit the code we have for some
> 64 bits machines, that synthetizes the pci_dev's from the OF nodes
> rather than probing. But that's not going to help other archs.
If you can get reliable PCI info from firmware, it should be relatively easy
to avoid at least a bar sizing. You can install an "early" fixup for
PCI_ANY_ID and fill the resource fields of the pci_dev with values obtained
from firmware. Then all we need in probe.c is just to check that the resource
is already non-zero and skip the sizing of respective BAR, if so.
> In fact, that's a problem we also have with
> pci_assign_unassigned_resources() which will happily move things around
> that must not be moved, especially when sitting behind P2P bridges.
It's not supposed to do that. Certainly, there were problems of that sort,
but hopefully they are in the past.
Ivan.
Ivan Kokshaysky wrote:
> On Sun, Sep 16, 2007 at 01:52:59PM -0600, Matthew Wilcox wrote:
>> I don't think it would be that complicated. We could just delay probing
>> for mmconfig until after the pci bus probes are done. No changes to
>> other architectures needed.
>
> Indeed. Seems like it's a way to go.
With another patch pending for 2.6.24, to validate the MMCONFIG aperture
against the ACPI motherboard resources instead of the E820 table, we
have to delay switching to MMCONFIG until the ACPI interpreter is
available (unless no other configuration mechanism is available). It
might thus be feasible to delay it further.
Nevertheless, I think this is a rather ugly workaround for the
underlying problem.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Mon, 2007-09-17 at 14:22 +0400, Ivan Kokshaysky wrote:
> On Sun, Sep 16, 2007 at 10:01:52PM +0200, Benjamin Herrenschmidt wrote:
> > Agreed. I have a similar problem on ppc where it's common to have things
> > like the main PIC on a PCI device. Note that another problem is (or at
> > least was, i haven't checked recently) the P2P bridge scanning code
> > that, in a similar way, can block the path to all devices below it. I
> > -do- have a case for example with Apple Xserve G4's where the main Apple
> > IO ASIC, which is a PCI device containing the PIC, the power management
> > controller, and various low level system control IOs is behind a pair of
> > P2P bridges.
>
> I think the P2P probing code is pretty safe now - there are read-only
> accesses to the bridge config, unless you request to reassign the bus
> numbers. Though it won't be safe anymore with the patch in question.
In which case I will need to NAK the patch... Note that those Xserve
G4's still have the subtle issue that they -also- reassign bus
numbers :-) But that's going away the day I finally enable domains
support for ppc32 (it's been off for now due to problems with X)
> > One solution for us (PPC) is to enforce those devices and bridges to be
> > described in the OF tree, and generalize a bit the code we have for some
> > 64 bits machines, that synthetizes the pci_dev's from the OF nodes
> > rather than probing. But that's not going to help other archs.
>
> If you can get reliable PCI info from firmware, it should be relatively easy
> to avoid at least a bar sizing. You can install an "early" fixup for
> PCI_ANY_ID and fill the resource fields of the pci_dev with values obtained
> from firmware. Then all we need in probe.c is just to check that the resource
> is already non-zero and skip the sizing of respective BAR, if so.
Right now, we have code to completely build a pci_dev from the firmware
infos. We only use it on 64 bits pseries currently though.
> > In fact, that's a problem we also have with
> > pci_assign_unassigned_resources() which will happily move things around
> > that must not be moved, especially when sitting behind P2P bridges.
>
> It's not supposed to do that. Certainly, there were problems of that sort,
> but hopefully they are in the past.
At this stage (but we are getting a bit OT), ppc has something like 3
different PCI code implementations :-) I do have some plans to fix that
by switching everybody to use pci_assign_unassigned_resources() and
friends but last time I tried, everything blew up :-) I suspect I'll
need a quirk or two in the generic code, but I'll let you know when I
get to it.
Cheers,
Ben.
On Mon, Sep 17, 2007 at 09:21:47AM +0800, Shaohua Li wrote:
> I can confirm this is an add-in graphics card. the bfd is 00:02.0, so
> it's not behind any AGP/PCI-E bridge.
AFAIKS, 00:02.0 is *integrated* controller. Can you check that "graphic
adapter priority" setting in BIOS is "PCI Express" and not "Internal VGA"?
In the latter case an add-on card might be turned completely off, so
it doesn't even show up in lspci output.
Ivan.
On Tue, Sep 18, 2007 at 06:30:23AM +1000, Benjamin Herrenschmidt wrote:
> At this stage (but we are getting a bit OT), ppc has something like 3
> different PCI code implementations :-) I do have some plans to fix that
> by switching everybody to use pci_assign_unassigned_resources() and
> friends but last time I tried, everything blew up :-) I suspect I'll
> need a quirk or two in the generic code, but I'll let you know when I
> get to it.
Ok, I'll be happy to look into that :-)
Ivan.
On Tuesday, September 18, 2007 2:53 am Ivan Kokshaysky wrote:
> On Mon, Sep 17, 2007 at 09:21:47AM +0800, Shaohua Li wrote:
> > I can confirm this is an add-in graphics card. the bfd is 00:02.0,
> > so it's not behind any AGP/PCI-E bridge.
>
> AFAIKS, 00:02.0 is *integrated* controller. Can you check that
> "graphic adapter priority" setting in BIOS is "PCI Express" and not
> "Internal VGA"? In the latter case an add-on card might be turned
> completely off, so it doesn't even show up in lspci output.
Yeah, it's integrated gfx. See the PRM at intel.com for the decode
rules.
I've been following this thread and I see a lot of fear about moving to
probing BARs as outlined in the PCI spec. I haven't seen much in the
way of hard evidence though, mostly just handwaving or red herrings
(and even one comment implying that -mm wasn't a real testbed for
patches) that don't have much to do with the actual "disable, size,
re-enable" logic. Has a conclusion been reached yet?
Keep in mind that any failures that occur due to this patch should be
easy to track down (boot hang), but we have yet to see any in real
life. Moreover, reversion is trivial, and we could move to a more
complex scheme at that time if needed, but why bother unless we're sure
we need to?
Thanks,
Jesse
Benjamin Herrenschmidt wrote:
> On Mon, 2007-09-17 at 14:22 +0400, Ivan Kokshaysky wrote:
>> On Sun, Sep 16, 2007 at 10:01:52PM +0200, Benjamin Herrenschmidt wrote:
>>> Agreed. I have a similar problem on ppc where it's common to have things
>>> like the main PIC on a PCI device. Note that another problem is (or at
>>> least was, i haven't checked recently) the P2P bridge scanning code
>>> that, in a similar way, can block the path to all devices below it. I
>>> -do- have a case for example with Apple Xserve G4's where the main Apple
>>> IO ASIC, which is a PCI device containing the PIC, the power management
>>> controller, and various low level system control IOs is behind a pair of
>>> P2P bridges.
>> I think the P2P probing code is pretty safe now - there are read-only
>> accesses to the bridge config, unless you request to reassign the bus
>> numbers. Though it won't be safe anymore with the patch in question.
>
> In which case I will need to NAK the patch... Note that those Xserve
> G4's still have the subtle issue that they -also- reassign bus
> numbers :-) But that's going away the day I finally enable domains
> support for ppc32 (it's been off for now due to problems with X)
How is this a change in behavior as far as this device is concerned? If
we are doing BAR sizing and moving the base address around, it's going
to cause problems if you try to access the device during this time
whether we disable decode or not.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
> How is this a change in behavior as far as this device is concerned? If
> we are doing BAR sizing and moving the base address around, it's going
> to cause problems if you try to access the device during this time
> whether we disable decode or not.
True. The window is smaller tho if the upper bridge decode hasn't been
disabled. I suppose I'll have to find a way to "pin" those devices and
generate the BAR info from the firmware data.
Ben.
On Thu, Sep 27, 2007 at 10:40:33AM +1000, Benjamin Herrenschmidt wrote:
>
> > How is this a change in behavior as far as this device is concerned? If
> > we are doing BAR sizing and moving the base address around, it's going
> > to cause problems if you try to access the device during this time
> > whether we disable decode or not.
>
> True. The window is smaller tho if the upper bridge decode hasn't been
> disabled. I suppose I'll have to find a way to "pin" those devices and
> generate the BAR info from the firmware data.
The upper bridge decode only gets disabled while we size the upper
bridge's BARs. Then we reenable it.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."