Hi !
There is something dodgy going on in pci_read_bases().
In addition to the fact (or related to it) that we tend to treat
res->start == 0 as meaning "unassigned" (which at this stage is mostly
incorrect, but let's say I can cope), however, there is some bit of code
that I think is just plain wrong:
reg = PCI_BASE_ADDRESS_0 + (pos << 2);
pci_read_config_dword(dev, reg, &l);
pci_write_config_dword(dev, reg, ~0);
pci_read_config_dword(dev, reg, &sz);
pci_write_config_dword(dev, reg, l);
if (!sz || sz == 0xffffffff)
continue;
if (l == 0xffffffff)
l = 0;
So here, we read the BAR value, which at this stage could either be some
assigned address or some ff's from a freshly unassigned BAR. _however_
that test against 0xffffffff looks totally bogus to me.
If we read that value, we write 0 in l. Which means that if we read some
unassigned resource that had the address space bits set to IO, we
overwrite this with a bit encoding that means Memory, and further along,
start sizing & treating this as a memory resource.
In fact, a BAR value should always contain a 0 bit. A Memory BAR must
contain a 0 at the bottom, and an IO BAR should contain a 0 in bit 1.
And we do check that case fine when reading back sz.
Thus a l value of 0xffffffff should not happen in practice, and if it
does, we should -at-least- try to get the address space bits from sz
(since in this case sz looks allright), not from l, no ? Or maybe just
skip the whole resource ?
Do we have practical cases where we see that 0xffffffff value ?
This isn't a problem I'm encountering in practice. I just noticed that
while trying to audit what it would take to fix the code to stop using
res->start = 0 as a way to mark unassigned resources (unrelated), so it
doesn't -need- to be fixed per-se, but I'm curious where that came from
in the first place.
Cheers,
Ben.
On Fri, Feb 29, 2008 at 01:37:47PM +1100, Benjamin Herrenschmidt wrote:
> Hi !
>
> There is something dodgy going on in pci_read_bases().
...
> if (l == 0xffffffff)
> l = 0;
...
> Thus a l value of 0xffffffff should not happen in practice, and if it
> does, we should -at-least- try to get the address space bits from sz
> (since in this case sz looks allright), not from l, no ? Or maybe just
> skip the whole resource ?
I agree this code looks wrong.
I used "the google" to track this down and at least got a bit
closer to when this was added: 2.3.15 it seems:
http://www.linuxhq.com/kernel/v2.3/15/drivers/pci/pci.c
--- v2.3.14/linux/drivers/pci/pci.c Thu Aug 12 11:50:14 1999
+++ linux/drivers/pci/pci.c Mon Aug 23 13:47:35 1999
It doesn't explain why but I suspect knowing the timeframe should
make the search a bit easier.
I have to confess. This is right around the time I got involved
with the linux kernel developement and specifically the parisc-linux.org
port. I was rewriting Alan Cox's first cut of Dino PCI Host-bus controller
"driver" (IRQ and PCI bus support for Dino chip).
Hrm...found an earlier reference to similar code:
http://www.srcdoc.com/linux_2.2.26/drivers_2pci_2pci_8c-source.html
...
00136 for(reg=0; reg<howmany; reg++) {
00137 pci_read_config_dword(dev, PCI_BASE_ADDRESS_0 + (reg << 2), &l);
00138 if (l == 0xffffffff)
00139 continue;
00140 dev->base_address[reg] = l;
...
This is a check to avoid mucking with 64-bit BARs.
But a bit later where pci_read_bases is called from:
00225 pci_read_bases(dev, 6);
00226 pcibios_read_config_dword(bus->number, devfn, PCI_ROM_ADDRESS, &l);
00227 dev->rom_address = (l == 0xffffffff) ? 0 : l;
00228 break;
The Expansion ROM BAR was clearly treated differently and I don't know why.
...
> Do we have practical cases where we see that 0xffffffff value ?
I can think of two cases this _might_ (but shouldn't) happen.
1) We probe the upper 32-bits of a 64-bit BAR and it already
contains 0xffffffff. This would be a bug in the probing IMHO.
2) PCI device ceases to talk to PCI Host and we get a PCI master abort.
I expect ~0 to be returned by HW in this case.
We need to skip this device and/or restart the probing
of this device (and possible others in the same PCI segment.)
hth,
grant
On Thursday, February 28, 2008 6:37 pm Benjamin Herrenschmidt wrote:
> Hi !
>
> There is something dodgy going on in pci_read_bases().
>
> In addition to the fact (or related to it) that we tend to treat
> res->start == 0 as meaning "unassigned" (which at this stage is mostly
> incorrect, but let's say I can cope),
Yeah, that sounds like trouble. I expect this may become a problem even for
PC architectures in the future (IOMMUs on multiple busses for example), so
maybe it makes sense to change the core to not assume 0 == unassigned?
> > however, there is some bit of code
> that I think is just plain wrong:
>
> reg = PCI_BASE_ADDRESS_0 + (pos << 2);
> pci_read_config_dword(dev, reg, &l);
> pci_write_config_dword(dev, reg, ~0);
> pci_read_config_dword(dev, reg, &sz);
> pci_write_config_dword(dev, reg, l);
> if (!sz || sz == 0xffffffff)
> continue;
> if (l == 0xffffffff)
> l = 0;
>
> So here, we read the BAR value, which at this stage could either be some
> assigned address or some ff's from a freshly unassigned BAR. _however_
> that test against 0xffffffff looks totally bogus to me.
>
> If we read that value, we write 0 in l. Which means that if we read some
> unassigned resource that had the address space bits set to IO, we
> overwrite this with a bit encoding that means Memory, and further along,
> start sizing & treating this as a memory resource.
But we should never hit the l == 0xffffffff case if it's a memory BAR, since
the low bit will be 0, right?
I think Grant is probably right that the only time we'd hit this is if the bus
is down and we're getting back all 1s for every read (though that's a PC
specific assumption).
> This isn't a problem I'm encountering in practice. I just noticed that
> while trying to audit what it would take to fix the code to stop using
> res->start = 0 as a way to mark unassigned resources (unrelated), so it
> doesn't -need- to be fixed per-se, but I'm curious where that came from
> in the first place.
Yeah, it seems like it could be safely removed, but this is an area where we
should probably move slowly (i.e. one, very small change to the probing code
at a time) or it may be difficult to isolate any bugs in the wild.
Jesse
On Mon, 2008-03-03 at 11:12 -0800, Jesse Barnes wrote:
> On Thursday, February 28, 2008 6:37 pm Benjamin Herrenschmidt wrote:
> > Hi !
> >
> > There is something dodgy going on in pci_read_bases().
> >
> > In addition to the fact (or related to it) that we tend to treat
> > res->start == 0 as meaning "unassigned" (which at this stage is mostly
> > incorrect, but let's say I can cope),
>
> Yeah, that sounds like trouble. I expect this may become a problem even for
> PC architectures in the future (IOMMUs on multiple busses for example), so
> maybe it makes sense to change the core to not assume 0 == unassigned?
Not only iommu's. On lots of platforms, we have the PCI MMIO space
mapped 1:N, that is, an access at MMIO location N turns into a PCI cycle
with address 0. So outbound PCI is effectively remapped, thus allowing
access to things like 0 etc... without punching holes in RAM and without
other dirty x86-like tricks.
The main question regarding 0 in a BAR is do we know of devices that
consider it as "disabled" and don't operate/decode 0 ? (That would be
gross and out of spec but who knows...)
If not, then there is no notion of "unassigned" at all... Anything is
"assigned", it's just a matter of whether the value we get at boot time
overlaps with something else or not, or is within the range of the host
bridge or not.
Unfortunately, while pretty much all architectures have a clear idea of
what is decoded by a given bridge, and what overlap with others and what
not, x86 doesn't :-)
But that can be solved, using something akin to IORESOURCE_UNSET, like
we do on powerpc, generalizing it, and having an x86-specific quirk set
it on anything that comes up with resource->start == 0.
> But we should never hit the l == 0xffffffff case if it's a memory BAR, since
> the low bit will be 0, right?
And for an IO BAR, there's another bit that should be 0, so that's
simply an impossible value for a BAR.
> I think Grant is probably right that the only time we'd hit this is if the bus
> is down and we're getting back all 1s for every read (though that's a PC
> specific assumption).
But the next two write/reads for sizing would have worked ? Unlikely...
I wonder if we can just remove that test completely :-)
> Yeah, it seems like it could be safely removed, but this is an area where we
> should probably move slowly (i.e. one, very small change to the probing code
> at a time) or it may be difficult to isolate any bugs in the wild.
True.
Cheers,
Ben.
On Monday, March 03, 2008 12:57 pm Jesse Barnes wrote:
> > Not only iommu's. On lots of platforms, we have the PCI MMIO space
> > mapped 1:N, that is, an access at MMIO location N turns into a PCI cycle
> > with address 0. So outbound PCI is effectively remapped, thus allowing
> > access to things like 0 etc... without punching holes in RAM and without
> > other dirty x86-like tricks.
>
> Right, bus address space is really a separate entity; I was just using
> IOMMUs as an example of that...
Err, wasn't really clear here. CPU->PCI space and PCI->host space are really
separate entities... IOMMUs will definitely affect the PCI->host mapping, and
CPU->PCI mappings can be arbitrarily complex as well, though as you say on
PCs it's typically just flatly mapped into CPU physical space somewhere.
Jesse
On Monday, March 03, 2008 12:42 pm Benjamin Herrenschmidt wrote:
> On Mon, 2008-03-03 at 11:12 -0800, Jesse Barnes wrote:
> > On Thursday, February 28, 2008 6:37 pm Benjamin Herrenschmidt wrote:
> > > Hi !
> > >
> > > There is something dodgy going on in pci_read_bases().
> > >
> > > In addition to the fact (or related to it) that we tend to treat
> > > res->start == 0 as meaning "unassigned" (which at this stage is mostly
> > > incorrect, but let's say I can cope),
> >
> > Yeah, that sounds like trouble. I expect this may become a problem even
> > for PC architectures in the future (IOMMUs on multiple busses for
> > example), so maybe it makes sense to change the core to not assume 0 ==
> > unassigned?
>
> Not only iommu's. On lots of platforms, we have the PCI MMIO space
> mapped 1:N, that is, an access at MMIO location N turns into a PCI cycle
> with address 0. So outbound PCI is effectively remapped, thus allowing
> access to things like 0 etc... without punching holes in RAM and without
> other dirty x86-like tricks.
Right, bus address space is really a separate entity; I was just using IOMMUs
as an example of that...
> The main question regarding 0 in a BAR is do we know of devices that
> consider it as "disabled" and don't operate/decode 0 ? (That would be
> gross and out of spec but who knows...)
I don't know of any offhand (yeah does sound gross).
> If not, then there is no notion of "unassigned" at all... Anything is
> "assigned", it's just a matter of whether the value we get at boot time
> overlaps with something else or not, or is within the range of the host
> bridge or not.
>
> Unfortunately, while pretty much all architectures have a clear idea of
> what is decoded by a given bridge, and what overlap with others and what
> not, x86 doesn't :-)
Yep, but that's just an artifact of the way PCs are typically laid out.
There's nothing preventing certain platforms from doing things differently,
either with special bridges or just funky address space layout.
> But that can be solved, using something akin to IORESOURCE_UNSET, like
> we do on powerpc, generalizing it, and having an x86-specific quirk set
> it on anything that comes up with resource->start == 0.
Yeah, that would be an improvement...
>
> > But we should never hit the l == 0xffffffff case if it's a memory BAR,
> > since the low bit will be 0, right?
>
> And for an IO BAR, there's another bit that should be 0, so that's
> simply an impossible value for a BAR.
Right.
> > I think Grant is probably right that the only time we'd hit this is if
> > the bus is down and we're getting back all 1s for every read (though
> > that's a PC specific assumption).
>
> But the next two write/reads for sizing would have worked ? Unlikely...
> I wonder if we can just remove that test completely :-)
>
> > Yeah, it seems like it could be safely removed, but this is an area where
> > we should probably move slowly (i.e. one, very small change to the
> > probing code at a time) or it may be difficult to isolate any bugs in the
> > wild.
>
> True.
I think we can just kill that test...
Jesse
> The main question regarding 0 in a BAR is do we know of devices that
> consider it as "disabled" and don't operate/decode 0 ? (That would be
> gross and out of spec but who knows...)
In the PC world it never gets tested so it might be wise to be cautious.
> But that can be solved, using something akin to IORESOURCE_UNSET, like
> we do on powerpc, generalizing it, and having an x86-specific quirk set
> it on anything that comes up with resource->start == 0.
And add pci_resource_unassigned() to wrap it nicely. That would
definitely be a good thing as we have confusion right now and it has
turned up in the real world.
Alan