2002-12-19 21:29:36

by Grant Grundler

[permalink] [raw]
Subject: PATCH 2.5.x disable BAR when sizing


Martin,
In April 2002, [email protected] sent a 2.4.x patch to disable
BARs while the BARs were being sized. I've "forward ported" this patch
to 2.5.x (appended). turukawa's excellent problem description and
original posting are here:
https://lists.linuxia64.org/archives//linux-ia64/2002-April/003302.html

David Mosberger agrees this is an "obvious fix".
We've been using this in the ia64 2.4 code stream since about August.

thanks,
grant


Index: drivers/pci/probe.c
===================================================================
RCS file: /var/cvs/linux-2.5/drivers/pci/probe.c,v
retrieving revision 1.2
diff -u -p -r1.2 probe.c
--- drivers/pci/probe.c 9 Oct 2002 20:42:57 -0000 1.2
+++ drivers/pci/probe.c 17 Dec 2002 21:53:14 -0000
@@ -48,6 +48,12 @@ static void pci_read_bases(struct pci_de
unsigned int pos, reg, next;
u32 l, sz;
struct resource *res;
+ u16 cmd;
+
+ /* Disable I/O & memory decoding while we size the BARs. */
+ pci_read_config_word(dev, PCI_COMMAND, &cmd);
+ pci_write_config_word(dev, PCI_COMMAND,
+ cmd & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY));

for(pos=0; pos<howmany; pos = next) {
next = pos+1;
@@ -114,6 +120,8 @@ static void pci_read_bases(struct pci_de
}
res->name = dev->name;
}
+
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
}

void __devinit pci_read_bridge_bases(struct pci_bus *child)


2002-12-20 02:07:44

by Alan

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Thu, 2002-12-19 at 21:37, Grant Grundler wrote:
>
> Martin,
> In April 2002, [email protected] sent a 2.4.x patch to disable
> BARs while the BARs were being sized. I've "forward ported" this patch
> to 2.5.x (appended). turukawa's excellent problem description and
> original posting are here:
> https://lists.linuxia64.org/archives//linux-ia64/2002-April/003302.html
>
> David Mosberger agrees this is an "obvious fix".
> We've been using this in the ia64 2.4 code stream since about August.

We've rejected this twice already from different people.

Nothing says your memory can't be behind the bridge and you just turned
memory access off. Whoops bang, game over.

And yes this happens on some PC class systems.

Alan

2002-12-20 02:19:30

by David Mosberger

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

>>>>> On 20 Dec 2002 02:54:28 +0000, Alan Cox <[email protected]> said:

Alan> On Thu, 2002-12-19 at 21:37, Grant Grundler wrote:
>> Martin, In April 2002, [email protected] sent a 2.4.x
>> patch to disable BARs while the BARs were being sized. I've
>> "forward ported" this patch to 2.5.x (appended). turukawa's
>> excellent problem description and original posting are here:
>> https://lists.linuxia64.org/archives//linux-ia64/2002-April/003302.html
>>
>> David Mosberger agrees this is an "obvious fix". We've been
>> using this in the ia64 2.4 code stream since about August.

Alan> We've rejected this twice already from different people.

Alan> Nothing says your memory can't be behind the bridge and you
Alan> just turned memory access off. Whoops bang, game over.

Alan> And yes this happens on some PC class systems.

And yet it's OK to remap that memory? That seems unlikely.

--david

2002-12-20 05:51:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

In article <[email protected]>,
Grant Grundler <[email protected]> wrote:
>
>In April 2002, [email protected] sent a 2.4.x patch to disable
>BARs while the BARs were being sized. I've "forward ported" this patch
>to 2.5.x (appended). turukawa's excellent problem description and
>original posting are here:
> https://lists.linuxia64.org/archives//linux-ia64/2002-April/003302.html
>
>David Mosberger agrees this is an "obvious fix".

It is NOT an "obvious fix".

It breaks stuff horribly. When you turn off the MEM bit on the
northbridge, there are northbridges that will stop forwarding RAM<->PCI.

>We've been using this in the ia64 2.4 code stream since about August.

And it's CRAP.

DO NOT DO THIS. It locks up some machines at bootup. Hard. Total bus
lockup if you have legacy USB enabled (or anything else that does DMA,
for that matter) at the same time as probing the northbridge with this.

Trust me. If you have some new silly ia64-specific bug, the fix is
_not_ to break real and existing hardware out there.

We've had this "obviously correct" patch floating around several times,
and it even made it into the kernel at least once. It was reverted
because it is WRONG.

Linus

2002-12-20 05:59:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

In article <[email protected]>,
David Mosberger <[email protected]> wrote:
> Alan> And yes this happens on some PC class systems.
>
>And yet it's OK to remap that memory? That seems unlikely.

Alan is right, however "unlikely" you think it is.

Turning off stuff in the config register not only turns off the standard
BAR's, it can turn off _everything_. Including stuff that isn't even
covered by the standard BARs that are beng probed.

It's like shutting off the power for the whole house because you want to
change a lightbulb. Sure, it's safer for the lightbulb, but if you
don't know what -else- needs power in the house, it sure as hell isn't
a good idea. Maybe you just trashed your wifes work because she happened
to be in front of the computer when you turned off the lights.

Linus

2002-12-20 09:34:32

by David Mosberger

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

>>>>> On Fri, 20 Dec 2002 05:57:23 +0000 (UTC), [email protected] (Linus Torvalds) said:

Linus> DO NOT DO THIS. It locks up some machines at
Linus> bootup. Hard. Total bus lockup if you have legacy USB enabled
Linus> (or anything else that does DMA, for that matter) at the same
Linus> time as probing the northbridge with this.

Linus> Trust me. If you have some new silly ia64-specific bug, the
Linus> fix is _not_ to break real and existing hardware out there.

Could you please stop this ia64 paranoia and instead explain to me why
it's OK to relocate a PCI device to (0x100000000-PCI_dev_size)
temporarily? That just seems horribly unsafe to me. The PCI spec
seems to say the same as it says pretty clearly that memory decoding
should be disabled during BAR-sizing. If certain bridges cause
problems, perhaps those need to be special-cased?

--david

2002-12-20 12:45:37

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Fri, Dec 20, 2002 at 01:42:33AM -0800, David Mosberger wrote:
> If certain bridges cause
> problems, perhaps those need to be special-cased?

Couple of days ago I mindlessly suggested exactly that, but I
take it back.
Not only *all* classes of bridges would be special-cased:
turning off IO and MEM in the PCI command register disables the legacy I/O
ports and memory on some VGAs. Guess what happens if someone decides
to printk in the meantime.

Ivan.

2002-12-20 16:56:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing




On Fri, 20 Dec 2002, David Mosberger wrote:
>
> Could you please stop this ia64 paranoia and instead explain to me why
> it's OK to relocate a PCI device to (0x100000000-PCI_dev_size)
> temporarily? That just seems horribly unsafe to me.

No. It's not horribly unsafe at all. It's very safe, for one simple
reason: it's how PCI probing has _always_ been done. Exactly because the
alternatives simply do not work.

I can also tell you why it does work, and why it's supposed to work: by
writing 0xffffffff to the BAR register, you basically move the BAR to high
PCI memory - even if it was enabled before. Which is fine, as long as
nobody else is in that high memory. So the secondary rule to "don't turn
off MEM or IO accesses" is "never allocate real PCI BAR resources at the
top of memory".

Think about it: if you move the BAR to high memory, you basically disable
only _that_ bar, and nothing else. You don't clobber any other associated
functions, or anything like that. It's clearly a _less_ disruptive thing
than disabling access to the whole device.

Anyway, why do you really want to add the MEM/IO disable? Do you actually
have a device that wants it, or are you just blinded by documentation
written by somebody who had no idea about what real life is all about?

> The PCI spec
> seems to say the same as it says pretty clearly that memory decoding
> should be disabled during BAR-sizing. If certain bridges cause
> problems, perhaps those need to be special-cased?

No. Do it the other way around: if there is some specific chipset that
actually _needs_ the disable, you do THAT special cased.

Because the current code works on everything that we know about, and we
_know_ that the disable doesn't work. As Ivan already pointed out, it's
not just bridges. When you disable IO/MEM accesses, you often disable
_everything_, which can break pretty much anything.

(I say "often", because it does actually depend on the chipset. Some chips
only seem to disable the BAR entries. Others disable all the "extended"
ports too, and leave only config space accessible after you've disable
IO/MEM)

The cases I've seen are northbridges that stop forwarding DMA, and USB
controllers that are still in legacy mode (because the BAR probing happens
before the USB driver has had a chance to _change_ it), where disabling IO
and/or MEM will cause the SMM code that does the legacy handling to just
lock up, since suddenly the hardware they expect to be there doesn't
respond any more.

Ivan pointed out that it also disables things like VGA legacy registers.

It will disable IDE legacy registers too, btw. I'd also expect it to
disable IDE DMA access, so if you happen to be trying to probe the BAR's
after somebody started IO on the IDE, you just made that IO fail
spectacularly, and I'd not be surprised if the IDE controller just locked
up as a result.

Let me re-iterate the "turn power off at the master switch in a house when
switching a light bulb" analogy. Yes, it's a good idea if you are nervous,
but you do that only when you _know_ who is in the house and you know what
they are doing and it's ok by them.

For example, it would be fine for a low-level driver (who has already
taken control of the device) to turn the device off. But it is NOT fine to
do it in general.

One solution in the long term may be to not even probe the BAR's at all in
generic code, and only do it in the pci_enable_dev() stuff. That way it
would literally only be done by the driver, who can hopefully make sure
that the device is ok with it.

Linus

2002-12-20 18:18:29

by David Mosberger

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

>>>>> On Fri, 20 Dec 2002 09:05:53 -0800 (PST), Linus Torvalds <[email protected]> said:

Linus> On Fri, 20 Dec 2002, David Mosberger wrote:
>> Could you please stop this ia64 paranoia and instead explain to
>> me why it's OK to relocate a PCI device to
>> (0x100000000-PCI_dev_size) temporarily? That just seems horribly
>> unsafe to me.

Linus> No. It's not horribly unsafe at all. It's very safe, for one
Linus> simple reason: it's how PCI probing has _always_ been
Linus> done. Exactly because the alternatives simply do not work.

Linus> I can also tell you why it does work, and why it's supposed
Linus> to work: by writing 0xffffffff to the BAR register, you
Linus> basically move the BAR to high PCI memory - even if it was
Linus> enabled before. Which is fine, as long as nobody else is in
Linus> that high memory. So the secondary rule to "don't turn off
Linus> MEM or IO accesses" is "never allocate real PCI BAR resources
Linus> at the top of memory".

A 32-bit PCI device can decode up to 2GB of memory. I doubt any PC
firmware is reserving the top 2GB for BAR-sizing. Without a
reasonably small a-priori limit on the address window size of device,
this method isn't safe.

Linus> Let me re-iterate the "turn power off at the master switch in
Linus> a house when switching a light bulb" analogy. Yes, it's a
Linus> good idea if you are nervous, but you do that only when you
Linus> _know_ who is in the house and you know what they are doing
Linus> and it's ok by them.

That's a poor analogy. What you're suggesting is more like replacing
the light bulb while it's powered on. Yes, it can be done, but people
do get burned and electrocuted that way.

Linus> One solution in the long term may be to not even probe the
Linus> BAR's at all in generic code, and only do it in the
Linus> pci_enable_dev() stuff. That way it would literally only be
Linus> done by the driver, who can hopefully make sure that the
Linus> device is ok with it.

Yes, I see now that the method in the PCI spec isn't really safe
either, so BAR-sizing probably shouldn't be done in generic code at
all.

--david

2002-12-20 18:42:59

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Fri, Dec 20, 2002 at 09:05:53AM -0800, Linus Torvalds wrote:
> One solution in the long term may be to not even probe the BAR's at all in
> generic code, and only do it in the pci_enable_dev() stuff. That way it
> would literally only be done by the driver, who can hopefully make sure
> that the device is ok with it.

I don't think that generic BAR probing is ever avoidable - too often
it's the only way to build a consistent resource tree. Without that
the driver cannot know whether the BAR setting is safe or there is a
conflict with something else.
Anyway, in the short term we could give the architecture ability to use its
own probing code, something like this:

In include/linux/pci.h:

#include <asm/pci.h>

+#ifndef HAVE_ARCH_PCI_BAR_PROBE
+#define pcibios_read_bases(dev, n, rom) 0
+#endif

In drivers/pci/probe.c:

static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
{
unsigned int pos, reg, next;
u32 l, sz;
struct resource *res;

+ if (pcibios_read_bases(dev, howmany, rom))
+ return;
+
for(pos=0; pos<howmany; pos = next) {
...

Ivan.

2002-12-20 19:03:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing



On Fri, 20 Dec 2002, Ivan Kokshaysky wrote:
>
> On Fri, Dec 20, 2002 at 09:05:53AM -0800, Linus Torvalds wrote:
> > One solution in the long term may be to not even probe the BAR's at all in
> > generic code, and only do it in the pci_enable_dev() stuff. That way it
> > would literally only be done by the driver, who can hopefully make sure
> > that the device is ok with it.
>
> I don't think that generic BAR probing is ever avoidable - too often
> it's the only way to build a consistent resource tree. Without that
> the driver cannot know whether the BAR setting is safe or there is a
> conflict with something else.

Well, generic BAR probing certainly isn't avoidable right now, that's for
sure. But we might make it less common, by doing it only "on demand", and
having default drivers for pci-pci bus bridges, for example (where the
default driver would do what we do now, but perhaps a way to register
specific bus drivers that know more about their specifics).

I don't think there are any real reasons to change what we do now - it
sounds like even David doesn't actually have any devices that actually
_need_ the disable as-is.

Linus

2002-12-20 19:19:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing



On Fri, 20 Dec 2002, David Mosberger wrote:
>
> Linus> One solution in the long term may be to not even probe the
> Linus> BAR's at all in generic code, and only do it in the
> Linus> pci_enable_dev() stuff. That way it would literally only be
> Linus> done by the driver, who can hopefully make sure that the
> Linus> device is ok with it.
>
> Yes, I see now that the method in the PCI spec isn't really safe
> either, so BAR-sizing probably shouldn't be done in generic code at
> all.

Note that by "long-term" I definitely mean "past 2.6", and thus if there
are some specific cases where you have trouble, those will need to just
have work-arounds for now. Maybe a PCI quirks entry or something like
that.

Linus

2002-12-20 19:42:55

by Grant Grundler

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Fri, Dec 20, 2002 at 02:54:28AM +0000, Alan Cox wrote:
> Nothing says your memory can't be behind the bridge and you just turned
> memory access off. Whoops bang, game over.
> And yes this happens on some PC class systems.

ugh. very poor design.
Can someone send URL or old mail describing such a broken system?
(or point me at previous attempts to submit this patch?)

I'm not able to narrow down the search to a reasonable number
of useful keywords.


I agree the patch I posted isn't ready.
But I have a real problem to solve on real HW.

I've tried to catchup with lkml postings via an archive:

Linux Torvalds wrote:
| Think about it: if you move the BAR to high memory, you basically disable
| only _that_ bar, and nothing else. You don't clobber any other associated
| functions, or anything like that.

That's exactly the problem on ia64 - it does.
Could this also be a problem on i386 that we just haven't noticed yet?

Original problem report says:
> In this case, the video device gets 64MB memory space from 0xFC000000 to
> 0xFFFFFFFF improperly, and it conflicts with System reserve region
> (0xFCxxxxxx - 0xFExxxxxx) for SAPIC interrupt messages.
> After that the video device reacts to an SAPIC interrupt improperly.

(https://lists.linuxia64.org/archives//linux-ia64/2002-April/003302.html)

One of my boxes with serverworks chips + Foster CPU (1.6Ghz) also
has SAPIC and /proc/iomem shows:
...
fec00000-fec0ffff : reserved
fee00000-fee00fff : reserved
fff80000-ffffffff : reserved


and later:
| Ivan pointed out that it also disables things like VGA legacy registers.

I expect disabling VGA is only a problem for debugging PCI code.
Is that the only thing that outputs for the duration we have things disabled?
Anyway, I've been there (debugging code crashes the box) and don't want
to go there again.

| It will disable IDE legacy registers too, btw. I'd also expect it to
| disable IDE DMA access, so if you happen to be trying to probe the BAR's
| after somebody started IO on the IDE, you just made that IO fail
| spectacularly, and I'd not be surprised if the IDE controller just locked
| up as a result.

We aren't clearing PCI_COMMAND_MASTER.
I would expect DMA to continue working.
But some chip sets could be broken in this way too...


| Let me re-iterate the "turn power off at the master switch in a house when
| switching a light bulb" analogy. Yes, it's a good idea if you are nervous,
| but you do that only when you _know_ who is in the house and you know what
| they are doing and it's ok by them.

The PCI subsystem does know "who is in the house and what they are doing".
We currently never probe active devices. PCI bridges complicate the
picture since the definition of "active" has to include downstream devices.
I would like to determine "active" (or a reasonable approximation)
and only disable PCI_COMMAND_IO/PCI_COMMAND_MEMORY when it's clearly
safe to do so. I'll resubmit if I can figure out an appropriate test.


Ivan Kokshaysky wrote:
| I don't think that generic BAR probing is ever avoidable - too often
| it's the only way to build a consistent resource tree. Without that
| the driver cannot know whether the BAR setting is safe or there is a
| conflict with something else.

I agree.

| Anyway, in the short term we could give the architecture ability to use its
| own probing code, something like this:

An arch specific hook might be useful.
But how about letting the arch decide if it's safe to disable a device?
ie don't replicate the rest of the code in pci_read_bases() in
the arch that needs to disable the device.
It's very similar to what Ivan proposed.

thanks,
grant

2002-12-20 20:34:07

by Richard B. Johnson

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Fri, 20 Dec 2002, Grant Grundler wrote:

> On Fri, Dec 20, 2002 at 02:54:28AM +0000, Alan Cox wrote:
> > Nothing says your memory can't be behind the bridge and you just turned
> > memory access off. Whoops bang, game over.
> > And yes this happens on some PC class systems.
>
> ugh. very poor design.
> Can someone send URL or old mail describing such a broken system?
> (or point me at previous attempts to submit this patch?)
>
> I'm not able to narrow down the search to a reasonable number
> of useful keywords.
>
>
> I agree the patch I posted isn't ready.
> But I have a real problem to solve on real HW.
>
> I've tried to catchup with lkml postings via an archive:
>
> Linux Torvalds wrote:
> | Think about it: if you move the BAR to high memory, you basically disable
> | only _that_ bar, and nothing else. You don't clobber any other associated
> | functions, or anything like that.
>
> That's exactly the problem on ia64 - it does.
> Could this also be a problem on i386 that we just haven't noticed yet?
>
> Original problem report says:
> > In this case, the video device gets 64MB memory space from 0xFC000000 to
> > 0xFFFFFFFF improperly, and it conflicts with System reserve region
> > (0xFCxxxxxx - 0xFExxxxxx) for SAPIC interrupt messages.
> > After that the video device reacts to an SAPIC interrupt improperly.
>
> (https://lists.linuxia64.org/archives//linux-ia64/2002-April/003302.html)
>
> One of my boxes with serverworks chips + Foster CPU (1.6Ghz) also
> has SAPIC and /proc/iomem shows:
> ...
> fec00000-fec0ffff : reserved
> fee00000-fee00fff : reserved
> fff80000-ffffffff : reserved
>
>
> and later:
> | Ivan pointed out that it also disables things like VGA legacy registers.
>
> I expect disabling VGA is only a problem for debugging PCI code.
> Is that the only thing that outputs for the duration we have things disabled?
> Anyway, I've been there (debugging code crashes the box) and don't want
> to go there again.
>
> | It will disable IDE legacy registers too, btw. I'd also expect it to
> | disable IDE DMA access, so if you happen to be trying to probe the BAR's
> | after somebody started IO on the IDE, you just made that IO fail
> | spectacularly, and I'd not be surprised if the IDE controller just locked
> | up as a result.
>
> We aren't clearing PCI_COMMAND_MASTER.
> I would expect DMA to continue working.
> But some chip sets could be broken in this way too...
>
>
> | Let me re-iterate the "turn power off at the master switch in a house when
> | switching a light bulb" analogy. Yes, it's a good idea if you are nervous,
> | but you do that only when you _know_ who is in the house and you know what
> | they are doing and it's ok by them.
>
> The PCI subsystem does know "who is in the house and what they are doing".
> We currently never probe active devices. PCI bridges complicate the
> picture since the definition of "active" has to include downstream devices.
> I would like to determine "active" (or a reasonable approximation)
> and only disable PCI_COMMAND_IO/PCI_COMMAND_MEMORY when it's clearly
> safe to do so. I'll resubmit if I can figure out an appropriate test.
>
>
> Ivan Kokshaysky wrote:
> | I don't think that generic BAR probing is ever avoidable - too often
> | it's the only way to build a consistent resource tree. Without that
> | the driver cannot know whether the BAR setting is safe or there is a
> | conflict with something else.
>
> I agree.
>
> | Anyway, in the short term we could give the architecture ability to use its
> | own probing code, something like this:
>
> An arch specific hook might be useful.
> But how about letting the arch decide if it's safe to disable a device?
> ie don't replicate the rest of the code in pci_read_bases() in
> the arch that needs to disable the device.
> It's very similar to what Ivan proposed.
>
> thanks,
> grant


I don't think it has anything to do with the bridge. It has
to do with the BAR for the video device. After software is
using that base-address to talk to the device, somebody else
can't come along and change it! The software talking to the
video device doesn't know it's been changed. Whatever you do
with such as device has to happen before the software uses it.

To change the base addresses for the video, one has to tell the
video driver about it, basically execute the video initialization
routines immediately after changing the base address and before
attempting to do any screen output.

On all PCI machines that I've mucked with, software can even
turn OFF device 0 (the host bridge) as long as all PCI I/O
has been suspended and as long as it gets turned back ON. All
with no ill effects at all.

Contrary to what somebody said, memory used by the CPU as RAM will
not be behind a bridge unless you decided to use some PCI device
private RAM for your pleasure. In that case, you had to set up
the mapping yourself anyway, so you certainly know about it.
I don't think anybody would (or could) write code that would
re-map screen-card memory, relocate some code to it, jump to
it, then start mucking with PCI registers. So, you are safe
from that perspective. But, you must spin-lock against all
PCI programming because an interrupt (perhaps for the network)
can happen anytime. If you've got the bridge turned OFF, the
network driver software may get rather confused when it reads
all ones from the device registers. On Intel machines, you
don't get a bus error on a read to non-existent RAM or an
inactive PCI address, you get all bits set. Some drivers may
loop forever, with the interrupts OFF, under these conditions.

Even though PCI video cards have a BIOS, you are supposed to,
and the specification requires, that the code be copied to
RAM before it is executed. If anybody is executing anything
from behind a PCI bridge, then they need to crash or at least
be shot dead. That's not allowed at all.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-20 20:32:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing



On Fri, 20 Dec 2002, Grant Grundler wrote:
>
> Can someone send URL or old mail describing such a broken system?
> (or point me at previous attempts to submit this patch?)

Th eone system I had myself was a Sony PCG-CR1. I don't have any pointers
to the discussion, it was in the late 2.3.x tree if I remember correctly.

> Linux Torvalds wrote:
> | Think about it: if you move the BAR to high memory, you basically disable
> | only _that_ bar, and nothing else. You don't clobber any other associated
> | functions, or anything like that.
>
> That's exactly the problem on ia64 - it does.
> Could this also be a problem on i386 that we just haven't noticed yet?

Unlikely. The IO-APIC on x86 is in that region, but it doesn't respond
from external sources, it's not actually on the PCI bus and only visible
from the CPU. And the CPU decodes that address internally and sends it on
the APIC bus and thus PCI devices simply do not matter for it.

> | Ivan pointed out that it also disables things like VGA legacy registers.
>
> I expect disabling VGA is only a problem for debugging PCI code.
> Is that the only thing that outputs for the duration we have things disabled?
> Anyway, I've been there (debugging code crashes the box) and don't want
> to go there again.

There's no point in you arguing about this. The problem exists and is real
on x86. The patch posted IS NOT GOING IN. That's final, and there's just
no point to arguing about it.

Alternative methods anyone?

Linus

2002-12-20 21:14:32

by Grant Grundler

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Fri, Dec 20, 2002 at 12:14:43PM -0800, Linus Torvalds wrote:
> Th eone system I had myself was a Sony PCG-CR1. I don't have any pointers
> to the discussion, it was in the late 2.3.x tree if I remember correctly.

tnx - I'll look for that.

> Unlikely. The IO-APIC on x86 is in that region, but it doesn't respond
> from external sources, it's not actually on the PCI bus and only visible
> from the CPU. And the CPU decodes that address internally and sends it on
> the APIC bus and thus PCI devices simply do not matter for it.

xAPIC != APIC
xAPIC ~= SAPIC
AFAICT, My x86 system uses an IO-xAPIC which programmed exactly like
the IO-SAPIC on IA64. i wrote the parisc IO-SAPIC support.
I haven't tried to reproduce the problem seen with IA64 on this box
and it's possible I just can't.

> There's no point in you arguing about this. The problem exists and is real
> on x86. The patch posted IS NOT GOING IN. That's final, and there's just
> no point to arguing about it.

I agree - it shouldn't.

> Alternative methods anyone?

I'll work one out with Ivan K - he usually has good ideas.

thanks,
grant

2002-12-20 21:21:36

by Grant Grundler

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Fri, Dec 20, 2002 at 03:37:49PM -0500, Richard B. Johnson wrote:
> I don't think it has anything to do with the bridge. It has
> to do with the BAR for the video device.

It seems you missed something in the conversation.
The case in question is a PCI bridge forwarding MMIO transactions
to a PCI VGA device.

grant

2002-12-21 18:32:57

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Fri, 20 Dec 2002, Linus Torvalds wrote:

> > That's exactly the problem on ia64 - it does.
> > Could this also be a problem on i386 that we just haven't noticed yet?
>
> Unlikely. The IO-APIC on x86 is in that region, but it doesn't respond
> from external sources, it's not actually on the PCI bus and only visible
> from the CPU. And the CPU decodes that address internally and sends it on
> the APIC bus and thus PCI devices simply do not matter for it.

That's not true for the I/O APIC. That's only true for the local APIC.

The I/O APIC may be wired to the PCI-ISA bridge, specifically the APICCS#
(chip select) line may be driven by that bridge when a PCI cycle targets
the area assigned to the I/O APIC. This is certainly the case for
discrete implementations, like the i82371SB/AB (PCI-ISA bridge) + i82093AA
(I/O APIC) pair, possibly for others, including later I/O APICs integrated
into chipsets. Obviously cycles from CPUs travel to the PCI-ISA bridge
across the associated PCI bus.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-12-21 21:59:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

David Mosberger <[email protected]> writes:

> >>>>> On Fri, 20 Dec 2002 09:05:53 -0800 (PST), Linus Torvalds
> <[email protected]> said:
>
>
> Linus> On Fri, 20 Dec 2002, David Mosberger wrote:
> >> Could you please stop this ia64 paranoia and instead explain to
> >> me why it's OK to relocate a PCI device to
> >> (0x100000000-PCI_dev_size) temporarily? That just seems horribly
> >> unsafe to me.
>
> Linus> No. It's not horribly unsafe at all. It's very safe, for one
> Linus> simple reason: it's how PCI probing has _always_ been
> Linus> done. Exactly because the alternatives simply do not work.
>
> Linus> I can also tell you why it does work, and why it's supposed
> Linus> to work: by writing 0xffffffff to the BAR register, you
> Linus> basically move the BAR to high PCI memory - even if it was
> Linus> enabled before. Which is fine, as long as nobody else is in
> Linus> that high memory. So the secondary rule to "don't turn off
> Linus> MEM or IO accesses" is "never allocate real PCI BAR resources
> Linus> at the top of memory".
>
> A 32-bit PCI device can decode up to 2GB of memory. I doubt any PC
> firmware is reserving the top 2GB for BAR-sizing. Without a
> reasonably small a-priori limit on the address window size of device,
> this method isn't safe.

Actually it is not quite as bad as that.
- We can reasonably assume there are no pci to pci transactions going
on, so the only accesses to a pci resource are generated by the
kernel from printk.

- If the large device is behind a pci bridge it should be shielded
from the chaos.

- If we don't call printk until we restore the old BAR value (which
is currently the case (drivers/pci/probe.c:60)) there should be no
transactions on the pci bus, that get a conflicting routing.

So it is safe because it is very, very local operation.

>
> Linus> Let me re-iterate the "turn power off at the master switch in
> Linus> a house when switching a light bulb" analogy. Yes, it's a
> Linus> good idea if you are nervous, but you do that only when you
> Linus> _know_ who is in the house and you know what they are doing
> Linus> and it's ok by them.
>
> That's a poor analogy. What you're suggesting is more like replacing
> the light bulb while it's powered on. Yes, it can be done, but people
> do get burned and electrocuted that way.

Actually we just need to be careful and not turn on the light
in the room. (I.e. keep the pci bus otherwise quiet while we are
sizing a bar).

> Linus> One solution in the long term may be to not even probe the
> Linus> BAR's at all in generic code, and only do it in the
> Linus> pci_enable_dev() stuff. That way it would literally only be
> Linus> done by the driver, who can hopefully make sure that the
> Linus> device is ok with it.
>
> Yes, I see now that the method in the PCI spec isn't really safe
> either, so BAR-sizing probably shouldn't be done in generic code at
> all.

As long as the pci bus is quiet while we are sizing a bar the current
method safe.

Eric

2002-12-21 22:35:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing


[ Ivan added to the cc, to see if he has any ideas on turning things off ]

On 21 Dec 2002, Eric W. Biederman wrote:
>
> Actually it is not quite as bad as that.
> - We can reasonably assume there are no pci to pci transactions going
> on, so the only accesses to a pci resource are generated by the
> kernel from printk.

Actually, I think it's certainly valid to not allow "printk()" to happen
around the BAR probing, at least at bootup when we control all the CPU's
tightly anyway.

And hotplug devices should be disabled at plug-in, so later BAR probing
should be pretty harmless too (and, as you point out about bridging, they
should be shielded by the hotplug bridge itself).

> - If the large device is behind a pci bridge it should be shielded
> from the chaos.
>
> - If we don't call printk until we restore the old BAR value (which
> is currently the case (drivers/pci/probe.c:60)) there should be no
> transactions on the pci bus, that get a conflicting routing.

The problem has been at least in the case I saw it that there are devices
that aren't entirely quiescent, often because we haven't even _gotten_ to
them yet, and the boot sequence left them active.

The one I saw was USB, and that's likely to be the worst case, since it's
one of very few devices that tends to "do stuff" even when inactive (ie a
USB setup walks the USB command tables in memory continuously, even if
nothing is happening). It's also one of the few classes of devices that
many PC's have SMM support for, so they are still alive even after the
BIOS has otherwise given up control.

> As long as the pci bus is quiet while we are sizing a bar the current
> method safe.

Well, the thing is, as long as the PCI bus is 100% quiet, it simply
doesn't _matter_ which method we use.

The interesting cases are all "some activity that we don't know about is
going on". That's the thing that breaks disabling the PCI device, but it's
also the thing that can break _not_ disabling the PCI device.

So if we can guarantee a quiescent PCI bus, then I could also accept the
patch that disables MEM/IO resources for BAR probing. At that point it
simply shouldn't matter any more, and then I'd happily drop my concerns
about it.

This is why I repeated my "turn the power off at the whole house" analogy,
even if David didn't like it. It's _fine_ to turn the power off if we know
things are quiet, it's just that as things stand now, we don't actually
know that.

If somebody wants to try to follow that method, I can try to dig out the
machine that I had problems on before and test things at least on that
setup to make myself happier about the fact that it really solves the
problem. The solution may be as simple as just making our current
two-phase PCI scanning be a _three_phase one:

- (new) phase 1 - scan for and turn off all devices
- phase 2 - go back and check the resources (BAR probing etc)
- phase 3 - allocate unassigned resources.

One of the problems with turning off devices is that we actually have a
hard time doing so. We can trivially turn off IO/MEM/DMA, but PCI doesn't
have a good way to turn off interrupts (which in turn can become SCI
events).

Which still makes me worry about legacy USB in particular - simply because
I wonder what happens if the USB controller raises an interrupt which
causes an SMM event, which then causes trouble because the SMM handler
will be unhappy when the device isn't there any more.

We've actually had those kinds of problems in real life, see the
quirk_piix3_usb() quirks, for example. So I'm really not trying to be
difficult here, it's just that PC BIOS issues, and SMM in _particular_
tends to be quite a horrible mess for the early boot sequence.

Linus

2002-12-22 07:07:24

by Paul Mackerras

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

Linus Torvalds writes:

> Actually, I think it's certainly valid to not allow "printk()" to happen
> around the BAR probing, at least at bootup when we control all the CPU's
> tightly anyway.

I'd like us to disable interrupts too. On powermacs, the interrupt
controller is typically inside a combo I/O ASIC which is on the PCI
bus. If we take an interrupt while the ASIC's BAR is relocated or
turned off, we will get a machine check when we try to access the
interrupt controller and the kernel will die at that point.

Paul.

2002-12-22 10:30:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing


I believe on most recent x86 boxes we can disable SCI interrupts at
the southbridge or on the CPUs local apic.

So the problem we are facing is that we have some activity while BARs
are being sized. Activity can be in the form of IRQs, DMA, IO, and
MEM accesses. And this is an important case to handle for unknown
devices. And unknown devices do not have to follow our interpretation
of the PCI spec, so we need to walk carefully and do the expected.

Question. Just how did disabling the BARs when sizing then cause
the machine to fail?

So the design should be:
- What can we do to defend ourselves and make it safe to size
a bar.
- How do we keep the window of vulnerability small if the defense
mechanisms do not work.

What I would suggest is:
- First attempt to disable IRQs, and don't do anything in the kernel.
- Second when we come to a device look to see if there is any device
specific code, for sizing or finding BARs. (For some southbridges
code to find BARs in non-standard locations would give us better
information for allocating unset BARs.)
- In the generic code disable the BARs (if it can be done safely) size
the bars and then renable the BARs to their old state.

The goal being to keep the window when uncontrolled pci accesses
can blow the system up as small as possible.

I am probably a bad one to be thinking about this problem because if a
BIOS gives me these kinds of issues, on a platform I really need to
support I port LinuxBIOS... And then their is no SMM code for me to
worry about, and all of the hardware is quiescent when the kernel is
booting...

Eric

2002-12-22 14:55:28

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Sun, 2002-12-22 at 08:15, Paul Mackerras wrote:
> Linus Torvalds writes:
>
> > Actually, I think it's certainly valid to not allow "printk()" to happen
> > around the BAR probing, at least at bootup when we control all the CPU's
> > tightly anyway.
>
> I'd like us to disable interrupts too. On powermacs, the interrupt
> controller is typically inside a combo I/O ASIC which is on the PCI
> bus. If we take an interrupt while the ASIC's BAR is relocated or
> turned off, we will get a machine check when we try to access the
> interrupt controller and the kernel will die at that point.

Actually, it's even worse, as the current probe code also turn off
address decoders of PCI<->PCI bridges when probing, and in a lot of
case, the combo ASIC containing the interrupt controller _is_ behind a
PCI<->PCI bridge ! Same problem with serial port (so printk is unsafe
for quite a while on serial consoles) etc...

The code has a comment that clearly says that we don't know why address
decoding is turned off and that should be fixed, so I suggest we either
fix it now or replace the comment with an explanation of why we need to
turn it off ;) Eventually, turning it off could be made an exception via
some quirks mecanism.

Ben.


2002-12-22 18:32:03

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Sat, Dec 21, 2002 at 02:44:23PM -0800, Linus Torvalds wrote:
> And hotplug devices should be disabled at plug-in, so later BAR probing
> should be pretty harmless too (and, as you point out about bridging, they
> should be shielded by the hotplug bridge itself).

Agreed.

> This is why I repeated my "turn the power off at the whole house" analogy,
> even if David didn't like it. It's _fine_ to turn the power off if we know
> things are quiet, it's just that as things stand now, we don't actually
> know that.

The analogy is wonderful. I'd like to extend it to mentioned ia64 system:
any attempt to replace a light bulb that consumes more than 32 Mb^H^HWatts
without turning the power off will blow up fuses in the whole house.

> - (new) phase 1 - scan for and turn off all devices
> - phase 2 - go back and check the resources (BAR probing etc)
> - phase 3 - allocate unassigned resources.

I don't think we want to turn off *all* devices in phase 1.
Probably we need another level of fixups - PCI_FIXUP_EARLY, which
will be called after phase 1, i.e. before fiddling with BARs.

However, I'm still not convinced that we should ever change anything.

On Sun, Dec 22, 2002 at 03:38:04AM -0700, Eric W. Biederman wrote:
> So the problem we are facing is that we have some activity while BARs
> are being sized. Activity can be in the form of IRQs, DMA, IO, and
> MEM accesses.

I would suggest another classification: a) bus activity generated by CPU
and b) bus activity generated by other devices.

Case 'a' is mostly under our control. The only problem is an interrupt
arriving during BAR probe, as Paul pointed out. This can be fixed:
pci_read_config_dword(dev, reg, &l);
+ local_irq_save(flags);
pci_write_config_dword(dev, reg, ~0);
pci_read_config_dword(dev, reg, &sz);
pci_write_config_dword(dev, reg, l);
+ local_irq_restore(flags);

Case 'b', two variants.
- The target of bus activity is a region located in the low bus
addresses (USB case), so there are no conflicts with probed BARs
of other devices. But we cannot safely disable devices while BARs
are being sized as we don't know for sure what device controls
that region.
- Target region is in the high bus addresses (ia64 case), and conflicts
are possible when we do the sizing. Yes, disabling the devices does help
to avoid conflicts and it happens to work with *some* configurations.
But what if the said region in its turn is defined by BAR? or just
controlled by some PCI device and therefore can be disconnected from
the bus? Too many if's.

I think it's obvious now that "disabling the BARs" is not a fix at all,
it's just a hack for that particular setup.
BTW, I think it is *not* required by PCI spec - v2.1 says nothing about it,
in v2.2 it seems to be just an implementation example.

The real fix is to shut up the device that makes a noise on the bus,
at least during bus probe.
I refuse to believe that it cannot be done with that ia64 SAPIC (must be
a hidden switch somewhere ;-)
So I'd wait for more tech details regarding that.

> The goal being to keep the window when uncontrolled pci accesses
> can blow the system up as small as possible.

The window will be large enough anyway - the PCI config space accesses are
very slow on most systems.

Ivan.

2002-12-22 19:13:20

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Sun, Dec 22, 2002 at 04:03:01PM +0100, Benjamin Herrenschmidt wrote:
> The code has a comment that clearly says that we don't know why address
> decoding is turned off and that should be fixed, so I suggest we either
> fix it now or replace the comment with an explanation of why we need to
> turn it off ;)

It certainly wants to be fixed, thanks. Looks like a thinko.

Just out of curiosity, formerly you mentioned that said ASIC cannot
be relocated in the PCI address space, why? Firmware issues or anything
else?

Ivan.

2002-12-22 19:42:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing


On Sun, 22 Dec 2002, Paul Mackerras wrote:
>
> Linus Torvalds writes:
>
> > Actually, I think it's certainly valid to not allow "printk()" to happen
> > around the BAR probing, at least at bootup when we control all the CPU's
> > tightly anyway.
>
> I'd like us to disable interrupts too. On powermacs, the interrupt
> controller is typically inside a combo I/O ASIC which is on the PCI
> bus.

I certainly wpouldn't mind disabling local interrupts around that code
sequence, but the problem with that is that I don't think it much matters,
since it wouldn't help the SMP case anyway (and in fact even doing an
old-style "cli/sti" if we still supported it would not have helped either,
since the global interrupt disable would still have allowed the other
CPU's to touch the irq controller).

And since the pci bus will be scanned in an "subsys_initcall", the other
CPU's are definitely active at that point.

In short, I don't think we can easily fix this. And again, I think that
the only good approach would be to really make sure the machine is
quescent while the PCI bus is initially scanned (and again, the later
scans behing proper hotplug bus bridges should be fine thanks to bridge
decoding limits).

Sadly, as I already mentioned, there isn't any generic PCI method of
shutting up device interrupts.

I think the only workable setup (which we effectively already do on x86,
not that we should have this problem in the first place) is to just turn
off the interrupts _at_ the interrupt controller. Which simply means that
we really shouldn't be getting any interrupts while probing the root bus.
Then you enable the interrupts as device drivers request them..

Linus

2002-12-22 19:40:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

Ivan Kokshaysky <[email protected]> writes:

> On Sat, Dec 21, 2002 at 02:44:23PM -0800, Linus Torvalds wrote:
> > And hotplug devices should be disabled at plug-in, so later BAR probing
> > should be pretty harmless too (and, as you point out about bridging, they
> > should be shielded by the hotplug bridge itself).
>
> Agreed.
>
> > This is why I repeated my "turn the power off at the whole house" analogy,
> > even if David didn't like it. It's _fine_ to turn the power off if we know
> > things are quiet, it's just that as things stand now, we don't actually
> > know that.
>
> The analogy is wonderful. I'd like to extend it to mentioned ia64 system:
> any attempt to replace a light bulb that consumes more than 32 Mb^H^HWatts
> without turning the power off will blow up fuses in the whole house.
>
> > - (new) phase 1 - scan for and turn off all devices
> > - phase 2 - go back and check the resources (BAR probing etc)
> > - phase 3 - allocate unassigned resources.
>
> I don't think we want to turn off *all* devices in phase 1.
> Probably we need another level of fixups - PCI_FIXUP_EARLY, which
> will be called after phase 1, i.e. before fiddling with BARs.
>
> However, I'm still not convinced that we should ever change anything.
>
> On Sun, Dec 22, 2002 at 03:38:04AM -0700, Eric W. Biederman wrote:
> > So the problem we are facing is that we have some activity while BARs
> > are being sized. Activity can be in the form of IRQs, DMA, IO, and
> > MEM accesses.
>
> I would suggest another classification: a) bus activity generated by CPU
> and b) bus activity generated by other devices.
>
> Case 'a' is mostly under our control. The only problem is an interrupt
> arriving during BAR probe, as Paul pointed out. This can be fixed:
> pci_read_config_dword(dev, reg, &l);
> + local_irq_save(flags);
> pci_write_config_dword(dev, reg, ~0);
> pci_read_config_dword(dev, reg, &sz);
> pci_write_config_dword(dev, reg, l);
> + local_irq_restore(flags);

Not if it is an NMI or an SCI interrupt. The latter on x86 places
the cpu in System Management Mode, and what the cpu does from that
point forward is out of our control.

Though disabling cpu controlled IRQs help if you are dealing
with any normal IRQs.

> Case 'b', two variants.
> - The target of bus activity is a region located in the low bus
> addresses (USB case), so there are no conflicts with probed BARs
> of other devices. But we cannot safely disable devices while BARs
> are being sized as we don't know for sure what device controls
> that region.
> - Target region is in the high bus addresses (ia64 case), and conflicts
> are possible when we do the sizing. Yes, disabling the devices does help
> to avoid conflicts and it happens to work with *some* configurations.
> But what if the said region in its turn is defined by BAR? or just
> controlled by some PCI device and therefore can be disconnected from
> the bus? Too many if's.
>
> I think it's obvious now that "disabling the BARs" is not a fix at all,
> it's just a hack for that particular setup.
> BTW, I think it is *not* required by PCI spec - v2.1 says nothing about it,
> in v2.2 it seems to be just an implementation example.
>
> The real fix is to shut up the device that makes a noise on the bus,
> at least during bus probe.
> I refuse to believe that it cannot be done with that ia64 SAPIC (must be
> a hidden switch somewhere ;-)
> So I'd wait for more tech details regarding that.
>
> > The goal being to keep the window when uncontrolled pci accesses
> > can blow the system up as small as possible.
>
> The window will be large enough anyway - the PCI config space accesses are
> very slow on most systems.

The window needs to be small from the PCI bus perspective, not in cpu
clocks. Write, Read, Write is only something like 9 PCI bus clocks.
If IRQs also traveled over the PCI bus we could be fairly confident of
saturating the PCI bus and not having problems. Except that the PCI
bus arbiter may not choose the CPU for back to back transactions.

Not that I would like to trust to the small window of opportunity
idea, but closing all of the other holes looks quite difficult.

Eric

2002-12-22 19:50:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

Linus Torvalds <[email protected]> writes:

> On Sun, 22 Dec 2002, Paul Mackerras wrote:
> >
> > Linus Torvalds writes:
> >
> > > Actually, I think it's certainly valid to not allow "printk()" to happen
> > > around the BAR probing, at least at bootup when we control all the CPU's
> > > tightly anyway.
> >
> > I'd like us to disable interrupts too. On powermacs, the interrupt
> > controller is typically inside a combo I/O ASIC which is on the PCI
> > bus.
>
> I certainly wpouldn't mind disabling local interrupts around that code
> sequence, but the problem with that is that I don't think it much matters,
> since it wouldn't help the SMP case anyway (and in fact even doing an
> old-style "cli/sti" if we still supported it would not have helped either,
> since the global interrupt disable would still have allowed the other
> CPU's to touch the irq controller).
>
> And since the pci bus will be scanned in an "subsys_initcall", the other
> CPU's are definitely active at that point.

Ouch, I hadn't thought about that one...

> In short, I don't think we can easily fix this. And again, I think that
> the only good approach would be to really make sure the machine is
> quescent while the PCI bus is initially scanned (and again, the later
> scans behing proper hotplug bus bridges should be fine thanks to bridge
> decoding limits).

For a similar reason this is why my kexec stuff does not attempt to
do generic device disables.

> Sadly, as I already mentioned, there isn't any generic PCI method of
> shutting up device interrupts.
>
> I think the only workable setup (which we effectively already do on x86,
> not that we should have this problem in the first place) is to just turn
> off the interrupts _at_ the interrupt controller. Which simply means that
> we really shouldn't be getting any interrupts while probing the root bus.
> Then you enable the interrupts as device drivers request them..

Well we need southbridge drivers on x86 if we want to prevent SCI interrupts
because those go around the i82559. Though we might catch them at the local
apic...

Eric

2002-12-22 20:57:47

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Sun, Dec 22, 2002 at 12:47:51PM -0700, Eric W. Biederman wrote:
> Not if it is an NMI or an SCI interrupt. The latter on x86 places
> the cpu in System Management Mode, and what the cpu does from that
> point forward is out of our control.
>
> Though disabling cpu controlled IRQs help if you are dealing
> with any normal IRQs.

I meant the timer interrupt in the first place. I assumed it's the
only one that does matter on this stage of the boot process.
What else could happen (in the real world terms)?

> The window needs to be small from the PCI bus perspective, not in cpu
> clocks. Write, Read, Write is only something like 9 PCI bus clocks.

No, the window is huge from the PCI bus perspective.
IIRC, PCI config read/write on x86 works like this (I may be wrong though):
i/o port write (BAR address) ~1us
i/o port read (BAR value after writing ~0) ~1us
i/o port write (BAR address) ~1us
i/o port write (saved BAR value) ~1us

It's a little bit more than 9 PCI clocks. Am I missing something?

Ivan.

2002-12-23 00:21:14

by Paul Mackerras

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

Ivan Kokshaysky writes:

> Just out of curiosity, formerly you mentioned that said ASIC cannot
> be relocated in the PCI address space, why? Firmware issues or anything
> else?

It's mainly the fact that we have already ioremapped parts of the
address space of the ASIC. For example we would have ioremapped the
interrupt controller's registers in init_IRQ(), which happens much
earlier than PCI probing. If we are using a serial console via one of
the serial ports in the ASIC, we would have ioremapped the serial
ports by this time. And so on.

We could relocate the ASIC if we could find the ioremaps and fix them
up, or if we could get to all the drivers and have them re-do their
ioremaps. There is no way to do that at the moment, though.
Typically the ASIC will have a couple of IDE interfaces, audio,
serial, i2c, interrupt controller, wireless ethernet, timer, and PMU
(power management unit) interfaces in it, so there are several drivers
involved.

In fact we don't really need to probe the BARs of the ASIC at all,
because the device tree that we get from Open Firmware tells us the
size and location of the resources it is using (along with all the
other PCI devices in the system). If we could have a
platform-specific hook so that we could provide an alternative method
for probing the BARs of certain PCI devices, that would let us avoid
the whole problem.

Paul.

2002-12-23 15:18:41

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: PATCH 2.5.x disable BAR when sizing

On Mon, Dec 23, 2002 at 11:29:15AM +1100, Paul Mackerras wrote:
> We could relocate the ASIC if we could find the ioremaps and fix them
> up, or if we could get to all the drivers and have them re-do their
> ioremaps. There is no way to do that at the moment, though.
> Typically the ASIC will have a couple of IDE interfaces, audio,
> serial, i2c, interrupt controller, wireless ethernet, timer, and PMU
> (power management unit) interfaces in it, so there are several drivers
> involved.

I see, thanks. I always had an impression that pci_init is called
way too late in the boot sequence, at least on the PCI-centric systems.
This makes all that PCI surgery really painful. Ideally, the init
order should be defined per architecture. This applies to the
generic device model as well: currently we have legacy, isa-pnp
and other stuff initialized and probed *before* PCI, though typically
these are hanging of the PCI bus. And it's a real problem on machines
with multiple PCI buses. Oh, well...

> In fact we don't really need to probe the BARs of the ASIC at all,
> because the device tree that we get from Open Firmware tells us the
> size and location of the resources it is using (along with all the
> other PCI devices in the system). If we could have a
> platform-specific hook so that we could provide an alternative method
> for probing the BARs of certain PCI devices, that would let us avoid
> the whole problem.

It can be trivially done using Linus' idea of probing in 2 phases.
The pass 1 can be made 100% non-destructive: we can initialize
device structures, read in device/vendor IDs, class codes and
build the bus trees. Basically, everything that we do currently
minus pci_read_bases(), so the resource fields are blank. Then we
can call arch- and device-specific routines, where along with other
fixups the device BARs could be probed in arch-specific way and written
down in the pci_dev structure.
In the pass 2 (generic BAR probing), we can check the resource flags
for non-zero value and skip the probe of respective BAR.
This would make the code a LOT more flexible.

Ivan.

2003-01-05 12:33:07

by Ivan Kokshaysky

[permalink] [raw]
Subject: [patch 2.5] PCI: allow alternative methods for probing the BARs

Hopefully this patch should solve most problems with probing the BARs.
The changes are quite minimal as everything still is done in one pass.
- Added another level of fixups (PCI_FIXUP_EARLY), called with only
device/vendor IDs and class code filled in, before sizing the BARs.
- pci_read_bases won't probe the BAR if the respective resource has
non-zero flags.

This allows to implement numerous alternative probing methods for
particular architecture/device/BAR. Some of possible variants:
get information from firmware and don't touch the BAR at all;
write the BAR value into respective resource->start and set
resource->end according to the chip manual without probing;
disable the device, call generic pci_read_bases(), re-enable the device;
probe with value other than 0xffffffff and so on.

Besides, this would easily solve one interesting (though unrelated) problem.
There are quite a few devices (IDE controllers, AGP aperture stuff,
some bridges) that significantly change their PCI header including BARs,
class code etc. depending on some configuration bits, so it might be useful
to program the device into desired state before probing.

Ivan.

--- 2.5.54/drivers/pci/probe.c Thu Jan 2 06:22:34 2003
+++ linux/drivers/pci/probe.c Sat Jan 4 19:02:19 2003
@@ -53,6 +53,9 @@ static void pci_read_bases(struct pci_de
next = pos+1;
res = &dev->resource[pos];
res->name = dev->dev.name;
+ /* Skip already probed resources */
+ if (res->flags)
+ continue;
reg = PCI_BASE_ADDRESS_0 + (pos << 2);
pci_read_config_dword(dev, reg, &l);
pci_write_config_dword(dev, reg, ~0);
@@ -355,10 +358,7 @@ int pci_setup_device(struct pci_dev * de
sprintf(dev->dev.name, "PCI device %04x:%04x", dev->vendor, dev->device);
INIT_LIST_HEAD(&dev->pools);

- pci_read_config_dword(dev, PCI_CLASS_REVISION, &class);
- class >>= 8; /* upper 3 bytes */
- dev->class = class;
- class >>= 8;
+ class = dev->class >> 8;

DBG("Found %02x:%02x [%04x/%04x] %06x %02x\n", dev->bus->number, dev->devfn, dev->vendor, dev->device, class, dev->hdr_type);

@@ -433,9 +433,16 @@ struct pci_dev * __devinit pci_scan_devi
dev->vendor = l & 0xffff;
dev->device = (l >> 16) & 0xffff;

+ pci_read_config_dword(dev, PCI_CLASS_REVISION, &l);
+ dev->class = l >> 8; /* upper 3 bytes */
+
/* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
set this higher, assuming the system even supports it. */
dev->dma_mask = 0xffffffff;
+
+ /* Early fixups, before probing the BARs */
+ pci_fixup_device(PCI_FIXUP_EARLY, dev);
+
if (pci_setup_device(dev) < 0) {
kfree(dev);
return NULL;
--- 2.5.54/include/linux/pci.h Thu Jan 2 06:21:56 2003
+++ linux/include/linux/pci.h Sun Jan 5 15:27:51 2003
@@ -809,8 +809,12 @@ struct pci_fixup {

extern struct pci_fixup pcibios_fixups[];

-#define PCI_FIXUP_HEADER 1 /* Called immediately after reading configuration header */
-#define PCI_FIXUP_FINAL 2 /* Final phase of device fixups */
+#define PCI_FIXUP_EARLY 1 /* Called immediately after reading
+ header type, device/vendor IDs
+ and class code */
+#define PCI_FIXUP_HEADER 2 /* Called after reading configuration
+ header (including BARs) */
+#define PCI_FIXUP_FINAL 3 /* Final phase of device fixups */

void pci_fixup_device(int pass, struct pci_dev *dev);

2003-01-06 00:26:13

by Anton Blanchard

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs


> Hopefully this patch should solve most problems with probing the BARs.
> The changes are quite minimal as everything still is done in one pass.
> - Added another level of fixups (PCI_FIXUP_EARLY), called with only
> device/vendor IDs and class code filled in, before sizing the BARs.
> - pci_read_bases won't probe the BAR if the respective resource has
> non-zero flags.
>
> This allows to implement numerous alternative probing methods for
> particular architecture/device/BAR. Some of possible variants:
> get information from firmware and don't touch the BAR at all;

This sounds useful on ppc64 where firmware does a good job of setting
up the BARs.

Anton

2003-01-06 04:37:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs


On Sun, 5 Jan 2003, Ivan Kokshaysky wrote:
>
> Hopefully this patch should solve most problems with probing the BARs.
> The changes are quite minimal as everything still is done in one pass.

Can you do the same with a multi-pass thing?

I really think the single-pass approach is broken, because it means that
we _cannot_ have a fixup for device that runs _before_ the fixup for the
bridge that bridges to the device.

As such, the "PCI_FIXUP_EARLY" is not _nearly_ early enough, since it's
way too late for the actual problem that started this whole thread (ie in
order to turn off a bridge, we have to make sure that everything behind
the bridge is turned off _first_).

In other words, we really should be able to do all the bus number setup
_first_. That isn't dependent ont eh BAR's or anything else. The actual
_sizing_ of the bus is clearly somethign we cannot do early, but we can
(and should) enumerate the devices first in phase #1.

Alternatively, we could even have a very limited phase #1 that only
enumerates _reachable_ devices (ie it doesn't even try to create bus
numbers, it only enumerates devices and buses that have already been set
up by the firmware, and ignores bridges that aren't set up yet). A pure
discovery phase, without any configuration at all.

Hmm?

Linus

2003-01-06 05:14:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

Linus Torvalds <[email protected]> writes:

> On Sun, 5 Jan 2003, Ivan Kokshaysky wrote:
> >
> > Hopefully this patch should solve most problems with probing the BARs.
> > The changes are quite minimal as everything still is done in one pass.
>
> Can you do the same with a multi-pass thing?
>
> I really think the single-pass approach is broken, because it means that
> we _cannot_ have a fixup for device that runs _before_ the fixup for the
> bridge that bridges to the device.
>
> As such, the "PCI_FIXUP_EARLY" is not _nearly_ early enough, since it's
> way too late for the actual problem that started this whole thread (ie in
> order to turn off a bridge, we have to make sure that everything behind
> the bridge is turned off _first_).
>
> In other words, we really should be able to do all the bus number setup
> _first_. That isn't dependent ont eh BAR's or anything else. The actual
> _sizing_ of the bus is clearly somethign we cannot do early, but we can
> (and should) enumerate the devices first in phase #1.
>
> Alternatively, we could even have a very limited phase #1 that only
> enumerates _reachable_ devices (ie it doesn't even try to create bus
> numbers, it only enumerates devices and buses that have already been set
> up by the firmware, and ignores bridges that aren't set up yet). A pure
> discovery phase, without any configuration at all.
>
> Hmm?

I have done something similar to this in LinuxBIOS, perhaps a
description of that will spark ideas.

In the enumeration phase I look at the vendor+device id and then hdr
type to assign methods. The methods I have are:
scan_bus,
read_bases,
write_bases.

Then for all children of a bus that have a scan_bus function I
recurse. This fits in very naturally with devices not necessarily
being pci devices.

When it comes time to assign pci resources I call the read_bases
method to get the size and possibly fixed location of the resource. I
play with my data structures to get a non conflicting set of resources
and then I call write_bases on each device to write the resource
assignments to the actual device.

The setup works well enough that I don't have to differentiation later
in the code between pci busses and normal pci devices.

I have not had to push beyond the standard pci devices yet, but I
don't think I have missed anything that would cause me problems.

Linus is this the kind of thing you are thinking of?

Eric

2003-01-06 10:25:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, 2003-01-06 at 05:14, Linus Torvalds wrote:

> Alternatively, we could even have a very limited phase #1 that only
> enumerates _reachable_ devices (ie it doesn't even try to create bus
> numbers, it only enumerates devices and buses that have already been set
> up by the firmware, and ignores bridges that aren't set up yet). A pure
> discovery phase, without any configuration at all.

That would only work for cases where the kernel isn't responsible for
the actual bus numbering/renumbering. For example, since we do not quite
deal with PCI domains yet, I have to force pcibios_assign_all_busses()
on pmac so that my 3 root busses get re-numbered (else I get 3 trees
with conflicting bus numbers all starting at 0). Another case is embedded
which can deal with completely unassigned bus numbers.

Ben.

2003-01-06 10:26:00

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, 2003-01-06 at 05:14, Linus Torvalds wrote:

> Alternatively, we could even have a very limited phase #1 that only
> enumerates _reachable_ devices (ie it doesn't even try to create bus
> numbers, it only enumerates devices and buses that have already been set
> up by the firmware, and ignores bridges that aren't set up yet). A pure
> discovery phase, without any configuration at all.

That would only work for cases where the kernel isn't responsible for
the actual bus numbering/renumbering. For example, since we do not quite
deal with PCI domains yet, I have to force pcibios_assign_all_busses()
on pmac so that my 3 root busses get re-numbered (else I get 3 trees
with conflicting bus numbers all starting at 0). Another case is embedded
which can deal with completely unassigned bus numbers.

Ben.

2003-01-06 11:25:45

by Andrew Walrond

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

My (as yet unresolved) problem seems related to this stuff.
I've applied this patch on the off-chance it might fix my problems, to
no avail. As you guys know about this stuff, perhaps you could glance
over my problem description - attached. I be much obliged, as I can't
use nptl *without* 2.5.x, and I can't use my e1000 *with* 2.5.x


2.5.53 and 2.5.54 do not seem to find everything on the pci bus on this
asus pr-dls dual p4 xeon m/b

With 2.4.20+ I see

hal3 root # cat /proc/pci
PCI devices found:
...

Bus 0, device 0, function 0:
Host bridge: PCI device 1166:0012 (ServerWorks) (rev 19).
Bus 0, device 0, function 1:
Host bridge: PCI device 1166:0012 (ServerWorks) (rev 0).
Bus 0, device 0, function 2:
Host bridge: PCI device 1166:0000 (ServerWorks) (rev 0).
Bus 0, device 2, function 0:
Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 16).
IRQ 18.
Master Capable. Latency=32. Min Gnt=8.Max Lat=56.
Non-prefetchable 32 bit memory at 0xfd800000 [0xfd800fff].
I/O at 0xd800 [0xd83f].
Non-prefetchable 32 bit memory at 0xfd000000 [0xfd01ffff].
Bus 0, device 3, function 0:
VGA compatible controller: ATI Technologies Inc Rage XL (rev 39).
IRQ 46.
Master Capable. Latency=32. Min Gnt=8.
Non-prefetchable 32 bit memory at 0xfc000000 [0xfcffffff].
I/O at 0xd400 [0xd4ff].
Non-prefetchable 32 bit memory at 0xfb800000 [0xfb800fff].
Bus 0, device 15, function 0:
ISA bridge: ServerWorks CSB5 South Bridge (rev 147).
Master Capable. Latency=32.
Bus 0, device 15, function 3:
Host bridge: ServerWorks GCLE Host Bridge (rev 0).
Bus 0, device 16, function 0:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.
Bus 0, device 16, function 2:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.
Bus 0, device 17, function 0:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.
Bus 0, device 17, function 2:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.
Bus 14, device 4, function 0:
SCSI storage controller: LSI Logic / Symbios Logic (formerly NCR)
53c1030 (rev 7).
IRQ 22.
Master Capable. Latency=32. Min Gnt=17.Max Lat=18.
I/O at 0xa000 [0xa0ff].
Non-prefetchable 64 bit memory at 0xfa000000 [0xfa00ffff].
Non-prefetchable 64 bit memory at 0xf9800000 [0xf980ffff].
Bus 14, device 4, function 1:
SCSI storage controller: LSI Logic / Symbios Logic (formerly NCR)
53c1030 (#2) (rev 7).
IRQ 23.
Master Capable. Latency=32. Min Gnt=17.Max Lat=18.
I/O at 0x9800 [0x98ff].
Non-prefetchable 64 bit memory at 0xf9000000 [0xf900ffff].
Non-prefetchable 64 bit memory at 0xf8800000 [0xf880ffff].
Bus 18, device 2, function 0:
Ethernet controller: Intel Corp. 82544GC Gigabit Ethernet
Controller (rev 2).
IRQ 19.
Master Capable. Latency=32. Min Gnt=255.
Non-prefetchable 64 bit memory at 0xf8000000 [0xf801ffff].
Non-prefetchable 64 bit memory at 0xf7800000 [0xf781ffff].
I/O at 0x9400 [0x941f].


But with 2.5.53/2.5.54 bus 14 and bus 18 are missing:

hal3 root # cat /proc/pci
PCI devices found:
Bus 0, device 0, function 0:
Host bridge: PCI device 1166:0012 (ServerWorks) (rev 19).
Bus 0, device 0, function 1:
Host bridge: PCI device 1166:0012 (ServerWorks) (rev 0).
Bus 0, device 0, function 2:
Host bridge: PCI device 1166:0000 (ServerWorks) (rev 0).
Bus 0, device 2, function 0:
Ethernet controller: Intel Corp. 82557/8/9 [Ethernet (rev 16).
IRQ 18.
Master Capable. Latency=32. Min Gnt=8.Max Lat=56.
Non-prefetchable 32 bit memory at 0xfd800000 [0xfd800fff].
I/O at 0xd800 [0xd83f].
Non-prefetchable 32 bit memory at 0xfd000000 [0xfd01ffff].
Bus 0, device 3, function 0:
VGA compatible controller: ATI Technologies Inc Rage XL (rev 39).
IRQ 46.
Master Capable. Latency=32. Min Gnt=8.
Non-prefetchable 32 bit memory at 0xfc000000 [0xfcffffff].
I/O at 0xd400 [0xd4ff].
Non-prefetchable 32 bit memory at 0xfb800000 [0xfb800fff].
Bus 0, device 15, function 0:
ISA bridge: ServerWorks CSB5 South Bridge (rev 147).
Master Capable. Latency=32.
Bus 0, device 15, function 3:
Host bridge: PCI device 1166:0225 (ServerWorks) (rev 0).
Bus 0, device 16, function 0:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.
Bus 0, device 16, function 2:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.
Bus 0, device 17, function 0:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.
Bus 0, device 17, function 2:
Host bridge: PCI device 1166:0101 (ServerWorks) (rev 3).
Master Capable. Latency=64.


ACPI is enabled in both cases (Won't boot with pci=noacpi)

2003-01-06 15:27:52

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Sun, Jan 05, 2003 at 08:14:53PM -0800, Linus Torvalds wrote:
> Can you do the same with a multi-pass thing?
>
> I really think the single-pass approach is broken, because it means that
> we _cannot_ have a fixup for device that runs _before_ the fixup for the
> bridge that bridges to the device.
>
> As such, the "PCI_FIXUP_EARLY" is not _nearly_ early enough, since it's
> way too late for the actual problem that started this whole thread (ie in
> order to turn off a bridge, we have to make sure that everything behind
> the bridge is turned off _first_).

I totally agree, 2-phase approach is a right way to go. However, I've been
confused by that "feature freeze" thing. ;-)
This patch has zero impact on existing code - I thought it's a "feature"
(and I still believe it's ok for 2.4).
OTOH, I've actually tried to implement 2-phase probing _inside_ the
pci_do_scan_bus, and that was ugly as hell.
I believe the phase #2 must be a separate top-level function -
something like pci_probe_resources (better naming?), but this
assumes some changes to PCI code for every architecture and hotplug
drivers.

> In other words, we really should be able to do all the bus number setup
> _first_. That isn't dependent ont eh BAR's or anything else. The actual
> _sizing_ of the bus is clearly somethign we cannot do early, but we can
> (and should) enumerate the devices first in phase #1.
>
> Alternatively, we could even have a very limited phase #1 that only
> enumerates _reachable_ devices (ie it doesn't even try to create bus
> numbers, it only enumerates devices and buses that have already been set
> up by the firmware, and ignores bridges that aren't set up yet). A pure
> discovery phase, without any configuration at all.

I like the former. Even if we have to reassign the bus numbers,
we won't affect anything but forwarding the PCI configuration
cycles.

For now I'd start with following:
phase #1. pci_do_scan_bus() - build the bus/device tree, read in
dev/vendor IDs, header type and class code; call
"PCI_FIXUP_EARLY" fixups.
phase #2. pci_probe_resources(bus) - walk the bus tree again,
probe the BARs, maybe call pci_read_bridge_bases for
bridges; fill in the rest of PCI header; call "PCI_FIXUP_HEADER"
fixups.
(Also there are phases #3 and #4 - assign resources and assign unassigned
resources, but it's another story)

Then, for example on alpha, we'll have

/* Scan all of the recorded PCI controllers. */
for (next_busno = 0, hose = hose_head; hose; hose = hose->next) {
hose->first_busno = next_busno;
hose->last_busno = 0xff;
bus = pci_scan_bus(next_busno, alpha_mv.pci_ops, hose);
hose->bus = bus;
next_busno = hose->last_busno = bus->subordinate;
+ pci_probe_resources(bus);
next_busno += 1;
}

Hmm, this looks good - now we can do early bus-specific fixups
without introducing "pcibios_init_bus" thing (suggested by Grant and
myself quite a while ago).

I'll try to code up all of this and do some basic testing on alpha
and i386 in next few days.

Comments?

Ivan.

2003-01-06 19:38:08

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, Jan 06, 2003 at 11:31:10AM +0100, Benjamin Herrenschmidt wrote:
> For example, since we do not quite deal with PCI domains yet,

I thought Bjorn Helgaas submitted "PCI Segment" support for ia64?
Was something else missing from that? Or was that 2.4.x only?

IIRC, lspci needs to be fixed to support this as well.

grant

2003-01-06 21:45:45

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, Jan 06, 2003 at 06:33:28PM +0300, Ivan Kokshaysky wrote:
> For now I'd start with following:
> phase #1. pci_do_scan_bus() - build the bus/device tree, read in
> dev/vendor IDs, header type and class code; call
> "PCI_FIXUP_EARLY" fixups.

you meant pci_scan_bus() for #1?

(pci_do_scan_bus() is the implementation, but I thought arch code
is supposed to call pci_scan_bus().)

> phase #2. pci_probe_resources(bus) - walk the bus tree again,
> probe the BARs, maybe call pci_read_bridge_bases for
> bridges; fill in the rest of PCI header; call "PCI_FIXUP_HEADER"
> fixups.
> (Also there are phases #3 and #4 - assign resources and assign unassigned
> resources, but it's another story)
...
> Hmm, this looks good - now we can do early bus-specific fixups
> without introducing "pcibios_init_bus" thing (suggested by Grant and
> myself quite a while ago).

I agree - it looks better. But I fail to see how it fixes the problem
of knowing when we can or cannot disable a device in order to size
the BAR. I'm under the impression some i386 specific code needs to be added
to identify/fixup the broken configurations (memory disabled when
a PCI Bridge is disabled).

I'm thinking the same logic needed by PCI hotplug to allow safe removal
of a bridge device would be useful to determine if a PCI Bridge could
be safely (temporarily) disabled (I'm thinking 4-port 100BT card).
Ie a "no subordinate drivers active" and "not in use" checking.

thanks,
grant

2003-01-06 21:54:34

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, Jan 06, 2003 at 11:32:29AM +0000, Andrew Walrond wrote:
> 2.5.53 and 2.5.54 do not seem to find everything on the pci bus on this
> asus pr-dls dual p4 xeon m/b
...
> But with 2.5.53/2.5.54 bus 14 and bus 18 are missing:
...
> ACPI is enabled in both cases (Won't boot with pci=noacpi)

Could this be an ACPI resource bug?
I thought ACPI reported the PCI bus controllers and it's possible
the 2.5.x code is just not finding the right resources.

In any case, I don't think this is related to the original problem.

grant

2003-01-06 23:43:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs


On Mon, 6 Jan 2003, Grant Grundler wrote:
>
> I agree - it looks better. But I fail to see how it fixes the problem
> of knowing when we can or cannot disable a device in order to size
> the BAR.

With a two-phase discovery, and a separate FIXUP_EARLY in the first phase,
we can now make fixups for devices behind bridges.

In particular, we can make the first phase disable DMA on the devices we
find, which means that we know they won't be generating PCI traffic during
the second phase - so now the second phase (which does the BAR sizing) can
do sizing and be safe in the knowledge that there should be no random PCI
activity ongoing at the same time.

That should fix at least the USB DMA problem.

> I'm under the impression some i386 specific code needs to be added
> to identify/fixup the broken configurations (memory disabled when
> a PCI Bridge is disabled).

It's fine to temporarily disable memory on the northbridge, as long as
nothign else tries to _access_ that memory at the same time. The tho-phase
discovery with the first phase doing the disables should make sure of
that.

Linus

2003-01-07 00:06:20

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, Jan 06, 2003 at 03:19:09PM -0800, Linus Torvalds wrote:
> In particular, we can make the first phase disable DMA on the devices we
> find, which means that we know they won't be generating PCI traffic during
> the second phase - so now the second phase (which does the BAR sizing) can
> do sizing and be safe in the knowledge that there should be no random PCI
> activity ongoing at the same time.

Did you expect the PCI_COMMAND_MASTER disabled in the USB Controller
or something else in the controller turned off?

Turning off MASTER will also disable the controller from responding
to MMIO ...ie USB subsystem can't touch the USB controller until
it's re-enabled (no USB interrupts). That's ok if PCI will re-enable
USB controller in a later PCI setup phase. In general I expect a driver
to call pci_enable_device() but I don't know anything about USB intialization
when it's part of the "console" (HID).

BTW, I wasn't thinking of USB. I'm just trying to understand if the "fix"
is exclusively in the PCI code or will require changes to other subsystems.

> It's fine to temporarily disable memory on the northbridge, as long as
> nothign else tries to _access_ that memory at the same time.

The implemention to enforce that is what I meant with "arch specific code".

I still have no idea which bridge implementations (vendor/model)
have this problem and thus no idea what the i386 code would need
to look like. I was hoping Ivan (or someone) might know.

thanks,
grant

2003-01-07 11:44:00

by Alan

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Tue, 2003-01-07 at 00:13, Grant Grundler wrote:
> On Mon, Jan 06, 2003 at 03:19:09PM -0800, Linus Torvalds wrote:
> > In particular, we can make the first phase disable DMA on the devices we
> > find, which means that we know they won't be generating PCI traffic during
> > the second phase - so now the second phase (which does the BAR sizing) can
> > do sizing and be safe in the knowledge that there should be no random PCI
> > activity ongoing at the same time.
>
> Did you expect the PCI_COMMAND_MASTER disabled in the USB Controller
> or something else in the controller turned off?

There is another problem too. Some devices ignore the master bit disable.
VIA 8233/8235 being a fine example.

2003-01-07 17:06:36

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, Jan 06, 2003 at 04:13:32PM -0800, Grant Grundler wrote:
> Turning off MASTER will also disable the controller from responding
> to MMIO ...ie USB subsystem can't touch the USB controller until
> it's re-enabled (no USB interrupts).

No, MMIO is controlled by PCI_COMMAND_MEMORY bit. We'll just turn
off DMA.

> I still have no idea which bridge implementations (vendor/model)
> have this problem and thus no idea what the i386 code would need
> to look like. I was hoping Ivan (or someone) might know.

Perhaps it will be i386-specific "PCI_FIXUP_EARLY" routine which
turns of PCI_COMMAND_MASTER for all devices with a class code of
PCI_CLASS_SERIAL_USB.

Ivan.

2003-01-07 17:00:40

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, Jan 06, 2003 at 01:52:10PM -0800, Grant Grundler wrote:
> you meant pci_scan_bus() for #1?
>
> (pci_do_scan_bus() is the implementation, but I thought arch code
> is supposed to call pci_scan_bus().)

Of course. But hotplug stuff has to call pci_do_scan_bus directly.

Ivan.

2003-01-07 17:02:08

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Mon, Jan 06, 2003 at 11:45:13AM -0800, Grant Grundler wrote:
> On Mon, Jan 06, 2003 at 11:31:10AM +0100, Benjamin Herrenschmidt wrote:
> > For example, since we do not quite deal with PCI domains yet,
>
> I thought Bjorn Helgaas submitted "PCI Segment" support for ia64?
> Was something else missing from that? Or was that 2.4.x only?

I guess it was just a catch up with alpha, ppc and sparc. ;-)

> IIRC, lspci needs to be fixed to support this as well.

No, unless we change the API and start exporting the domain
number (pci_controller_num) to userspace. It would be a major pain
as it would break a lot of things, X server in the first place.
In-kernel representation of the PCI domains is a non-issue, really.

Ivan.

2003-01-07 17:22:10

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Tue, Jan 07, 2003 at 12:33:41PM +0000, Alan Cox wrote:
> > Did you expect the PCI_COMMAND_MASTER disabled in the USB Controller
> > or something else in the controller turned off?
>
> There is another problem too. Some devices ignore the master bit disable.
> VIA 8233/8235 being a fine example.

I don't think IDE DMA is active at that point. But in either case,
with the 2-pass approach we should be able to handle 100% of
problematic hardware.

Ivan.

2003-01-07 17:56:44

by Alan

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Tue, 2003-01-07 at 17:27, Ivan Kokshaysky wrote:
> I don't think IDE DMA is active at that point. But in either case,
> with the 2-pass approach we should be able to handle 100% of
> problematic hardware.

Audio side too. I don't know about the USB (which may be
active in SMM mode). I can test that tomorrow

2003-01-07 18:09:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs


On 7 Jan 2003, Alan Cox wrote:
>
> On Tue, 2003-01-07 at 00:13, Grant Grundler wrote:
> > On Mon, Jan 06, 2003 at 03:19:09PM -0800, Linus Torvalds wrote:
> > > In particular, we can make the first phase disable DMA on the devices we
> > > find, which means that we know they won't be generating PCI traffic during
> > > the second phase - so now the second phase (which does the BAR sizing) can
> > > do sizing and be safe in the knowledge that there should be no random PCI
> > > activity ongoing at the same time.
> >
> > Did you expect the PCI_COMMAND_MASTER disabled in the USB Controller
> > or something else in the controller turned off?
>
> There is another problem too. Some devices ignore the master bit disable.
> VIA 8233/8235 being a fine example.

Well, I was actually thinking of really _stopping_ the USB controller and
disable DMA that way. That's easy to do with a few trivial fixups - one
for each USB controller type (and there are only three).

Because of legacy USB handling by the SMM BIOS, USB really ends up being a
special case. There may be other special cases, of course, but the whole
point of the fixups is exactly to handle special cases.

Linus

2003-01-07 18:11:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs


On Tue, 7 Jan 2003, Ivan Kokshaysky wrote:
>
> > IIRC, lspci needs to be fixed to support this as well.
>
> No, unless we change the API and start exporting the domain
> number (pci_controller_num) to userspace.

It _is_ already exported. /sys gets this right, as far as I know (I don't
have any machines to test with, but at least it's not anything
fundamentally hard to fix in case there's something missing).

But yes, X and lspci would need to know about /sys.

Linus

2003-01-07 20:09:47

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Tue, Jan 07, 2003 at 08:05:37PM +0300, Ivan Kokshaysky wrote:
> No, unless we change the API and start exporting the domain
> number (pci_controller_num) to userspace.

BTW, please don't equate PCI controller instance number with PCI Domain.

Current parisc platforms implement one PCI address space for each SBA
(System Bus Adapter). HP ZX1 (IA64) platforms are based on parisc designs.
Each SBA can have something like 8 or 16 PCI controllers below it.

thanks,
grant

2003-01-08 14:41:52

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Tue, Jan 07, 2003 at 12:17:00PM -0800, Grant Grundler wrote:
> BTW, please don't equate PCI controller instance number with PCI Domain.

I agree, it's quite confusing. However, I don't think that the PCI spec
defines "PCI controller" or "PCI domain" terms, it's pretty much
implementation specific.
Assuming that each PCI controller can handle up to 256 bridged buses,
the unique PCI controller index and PCI bus number is all that userspace
needs to know in order to properly identify devices in the system.

> Current parisc platforms implement one PCI address space for each SBA
> (System Bus Adapter). HP ZX1 (IA64) platforms are based on parisc designs.
> Each SBA can have something like 8 or 16 PCI controllers below it.

It's somewhat similar to new alpha-ev7 systems: it can have 1 I/O
controller per CPU where each I/O controller has a 4x AGP bus and 3
PCI/PCI-X buses. However, each of these buses has its own address
space and IOMMU.

Ivan.

2003-01-08 16:50:21

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

On Tue, Jan 07, 2003 at 09:44:53AM -0800, Linus Torvalds wrote:
> Because of legacy USB handling by the SMM BIOS, USB really ends up being a
> special case. There may be other special cases, of course, but the whole
> point of the fixups is exactly to handle special cases.

Ok, the 2-pass thing is almost done, seems to work on my
alpha and i386 boxes.

Now I have in pci_read_bases():

if (dev->skip_probe)
return;

/* Disable I/O & memory decoding while we size the BARs. */
pci_read_config_word(dev, PCI_COMMAND, &cmd);
pci_write_config_word(dev, PCI_COMMAND,
cmd & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY));

for(pos=0; pos<howmany; pos = next) {
...

The "skip_probe" (single-bit field) can be set in the phase #1.
It's intended for stuff like pmac combo I/O ASIC, which probably
shouldn't be disabled even for a short time.

If it looks ok, I'll have a complete patch tomorrow, including
updates for all architectures and hotplug drivers.

Ivan.

2003-01-08 22:27:03

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] PCI: allow alternative methods for probing the BARs

Ivan Kokshaysky wrote:
> On Tue, Jan 07, 2003 at 12:17:00PM -0800, Grant Grundler wrote:
> > BTW, please don't equate PCI controller instance number with PCI Domain.
>
> I agree, it's quite confusing. However, I don't think that the PCI spec
> defines "PCI controller" or "PCI domain" terms, it's pretty much
> implementation specific.

Oh. The definition I was using is based on which PCI devices can do
peer-to-peer transactions. ie a "PCI Domain" is defined by the
PCI MMIO address space routing.

> Assuming that each PCI controller can handle up to 256 bridged buses,
> the unique PCI controller index and PCI bus number is all that userspace
> needs to know in order to properly identify devices in the system.

yes - that makes sense for configuration space accesses.
I can see why you'd call this a "PCI Domain" as well.

The platforms I was commenting on can only generate config cycles below
a PCI "Host Bus Adapter" (aka controller).

thanks,
grant

2003-01-09 17:40:32

by Ivan Kokshaysky

[permalink] [raw]
Subject: [patch 2.5] 2-pass PCI probing, generic part

As discussed, this patch splits PCI probing into 2 phases.
1. Do the full bus enumeration, collect identification info for
all devices, call early fixup routines. At this stage we
can solve two kinds of problems:
- turn off devices generating PCI traffic, so we'll be
safe in the phase 2 (USB DMA case);
- allow alternative probing methods for devices that
cannot be safely probed by generic code (powermac I/O ASIC).
2. Sizing the BARs. Now it's possible to disable the device
being probed, which should fix ia64 case. Note that we
don't need to keep the device disabled when sizing ROM
base register, as we probe with ROM-enable bit cleared.

There is no need for changes to arch-specific code, as everything
is hidden in pci_scan_bus(). However, it's possible to use
pci_scan_bus_parented() and pci_probe_resources() directly,
because some arch-specific fixups between these two might
be useful.

Note: on powermacs, if the I/O ASIC is behind PCI-PCI bridge, the
bridge device probably should be marked as "skip_probe" as well.

Ivan.

diff -urpN 2.5.55/drivers/pci/probe.c linux/drivers/pci/probe.c
--- 2.5.55/drivers/pci/probe.c Thu Jan 9 07:04:19 2003
+++ linux/drivers/pci/probe.c Thu Jan 9 13:47:47 2003
@@ -47,8 +47,17 @@ static void pci_read_bases(struct pci_de
{
unsigned int pos, reg, next;
u32 l, sz;
+ u16 cmd;
struct resource *res;

+ if (dev->skip_probe)
+ return;
+
+ /* Disable I/O & memory decoding while we size the BARs. */
+ pci_read_config_word(dev, PCI_COMMAND, &cmd);
+ pci_write_config_word(dev, PCI_COMMAND,
+ cmd & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY));
+
for(pos=0; pos<howmany; pos = next) {
next = pos+1;
res = &dev->resource[pos];
@@ -96,6 +105,9 @@ static void pci_read_bases(struct pci_de
#endif
}
}
+
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
+
if (rom) {
dev->rom_base_reg = rom;
res = &dev->resource[PCI_ROM_RESOURCE];
@@ -342,7 +354,7 @@ static void pci_read_irq(struct pci_dev
* @dev: the device structure to fill
*
* Initialize the device structure with information about the device's
- * vendor,class,memory and IO-space addresses,IRQ lines etc.
+ * vendor, memory and IO-space addresses, IRQ lines etc.
* Called at initialisation of the PCI subsystem and by CardBus services.
* Returns 0 on success and -1 if unknown type of device (not normal, bridge
* or CardBus).
@@ -351,14 +363,9 @@ int pci_setup_device(struct pci_dev * de
{
u32 class;

- sprintf(dev->slot_name, "%02x:%02x.%d", dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
- sprintf(dev->dev.name, "PCI device %04x:%04x", dev->vendor, dev->device);
INIT_LIST_HEAD(&dev->pools);

- pci_read_config_dword(dev, PCI_CLASS_REVISION, &class);
- class >>= 8; /* upper 3 bytes */
- dev->class = class;
- class >>= 8;
+ class = dev->class >> 8;

DBG("Found %02x:%02x [%04x/%04x] %06x %02x\n", dev->bus->number, dev->devfn, dev->vendor, dev->device, class, dev->hdr_type);

@@ -410,8 +417,8 @@ int pci_setup_device(struct pci_dev * de
}

/*
- * Read the config data for a PCI device, sanity-check it
- * and fill in the dev structure...
+ * Read the config data for a PCI device and fill in
+ * the dev structure...
*/
struct pci_dev * __devinit pci_scan_device(struct pci_dev *temp)
{
@@ -433,18 +440,23 @@ struct pci_dev * __devinit pci_scan_devi
dev->vendor = l & 0xffff;
dev->device = (l >> 16) & 0xffff;

+ pci_read_config_dword(dev, PCI_CLASS_REVISION, &l);
+ dev->class = l >> 8; /* upper 3 bytes */
+
/* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
set this higher, assuming the system even supports it. */
dev->dma_mask = 0xffffffff;
- if (pci_setup_device(dev) < 0) {
- kfree(dev);
- return NULL;
- }

pci_name_device(dev);

+ sprintf(dev->slot_name, "%02x:%02x.%d",
+ dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+
+ /* Early fixups, before probing the BARs */
+ pci_fixup_device(PCI_FIXUP_EARLY, dev);
+
/* now put in global tree */
- strcpy(dev->dev.bus_id,dev->slot_name);
+ strcpy(dev->dev.bus_id, dev->slot_name);
dev->dev.dma_mask = &dev->dma_mask;

device_register(&dev->dev);
@@ -480,13 +492,12 @@ struct pci_dev * __devinit pci_scan_slot
* the per-bus list of devices and add the /proc entry.
*/
pci_insert_device (dev, bus);
-
- /* Fix up broken headers */
- pci_fixup_device(PCI_FIXUP_HEADER, dev);
}
return first_dev;
}

+/* First phase of device discovery. Build the bus tree, collect header types,
+ class codes, device and vendor IDs. Perform early fixups. */
unsigned int __devinit pci_do_scan_bus(struct pci_bus *bus)
{
unsigned int devfn, max, pass;
@@ -510,11 +521,8 @@ unsigned int __devinit pci_do_scan_bus(s
}

/*
- * After performing arch-dependent fixup of the bus, look behind
- * all PCI-to-PCI bridges on this bus.
+ * Look behind all PCI-to-PCI bridges on this bus.
*/
- DBG("Fixups for bus %02x\n", bus->number);
- pcibios_fixup_bus(bus);
for (pass=0; pass < 2; pass++)
for (ln=bus->devices.next; ln != &bus->devices; ln=ln->next) {
dev = pci_dev_b(ln);
@@ -545,6 +553,38 @@ int __devinit pci_bus_exists(const struc
return 0;
}

+/* Second phase of device discovery. Probe the BARs, fill in the rest
+ of pci_dev structure, perform device and bus fixups. */
+void __devinit pci_probe_resources(struct pci_bus *bus)
+{
+ struct list_head *ln;
+
+ if (!bus)
+ return;
+
+ for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) {
+ struct pci_dev *dev = pci_dev_b(ln);
+
+ if (pci_setup_device(dev) < 0) {
+ pci_remove_device(dev);
+ kfree(dev);
+ }
+
+ /* Fix up broken headers */
+ pci_fixup_device(PCI_FIXUP_HEADER, dev);
+ }
+
+ /*
+ * After performing arch-dependent fixup of the bus, look behind
+ * all PCI-to-PCI bridges on this bus.
+ */
+ DBG("Fixups for bus %02x\n", bus->number);
+ pcibios_fixup_bus(bus);
+
+ for (ln = bus->children.next; ln != &bus->children; ln = ln->next)
+ pci_probe_resources(pci_bus_b(ln));
+}
+
struct pci_bus * __devinit pci_alloc_primary_bus_parented(struct device *parent, int bus)
{
struct pci_bus *b;
@@ -600,4 +640,5 @@ EXPORT_SYMBOL(pci_do_scan_bus);
EXPORT_SYMBOL(pci_scan_slot);
EXPORT_SYMBOL(pci_scan_bus);
EXPORT_SYMBOL(pci_scan_bridge);
+EXPORT_SYMBOL(pci_probe_resources);
#endif
diff -urpN 2.5.55/include/linux/pci.h linux/include/linux/pci.h
--- 2.5.55/include/linux/pci.h Thu Jan 9 07:04:13 2003
+++ linux/include/linux/pci.h Thu Jan 9 13:40:32 2003
@@ -412,7 +412,8 @@ struct pci_dev {
char slot_name[8]; /* slot name */

/* These fields are used by common fixups */
- unsigned int transparent:1; /* Transparent PCI bridge */
+ unsigned int transparent:1, /* Transparent PCI bridge */
+ skip_probe:1; /* Don't probe the BARs */
};

#define pci_dev_g(n) list_entry(n, struct pci_dev, global_list)
@@ -544,11 +545,14 @@ void pcibios_fixup_pbus_ranges(struct pc

/* Generic PCI functions used internally */

+void pci_probe_resources(struct pci_bus *bus);
int pci_bus_exists(const struct list_head *list, int nr);
struct pci_bus *pci_scan_bus_parented(struct device *parent, int bus, struct pci_ops *ops, void *sysdata);
static inline struct pci_bus *pci_scan_bus(int bus, struct pci_ops *ops, void *sysdata)
{
- return pci_scan_bus_parented(NULL, bus, ops, sysdata);
+ struct pci_bus *pbus = pci_scan_bus_parented(NULL, bus, ops, sysdata);
+ pci_probe_resources(pbus);
+ return pbus;
}
struct pci_bus *pci_alloc_primary_bus_parented(struct device * parent, int bus);
static inline struct pci_bus *pci_alloc_primary_bus(int bus)
@@ -809,8 +813,11 @@ struct pci_fixup {

extern struct pci_fixup pcibios_fixups[];

-#define PCI_FIXUP_HEADER 1 /* Called immediately after reading configuration header */
-#define PCI_FIXUP_FINAL 2 /* Final phase of device fixups */
+#define PCI_FIXUP_EARLY 1 /* Called immediately after reading
+ device/vendor IDs and class code */
+#define PCI_FIXUP_HEADER 2 /* Called after reading configuration
+ header (including BARs) */
+#define PCI_FIXUP_FINAL 3 /* Final phase of device fixups */

void pci_fixup_device(int pass, struct pci_dev *dev);

2003-01-09 17:42:22

by Ivan Kokshaysky

[permalink] [raw]
Subject: [patch 2.5] 2-pass PCI probing, hotplug changes

As the hotplug drivers use low-level probing functions
(pci_do_scan_bus and pci_scan_bridge), they have to
use the phase #2 routine directly.

Ivan.

diff -urpN 2.5.55/drivers/hotplug/acpiphp_glue.c linux/drivers/hotplug/acpiphp_glue.c
--- 2.5.55/drivers/hotplug/acpiphp_glue.c Thu Jan 9 07:03:59 2003
+++ linux/drivers/hotplug/acpiphp_glue.c Thu Jan 9 15:33:29 2003
@@ -947,6 +947,7 @@ static int enable_device (struct acpiphp
pci_read_config_byte(dev, PCI_SECONDARY_BUS, &bus);
child = (struct pci_bus*) pci_add_new_bus(dev->bus, dev, bus);
pci_do_scan_bus(child);
+ pci_probe_resources(child);
}

/* associate pci_dev to our representation */
diff -urpN 2.5.55/drivers/hotplug/cpci_hotplug_pci.c linux/drivers/hotplug/cpci_hotplug_pci.c
--- 2.5.55/drivers/hotplug/cpci_hotplug_pci.c Thu Jan 9 07:04:17 2003
+++ linux/drivers/hotplug/cpci_hotplug_pci.c Thu Jan 9 15:59:12 2003
@@ -395,6 +395,7 @@ static int cpci_configure_bridge(struct
/* Scan behind bridge */
n = pci_scan_bridge(bus, dev, max, 2);
child = pci_find_bus(max + 1);
+ pci_probe_resources(child);
#ifdef CONFIG_PROC_FS
pci_proc_attach_bus(child);
#endif
diff -urpN 2.5.55/drivers/hotplug/cpqphp_pci.c linux/drivers/hotplug/cpqphp_pci.c
--- 2.5.55/drivers/hotplug/cpqphp_pci.c Thu Jan 9 07:04:25 2003
+++ linux/drivers/hotplug/cpqphp_pci.c Thu Jan 9 15:34:07 2003
@@ -284,7 +284,7 @@ int cpqhp_configure_device (struct contr
pci_read_config_byte(func->pci_dev, PCI_SECONDARY_BUS, &bus);
child = (struct pci_bus*) pci_add_new_bus(func->pci_dev->bus, (func->pci_dev), bus);
pci_do_scan_bus(child);
-
+ pci_probe_resources(child);
}

temp = func->pci_dev;
diff -urpN 2.5.55/drivers/hotplug/ibmphp_core.c linux/drivers/hotplug/ibmphp_core.c
--- 2.5.55/drivers/hotplug/ibmphp_core.c Thu Jan 9 07:04:28 2003
+++ linux/drivers/hotplug/ibmphp_core.c Thu Jan 9 15:38:25 2003
@@ -1071,6 +1071,7 @@ static int ibm_configure_device (struct
pci_read_config_byte (func->dev, PCI_SECONDARY_BUS, &bus);
child = (struct pci_bus *) pci_add_new_bus (func->dev->bus, (func->dev), bus);
pci_do_scan_bus (child);
+ pci_probe_resources (child);
}

temp = func->dev;

2003-01-09 17:45:01

by Ivan Kokshaysky

[permalink] [raw]
Subject: [patch 2.5] 2-pass PCI probing, i386 USB quirk

Untested, but it might work. Perhaps this needs to be chip specific (VIA?).

Ivan.

--- 2.5.55/arch/i386/pci/fixup.c Thu Jan 9 07:04:19 2003
+++ linux/arch/i386/pci/fixup.c Thu Jan 9 15:10:34 2003
@@ -187,6 +187,24 @@ static void __devinit pci_fixup_transpar
dev->transparent = 1;
}

+/*
+ * On certain systems with legacy USB support, the USB controller does
+ * access the system memory while we boot, so disconnecting the host
+ * bridge (along with system memory) from PCI bus during the PCI window
+ * sizing causes immediate crash.
+ * To avoid this, clear the MASTER bit before we start probing the BARs.
+ */
+static void __devinit pci_fixup_usb_dma(struct pci_dev *dev)
+{
+ u16 cmd;
+
+ if ((dev->class >> 8) == PCI_CLASS_SERIAL_USB) {
+ pci_read_config_word(dev, PCI_COMMAND, &cmd);
+ pci_write_config_word(dev, PCI_COMMAND,
+ cmd & ~PCI_COMMAND_MASTER);
+ }
+}
+
struct pci_fixup pcibios_fixups[] = {
{ PCI_FIXUP_HEADER, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82451NX, pci_fixup_i450nx },
{ PCI_FIXUP_HEADER, PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82454GX, pci_fixup_i450gx },
@@ -205,5 +223,6 @@ struct pci_fixup pcibios_fixups[] = {
{ PCI_FIXUP_HEADER, PCI_VENDOR_ID_VIA, PCI_DEVICE_ID_VIA_8367_0, pci_fixup_via_northbridge_bug },
{ PCI_FIXUP_HEADER, PCI_VENDOR_ID_NCR, PCI_DEVICE_ID_NCR_53C810, pci_fixup_ncr53c810 },
{ PCI_FIXUP_HEADER, PCI_VENDOR_ID_INTEL, PCI_ANY_ID, pci_fixup_transparent_bridge },
+ { PCI_FIXUP_EARLY, PCI_ANY_ID, PCI_ANY_ID, pci_fixup_usb_dma },
{ 0 }
};

2003-01-09 17:50:02

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, 2003-01-09 at 18:46, Ivan Kokshaysky wrote:

> There is no need for changes to arch-specific code, as everything
> is hidden in pci_scan_bus(). However, it's possible to use
> pci_scan_bus_parented() and pci_probe_resources() directly,
> because some arch-specific fixups between these two might
> be useful.
>
> Note: on powermacs, if the I/O ASIC is behind PCI-PCI bridge, the
> bridge device probably should be marked as "skip_probe" as well.

Good.

Yes, On these, I'll skip the pci<->pci bridge in cases there is one on
the path too, this will add some nasty logic to the pmac pci code, but
that's ok as long as that crap doesn't leak out of
arch/ppc/platforms/pmac_*

Ben.

2003-01-09 18:28:41

by David Mosberger

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

>>>>> On Thu, 9 Jan 2003 20:46:26 +0300, Ivan Kokshaysky <[email protected]> said:

Ivan> As discussed, this patch splits PCI probing into 2 phases.
Ivan> 1. Do the full bus enumeration, collect identification info for
Ivan> all devices, call early fixup routines. At this stage we
Ivan> can solve two kinds of problems:
Ivan> - turn off devices generating PCI traffic, so we'll be
Ivan> safe in the phase 2 (USB DMA case);
Ivan> - allow alternative probing methods for devices that
Ivan> cannot be safely probed by generic code (powermac I/O ASIC).
Ivan> 2. Sizing the BARs. Now it's possible to disable the device
Ivan> being probed, which should fix ia64 case. Note that we
Ivan> don't need to keep the device disabled when sizing ROM
Ivan> base register, as we probe with ROM-enable bit cleared.

Sounds good (haven't tested the code yet, but it looks fine to me).

--david

2003-01-09 19:45:08

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, Jan 09, 2003 at 08:46:26PM +0300, Ivan Kokshaysky wrote:
> As discussed, this patch splits PCI probing into 2 phases.

Like david, both parts of the patch look good but i haven't tested
on either parisc or ia64. I don't have time right now to work on
2.5.x issues :^(

I looked at 2.5.54 to check if everything was ok and found
another nit:
grundler <510>fgrep pci_scan_bus *.[ch]
probe.c:struct pci_bus * __devinit pci_scan_bus_parented(struct device *parent, int bus, struct pci_ops *ops, void *sysdata)
probe.c:EXPORT_SYMBOL(pci_scan_bus);

and :
grundler <514>fgrep pci_scan_bus include/linux/*h
include/linux/pci.h:struct pci_bus *pci_scan_bus_parented(struct device *parent, int bus, struct pci_ops *ops, void *sysdata);
include/linux/pci.h:static inline struct pci_bus *pci_scan_bus(int bus, struct pci_ops *ops, void *sysdata)
include/linux/pci.h: return pci_scan_bus_parented(NULL, bus, ops, sysdata);

Can the EXPORT_SYMBOL(pci_scan_bus) be removed now?


BTW, thanks to whoever introduced pci_scan_bus_parented().
It's exactly what parisc code needed (lba_pci.c and dino.c use it).

I think just go ahead with your patches and we'll fix up the arch specific
stuff to follow. I'm convinced it's the right direction.

thanks,
grant

2003-01-09 22:03:44

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, Jan 09, 2003 at 06:52:15PM +0100, Benjamin Herrenschmidt wrote:
> Yes, On these, I'll skip the pci<->pci bridge in cases there is one on
> the path too, this will add some nasty logic to the pmac pci code, but
> that's ok as long as that crap doesn't leak out of
> arch/ppc/platforms/pmac_*

I don't see why it would add any extra logic to your pci code. You only
need to implement simple pmac-specific "FIXUP_EARLY" routine for I/O ASIC.
Something like this:

static void __init fixup_io_asic(struct pci_dev *asic)
{
struct pci_dev *bridge;

/* Set up asic->resource[] (using firmware info?) */
...
asic->skip_probe = 1;
/* Also, don't probe parent bridge */
bridge = asic->bus->self;
if (bridge && (bridge->class >> 8) == PCI_CLASS_BRIDGE_PCI)
bridge->skip_probe = 1;
}

And you've done with it.
Note that in most cases PCI-PCI bridges can be safely excluded from
pci_read_bases() simply because they have neither regular BARs nor
ROM BAR (even though PCI spec allows that).

Ivan.

2003-01-09 22:18:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, 2003-01-09 at 23:16, Linus Torvalds wrote:
> On Fri, 10 Jan 2003, Ivan Kokshaysky wrote:
> >
> > Note that in most cases PCI-PCI bridges can be safely excluded from
> > pci_read_bases() simply because they have neither regular BARs nor
> > ROM BAR (even though PCI spec allows that).
>
> This might be a good approach to take regardless - don't read pci-pci
> bridge BAR (or host-bridge BAR's for that matter), simply because
>
> (a) bridges are more "interesting" than regular devices, and disabling
> part of them might be a stupid thing.
> (b) we're generally not really interested in the end result anyway

I completely agree. For example, I have already a bunch of cases where I
have to explicitely "hide" the host bridge from linux PCI layer as those
have BARs that mustn't be touched. A typical example is the 405gp
internal PCI host. It has a BAR that represents the view of system RAM
from bus mastering PCI devices. Of course, the kernel doesn't understand
that and tries to remap it away from 0 where it belongs ;)

Ben.

--
Benjamin Herrenschmidt <[email protected]>

2003-01-09 22:34:36

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, Jan 09, 2003 at 11:52:31AM -0800, Grant Grundler wrote:
> Can the EXPORT_SYMBOL(pci_scan_bus) be removed now?

No, I think. Looks like it's used by ibm hotplug which can be a module.

Ivan.

2003-01-09 22:35:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part


On Fri, 10 Jan 2003, Ivan Kokshaysky wrote:
>
> Note that in most cases PCI-PCI bridges can be safely excluded from
> pci_read_bases() simply because they have neither regular BARs nor
> ROM BAR (even though PCI spec allows that).

This might be a good approach to take regardless - don't read pci-pci
bridge BAR (or host-bridge BAR's for that matter), simply because

(a) bridges are more "interesting" than regular devices, and disabling
part of them might be a stupid thing.
(b) we're generally not really interested in the end result anyway

Hmm?

Linus

2003-01-09 23:14:31

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, Jan 09, 2003 at 02:16:20PM -0800, Linus Torvalds wrote:
> On Fri, 10 Jan 2003, Ivan Kokshaysky wrote:
> >
> > Note that in most cases PCI-PCI bridges can be safely excluded from
> > pci_read_bases() simply because they have neither regular BARs nor
> > ROM BAR (even though PCI spec allows that).
>
> This might be a good approach to take regardless - don't read pci-pci
> bridge BAR (or host-bridge BAR's for that matter), simply because
>
> (a) bridges are more "interesting" than regular devices, and disabling
> part of them might be a stupid thing.
> (b) we're generally not really interested in the end result anyway

PCI-PCI, PCI-ISA bridges - probably, but not host bridges. On x86 they
often have quite a few BARs, like AGP window, AGP MMIO, power management
etc., which we cannot ignore.

OTOH, with that patch we can control probing for any given class of
devices with a 4-5 lines fixups, per architecture. :-)

Ivan.

2003-01-09 23:54:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part


On Fri, 10 Jan 2003, Ivan Kokshaysky wrote:
>
> PCI-PCI, PCI-ISA bridges - probably, but not host bridges. On x86 they
> often have quite a few BARs, like AGP window, AGP MMIO, power management
> etc., which we cannot ignore.

Oh, but we _can_ ignore it.

All those things are stuff that if the kernel doesn't use them, the kernel
doesn't even need to know they are there.

Sure, if we support AGP, we need to see the aperture size etc, but then
we'd have the AGP driver just do the "pci_enable_dev()" thing to work it
out.

The only real reason to worry about BAR sizing is really to do resource
discovery in order to make sure that out bridges have sufficiently big
windows for the IO regions. Agreed?

And that should be a non-issue especially on a host bridge, since we
almost certainly don't want to reprogram the bridge windows there anyway.

So I'd like to make the _default_ be to probe the minimal possible,
_especially_ for host bridges. Then, the PCI quirks could be used to
expand on that default.

Linus

2003-01-10 01:01:57

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, Jan 09, 2003 at 03:35:32PM -0800, Linus Torvalds wrote:
> The only real reason to worry about BAR sizing is really to do resource
> discovery in order to make sure that out bridges have sufficiently big
> windows for the IO regions. Agreed?

yes. And eventually to make sure regions don't overlap.

> And that should be a non-issue especially on a host bridge, since we
> almost certainly don't want to reprogram the bridge windows there anyway.

Current PARISC servers do not allocate MMIO/IO resources for all PCI devices.
Only boot devices are configured. Fortunately, default MMIO/IO address
space assigned to Host bridges seems to work - at least I've not heard
anyone complain (yet). But no one has tried PCI expansion chassis or
cards with massive (> 64MB) MMIO BARs.

Because of PCI hotplug, rumor is ia64 firmware folks want to do
the same thing in the near future.

> So I'd like to make the _default_ be to probe the minimal possible,
> _especially_ for host bridges. Then, the PCI quirks could be used to
> expand on that default.

That's reasonable.
pci arch code also has hooks to fixup host bridge resources.

thanks,
grant

2003-01-10 02:56:49

by Alan

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, 2003-01-09 at 22:26, Benjamin Herrenschmidt wrote:
> I completely agree. For example, I have already a bunch of cases where I
> have to explicitely "hide" the host bridge from linux PCI layer as those
> have BARs that mustn't be touched. A typical example is the 405gp
> internal PCI host. It has a BAR that represents the view of system RAM
> from bus mastering PCI devices. Of course, the kernel doesn't understand
> that and tries to remap it away from 0 where it belongs ;)

Ditto The X-box people would like to explicitly hide a pile of apparently ghost
devices that hang the bus solid if you read them.

2003-01-10 07:51:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

[email protected] (Grant Grundler) writes:

> On Thu, Jan 09, 2003 at 03:35:32PM -0800, Linus Torvalds wrote:
> > The only real reason to worry about BAR sizing is really to do resource
> > discovery in order to make sure that out bridges have sufficiently big
> > windows for the IO regions. Agreed?
>
> yes. And eventually to make sure regions don't overlap.
>
> > And that should be a non-issue especially on a host bridge, since we
> > almost certainly don't want to reprogram the bridge windows there anyway.
>
> Current PARISC servers do not allocate MMIO/IO resources for all PCI devices.
> Only boot devices are configured. Fortunately, default MMIO/IO address
> space assigned to Host bridges seems to work - at least I've not heard
> anyone complain (yet). But no one has tried PCI expansion chassis or
> cards with massive (> 64MB) MMIO BARs.

For what it is worth these cards exist though.
Quadris cards have a 256MB bar, and dolphin cards default to having a 512MB bar.
Both are high performance I/O adapters.

> Because of PCI hotplug, rumor is ia64 firmware folks want to do
> the same thing in the near future.

If someone leaves a big enough hole for hotplug cards I guess it can work...
How you define a potential boot device, and what it saves you to not assign
it resources I don't know.

I am still recovering from putting a 256MB bar and 4GB of ram in a 4GB hole,
with minimal loss on x86, so my imagination of what can be sanely done
on a 64bit arch may be a little stunted..

Eric

2003-01-10 13:29:51

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Thu, Jan 09, 2003 at 03:35:32PM -0800, Linus Torvalds wrote:
>
> On Fri, 10 Jan 2003, Ivan Kokshaysky wrote:
> >
> > PCI-PCI, PCI-ISA bridges - probably, but not host bridges. On x86 they
> > often have quite a few BARs, like AGP window, AGP MMIO, power management
> > etc., which we cannot ignore.
>
> Oh, but we _can_ ignore it.
>
> All those things are stuff that if the kernel doesn't use them, the kernel
> doesn't even need to know they are there.

It does - even if we don't use it, this stuff occupies regions in the bus
address space. If we ignore it we'll end up with incomplete resource
tree and pci_assign_resource() won't work anymore. Especially with AGP
aperture window, which tends to be large (32-128M), we'd have very good
chances to allocate a new resource (either by hotplug drivers or just
working around BIOS allocation bugs) in the middle of this window. Ouch.

> Sure, if we support AGP, we need to see the aperture size etc, but then
> we'd have the AGP driver just do the "pci_enable_dev()" thing to work it
> out.
>
> The only real reason to worry about BAR sizing is really to do resource
> discovery in order to make sure that out bridges have sufficiently big
> windows for the IO regions. Agreed?

Yes, and as Grant already pointed out, to deal with overlapping (or just
incorrectly allocated) regions.

> And that should be a non-issue especially on a host bridge, since we
> almost certainly don't want to reprogram the bridge windows there anyway.

[Hmm, on alpha we're doing that for years. :-)]
But even if we don't reprogram the bridge, we need to know about the
ranges that it decodes as we don't want to step on them accidentally.

> So I'd like to make the _default_ be to probe the minimal possible,
> _especially_ for host bridges. Then, the PCI quirks could be used to
> expand on that default.

I agree in general. Reasonable probing policy certainly is a good thing.
"Don't probe by default" would work wonderfully for PCI and ISA bridges -
I've yet to see one with normal BARs.
I'm just afraid that with the default "don't probe northbridges" we'll end
up with a lot more quirks than with "probe by default".

In either case, all of this can be trivially implemented on the top of
that patch. I'd suggest another pci_dev bit - "force_probe", to override
default policy:

static inline int default_probing_policy(struct pci_dev *dev)
{
switch (dev->class >> 8) {
case PCI_CLASS_BRIDGE_ISA:
case PCI_CLASS_BRIDGE_EISA:
case PCI_CLASS_BRIDGE_PCI:
return 0; /* Don't probe */
case PCI_CLASS_BRIDGE_HOST:
return ???
default:
return 1; /* Do probe */
}
}

static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
{
...

if (dev->skip_probe ||
(!dev->force_probe && !default_probing_policy(dev)))
return;

/* Disable I/O & memory decoding while we size the BARs. */
pci_read_config_word(dev, PCI_COMMAND, &cmd);
...

Ivan.

2003-01-10 18:55:00

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Fri, Jan 10, 2003 at 12:56:17AM -0700, Eric W. Biederman wrote:
> For what it is worth these cards exist though.

yes.

> Quadris cards have a 256MB bar, and dolphin cards default to having a 512MB bar.
> Both are high performance I/O adapters.

I'm not familiar with "dolphin" cards.
I'm aware of "Quadrics" but I've not heard anyone try those with parisc-linux.
Quadrics cards do work on ia64 (for some definition of "work").

> If someone leaves a big enough hole for hotplug cards I guess it can work...

Or dynamically assigns windows to PCI Bus controllers as PCI devices
are brought on-line. For PCI Hotplug, the role of managing MMIO/IRQ
resources has moved to the OS since these services are needed
after the OS has taken control of the box.

> How you define a potential boot device, and what it saves you to not assign
> it resources I don't know.

You have it backwards. firmware only assigns resources to boot devices
and "console" devices. ie firmware does minimal configuration.
Why? An OS with hotplug support can do it anyway.

A "potential boot device" has firmware support which the primary boot
loader can use to load the OS or a secondary boot loader. But firmware
only needs to configure a single boot/console device that is
actually being used.


> I am still recovering from putting a 256MB bar and 4GB of ram in a 4GB hole,
> with minimal loss on x86, so my imagination of what can be sanely done
> on a 64bit arch may be a little stunted..

both ia64 and later parisc boxes from HP reserve GB's of LMMIO address space
for IO uses (LMMIO == MMIO < 4GB). AFAIK, physical memory behind that address
space gets remapped to higher "physical" addresses by the memory controller.
But making 256MB still fit in that space can still be a challenge.
One 256MB BAR isn't so bad. It's when the customer wants to have a central
server that has 2 or more such cards...64-bit BARs on 64-bit architecture
make life alot easier.

grant

2003-01-10 21:37:12

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Fri, Jan 10, 2003 at 11:00:30AM -0800, Grant Grundler wrote:
> Or dynamically assigns windows to PCI Bus controllers as PCI devices
> are brought on-line.

Eh? In general case, to make room for newly added device, you have
to shutdown the whole PCI bus starting from level 0, reassign _all_
BARs and bridge windows and then restart...
The "hotplug resource reservation" is the only viable approach, it has
been discussed numerous times.

> For PCI Hotplug, the role of managing MMIO/IRQ
> resources has moved to the OS since these services are needed
> after the OS has taken control of the box.

100% agree. :-)

> One 256MB BAR isn't so bad. It's when the customer wants to have a central
> server that has 2 or more such cards...64-bit BARs on 64-bit architecture
> make life alot easier.

I believe 1GB is a theoretical maximum for a 32-bit BAR. Everything above
that has to be 64-bit.

Ivan.

2003-01-10 23:00:09

by Grant Grundler

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Sat, Jan 11, 2003 at 12:42:39AM +0300, Ivan Kokshaysky wrote:
> Eh? In general case, to make room for newly added device, you have
> to shutdown the whole PCI bus starting from level 0, reassign _all_
> BARs and bridge windows and then restart...

That might be true for x86. I'm pretty sure it's not true for the
parisc machines I'm familiar with. One "window" register is set to
a "default" at boot time by the firmware and behaves as you describe above.
But a *second* window register in the PCI controller was intended
for dynamic MMIO assignment in case the first one was too small.
The trick is to find MMIO space in the region already being routed
by the IOC (parent of the PCI controller). My point is a second
somewhat larger MMIO region can be routed to a few additional PCI
controllers given sufficient MMIO space.

Again, I believe HP's ia64 HW has leveraged this design but I haven't
checked specs. And methods to reprogram the window registers might
require firmware to do it instead of the OS.

> The "hotplug resource reservation" is the only viable approach, it has
> been discussed numerous times.

ok - I'll to look for those when the time comes.


> I believe 1GB is a theoretical maximum for a 32-bit BAR.

2GB. But my argument is enough "bigger" BARs may not work either.

grant

2003-01-11 19:31:06

by Scott Murray

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Sat, 11 Jan 2003, Ivan Kokshaysky wrote:

> On Fri, Jan 10, 2003 at 11:00:30AM -0800, Grant Grundler wrote:
> > Or dynamically assigns windows to PCI Bus controllers as PCI devices
> > are brought on-line.
>
> Eh? In general case, to make room for newly added device, you have
> to shutdown the whole PCI bus starting from level 0, reassign _all_
> BARs and bridge windows and then restart...
> The "hotplug resource reservation" is the only viable approach, it has
> been discussed numerous times.
>
> > For PCI Hotplug, the role of managing MMIO/IRQ
> > resources has moved to the OS since these services are needed
> > after the OS has taken control of the box.
>
> 100% agree. :-)

Since the lack of resource reservation currently is keeping CompactPCI
hot insertion from working properly, I have a strong interest in getting
something in place before 2.6. I've got a completely manual (kernel
command-line parameter controlled) reservation patch[1] against 2.4 that
I could start updating, but I've always thought there must be a more
elegant way to do things than the somewhat crude fixup approach I used
in it. I'm willing to try coding up something if you or anyone else
have some ideas as to what would be an acceptable solution.

Thanks,

Scott

[1] look at the new file drivers/pci/setup-hp.c contained in:
ftp://oss.somanetworks.com/pub/linux/cpci/v2.4/linux-2.4.19-cpci-20021107.diff.gz


--
Scott Murray
SOMA Networks, Inc.
Toronto, Ontario
e-mail: [email protected]


2003-01-12 07:12:02

by Greg KH

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Sat, Jan 11, 2003 at 02:39:47PM -0500, Scott Murray wrote:
>
> Since the lack of resource reservation currently is keeping CompactPCI
> hot insertion from working properly, I have a strong interest in getting
> something in place before 2.6. I've got a completely manual (kernel
> command-line parameter controlled) reservation patch[1] against 2.4 that
> I could start updating, but I've always thought there must be a more
> elegant way to do things than the somewhat crude fixup approach I used
> in it. I'm willing to try coding up something if you or anyone else
> have some ideas as to what would be an acceptable solution.

Didn't people say that your reservation patch looked fine for 2.5?
But that was a while ago, and you never sent it... :)

thanks,

greg k-h

2003-01-13 06:20:07

by Scott Murray

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

On Sat, 11 Jan 2003, Greg KH wrote:

> On Sat, Jan 11, 2003 at 02:39:47PM -0500, Scott Murray wrote:
> >
> > Since the lack of resource reservation currently is keeping CompactPCI
> > hot insertion from working properly, I have a strong interest in getting
> > something in place before 2.6. I've got a completely manual (kernel
> > command-line parameter controlled) reservation patch[1] against 2.4 that
> > I could start updating, but I've always thought there must be a more
> > elegant way to do things than the somewhat crude fixup approach I used
> > in it. I'm willing to try coding up something if you or anyone else
> > have some ideas as to what would be an acceptable solution.
>
> Didn't people say that your reservation patch looked fine for 2.5?
> But that was a while ago, and you never sent it... :)

I went through my archived mail, and you're absolutely right, I had just
totally forgotten about that discussion. I'll try and cook up a version
of my manual resource reservation patch against current 2.5 starting
tomorrow.

Scott


--
Scott Murray
SOMA Networks, Inc.
Toronto, Ontario
e-mail: [email protected]

2003-01-14 16:34:36

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [patch 2.5] 2-pass PCI probing, generic part

We know for a fact that current probing code works fine on
a vast majority of systems, so _not_ disabling all types of
bridges during the probing seems to be a reasonable default
behavior (which can be overridden by PCI quirks, of course).
This should work on x86 and ia64 without any additional fixups,
ppc will need only minimal changes - skip probe for I/O ASIC
and perhaps certain host bridges, no need to care about P2P ones.

Here's updated patch vs. 2.5.58.

Ivan.

--- 2.5.58/drivers/pci/probe.c Tue Jan 14 13:38:12 2003
+++ linux/drivers/pci/probe.c Tue Jan 14 15:08:46 2003
@@ -43,12 +43,37 @@ static u32 pci_size(u32 base, unsigned l
return size-1; /* extent = size - 1 */
}

+/* By default, disable I/O & memory decoding while we size the BARs
+ for all devices except bridges. */
+static inline void set_probing_method(struct pci_dev *dev, u16 oldcmd)
+{
+ u16 cmd = oldcmd & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY);
+
+ switch (dev->probe) {
+ case PCI_PROBE_CMD_ENABLED:
+ break;
+ case PCI_PROBE_DEFAULT:
+ if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
+ break;
+ case PCI_PROBE_CMD_DISABLED:
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
+ }
+}
+
static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
{
unsigned int pos, reg, next;
u32 l, sz;
+ u16 cmd;
struct resource *res;

+ if (dev->probe == PCI_PROBE_SKIP)
+ return;
+
+ pci_read_config_word(dev, PCI_COMMAND, &cmd);
+
+ set_probing_method(dev, cmd);
+
for(pos=0; pos<howmany; pos = next) {
next = pos+1;
res = &dev->resource[pos];
@@ -96,6 +121,9 @@ static void pci_read_bases(struct pci_de
#endif
}
}
+
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
+
if (rom) {
dev->rom_base_reg = rom;
res = &dev->resource[PCI_ROM_RESOURCE];
@@ -338,7 +366,7 @@ static void pci_read_irq(struct pci_dev
* @dev: the device structure to fill
*
* Initialize the device structure with information about the device's
- * vendor,class,memory and IO-space addresses,IRQ lines etc.
+ * vendor, memory and IO-space addresses, IRQ lines etc.
* Called at initialisation of the PCI subsystem and by CardBus services.
* Returns 0 on success and -1 if unknown type of device (not normal, bridge
* or CardBus).
@@ -347,14 +375,9 @@ int pci_setup_device(struct pci_dev * de
{
u32 class;

- sprintf(dev->slot_name, "%02x:%02x.%d", dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
- sprintf(dev->dev.name, "PCI device %04x:%04x", dev->vendor, dev->device);
INIT_LIST_HEAD(&dev->pools);

- pci_read_config_dword(dev, PCI_CLASS_REVISION, &class);
- class >>= 8; /* upper 3 bytes */
- dev->class = class;
- class >>= 8;
+ class = dev->class >> 8;

DBG("Found %02x:%02x [%04x/%04x] %06x %02x\n", dev->bus->number, dev->devfn, dev->vendor, dev->device, class, dev->hdr_type);

@@ -406,8 +429,8 @@ int pci_setup_device(struct pci_dev * de
}

/*
- * Read the config data for a PCI device, sanity-check it
- * and fill in the dev structure...
+ * Read the config data for a PCI device and fill in
+ * the dev structure...
*/
struct pci_dev * __devinit pci_scan_device(struct pci_dev *temp)
{
@@ -429,18 +452,23 @@ struct pci_dev * __devinit pci_scan_devi
dev->vendor = l & 0xffff;
dev->device = (l >> 16) & 0xffff;

+ pci_read_config_dword(dev, PCI_CLASS_REVISION, &l);
+ dev->class = l >> 8; /* upper 3 bytes */
+
/* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
set this higher, assuming the system even supports it. */
dev->dma_mask = 0xffffffff;
- if (pci_setup_device(dev) < 0) {
- kfree(dev);
- return NULL;
- }

pci_name_device(dev);

+ sprintf(dev->slot_name, "%02x:%02x.%d",
+ dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
+
+ /* Early fixups, before probing the BARs */
+ pci_fixup_device(PCI_FIXUP_EARLY, dev);
+
/* now put in global tree */
- strcpy(dev->dev.bus_id,dev->slot_name);
+ strcpy(dev->dev.bus_id, dev->slot_name);
dev->dev.dma_mask = &dev->dma_mask;

device_register(&dev->dev);
@@ -476,13 +504,12 @@ struct pci_dev * __devinit pci_scan_slot
* the per-bus list of devices and add the /proc entry.
*/
pci_insert_device (dev, bus);
-
- /* Fix up broken headers */
- pci_fixup_device(PCI_FIXUP_HEADER, dev);
}
return first_dev;
}

+/* First phase of device discovery. Build the bus tree, collect header types,
+ class codes, device and vendor IDs. Perform early fixups. */
unsigned int __devinit pci_do_scan_bus(struct pci_bus *bus)
{
unsigned int devfn, max, pass;
@@ -506,11 +533,8 @@ unsigned int __devinit pci_do_scan_bus(s
}

/*
- * After performing arch-dependent fixup of the bus, look behind
- * all PCI-to-PCI bridges on this bus.
+ * Look behind all PCI-to-PCI bridges on this bus.
*/
- DBG("Fixups for bus %02x\n", bus->number);
- pcibios_fixup_bus(bus);
for (pass=0; pass < 2; pass++)
for (ln=bus->devices.next; ln != &bus->devices; ln=ln->next) {
dev = pci_dev_b(ln);
@@ -541,6 +565,38 @@ int __devinit pci_bus_exists(const struc
return 0;
}

+/* Second phase of device discovery. Probe the BARs, fill in the rest
+ of pci_dev structure, perform device and bus fixups. */
+void __devinit pci_probe_resources(struct pci_bus *bus)
+{
+ struct list_head *ln;
+
+ if (!bus)
+ return;
+
+ for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) {
+ struct pci_dev *dev = pci_dev_b(ln);
+
+ if (pci_setup_device(dev) < 0) {
+ pci_remove_device(dev);
+ kfree(dev);
+ }
+
+ /* Fix up broken headers */
+ pci_fixup_device(PCI_FIXUP_HEADER, dev);
+ }
+
+ /*
+ * After performing arch-dependent fixup of the bus, look behind
+ * all PCI-to-PCI bridges on this bus.
+ */
+ DBG("Fixups for bus %02x\n", bus->number);
+ pcibios_fixup_bus(bus);
+
+ for (ln = bus->children.next; ln != &bus->children; ln = ln->next)
+ pci_probe_resources(pci_bus_b(ln));
+}
+
struct pci_bus * __devinit pci_alloc_primary_bus_parented(struct device *parent, int bus)
{
struct pci_bus *b;
@@ -594,6 +650,7 @@ EXPORT_SYMBOL(pci_setup_device);
EXPORT_SYMBOL(pci_add_new_bus);
EXPORT_SYMBOL(pci_do_scan_bus);
EXPORT_SYMBOL(pci_scan_slot);
-EXPORT_SYMBOL(pci_scan_bus);
+EXPORT_SYMBOL(pci_scan_bus_parented);
EXPORT_SYMBOL(pci_scan_bridge);
+EXPORT_SYMBOL(pci_probe_resources);
#endif
--- 2.5.58/include/linux/pci.h Fri Jan 10 23:11:51 2003
+++ linux/include/linux/pci.h Tue Jan 14 14:33:56 2003
@@ -361,6 +361,12 @@ enum pci_mmap_state {

#define PCI_ANY_ID (~0)

+/* This defines methods for probing the BARs (dev->probe bits) */
+#define PCI_PROBE_DEFAULT 0
+#define PCI_PROBE_CMD_DISABLED 1
+#define PCI_PROBE_CMD_ENABLED 2
+#define PCI_PROBE_SKIP 3
+
/*
* The pci_dev structure is used to describe PCI devices.
*/
@@ -412,7 +418,9 @@ struct pci_dev {
char slot_name[8]; /* slot name */

/* These fields are used by common fixups */
- unsigned int transparent:1; /* Transparent PCI bridge */
+ unsigned int transparent:1, /* Transparent PCI bridge */
+ probe:2; /* Override default BAR probing
+ behavior */
};

#define pci_dev_g(n) list_entry(n, struct pci_dev, global_list)
@@ -544,11 +552,14 @@ void pcibios_fixup_pbus_ranges(struct pc

/* Generic PCI functions used internally */

+void pci_probe_resources(struct pci_bus *bus);
int pci_bus_exists(const struct list_head *list, int nr);
struct pci_bus *pci_scan_bus_parented(struct device *parent, int bus, struct pci_ops *ops, void *sysdata);
static inline struct pci_bus *pci_scan_bus(int bus, struct pci_ops *ops, void *sysdata)
{
- return pci_scan_bus_parented(NULL, bus, ops, sysdata);
+ struct pci_bus *pbus = pci_scan_bus_parented(NULL, bus, ops, sysdata);
+ pci_probe_resources(pbus);
+ return pbus;
}
struct pci_bus *pci_alloc_primary_bus_parented(struct device * parent, int bus);
static inline struct pci_bus *pci_alloc_primary_bus(int bus)
@@ -809,8 +820,11 @@ struct pci_fixup {

extern struct pci_fixup pcibios_fixups[];

-#define PCI_FIXUP_HEADER 1 /* Called immediately after reading configuration header */
-#define PCI_FIXUP_FINAL 2 /* Final phase of device fixups */
+#define PCI_FIXUP_EARLY 1 /* Called immediately after reading
+ device/vendor IDs and class code */
+#define PCI_FIXUP_HEADER 2 /* Called after reading configuration
+ header (including BARs) */
+#define PCI_FIXUP_FINAL 3 /* Final phase of device fixups */

void pci_fixup_device(int pass, struct pci_dev *dev);