2006-11-07 16:57:03

by Jeff Chua

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)


On 11/7/06, Aaron Durbin <[email protected]> wrote:

> Could please you post a dump of /proc/iomem for both the kernel that
> works for you and the kernel that fails to allocate the PCI resources?



1) this works ...

00000000-0009ffff : System RAM
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000cc800-000cffff : Adapter ROM
000f0000-000fffff : System ROM
00100000-df686bff : System RAM
00100000-00357d27 : Kernel code
00357d28-0042bab3 : Kernel data
df686c00-df688bff : ACPI Non-volatile Storage
df688c00-df68abff : ACPI Tables
df68ac00-dfffffff : reserved
e0000000-efffffff : 0000:00:02.0
f0000000-f3ffffff : reserved
fe700000-fe7fffff : PCI Bus #03
fe800000-fe8fffff : PCI Bus #02
fe8f0000-fe8fffff : 0000:02:00.0
fe8f0000-fe8fffff : tg3
fe900000-fe9fffff : PCI Bus #01
feabf900-feabf9ff : 0000:00:1e.2
feabfa00-feabfbff : 0000:00:1e.2
feac0000-feafffff : 0000:00:02.0
feb00000-feb7ffff : 0000:00:02.0
feb80000-febfffff : 0000:00:02.1
fed00000-fed003ff : HPET 0
fed20000-fed9ffff : reserved
fee00000-feefffff : reserved
ffa80800-ffa80bff : 0000:00:1d.7
ffa80800-ffa80bff : ehci_hcd
ffb00000-ffffffff : reserved



2) this fails ...

00000000-0009ffff : System RAM
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000cc800-000cffff : Adapter ROM
000f0000-000fffff : System ROM
00100000-df686bff : System RAM
00100000-00358927 : Kernel code
00358928-0042cab3 : Kernel data
df686c00-df688bff : ACPI Non-volatile Storage
df688c00-df68abff : ACPI Tables
df68ac00-dfffffff : reserved
e0000000-efffffff : 0000:00:02.0
f0000000-ffffffff : PCI MMCONFIG 0
fed00000-fed003ff : HPET 0



Thanks,
Jeff


2006-11-07 17:11:16

by Aaron Durbin

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

On 11/7/06, Jeff Chua <[email protected]> wrote:
>
> On 11/7/06, Aaron Durbin <[email protected]> wrote:
>
> > Could please you post a dump of /proc/iomem for both the kernel that
> > works for you and the kernel that fails to allocate the PCI resources?
>
>
>
> 1) this works ...
>
> 00000000-0009ffff : System RAM
> 000a0000-000bffff : Video RAM area
> 000c0000-000c7fff : Video ROM
> 000cc800-000cffff : Adapter ROM
> 000f0000-000fffff : System ROM
> 00100000-df686bff : System RAM
> 00100000-00357d27 : Kernel code
> 00357d28-0042bab3 : Kernel data
> df686c00-df688bff : ACPI Non-volatile Storage
> df688c00-df68abff : ACPI Tables
> df68ac00-dfffffff : reserved
> e0000000-efffffff : 0000:00:02.0
> f0000000-f3ffffff : reserved
> fe700000-fe7fffff : PCI Bus #03
> fe800000-fe8fffff : PCI Bus #02
> fe8f0000-fe8fffff : 0000:02:00.0
> fe8f0000-fe8fffff : tg3
> fe900000-fe9fffff : PCI Bus #01
> feabf900-feabf9ff : 0000:00:1e.2
> feabfa00-feabfbff : 0000:00:1e.2
> feac0000-feafffff : 0000:00:02.0
> feb00000-feb7ffff : 0000:00:02.0
> feb80000-febfffff : 0000:00:02.1
> fed00000-fed003ff : HPET 0
> fed20000-fed9ffff : reserved
> fee00000-feefffff : reserved
> ffa80800-ffa80bff : 0000:00:1d.7
> ffa80800-ffa80bff : ehci_hcd
> ffb00000-ffffffff : reserved
>
>
>
> 2) this fails ...
>
> 00000000-0009ffff : System RAM
> 000a0000-000bffff : Video RAM area
> 000c0000-000c7fff : Video ROM
> 000cc800-000cffff : Adapter ROM
> 000f0000-000fffff : System ROM
> 00100000-df686bff : System RAM
> 00100000-00358927 : Kernel code
> 00358928-0042cab3 : Kernel data
> df686c00-df688bff : ACPI Non-volatile Storage
> df688c00-df68abff : ACPI Tables
> df68ac00-dfffffff : reserved
> e0000000-efffffff : 0000:00:02.0
> f0000000-ffffffff : PCI MMCONFIG 0
> fed00000-fed003ff : HPET 0
>

Ok. Jeff I have patch in there that reserves the MMCONFIG space,
however it is marked as reserved during resource insertion. For some
reason your MMCONFIG space is being reported as very large, thus
reserving the range f0000000-ffffffff. That is why your PCI devices
are bombing out on resource allocation. It looks like the MMCONFIG
region should be:
f0000000-f3ffffff. This range is marked as reserved in your e820 map,
however the MMCONFIG parsing is thinking it is 256MB.

This is not the right answer, but you could patch up your kernel to
fix it to the correct size for a temporary fix. I am going to see if
I can parse any other information from your logs and see if I can come
up w/ a better solution.

I just wanted to point you and others in the right direction.

-Aaron

2006-11-07 17:11:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

On Wed, Nov 08, 2006 at 12:57:03AM +0800, Jeff Chua wrote:
> 2) this fails ...
> =

> e0000000-efffffff : 0000:00:02.0
> f0000000-ffffffff : PCI MMCONFIG 0
> fed00000-fed003ff : HPET 0

Heh, no kidding ...

num_buses =3D pci_mmcfg_config[i].end_bus_number -
pci_mmcfg_config[i].start_bus_number + 1;
res->start =3D pci_mmcfg_config[i].base_address;
res->end =3D res->start + (num_buses << 20) - 1;
res->flags =3D IORESOURCE_MEM | IORESOURCE_BUSY;
insert_resource(&iomem_resource, res);

So if we have 256 busses assigned, then we request 256MB and, well,
there's no room for anyone else. This code was added by Andi in commit
de09bddb9d6f96785be470c832b881e6d72d589f

Hopefully he'll have a good idea how to restrict it. Given your "working"
resource map, it seems like it should be limited to 16MB (and thus 16 busse=
s).
But how to figure that out?

2006-11-07 17:50:54

by Aaron Durbin

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

On 11/7/06, Matthew Wilcox <[email protected]> wrote:
> On Wed, Nov 08, 2006 at 12:57:03AM +0800, Jeff Chua wrote:
> > 2) this fails ...
> >
> > e0000000-efffffff : 0000:00:02.0
> > f0000000-ffffffff : PCI MMCONFIG 0
> > fed00000-fed003ff : HPET 0
>
> Heh, no kidding ...
>
> num_buses =3D pci_mmcfg_config[i].end_bus_number -
> pci_mmcfg_config[i].start_bus_number + 1;
> res->start =3D pci_mmcfg_config[i].base_address;
> res->end =3D res->start + (num_buses << 20) - 1;
> res->flags =3D IORESOURCE_MEM | IORESOURCE_BUSY;
> insert_resource(&iomem_resource, res);
>
> So if we have 256 busses assigned, then we request 256MB and, well,
> there's no room for anyone else. This code was added by Andi in commit
> de09bddb9d6f96785be470c832b881e6d72d589f
>
> Hopefully he'll have a good idea how to restrict it. Given your "working"
> resource map, it seems like it should be limited to 16MB (and thus 16 bus=
ses).
> But how to figure that out?
>

Maybe Andi can shed some light on the reasoning for not checking e820
to see if the entire MMCONFIG region is reported as reserved in the
e820 map. I can patch up the pci_mmcfg_insert_resource to verify if
the region that is exported by ACPI is reserved in e820 and printk an
error message if it is not and skip the resource insertion.

Does that seem like a good avenue to pursue?

-Aaron

2006-11-07 17:56:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

On Tue, Nov 07, 2006 at 09:50:54AM -0800, Aaron Durbin wrote:
> Maybe Andi can shed some light on the reasoning for not checking e820
> to see if the entire MMCONFIG region is reported as reserved in the
> e820 map. I can patch up the pci_mmcfg_insert_resource to verify if
> the region that is exported by ACPI is reserved in e820 and printk an
> error message if it is not and skip the resource insertion.
> =

> Does that seem like a good avenue to pursue?

Sounds much better than Eric's idea of maximum bus number currently in
use (which was also my first thought).

But rather than skipping the resource insertion, I believe you should
limit its size to the largest multiple of 1MB that will fit within the
reserved region.

2006-11-08 07:39:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

On Tuesday 07 November 2006 18:11, Matthew Wilcox wrote:
> On Wed, Nov 08, 2006 at 12:57:03AM +0800, Jeff Chua wrote:
> > 2) this fails ...
> >
> > e0000000-efffffff : 0000:00:02.0
> > f0000000-ffffffff : PCI MMCONFIG 0
> > fed00000-fed003ff : HPET 0
>
> Heh, no kidding ...
>
> num_buses =3D pci_mmcfg_config[i].end_bus_number -
> pci_mmcfg_config[i].start_bus_number + 1;
> res->start =3D pci_mmcfg_config[i].base_address;
> res->end =3D res->start + (num_buses << 20) - 1;
> res->flags =3D IORESOURCE_MEM | IORESOURCE_BUSY;
> insert_resource(&iomem_resource, res);
>
> So if we have 256 busses assigned, then we request 256MB and, well,
> there's no room for anyone else. This code was added by Andi in commit
> de09bddb9d6f96785be470c832b881e6d72d589f
>
> Hopefully he'll have a good idea how to restrict it. Given your "working"
> resource map, it seems like it should be limited to 16MB (and thus 16
> busses). But how to figure that out?


ACPI knows the number of busses.

Just need to get the information there, which is a ordering issue
(normally MCFG initialization is before this is known I think)

Len, ACPI folks, any ideas how to fix this cleanly?

-Andi

2006-11-08 12:22:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

On Wed, Nov 08, 2006 at 08:39:44AM +0100, Andi Kleen wrote:
> ACPI knows the number of busses.

But what if the number of busses increases later, eg by hotplugging
a card with a PCI-PCI bridge on it? Or does it know the number of
busses which can be supported by this machine's MMCONFIG region?
If so, why isn't this information reported in the MCFG table properly
instead of claiming to support 0-255?

> Just need to get the information there, which is a ordering issue
> (normally MCFG initialization is before this is known I think)
> =

> Len, ACPI folks, any ideas how to fix this cleanly?
> =

> -Andi

2006-11-08 15:14:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)



> I can patch up the pci_mmcfg_insert_resource to verify if =

> the region that is exported by ACPI is reserved in e820 and printk an
> error message if it is not and skip the resource insertion.

It probably should get its information from pci_mcfg_init()
and only reserve what is used there instead of adding duplicate
e820 checking code somewhere else.

Or perhaps only reserve when the bus is discovered?

-Andi

2006-11-08 16:05:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)



On Wed, 8 Nov 2006, Matthew Wilcox wrote:
>
> On Wed, Nov 08, 2006 at 08:39:44AM +0100, Andi Kleen wrote:
> > ACPI knows the number of busses.
> =

> But what if the number of busses increases later, eg by hotplugging
> a card with a PCI-PCI bridge on it? Or does it know the number of
> busses which can be supported by this machine's MMCONFIG region?

ACPI will give the maximum number.

However, in this case, the correct thing to do (always _has_ been) is to =

not use ACPI for _anything_, but just read the base and the size of the =

MMCONFIG region from the hardware itself.

Anyway, I do not consider this a regression. MMCONFIG has _never_ worked =

reliably. It has always been a case of "we can make it work on some =

machines by making it break on others".

Linus

2006-11-08 17:38:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

Linus Torvalds <[email protected]> writes:

> On Wed, 8 Nov 2006, Matthew Wilcox wrote:
>>
>> On Wed, Nov 08, 2006 at 08:39:44AM +0100, Andi Kleen wrote:
>> > ACPI knows the number of busses.
>> =

>> But what if the number of busses increases later, eg by hotplugging
>> a card with a PCI-PCI bridge on it? Or does it know the number of
>> busses which can be supported by this machine's MMCONFIG region?
>
> ACPI will give the maximum number.
>
> However, in this case, the correct thing to do (always _has_ been) is to =

> not use ACPI for _anything_, but just read the base and the size of the =

> MMCONFIG region from the hardware itself.
>
> Anyway, I do not consider this a regression. MMCONFIG has _never_ worked =

> reliably. It has always been a case of "we can make it work on some =

> machines by making it break on others".

The implementations I have seen, I believe have all been on bridges and
the maximum size is actually generated from the bus number below the bridge.

Eric

2006-11-08 18:52:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)



On Wed, 8 Nov 2006, Eric W. Biederman wrote:
> =

> The implementations I have seen, I believe have all been on bridges and
> the maximum size is actually generated from the bus number below the brid=
ge.

Hmm. It might be possible to first set up the MMCONFIG thing for the =

minimum range, then read the bus numbers from the host bridge on that bus, =

and then expand the mmconfig range if necessary.

Because pretty much ANYTHING is better than trusting the BIOS tables.

That said, I'd really be a _lot_ more confident about it if we were to be =

able to read the values from the hardware itself some way. There's =

obviously a chicken-and-egg issue on mmcfg configuration, but it's one =

that the BIOS startup code also has, so I assume that there is a solution =

to that somewhere.

Linus

2006-11-08 19:10:13

by Aaron Durbin

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

On 11/8/06, Linus Torvalds <[email protected]> wrote:
>
>
> On Wed, 8 Nov 2006, Eric W. Biederman wrote:
> >
> > The implementations I have seen, I believe have all been on bridges and
> > the maximum size is actually generated from the bus number below the br=
idge.
>
> Hmm. It might be possible to first set up the MMCONFIG thing for the
> minimum range, then read the bus numbers from the host bridge on that bus,
> and then expand the mmconfig range if necessary.
>
> Because pretty much ANYTHING is better than trusting the BIOS tables.
>
> That said, I'd really be a _lot_ more confident about it if we were to be
> able to read the values from the hardware itself some way. There's
> obviously a chicken-and-egg issue on mmcfg configuration, but it's one
> that the BIOS startup code also has, so I assume that there is a solution
> to that somewhere.
>

I agree that the orignal patch was stupid in relying on the MCFG table repo=
rted
in ACPI, however, like you said, without the actual knowledge of the MCFG
region being pulled out of the hardware even the e820 check is not valid. I=
t is
close, but not entirely correct. For instance, if the MCFG region is being
reported in ACPI land as 256 buses and the e820 has a reservation at the
MCFG base address of 18MB that does not necessarily mean the MCFG region al=
lows
for PCI config access on 18 buses. It could be that it only allows 16 buses=
w/
another piece of hardware on that last 2MB.

So what is the proper scenario? One needs to know the actual upper limit of
MCFG region. Otherwise when detecting unreachable devices one could be poki=
ng
something else in the process of trying to discover these unreachable devic=
es.
I am open to ideas and am willing to rework some of the code, but I do like=
the
idea of having the region being reported in the resource table.

-Aaron

2006-11-08 19:25:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)



On Wed, 8 Nov 2006, Aaron Durbin wrote:
>
> For instance, if the MCFG region is being reported in ACPI land as 256 =

> buses and the e820 has a reservation at the MCFG base address of 18MB =

> that does not necessarily mean the MCFG region allows for PCI config =

> access on 18 buses. It could be that it only allows 16 buses w/ another =

> piece of hardware on that last 2MB.

Oh, I agree. You'd _hope_ that the BIOS reports that as a separate region, =

and we could use that as a hint, but it's never going to be fool-proof. =


It's just much much better to try to figure out what the hardware itself =

thinks it is doing, rather than relying on a firmware engineer filling out =

the table to match what he _thinks_ the hardware is doing (or, more =

accurately, randomly scribbling values until Windows boots, at which point =

it's not his problem any more, and people ship the crap).

Some misguided people used to think that we shouldn't do our own PCI =

probing, but use ACPI instead. This is the same thing, except on a smaller =

scale. MAYBE the scale ends up being so small that we can figure out some =

reliable way without actually asking the hardware itself, but I kind of =

doubt it. Especially judging by the current situation.

> So what is the proper scenario? One needs to know the actual upper limit =
of
> MCFG region. Otherwise when detecting unreachable devices one could be po=
king
> something else in the process of trying to discover these unreachable dev=
ices.
> I am open to ideas and am willing to rework some of the code, but I do li=
ke
> the idea of having the region being reported in the resource table.

Absolutely. I'd _love_ to have the region reported in the resource table. =

It's just that right now it doesn't seem practical, since the downsides =

are bigger than the upsides (and the upsides aren't _that_ big, since we =

require the thing to be marked reserved in the e820 tables anyway, so the =

resource tables do know about it, about as well as they currently can).

In the absense of a way to actually ask the hardware, we could perhaps =

modify the thing so that it does request the regions in the resource =

table, but _only_ if the e820 entries aren't there (ie the "config type 1 =

didn't even work" case).

Alternatively, we might choose to request just the known smallest region, =

because that should be relatively "safer". It's better than not reporting =

the regions at all, and while it's not perfect, it at least shouldn't have =

huge potential downsides from getting the size totally wrong...

Hmm?

Linus

2006-11-08 19:24:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)

Linus Torvalds <[email protected]> writes:

> On Wed, 8 Nov 2006, Eric W. Biederman wrote:
>> =

>> The implementations I have seen, I believe have all been on bridges and
>> the maximum size is actually generated from the bus number below the bri=
dge.
>
> Hmm. It might be possible to first set up the MMCONFIG thing for the =

> minimum range, then read the bus numbers from the host bridge on that bus=
, =

> and then expand the mmconfig range if necessary.
>
> Because pretty much ANYTHING is better than trusting the BIOS tables.
>
> That said, I'd really be a _lot_ more confident about it if we were to be =

> able to read the values from the hardware itself some way. There's =

> obviously a chicken-and-egg issue on mmcfg configuration, but it's one =

> that the BIOS startup code also has, so I assume that there is a solution =

> to that somewhere.

cfc and cf8 still work on x86. So you can start with the old path
and then when you know mmconfig works you can upgrade.

In fact mmconfig doesn't necessary allow access to the entire pci domain.
On AMD systems currently you will get all of the subordinate busses but
the cpus themselves will not show up in the mmconfig space.

So we should have the infrastructure to only use mmconfig for some set
of busses. If that interface is well described we can probably
bootstrap sanely, only enabling what we know exists and like wise
only reserving what we know is used.

For chipsets I know that there is quite a bit of information publicly
available. For intel chipsets I believe those are registers they
make available in their public docs. For things like the Nvidia
chipset the knowledge should be in the publicly available linuxbios
code base.

Hopefully that is enough of a pointer to get people going. I might
have enough time to write the patch but I don't have enough time to
maintain it until mmconfig becomes boring.

Eric

2006-11-10 06:52:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3)


> So we should have the infrastructure to only use mmconfig for some set
> of busses. If that interface is well described we can probably
> bootstrap sanely, only enabling what we know exists and like wise
> only reserving what we know is used.

Unfortunately there is a chicken and egg problem on those few broken
systems (like some x86 Macs) where only mcfg works. Without mcfg you
won't be able to probe the bus. Ok you could trust ACPI when it says
it's there, but I'm not sure Linus would like that.

Still perhaps I guess only reserving when the bus is probed is probably
a good idea. In most cases we only probe a small number of busses =

because ACPI tells us the number.

This basically means pci_mcfg_init() should be split up.

-Andi