Looks like a commit that I can't find in git due to the arch merge
has broken PCI address assignment. This patch by Richard Henderson
against 2.6.23 fixes it for x86_64:
--- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c 2007-10-09 13:31:38.000000000 -0700
+++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c 2007-12-15 12:37:44.000000000 -0800
@@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
while ((gapsize >> 4) > round)
round += round;
/* Fun with two's complement */
- pci_mem_start = (gapstart + round) & -round;
+ pci_mem_start = (gapstart + round - 1) & -round;
printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
pci_mem_start, gapstart, gapsize);
Here is the original changeset, taken from the Mercurial repo. It was
merged in 2.6.14:
# HG changeset patch
# User Daniel Ritz <[email protected]>
# Date 1126304746 -700
# Node ID 51367d6e0b839be0b425a8f67c29f625b670f126
# Parent f4852c862b04efc9f8e2c7913191f5f7d140d895
[PATCH] Update PCI IOMEM allocation start
This fixes the problem with "Averatec 6240 pcmcia_socket0: unable to
apply power", which was due to the CardBus IOMEM register region being
allocated at an address that was actually inside the RAM window that had
been reserved for video frame-buffers in an UMA setup.
The BIOS _should_ have marked that region reserved in the e820 memory
descriptor tables, but did not.
It is fixed by rounding up the default starting address of PCI memory
allocations, so that we leave a bigger gap after the final known memory
location. The amount of rounding depends on how big the unused memory
gap is that we can allocate IOMEM from.
Based on example code by Linus.
Acked-by: Greg KH <[email protected]>
Acked-by: Ivan Kokshaysky <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
committer: Linus Torvalds <[email protected]> 1126304746 -0700
--- a/arch/i386/kernel/setup.c Fri Sep 09 22:28:40 2005 +0011
+++ b/arch/i386/kernel/setup.c Fri Sep 09 22:37:26 2005 +0011
@@ -1300,7 +1300,7 @@ legacy_init_iomem_resources(struct resou
*/
static void __init register_memory(void)
{
- unsigned long gapstart, gapsize;
+ unsigned long gapstart, gapsize, round;
unsigned long long last;
int i;
@@ -1345,14 +1345,14 @@ static void __init register_memory(void)
}
/*
- * Start allocating dynamic PCI memory a bit into the gap,
- * aligned up to the nearest megabyte.
- *
- * Question: should we try to pad it up a bit (do something
- * like " + (gapsize >> 3)" in there too?). We now have the
- * technology.
+ * See how much we want to round up: start off with
+ * rounding to the next 1MB area.
*/
- pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
+ round = 0x100000;
+ while ((gapsize >> 4) > round)
+ round += round;
+ /* Fun with two's complement */
+ pci_mem_start = (gapstart + round) & -round;
printk("Allocating PCI resources starting at %08lx (gap: %08lx:%08lx)\n",
pci_mem_start, gapstart, gapsize);
--- a/arch/x86_64/kernel/e820.c Fri Sep 09 22:28:40 2005 +0011
+++ b/arch/x86_64/kernel/e820.c Fri Sep 09 22:37:26 2005 +0011
@@ -567,7 +567,7 @@ unsigned long pci_mem_start = 0xaeedbabe
*/
__init void e820_setup_gap(void)
{
- unsigned long gapstart, gapsize;
+ unsigned long gapstart, gapsize, round;
unsigned long last;
int i;
int found = 0;
@@ -604,14 +604,14 @@ __init void e820_setup_gap(void)
}
/*
- * Start allocating dynamic PCI memory a bit into the gap,
- * aligned up to the nearest megabyte.
- *
- * Question: should we try to pad it up a bit (do something
- * like " + (gapsize >> 3)" in there too?). We now have the
- * technology.
+ * See how much we want to round up: start off with
+ * rounding to the next 1MB area.
*/
- pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
+ round = 0x100000;
+ while ((gapsize >> 4) > round)
+ round += round;
+ /* Fun with two's complement */
+ pci_mem_start = (gapstart + round) & -round;
printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
pci_mem_start, gapstart, gapsize);
On Mon, 17 Dec 2007, Chuck Ebbert wrote:
>
> Looks like a commit that I can't find in git due to the arch merge
> has broken PCI address assignment. This patch by Richard Henderson
> against 2.6.23 fixes it for x86_64:
>
> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c 2007-10-09 13:31:38.000000000 -0700
> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c 2007-12-15 12:37:44.000000000 -0800
> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
> while ((gapsize >> 4) > round)
> round += round;
> /* Fun with two's complement */
> - pci_mem_start = (gapstart + round) & -round;
> + pci_mem_start = (gapstart + round - 1) & -round;
No, it's very much meant to be that way.
We do *not* want to have the PCI memory abutthe end of memory exactly. So
it leaves a gap in between "gapstart" and the actual start of PCI memory
addressing very much on purpose.
In fact, the very commit (it's f0eca9626c6becb6fc56106b2e4287c6c784af3d in
the kernel tree) you mention actually explicitly *explains* that, although
maybe it's a bit indirect: if you start allocating PCI resources directly
after the end-of-RAM thing, you can easily end up using addresses that are
actually inside the magic stolen system RAM that is being used for UMA
video etc.
So you very much want to have a buffer in between the end-of-RAM and the
actual start of the region we try to allocate in.
So why do you want them to be close, anyway?
Linus
PS. On a different topic: if you do
git log --follow arch/x86/kernel/e820_64.c
you'd see the history past the renames in git. Or just do a "git blame -C"
which will also follow renames (and copies).
On 12/17/2007 07:57 PM, Linus Torvalds wrote:
>
> On Mon, 17 Dec 2007, Chuck Ebbert wrote:
>> Looks like a commit that I can't find in git due to the arch merge
>> has broken PCI address assignment. This patch by Richard Henderson
>> against 2.6.23 fixes it for x86_64:
>>
>> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c 2007-10-09 13:31:38.000000000 -0700
>> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c 2007-12-15 12:37:44.000000000 -0800
>> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
>> while ((gapsize >> 4) > round)
>> round += round;
>> /* Fun with two's complement */
>> - pci_mem_start = (gapstart + round) & -round;
>> + pci_mem_start = (gapstart + round - 1) & -round;
>
> No, it's very much meant to be that way.
>
> We do *not* want to have the PCI memory abutthe end of memory exactly. So
> it leaves a gap in between "gapstart" and the actual start of PCI memory
> addressing very much on purpose.
>
> In fact, the very commit (it's f0eca9626c6becb6fc56106b2e4287c6c784af3d in
> the kernel tree) you mention actually explicitly *explains* that, although
> maybe it's a bit indirect: if you start allocating PCI resources directly
> after the end-of-RAM thing, you can easily end up using addresses that are
> actually inside the magic stolen system RAM that is being used for UMA
> video etc.
>
> So you very much want to have a buffer in between the end-of-RAM and the
> actual start of the region we try to allocate in.
>
> So why do you want them to be close, anyway?
>
Because otherwise some video adapters with 256MB of memory end up with their
resources allocated above 4GB, and that doesn't work very well.
https://bugzilla.redhat.com/show_bug.cgi?id=425794#c0
>
> PS. On a different topic: if you do
>
> git log --follow arch/x86/kernel/e820_64.c
>
> you'd see the history past the renames in git. Or just do a "git blame -C"
> which will also follow renames (and copies).
The history in the web interface just ends at the rename.
On Tue, 18 Dec 2007, Chuck Ebbert wrote:
> >
> > So why do you want them to be close, anyway?
>
> Because otherwise some video adapters with 256MB of memory end up with their
> resources allocated above 4GB, and that doesn't work very well.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=425794#c0
That bugzilla entry doesn't even have a dmesg output or anything like
that. I'd really like to see what the
The fact is, that patch is not safe. We very much _want_ to make the PCI
region allocator use a window that is in the *middle* of the gap, and not
close to either end of the gap, and the code literally tries to make the
default start of the PCI allocation gap start be about 1/16th of the
actual gap size in question, so that we don't hit BIOS allocations that
it didn't tell us about by mistake.
But without dmesg and lspci output to see what the actual allocations are,
there's no way to even _guess_ at whether there is a correct fix or not,
just the fix that totally misses the point of having any rounding-up at
all.
That patch might as well just do
pci_mem_start = gapstart;
and get rid of all that rounding code entirely, since the patch just
assumes that it's safe to use memory after gapstart (which is known to not
be true, and is the whole and only reason for that code in the first
place: BIOSes *invariably* get resource allocation wrong, and forget to
tell us about some resource they set up).
Now, it's entirely possible that the only reasonable end result is that we
do have to avoid rounding up that far, but I definitely want to see what
the actual resource situation is - that patch is *not* obviously correct,
and it definitely breaks the whole point of the code.
The *other* patch in the bugzilla entry seems more correct, in that yes,
we should make sure that we don't allocate resources over 4G if the
resource won't fit. That said, I think that patch is wrong too: we should
just fix pcibios_align_resource() to check that case for MEM resouces (the
same way it already knows about the magic rules for IO resources).
So I'd suggest just fixing pcibios_align_resource() instead. Something
like the appended might work (and then you could perhaps quirk it to
always clear the PCI_BASE_ADDRESS_MEM_TYPE_64 thing for VGA controllers,
although I really don't think the kernel is the right place to do that,
and that would be an X server issue!).
NOTE! This patch is an independent issue of the whole "what window do we
use to allocate new resources, and how do we align it" thing.
Linus
---
arch/x86/pci/i386.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 42ba0e2..abc642b 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -70,6 +70,20 @@ pcibios_align_resource(void *data, struct resource *res,
start = (start + 0x3ff) & ~0x3ff;
res->start = start;
}
+ } else {
+ u64 max;
+ switch (res->flags & PCI_BASE_ADDRESS_MEM_MASK) {
+ case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+ max = 0xfffff;
+ break;
+ case PCI_BASE_ADDRESS_MEM_TYPE_64:
+ max = -1;
+ break;
+ default:
+ max = 0xffffffff;
+ }
+ if (res->start > max)
+ res->start = res->end;
}
}
On Tue, 18 Dec 2007, Richard Henderson wrote:
>
> I've added dmesg, /proc/iomem, and lspci -v output to that bug.
>
> Basically, we have
>
> c0000000-cfffffff : free
> ddf00000-dfefffff : PCI Bus #04
> e0000000-efffffff : pnp 00:0b
> f0000000-fedfffff : less than 256MB
Gaah.
That really is very unlucky. That 256M only goes at one point in the low
4GB, but the thing is, it fits perfectly well above it, and dammit, that
resource is explicitly a 64-bit resource or a really good reason.
However, I wonder about that
e0000000-efffffff : pnp 00:0b
thing. I actually suspect that that whole allocation is literally *meant*
for that 256MB graphics aperture, but the kernel explicitly avoids it
because it's listed in the PnP tables.
I wonder what the heck is the point of that pnp entry. Just for fun, can
you try to just disable CONFIG_PNP, and see if it all works then?
Bj?rn Helgaas added to Cc to clarify what those pnp entries tend to mean,
and whether there is possibly some way to match up a specific pnp entry
with the PCI device that might want to use it. Because that is a nice
256MB region that really doesn't seem to make sense for anything else than
the graphics buffer - there's nothing else in your system that seems
likely (although I guess it could be for some docking port, but even then
I'd have expected one of the PCI bridges to map it!)
But apart from the question about that pnp 00:0b device, the kernel
resource allocation really does look perfectly fine, and while we could
shoe-horn it into the low 4GB in this case by just hoping that there is
nothing undocumented there (and there probably isn't), it's really
annoying considering that big graphics areas are a hell of a good reason
to use those 64-bit resources.
It's not like 256MB is even as large as they come, half-gig graphics cards
are getting to be fairly common at the high end, and X absolutely _has_ to
be able to handle a 64-bit address for those.
Also, I'm surprised it doesn't work with X already: the ChangeLog for X
says that there are "Minor fixes to the handling of 64-bit PCI BARs [..]"
in 4.6.99.18, so I'd have assumed that XFree86-4.7.0 should be able to
handle this perfectly well.
I'll add Keithp to the cc too, to see if the X issues can be clarified.
Maybe he can set us right. But maybe you just have an old X server? If so,
considering the situation, I really think the kernel has done a good job
already, and I'd be *very* nervous about making the kernel allocate new
PCI resources right after the end-of-memory thing.
I bet it would work in this case, but as mentioned, we definitely know of
cases where the BIOS did *not* document the magic memory region that was
stolen for UMA graphics, and trying to put PCI devices just after the top
of reserved memory in the e820 list causes machines to not work at all
because the address decoding will clash.
Of course, we could also make the minimum address more of a *hint*, and
only make the resource allocator only abut the top-of-known-memory when it
absolutely has to, but on the other hand, in this case it really doesn't
have to, since there's just _tons_ of space for 64-bit resources. So the
correct thing really does seem to be to just use the 64-bit hw that is
there.
> That would have been an excellent comment to add to that code then,
> rather than just "rounding up to the next 1MB area", because purely
> as rounding code it is erroneous.
Patches to add comments are welcome. There are few enough people who
actually work on the PCI resource allocation code these days (I wish there
were more), and it's very rare that anybody else than me or Ivan ends up
even *looking* at it. So it's not been a big issue.
Linus
On Tue, Dec 18, 2007 at 10:21:50AM -0800, Linus Torvalds wrote:
> > https://bugzilla.redhat.com/show_bug.cgi?id=425794#c0
>
> That bugzilla entry doesn't even have a dmesg output or anything like
> that. I'd really like to see what the
I've added dmesg, /proc/iomem, and lspci -v output to that bug.
Basically, we have
c0000000-cfffffff : free
ddf00000-dfefffff : PCI Bus #04
e0000000-efffffff : pnp 00:0b
f0000000-fedfffff : less than 256MB
The annoying part is that there's no device (that I can see)
behind PCI Bus #04, so it might as well be disabled and that
entire d0000000-dfffffff area reclaimed.
> That patch might as well just do
>
> pci_mem_start = gapstart;
>
> and get rid of all that rounding code entirely, since the patch just
> assumes that it's safe to use memory after gapstart (which is known to not
> be true, and is the whole and only reason for that code in the first
> place: BIOSes *invariably* get resource allocation wrong, and forget to
> tell us about some resource they set up).
That would have been an excellent comment to add to that code then,
rather than just "rounding up to the next 1MB area", because purely
as rounding code it is erroneous.
> The *other* patch in the bugzilla entry seems more correct, in that yes,
> we should make sure that we don't allocate resources over 4G if the
> resource won't fit. That said, I think that patch is wrong too: we should
> just fix pcibios_align_resource() to check that case for MEM resouces (the
> same way it already knows about the magic rules for IO resources).
I'll give that patch a try, modified a tad to still include the
force_32_bit quirk.
> So I'd suggest just fixing pcibios_align_resource() instead. Something
> like the appended might work (and then you could perhaps quirk it to
> always clear the PCI_BASE_ADDRESS_MEM_TYPE_64 thing for VGA controllers,
That won't work, because PCI_BASE_ADDRESS_MEM_TYPE_64 controls how
many bits need to be written back to the BAR. If we changed that
to PCI_BASE_ADDRESS_MEM_TYPE_32, we wouldn't clear the high 32-bits
of the BAR.
> ... and that would be an X server issue!).
Of course, fixing the X server to *handle* 64-bit BARs is the correct
solution. I've no idea how involved that is, but I have a sneeking
suspicion that it uses that damned CARD32 datatype for everything.
r~
On Tue, Dec 18, 2007 at 12:22:34PM -0800, Richard Henderson wrote:
> On Tue, Dec 18, 2007 at 10:21:50AM -0800, Linus Torvalds wrote:
> > ... and that would be an X server issue!).
>
> Of course, fixing the X server to *handle* 64-bit BARs is the correct
> solution. I've no idea how involved that is, but I have a sneeking
> suspicion that it uses that damned CARD32 datatype for everything.
Doh. Let's fix the kernel first...
Does this make any difference? (the patch is self explaining ;-)
Ivan.
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -208,8 +208,9 @@ pci_setup_bridge(struct pci_bus *bus)
}
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);
- /* Clear out the upper 32 bits of PREF base. */
- pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
+ /* Set up the upper 32 bits of PREF base. */
+ l = region.start >> 16 >> 16;
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, l);
pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
}
On 12/18/2007 04:09 PM, Linus Torvalds wrote:
>
> I wonder what the heck is the point of that pnp entry. Just for fun, can
> you try to just disable CONFIG_PNP, and see if it all works then?
>
pnpacpi=off should work.
PnP is also trying (and failing) to reserve all physical memory.
On Wed, 19 Dec 2007, Ivan Kokshaysky wrote:
>
> Doh. Let's fix the kernel first...
>
> Does this make any difference? (the patch is self explaining ;-)
Heh, indeed. Good catch - that
Prefetchable memory behind bridge: 0000000000000000-000000000fffffff
on device 00:01.0 does look totally broken, and it would make more sense
if it matched what the device has (and what /proc/iomem resports).
That said, Intel bridges tend to be transparent even when they *claim*
normal decode, so I wonder whether it actually matters in this case. But
your patch looks very obviously right.
So maybe the rest of the kernel and X both already did everything right,
and it was just this stupid bridge setup thing that was broken (and
forcing the IOMEM resource to below the 4G mark just hid it).
Linus
On Tue, Dec 18, 2007 at 01:09:15PM -0800, Linus Torvalds wrote:
> However, I wonder about that
>
> e0000000-efffffff : pnp 00:0b
>
> thing. I actually suspect that that whole allocation is literally *meant*
> for that 256MB graphics aperture, but the kernel explicitly avoids it
> because it's listed in the PnP tables.
I assumed it was reserved for the pccard thing, which I don't see
listed or allocated anywhere else.
> I wonder what the heck is the point of that pnp entry. Just for fun, can
> you try to just disable CONFIG_PNP, and see if it all works then?
I'll try that, for grins.
> I'll add Keithp to the cc too, to see if the X issues can be clarified.
> Maybe he can set us right. But maybe you just have an old X server?
I've got xorg-x11-server-Xorg-1.3.0.0-36.fc8 installed. I wouldn't
have thought that was too old, since Fedora 8 just came out, but it's
not like I keep up on these things. I'll give 1.4.99 a try, as that's
what's current in Rawhide.
> Of course, we could also make the minimum address more of a *hint*, and
> only make the resource allocator only abut the top-of-known-memory when it
> absolutely has to....
Another way to look at this is that the graphics BAR came in from
the BIOS allocated at c0000000, and we ignored that. Perhaps there's
a way to give weight to the BIOS settings when consdering where the
PCI region is supposed to start?
On that system for which there was undeclared resources, did the BIOS
avoid that resource for the other PCI resources? I suspect it did...
r~
On Tue, 18 Dec 2007, Chuck Ebbert wrote:
> On 12/18/2007 04:09 PM, Linus Torvalds wrote:
> >
> > I wonder what the heck is the point of that pnp entry. Just for fun, can
> > you try to just disable CONFIG_PNP, and see if it all works then?
>
> pnpacpi=off should work.
>
> PnP is also trying (and failing) to reserve all physical memory.
Yeah, that really is a pretty confused-looking pnp table thing. But I have
absolutely zero idea how PnP is even supposed to work - the whole thing is
just a total hack for Windows, afaik.
The sad part is that *normally* the right thing to do about almost any
BIOS information is what we do right now: just avoid that magic address
range like the plague, because we have no clue what the heck the BIOS is
up to. But it looks like in this particular case, some of the problems
may arise exactly *because* we avoid that range.
It would be good to know what Windows does. If ACPI is found, does it
perhaps just ignore all the PnP entries these days?
Linus
On Tue, 18 Dec 2007, Richard Henderson wrote:
>
> Another way to look at this is that the graphics BAR came in from
> the BIOS allocated at c0000000, and we ignored that.
We did?
> Perhaps there's a way to give weight to the BIOS settings when
> consdering where the PCI region is supposed to start?
Normally, we *always* keep the BIOS allocations, unless it explicitly
clashes with something and we have reason to believe that they cannot
work. And there's nothing it clashes with, so we definitely *should* have
kept it.
Why do you think it came pre-allocated at 0xc0000000? I'm seeing the
message
PCI: Cannot allocate resource region 1 of device 0000:01:00.0
and I can well imagine that that is it, but if it was a valid allocation,
then we really should have kept it there.
That question also brings up another issue: how come did we actually
choose address 0xc0000000 with the original patch you sent in? If we can't
find it in the parent resources, we shouldn't have accepted it even if it
had room for it!
Which brings up *another* potential fix for this thing: as mentioned,
Intel bridges often claim to be "Normal decode", but the core ones seem to
almost never actually be that, and they tend to be "Negative decode". So
what may be going on is:
- the kernel sees that BIOS allocation at 0xc0000000 (I'll take your word
for it, it doesn't actually say so without PCI debugging enabled)
- ...and notices that the PCI BAR is behind a PCI bridge that does
not claim to be able to actually bridge that resource (it's normal
decode, and the ranges it *does* decode are elsewhere!)
- so clearly that old allocation is pure crap and has to be re-done in a
range that is actually properly bridged.
but that decision bases itself on the Intel bridge not lying, and if it
turns out that the bridge at 0000:00:01.0 actually is transparent, then
the original allocation would have been ok.
That said, your bridge at 00:1e.0 is *also* transparent, and it's actually
against the PCI specs to have two transparent bridges on the same PCI bus,
so I'm a bit surprised about that. But it does bring up a new thing you
could *try*, namely this patch...
(You obviously have to replace "insert-your-device-here" with the proper
PCI device ID for the thing you have - your lspci output only gives the
name, not the numbers)
We seem to have a multitude of possible reasons for this insanity. It
would be interesting to hear which one(s) of the possibilities make a
difference, if any.
Linus
---
drivers/pci/quirks.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 26cc4dc..c3b52f5 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -820,6 +820,7 @@ static void __devinit quirk_transparent_bridge(struct pci_dev *dev)
{
dev->transparent = 1;
}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, insert-your-device-id-here, quirk_transparent_bridge );
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82380FB, quirk_transparent_bridge );
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_TOSHIBA, 0x605, quirk_transparent_bridge );
On Tue, Dec 18, 2007 at 04:46:09PM -0500, Chuck Ebbert wrote:
> pnpacpi=off should work.
This does result in the graphics bar being placed at e0000000,
and does result in a system lockup when X starts. So it appears
as if there's really something there.
r~
Linus Torvalds wrote:
>
> On Tue, 18 Dec 2007, Richard Henderson wrote:
>> I've added dmesg, /proc/iomem, and lspci -v output to that bug.
>>
>> Basically, we have
>>
>> c0000000-cfffffff : free
>> ddf00000-dfefffff : PCI Bus #04
>> e0000000-efffffff : pnp 00:0b
>> f0000000-fedfffff : less than 256MB
>
> Gaah.
>
> That really is very unlucky. That 256M only goes at one point in the low
> 4GB, but the thing is, it fits perfectly well above it, and dammit, that
> resource is explicitly a 64-bit resource or a really good reason.
>
> However, I wonder about that
>
> e0000000-efffffff : pnp 00:0b
>
> thing. I actually suspect that that whole allocation is literally *meant*
> for that 256MB graphics aperture, but the kernel explicitly avoids it
> because it's listed in the PnP tables.
That is probably the MMCONFIG aperture, in that case any attempt to map
the graphics BAR there will have disastrous results. (This BIOS has an
MCFG table, though it looks like this Fedora kernel has MMCONFIG
disabled, so we can't tell what it actually contains.)
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
Linus Torvalds wrote:
>
> On Tue, 18 Dec 2007, Chuck Ebbert wrote:
>
>> On 12/18/2007 04:09 PM, Linus Torvalds wrote:
>>> I wonder what the heck is the point of that pnp entry. Just for fun, can
>>> you try to just disable CONFIG_PNP, and see if it all works then?
>> pnpacpi=off should work.
>>
>> PnP is also trying (and failing) to reserve all physical memory.
>
> Yeah, that really is a pretty confused-looking pnp table thing. But I have
> absolutely zero idea how PnP is even supposed to work - the whole thing is
> just a total hack for Windows, afaik.
>
> The sad part is that *normally* the right thing to do about almost any
> BIOS information is what we do right now: just avoid that magic address
> range like the plague, because we have no clue what the heck the BIOS is
> up to. But it looks like in this particular case, some of the problems
> may arise exactly *because* we avoid that range.
>
> It would be good to know what Windows does. If ACPI is found, does it
> perhaps just ignore all the PnP entries these days?
>
> Linus
ACPI is where those PnP entries are coming from (on any modern system
anyway). They do show up in Device Manager as devices with resources
(the one that reserves all of system RAM on my machine is labeled
"System board", others like the one that reserves the MMCONFIG aperature
are "Motherboard resources" - the name is based on the PNP device ID, I
believe).
It could be that Windows is stupid enough that it will map things over
top of physical RAM if the BIOS doesn't explicitly reserve it like that.
I suspect based on some comments in Microsoft documents that Windows
uses the E820 table to figure out where the RAM is, and ACPI/PnP
information to figure out where IO mappings are, but may not really
combine those two pieces of information into one overall map like Linux
does, which would explain why it needs to reserve all physical RAM..
(As mentioned in another post, I would guess the BIOS is reserving that
memory range since it's the MMCONFIG aperture..)
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On Tuesday 18 December 2007 02:09:15 pm Linus Torvalds wrote:
>
> On Tue, 18 Dec 2007, Richard Henderson wrote:
> >
> > I've added dmesg, /proc/iomem, and lspci -v output to that bug.
> >
> > Basically, we have
> >
> > c0000000-cfffffff : free
> > ddf00000-dfefffff : PCI Bus #04
> > e0000000-efffffff : pnp 00:0b
> > f0000000-fedfffff : less than 256MB
>
> Gaah.
>
> That really is very unlucky. That 256M only goes at one point in the low
> 4GB, but the thing is, it fits perfectly well above it, and dammit, that
> resource is explicitly a 64-bit resource or a really good reason.
>
> However, I wonder about that
>
> e0000000-efffffff : pnp 00:0b
>
> thing. I actually suspect that that whole allocation is literally *meant*
> for that 256MB graphics aperture, but the kernel explicitly avoids it
> because it's listed in the PnP tables.
>
> I wonder what the heck is the point of that pnp entry. Just for fun, can
> you try to just disable CONFIG_PNP, and see if it all works then?
00:0b must be a "motherboard" device, probably PNP0C01 or PNP0C02.
Those are catch-all devices with no real programming model associated
with them; they only describe resource usage. AFAICT, they're mostly
used to describe legacy stuff like interrupt controllers, timers, etc.
My laptop has the same range for one of its PNP0C02 devices. I'll
try to dig up a chipset spec and see what might look like that range.
We used to ignore anything past the first 8 I/O port regions and 4
memory regions (PNP_MAX_PORT and PNP_MAX_MEM), but those limits have
been recently bumped a bit [1]. That will cause additional reservations
that may explain some of the issues we're seeing.
[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a7839e960675b549f06209d18283d5cee2ce9261
> Bj?rn Helgaas added to Cc to clarify what those pnp entries tend to mean,
> and whether there is possibly some way to match up a specific pnp entry
> with the PCI device that might want to use it. Because that is a nice
> 256MB region that really doesn't seem to make sense for anything else than
> the graphics buffer - there's nothing else in your system that seems
> likely (although I guess it could be for some docking port, but even then
> I'd have expected one of the PCI bridges to map it!)
>
> But apart from the question about that pnp 00:0b device, the kernel
> resource allocation really does look perfectly fine, and while we could
> shoe-horn it into the low 4GB in this case by just hoping that there is
> nothing undocumented there (and there probably isn't), it's really
> annoying considering that big graphics areas are a hell of a good reason
> to use those 64-bit resources.
>
> It's not like 256MB is even as large as they come, half-gig graphics cards
> are getting to be fairly common at the high end, and X absolutely _has_ to
> be able to handle a 64-bit address for those.
>
> Also, I'm surprised it doesn't work with X already: the ChangeLog for X
> says that there are "Minor fixes to the handling of 64-bit PCI BARs [..]"
> in 4.6.99.18, so I'd have assumed that XFree86-4.7.0 should be able to
> handle this perfectly well.
>
> I'll add Keithp to the cc too, to see if the X issues can be clarified.
> Maybe he can set us right. But maybe you just have an old X server? If so,
> considering the situation, I really think the kernel has done a good job
> already, and I'd be *very* nervous about making the kernel allocate new
> PCI resources right after the end-of-memory thing.
>
> I bet it would work in this case, but as mentioned, we definitely know of
> cases where the BIOS did *not* document the magic memory region that was
> stolen for UMA graphics, and trying to put PCI devices just after the top
> of reserved memory in the e820 list causes machines to not work at all
> because the address decoding will clash.
>
> Of course, we could also make the minimum address more of a *hint*, and
> only make the resource allocator only abut the top-of-known-memory when it
> absolutely has to, but on the other hand, in this case it really doesn't
> have to, since there's just _tons_ of space for 64-bit resources. So the
> correct thing really does seem to be to just use the 64-bit hw that is
> there.
>
> > That would have been an excellent comment to add to that code then,
> > rather than just "rounding up to the next 1MB area", because purely
> > as rounding code it is erroneous.
>
> Patches to add comments are welcome. There are few enough people who
> actually work on the PCI resource allocation code these days (I wish there
> were more), and it's very rare that anybody else than me or Ivan ends up
> even *looking* at it. So it's not been a big issue.
>
> Linus
>
Linus Torvalds wrote:
>
> On Mon, 17 Dec 2007, Chuck Ebbert wrote:
>> Looks like a commit that I can't find in git due to the arch merge
>> has broken PCI address assignment. This patch by Richard Henderson
>> against 2.6.23 fixes it for x86_64:
>>
>> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c 2007-10-09 13:31:38.000000000 -0700
>> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c 2007-12-15 12:37:44.000000000 -0800
>> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
>> while ((gapsize >> 4) > round)
>> round += round;
>> /* Fun with two's complement */
>> - pci_mem_start = (gapstart + round) & -round;
>> + pci_mem_start = (gapstart + round - 1) & -round;
>
> No, it's very much meant to be that way.
>
> We do *not* want to have the PCI memory abutthe end of memory exactly. So
> it leaves a gap in between "gapstart" and the actual start of PCI memory
> addressing very much on purpose.
>
> In fact, the very commit (it's f0eca9626c6becb6fc56106b2e4287c6c784af3d in
> the kernel tree) you mention actually explicitly *explains* that, although
> maybe it's a bit indirect: if you start allocating PCI resources directly
> after the end-of-RAM thing, you can easily end up using addresses that are
> actually inside the magic stolen system RAM that is being used for UMA
> video etc.
>
> So you very much want to have a buffer in between the end-of-RAM and the
> actual start of the region we try to allocate in.
>
> So why do you want them to be close, anyway?
>
> Linus
>
> PS. On a different topic: if you do
>
> git log --follow arch/x86/kernel/e820_64.c
>
> you'd see the history past the renames in git. Or just do a "git blame -C"
> which will also follow renames (and copies).
That patch is from the 2.6.14 era - I don't think we even did PnP ACPI
resource reservation handling then? It could be that the BIOS was trying
to tell us that UMA memory region is reserved using PnP ACPI
reservations, but we just ignored it.
It seems rather arbitrary in how much it leaves unused - and in this
case, likely prevents us from using the nice big open gap that the BIOS
presumably expected the graphics card to be mapped into.
I suspect this buffer space insertion is really not needed at this
point. The patch description is likely technically correct in that the
BIOS should have reserved it in E820, but (according to MS comments in a
presentation I read) Windows doesn't use E820 for anything other than
figuring out where RAM is, it uses PnP ACPI for figuring out areas it
needs to avoid. Since BIOS writers test against that behavior, there are
surely lots of systems where ignoring PnP ACPI reservations and relying
only on E820 would result in things really going blammo (like mappings
things over MMCONFIG tables for instance). So disabling it on modern
machines is really not an option. And if it's enabled, you likely
wouldn't hit the problem it tries to fix.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
On 12/18/2007 07:11 PM, Robert Hancock wrote:
>> However, I wonder about that
>>
>> e0000000-efffffff : pnp 00:0b
>>
>> thing. I actually suspect that that whole allocation is literally
>> *meant* for that 256MB graphics aperture, but the kernel explicitly
>> avoids it because it's listed in the PnP tables.
>
> That is probably the MMCONFIG aperture, in that case any attempt to map
> the graphics BAR there will have disastrous results. (This BIOS has an
> MCFG table, though it looks like this Fedora kernel has MMCONFIG
> disabled, so we can't tell what it actually contains.)
>
You can boot with "pci=mmconf" to enable it.
On Tue, Dec 18, 2007 at 07:55:37PM -0500, Chuck Ebbert wrote:
> You can boot with "pci=mmconf" to enable it.
Heh.
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.
r~
On Tue, 18 Dec 2007, Linus Torvalds wrote:
>
> That question also brings up another issue: how come did we actually
> choose address 0xc0000000 with the original patch you sent in? If we can't
> find it in the parent resources, we shouldn't have accepted it even if it
> had room for it!
That
PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
PCI: Cannot allocate resource region 1 of device 0000:01:00.0
thing is really starting to bug me.
I bet that is the real problem here, but it's not printing out enough
information about the resource to actually give us much of a clue about
what is wrong.
I suspect that it had a bridge mapping (device 0:01.0) that included the
range from 0xc0000000 to 0xcfffffff, but there was something stupid wrong
with it (eg the BIOS had allocated overlapping regions), so we disabled
it. That, in turn, then caused us to also refuse the existing 0xc0000000
mapping for the graphics card (device 01:00.0), because now there was no
valid resource for it.
But that PCI bridge resource handling happens even *before* we look at any
PnP reserved areas (because we - for really good reasons - trust the
hardware a _lot_ more than we trust any idiotic firmware tables), so I
wonder what that strange PCI bridge mapping in 00:01.0 was - it must have
been _really_ off in order to not fit in the resource tree.
Could you just make it print out what the bridge resources are when it
scans them? Something like the appended..
Linus
---
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 42ba0e2..37c4b92 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -117,11 +117,16 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
/* Depth-First Search on bus tree */
list_for_each_entry(bus, bus_list, node) {
if ((dev = bus->self)) {
+ printk(KERN_DEBUG "PCI: Bridge %s\n", pci_name(dev));
for (idx = PCI_BRIDGE_RESOURCES;
idx < PCI_NUM_RESOURCES; idx++) {
r = &dev->resource[idx];
if (!r->flags)
continue;
+ printk(KERN_DEBUG "PCI: Bridge resource "
+ "%08llx-%08llx (f=%lx)\n",
+ r->start, r->end, r->flags);
+
pr = pci_find_parent_resource(dev, r);
if (!r->start || !pr ||
request_resource(pr, r) < 0) {
On Tue, 2007-12-18 at 13:09 -0800, Linus Torvalds wrote:
> It's not like 256MB is even as large as they come, half-gig graphics cards
> are getting to be fairly common at the high end, and X absolutely _has_ to
> be able to handle a 64-bit address for those.
We're now using a system-dependent wrapper library 'libpciaccess' for
all of this stuff, it uses 64-bits for all PCI addresses and should make
this transparent to the X driver.
In addition, our kernel drivers are moving to support graphics cards
that have memory beyond that addressable through their aperture, so we
should be able to manage cards with even more memory, some of which is
not reachable from the CPU.
> Also, I'm surprised it doesn't work with X already: the ChangeLog for X
> says that there are "Minor fixes to the handling of 64-bit PCI BARs [..]"
> in 4.6.99.18, so I'd have assumed that XFree86-4.7.0 should be able to
> handle this perfectly well.
And that code has been replaced with an even more general library that
abstracts away all of the PCI routing issues.
> I'll add Keithp to the cc too, to see if the X issues can be clarified.
> Maybe he can set us right. But maybe you just have an old X server? If so,
> considering the situation, I really think the kernel has done a good job
> already, and I'd be *very* nervous about making the kernel allocate new
> PCI resources right after the end-of-memory thing.
Trying a libpciaccess-based X server is certainly something worth doing,
that should be 1.4 or later (thanks, git-describe).
> I bet it would work in this case, but as mentioned, we definitely know of
> cases where the BIOS did *not* document the magic memory region that was
> stolen for UMA graphics, and trying to put PCI devices just after the top
> of reserved memory in the e820 list causes machines to not work at all
> because the address decoding will clash.
There is an additional single-page BAR on 9xx chips which may end up
mapped and not documented. Our kernel driver should correctly deal with
that now though.
--
[email protected]
On Tue, 18 Dec 2007, Richard Henderson wrote:
>
> Heh.
>
> PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
> PCI: Not using MMCONFIG.
Well, that at least confirms that e0000000 is indeed the mmconfig area.
One of these days we'll trust the ACPI resource data enough that we can
use mmconfig even when it's just reserved in those PnP things, which is
apparently how BIOS writers are suggested to do it (stupidly enough, but
whatever)
Linus
On Tue, Dec 18, 2007 at 01:46:56PM -0800, Linus Torvalds wrote:
> Heh, indeed. Good catch - that
>
> Prefetchable memory behind bridge: 0000000000000000-000000000fffffff
>
> on device 00:01.0 does look totally broken, and it would make more sense
> if it matched what the device has (and what /proc/iomem resports).
Sigh. The patch was way too incomplete - I somehow missed
PCI_PREF_LIMIT_UPPER32... Corrected patch appended - I think it's
worth applying in either case since a bridge window at address 0 is
an obvious bug and this patch fixes it.
> That said, Intel bridges tend to be transparent even when they *claim*
> normal decode, so I wonder whether it actually matters in this case. But
> your patch looks very obviously right.
I don't think that transparency is the case here - I read specs for some
recent Intel chipsets and it looks like they are pretty accurate now with
a "subtractive decode" flag - over the last couple of years, at least.
There are, of course, some strange "priority decode rules", but they
can be safely ignored as far as the kernel resource management is
concerned.
> So maybe the rest of the kernel and X both already did everything right,
> and it was just this stupid bridge setup thing that was broken (and
> forcing the IOMEM resource to below the 4G mark just hid it).
I've just checked the setup-bus code and have to say that its ability to
correctly handle the 64-bit BARs is close to zero...
On the positive side, getting it right doesn't seem to be as complicated
as I thought initially - mainly because only the prefetchable memory
window of p2p bridge is 64-bit. This effectively limits >4G allocations
to prefetchable resources only.
Anyway, using PCI bus space above 4G on x86 seems to be a must these
days, and I have some spare hardware to play with. Maybe I'll be able
to get something working by mid January or so...
Ivan.
---
PCI: do respect full 64-bit address for bridge prefetch window
Prevent the prefetch window from being programmed with a bogus address
when its respective resource gets allocated above the 4G mark.
Note that we cannot yet guarantee correct resource allocations
above 4G, though it might work in some simple cases.
Signed-off-by: Ivan Kokshaysky <[email protected]>
---
drivers/pci/setup-bus.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 401e03c..643e72e 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -208,8 +208,11 @@ pci_setup_bridge(struct pci_bus *bus)
}
pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);
- /* Clear out the upper 32 bits of PREF base. */
- pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
+ /* Set up the upper 32 bits of PREF base/limit. */
+ l = region.start >> 16 >> 16;
+ pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, l);
+ l = region.end >> 16 >> 16;
+ pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, l);
pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
}
> That won't work, because PCI_BASE_ADDRESS_MEM_TYPE_64 controls how
> many bits need to be written back to the BAR. If we changed that
> to PCI_BASE_ADDRESS_MEM_TYPE_32, we wouldn't clear the high 32-bits
> of the BAR.
>
> > ... and that would be an X server issue!).
>
> Of course, fixing the X server to *handle* 64-bit BARs is the correct
> solution. I've no idea how involved that is, but I have a sneeking
> suspicion that it uses that damned CARD32 datatype for everything.
A lot more than X needs to be fixed to handle 64-bit BARs btw. There's a
whole load of places in drivers/pci/* where we just puke if we see a
value >4G being assigned.
Now, there is some hope that the new X with libpciaccess can cope with
that, and even if it is broken, it would be much easier to fix, as X in
that case is no longer trying to bypass the kernel, but instead uses
proper kernel interfaces to map device resources.
That used to be Xorg pci-rework branch, though it might have been merged
in the trunk by now.
Ben.
Another turd is pci_scan_device() which can't cope with 64 bits BARs on
32 bits platforms even when they have 64 bits resources. I'll send a fix
for that.
Ben.
On Tue, Dec 18, 2007 at 05:38:58PM -0800, Linus Torvalds wrote:
> That
>
> PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
> PCI: Cannot allocate resource region 1 of device 0000:01:00.0
>
> thing is really starting to bug me.
>
> I bet that is the real problem here, but it's not printing out enough
> information about the resource to actually give us much of a clue about
> what is wrong.
>
> I suspect that it had a bridge mapping (device 0:01.0) that included the
> range from 0xc0000000 to 0xcfffffff, but there was something stupid wrong
> with it (eg the BIOS had allocated overlapping regions), so we disabled
> it. That, in turn, then caused us to also refuse the existing 0xc0000000
> mapping for the graphics card (device 01:00.0), because now there was no
> valid resource for it.
That is exactly it. The relevant section of the debug info is
PCI: Bridge 0000:00:01.0
PCI: Bridge resource 7 00008000-00008fff (%f=100)
PCI: Bridge resource 8 f7d00000-fddfffff (%f=200)
PCI: Bridge resource 9 bdf00000-ddefffff (%f=1201)
The bridge was assigned to a piece of the end of physical memory.
r~
On Thu, 20 Dec 2007, Richard Henderson wrote:
> On Tue, Dec 18, 2007 at 05:38:58PM -0800, Linus Torvalds wrote:
> > That
> >
> > PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
> > PCI: Cannot allocate resource region 1 of device 0000:01:00.0
> >
> > thing is really starting to bug me.
> >
> > I bet that is the real problem here, but it's not printing out enough
> > information about the resource to actually give us much of a clue about
> > what is wrong.
> >
> > I suspect that it had a bridge mapping (device 0:01.0) that included the
> > range from 0xc0000000 to 0xcfffffff, but there was something stupid wrong
> > with it (eg the BIOS had allocated overlapping regions), so we disabled
> > it. That, in turn, then caused us to also refuse the existing 0xc0000000
> > mapping for the graphics card (device 01:00.0), because now there was no
> > valid resource for it.
>
> That is exactly it. The relevant section of the debug info is
>
> PCI: Bridge 0000:00:01.0
> PCI: Bridge resource 7 00008000-00008fff (%f=100)
> PCI: Bridge resource 8 f7d00000-fddfffff (%f=200)
> PCI: Bridge resource 9 bdf00000-ddefffff (%f=1201)
>
> The bridge was assigned to a piece of the end of physical memory.
Oh, wow. That's just really bogus. So the kernel message about
PCI: Cannot allocate resource region 9 of bridge 0000:00:01.0
was perfectly fine, and we did absolutely the right thing.
But it also says that if the graphics adaptor really had a resource mapped
at 0xc0000000 - 0xcfffffff by the BIOS, then that mapping never worked at
all, since it never had any bridge mapping it could rely on. So our
decision to unmap that one as invalid was _also_ right.
Damn. Very irritating.
You know what? I think this simple (BUT TOTALLY UNTESTED!) patch will get
your case right, and I think it is preferable to just always lowering the
"minimum" starting point.
What it does is to just take the minimum PCI address for new allocations
(which is only used for the case where we don't have an explicit starting
point for the parent bus anyway!), and just saying "we'll always align it
down to the required alignment of the allocation".
I'm not exactly 100% happy with it, but it does mean that if we need a big
area, we'll relax the suggested starting point by that amount. It's not
wonderful, but it essentially admits that the minimum for the allocations
is really just a hint, and if we need lots of space for a resource, we'll
relax the minimum point appropriately.
So in your case, it should *result* in the exact same situation that your
patch did, but at the same time, when dealing with the (more common) case
of smaller allocations, we still continue to try to avoid being too close
to the top-of-memory.
So it's not perfect, but perhaps it is a good compromise between being
careful and having to make room?
Does this work for your case?
Linus
---
drivers/pci/bus.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 9e5ea07..d48d270 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -61,7 +61,7 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
/* Ok, try it out.. */
ret = allocate_resource(r, res, size,
- r->start ? : min,
+ r->start ? : min & -align,
-1, align,
alignf, alignf_data);
if (ret == 0)
On Thu, Dec 20, 2007 at 02:24:48PM -0800, Linus Torvalds wrote:
> I'm not exactly 100% happy with it, but it does mean that if we need a big
> area, we'll relax the suggested starting point by that amount. It's not
> wonderful, but it essentially admits that the minimum for the allocations
> is really just a hint, and if we need lots of space for a resource, we'll
> relax the minimum point appropriately.
This breaks in odd cases where the amount of memory in the system
is not a nice round number. Like throwing two 128MB sticks into
a system that already has 2gb. A 512MB allocation will get placed
back at 2gb, on top of the end of ram. In order to get this kind
of thing to work, you'd have to have a hard and a soft minimum.
Even then, any random large allocation is going to ignore that
buffer that you added. It'd be better if we could still tie this
ignoring of the buffer to whether the bios placed the resource
there in the first place.
Perhaps this is one of those things that just aren't going to be
solved properly without an xserver upgrade...
r~
On Thu, 20 Dec 2007, Richard Henderson wrote:
>
> This breaks in odd cases where the amount of memory in the system
> is not a nice round number. Like throwing two 128MB sticks into
> a system that already has 2gb. A 512MB allocation will get placed
> back at 2gb, on top of the end of ram.
No, no, you misunderstand.
The kernel *always* takes known memory allocations into account. The
"minimum PCI starting allocation" value is not there to protect memory we
know about: the resource management already does that!
So if you have real memory of 2GB+128MB, and you want a 512MB allocation,
then yes, maybe the "preferred starting point" would be rounded back down
to 2GB, but the resource allocator would still take known resources into
account, and skip that address as being a conflict, and then try the next
address that suits the alignment requirements, and try to see if there's a
big enough hole at the 2.5GB mark.
So it would all work fine.
The reason we have that "min" parameter is not because of those _known_
resources, it's exactly because we have been bitten too many times by
BIOSes that lay out magic undocumented resources in memory that we simply
don't know about, because they aren't standard BAR resources, but some
other special magic stuff. Things like the special ACPI areas that the
northbridge recognizes, but aren't exposed as regular BAR's, but as just
magic registers hidden in some undocumented NB register space.
So the reason we have those PCIBIOS_MIN_IO/MEM things is not because we'd
trample on top of memory without them, it's because we might trample the
BIOS resources that it never told us about!
Quite often, that's things like stolen RAM that doesn't show up in the
e820 tables (it *should* show up as "reserved", but BIOS writers are
generally incompetent drug-addicts picked up from the streets, who just
randomly change BIOS tables until Windows boots on the machine), or the
afore-mentioned magic IO registers for some special motherboard resource.
> In order to get this kind of thing to work, you'd have to have a hard
> and a soft minimum.
We do have that "hard limit" - the resource management keeps track of all
the resources it already knows about. The "soft limit" is exactly that
PCIBIOS_MIN_MEM (which on a PC is that "pci_mem_start" variable). It's
just a hint, but it's a pretty important one, exactly because we've been
burned so many times by crap firmware and undocumented memory and MMIO
ranges.
Linus
> So in your case, it should *result* in the exact same situation that your
> patch did, but at the same time, when dealing with the (more common) case
> of smaller allocations, we still continue to try to avoid being too close
> to the top-of-memory.
>
> So it's not perfect, but perhaps it is a good compromise between being
> careful and having to make room?
>
> Does this work for your case?
I'm not totally happy with changing the generic code like that, to
possibly not enforce "min" anymore. Other archs may have very good
reasons to provide a min value here... Though at the same time, at
least on powerpc, the parent resource of the host bridge will be the
real limit, so that may not be a big issue.
Cheers,
Ben.
On Thu, 20 Dec 2007 11:46:16 +0300 Ivan Kokshaysky <[email protected]> wrote:
> PCI: do respect full 64-bit address for bridge prefetch window
>
> Prevent the prefetch window from being programmed with a bogus address
> when its respective resource gets allocated above the 4G mark.
>
> Note that we cannot yet guarantee correct resource allocations
> above 4G, though it might work in some simple cases.
>
So.. did we agree that this patch is good to go?
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -208,8 +208,11 @@ pci_setup_bridge(struct pci_bus *bus)
> }
> pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);
>
> - /* Clear out the upper 32 bits of PREF base. */
> - pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
> + /* Set up the upper 32 bits of PREF base/limit. */
> + l = region.start >> 16 >> 16;
We have the little upper_32_bits() helper for this.
> + pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, l);
> + l = region.end >> 16 >> 16;
> + pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, l);
>
> pci_write_config_word(bridge, PCI_BRIDGE_CONTROL, bus->bridge_ctl);
> }
On Sat, 22 Dec 2007 01:12:18 -0800 Andrew Morton <[email protected]> wrote:
> On Thu, 20 Dec 2007 11:46:16 +0300 Ivan Kokshaysky <[email protected]> wrote:
>
> > PCI: do respect full 64-bit address for bridge prefetch window
> >
> > Prevent the prefetch window from being programmed with a bogus address
> > when its respective resource gets allocated above the 4G mark.
> >
> > Note that we cannot yet guarantee correct resource allocations
> > above 4G, though it might work in some simple cases.
> >
>
> So.. did we agree that this patch is good to go?
Oh, I see Greg merged a differnet patch.
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -208,8 +208,11 @@ pci_setup_bridge(struct pci_bus *bus)
> > }
> > pci_write_config_dword(bridge, PCI_PREF_MEMORY_BASE, l);
> >
> > - /* Clear out the upper 32 bits of PREF base. */
> > - pci_write_config_dword(bridge, PCI_PREF_BASE_UPPER32, 0);
> > + /* Set up the upper 32 bits of PREF base/limit. */
> > + l = region.start >> 16 >> 16;
>
> We have the little upper_32_bits() helper for this.
>
Which could use this.
--- a/drivers/pci/setup-bus.c~gregkh-pci-pci-fix-bus-resource-assignment-on-32-bits-with-64b-resources-cleanup
+++ a/drivers/pci/setup-bus.c
@@ -206,10 +206,8 @@ pci_setup_bridge(struct pci_bus *bus)
if (bus->resource[2]->flags & IORESOURCE_PREFETCH) {
l = (region.start >> 16) & 0xfff0;
l |= region.end & 0xfff00000;
-#ifdef CONFIG_RESOURCES_64BIT
- bu = region.start >> 32;
- lu = region.end >> 32;
-#endif
+ bu = upper_32_bits(region.start);
+ lu = upper_32_bits(region.end);
DBG(KERN_INFO " PREFETCH window: 0x%016llx-0x%016llx\n",
(unsigned long long)region.start,
(unsigned long long)region.end);
On Mon, 17 Dec 2007 19:25:27 -0500 Chuck Ebbert <[email protected]> wrote:
> Looks like a commit that I can't find in git due to the arch merge
> has broken PCI address assignment. This patch by Richard Henderson
> against 2.6.23 fixes it for x86_64:
>
> --- linux-2.6.23.x86_64/arch/x86_64/kernel/e820.c 2007-10-09 13:31:38.000000000 -0700
> +++ linux-2.6.23.x86_64-rth/arch/x86_64/kernel/e820.c 2007-12-15 12:37:44.000000000 -0800
> @@ -718,8 +718,8 @@ __init void e820_setup_gap(void)
> while ((gapsize >> 4) > round)
> round += round;
> /* Fun with two's complement */
> - pci_mem_start = (gapstart + round) & -round;
> + pci_mem_start = (gapstart + round - 1) & -round;
>
> printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
> pci_mem_start, gapstart, gapsize);
>
>
> Here is the original changeset, taken from the Mercurial repo. It was
> merged in 2.6.14:
>
> # HG changeset patch
> # User Daniel Ritz <[email protected]>
> # Date 1126304746 -700
> # Node ID 51367d6e0b839be0b425a8f67c29f625b670f126
> # Parent f4852c862b04efc9f8e2c7913191f5f7d140d895
> [PATCH] Update PCI IOMEM allocation start
>
> This fixes the problem with "Averatec 6240 pcmcia_socket0: unable to
> apply power", which was due to the CardBus IOMEM register region being
> allocated at an address that was actually inside the RAM window that had
> been reserved for video frame-buffers in an UMA setup.
>
> The BIOS _should_ have marked that region reserved in the e820 memory
> descriptor tables, but did not.
>
> It is fixed by rounding up the default starting address of PCI memory
> allocations, so that we leave a bigger gap after the final known memory
> location. The amount of rounding depends on how big the unused memory
> gap is that we can allocate IOMEM from.
>
> Based on example code by Linus.
>
> Acked-by: Greg KH <[email protected]>
> Acked-by: Ivan Kokshaysky <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> committer: Linus Torvalds <[email protected]> 1126304746 -0700
>
>
> --- a/arch/i386/kernel/setup.c Fri Sep 09 22:28:40 2005 +0011
> +++ b/arch/i386/kernel/setup.c Fri Sep 09 22:37:26 2005 +0011
> @@ -1300,7 +1300,7 @@ legacy_init_iomem_resources(struct resou
> */
> static void __init register_memory(void)
> {
> - unsigned long gapstart, gapsize;
> + unsigned long gapstart, gapsize, round;
> unsigned long long last;
> int i;
>
> @@ -1345,14 +1345,14 @@ static void __init register_memory(void)
> }
>
> /*
> - * Start allocating dynamic PCI memory a bit into the gap,
> - * aligned up to the nearest megabyte.
> - *
> - * Question: should we try to pad it up a bit (do something
> - * like " + (gapsize >> 3)" in there too?). We now have the
> - * technology.
> + * See how much we want to round up: start off with
> + * rounding to the next 1MB area.
> */
> - pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
> + round = 0x100000;
> + while ((gapsize >> 4) > round)
> + round += round;
> + /* Fun with two's complement */
> + pci_mem_start = (gapstart + round) & -round;
>
> printk("Allocating PCI resources starting at %08lx (gap: %08lx:%08lx)\n",
> pci_mem_start, gapstart, gapsize);
> --- a/arch/x86_64/kernel/e820.c Fri Sep 09 22:28:40 2005 +0011
> +++ b/arch/x86_64/kernel/e820.c Fri Sep 09 22:37:26 2005 +0011
> @@ -567,7 +567,7 @@ unsigned long pci_mem_start = 0xaeedbabe
> */
> __init void e820_setup_gap(void)
> {
> - unsigned long gapstart, gapsize;
> + unsigned long gapstart, gapsize, round;
> unsigned long last;
> int i;
> int found = 0;
> @@ -604,14 +604,14 @@ __init void e820_setup_gap(void)
> }
>
> /*
> - * Start allocating dynamic PCI memory a bit into the gap,
> - * aligned up to the nearest megabyte.
> - *
> - * Question: should we try to pad it up a bit (do something
> - * like " + (gapsize >> 3)" in there too?). We now have the
> - * technology.
> + * See how much we want to round up: start off with
> + * rounding to the next 1MB area.
> */
> - pci_mem_start = (gapstart + 0xfffff) & ~0xfffff;
> + round = 0x100000;
> + while ((gapsize >> 4) > round)
> + round += round;
> + /* Fun with two's complement */
> + pci_mem_start = (gapstart + round) & -round;
>
> printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
> pci_mem_start, gapstart, gapsize);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/