2007-06-01 18:15:13

by Justin Piszcz

[permalink] [raw]
Subject: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?


Dear Customer,

Thank you for contacting Intel(R) Technical Support.

After reviewing the email history on this case, we have the following suggestions/comments:

Note before continuing: Debian* Linux Operating System is not an officially, validated, tested Operating System for the Intel(R) Desktop Board DG965WH (see http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2375); moreover, we do confirm that "on a system that has 8 GB of system memory installed, it is not possible to use all of the installed memory due to system address space being allocated for other system critical functions." [qtd. on page 43 of the Technical Product Specification (see http://download.intel.com/design/motherbd/wh/D5600801US.pdf)]. Thus, the following suggestions are provided AS IS; we cannot guarantee the problem would be fixed afterwards:

1. Try updating the BIOS to the most current version (1687):

http://downloadcenter.intel.com/filter_results.aspx?strTypes=all&ProductID=2375&OSFullName=OS+Independent&lang=eng&strOSs=38&submit=Go%21

Note: Once the update is done, please, restart the system and repeat the following:

Press <F9> to restore BIOS default settings. Reset any customized BIOS settings. Clear all DMI event logs, which are located in the Advanced/Event Log Configuration section of the BIOS Setup utility. Press <F10> to save the new settings and reboot the system.


2. If the problem continues, please, ensure the brand and part# of the memory modules are listed among the tested memory lists (see http://developer.intel.com/design/motherbd/wh/wh_mem.htm#1).


------------

Before I upgraded to 8GB, I used to upgrade my BIOS everytime a new
version came out, however, once I upgraded past 1666P I also noticed this
problem even with 4GB of memory, I tried to downgrade back to 1666P and it
corrupted the BIOS/failed and I had to wait 1-2 weeks for the RMA process,
1 week to get there, 1-3 days for analysis etc, they do ship two-day
shipping back to you however..

Per Robert's response, this is the fix I will be using as Intel wants me
to upgrade the BIOS to the defaults, which could potentially cause another
motherboard failure. I'll stick with the mem= option. I need to read up
on the E820 memory map.

How come the kernel does not automatically map the memory correctly and
then put a message in syslog/dmesg: Only using 7.7GB because your BIOS is
using 64MB for other purposes, re-mapping kernel into higher memory..

Any comments?


Per Robert below:

Justin Piszcz wrote:
> That output looked nasty, attaching entries from syslog.
>
> Justin.

Here's your E820 memory map, from dmesg:

BIOS-e820: 0000000000000000 - 000000000008f000 (usable)
BIOS-e820: 000000000008f000 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000cf58f000 (usable)
BIOS-e820: 00000000cf58f000 - 00000000cf59c000 (reserved)
BIOS-e820: 00000000cf59c000 - 00000000cf653000 (usable)
BIOS-e820: 00000000cf653000 - 00000000cf6a5000 (ACPI NVS)
BIOS-e820: 00000000cf6a5000 - 00000000cf6a8000 (ACPI data)
BIOS-e820: 00000000cf6a8000 - 00000000cf6ef000 (ACPI NVS)
BIOS-e820: 00000000cf6ef000 - 00000000cf6f1000 (ACPI data)
BIOS-e820: 00000000cf6f1000 - 00000000cf6f2000 (usable)
BIOS-e820: 00000000cf6f2000 - 00000000cf6ff000 (ACPI data)
BIOS-e820: 00000000cf6ff000 - 00000000cf700000 (usable)
BIOS-e820: 00000000cf700000 - 00000000d0000000 (reserved)
BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 000000022c000000 (usable)

so the usable memory ranges are:

0-572K
1MB-3317.55MB
3317.60MB-3317.75MB
3318.94MB-3318.945MB
3318.996MB-3319MB
4096MB-8896MB

and the MTRRs (from /proc/mtrr, from private email):

reg00: base=0x00000000 ( 0MB), size=2048MB: write-back, count=1
reg01: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
reg02: base=0xc0000000 (3072MB), size= 256MB: write-back, count=1
reg03: base=0xcf800000 (3320MB), size= 8MB: uncachable, count=1
reg04: base=0xcf700000 (3319MB), size= 1MB: uncachable, count=1
reg05: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
reg06: base=0x200000000 (8192MB), size= 512MB: write-back, count=1
reg07: base=0x220000000 (8704MB), size= 128MB: write-back, count=1

so the ranges mapped as cacheable are:

0-3319MB
4096-8832MB

leaving 64MB of memory at the top of RAM uncached. What do you want to
bet that something important (kernel code?) is getting loaded there..

So essentially it's a BIOS problem, it's not setting up the MTRRs
properly in order to map all of RAM as cacheable. As Andi says, complain
to Intel.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/


2007-06-01 19:10:28

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

> and the MTRRs (from /proc/mtrr, from private email):
>
> reg00: base=0x00000000 ( 0MB), size=2048MB: write-back, count=1
> reg01: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
> reg02: base=0xc0000000 (3072MB), size= 256MB: write-back, count=1
> reg03: base=0xcf800000 (3320MB), size= 8MB: uncachable, count=1
> reg04: base=0xcf700000 (3319MB), size= 1MB: uncachable, count=1
> reg05: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
> reg06: base=0x200000000 (8192MB), size= 512MB: write-back, count=1
> reg07: base=0x220000000 (8704MB), size= 128MB: write-back, count=1
>
> so the ranges mapped as cacheable are:
>
> 0-3319MB
> 4096-8832MB
>
> leaving 64MB of memory at the top of RAM uncached. What do you want to
> bet that something important (kernel code?) is getting loaded there..
>
> So essentially it's a BIOS problem, it's not setting up the MTRRs
> properly in order to map all of RAM as cacheable. As Andi says, complain
> to Intel.

If it's just 64M you'll end up losing, you could try the "[RFC] trim memory
not covered by MTRR WB type" patch I posted yesterday. It won't reinit the
MTRRs (maybe we should) but it will at least prevent your system from
crawling if the BIOS doesn't set them up right. That would at least let you
use most of your memory until the BIOS guys acknowledge that they have a
problem (or we get proper PAT support, which I think would make this problem
go away as well).

Jesse

2007-06-01 19:18:13

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Fri, 1 Jun 2007, Jesse Barnes wrote:

>> and the MTRRs (from /proc/mtrr, from private email):
>>
>> reg00: base=0x00000000 ( 0MB), size=2048MB: write-back, count=1
>> reg01: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
>> reg02: base=0xc0000000 (3072MB), size= 256MB: write-back, count=1
>> reg03: base=0xcf800000 (3320MB), size= 8MB: uncachable, count=1
>> reg04: base=0xcf700000 (3319MB), size= 1MB: uncachable, count=1
>> reg05: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
>> reg06: base=0x200000000 (8192MB), size= 512MB: write-back, count=1
>> reg07: base=0x220000000 (8704MB), size= 128MB: write-back, count=1
>>
>> so the ranges mapped as cacheable are:
>>
>> 0-3319MB
>> 4096-8832MB
>>
>> leaving 64MB of memory at the top of RAM uncached. What do you want to
>> bet that something important (kernel code?) is getting loaded there..
>>
>> So essentially it's a BIOS problem, it's not setting up the MTRRs
>> properly in order to map all of RAM as cacheable. As Andi says, complain
>> to Intel.
>
> If it's just 64M you'll end up losing, you could try the "[RFC] trim memory
> not covered by MTRR WB type" patch I posted yesterday. It won't reinit the
> MTRRs (maybe we should) but it will at least prevent your system from
> crawling if the BIOS doesn't set them up right. That would at least let you
> use most of your memory until the BIOS guys acknowledge that they have a
> problem (or we get proper PAT support, which I think would make this problem
> go away as well).
>
> Jesse
>

Copying and pasting from here:

http://permalink.gmane.org/gmane.linux.kernel/537020

# patch -p1 < ../mtrr.patch
patch: **** Only garbage was found in the patch input.

Will try to find another link.

2007-06-01 19:19:22

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

It works ok on
my machine (correctly detects the condition, adjusts end_pfn, and keeps
the machine fast), aside from the fact that X won't start.

But X won't start? :\

On Fri, 1 Jun 2007, Jesse Barnes wrote:

>> and the MTRRs (from /proc/mtrr, from private email):
>>
>> reg00: base=0x00000000 ( 0MB), size=2048MB: write-back, count=1
>> reg01: base=0x80000000 (2048MB), size=1024MB: write-back, count=1
>> reg02: base=0xc0000000 (3072MB), size= 256MB: write-back, count=1
>> reg03: base=0xcf800000 (3320MB), size= 8MB: uncachable, count=1
>> reg04: base=0xcf700000 (3319MB), size= 1MB: uncachable, count=1
>> reg05: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
>> reg06: base=0x200000000 (8192MB), size= 512MB: write-back, count=1
>> reg07: base=0x220000000 (8704MB), size= 128MB: write-back, count=1
>>
>> so the ranges mapped as cacheable are:
>>
>> 0-3319MB
>> 4096-8832MB
>>
>> leaving 64MB of memory at the top of RAM uncached. What do you want to
>> bet that something important (kernel code?) is getting loaded there..
>>
>> So essentially it's a BIOS problem, it's not setting up the MTRRs
>> properly in order to map all of RAM as cacheable. As Andi says, complain
>> to Intel.
>
> If it's just 64M you'll end up losing, you could try the "[RFC] trim memory
> not covered by MTRR WB type" patch I posted yesterday. It won't reinit the
> MTRRs (maybe we should) but it will at least prevent your system from
> crawling if the BIOS doesn't set them up right. That would at least let you
> use most of your memory until the BIOS guys acknowledge that they have a
> problem (or we get proper PAT support, which I think would make this problem
> go away as well).
>
> Jesse
>

2007-06-01 19:21:43

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Friday, June 1, 2007 12:19:12 Justin Piszcz wrote:
> It works ok on
> my machine (correctly detects the condition, adjusts end_pfn, and keeps
> the machine fast), aside from the fact that X won't start.
>
> But X won't start? :\

Oh yeah, forgot about that. :) Somehow the patch breaks X startup, probably
by doing something bad to the MTRR API... but I don't know what yet.

Jesse

2007-06-01 20:17:52

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

Jesse Barnes <[email protected]> writes:

> (or we get proper PAT support, which I think would make this problem
> go away as well).

No it won't. If the basic MTRRs for memory are wrong just having PAT support
in drivers (which already exist in a limited form already, just for
UC only) won't change anything.

-Andi

2007-06-01 20:19:34

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Fri, 1 Jun 2007, Andi Kleen wrote:

> Jesse Barnes <[email protected]> writes:
>
>> (or we get proper PAT support, which I think would make this problem
>> go away as well).
>
> No it won't. If the basic MTRRs for memory are wrong just having PAT support
> in drivers (which already exist in a limited form already, just for
> UC only) won't change anything.
>
> -Andi
>

Basically from what I read:

1. The MCH/ICH8 hub 'requires' a minimum of 512MB to run, the board manual
states it needs at least 512MB of memort.
2. The DVT/IGP graphics uses either 128MB or 256MB, I have it set to
128MB.

How can the Linux kernel find this out/poll this information so users do
not have to know mem=XXXXM?

Justin.

2007-06-01 20:24:23

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

> 1. The MCH/ICH8 hub 'requires' a minimum of 512MB to run, the board manual
> states it needs at least 512MB of memort.
> 2. The DVT/IGP graphics uses either 128MB or 256MB, I have it set to
> 128MB.
>
> How can the Linux kernel find this out/poll this information so users do
> not have to know mem=XXXXM?

I don't think it should. The Linux kernel is not trying to be
a BIOS replacement and should not know everything about the platforms
it runs on. We sometimes try to work around very common
bugs, but this one (involving lots of memory and special configuration)
seems to be more in the exotic range where command line options
or waiting for a BIOS update seem better options.

-Andi

2007-06-01 20:26:33

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Fri, 1 Jun 2007, Andi Kleen wrote:

>> 1. The MCH/ICH8 hub 'requires' a minimum of 512MB to run, the board manual
>> states it needs at least 512MB of memort.
>> 2. The DVT/IGP graphics uses either 128MB or 256MB, I have it set to
>> 128MB.
>>
>> How can the Linux kernel find this out/poll this information so users do
>> not have to know mem=XXXXM?
>
> I don't think it should. The Linux kernel is not trying to be
> a BIOS replacement and should not know everything about the platforms
> it runs on. We sometimes try to work around very common
> bugs, but this one (involving lots of memory and special configuration)
> seems to be more in the exotic range where command line options
> or waiting for a BIOS update seem better options.
>
> -Andi
>

4GB of memory is $150 these days and this is a very common board *965
chipset, the bug occured with any bios > 1666P and 4GB of memory (or)
version 1612P (and possibly others) with 8GB of memory.

I guess a warning in dmesg or such would be appropriate letting the user
know of a possible work-around with the mem= option?

Justin.

2007-06-01 21:08:11

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Friday, June 1, 2007 2:14:17 Andi Kleen wrote:
> Jesse Barnes <[email protected]> writes:
> > (or we get proper PAT support, which I think would make this problem
> > go away as well).
>
> No it won't. If the basic MTRRs for memory are wrong just having PAT
> support in drivers (which already exist in a limited form already, just for
> UC only) won't change anything.

No obviously just using PAT for drivers wouldn't help, I was thinking more of
having one PAT type be WB memory, and using it by default for most PTEs
covering normal memory. If that's not possible, then it seems sensible to
try to fix this MTRR problem in a better way, either with something like the
patch I posted earlier or a more advanced MTRR remapper that runs at early
boot. Depending on platform requirements though, that could get complicated
pretty fast...

Jesse

2007-06-01 21:19:54

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Fri, Jun 01, 2007 at 02:07:51PM -0700, Jesse Barnes wrote:
> On Friday, June 1, 2007 2:14:17 Andi Kleen wrote:
> > Jesse Barnes <[email protected]> writes:
> > > (or we get proper PAT support, which I think would make this problem
> > > go away as well).
> >
> > No it won't. If the basic MTRRs for memory are wrong just having PAT
> > support in drivers (which already exist in a limited form already, just for
> > UC only) won't change anything.
>
> No obviously just using PAT for drivers wouldn't help, I was thinking more of
> having one PAT type be WB memory, and using it by default for most PTEs

Then the BIOS couldn't override it anymore in case it is needed somewhere.
e.g. normally we just use normal 2MB direct mappings for the hole
if there is memory beyond it and the hole doesn't need to be 2MB aligned.
Just assuming UC for all reserved pages would be also pretty drastic
and likely result in many 2MB pages being split and using a lot more
TLB.

> covering normal memory. If that's not possible, then it seems sensible to

And normally the MTRRs win, don't they (if I remember the table correctly)
So if the MTRR says UC and PAT disagrees it might not actually help

-Andi

2007-06-01 21:36:15

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
> On Fri, Jun 01, 2007 at 02:07:51PM -0700, Jesse Barnes wrote:
> > On Friday, June 1, 2007 2:14:17 Andi Kleen wrote:
> > > Jesse Barnes <[email protected]> writes:
> > > > (or we get proper PAT support, which I think would make this problem
> > > > go away as well).
> > >
> > > No it won't. If the basic MTRRs for memory are wrong just having PAT
> > > support in drivers (which already exist in a limited form already, just
> > > for UC only) won't change anything.
> >
> > No obviously just using PAT for drivers wouldn't help, I was thinking
> > more of having one PAT type be WB memory, and using it by default for
> > most PTEs
>
> Then the BIOS couldn't override it anymore in case it is needed somewhere.
> e.g. normally we just use normal 2MB direct mappings for the hole
> if there is memory beyond it and the hole doesn't need to be 2MB aligned.
> Just assuming UC for all reserved pages would be also pretty drastic
> and likely result in many 2MB pages being split and using a lot more
> TLB.
>
> > covering normal memory. If that's not possible, then it seems sensible
> > to
>
> And normally the MTRRs win, don't they (if I remember the table correctly)
> So if the MTRR says UC and PAT disagrees it might not actually help

I didn't check that part of the spec, that might be true. If so, then we
really need some sort of MTRR fix no matter what.

Jesse

2007-06-01 21:42:22

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
> And normally the MTRRs win, don't they (if I remember the table correctly)
> So if the MTRR says UC and PAT disagrees it might not actually help

I just checked, yes the MTRRs win for UC types. But it sounds like the cases
we're talking about are actually situations where there's no MTRR coverage,
so the default type is used. The manual doesn't specifically call out how
memory using the default type interacts with PAT, but it may well be that it
stays uncached if the default type is uncached. Again that argues for fixing
the MTRR mapping problem in some way.

Thanks,
Jesse

2007-06-02 01:09:18

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Fri, Jun 01, 2007 at 02:41:57PM -0700, Jesse Barnes wrote:
> On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
> > And normally the MTRRs win, don't they (if I remember the table correctly)
> > So if the MTRR says UC and PAT disagrees it might not actually help
>
> I just checked, yes the MTRRs win for UC types. But it sounds like the cases
> we're talking about are actually situations where there's no MTRR coverage,
> so the default type is used. The manual doesn't specifically call out how
> memory using the default type interacts with PAT, but it may well be that it
> stays uncached if the default type is uncached. Again that argues for fixing
> the MTRR mapping problem in some way.
>

I feel, having a silent/transparent workaround is not a good idea. With that
chances are BIOS bug will go unnoticed (having an error message in dmesg may not
get noticed either). Probably we should just panic at boot with a
detailed message about the e820 mtrr discrepancy (which can be logged as
a BUG to BIOS provider) and suggest a temporary workaround of "mem=___".

Thanks,
Venki

2007-06-02 01:16:18

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Friday, June 1, 2007 6:05:39 Venki Pallipadi wrote:
> On Fri, Jun 01, 2007 at 02:41:57PM -0700, Jesse Barnes wrote:
> > On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
> > > And normally the MTRRs win, don't they (if I remember the table
> > > correctly) So if the MTRR says UC and PAT disagrees it might not
> > > actually help
> >
> > I just checked, yes the MTRRs win for UC types. But it sounds like the
> > cases we're talking about are actually situations where there's no MTRR
> > coverage, so the default type is used. The manual doesn't specifically
> > call out how memory using the default type interacts with PAT, but it may
> > well be that it stays uncached if the default type is uncached. Again
> > that argues for fixing the MTRR mapping problem in some way.
>
> I feel, having a silent/transparent workaround is not a good idea. With
> that chances are BIOS bug will go unnoticed (having an error message in
> dmesg may not get noticed either). Probably we should just panic at boot
> with a
> detailed message about the e820 mtrr discrepancy (which can be logged as
> a BUG to BIOS provider) and suggest a temporary workaround of "mem=___".

That might be best, short of actually fixing the MTRRs...

Jesse


2007-06-02 08:44:16

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Fri, 1 Jun 2007, Jesse Barnes wrote:

> On Friday, June 1, 2007 6:05:39 Venki Pallipadi wrote:
>> On Fri, Jun 01, 2007 at 02:41:57PM -0700, Jesse Barnes wrote:
>>> On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
>>>> And normally the MTRRs win, don't they (if I remember the table
>>>> correctly) So if the MTRR says UC and PAT disagrees it might not
>>>> actually help
>>>
>>> I just checked, yes the MTRRs win for UC types. But it sounds like the
>>> cases we're talking about are actually situations where there's no MTRR
>>> coverage, so the default type is used. The manual doesn't specifically
>>> call out how memory using the default type interacts with PAT, but it may
>>> well be that it stays uncached if the default type is uncached. Again
>>> that argues for fixing the MTRR mapping problem in some way.
>>
>> I feel, having a silent/transparent workaround is not a good idea. With
>> that chances are BIOS bug will go unnoticed (having an error message in
>> dmesg may not get noticed either). Probably we should just panic at boot
>> with a
>> detailed message about the e820 mtrr discrepancy (which can be logged as
>> a BUG to BIOS provider) and suggest a temporary workaround of "mem=___".
>
> That might be best, short of actually fixing the MTRRs...
>
> Jesse
>
>

Indeed, at least it will inform the user of -what- is going on.

2007-06-02 09:22:41

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

> I feel, having a silent/transparent workaround is not a good idea. With that

If enough RAM is chopped off users will notice. They tend to complain
when they miss RAM. I don't like panic very much because for many
users it will be a show stopper (even when they are not blessed
with "quiet" boots like some distributions do)

The message in dmesg could be also emphasized a bit with a little
ASCII art (but no <blink> tag in there)

The problem I'm more worried about is if the system will be really
stable --- could it be that the memory controller is still
misconfigured and cause other stability issues? (we've had such
cases in the past). Also I'm not sure we can handle the case of
the MTRR wrong not at the end of memory but at the hole sanely.

-Andi

2007-06-02 20:11:57

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Sat, 2 Jun 2007, Andi Kleen wrote:

>> I feel, having a silent/transparent workaround is not a good idea. With that
>
> If enough RAM is chopped off users will notice. They tend to complain
> when they miss RAM. I don't like panic very much because for many
> users it will be a show stopper (even when they are not blessed
> with "quiet" boots like some distributions do)
>
> The message in dmesg could be also emphasized a bit with a little
> ASCII art (but no <blink> tag in there)
>
> The problem I'm more worried about is if the system will be really
> stable --- could it be that the memory controller is still
> misconfigured and cause other stability issues? (we've had such
> cases in the past). Also I'm not sure we can handle the case of
> the MTRR wrong not at the end of memory but at the hole sanely.
>
> -Andi
>

So far I have been booting with mem=8832M and have run stress/loaded the
memory subsystem pretty good; what other tests should I run?

It'd be nice if we could pose some sort of solution/warning for the future
so other people do not have to experience the same problems.

What are the next steps?

Justin.

2007-06-03 09:15:27

by Matt Keenan

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

Justin Piszcz wrote:
>
>
> On Sat, 2 Jun 2007, Andi Kleen wrote:
>
>>> I feel, having a silent/transparent workaround is not a good idea.
>>> With that
>>
>> If enough RAM is chopped off users will notice. They tend to complain
>> when they miss RAM. I don't like panic very much because for many
>> users it will be a show stopper (even when they are not blessed
>> with "quiet" boots like some distributions do)
>>
>> The message in dmesg could be also emphasized a bit with a little
>> ASCII art (but no <blink> tag in there)
>>
>> The problem I'm more worried about is if the system will be really
>> stable --- could it be that the memory controller is still
>> misconfigured and cause other stability issues? (we've had such
>> cases in the past). Also I'm not sure we can handle the case of
>> the MTRR wrong not at the end of memory but at the hole sanely.
>>
>> -Andi
>>
>
> So far I have been booting with mem=8832M and have run stress/loaded
> the memory subsystem pretty good; what other tests should I run?
>
> It'd be nice if we could pose some sort of solution/warning for the
> future so other people do not have to experience the same problems.
>
> What are the next steps?
>
Wouldn't it be possible for the e820/MTRR set up code detect the problem
and suggest a mem=xxxx that would fix the problem (while also
complaining that the BIOS is broken)?

Matt

2007-06-04 15:40:41

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Sunday, June 3, 2007 2:15:06 Matt Keenan wrote:
> Justin Piszcz wrote:
> > On Sat, 2 Jun 2007, Andi Kleen wrote:
> >>> I feel, having a silent/transparent workaround is not a good idea.
> >>> With that
> >>
> >> If enough RAM is chopped off users will notice. They tend to complain
> >> when they miss RAM. I don't like panic very much because for many
> >> users it will be a show stopper (even when they are not blessed
> >> with "quiet" boots like some distributions do)
> >>
> >> The message in dmesg could be also emphasized a bit with a little
> >> ASCII art (but no <blink> tag in there)
> >>
> >> The problem I'm more worried about is if the system will be really
> >> stable --- could it be that the memory controller is still
> >> misconfigured and cause other stability issues? (we've had such
> >> cases in the past). Also I'm not sure we can handle the case of
> >> the MTRR wrong not at the end of memory but at the hole sanely.
> >>
> >> -Andi
> >
> > So far I have been booting with mem=8832M and have run stress/loaded
> > the memory subsystem pretty good; what other tests should I run?
> >
> > It'd be nice if we could pose some sort of solution/warning for the
> > future so other people do not have to experience the same problems.
> >
> > What are the next steps?
>
> Wouldn't it be possible for the e820/MTRR set up code detect the problem
> and suggest a mem=xxxx that would fix the problem (while also
> complaining that the BIOS is broken)?

Yes, that should be fairly easy, though as Andi points out, if there are holes
in the MTRR setup, things get a little trickier (I had an earlier patch to
deal with this, but ended up with too many early boot issues).

Maybe what Venki suggested would be best: just detect the condition and
panic, with a string telling the user to use mem=xxx (we can figure that out)
and/or upgrade their BIOS.

I'll spin a new patch to do that today.

Jesse

2007-06-04 15:48:48

by Ray Lee

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On 6/4/07, Jesse Barnes <[email protected]> wrote:
> On Sunday, June 3, 2007 2:15:06 Matt Keenan wrote:
> > Justin Piszcz wrote:
> > > On Sat, 2 Jun 2007, Andi Kleen wrote:
> > >>> I feel, having a silent/transparent workaround is not a good idea.
> > >>> With that
> > >>
> > >> If enough RAM is chopped off users will notice. They tend to complain
> > >> when they miss RAM. I don't like panic very much because for many
> > >> users it will be a show stopper (even when they are not blessed
> > >> with "quiet" boots like some distributions do)
> > >>
> > >> The message in dmesg could be also emphasized a bit with a little
> > >> ASCII art (but no <blink> tag in there)
> > >>
> > >> The problem I'm more worried about is if the system will be really
> > >> stable --- could it be that the memory controller is still
> > >> misconfigured and cause other stability issues? (we've had such
> > >> cases in the past). Also I'm not sure we can handle the case of
> > >> the MTRR wrong not at the end of memory but at the hole sanely.
> > >>
> > >> -Andi
> > >
> > > So far I have been booting with mem=8832M and have run stress/loaded
> > > the memory subsystem pretty good; what other tests should I run?
> > >
> > > It'd be nice if we could pose some sort of solution/warning for the
> > > future so other people do not have to experience the same problems.
> > >
> > > What are the next steps?
> >
> > Wouldn't it be possible for the e820/MTRR set up code detect the problem
> > and suggest a mem=xxxx that would fix the problem (while also
> > complaining that the BIOS is broken)?
>
> Yes, that should be fairly easy, though as Andi points out, if there are holes
> in the MTRR setup, things get a little trickier (I had an earlier patch to
> deal with this, but ended up with too many early boot issues).
>
> Maybe what Venki suggested would be best: just detect the condition and
> panic, with a string telling the user to use mem=xxx (we can figure that out)
> and/or upgrade their BIOS.

Ick. Systems that used to boot fine would then panic on a kernel
upgrade. That's rather rude for a condition that's merely an
optimization (using all memory), rather than one of correctness. A
panic seems entirely inappropriate.

Ray

2007-06-04 15:49:50

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Mon, 4 Jun 2007, Ray Lee wrote:

> On 6/4/07, Jesse Barnes <[email protected]> wrote:
>> On Sunday, June 3, 2007 2:15:06 Matt Keenan wrote:
>> > Justin Piszcz wrote:
>> > > On Sat, 2 Jun 2007, Andi Kleen wrote:
>> > >>> I feel, having a silent/transparent workaround is not a good idea.
>> > >>> With that
>> > >>
>> > >> If enough RAM is chopped off users will notice. They tend to complain
>> > >> when they miss RAM. I don't like panic very much because for many
>> > >> users it will be a show stopper (even when they are not blessed
>> > >> with "quiet" boots like some distributions do)
>> > >>
>> > >> The message in dmesg could be also emphasized a bit with a little
>> > >> ASCII art (but no <blink> tag in there)
>> > >>
>> > >> The problem I'm more worried about is if the system will be really
>> > >> stable --- could it be that the memory controller is still
>> > >> misconfigured and cause other stability issues? (we've had such
>> > >> cases in the past). Also I'm not sure we can handle the case of
>> > >> the MTRR wrong not at the end of memory but at the hole sanely.
>> > >>
>> > >> -Andi
>> > >
>> > > So far I have been booting with mem=8832M and have run stress/loaded
>> > > the memory subsystem pretty good; what other tests should I run?
>> > >
>> > > It'd be nice if we could pose some sort of solution/warning for the
>> > > future so other people do not have to experience the same problems.
>> > >
>> > > What are the next steps?
>> >
>> > Wouldn't it be possible for the e820/MTRR set up code detect the problem
>> > and suggest a mem=xxxx that would fix the problem (while also
>> > complaining that the BIOS is broken)?
>>
>> Yes, that should be fairly easy, though as Andi points out, if there are
>> holes
>> in the MTRR setup, things get a little trickier (I had an earlier patch to
>> deal with this, but ended up with too many early boot issues).
>>
>> Maybe what Venki suggested would be best: just detect the condition and
>> panic, with a string telling the user to use mem=xxx (we can figure that
>> out)
>> and/or upgrade their BIOS.
>
> Ick. Systems that used to boot fine would then panic on a kernel
> upgrade. That's rather rude for a condition that's merely an
> optimization (using all memory), rather than one of correctness. A
> panic seems entirely inappropriate.
>
> Ray
>

While I am unsure of the 'best' solution, if they boot and it does not
panic but takes 10 minutes to boot, people are going to seriously wonder
what is going on?

Justin.

2007-06-04 15:54:31

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Monday, June 4, 2007 8:48:37 Ray Lee wrote:
> On 6/4/07, Jesse Barnes <[email protected]> wrote:
> > On Sunday, June 3, 2007 2:15:06 Matt Keenan wrote:
> > > Justin Piszcz wrote:
> > > > On Sat, 2 Jun 2007, Andi Kleen wrote:
> > > >>> I feel, having a silent/transparent workaround is not a good idea.
> > > >>> With that
> > > >>
> > > >> If enough RAM is chopped off users will notice. They tend to
> > > >> complain when they miss RAM. I don't like panic very much because
> > > >> for many users it will be a show stopper (even when they are not
> > > >> blessed with "quiet" boots like some distributions do)
> > > >>
> > > >> The message in dmesg could be also emphasized a bit with a little
> > > >> ASCII art (but no <blink> tag in there)
> > > >>
> > > >> The problem I'm more worried about is if the system will be really
> > > >> stable --- could it be that the memory controller is still
> > > >> misconfigured and cause other stability issues? (we've had such
> > > >> cases in the past). Also I'm not sure we can handle the case of
> > > >> the MTRR wrong not at the end of memory but at the hole sanely.
> > > >>
> > > >> -Andi
> > > >
> > > > So far I have been booting with mem=8832M and have run stress/loaded
> > > > the memory subsystem pretty good; what other tests should I run?
> > > >
> > > > It'd be nice if we could pose some sort of solution/warning for the
> > > > future so other people do not have to experience the same problems.
> > > >
> > > > What are the next steps?
> > >
> > > Wouldn't it be possible for the e820/MTRR set up code detect the
> > > problem and suggest a mem=xxxx that would fix the problem (while also
> > > complaining that the BIOS is broken)?
> >
> > Yes, that should be fairly easy, though as Andi points out, if there are
> > holes in the MTRR setup, things get a little trickier (I had an earlier
> > patch to deal with this, but ended up with too many early boot issues).
> >
> > Maybe what Venki suggested would be best: just detect the condition and
> > panic, with a string telling the user to use mem=xxx (we can figure that
> > out) and/or upgrade their BIOS.
>
> Ick. Systems that used to boot fine would then panic on a kernel
> upgrade. That's rather rude for a condition that's merely an
> optimization (using all memory), rather than one of correctness. A
> panic seems entirely inappropriate.

No, existing kernels would have been so slow as to be nearly unusable on
machines with this problem. Reducing the amount of available memory
automatically might work in most cases, but as Venki pointed out, people will
have to check their logs to notice that anything is wrong.

But I don't have a strong preference, maybe just a boot time message (with
suitably obnoxious ascii art) would be sufficient.

Jesse

2007-06-04 16:01:18

by Ray Lee

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On 6/4/07, Justin Piszcz <[email protected]> wrote:
> On Mon, 4 Jun 2007, Ray Lee wrote:
> > Ick. Systems that used to boot fine would then panic on a kernel
> > upgrade. That's rather rude for a condition that's merely an
> > optimization (using all memory), rather than one of correctness. A
> > panic seems entirely inappropriate.
>
> While I am unsure of the 'best' solution, if they boot and it does not
> panic but takes 10 minutes to boot, people are going to seriously wonder
> what is going on?

<goes and re-reads thread more carefully> Oh, hmm.

I think a big fat warning with asterisks in the bootup is a good
thing, but panicking when there's no need is never a good idea. If the
system takes that long to boot up, I'm certain the first thing they'll
do is to type dmesg | less to look for anomalies.

Ray

2007-06-04 18:14:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

Jesse Barnes <[email protected]> writes:

> On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
>> And normally the MTRRs win, don't they (if I remember the table correctly)
>> So if the MTRR says UC and PAT disagrees it might not actually help
>
> I just checked, yes the MTRRs win for UC types. But it sounds like the cases
> we're talking about are actually situations where there's no MTRR coverage,
> so the default type is used. The manual doesn't specifically call out how
> memory using the default type interacts with PAT, but it may well be that it
> stays uncached if the default type is uncached. Again that argues for fixing
> the MTRR mapping problem in some way.

Last I looked PAT can only demote not promote the type of a page,
except for the specific exception of UC to WC.

Normally the default type is UC so putting a pat type of WB won't
help anything. I may have missed some subtle detail but I remember
looking into this in some detail a while ago and coming to that
conclusion.

It is the BIOS's responsibility to mark all usable memory as WB,
using the MTRRs. If it doesn't it is a BIOS bug.

Eric

2007-06-04 18:22:18

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Mon, 4 Jun 2007, Eric W. Biederman wrote:

> Jesse Barnes <[email protected]> writes:
>
>> On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
>>> And normally the MTRRs win, don't they (if I remember the table correctly)
>>> So if the MTRR says UC and PAT disagrees it might not actually help
>>
>> I just checked, yes the MTRRs win for UC types. But it sounds like the cases
>> we're talking about are actually situations where there's no MTRR coverage,
>> so the default type is used. The manual doesn't specifically call out how
>> memory using the default type interacts with PAT, but it may well be that it
>> stays uncached if the default type is uncached. Again that argues for fixing
>> the MTRR mapping problem in some way.
>
> Last I looked PAT can only demote not promote the type of a page,
> except for the specific exception of UC to WC.
>
> Normally the default type is UC so putting a pat type of WB won't
> help anything. I may have missed some subtle detail but I remember
> looking into this in some detail a while ago and coming to that
> conclusion.
>
> It is the BIOS's responsibility to mark all usable memory as WB,
> using the MTRRs. If it doesn't it is a BIOS bug.
>
> Eric
>

According to Intel it is not a BIOS bug but rather the media controller
hub (MCH) uses memory for various purposes, outlined in their doc:

>From their response, it sounds like the kernel needs to setup the memory
properly to deal with the MCH found in the 965 motherboards?

>From their e-mail:

Note before continuing: Debian* Linux Operating System is not an
officially, validated, tested Operating System for the Intel(R) Desktop
Board DG965WH
(see http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2375);
moreover, we do confirm that "on a system that has 8 GB of system memory
installed, it is not possible to use all of the installed memory due to system
address space being allocated for other system critical functions." [qtd.
on page 43 of the Technical Product Specification (see
http://download.intel.com/design/motherbd/wh/D5600801US.pdf)]. Thus, the
following suggestions are provided AS IS; we cannot guarantee the problem
would be fixed afterwards:

Therefore, they are NOT going to fix their BIOS-- and I have already
received an e-mail from one or two people who are experiencing this
problem, I presume it will only get worse.

Justin.

2007-06-04 18:24:34

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

> Ick. Systems that used to boot fine would then panic on a kernel
> upgrade. That's rather rude for a condition that's merely an
> optimization (using all memory), rather than one of correctness. A
> panic seems entirely inappropriate.

No, when the MTRRs are wrong they would generally not work fine.
As soon as something uses the uncached memory things go incredibly
slow, slow enough to make the machine unusable. Sometimes you're
lucky and nothing important is in there, but only sometimes.

There is also no code in there currently to automatically limit
the memory; the user has to do that with mem=...

But I also don't like the panic.

-Andi

2007-06-04 19:08:42

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Mon, 4 Jun 2007, Justin Piszcz wrote:

>
>
> On Mon, 4 Jun 2007, Eric W. Biederman wrote:
>
>> Jesse Barnes <[email protected]> writes:
>>
>>> On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
>>>> And normally the MTRRs win, don't they (if I remember the table
>>>> correctly)
>>>> So if the MTRR says UC and PAT disagrees it might not actually help
>>>
>>> I just checked, yes the MTRRs win for UC types. But it sounds like the
>>> cases
>>> we're talking about are actually situations where there's no MTRR
>>> coverage,
>>> so the default type is used. The manual doesn't specifically call out how
>>> memory using the default type interacts with PAT, but it may well be that
>>> it
>>> stays uncached if the default type is uncached. Again that argues for
>>> fixing
>>> the MTRR mapping problem in some way.
>>
>> Last I looked PAT can only demote not promote the type of a page,
>> except for the specific exception of UC to WC.
>>
>> Normally the default type is UC so putting a pat type of WB won't
>> help anything. I may have missed some subtle detail but I remember
>> looking into this in some detail a while ago and coming to that
>> conclusion.
>>
>> It is the BIOS's responsibility to mark all usable memory as WB,
>> using the MTRRs. If it doesn't it is a BIOS bug.
>>
>> Eric
>>
>
> According to Intel it is not a BIOS bug but rather the media controller
> hub (MCH) uses memory for various purposes, outlined in their doc:
>
> From their response, it sounds like the kernel needs to setup the memory
> properly to deal with the MCH found in the 965 motherboards?
>
> From their e-mail:
>
> Note before continuing: Debian* Linux Operating System is not an officially,
> validated, tested Operating System for the Intel(R) Desktop Board DG965WH
> (see http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2375);
> moreover, we do confirm that "on a system that has 8 GB of system memory
> installed, it is not possible to use all of the installed memory due to
> system address space being allocated for other system critical functions."
> [qtd. on page 43 of the Technical Product Specification (see
> http://download.intel.com/design/motherbd/wh/D5600801US.pdf)]. Thus, the
> following suggestions are provided AS IS; we cannot guarantee the problem
> would be fixed afterwards:
>
> Therefore, they are NOT going to fix their BIOS-- and I have already received
> an e-mail from one or two people who are experiencing this problem, I presume
> it will only get worse.
>
> Justin.
>

Therefore, since they will NOT commit to a BIOS fix (they claim not a BIOS
issue) what options does that leave us with?

Justin.

2007-06-04 19:13:52

by Alan

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

> Therefore, they are NOT going to fix their BIOS-- and I have already
> received an e-mail from one or two people who are experiencing this
> problem, I presume it will only get worse.

In which case we need to clip the memory used according to the MTRR
registers and tell the user xxMB of memory not available due to BIOS bugs

Alan

2007-06-04 19:18:01

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Monday, June 4, 2007 11:22 am Justin Piszcz wrote:
> On Mon, 4 Jun 2007, Eric W. Biederman wrote:
> > Jesse Barnes <[email protected]> writes:
> >> On Friday, June 1, 2007 2:19:43 Andi Kleen wrote:
> >>> And normally the MTRRs win, don't they (if I remember the table
> >>> correctly) So if the MTRR says UC and PAT disagrees it might not
> >>> actually help
> >>
> >> I just checked, yes the MTRRs win for UC types. But it sounds
> >> like the cases we're talking about are actually situations where
> >> there's no MTRR coverage, so the default type is used. The manual
> >> doesn't specifically call out how memory using the default type
> >> interacts with PAT, but it may well be that it stays uncached if
> >> the default type is uncached. Again that argues for fixing the
> >> MTRR mapping problem in some way.
> >
> > Last I looked PAT can only demote not promote the type of a page,
> > except for the specific exception of UC to WC.
> >
> > Normally the default type is UC so putting a pat type of WB won't
> > help anything. I may have missed some subtle detail but I remember
> > looking into this in some detail a while ago and coming to that
> > conclusion.
> >
> > It is the BIOS's responsibility to mark all usable memory as WB,
> > using the MTRRs. If it doesn't it is a BIOS bug.
> >
> > Eric
>
> According to Intel it is not a BIOS bug but rather the media
> controller hub (MCH) uses memory for various purposes, outlined in
> their doc:
>
> From their response, it sounds like the kernel needs to setup the
> memory properly to deal with the MCH found in the 965 motherboards?
>
> From their e-mail:
>
> Note before continuing: Debian* Linux Operating System is not an
> officially, validated, tested Operating System for the Intel(R)
> Desktop Board DG965WH
> (see
> http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2375);
> moreover, we do confirm that "on a system that has 8 GB of system
> memory installed, it is not possible to use all of the installed
> memory due to system address space being allocated for other system
> critical functions." [qtd. on page 43 of the Technical Product
> Specification (see
> http://download.intel.com/design/motherbd/wh/D5600801US.pdf)]. Thus,
> the following suggestions are provided AS IS; we cannot guarantee the
> problem would be fixed afterwards:
>
> Therefore, they are NOT going to fix their BIOS-- and I have already
> received an e-mail from one or two people who are experiencing this
> problem, I presume it will only get worse.

That's a separate issue from the MTRR mapping though. Regardless of the
fact that the system needs some address space in its 8GB range reserved
for I/O devices, the BIOS should properly setup the MTRRs to map all of
*available* RAM. So the person handling your bug report may have been
confused into thinking that you were describing the former problem.

Jesse

2007-06-04 19:23:34

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

> From their e-mail:
>
> Note before continuing: Debian* Linux Operating System is not an
> officially, validated, tested Operating System for the Intel(R) Desktop
> Board DG965WH
> (see http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2375);
> moreover, we do confirm that "on a system that has 8 GB of system memory
> installed, it is not possible to use all of the installed memory due to
> system address space being allocated for other system critical functions."
> [qtd. on page 43 of the Technical Product Specification (see
> http://download.intel.com/design/motherbd/wh/D5600801US.pdf)]. Thus, the
> following suggestions are provided AS IS; we cannot guarantee the problem
> would be fixed afterwards:

They're talking about something different than your issue. If you put in
the fully possible 8GB (4x2GB) then some memory will be lost to the PCI hole
because the desktop ICH can only access 35bits (8GB) in hardware.

That can be up to 2GB in extreme cases, usually <0.5-1GB depending
on how much mapping space your hardware needs.

But if you put in less than 8GB the BIOS is supposed to remap
the memory around the PCI hole and set up the MTRRs correctly
so that the PCI hole is uncached and the memory around it is cached.

That is 100% the BIOS' responsibility and if it doesn't do that
it is buggy.

-Andi

2007-06-04 19:25:13

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately? II

> because the desktop ICH can only access 35bits (8GB) in hardware.
Should be 33bits of course.

-Andi

2007-06-04 21:03:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

Alan Cox <[email protected]> writes:

>> Therefore, they are NOT going to fix their BIOS-- and I have already
>> received an e-mail from one or two people who are experiencing this
>> problem, I presume it will only get worse.
>
> In which case we need to clip the memory used according to the MTRR
> registers and tell the user xxMB of memory not available due to BIOS bugs

Exactly, and given that this is a fairly easy thing to do, and that
occasionally we see systems where this happens (even if their BIOS is
later fixed). It is likely worth it for someone to write up the patch
and that compare MTRRs with available memory, and to complain and
reserve all memory that MTRRs claim is not write-back.

Eric

2007-06-05 00:54:58

by Yinghai Lu

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On 6/4/07, Andi Kleen <[email protected]> wrote:
> > From their e-mail:
> >
> > Note before continuing: Debian* Linux Operating System is not an
> > officially, validated, tested Operating System for the Intel(R) Desktop
> > Board DG965WH
> > (see http://downloadcenter.intel.com/Product_Filter.aspx?ProductID=2375);
> > moreover, we do confirm that "on a system that has 8 GB of system memory
> > installed, it is not possible to use all of the installed memory due to
> > system address space being allocated for other system critical functions."
> > [qtd. on page 43 of the Technical Product Specification (see
> > http://download.intel.com/design/motherbd/wh/D5600801US.pdf)]. Thus, the
> > following suggestions are provided AS IS; we cannot guarantee the problem
> > would be fixed afterwards:
>
> They're talking about something different than your issue. If you put in
> the fully possible 8GB (4x2GB) then some memory will be lost to the PCI hole
> because the desktop ICH can only access 35bits (8GB) in hardware.
>
> That can be up to 2GB in extreme cases, usually <0.5-1GB depending
> on how much mapping space your hardware needs.
>
> But if you put in less than 8GB the BIOS is supposed to remap
> the memory around the PCI hole and set up the MTRRs correctly
> so that the PCI hole is uncached and the memory around it is cached.
>
> That is 100% the BIOS' responsibility and if it doesn't do that
> it is buggy.

then the BIOS need to disable mem remap to make thing simple if there
8G ram installed.

YH

2007-06-05 01:00:15

by Yinghai Lu

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On 6/4/07, Eric W. Biederman <[email protected]> wrote:
>
> Exactly, and given that this is a fairly easy thing to do, and that
> occasionally we see systems where this happens (even if their BIOS is
> later fixed). It is likely worth it for someone to write up the patch
> and that compare MTRRs with available memory, and to complain and
> reserve all memory that MTRRs claim is not write-back.
>
that is good.
Sometime BIOS can not even keep mtrr to the identical between
different CPU in SMP system.

Or reset mtrr according to e820 table.

YH

2007-06-05 01:40:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

"Yinghai Lu" <[email protected]> writes:

> On 6/4/07, Eric W. Biederman <[email protected]> wrote:
>>
>> Exactly, and given that this is a fairly easy thing to do, and that
>> occasionally we see systems where this happens (even if their BIOS is
>> later fixed). It is likely worth it for someone to write up the patch
>> and that compare MTRRs with available memory, and to complain and
>> reserve all memory that MTRRs claim is not write-back.
>>
> that is good.
> Sometime BIOS can not even keep mtrr to the identical between
> different CPU in SMP system.
>
> Or reset mtrr according to e820 table.

Resetting the mtrrs according to match the e820 table is attractive
and it would be even easier to set the MTRR default type to
write-back, and just handle everything else with PAT.

However that would most likely do horrible things to any BIOS going
into SMI mode, and even a more modest scheme with reprogramming
MTRRs would likely have similar problems, where we put something
in the wrong caching mode.

So the only safe thing we can do is not use memory that is not
write-back cached. That we can positively detect and is a
conservative action so if anything will work that will.

Eric

2007-06-05 09:46:38

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

> So the only safe thing we can do is not use memory that is not
> write-back cached. That we can positively detect and is a
> conservative action so if anything will work that will.

Jesse wrote such a patch (or rather it limitted end_pfn), but it broke
the X server for so far unknown reasons.

-Andi

2007-06-05 13:19:15

by Justin Piszcz

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?



On Tue, 5 Jun 2007, Andi Kleen wrote:

>> So the only safe thing we can do is not use memory that is not
>> write-back cached. That we can positively detect and is a
>> conservative action so if anything will work that will.
>
> Jesse wrote such a patch (or rather it limitted end_pfn), but it broke
> the X server for so far unknown reasons.
>
> -Andi
>

So the best solution is for a patch/function that calculates the E820
memory map subtracts the non-cached memory and does the equivilant of
append= in LILO config with the proper amount of memory?

Justin.

2007-06-05 17:21:17

by Jesse Barnes

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Tuesday, June 5, 2007 2:46 am Andi Kleen wrote:
> > So the only safe thing we can do is not use memory that is not
> > write-back cached. That we can positively detect and is a
> > conservative action so if anything will work that will.
>
> Jesse wrote such a patch (or rather it limitted end_pfn), but it
> broke the X server for so far unknown reasons.

It looks like I broke the /proc/mtrr interface somehow... I'll try to
fix it tomorrow.

Jesse

2007-06-07 08:47:36

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel's response Linux/MTRR/8GB Memory Support / Why doesn't the kernel realize the BIOS has problems and re-map appropriately?

On Mon, Jun 04, 2007 at 05:59:59PM -0700, Yinghai Lu wrote:
> On 6/4/07, Eric W. Biederman <[email protected]> wrote:
> >
> >Exactly, and given that this is a fairly easy thing to do, and that
> >occasionally we see systems where this happens (even if their BIOS is
> >later fixed). It is likely worth it for someone to write up the patch
> >and that compare MTRRs with available memory, and to complain and
> >reserve all memory that MTRRs claim is not write-back.
> >
> that is good.
> Sometime BIOS can not even keep mtrr to the identical between
> different CPU in SMP system.

The MTRR code already fixes this case. Or at least mostly --
it doesn't do it for fixed size MTRRs.

> Or reset mtrr according to e820 table.

That would be likely dangerous.

-Andi