Subject: [PATCH] Fix alignment of early reservation for EBDA

Hi Andi,

My eyes fell on the following table in the boot messages:

early res: 0 [0-fff] BIOS data page
early res: 1 [6000-7fff] SMP_TRAMPOLINE
early res: 2 [200000-374557] TEXT DATA BSS
early res: 3 [9fc00-a0bff] EBDA
early res: 4 [8000-afff] PGTABLE

The memory reserved for the EBDA overflows into the area normally
reserved for the VGA adaptor. It seems that you wanted to force
the allocation to cover whole pages, like:

early res: 3 [9f000-9ffff] EBDA

This is what this patch implements.

Is it really necessary to force the allocation to a page boundary?


Signed-off-by: Alexander van Heukelum <[email protected]>

arch/x86/kernel/head64.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index ad24408..2c52543 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -53,7 +53,7 @@ static void __init copy_bootdata(char *real_mode_data)

static __init void reserve_ebda(void)
{
- unsigned ebda_addr, ebda_size;
+ unsigned int ebda_addr, ebda_size, ebda_end;

/*
* there is a real-mode segmented pointer pointing to the
@@ -70,12 +70,14 @@ static __init void reserve_ebda(void)
/* Round EBDA up to pages */
if (ebda_size == 0)
ebda_size = 1;
+ if (ebda_size > 64)
+ ebda_size = 64;
ebda_size <<= 10;
- ebda_size = round_up(ebda_size + (ebda_addr & ~PAGE_MASK), PAGE_SIZE);
- if (ebda_size > 64*1024)
- ebda_size = 64*1024;
+
+ ebda_end = ALIGN(ebda_addr + ebda_size, PAGE_SIZE);
+ ebda_addr = ALIGN(ebda_addr - PAGE_SIZE, PAGE_SIZE);

- reserve_early(ebda_addr, ebda_addr + ebda_size, "EBDA");
+ reserve_early(ebda_addr, ebda_end, "EBDA");
}

void __init x86_64_start_kernel(char * real_mode_data)


2008-02-24 19:27:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix alignment of early reservation for EBDA

On Sunday 24 February 2008 18:46:05 Alexander van Heukelum wrote:
> Hi Andi,
>
> My eyes fell on the following table in the boot messages:
>
> early res: 0 [0-fff] BIOS data page
> early res: 1 [6000-7fff] SMP_TRAMPOLINE
> early res: 2 [200000-374557] TEXT DATA BSS
> early res: 3 [9fc00-a0bff] EBDA
> early res: 4 [8000-afff] PGTABLE
>
> The memory reserved for the EBDA overflows into the area normally
> reserved for the VGA adaptor. It seems that you wanted to force
> the allocation to cover whole pages, like:
>
> early res: 3 [9f000-9ffff] EBDA
>
> This is what this patch implements.

Thanks.

> Is it really necessary to force the allocation to a page boundary?

In theory not, in practice it works around some problems in early
allocations where the other users assume page alignment.
At some point it needs to be cleaned up properly.

-Andi

2008-02-24 19:41:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Fix alignment of early reservation for EBDA


* Alexander van Heukelum <[email protected]> wrote:

> Hi Andi,
>
> My eyes fell on the following table in the boot messages:
>
> early res: 0 [0-fff] BIOS data page
> early res: 1 [6000-7fff] SMP_TRAMPOLINE
> early res: 2 [200000-374557] TEXT DATA BSS
> early res: 3 [9fc00-a0bff] EBDA
> early res: 4 [8000-afff] PGTABLE
>
> The memory reserved for the EBDA overflows into the area normally
> reserved for the VGA adaptor. It seems that you wanted to force the
> allocation to cover whole pages, like:

well, that's what your EBDA descriptor says - it's set to 9fc00 which is
512 bytes below the VGA range. This behavior didnt really change over
v2.6.24 (which reserved 'into' the VGA range too), it's just that in
v2.6.25 we also print out these early reservations. Can you see any
regression? There should be no harm from overlapping into the VGA range
- these "reservations" only make RAM unavailable for normal allocations.

your patch on the other hand rounds the EBDA area down which could in
theory be unsafe on other boxes (where there could be real RAM above the
EBDA area): the safest approach is to round the beginning of it down,
the end of it up (to page boundary). Your patch _should_ be OK, but in
practice it doesnt hurt to reserve a bit more around the edges than to
accidentally give a page to the OS that the BIOS might rely upon.

Ingo

2008-02-24 20:54:04

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] Fix alignment of early reservation for EBDA

On Sun, 24 Feb 2008 20:41:21 +0100, "Ingo Molnar" <[email protected]> said:
>
> * Alexander van Heukelum <[email protected]> wrote:
>
> > Hi Andi,
> >
> > My eyes fell on the following table in the boot messages:
> >
> > early res: 0 [0-fff] BIOS data page
> > early res: 1 [6000-7fff] SMP_TRAMPOLINE
> > early res: 2 [200000-374557] TEXT DATA BSS
> > early res: 3 [9fc00-a0bff] EBDA
> > early res: 4 [8000-afff] PGTABLE
> >
> > The memory reserved for the EBDA overflows into the area normally
> > reserved for the VGA adaptor. It seems that you wanted to force the
> > allocation to cover whole pages, like:
>
> well, that's what your EBDA descriptor says - it's set to 9fc00 which is
> 512 bytes below the VGA range.

It's 1024 bytes below, but yes, the EBDA starts there. Then the first
two
bytes of the EBDA contain the value 0x0001, which means that its size is
1kb, so the BIOS is correct.

> This behavior didnt really change over
> v2.6.24 (which reserved 'into' the VGA range too), it's just that in
> v2.6.25 we also print out these early reservations.

Correct. I thought it was new code, but looking more closely, the
behaviour
has indeed not changed recently (note to self: git log -p somefile.c
does
not indicate in any way that code was moved from some other place.)

> Can you see any
> regression? There should be no harm from overlapping into the VGA range
> - these "reservations" only make RAM unavailable for normal allocations.

No regressions, it's just cosmetic.

> your patch on the other hand rounds the EBDA area down which could in
> theory be unsafe on other boxes (where there could be real RAM above the
> EBDA area): the safest approach is to round the beginning of it down,
> the end of it up (to page boundary).

Not really: ebda_addr is just a local variable. If the system needs to
find the start of the EBDA, it will just have to look at the 16 bit
value
at address 0x40E again.

> Your patch _should_ be OK, but in
> practice it doesnt hurt to reserve a bit more around the edges than to
> accidentally give a page to the OS that the BIOS might rely upon.

The patch is exactly trying to do that. The code that was there seemed
to
imply that the author wanted to allocate whole pages, in such a way that
the allocation contained the whole EBDA. I think that is what it does
after this patch.

Greetings,
Alexander
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - And now for something completely different?

2008-02-25 02:18:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Fix alignment of early reservation for EBDA

Alexander van Heukelum wrote:
> Hi Andi,
>
> My eyes fell on the following table in the boot messages:
>
> early res: 0 [0-fff] BIOS data page
> early res: 1 [6000-7fff] SMP_TRAMPOLINE
> early res: 2 [200000-374557] TEXT DATA BSS
> early res: 3 [9fc00-a0bff] EBDA
> early res: 4 [8000-afff] PGTABLE
>
> The memory reserved for the EBDA overflows into the area normally
> reserved for the VGA adaptor. It seems that you wanted to force
> the allocation to cover whole pages, like:
>
> early res: 3 [9f000-9ffff] EBDA
>
> This is what this patch implements.
>
> Is it really necessary to force the allocation to a page boundary?
>

It is, but that rounding gets done in reserve_bootmem() anyway, so there
is no need for the arch-specific code to do it.

The 32-bit EBDA code hard-codes a size of 4K, which is probably equally
wrong; my gut feel is that the right thing to do is to reserve from the
EBDA up to the 640K mark (some BIOSes use an area like that for SMM
stuff), possibly with some sanity checking.

-hpa

2008-02-25 16:54:50

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] Fix alignment of early reservation for EBDA

On Sun, 24 Feb 2008 18:18:16 -0800, "H. Peter Anvin" <[email protected]>
said:
> Alexander van Heukelum wrote:
> > early res: 3 [9f000-9ffff] EBDA
> >
> > Is it really necessary to force the allocation to a page boundary?
>
> It is, but that rounding gets done in reserve_bootmem() anyway, so there
> is no need for the arch-specific code to do it.
>
> The 32-bit EBDA code hard-codes a size of 4K, which is probably equally
> wrong; my gut feel is that the right thing to do is to reserve from the
> EBDA up to the 640K mark (some BIOSes use an area like that for SMM
> stuff), possibly with some sanity checking.

Then, how about reserving everything from the end of conventional memory
up to the 1Mb mark? Like this:

/*
* The BIOS places the EBDA/XBDA at the top of conventional
* memory, and usually decreases the reported amount of
* conventional memory (int 0x12) too.
*/
static __init void reserve_ebda(void)
{
unsigned int lowmem, ebda_addr;

/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;

/* start of EBDA area */
ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
ebda_addr <<= 4;

/* Fixup: bios puts an EBDA in the top 64K segment */
/* of conventional memory, but does not adjust lowmem. */
if ((lowmem - ebda_addr) <= 0x10000)
lowmem = ebda_addr;

/* Fixup: bios does not report an EBDA at all. */
/* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
if ((ebda_addr == 0) && (lowmem >= 0x9f000))
lowmem = 0x9f000;

/* Paranoia: should never happen, but... */
if (lowmem >= 0x100000)
lowmem = 0xa0000;

/* reserve all memory between lowmem and the 1MB mark */
reserve_early(lowmem, 0x100000, "BIOS reserved");
}

Greetings,
Alexander
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - Choose from over 50 domains or use your own

2008-02-25 17:01:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] Fix alignment of early reservation for EBDA


* Alexander van Heukelum <[email protected]> wrote:

> Then, how about reserving everything from the end of conventional
> memory up to the 1Mb mark? Like this:

looks very good to me - could you send a patch against x86.git#testing?

Ingo

Subject: [PATCH] reserve_early end-of-conventional-memory to 1MB

Explicitly reserve_early the whole address range from the end of
conventional memory as reported by the bios data area up to the
1Mb mark. Regard the info retrieved from the BIOS data area with
a bit of paranoia, though, because some biosses forget to register
the EBDA correctly.

Signed-off-by: Alexander van Heukelum <[email protected]>

--

I think this patch is against -x86.git#testing :-/.

Greetings,
Alexander

arch/x86/kernel/head64.c | 45 +++++++++++++++++++++++++++------------------
1 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 38f32e7..b684552 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -49,33 +49,42 @@ static void __init copy_bootdata(char *real_mode_data)
}
}

-#define EBDA_ADDR_POINTER 0x40E
+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413

+/*
+ * The BIOS places the EBDA/XBDA at the top of conventional
+ * memory, and usually decreases the reported amount of
+ * conventional memory (int 0x12) too.
+ */
static __init void reserve_ebda(void)
{
- unsigned ebda_addr, ebda_size;
+ unsigned int lowmem, ebda_addr;

- /*
- * there is a real-mode segmented pointer pointing to the
- * 4K EBDA area at 0x40E
- */
- ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
+ /* end of low (conventional) memory */
+ lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
+ lowmem <<= 10;
+
+ /* start of EBDA area */
+ ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
ebda_addr <<= 4;

- if (!ebda_addr)
- return;
+ /* Fixup: bios puts an EBDA in the top 64K segment */
+ /* of conventional memory, but does not adjust lowmem. */
+ if ((lowmem - ebda_addr) <= 0x10000)
+ lowmem = ebda_addr;

- ebda_size = *(unsigned short *)__va(ebda_addr);
+ /* Fixup: bios does not report an EBDA at all. */
+ /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
+ if ((ebda_addr == 0) && (lowmem >= 0x9f000))
+ lowmem = 0x9f000;

- /* Round EBDA up to pages */
- if (ebda_size == 0)
- ebda_size = 1;
- ebda_size <<= 10;
- ebda_size = round_up(ebda_size + (ebda_addr & ~PAGE_MASK), PAGE_SIZE);
- if (ebda_size > 64*1024)
- ebda_size = 64*1024;
+ /* Paranoia: should never happen, but... */
+ if (lowmem >= 0x100000)
+ lowmem = 0xa0000;

- reserve_early(ebda_addr, ebda_addr + ebda_size, "EBDA");
+ /* reserve all memory between lowmem and the 1MB mark */
+ reserve_early(lowmem, 0x100000, "BIOS reserved");
}

void __init x86_64_start_kernel(char * real_mode_data)

2008-02-25 18:14:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB

Alexander van Heukelum wrote:
>
> arch/x86/kernel/head64.c | 45 +++++++++++++++++++++++++++------------------
> 1 files changed, 27 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 38f32e7..b684552 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -49,33 +49,42 @@ static void __init copy_bootdata(char *real_mode_data)
> }
> }
>
> -#define EBDA_ADDR_POINTER 0x40E
> +#define BIOS_EBDA_SEGMENT 0x40E
> +#define BIOS_LOWMEM_KILOBYTES 0x413
>

Linus has specific comments in the 32-bit code that states we do not
want to use address 0x413 for anything - if nothing else because it's
often lowered when there is boottime code involved like CD-ROM booting
or PXE.

Either way, the code should be shared between 32 and 64 bits. There is
nothing bitsize-specific about it!

-hpa

2008-02-25 19:47:27

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB

On Mon, 25 Feb 2008 10:13:17 -0800, "H. Peter Anvin" <[email protected]>
said:
> Alexander van Heukelum wrote:
> >
> > arch/x86/kernel/head64.c | 45 +++++++++++++++++++++++++++------------------
> > 1 files changed, 27 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> > index 38f32e7..b684552 100644
> > --- a/arch/x86/kernel/head64.c
> > +++ b/arch/x86/kernel/head64.c
> > @@ -49,33 +49,42 @@ static void __init copy_bootdata(char *real_mode_data)
> > }
> > }
> >
> > -#define EBDA_ADDR_POINTER 0x40E
> > +#define BIOS_EBDA_SEGMENT 0x40E
> > +#define BIOS_LOWMEM_KILOBYTES 0x413
> >
>
> Linus has specific comments in the 32-bit code that states we do not
> want to use address 0x413 for anything - if nothing else because it's
> often lowered when there is boottime code involved like CD-ROM booting
> or PXE.

Hello hpa,

I failed to find a comment referring to 0x413 or int 0x12 in
arch/x86/kernel. I do know the comment in Documentation/i386/boot.txt,
however, suggesting that "INT 12h" _should_ be used, but that it would
be pointless for zImage and old bzImage kernels, because they _have_
to be started from 0x90000 anyway. New bzImage kernels do not have
this limitation, and smart bootloaders simply put them at much
lower addresses, like 0x40000. (I know: you know.) My point is just
that the argument not to use 0x413 in the bootloader is not valid
for this case.

That leaves the possibility that code/data is inserted at the top of
conventional memory by either the BIOS or some kind of bootloader.
My view on this is that this code/data should be preserved: it
was specifically asked from us by lowering 0x413. One just loses
a tiny bit of usable memory, and if the program booting the kernel
knows better it can unload the driver if it wants to. It's just
not a kernel problem.

> Either way, the code should be shared between 32 and 64 bits.
> There is nothing bitsize-specific about it!

Of course. That's also why I already added the old-Dell case ;).
But one problem at a time, please!

Greetings,
Alexander
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - I mean, what is it about a decent email service?

2008-02-25 21:18:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB

Alexander van Heukelum wrote:
>
> Hello hpa,
>
> I failed to find a comment referring to 0x413 or int 0x12 in
> arch/x86/kernel. I do know the comment in Documentation/i386/boot.txt,
> however, suggesting that "INT 12h" _should_ be used, but that it would
> be pointless for zImage and old bzImage kernels, because they _have_
> to be started from 0x90000 anyway. New bzImage kernels do not have
> this limitation, and smart bootloaders simply put them at much
> lower addresses, like 0x40000. (I know: you know.) My point is just
> that the argument not to use 0x413 in the bootloader is not valid
> for this case.
>

OK, let me see if I can dig up the comment I thought I found at one
point, and maybe found out the history behind it. Either that, or find
out that my memory is faulty ;)

-hpa

2008-02-26 09:31:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB


* Alexander van Heukelum <[email protected]> wrote:

> > Either way, the code should be shared between 32 and 64 bits. There
> > is nothing bitsize-specific about it!
>
> Of course. That's also why I already added the old-Dell case ;). But
> one problem at a time, please!

i've applied your patch to x86.git#testing. Feel free to send a
code-unification patch too :-)

Ingo

2008-02-27 14:23:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix alignment of early reservation for EBDA

Alexander van Heukelum wrote:
> On Sun, 24 Feb 2008 18:18:16 -0800, "H. Peter Anvin" <[email protected]>
> said:
>> Alexander van Heukelum wrote:
>>> early res: 3 [9f000-9ffff] EBDA
>>>
>>> Is it really necessary to force the allocation to a page boundary?
>> It is, but that rounding gets done in reserve_bootmem() anyway, so there
>> is no need for the arch-specific code to do it.
>>
>> The 32-bit EBDA code hard-codes a size of 4K, which is probably equally
>> wrong; my gut feel is that the right thing to do is to reserve from the
>> EBDA up to the 640K mark (some BIOSes use an area like that for SMM
>> stuff), possibly with some sanity checking.

It's not needed, the e820 maps are always correct for modern systems in
this case as far as I know.


>
> /* reserve all memory between lowmem and the 1MB mark */
> reserve_early(lowmem, 0x100000, "BIOS reserved");

The i386 kernel did this always, but I intentionally removed it from the
64bit kernel because all the modern BIOS seem to correctly report holes
in this area. Only didn't do it on i386 because there were some concerns
of very old systems not doing this correctly.

My suspicion is that modern Windows systems rely on this, that is why
BIOSes typically get it correct now.

I think it should be only undone if you have a concrete case where it
breaks not just based on someone's gut feel. Sure it's not a lot of
memory, but why waste memory unnecessarily?

-Andi

2008-02-27 14:24:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB

Ingo Molnar wrote:
> * Alexander van Heukelum <[email protected]> wrote:
>
>>> Either way, the code should be shared between 32 and 64 bits. There
>>> is nothing bitsize-specific about it!
>> Of course. That's also why I already added the old-Dell case ;). But
>> one problem at a time, please!
>
> i've applied your patch to x86.git#testing. Feel free to send a
> code-unification patch too :-)

So you've basically wasted a lot of memory with no concrete breakage
known ...Please reconsider.

-Andi

2008-02-27 14:36:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB II - some numbers to put it into perspective

Ingo Molnar wrote:
> * Alexander van Heukelum <[email protected]> wrote:
>
>>> Either way, the code should be shared between 32 and 64 bits. There
>>> is nothing bitsize-specific about it!
>> Of course. That's also why I already added the old-Dell case ;). But
>> one problem at a time, please!
>
> i've applied your patch to x86.git#testing. Feel free to send a
> code-unification patch too :-)

Just to give some perspective of this:

On my laptop here

BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000d2000 - 0000000000100000 (reserved)

This means it reserves only ~193KB in the 640k-1MB area

With this patch it will reserve 384KB instead. This means 191KB
are lost. While that doesn't sound too much it worth as much as
382 patches that reduce kernel code size by 512bytes or
worth 3820 patches that reduce kernel code by 100 bytes in terms
of memory consumption.

Now such kernel code size patches are always popular, but why undo that
work by throwing away perfectly good memory elsewhere?

Or also the laptop kernel does

Freeing unused kernel memory: 340k freed

This means the 193KB now lost are worth 56% of the complete
memory that is freed by __initdata/__init. Just maintaining
these annotations is a lot of work, but why do all that if we
then throw away than half as much memory as they save so easily?

-Andi

2008-02-27 16:45:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB II - some numbers to put it into perspective

Andi Kleen wrote:
>
> Just to give some perspective of this:
>
> On my laptop here
>
> BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000d2000 - 0000000000100000 (reserved)
>
> This means it reserves only ~193KB in the 640k-1MB area
>
> With this patch it will reserve 384KB instead. This means 191KB
> are lost. While that doesn't sound too much it worth as much as
> 382 patches that reduce kernel code size by 512bytes or
> worth 3820 patches that reduce kernel code by 100 bytes in terms
> of memory consumption.
>
> Now such kernel code size patches are always popular, but why undo that
> work by throwing away perfectly good memory elsewhere?
>
> Or also the laptop kernel does
>
> Freeing unused kernel memory: 340k freed
>
> This means the 193KB now lost are worth 56% of the complete
> memory that is freed by __initdata/__init. Just maintaining
> these annotations is a lot of work, but why do all that if we
> then throw away than half as much memory as they save so easily?
>

It doesn't waste any memory at all.

It arguably wastes some *address space*, but that's an entirely
different thing.

Unless you have chipset-specific drivers to enable memory in the
640-1024K memory area, there is nothing there.

-hpa

2008-02-27 20:01:46

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] reserve_early end-of-conventional-memory to 1MB II - some numbers to put it into perspective


On Wed, 27 Feb 2008 15:38:28 +0100, "Andi Kleen" <[email protected]> said:
> Ingo Molnar wrote:
> > * Alexander van Heukelum <[email protected]> wrote:
> >
> >>> Either way, the code should be shared between 32 and 64 bits. There
> >>> is nothing bitsize-specific about it!
> >> Of course. That's also why I already added the old-Dell case ;). But
> >> one problem at a time, please!
> >
> > i've applied your patch to x86.git#testing. Feel free to send a
> > code-unification patch too :-)
>
> Just to give some perspective of this:
>
> On my laptop here
>
> BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000d2000 - 0000000000100000 (reserved)

Hi Andi,

Can you provide the complete 'raw' e820 info? This is at least
not complete, and also might be after the sanitation. Do you
mean that (1) there is usable RAM somewhere between 0xa0000
and 0xd2000? Or that (2) this second line should be RAM, not
reserved?

Case (1) would surpise me, because I expect the VGA adapter
(which I assume is there...) to occupy 0xa0000 to 0xc0000 for
the framebuffer. Also, your case would be a lot stronger if
there were a line that explicitly indicated that there was
RAM there.

Case (2) would surprise me too, because a lot of software
would expect the system BIOS to reside in at least the area
0xf0000 to 0x100000. jmp 0xf000:fff0 for a reboot, no?

Laptops do not always have the VGA bios in the standard place,
so the range 0xc0000-0xd2000 could well be unused. Still, I
doubt that there is real RAM accessible in that region.

So I think you have not correctly interpreted the e820 info.
But, if you (or anyone else, for that matter) provide(s) a raw
e820 map that shows usable RAM in the region 0xa0000-0x100000,
the I agree that the patch should be reconsidered.

> This means it reserves only ~193KB in the 640k-1MB area
>
> With this patch it will reserve 384KB instead. This means 191KB
> are lost.

The two lines you gave say that two regions are reserved. Nothing
tells what is in between those regions. If a region is not
covered by e820 at all, it is to be considered unusable, right?

> While that doesn't sound too much it worth as much as
> 382 patches that reduce kernel code size by 512bytes or
> worth 3820 patches that reduce kernel code by 100 bytes in terms
> of memory consumption.

Agreed.

> Now such kernel code size patches are always popular, but why
> undo that work by throwing away perfectly good memory elsewhere?

I don't intend to. I have never seen a machine with usable
memory in the 0xa000-0x100000 region.

> Or also the laptop kernel does
>
> Freeing unused kernel memory: 340k freed
>
> This means the 193KB now lost are worth 56% of the complete
> memory that is freed by __initdata/__init. Just maintaining
> these annotations is a lot of work, but why do all that if we
> then throw away than half as much memory as they save so easily?

Agreed, if...

Greetings,
Alexander
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - mmm... Fastmail...

Subject: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit

The 32-bit code still uses reserve_bootmem, so this is not really
a unification with the 64-bit version of the ebda reservation code,
but at least it provides the same detection logic and reserves the
same areas.

This does not crash immediately on qemu. No further testing was
done! Otherwise:

Signed-off-by: Alexander van Heukelum <[email protected]>

---

Hello Ian, Ingo,

This patch should do the same reservations as the reserve_early
patch for 64 bit. Maybe you could try if this solves the problems
you are seeing with Xen?

On the other hand, Xen should be able to use those legacy ranges
as normal memory. Why doesn't it work? Does Xen use a special
bootloader? In that case it could just add the reservation to
the e820 memory map that is provided.

Another possible approach to the whole problem is just to ammend
the e820 memory map in the boot-code. Thanks to hpa, it is now easy
to change this code. Then the kernel can just trust the e820 memory
map, and only 'sophisticated' bootloaders need to do something
similar by themselves.

Greetings,
Alexander

diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index a1d7071..e12cc31 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -385,15 +385,47 @@ unsigned long __init find_max_low_pfn(void)
return max_low_pfn;
}

+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413
+
/*
- * workaround for Dell systems that neglect to reserve EBDA
+ * The BIOS places the EBDA/XBDA at the top of conventional
+ * memory, and usually decreases the reported amount of
+ * conventional memory (int 0x12) too. This also contains a
+ * workaround for Dell systems that neglect to reserve EBDA.
+ * The same workaround also avoids a problem with the AMD768MPX
+ * chipset: reserve a page before VGA to prevent PCI prefetch
+ * into it (errata #56). Usually the page is reserved anyways,
+ * unless you have no PS/2 mouse plugged in.
*/
static void __init reserve_ebda_region(void)
{
- unsigned int addr;
- addr = get_bios_ebda();
- if (addr)
- reserve_bootmem(addr, PAGE_SIZE, BOOTMEM_DEFAULT);
+ unsigned int lowmem, ebda_addr;
+
+ /* end of low (conventional) memory */
+ lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
+ lowmem <<= 10;
+
+ /* start of EBDA area */
+ ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
+ ebda_addr <<= 4;
+
+ /* Fixup: bios puts an EBDA in the top 64K segment */
+ /* of conventional memory, but does not adjust lowmem. */
+ if ((lowmem - ebda_addr) <= 0x10000)
+ lowmem = ebda_addr;
+
+ /* Fixup: bios does not report an EBDA at all. */
+ /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
+ if ((ebda_addr == 0) && (lowmem >= 0x9f000))
+ lowmem = 0x9f000;
+
+ /* Paranoia: should never happen, but... */
+ if (lowmem >= 0x100000)
+ lowmem = 0xa0000;
+
+ /* reserve all memory between lowmem and the 1MB mark */
+ reserve_bootmem(lowmem, 0x100000 - lowmem, BOOTMEM_DEFAULT);
}

#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -619,16 +651,9 @@ void __init setup_bootmem_allocator(void)
*/
reserve_bootmem(0, PAGE_SIZE, BOOTMEM_DEFAULT);

- /* reserve EBDA region, it's a 4K region */
+ /* reserve EBDA region */
reserve_ebda_region();

- /* could be an AMD 768MPX chipset. Reserve a page before VGA to prevent
- PCI prefetch into it (errata #56). Usually the page is reserved anyways,
- unless you have no PS/2 mouse plugged in. */
- if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 == 6)
- reserve_bootmem(0xa0000 - 4096, 4096, BOOTMEM_DEFAULT);
-
#ifdef CONFIG_SMP
/*
* But first pinch a few for the stack/trampoline stuff

Subject: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Instead of using early reservations inside the kernel code,
we could use the realmode code to modify the e820 memmap.
This patch shows what that would look like. I have not looked
at the case where the BIOS does not provide an e820 memmap
yet. Probably a full solution would need to create a fake
e820 memmap in that case.

Comments?

Signed-off-by: Alexander van Heukelum <[email protected]>

diff --git a/arch/x86/boot/memory.c b/arch/x86/boot/memory.c
index e77d89f..522920a 100644
--- a/arch/x86/boot/memory.c
+++ b/arch/x86/boot/memory.c
@@ -23,6 +23,7 @@ static int detect_memory_e820(void)
int count = 0;
u32 next = 0;
u32 size, id;
+ u32 lowmem, ebda_addr;
u8 err;
struct e820entry *desc = boot_params.e820_map;

@@ -50,13 +51,51 @@ static int detect_memory_e820(void)
just return failure. */
if (id != SMAP) {
count = 0;
- break;
+ goto out;
}

count++;
desc++;
- } while (next && count < E820MAX);
-
+ } while (next && count < (E820MAX - 1));
+
+ /* Some BIOSes do not reserve the EBDA/XBDA area correctly in.
+ The e820-map. Find out where the EBDA resides by looking at
+ the BIOS data area and reserve the EBDA and the following
+ legacy adapter area explicitly. */
+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413
+
+ /* end of low (conventional) memory */
+ set_fs(0);
+ lowmem = rdfs16(BIOS_LOWMEM_KILOBYTES);
+ lowmem <<= 10;
+
+ /* start of EBDA area */
+ ebda_addr = rdfs16(BIOS_EBDA_SEGMENT);
+ ebda_addr <<= 4;
+
+ /* Fixup: bios puts an EBDA in the top 64K segment */
+ /* of conventional memory, but does not adjust lowmem. */
+ if ((lowmem - ebda_addr) <= 0x10000)
+ lowmem = ebda_addr;
+
+ /* Fixup: bios does not report an EBDA at all. */
+ /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
+ if ((ebda_addr == 0) && (lowmem >= 0x9f000))
+ lowmem = 0x9f000;
+
+ /* Paranoia: should never happen, but... */
+ if (lowmem >= 0x100000)
+ lowmem = 0xa0000;
+
+ /* reserve all memory between lowmem and the 1MB mark */
+ desc->addr = lowmem;
+ desc->size = 0x100000 - lowmem;
+ desc->type = E820_RESERVED;
+ count++;
+ desc++;
+
+out:
return boot_params.e820_entries = count;
}

2008-02-28 21:10:42

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit


On Thu, 2008-02-28 at 14:13 +0100, Alexander van Heukelum wrote:
> The 32-bit code still uses reserve_bootmem, so this is not really
> a unification with the 64-bit version of the ebda reservation code,
> but at least it provides the same detection logic and reserves the
> same areas.
>
> This does not crash immediately on qemu. No further testing was
> done! Otherwise:

I haven't tested extensively either but it does seem to solve the
problem for Xen.

Thanks!
Ian

>
> Signed-off-by: Alexander van Heukelum <[email protected]>
>
> ---
>
> Hello Ian, Ingo,
>
> This patch should do the same reservations as the reserve_early
> patch for 64 bit. Maybe you could try if this solves the problems
> you are seeing with Xen?
>
> On the other hand, Xen should be able to use those legacy ranges
> as normal memory. Why doesn't it work? Does Xen use a special
> bootloader? In that case it could just add the reservation to
> the e820 memory map that is provided.
>
> Another possible approach to the whole problem is just to ammend
> the e820 memory map in the boot-code. Thanks to hpa, it is now easy
> to change this code. Then the kernel can just trust the e820 memory
> map, and only 'sophisticated' bootloaders need to do something
> similar by themselves.
>
> Greetings,
> Alexander
>
> diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
> index a1d7071..e12cc31 100644
> --- a/arch/x86/kernel/setup_32.c
> +++ b/arch/x86/kernel/setup_32.c
> @@ -385,15 +385,47 @@ unsigned long __init find_max_low_pfn(void)
> return max_low_pfn;
> }
>
> +#define BIOS_EBDA_SEGMENT 0x40E
> +#define BIOS_LOWMEM_KILOBYTES 0x413
> +
> /*
> - * workaround for Dell systems that neglect to reserve EBDA
> + * The BIOS places the EBDA/XBDA at the top of conventional
> + * memory, and usually decreases the reported amount of
> + * conventional memory (int 0x12) too. This also contains a
> + * workaround for Dell systems that neglect to reserve EBDA.
> + * The same workaround also avoids a problem with the AMD768MPX
> + * chipset: reserve a page before VGA to prevent PCI prefetch
> + * into it (errata #56). Usually the page is reserved anyways,
> + * unless you have no PS/2 mouse plugged in.
> */
> static void __init reserve_ebda_region(void)
> {
> - unsigned int addr;
> - addr = get_bios_ebda();
> - if (addr)
> - reserve_bootmem(addr, PAGE_SIZE, BOOTMEM_DEFAULT);
> + unsigned int lowmem, ebda_addr;
> +
> + /* end of low (conventional) memory */
> + lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
> + lowmem <<= 10;
> +
> + /* start of EBDA area */
> + ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
> + ebda_addr <<= 4;
> +
> + /* Fixup: bios puts an EBDA in the top 64K segment */
> + /* of conventional memory, but does not adjust lowmem. */
> + if ((lowmem - ebda_addr) <= 0x10000)
> + lowmem = ebda_addr;
> +
> + /* Fixup: bios does not report an EBDA at all. */
> + /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
> + if ((ebda_addr == 0) && (lowmem >= 0x9f000))
> + lowmem = 0x9f000;
> +
> + /* Paranoia: should never happen, but... */
> + if (lowmem >= 0x100000)
> + lowmem = 0xa0000;
> +
> + /* reserve all memory between lowmem and the 1MB mark */
> + reserve_bootmem(lowmem, 0x100000 - lowmem, BOOTMEM_DEFAULT);
> }
>
> #ifndef CONFIG_NEED_MULTIPLE_NODES
> @@ -619,16 +651,9 @@ void __init setup_bootmem_allocator(void)
> */
> reserve_bootmem(0, PAGE_SIZE, BOOTMEM_DEFAULT);
>
> - /* reserve EBDA region, it's a 4K region */
> + /* reserve EBDA region */
> reserve_ebda_region();
>
> - /* could be an AMD 768MPX chipset. Reserve a page before VGA to prevent
> - PCI prefetch into it (errata #56). Usually the page is reserved anyways,
> - unless you have no PS/2 mouse plugged in. */
> - if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> - boot_cpu_data.x86 == 6)
> - reserve_bootmem(0xa0000 - 4096, 4096, BOOTMEM_DEFAULT);
> -
> #ifdef CONFIG_SMP
> /*
> * But first pinch a few for the stack/trampoline stuff
>
--
Ian Campbell

In spite of everything, I still believe that people are good at heart.
-- Anne Frank

2008-02-28 21:12:44

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


On Thu, 2008-02-28 at 14:28 +0100, Alexander van Heukelum wrote:
> Instead of using early reservations inside the kernel code,
> we could use the realmode code to modify the e820 memmap.
> This patch shows what that would look like. I have not looked
> at the case where the BIOS does not provide an e820 memmap
> yet. Probably a full solution would need to create a fake
> e820 memmap in that case.

An e820 is already faked up in machine_specific_memory_setup() if one
doesn't already exist.

> Comments?

This won't work for Xen since the real-mode code never runs there. I
think it could be fixed in xen_memory_setup() though if native goes down
this route.

Ian.

>
> Signed-off-by: Alexander van Heukelum <[email protected]>
>
> diff --git a/arch/x86/boot/memory.c b/arch/x86/boot/memory.c
> index e77d89f..522920a 100644
> --- a/arch/x86/boot/memory.c
> +++ b/arch/x86/boot/memory.c
> @@ -23,6 +23,7 @@ static int detect_memory_e820(void)
> int count = 0;
> u32 next = 0;
> u32 size, id;
> + u32 lowmem, ebda_addr;
> u8 err;
> struct e820entry *desc = boot_params.e820_map;
>
> @@ -50,13 +51,51 @@ static int detect_memory_e820(void)
> just return failure. */
> if (id != SMAP) {
> count = 0;
> - break;
> + goto out;
> }
>
> count++;
> desc++;
> - } while (next && count < E820MAX);
> -
> + } while (next && count < (E820MAX - 1));
> +
> + /* Some BIOSes do not reserve the EBDA/XBDA area correctly in.
> + The e820-map. Find out where the EBDA resides by looking at
> + the BIOS data area and reserve the EBDA and the following
> + legacy adapter area explicitly. */
> +#define BIOS_EBDA_SEGMENT 0x40E
> +#define BIOS_LOWMEM_KILOBYTES 0x413
> +
> + /* end of low (conventional) memory */
> + set_fs(0);
> + lowmem = rdfs16(BIOS_LOWMEM_KILOBYTES);
> + lowmem <<= 10;
> +
> + /* start of EBDA area */
> + ebda_addr = rdfs16(BIOS_EBDA_SEGMENT);
> + ebda_addr <<= 4;
> +
> + /* Fixup: bios puts an EBDA in the top 64K segment */
> + /* of conventional memory, but does not adjust lowmem. */
> + if ((lowmem - ebda_addr) <= 0x10000)
> + lowmem = ebda_addr;
> +
> + /* Fixup: bios does not report an EBDA at all. */
> + /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
> + if ((ebda_addr == 0) && (lowmem >= 0x9f000))
> + lowmem = 0x9f000;
> +
> + /* Paranoia: should never happen, but... */
> + if (lowmem >= 0x100000)
> + lowmem = 0xa0000;
> +
> + /* reserve all memory between lowmem and the 1MB mark */
> + desc->addr = lowmem;
> + desc->size = 0x100000 - lowmem;
> + desc->type = E820_RESERVED;
> + count++;
> + desc++;
> +
> +out:
> return boot_params.e820_entries = count;
> }
>
>
>
--
Ian Campbell

Nasrudin walked into a teahouse and declaimed, "The moon is more useful
than the sun."
"Why?", he was asked.
"Because at night we need the light more."

2008-02-28 21:19:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Ian Campbell wrote:
> On Thu, 2008-02-28 at 14:28 +0100, Alexander van Heukelum wrote:
>> Instead of using early reservations inside the kernel code,
>> we could use the realmode code to modify the e820 memmap.
>> This patch shows what that would look like. I have not looked
>> at the case where the BIOS does not provide an e820 memmap
>> yet. Probably a full solution would need to create a fake
>> e820 memmap in that case.
>
> An e820 is already faked up in machine_specific_memory_setup() if one
> doesn't already exist.
>
>> Comments?
>
> This won't work for Xen since the real-mode code never runs there. I
> think it could be fixed in xen_memory_setup() though if native goes down
> this route.
>

s/could/should/.

You need to set up your memory map more sensibly; it's not just the
kernel, user space tries to access these areas too.

-hpa

2008-02-28 23:17:27

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


On Thu, 2008-02-28 at 13:14 -0800, H. Peter Anvin wrote:
> Ian Campbell wrote:
> > On Thu, 2008-02-28 at 14:28 +0100, Alexander van Heukelum wrote:
> >> Instead of using early reservations inside the kernel code,
> >> we could use the realmode code to modify the e820 memmap.
> >> This patch shows what that would look like. I have not looked
> >> at the case where the BIOS does not provide an e820 memmap
> >> yet. Probably a full solution would need to create a fake
> >> e820 memmap in that case.
> >
> > An e820 is already faked up in machine_specific_memory_setup() if one
> > doesn't already exist.
> >
> >> Comments?
> >
> > This won't work for Xen since the real-mode code never runs there. I
> > think it could be fixed in xen_memory_setup() though if native goes down
> > this route.
> >
>
> s/could/should/.
>
> You need to set up your memory map more sensibly; it's not just the
> kernel, user space tries to access these areas too.

Agreed, I think it's pure luck that Xen kernels have gotten away with it
in the past.

The patch below seems like the right thing to do. It certainly boots in
a domU without the DMI problem (without any of the other related patches
such as Alexander's).

However ddcprobe hangs when run -- need to investigate some more, vm86
in general works ok (i.e. my vm86 test code passes).

BTW Jeremy, the kernel doesn't use XENMEM_memory_map -- any reason other
than it not being useful at the time? These days the tools can push an
arbitrary e820 down for a guest which might be useful to support,
although nothing interesting is done with it today.

Ian.

x86/xen: Construct e820 map with a hole between 640K-1M.

Signed-off-by: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/setup.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 3bad477..2341492 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -38,7 +38,8 @@ char * __init xen_memory_setup(void)
unsigned long max_pfn = xen_start_info->nr_pages;

e820.nr_map = 0;
- add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
+ add_memory_region(0, LOWMEMSIZE(), E820_RAM);
+ add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM);

return "Xen";
}
--
1.5.4.2


--
Ian Campbell

His mind is like a steel trap: full of mice.
-- Foghorn Leghorn

Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit

This patch adds explicit detection of the EBDA and reservation
of the rom and adapter address space 0xa0000-0x100000 to the
i386 kernels. It uses reserve_bootmem instead of reserve_early,
because reserve_early is not yet available on i386.

Before this patch, the EBDA size was hardcoded as 4Kb. Also, the
reservation of the adapter range was done by modifying the e820
map which is now not necessary any longer, and the code is removed
from copy_e820_map.

The changes in e820_64.c are only a change in the comment above
copy_e820_map, and some changes of the types of local variables
in that function such that the 32 and 64 bit versions become equal.

Signed-off-by: Alexander van Heukelum <[email protected]>

---

On Thu, Feb 28, 2008 at 09:09:56PM +0000, Ian Campbell wrote:
> On Thu, 2008-02-28 at 14:13 +0100, Alexander van Heukelum wrote:
> > The 32-bit code still uses reserve_bootmem, so this is not really
> > a unification with the 64-bit version of the ebda reservation code,
> > but at least it provides the same detection logic and reserves the
> > same areas.
> >
> > This does not crash immediately on qemu. No further testing was
> > done! Otherwise:
>
> I haven't tested extensively either but it does seem to solve the
> problem for Xen.
>
> Thanks!
> Ian

Thank you!

Ingo,

I think this is ready for -x86#testing.
It boots to a small userspace in qemu (i386).
If I should separate the cleanups, let me know.

Greetings,
Alexander

arch/x86/kernel/e820_32.c | 30 +++++++-------------------
arch/x86/kernel/e820_64.c | 15 +++++++++---
arch/x86/kernel/setup_32.c | 51 ++++++++++++++++++++++++++++++++-----------
3 files changed, 57 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kernel/e820_32.c b/arch/x86/kernel/e820_32.c
index 4e16ef4..8ea7db2 100644
--- a/arch/x86/kernel/e820_32.c
+++ b/arch/x86/kernel/e820_32.c
@@ -444,44 +444,30 @@ int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map)
* set up memory. If we aren't, we'll fake a memory map.
*
* We check to see that the memory map contains at least 2 elements
- * before we'll use it, because the detection code in setup.S may
- * not be perfect and most every PC known to man has two memory
+ * before we'll use it, because the detection code in boot/memory.c
+ * may not be perfect and every PC known to man has two memory
* regions: one from 0 to 640k, and one from 1mb up. (The IBM
* thinkpad 560x, for example, does not cooperate with the memory
* detection code.)
*/
-int __init copy_e820_map(struct e820entry * biosmap, int nr_map)
+int __init copy_e820_map(struct e820entry *biosmap, int nr_map)
{
/* Only one memory region (or negative)? Ignore it */
if (nr_map < 2)
return -1;

do {
- unsigned long long start = biosmap->addr;
- unsigned long long size = biosmap->size;
- unsigned long long end = start + size;
- unsigned long type = biosmap->type;
+ u64 start = biosmap->addr;
+ u64 size = biosmap->size;
+ u64 end = start + size;
+ u32 type = biosmap->type;

/* Overflow in 64 bits? Ignore the memory map. */
if (start > end)
return -1;

- /*
- * Some BIOSes claim RAM in the 640k - 1M region.
- * Not right. Fix it up.
- */
- if (type == E820_RAM) {
- if (start < 0x100000ULL && end > 0xA0000ULL) {
- if (start < 0xA0000ULL)
- add_memory_region(start, 0xA0000ULL-start, type);
- if (end <= 0x100000ULL)
- continue;
- start = 0x100000ULL;
- size = end - start;
- }
- }
add_memory_region(start, size, type);
- } while (biosmap++,--nr_map);
+ } while (biosmap++, --nr_map);
return 0;
}

diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index dd68cfd..e7f0400 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -613,6 +613,13 @@ static int __init sanitize_e820_map(struct e820entry *biosmap, char *pnr_map)
* If we're lucky and live on a modern system, the setup code
* will have given us a memory map that we can use to properly
* set up memory. If we aren't, we'll fake a memory map.
+ *
+ * We check to see that the memory map contains at least 2 elements
+ * before we'll use it, because the detection code in boot/memory.c
+ * may not be perfect and every PC known to man has two memory
+ * regions: one from 0 to 640k, and one from 1mb up. (The IBM
+ * thinkpad 560x, for example, does not cooperate with the memory
+ * detection code.)
*/
static int __init copy_e820_map(struct e820entry *biosmap, int nr_map)
{
@@ -621,10 +628,10 @@ static int __init copy_e820_map(struct e820entry *biosmap, int nr_map)
return -1;

do {
- unsigned long start = biosmap->addr;
- unsigned long size = biosmap->size;
- unsigned long end = start + size;
- unsigned long type = biosmap->type;
+ u64 start = biosmap->addr;
+ u64 size = biosmap->size;
+ u64 end = start + size;
+ u32 type = biosmap->type;

/* Overflow in 64 bits? Ignore the memory map. */
if (start > end)
diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index a1d7071..e12cc31 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -385,15 +385,47 @@ unsigned long __init find_max_low_pfn(void)
return max_low_pfn;
}

+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413
+
/*
- * workaround for Dell systems that neglect to reserve EBDA
+ * The BIOS places the EBDA/XBDA at the top of conventional
+ * memory, and usually decreases the reported amount of
+ * conventional memory (int 0x12) too. This also contains a
+ * workaround for Dell systems that neglect to reserve EBDA.
+ * The same workaround also avoids a problem with the AMD768MPX
+ * chipset: reserve a page before VGA to prevent PCI prefetch
+ * into it (errata #56). Usually the page is reserved anyways,
+ * unless you have no PS/2 mouse plugged in.
*/
static void __init reserve_ebda_region(void)
{
- unsigned int addr;
- addr = get_bios_ebda();
- if (addr)
- reserve_bootmem(addr, PAGE_SIZE, BOOTMEM_DEFAULT);
+ unsigned int lowmem, ebda_addr;
+
+ /* end of low (conventional) memory */
+ lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
+ lowmem <<= 10;
+
+ /* start of EBDA area */
+ ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
+ ebda_addr <<= 4;
+
+ /* Fixup: bios puts an EBDA in the top 64K segment */
+ /* of conventional memory, but does not adjust lowmem. */
+ if ((lowmem - ebda_addr) <= 0x10000)
+ lowmem = ebda_addr;
+
+ /* Fixup: bios does not report an EBDA at all. */
+ /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
+ if ((ebda_addr == 0) && (lowmem >= 0x9f000))
+ lowmem = 0x9f000;
+
+ /* Paranoia: should never happen, but... */
+ if (lowmem >= 0x100000)
+ lowmem = 0xa0000;
+
+ /* reserve all memory between lowmem and the 1MB mark */
+ reserve_bootmem(lowmem, 0x100000 - lowmem, BOOTMEM_DEFAULT);
}

#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -619,16 +651,9 @@ void __init setup_bootmem_allocator(void)
*/
reserve_bootmem(0, PAGE_SIZE, BOOTMEM_DEFAULT);

- /* reserve EBDA region, it's a 4K region */
+ /* reserve EBDA region */
reserve_ebda_region();

- /* could be an AMD 768MPX chipset. Reserve a page before VGA to prevent
- PCI prefetch into it (errata #56). Usually the page is reserved anyways,
- unless you have no PS/2 mouse plugged in. */
- if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 == 6)
- reserve_bootmem(0xa0000 - 4096, 4096, BOOTMEM_DEFAULT);
-
#ifdef CONFIG_SMP
/*
* But first pinch a few for the stack/trampoline stuff

2008-02-29 17:15:22

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit

On Fri, 2008-02-29 at 12:49 +0100, Alexander van Heukelum wrote:
> This patch adds explicit detection of the EBDA and reservation
> of the rom and adapter address space 0xa0000-0x100000 to the
> i386 kernels. It uses reserve_bootmem instead of reserve_early,
> because reserve_early is not yet available on i386.
>
> Before this patch, the EBDA size was hardcoded as 4Kb. Also, the
> reservation of the adapter range was done by modifying the e820
> map which is now not necessary any longer, and the code is removed
> from copy_e820_map.
>
> The changes in e820_64.c are only a change in the comment above
> copy_e820_map, and some changes of the types of local variables
> in that function such that the 32 and 64 bit versions become equal.
>
> Signed-off-by: Alexander van Heukelum <[email protected]>
>
> ---
>
> On Thu, Feb 28, 2008 at 09:09:56PM +0000, Ian Campbell wrote:
> > On Thu, 2008-02-28 at 14:13 +0100, Alexander van Heukelum wrote:
> > > The 32-bit code still uses reserve_bootmem, so this is not really
> > > a unification with the 64-bit version of the ebda reservation code,
> > > but at least it provides the same detection logic and reserves the
> > > same areas.
> > >
> > > This does not crash immediately on qemu. No further testing was
> > > done! Otherwise:
> >
> > I haven't tested extensively either but it does seem to solve the
> > problem for Xen.
> >
> > Thanks!
> > Ian
>
> Thank you!
>
> Ingo,
>
> I think this is ready for -x86#testing.
> It boots to a small userspace in qemu (i386).
> If I should separate the cleanups, let me know.

I haven't investigated in any detail, but with 2.6.25-rc3 and your
patch I'm seeing a Xen guest hit this BUG:

void __init smp_alloc_memory(void)
{
trampoline_base = alloc_bootmem_low_pages(PAGE_SIZE);
/*
* Has to be in very low memory so we can execute
* real-mode AP code.
*/
if (__pa(trampoline_base) >= 0x9F000)
BUG();
}

Stack looks like:

[<c137ef97>] smp_alloc_memory+0x25 <--
[<c137ef97>] smp_alloc_memory+0x25
[<c137a500>] setup_arch+0x28e
[<c13735f7>] start_kernel+0x7a
[<c1379240>] xen_start_kernel+0x300

Cheers,
Mark.

2008-02-29 18:38:57

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit

On Fri, 29 Feb 2008 17:14:07 +0000, "Mark McLoughlin"
<[email protected]> said:
> On Fri, 2008-02-29 at 12:49 +0100, Alexander van Heukelum wrote:
> > This patch adds explicit detection of the EBDA and reservation
> > of the rom and adapter address space 0xa0000-0x100000 to the
> > i386 kernels. It uses reserve_bootmem instead of reserve_early,
> > because reserve_early is not yet available on i386.
> >
> > Before this patch, the EBDA size was hardcoded as 4Kb. Also, the
> > reservation of the adapter range was done by modifying the e820
> > map which is now not necessary any longer, and the code is removed
> > from copy_e820_map.
> >
> > The changes in e820_64.c are only a change in the comment above
> > copy_e820_map, and some changes of the types of local variables
> > in that function such that the 32 and 64 bit versions become equal.
> >
> > Signed-off-by: Alexander van Heukelum <[email protected]>
> >
> > ---
> >
> > On Thu, Feb 28, 2008 at 09:09:56PM +0000, Ian Campbell wrote:
> > > On Thu, 2008-02-28 at 14:13 +0100, Alexander van Heukelum wrote:
> > > > The 32-bit code still uses reserve_bootmem, so this is not really
> > > > a unification with the 64-bit version of the ebda reservation code,
> > > > but at least it provides the same detection logic and reserves the
> > > > same areas.
> > > >
> > > > This does not crash immediately on qemu. No further testing was
> > > > done! Otherwise:
> > >
> > > I haven't tested extensively either but it does seem to solve the
> > > problem for Xen.
> > >
> > > Thanks!
> > > Ian
> >
> > Thank you!
> >
> > Ingo,
> >
> > I think this is ready for -x86#testing.
> > It boots to a small userspace in qemu (i386).
> > If I should separate the cleanups, let me know.
>
> I haven't investigated in any detail, but with 2.6.25-rc3 and your
> patch I'm seeing a Xen guest hit this BUG:
>
> void __init smp_alloc_memory(void)
> {
> trampoline_base = alloc_bootmem_low_pages(PAGE_SIZE);
> /*
> * Has to be in very low memory so we can execute
> * real-mode AP code.
> */
> if (__pa(trampoline_base) >= 0x9F000)
> BUG();
> }
>
> Stack looks like:
>
> [<c137ef97>] smp_alloc_memory+0x25 <--
> [<c137ef97>] smp_alloc_memory+0x25
> [<c137a500>] setup_arch+0x28e
> [<c13735f7>] start_kernel+0x7a
> [<c1379240>] xen_start_kernel+0x300

Ouch.

My first guess is that the BIOS data area is completely non-existent for
Xen.
Is it guaranteed that the memory is zeroed out on boot? In that case we
can
special-case it easily:

change:
/* Paranoia: should never happen, but... */
if (lowmem >= 0x100000)
lowmem = 0xa0000;

into:
/* Strange case, like Xen ;) */
if (lowmem == 0 || lowmem >= 0x100000)
lowmem = 0x9f000;

Can you test that?

Greetings,
Alexander
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - Send your email first class

2008-02-29 18:49:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit

Alexander van Heukelum wrote:
>
> My first guess is that the BIOS data area is completely non-existent for
> Xen.
> Is it guaranteed that the memory is zeroed out on boot? In that case we
> can
> special-case it easily:
>
> change:
> /* Paranoia: should never happen, but... */
> if (lowmem >= 0x100000)
> lowmem = 0xa0000;
>
> into:
> /* Strange case, like Xen ;) */
> if (lowmem == 0 || lowmem >= 0x100000)
> lowmem = 0x9f000;
>
> Can you test that?
>

The EBDA is optional anyway; I presume it should have a zero pointer if
it isn't present.

-hpa

2008-02-29 18:57:10

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit


On Fri, 29 Feb 2008 10:44:36 -0800, "H. Peter Anvin" <[email protected]>
said:
> Alexander van Heukelum wrote:
> >
> > My first guess is that the BIOS data area is completely non-existent for
> > Xen.
> > Is it guaranteed that the memory is zeroed out on boot? In that case we
> > can
> > special-case it easily:
> >
> > change:
> > /* Paranoia: should never happen, but... */
> > if (lowmem >= 0x100000)
> > lowmem = 0xa0000;
> >
> > into:
> > /* Strange case, like Xen ;) */
> > if (lowmem == 0 || lowmem >= 0x100000)
> > lowmem = 0x9f000;
> >
> > Can you test that?
> >
>
> The EBDA is optional anyway; I presume it should have a zero pointer if
> it isn't present.

Correct, but the value that indicates the size of the conventional
memory
area must be set. At least on any real hardware. My guess is that both
values read as 0 on Xen, which causes 0-0x100000 to be reserved. If that
is always the case the workaround is fine, I think.

Alexander

> -hpa
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - mmm... Fastmail...

2008-02-29 20:00:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


* Ian Campbell <[email protected]> wrote:

> Agreed, I think it's pure luck that Xen kernels have gotten away with
> it in the past.
>
> The patch below seems like the right thing to do. It certainly boots
> in a domU without the DMI problem (without any of the other related
> patches such as Alexander's).
>
> However ddcprobe hangs when run -- need to investigate some more, vm86
> in general works ok (i.e. my vm86 test code passes).

looks good to me - please resend with a note to add it to x86.git once
you are happy with it. Seems like 2.6.25 material as well.

Ingo

2008-02-29 22:07:55

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit

On Fri, 2008-02-29 at 19:38 +0100, Alexander van Heukelum wrote:
> On Fri, 29 Feb 2008 17:14:07 +0000, "Mark McLoughlin"
> <[email protected]> said:
>
> > I haven't investigated in any detail, but with 2.6.25-rc3 and your
> > patch I'm seeing a Xen guest hit this BUG:
> >
> > void __init smp_alloc_memory(void)
> > {
> > trampoline_base = alloc_bootmem_low_pages(PAGE_SIZE);
> > /*
> > * Has to be in very low memory so we can execute
> > * real-mode AP code.
> > */
> > if (__pa(trampoline_base) >= 0x9F000)
> > BUG();
> > }
> >
> > Stack looks like:
> >
> > [<c137ef97>] smp_alloc_memory+0x25 <--
> > [<c137ef97>] smp_alloc_memory+0x25
> > [<c137a500>] setup_arch+0x28e
> > [<c13735f7>] start_kernel+0x7a
> > [<c1379240>] xen_start_kernel+0x300
>
> Ouch.
>
> My first guess is that the BIOS data area is completely non-existent for
> Xen.
> Is it guaranteed that the memory is zeroed out on boot?

Yep, that seems to be the case

> In that case we can special-case it easily:
>
> change:
> /* Paranoia: should never happen, but... */
> if (lowmem >= 0x100000)
> lowmem = 0xa0000;
>
> into:
> /* Strange case, like Xen ;) */
> if (lowmem == 0 || lowmem >= 0x100000)
> lowmem = 0x9f000;
>
> Can you test that?

Yes, that fixes boot for me.

Thanks,
Mark.

2008-02-29 22:27:01

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit

On Fri, 29 Feb 2008 22:06:59 +0000, "Mark McLoughlin"
<[email protected]> said:
> On Fri, 2008-02-29 at 19:38 +0100, Alexander van Heukelum wrote:
> > My first guess is that the BIOS data area is completely non-existent for
> > Xen. Is it guaranteed that the memory is zeroed out on boot?
>
> Yep, that seems to be the case
>
> > In that case we can special-case it easily:
> >
> > change:
> > /* Paranoia: should never happen, but... */
> > if (lowmem >= 0x100000)
> > lowmem = 0xa0000;
> >
> > into:
> > /* Strange case, like Xen ;) */
> > if (lowmem == 0 || lowmem >= 0x100000)
> > lowmem = 0x9f000;
> >
> > Can you test that?
>
> Yes, that fixes boot for me.

Thanks for testing this.

I'ld rather not count on Xen providing zeroed memory, though. I'll
try to find a different solution. It's weekend anyhow ;).

Greetings,
Alexander
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - Or how I learned to stop worrying and
love email again

Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2

This patch adds explicit detection of the EBDA and reservation
of the rom and adapter address space 0xa0000-0x100000 to the
i386 kernels. Before this patch, the EBDA size was hardcoded
as 4Kb. Also, the reservation of the adapter range was done by
modifying the e820 map which is now not necessary any longer,
and that code is removed from copy_e820_map.

The amount of conventional memory and the start of the EBDA are
detected by reading the BIOS data area directly. Paravirtual
environments do not provide this area, so we bail out early
in that case. They will just have to set up a correct memory
map to start with.

Signed-off-by: Alexander van Heukelum <[email protected]>

---

Hi Ingo,

This is the second attempt at a i386-version of the ebda
patch. I hope that one of the Xen people will be able to
check that this does not break their setups, but I think
it will be fine after their patch to exclude the 0x9f000-
0x100000 area explicitly in their setup.

Greetings,
Alexander

diff --git a/arch/x86/kernel/e820_32.c b/arch/x86/kernel/e820_32.c
index 4e16ef4..746d683 100644
--- a/arch/x86/kernel/e820_32.c
+++ b/arch/x86/kernel/e820_32.c
@@ -450,38 +450,25 @@ int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map)
* thinkpad 560x, for example, does not cooperate with the memory
* detection code.)
*/
-int __init copy_e820_map(struct e820entry * biosmap, int nr_map)
+int __init copy_e820_map(struct e820entry *biosmap, int nr_map)
{
/* Only one memory region (or negative)? Ignore it */
if (nr_map < 2)
return -1;

do {
- unsigned long long start = biosmap->addr;
- unsigned long long size = biosmap->size;
- unsigned long long end = start + size;
- unsigned long type = biosmap->type;
+ u64 start = biosmap->addr;
+ u64 size = biosmap->size;
+ u64 end = start + size;
+ u32 type = biosmap->type;

/* Overflow in 64 bits? Ignore the memory map. */
if (start > end)
return -1;

- /*
- * Some BIOSes claim RAM in the 640k - 1M region.
- * Not right. Fix it up.
- */
- if (type == E820_RAM) {
- if (start < 0x100000ULL && end > 0xA0000ULL) {
- if (start < 0xA0000ULL)
- add_memory_region(start, 0xA0000ULL-start, type);
- if (end <= 0x100000ULL)
- continue;
- start = 0x100000ULL;
- size = end - start;
- }
- }
add_memory_region(start, size, type);
- } while (biosmap++,--nr_map);
+ } while (biosmap++, --nr_map);
+
return 0;
}

diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index a1d7071..20e537b 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -385,15 +385,60 @@ unsigned long __init find_max_low_pfn(void)
return max_low_pfn;
}

+#define BIOS_EBDA_SEGMENT 0x40E
+#define BIOS_LOWMEM_KILOBYTES 0x413
+
/*
- * workaround for Dell systems that neglect to reserve EBDA
+ * The BIOS places the EBDA/XBDA at the top of conventional
+ * memory, and usually decreases the reported amount of
+ * conventional memory (int 0x12) too. This also contains a
+ * workaround for Dell systems that neglect to reserve EBDA.
+ * The same workaround also avoids a problem with the AMD768MPX
+ * chipset: reserve a page before VGA to prevent PCI prefetch
+ * into it (errata #56). Usually the page is reserved anyways,
+ * unless you have no PS/2 mouse plugged in.
*/
static void __init reserve_ebda_region(void)
{
- unsigned int addr;
- addr = get_bios_ebda();
- if (addr)
- reserve_bootmem(addr, PAGE_SIZE, BOOTMEM_DEFAULT);
+ unsigned int lowmem, ebda_addr;
+
+ /* To determine the position of the EBDA and the */
+ /* end of conventional memory, we need to look at */
+ /* the BIOS data area. In a paravirtual environment */
+ /* that area is absent. We'll just have to assume */
+ /* that the paravirt case can handle memory setup */
+ /* correctly, without our help. */
+#ifdef CONFIG_PARAVIRT
+ if ((boot_params.hdr.version >= 0x207) &&
+ (boot_params.hdr.hardware_subarch != 0)) {
+ return;
+ }
+#endif
+
+ /* end of low (conventional) memory */
+ lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
+ lowmem <<= 10;
+
+ /* start of EBDA area */
+ ebda_addr = *(unsigned short *)__va(BIOS_EBDA_SEGMENT);
+ ebda_addr <<= 4;
+
+ /* Fixup: bios puts an EBDA in the top 64K segment */
+ /* of conventional memory, but does not adjust lowmem. */
+ if ((lowmem - ebda_addr) <= 0x10000)
+ lowmem = ebda_addr;
+
+ /* Fixup: bios does not report an EBDA at all. */
+ /* Some old Dells seem to need 4k anyhow (bugzilla 2990) */
+ if ((ebda_addr == 0) && (lowmem >= 0x9f000))
+ lowmem = 0x9f000;
+
+ /* Paranoia: should never happen, but... */
+ if ((lowmem == 0) || (lowmem >= 0x100000))
+ lowmem = 0x9f000;
+
+ /* reserve all memory between lowmem and the 1MB mark */
+ reserve_bootmem(lowmem, 0x100000 - lowmem, BOOTMEM_DEFAULT);
}

#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -619,16 +664,9 @@ void __init setup_bootmem_allocator(void)
*/
reserve_bootmem(0, PAGE_SIZE, BOOTMEM_DEFAULT);

- /* reserve EBDA region, it's a 4K region */
+ /* reserve EBDA region */
reserve_ebda_region();

- /* could be an AMD 768MPX chipset. Reserve a page before VGA to prevent
- PCI prefetch into it (errata #56). Usually the page is reserved anyways,
- unless you have no PS/2 mouse plugged in. */
- if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 == 6)
- reserve_bootmem(0xa0000 - 4096, 4096, BOOTMEM_DEFAULT);
-
#ifdef CONFIG_SMP
/*
* But first pinch a few for the stack/trampoline stuff

Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 64-bit add-on

This patch is an add-on to the 64-bit ebda patch. It makes
the functions reserve_ebda_region (renamed from reserve_ebda)
and copy_e820_map equal to the 32-bit versions of the previous
patch.

Changes:

Use u64 and u32 for local variables in copy_e820_map.

The amount of conventional memory and the start of the EBDA are
detected by reading the BIOS data area directly. Paravirtual
environments do not provide this area, so we bail out early
in that case. They will just have to set up a correct memory
map to start with.

Add a safety net for zeroed out BIOS data area.

Signed-off-by: Alexander van Heukelum <[email protected]>

---

diff --git a/arch/x86/kernel/e820_32.c b/arch/x86/kernel/e820_32.c
diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
index dd68cfd..baa4223 100644
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -621,10 +621,10 @@ static int __init copy_e820_map(struct e820entry *biosmap, int nr_map)
return -1;

do {
- unsigned long start = biosmap->addr;
- unsigned long size = biosmap->size;
- unsigned long end = start + size;
- unsigned long type = biosmap->type;
+ u64 start = biosmap->addr;
+ u64 size = biosmap->size;
+ u64 end = start + size;
+ u32 type = biosmap->type;

/* Overflow in 64 bits? Ignore the memory map. */
if (start > end)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b684552..269a6b4 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -55,12 +55,30 @@ static void __init copy_bootdata(char *real_mode_data)
/*
* The BIOS places the EBDA/XBDA at the top of conventional
* memory, and usually decreases the reported amount of
- * conventional memory (int 0x12) too.
+ * conventional memory (int 0x12) too. This also contains a
+ * workaround for Dell systems that neglect to reserve EBDA.
+ * The same workaround also avoids a problem with the AMD768MPX
+ * chipset: reserve a page before VGA to prevent PCI prefetch
+ * into it (errata #56). Usually the page is reserved anyways,
+ * unless you have no PS/2 mouse plugged in.
*/
-static __init void reserve_ebda(void)
+static void __init reserve_ebda_region(void)
{
unsigned int lowmem, ebda_addr;

+ /* To determine the position of the EBDA and the */
+ /* end of conventional memory, we need to look at */
+ /* the BIOS data area. In a paravirtual environment */
+ /* that area is absent. We'll just have to assume */
+ /* that the paravirt case can handle memory setup */
+ /* correctly, without our help. */
+#ifdef CONFIG_PARAVIRT
+ if ((boot_params.hdr.version >= 0x207) &&
+ (boot_params.hdr.hardware_subarch != 0)) {
+ return;
+ }
+#endif
+
/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;
@@ -80,8 +98,8 @@ static __init void reserve_ebda(void)
lowmem = 0x9f000;

/* Paranoia: should never happen, but... */
- if (lowmem >= 0x100000)
- lowmem = 0xa0000;
+ if ((lowmem == 0) || (lowmem >= 0x100000))
+ lowmem = 0x9f000;

/* reserve all memory between lowmem and the 1MB mark */
reserve_early(lowmem, 0x100000, "BIOS reserved");
@@ -140,7 +158,7 @@ void __init x86_64_start_kernel(char * real_mode_data)
reserve_early(ramdisk_image, ramdisk_end, "RAMDISK");
}

- reserve_ebda();
+ reserve_ebda_region();

/*
* At this point everything still needed from the boot loader
diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c

2008-03-04 11:42:40

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

On Thu, 2008-02-28 at 23:16 +0000, Ian Campbell wrote:
> On Thu, 2008-02-28 at 13:14 -0800, H. Peter Anvin wrote:
> >
> > You need to set up your memory map more sensibly; it's not just the
> > kernel, user space tries to access these areas too.
>
> Agreed, I think it's pure luck that Xen kernels have gotten away with it
> in the past.
>
> The patch below seems like the right thing to do. It certainly boots in
> a domU without the DMI problem (without any of the other related patches
> such as Alexander's).

Yep, this patch on its own fixes 2.6.25-rc3 DomU boot for me here.

> x86/xen: Construct e820 map with a hole between 640K-1M.
>
> Signed-off-by: Ian Campbell <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/xen/setup.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 3bad477..2341492 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -38,7 +38,8 @@ char * __init xen_memory_setup(void)
> unsigned long max_pfn = xen_start_info->nr_pages;
>
> e820.nr_map = 0;
> - add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
> + add_memory_region(0, LOWMEMSIZE(), E820_RAM);
> + add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM);

Seems to me to be the right thing to do ...

Acked-by: Mark McLoughlin <[email protected]>

Cheers,
Mark.

2008-03-04 11:47:40

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2

On Sat, 2008-03-01 at 17:09 +0100, Alexander van Heukelum wrote:
> This patch adds explicit detection of the EBDA and reservation
> of the rom and adapter address space 0xa0000-0x100000 to the
> i386 kernels. Before this patch, the EBDA size was hardcoded
> as 4Kb. Also, the reservation of the adapter range was done by
> modifying the e820 map which is now not necessary any longer,
> and that code is removed from copy_e820_map.
>
> The amount of conventional memory and the start of the EBDA are
> detected by reading the BIOS data area directly. Paravirtual
> environments do not provide this area, so we bail out early
> in that case. They will just have to set up a correct memory
> map to start with.
>
> Signed-off-by: Alexander van Heukelum <[email protected]>
>
> ---
>
> Hi Ingo,
>
> This is the second attempt at a i386-version of the ebda
> patch. I hope that one of the Xen people will be able to
> check that this does not break their setups, but I think
> it will be fine after their patch to exclude the 0x9f000-
> 0x100000 area explicitly in their setup.

Confirmed that with Ian's e820 map patch and your patch, Xen DomU boots
fine.

> diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
> index a1d7071..20e537b 100644
> --- a/arch/x86/kernel/setup_32.c
> +++ b/arch/x86/kernel/setup_32.c
> @@ -385,15 +385,60 @@ unsigned long __init find_max_low_pfn(void)
> return max_low_pfn;
> }
>
> +#define BIOS_EBDA_SEGMENT 0x40E
> +#define BIOS_LOWMEM_KILOBYTES 0x413
> +
> /*
> - * workaround for Dell systems that neglect to reserve EBDA
> + * The BIOS places the EBDA/XBDA at the top of conventional
> + * memory, and usually decreases the reported amount of
> + * conventional memory (int 0x12) too. This also contains a
> + * workaround for Dell systems that neglect to reserve EBDA.
> + * The same workaround also avoids a problem with the AMD768MPX
> + * chipset: reserve a page before VGA to prevent PCI prefetch
> + * into it (errata #56). Usually the page is reserved anyways,
> + * unless you have no PS/2 mouse plugged in.
> */
> static void __init reserve_ebda_region(void)
> {
> - unsigned int addr;
> - addr = get_bios_ebda();
> - if (addr)
> - reserve_bootmem(addr, PAGE_SIZE, BOOTMEM_DEFAULT);
> + unsigned int lowmem, ebda_addr;
> +
> + /* To determine the position of the EBDA and the */
> + /* end of conventional memory, we need to look at */
> + /* the BIOS data area. In a paravirtual environment */
> + /* that area is absent. We'll just have to assume */
> + /* that the paravirt case can handle memory setup */
> + /* correctly, without our help. */
> +#ifdef CONFIG_PARAVIRT
> + if ((boot_params.hdr.version >= 0x207) &&
> + (boot_params.hdr.hardware_subarch != 0)) {
> + return;
> + }
> +#endif

This is a bit magic, is it worth splitting it out as something like
is_paravirt_environment() ?

Cheers,
Mark.

2008-03-04 13:31:34

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2

On Tue, 04 Mar 2008 11:44:42 +0000, "Mark McLoughlin"
<[email protected]> said:
> On Sat, 2008-03-01 at 17:09 +0100, Alexander van Heukelum wrote:
> > This patch adds explicit detection of the EBDA and reservation
> > of the rom and adapter address space 0xa0000-0x100000 to the
> > i386 kernels. Before this patch, the EBDA size was hardcoded
> > as 4Kb. Also, the reservation of the adapter range was done by
> > modifying the e820 map which is now not necessary any longer,
> > and that code is removed from copy_e820_map.
> >
> > The amount of conventional memory and the start of the EBDA are
> > detected by reading the BIOS data area directly. Paravirtual
> > environments do not provide this area, so we bail out early
> > in that case. They will just have to set up a correct memory
> > map to start with.
> >
> > Signed-off-by: Alexander van Heukelum <[email protected]>
> >
> > ---
> >
> > Hi Ingo,
> >
> > This is the second attempt at a i386-version of the ebda
> > patch. I hope that one of the Xen people will be able to
> > check that this does not break their setups, but I think
> > it will be fine after their patch to exclude the 0x9f000-
> > 0x100000 area explicitly in their setup.
>
> Confirmed that with Ian's e820 map patch and your patch, Xen DomU boots
> fine.

Thanks for testing that!

> > diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
> > index a1d7071..20e537b 100644
> > --- a/arch/x86/kernel/setup_32.c
> > +++ b/arch/x86/kernel/setup_32.c
> > @@ -385,15 +385,60 @@ unsigned long __init find_max_low_pfn(void)
> > return max_low_pfn;
> > }
> >
> > +#define BIOS_EBDA_SEGMENT 0x40E
> > +#define BIOS_LOWMEM_KILOBYTES 0x413
> > +
> > /*
> > - * workaround for Dell systems that neglect to reserve EBDA
> > + * The BIOS places the EBDA/XBDA at the top of conventional
> > + * memory, and usually decreases the reported amount of
> > + * conventional memory (int 0x12) too. This also contains a
> > + * workaround for Dell systems that neglect to reserve EBDA.
> > + * The same workaround also avoids a problem with the AMD768MPX
> > + * chipset: reserve a page before VGA to prevent PCI prefetch
> > + * into it (errata #56). Usually the page is reserved anyways,
> > + * unless you have no PS/2 mouse plugged in.
> > */
> > static void __init reserve_ebda_region(void)
> > {
> > - unsigned int addr;
> > - addr = get_bios_ebda();
> > - if (addr)
> > - reserve_bootmem(addr, PAGE_SIZE, BOOTMEM_DEFAULT);
> > + unsigned int lowmem, ebda_addr;
> > +
> > + /* To determine the position of the EBDA and the */
> > + /* end of conventional memory, we need to look at */
> > + /* the BIOS data area. In a paravirtual environment */
> > + /* that area is absent. We'll just have to assume */
> > + /* that the paravirt case can handle memory setup */
> > + /* correctly, without our help. */
> > +#ifdef CONFIG_PARAVIRT
> > + if ((boot_params.hdr.version >= 0x207) &&
> > + (boot_params.hdr.hardware_subarch != 0)) {
> > + return;
> > + }
> > +#endif
>
> This is a bit magic, is it worth splitting it out as something like
> is_paravirt_environment() ?

Yes, I guess it would make sense, but do we really want to start
creating accessor functions to the boot_params struct? If we do,
then maybe put this as an inline function in <asm-x86/bootparam.h>?
I don't think <asm-x86/paravirt.h> is the right place for this.

Should is_paravirt_environment then always return 0 if
CONFIG_PARAVIRT is not set? Maybe a function like get_subarch
would be preferable? or has_legacy_bios_areas? I don't know.
If someone makes a reasonable suggestion, I'll code it up.

Greetings,
Alexander

> Cheers,
> Mark.
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - And now for something completely different?

2008-03-04 14:34:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


* Mark McLoughlin <[email protected]> wrote:

> > The patch below seems like the right thing to do. It certainly boots
> > in a domU without the DMI problem (without any of the other related
> > patches such as Alexander's).
>
> Yep, this patch on its own fixes 2.6.25-rc3 DomU boot for me here.

thanks - i picked it up for v2.6.25 merging.

i'm wondering, what triggered this bug, and why didnt we have these
problems in the past?

Ingo

2008-03-04 14:49:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2


* Mark McLoughlin <[email protected]> wrote:

> > This is the second attempt at a i386-version of the ebda patch. I
> > hope that one of the Xen people will be able to check that this does
> > not break their setups, but I think it will be fine after their
> > patch to exclude the 0x9f000- 0x100000 area explicitly in their
> > setup.
>
> Confirmed that with Ian's e820 map patch and your patch, Xen DomU
> boots fine.

hm, for now i've only got the patch below queued up for v2.6.25.

Could you check whether just the patch below ontop of -rc3-ish upstream
solves the problem too? The EBDA patch would be a bit risky now - it's
queued up for v2.6.26 at the moment.

Ingo

--------------->
Subject: x86/xen: fix DomU boot problem
From: Ian Campbell <[email protected]>
Date: Thu, 28 Feb 2008 23:16:49 +0000

Construct Xen guest e820 map with a hole between 640K-1M.

It's pure luck that Xen kernels have gotten away with it in the past.

The patch below seems like the right thing to do. It certainly boots in
a domU without the DMI problem (without any of the other related patches
such as Alexander's).

Signed-off-by: Ian Campbell <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Tested-by: Mark McLoughlin <[email protected]>
Acked-by: Mark McLoughlin <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/xen/setup.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-x86.q/arch/x86/xen/setup.c
===================================================================
--- linux-x86.q.orig/arch/x86/xen/setup.c
+++ linux-x86.q/arch/x86/xen/setup.c
@@ -38,7 +38,8 @@ char * __init xen_memory_setup(void)
unsigned long max_pfn = xen_start_info->nr_pages;

e820.nr_map = 0;
- add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
+ add_memory_region(0, LOWMEMSIZE(), E820_RAM);
+ add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM);

return "Xen";
}

2008-03-04 15:13:36

by Ian Campbell

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


On Tue, 2008-03-04 at 15:33 +0100, Ingo Molnar wrote:
> * Mark McLoughlin <[email protected]> wrote:
>
> > > The patch below seems like the right thing to do. It certainly boots
> > > in a domU without the DMI problem (without any of the other related
> > > patches such as Alexander's).
> >
> > Yep, this patch on its own fixes 2.6.25-rc3 DomU boot for me here.
>
> thanks - i picked it up for v2.6.25 merging.

Thanks, I'd been planning to resend once I got the ddcprobe thing sorted
but I've been a bit snowed under (deadlines :-|). This patch is
independently correct though and turns the ddcprobe failure from a domU
kernel crash into a userspace hang.

ddcprobe hangs because the page at pfn 0 is comprised entirely 0xc2
bytes (the Xen memory scrub poison value -- I'm running a debug
hypervisor) which causes lrmi to try and jump to 0xc2c2*16+0xc2c2 to
implement int $0xXX. gdb won't let me actually examine that address for
some reason so I don't actually know what it's hitting to cause the
loop/hang but no good can come of jumping to that address in any case...

Sanest solution seems to me to be to explicitly zero PFN 0 -- I have a
patch I just haven't had a chance to clean it up for submission yet.

> i'm wondering, what triggered this bug, and why didnt we have these
> problems in the past?

Did someone make the boot allocator start at the bottom instead of the
top of memory or something?

Ian.

--
Ian Campbell
Current Noise: Nile - Even The Gods Must Die

There is a natural hootchy-kootchy to a goldfish.
-- Walt Disney

2008-03-04 15:17:18

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2

On Tue, 2008-03-04 at 15:49 +0100, Ingo Molnar wrote:
> * Mark McLoughlin <[email protected]> wrote:
>
> > > This is the second attempt at a i386-version of the ebda patch. I
> > > hope that one of the Xen people will be able to check that this does
> > > not break their setups, but I think it will be fine after their
> > > patch to exclude the 0x9f000- 0x100000 area explicitly in their
> > > setup.
> >
> > Confirmed that with Ian's e820 map patch and your patch, Xen DomU
> > boots fine.
>
> hm, for now i've only got the patch below queued up for v2.6.25.
>
> Could you check whether just the patch below ontop of -rc3-ish upstream
> solves the problem too? The EBDA patch would be a bit risky now - it's
> queued up for v2.6.26 at the moment.

Sorry, perhaps I wasn't clear.

Ian's patch to add a hole in the e820 map fixes the problem without any
other patch.

I was just confirming to Alexander that his EBDA patch didn't break
anything further.

Thanks,
Mark.

2008-03-04 15:19:05

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Ingo Molnar wrote:
> i'm wondering, what triggered this bug, and why didnt we have these
> problems in the past?

I think it was a side effect of all the early_ioremap work, which
changed the order in which things happened at boot time.

J

2008-03-04 15:24:04

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2

Mark McLoughlin wrote:
> This is a bit magic, is it worth splitting it out as something like
> is_paravirt_environment() ?
>

Yes, is_paravirt() already exists for this purpose.

The code looking at the boot params will only work if we actually booted
via the paravirt Linux boot protocol, which Xen doesn't at present.

J

2008-03-04 15:25:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2


* Mark McLoughlin <[email protected]> wrote:

> > hm, for now i've only got the patch below queued up for v2.6.25.
> >
> > Could you check whether just the patch below ontop of -rc3-ish
> > upstream solves the problem too? The EBDA patch would be a bit risky
> > now - it's queued up for v2.6.26 at the moment.
>
> Sorry, perhaps I wasn't clear.
>
> Ian's patch to add a hole in the e820 map fixes the problem without
> any other patch.
>
> I was just confirming to Alexander that his EBDA patch didn't break
> anything further.

ah, great. You got me worried for a minute :)

Ingo

2008-03-04 15:26:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


* Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
>> i'm wondering, what triggered this bug, and why didnt we have these
>> problems in the past?
>
> I think it was a side effect of all the early_ioremap work, which
> changed the order in which things happened at boot time.

which commits do you mean in particular?

Ingo

2008-03-04 16:07:39

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Ingo Molnar wrote:
>> I think it was a side effect of all the early_ioremap work, which
>> changed the order in which things happened at boot time.
>>
>
> which commits do you mean in particular?

Good question. It might have been cpa as well. I'd have to look
through to confirm the theory, but I think Xen got broken for other
reasons around there, so simple bisect wouldn't be enough to identify
the problem.

J

2008-03-04 16:15:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


* Jeremy Fitzhardinge <[email protected]> wrote:

> Ingo Molnar wrote:
>>> I think it was a side effect of all the early_ioremap work, which changed
>>> the order in which things happened at boot time.
>>>
>>
>> which commits do you mean in particular?
>
> Good question. It might have been cpa as well. I'd have to look
> through to confirm the theory, but I think Xen got broken for other
> reasons around there, so simple bisect wouldn't be enough to identify
> the problem.

i'm trying to do bug forensics to figure out where we broke Xen - i
really have the feeling that there could be other side-effects of
whatever change caused this.

Ingo

2008-03-04 16:16:24

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

On Tue, 2008-03-04 at 15:33 +0100, Ingo Molnar wrote:
> * Mark McLoughlin <[email protected]> wrote:
>
> > > The patch below seems like the right thing to do. It certainly boots
> > > in a domU without the DMI problem (without any of the other related
> > > patches such as Alexander's).
> >
> > Yep, this patch on its own fixes 2.6.25-rc3 DomU boot for me here.
>
> thanks - i picked it up for v2.6.25 merging.

Great.

> i'm wondering, what triggered this bug, and why didnt we have these
> problems in the past?

That's been bothering me too, but hadn't come up with anything until I
just now looked again.

I'm pretty certain that dmi_scan_machine() wasn't using a fixmap before
(in 2.6.24) since I had to fix a broken xen dom0 patch to handle this
new situation.

Looking at 2.6.24 bt_ioremap(), it would have been hitting this:

/*
* Don't remap the low PCI/ISA area, it's always mapped..
*/
if (phys_addr >= ISA_START_ADDRESS && last_addr < ISA_END_ADDRESS)
return phys_to_virt(phys_addr);

If I restore this behaviour (i.e. remove Ian's e820 hole and add the
patch below), then I see us still allocating a page table at 0xf0000 but
we don't then try and create a new writable mapping to it later, and
everything goes fine.

That's a little worrying though, surely? The mapping created by
kernel_physical_mapping_init() is writable, so shouldn't Xen barf if you
try and use a page within that mapping as a page table?

Cheers,
Mark.

>From c43a8a1badf5347a57d66556a4ba497623d620d4 Mon Sep 17 00:00:00 2001
From: Mark McLoughlin <[email protected]>
Date: Tue, 4 Mar 2008 15:32:03 +0000
Subject: [PATCH] x86: Don't remap the ISA region in early_ioremap()

Restore the previous behaviour of bt_ioremap() where
it would use the existing mapping of the ISA region
rather than using a fixmap.

Signed-off-by: Mark McLoughlin <[email protected]>
---
arch/x86/mm/ioremap.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 882328e..89fdd8f 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -425,6 +425,13 @@ void __init *early_ioremap(unsigned long phys_addr, unsigned long size)
return NULL;
}

+ /*
+ * Don't remap the low PCI/ISA area, it's always mapped..
+ */
+ if (phys_addr >= ISA_START_ADDRESS && last_addr < ISA_END_ADDRESS)
+ return phys_to_virt(phys_addr);
+
+
if (nesting >= FIX_BTMAPS_NESTING) {
WARN_ON(1);
return NULL;
@@ -471,6 +478,10 @@ void __init early_iounmap(void *addr, unsigned long size)
enum fixed_addresses idx;
unsigned int nesting;

+ virt_addr = (unsigned long)addr;
+ if (virt_addr < fix_to_virt(FIX_BTMAP_BEGIN))
+ return;
+
nesting = --early_ioremap_nested;
WARN_ON(nesting < 0);

@@ -480,11 +491,6 @@ void __init early_iounmap(void *addr, unsigned long size)
dump_stack();
}

- virt_addr = (unsigned long)addr;
- if (virt_addr < fix_to_virt(FIX_BTMAP_BEGIN)) {
- WARN_ON(1);
- return;
- }
offset = virt_addr & ~PAGE_MASK;
nrpages = PAGE_ALIGN(offset + size - 1) >> PAGE_SHIFT;

--
1.5.4.2


2008-03-04 16:23:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB


* Mark McLoughlin <[email protected]> wrote:

> > i'm wondering, what triggered this bug, and why didnt we have these
> > problems in the past?
>
> That's been bothering me too, but hadn't come up with anything until I
> just now looked again.
>
> I'm pretty certain that dmi_scan_machine() wasn't using a fixmap
> before (in 2.6.24) since I had to fix a broken xen dom0 patch to
> handle this new situation.

ah, ok, indeed Jeremy's gut feeling about early_ioremap() was right - we
didnt use fixmaps for early_ioremap() [ ex. bt_ioremap() ] in .24. We
try to slowly get rid of the linearity assumptions from arch/x86, this
was one of the steps.

Ingo

2008-03-04 16:51:28

by Alexander van Heukelum

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2


On Tue, 04 Mar 2008 07:18:48 -0800, "Jeremy Fitzhardinge"
<[email protected]> said:
> Mark McLoughlin wrote:
> > This is a bit magic, is it worth splitting it out as something like
> > is_paravirt_environment() ?
> >
>
> Yes, is_paravirt() already exists for this purpose.

Hi,

If it exists, it is well-hidden. A grep for is_paravirt on the testing
tree turns up nothing. Did you get the name right?

> The code looking at the boot params will only work if we actually booted
> via the paravirt Linux boot protocol, which Xen doesn't at present.

I chose to code it exactly this way, because it is what is used in
head_32.S to choose how to start the kernel. Or is this code not
executed at all?

[excerpt form head_32.S]
cmpw $0x207, pa(boot_params + BP_version)
jb default_entry

/* Paravirt-compatible boot parameters. Look to see what
architecture
we're booting under. */
movl pa(boot_params + BP_hardware_subarch), %eax
cmpl $num_subarch_entries, %eax
jae bad_subarch

movl pa(subarch_entries)(,%eax,4), %eax
subl $__PAGE_OFFSET, %eax
jmp *%eax
[and]
subarch_entries:
.long default_entry /* normal x86/PC */
.long lguest_entry /* lguest hypervisor */
.long xen_entry /* Xen hypervisor */
num_subarch_entries = (. - subarch_entries) / 4
[end]

If this is indeed not executed, is there a way to detect whether
we can expect the environment to behave like a normal pc in terms
of magic addresses, bios areas, isa reserved address space and so
on?

Greetings,
Alexander
--
Alexander van Heukelum
[email protected]

--
http://www.fastmail.fm - mmm... Fastmail...

2008-03-04 17:10:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2

Alexander van Heukelum wrote:
>
> If this is indeed not executed, is there a way to detect whether
> we can expect the environment to behave like a normal pc in terms
> of magic addresses, bios areas, isa reserved address space and so
> on?
>

The subarch stuff (boot_params.hdr.hardware_subarch) is indeed executed,
at least with newer PV guests.

However, as far as this kind of stuff, one really have to wonder to what
extent the PV users really care about how much they perturb the guests.
The canonical view of virtualization is that you should perturb your
guests as little as possible, but that doesn't really seem to be
considered in the observable bits of the PV universe as far as I can tell.

A *massive* issue with hooks -- including paravirt_ops -- is that they
are largely undocumented both in code and in specification, and usually
hard-code the underlying implemenentation at a specific point in time: I
have yet to see any sort of specification document for paravirt_ops, and
most of the hooks are simply empty on hardware. This means that the
burden has shifted onto the kernel maintainers to test every possible
paravirt_ops client, because it is quite literally the only way to know
when it's broken.

I'm starting to feel that the PV clients need to document their
environments and their constraints better for the benefit of the core
maintainers.

-hpa

2008-03-04 17:17:04

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB on 32-bit v2

Alexander van Heukelum wrote:
> On Tue, 04 Mar 2008 07:18:48 -0800, "Jeremy Fitzhardinge"
> <[email protected]> said:
>
>> Mark McLoughlin wrote:
>>
>>> This is a bit magic, is it worth splitting it out as something like
>>> is_paravirt_environment() ?
>>>
>>>
>> Yes, is_paravirt() already exists for this purpose.
>>
>
> Hi,
>
> If it exists, it is well-hidden. A grep for is_paravirt on the testing
> tree turns up nothing. Did you get the name right?
>

Nope. paravirt_enabled().

>> The code looking at the boot params will only work if we actually booted
>> via the paravirt Linux boot protocol, which Xen doesn't at present.
>>
>
> I chose to code it exactly this way, because it is what is used in
> head_32.S to choose how to start the kernel. Or is this code not
> executed at all?
>

No, this isn't executed when booting under Xen. Xen is a bit magic, and
has its own kernel entrypoint which the domain builder finds via
Xen-specific ELF notes on the vmlinux. We plan to move to booting via
this path at some point.

> [excerpt form head_32.S]
> cmpw $0x207, pa(boot_params + BP_version)
> jb default_entry
>
> /* Paravirt-compatible boot parameters. Look to see what
> architecture
> we're booting under. */
> movl pa(boot_params + BP_hardware_subarch), %eax
> cmpl $num_subarch_entries, %eax
> jae bad_subarch
>
> movl pa(subarch_entries)(,%eax,4), %eax
> subl $__PAGE_OFFSET, %eax
> jmp *%eax
> [and]
> subarch_entries:
> .long default_entry /* normal x86/PC */
> .long lguest_entry /* lguest hypervisor */
> .long xen_entry /* Xen hypervisor */
> num_subarch_entries = (. - subarch_entries) / 4
> [end]
>
> If this is indeed not executed, is there a way to detect whether
> we can expect the environment to behave like a normal pc in terms
> of magic addresses, bios areas, isa reserved address space and so
> on?
>

Just because something is paravirtualized and uses a non-PC
hardware_subarch doesn't mean this stuff isn't present. Even the
paravirt_enabled() test isn't accurate, because the environment may
still choose to emulate these things (or in the Xen dom0 case, it may
directly expose the real hardware).

J

2008-03-04 17:50:16

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Ian Campbell wrote:
> BTW Jeremy, the kernel doesn't use XENMEM_memory_map -- any reason other
> than it not being useful at the time? These days the tools can push an
> arbitrary e820 down for a guest which might be useful to support,
> although nothing interesting is done with it today.

No, never looked at that stuff in detail. I'll bear it in mind.

J

Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB, 32-bit, use paravirt_enabled

Jeremy Fitzhardinge pointed out that looking at the boot_params
struct to determine if the system is running in a paravirtual
environment is not reliable for the Xen case, currently. He also
points out that there already exists a function to determine if
the system is running in a paravirtual environment. So let's use
that instead. This gets rid of the preprocessor test too.

Signed-off-by: Alexander van Heukelum <[email protected]>

---

On Tue, Mar 04, 2008 at 09:11:33AM -0800, Jeremy Fitzhardinge wrote:
> Alexander van Heukelum wrote:
> >On Tue, 04 Mar 2008 07:18:48 -0800, "Jeremy Fitzhardinge"
> ><[email protected]> said:
> >>Yes, is_paravirt() already exists for this purpose.
> >
> >Hi,
> >
> >If it exists, it is well-hidden. A grep for is_paravirt on the testing
> >tree turns up nothing. Did you get the name right?
> >
>
> Nope. paravirt_enabled().

Duh, I should have been able to find that :-/. Thanks for the pointer.

> Just because something is paravirtualized and uses a non-PC
> hardware_subarch doesn't mean this stuff isn't present. Even the
> paravirt_enabled() test isn't accurate, because the environment may
> still choose to emulate these things (or in the Xen dom0 case, it may
> directly expose the real hardware).

paravirt_enabled() returns 0 if CONFIG_PARAVIRT is not set, or if a
paravirt-enabled kernel is run on bare hardware: dom0. If that is right,
then it is exactly what is needed.

Greetings,
Alexander

diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index 20e537b..f595d7b 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -64,6 +64,7 @@
#include <setup_arch.h>
#include <bios_ebda.h>
#include <asm/cacheflush.h>
+#include <asm/processor.h>

/* This value is set up by the early boot code to point to the value
immediately after the boot time page tables. It contains a *physical*
@@ -408,12 +409,8 @@ static void __init reserve_ebda_region(void)
/* that area is absent. We'll just have to assume */
/* that the paravirt case can handle memory setup */
/* correctly, without our help. */
-#ifdef CONFIG_PARAVIRT
- if ((boot_params.hdr.version >= 0x207) &&
- (boot_params.hdr.hardware_subarch != 0)) {
+ if (paravirt_enabled())
return;
- }
-#endif

/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);

Subject: Re: [PATCH] reserve end-of-conventional-memory to 1MB, 64-bit, use paravirt_enabled

Jeremy Fitzhardinge pointed out that looking at the boot_params
struct to determine if the system is running in a paravirtual
environment is not reliable for the Xen case, currently. He also
points out that there already exists a function to determine if
the system is running in a paravirtual environment. So let's use
that instead. This gets rid of the preprocessor test too.

Signed-off-by: Alexander van Heukelum <[email protected]>

---

Same, for 64 bit.

Greetings,
Alexander

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 269a6b4..48be76c 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -72,12 +72,8 @@ static void __init reserve_ebda_region(void)
/* that area is absent. We'll just have to assume */
/* that the paravirt case can handle memory setup */
/* correctly, without our help. */
-#ifdef CONFIG_PARAVIRT
- if ((boot_params.hdr.version >= 0x207) &&
- (boot_params.hdr.hardware_subarch != 0)) {
+ if (paravirt_enabled())
return;
- }
-#endif

/* end of low (conventional) memory */
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);

2008-03-05 16:02:53

by Eduardo Habkost

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

On Thu, Feb 28, 2008 at 11:16:49PM +0000, Ian Campbell wrote:
>
> The patch below seems like the right thing to do. It certainly boots in
> a domU without the DMI problem (without any of the other related patches
> such as Alexander's).
>
> However ddcprobe hangs when run -- need to investigate some more, vm86
> in general works ok (i.e. my vm86 test code passes).
>
> BTW Jeremy, the kernel doesn't use XENMEM_memory_map -- any reason other
> than it not being useful at the time? These days the tools can push an
> arbitrary e820 down for a guest which might be useful to support,
> although nothing interesting is done with it today.
>
> Ian.
>
> x86/xen: Construct e820 map with a hole between 640K-1M.
>
> Signed-off-by: Ian Campbell <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/xen/setup.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 3bad477..2341492 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -38,7 +38,8 @@ char * __init xen_memory_setup(void)
> unsigned long max_pfn = xen_start_info->nr_pages;
>
> e820.nr_map = 0;
> - add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
> + add_memory_region(0, LOWMEMSIZE(), E820_RAM);
> + add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM);

Won't this waste 300+ KB of Perfectly Good RAM? Or I understood it
incorrectly?

I am aware that it would take more work to tell all kernel code that it
shouldn't look for BIOS data on this region when running as a domU guest,
but it seems that it would be a better solution.

--
Eduardo

2008-03-05 16:12:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Eduardo Habkost wrote:
>>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index 3bad477..2341492 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -38,7 +38,8 @@ char * __init xen_memory_setup(void)
>> unsigned long max_pfn = xen_start_info->nr_pages;
>>
>> e820.nr_map = 0;
>> - add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
>> + add_memory_region(0, LOWMEMSIZE(), E820_RAM);
>> + add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM);
>
> Won't this waste 300+ KB of Perfectly Good RAM? Or I understood it
> incorrectly?
>
> I am aware that it would take more work to tell all kernel code that it
> shouldn't look for BIOS data on this region when running as a domU guest,
> but it seems that it would be a better solution.
>

No, the right thing is for Xen to not try to map RAM in this area.

-hpa

2008-03-05 16:44:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Eduardo Habkost wrote:
> On Thu, Feb 28, 2008 at 11:16:49PM +0000, Ian Campbell wrote:
>
>> The patch below seems like the right thing to do. It certainly boots in
>> a domU without the DMI problem (without any of the other related patches
>> such as Alexander's).
>>
>> However ddcprobe hangs when run -- need to investigate some more, vm86
>> in general works ok (i.e. my vm86 test code passes).
>>
>> BTW Jeremy, the kernel doesn't use XENMEM_memory_map -- any reason other
>> than it not being useful at the time? These days the tools can push an
>> arbitrary e820 down for a guest which might be useful to support,
>> although nothing interesting is done with it today.
>>
>> Ian.
>>
>> x86/xen: Construct e820 map with a hole between 640K-1M.
>>
>> Signed-off-by: Ian Campbell <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Jeremy Fitzhardinge <[email protected]>
>> ---
>> arch/x86/xen/setup.c | 3 ++-
>> 1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
>> index 3bad477..2341492 100644
>> --- a/arch/x86/xen/setup.c
>> +++ b/arch/x86/xen/setup.c
>> @@ -38,7 +38,8 @@ char * __init xen_memory_setup(void)
>> unsigned long max_pfn = xen_start_info->nr_pages;
>>
>> e820.nr_map = 0;
>> - add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
>> + add_memory_region(0, LOWMEMSIZE(), E820_RAM);
>> + add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY, E820_RAM);
>>
>
> Won't this waste 300+ KB of Perfectly Good RAM? Or I understood it
> incorrectly?
>
> I am aware that it would take more work to tell all kernel code that it
> shouldn't look for BIOS data on this region when running as a domU guest,
> but it seems that it would be a better solution.

For the moment that's true, but we should be able to release those
pages. On the other hand, there's been talk of making Xen hand out
memory in physically contigious 2M chunks, which means trying to unmap
300k won't be possible or worthwhile. I suspect real hardware will just
waste this memory too (it won't remap that 320k of ram to somewhere
else; it will just be shadowed) - which is nothing compared to chipsets
which will just throw away 1G to make space for PCI...

J

2008-03-05 16:56:52

by Eduardo Habkost

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

On Wed, Mar 05, 2008 at 08:08:01AM -0800, H. Peter Anvin wrote:
> Eduardo Habkost wrote:
> >>
> >>diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> >>index 3bad477..2341492 100644
> >>--- a/arch/x86/xen/setup.c
> >>+++ b/arch/x86/xen/setup.c
> >>@@ -38,7 +38,8 @@ char * __init xen_memory_setup(void)
> >> unsigned long max_pfn = xen_start_info->nr_pages;
> >>
> >> e820.nr_map = 0;
> >>- add_memory_region(0, PFN_PHYS(max_pfn), E820_RAM);
> >>+ add_memory_region(0, LOWMEMSIZE(), E820_RAM);
> >>+ add_memory_region(HIGH_MEMORY, PFN_PHYS(max_pfn)-HIGH_MEMORY,
> >>E820_RAM);
> >
> >Won't this waste 300+ KB of Perfectly Good RAM? Or I understood it
> >incorrectly?
> >
> >I am aware that it would take more work to tell all kernel code that it
> >shouldn't look for BIOS data on this region when running as a domU guest,
> >but it seems that it would be a better solution.
> >
>
> No, the right thing is for Xen to not try to map RAM in this area.

Why? If the Xen host is telling us there is valid RAM in this area,
why can't we use it?

Existing Xen hosts use those physical address ranges for RAM. We can't
fix this on the e820 map on the guest side without making otherwise-valid
RAM unused.

--
Eduardo

2008-03-05 17:32:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Jeremy Fitzhardinge wrote:
>>
>> I am aware that it would take more work to tell all kernel code that it
>> shouldn't look for BIOS data on this region when running as a domU guest,
>> but it seems that it would be a better solution.
>
> For the moment that's true, but we should be able to release those
> pages. On the other hand, there's been talk of making Xen hand out
> memory in physically contigious 2M chunks, which means trying to unmap
> 300k won't be possible or worthwhile. I suspect real hardware will just
> waste this memory too (it won't remap that 320k of ram to somewhere
> else; it will just be shadowed) - which is nothing compared to chipsets
> which will just throw away 1G to make space for PCI...
>

Not that 384K is all that much to worry about, but you could just map
memory from 2 MB upward, and set the kernel load address to 2 MB (which
is good for performance anyway.)

There has been talk of making the default kernel load address 16 MB, to
keep the DMA zone free.

-hpa

2008-03-05 17:33:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Eduardo Habkost wrote:
>>>
>>> I am aware that it would take more work to tell all kernel code that it
>>> shouldn't look for BIOS data on this region when running as a domU guest,
>>> but it seems that it would be a better solution.
>>>
>> No, the right thing is for Xen to not try to map RAM in this area.
>
> Why? If the Xen host is telling us there is valid RAM in this area,
> why can't we use it?
>
> Existing Xen hosts use those physical address ranges for RAM. We can't
> fix this on the e820 map on the guest side without making otherwise-valid
> RAM unused.
>

x86 has a number of historical assumptions, both in kernel and in
userspace (consider dmidecode!) Trying to go against them is a losing
proposition, *AND* a headache for the maintainers, which will have to
consider "oh yes, and Xen does this dumb thing which goes against what
all the hardware does."

-hpa

2008-03-05 17:34:05

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

H. Peter Anvin wrote:
> x86 has a number of historical assumptions, both in kernel and in
> userspace (consider dmidecode!) Trying to go against them is a losing
> proposition, *AND* a headache for the maintainers, which will have to
> consider "oh yes, and Xen does this dumb thing which goes against what
> all the hardware does."

Ahem. "Hardware does this dumb thing that Xen avoids". kthxbye ;)

J

2008-03-05 17:43:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] use realmode code to reserve end-of-conventional-memory to 1MB

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> x86 has a number of historical assumptions, both in kernel and in
>> userspace (consider dmidecode!) Trying to go against them is a losing
>> proposition, *AND* a headache for the maintainers, which will have to
>> consider "oh yes, and Xen does this dumb thing which goes against what
>> all the hardware does."
>
> Ahem. "Hardware does this dumb thing that Xen avoids". kthxbye ;)

If hardware was trying to emulate Xen, you'd be right.

-hpa