2008-02-07 04:16:33

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86_64: fix page table size

[PATCH] x86_64: fix page table size

Command line: apic=debug acpi.debug_level=0x0000000F debug initcall_debug pci=routeirq ramdisk_size=131072 root=/dev/ram0 rw ip=dhcp console=uart8250,io,0x3f8,115200n8
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000100 - 000000000009bc00 (usable)
BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000dffe0000 (usable)
BIOS-e820: 00000000dffe0000 - 00000000dffee000 (ACPI data)
BIOS-e820: 00000000dffee000 - 00000000dffff050 (ACPI NVS)
BIOS-e820: 00000000dffff050 - 00000000e0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000004020000000 (usable)
Early serial console at I/O port 0x3f8 (options '115200n8')
console [uart0] enabled
Entering add_active_range(0, 1, 155) 0 entries of 3200 used
Entering add_active_range(0, 256, 917472) 1 entries of 3200 used
Entering add_active_range(0, 1048576, 67239936) 2 entries of 3200 used
end_pfn_map = 67239936
Kernel panic - not syncing: Overlapping early reservations 8000-109fff PGTABLE to 9bc00-9dbff EBDA

change back the logic. we DO need extra space for pmds when direct_gbpages is not there.

Signed-off-by: Yinghai Lu <[email protected]>

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index eb376b5..31f0e82 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -370,7 +370,7 @@ static void __init find_early_table_space(unsigned long end)

puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
- if (direct_gbpages) {
+ if (!direct_gbpages) {
pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
tables += round_up(pmds * sizeof(pmd_t), PAGE_SIZE);
}


2008-02-07 04:16:51

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH] x86_64: clean up find_e820_area

[PATCH] x86_64: clean up find_e820_area

change size to unsigned long, becase caller and user all used unsigned long.
also make bad_addr take align.

Signed-off-by: Yinghai Lu <[email protected]>


Index: linux-2.6/arch/x86/kernel/e820_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820_64.c
+++ linux-2.6/arch/x86/kernel/e820_64.c
@@ -95,7 +95,7 @@ void __init early_res_to_bootmem(void)
}

/* Check for already reserved areas */
-static inline int bad_addr(unsigned long *addrp, unsigned long size)
+static inline int bad_addr(unsigned long *addrp, unsigned long size, unsigned long align)
{
int i;
unsigned long addr = *addrp, last;
@@ -105,7 +105,7 @@ again:
for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
struct early_res *r = &early_res[i];
if (last >= r->start && addr < r->end) {
- *addrp = addr = r->end;
+ *addrp = addr = round_up(r->end, align);
changed = 1;
goto again;
}
@@ -174,26 +174,27 @@ int __init e820_all_mapped(unsigned long
* Find a free area with specified alignment in a specific range.
*/
unsigned long __init find_e820_area(unsigned long start, unsigned long end,
- unsigned size, unsigned long align)
+ unsigned long size, unsigned long align)
{
int i;
- unsigned long mask = ~(align - 1);

for (i = 0; i < e820.nr_map; i++) {
struct e820entry *ei = &e820.map[i];
- unsigned long addr = ei->addr, last;
+ unsigned long addr, last;
+ unsigned long ei_last;

if (ei->type != E820_RAM)
continue;
+ addr = round_up(ei->addr, align);
+ ei_last = ei->addr + ei->size;
if (addr < start)
- addr = start;
- if (addr > ei->addr + ei->size)
+ addr = round_up(start, align);
+ if (addr > ei_last)
continue;
- while (bad_addr(&addr, size) && addr+size <= ei->addr+ei->size)
+ while(bad_addr(&addr, size, align) && addr+size <= ei_last)
;
- addr = (addr + align - 1) & mask;
last = addr + size;
- if (last > ei->addr + ei->size)
+ if (last > ei_last)
continue;
if (last > end)
continue;
Index: linux-2.6/include/asm-x86/e820_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/e820_64.h
+++ linux-2.6/include/asm-x86/e820_64.h
@@ -15,7 +15,7 @@

#ifndef __ASSEMBLY__
extern unsigned long find_e820_area(unsigned long start, unsigned long end,
- unsigned size, unsigned long align);
+ unsigned long size, unsigned long align);
extern void add_memory_region(unsigned long start, unsigned long size,
int type);
extern void setup_memory_region(void);

2008-02-07 07:23:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86_64: fix page table size

Yinghai Lu <[email protected]> writes:
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index eb376b5..31f0e82 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -370,7 +370,7 @@ static void __init find_early_table_space(unsigned long end)
>
> puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
> tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
> - if (direct_gbpages) {
> + if (!direct_gbpages) {


What tree did you use? The patch I submitted had

if (direct_gbpages == GBP_ON) {

here. And the actual direct gbpages patch does not even have to made
it to Linus' tree yet, just all the other gbpages patches

But yes if it really was if (direct_gbpages) { here then your change
is correct of course.

-Andi

2008-02-07 08:10:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86_64: fix page table size


* Yinghai Lu <[email protected]> wrote:

> [PATCH] x86_64: fix page table size

> Entering add_active_range(0, 1048576, 67239936) 2 entries of 3200 used
> end_pfn_map = 67239936
> Kernel panic - not syncing: Overlapping early reservations 8000-109fff
> PGTABLE to 9bc00-9dbff EBDA
>
> change back the logic. we DO need extra space for pmds when
> direct_gbpages is not there.

> @@ -370,7 +370,7 @@ static void __init find_early_table_space(unsigned long end)
>
> puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
> tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
> - if (direct_gbpages) {
> + if (!direct_gbpages) {

thanks Yinghai, applied!

I'm wondering why this bug didnt trigger more widely. It seems to me it
needs some serious amount of RAM to trigger this bug - correct?

btw., it would be nice to have some "lots of RAM simulation" debugging
code which would just _fake_ a really large e820 map and would in the
end throw away the 'fake' pages later during bootup. Perhaps tell the
early allocator to never allocate into these fake areas [via an struct
e820 entry flag], but all our sizing code and the boot bitmaps, etc.
would be sized accordingly, as if we had this much RAM - and we'd
trigger these nuances. We could put this into a new "fakemem=128GB" boot
option and hence we could boot with fakemem=128GB on a 2GB box and could
at least hope to be able to boot [with some serious amount of RAM wasted
on over-sized pagetables, allocator bitmaps and mem_map[]]. Hm?

Ingo

2008-02-07 08:13:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86_64: clean up find_e820_area


* Yinghai Lu <[email protected]> wrote:

> [PATCH] x86_64: clean up find_e820_area
>
> change size to unsigned long, becase caller and user all used unsigned
> long. also make bad_addr take align.

thanks, applied.

[ tiny nitpicking detail:

> -static inline int bad_addr(unsigned long *addrp, unsigned long size)
> +static inline int bad_addr(unsigned long *addrp, unsigned long size, unsigned long align)

please run scripts/checkpatch.pl over your patches, that would have
caught the above overlong line. ]

> @@ -105,7 +105,7 @@ again:
> for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) {
> struct early_res *r = &early_res[i];
> if (last >= r->start && addr < r->end) {
> - *addrp = addr = r->end;
> + *addrp = addr = round_up(r->end, align);

nice! This makes the reservation intent and effect much clearer.

Ingo

2008-02-07 08:36:30

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] x86_64: fix page table size

On Feb 7, 2008 12:09 AM, Ingo Molnar <[email protected]> wrote:
>
> * Yinghai Lu <[email protected]> wrote:
>
> > [PATCH] x86_64: fix page table size
>
> > Entering add_active_range(0, 1048576, 67239936) 2 entries of 3200 used
> > end_pfn_map = 67239936
> > Kernel panic - not syncing: Overlapping early reservations 8000-109fff
> > PGTABLE to 9bc00-9dbff EBDA
> >
> > change back the logic. we DO need extra space for pmds when
> > direct_gbpages is not there.
>
> > @@ -370,7 +370,7 @@ static void __init find_early_table_space(unsigned long end)
> >
> > puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
> > tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
> > - if (direct_gbpages) {
> > + if (!direct_gbpages) {
>
> thanks Yinghai, applied!
>
> I'm wondering why this bug didnt trigger more widely. It seems to me it
> needs some serious amount of RAM to trigger this bug - correct?

yes. need 256g.
128g without patch it still work. because round_up(.., PAGE_SIZE) get
some extra.

>
> btw., it would be nice to have some "lots of RAM simulation" debugging
> code which would just _fake_ a really large e820 map and would in the
> end throw away the 'fake' pages later during bootup. Perhaps tell the
> early allocator to never allocate into these fake areas [via an struct
> e820 entry flag], but all our sizing code and the boot bitmaps, etc.
> would be sized accordingly, as if we had this much RAM - and we'd
> trigger these nuances. We could put this into a new "fakemem=128GB" boot
> option and hence we could boot with fakemem=128GB on a 2GB box and could
> at least hope to be able to boot [with some serious amount of RAM wasted
> on over-sized pagetables, allocator bitmaps and mem_map[]]. Hm?

sound interesting. but need some variable to prevent using non exist page.

YH

2008-02-07 08:49:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86_64: fix page table size


* Yinghai Lu <[email protected]> wrote:

> > btw., it would be nice to have some "lots of RAM simulation"
> > debugging code which would just _fake_ a really large e820 map and
> > would in the end throw away the 'fake' pages later during bootup.
> > Perhaps tell the early allocator to never allocate into these fake
> > areas [via an struct e820 entry flag], but all our sizing code and
> > the boot bitmaps, etc. would be sized accordingly, as if we had this
> > much RAM - and we'd trigger these nuances. We could put this into a
> > new "fakemem=128GB" boot option and hence we could boot with
> > fakemem=128GB on a 2GB box and could at least hope to be able to
> > boot [with some serious amount of RAM wasted on over-sized
> > pagetables, allocator bitmaps and mem_map[]]. Hm?
>
> sound interesting. but need some variable to prevent using non exist
> page.

yeah, exactly. You could try into using the "PG_arch_1" struct page
->flags bit - it's not utilized on x86. That way we could avoid free-ing
it into the general page pool. We could check for it in
arch/x86/mm/init_64.c's online_page() function [that's where we release
most pages into the page pool].

Perhaps it might also be handy to introduce a page_is_fake_ram(addr)
method that searches the e820 maps for fake ram, to make sure we never
allocate such ranges.

Such a feature would be very useful.

Ingo

2008-02-07 11:55:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86_64: fix page table size


* Andi Kleen <[email protected]> wrote:

> Yinghai Lu <[email protected]> writes:
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index eb376b5..31f0e82 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -370,7 +370,7 @@ static void __init find_early_table_space(unsigned long end)
> >
> > puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
> > tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
> > - if (direct_gbpages) {
> > + if (!direct_gbpages) {
>
> What tree did you use? The patch I submitted had
>
> if (direct_gbpages == GBP_ON) {

yes, the bug was introduced in your original submission of gbpages.

Ingo

2008-02-07 17:27:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86_64: fix page table size

On Thursday 07 February 2008 12:54:42 Ingo Molnar wrote:
> * Andi Kleen <[email protected]> wrote:
> > Yinghai Lu <[email protected]> writes:
> > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > > index eb376b5..31f0e82 100644
> > > --- a/arch/x86/mm/init_64.c
> > > +++ b/arch/x86/mm/init_64.c
> > > @@ -370,7 +370,7 @@ static void __init find_early_table_space(unsigned
> > > long end)
> > >
> > > puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
> > > tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
> > > - if (direct_gbpages) {
> > > + if (!direct_gbpages) {
> >
> > What tree did you use? The patch I submitted had
> >
> > if (direct_gbpages == GBP_ON) {
>
> yes, the bug was introduced in your original submission of gbpages

I see yes. The original was ok I think, but it must have been a typo when
I switched the boolean to a enum on Thomas request and for some
reason the new breakage didn't show up on my testing.

I wonder why you didn't keep the enum even though Thomas
insisted on it. Since you removed it again the safest would have been
to just keep it correct as it originally was. And it was rather pointless
to force me to do changes when you then afterwards half way rewrite the code
anyways. To be honest that habit makes it rather unpleasant to submit
patch to you recently. At least you could have indicated that in advance
and safe everybody trouble.

-Andi