2011-05-02 17:22:50

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

In a couple of days I thinking to ask Linus to pull this branch:

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/bug-fixes-for-rc5

Which as two fixes that fix a bootup (Linux can't boot under Xen
at all) regression introduced by "x86-64, mm: Put early page table high"
(git commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e). Stefano
and Yinghai have been working on patches fixing this regression
when the patch was still in x86/mm-core before the 2.6.39 merge window openned.

But we haven't come up with an acceptable general solution yet, so this
patchset provides a workaround for the problem. Peter, Yinghai - what would be
the best forum/email/conference to hammer out a general solution for this?

Currently, there are couple of ways of fixing this:
- use pvops hooks: http://marc.info/?i=1302607192-21355-2-git-send-email-stefano.stabellini@eu.citrix.com
- have a workaround in Xen MMU's early bootup code (which is what these
two patches to this email have).
- or remove the patch introducing the regression altogether.

Foremost important is to fix the regression, and attached patches
achieve that. I want to remove this workaround patch when we
hammer out more appropriate semantics for the page table creation - but
that will take some time and the runway to do that in 2.6.39 is gone.

Konrad Rzeszutek Wilk (1):
xen/mmu: Add workaround "x86-64, mm: Put early page table high"

Stefano Stabellini (1):
xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top

arch/x86/xen/mmu.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 124 insertions(+), 1 deletions(-)


2011-05-02 17:22:46

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

As a consequence of the commit:

commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
Author: Yinghai Lu <[email protected]>
Date: Fri Dec 17 16:58:28 2010 -0800

x86-64, mm: Put early page table high

it causes the Linux kernel to crash under Xen:

mapping kernel into physical memory
Xen: setup ISA identity maps
about to get started...
(XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
(XEN) mm.c:3027:d0 Error while pinning mfn b1d89
(XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
...

The reason is that at some point init_memory_mapping is going to reach
the pagetable pages area and map those pages too (mapping them as normal
memory that falls in the range of addresses passed to init_memory_mapping
as argument). Some of those pages are already pagetable pages (they are
in the range pgt_buf_start-pgt_buf_end) therefore they are going to be
mapped RO and everything is fine.
Some of these pages are not pagetable pages yet (they fall in the range
pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
are going to be mapped RW. When these pages become pagetable pages and
are hooked into the pagetable, xen will find that the guest has already
a RW mapping of them somewhere and fail the operation.
The reason Xen requires pagetables to be RO is that the hypervisor needs
to verify that the pagetables are valid before using them. The validation
operations are called "pinning" (more details in arch/x86/xen/mmu.c).

In order to fix the issue we mark all the pages in the entire range
pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
is completed only the range pgt_buf_start-pgt_buf_end is reserved by
init_memory_mapping. Hence the kernel is going to crash as soon as one
of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
ranges are RO).

For this reason, this function is introduced which is called _after_
the init_memory_mapping has completed (in a perfect world we would
call this function from init_memory_mapping, but lets ignore that).

Because we are called _after_ init_memory_mapping the pgt_buf_[start,
end,top] have all changed to new values (b/c another init_memory_mapping
is called). Hence, the first time we enter this function, we save
away the pgt_buf_start value and update the pgt_buf_[end,top].

When we detect that the "old" pgt_buf_start through pgt_buf_end
PFNs have been reserved (so memblock_x86_reserve_range has been called),
we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.

And then we update those "old" pgt_buf_[end|top] with the new ones
so that we can redo this on the next pagetable.

Reviewed-by: Jeremy Fitzhardinge <[email protected]>
[v1: Updated with Jeremy's comments]
[v2: Added the crash output]
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/mmu.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index aef7af9..1bca25f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1463,6 +1463,119 @@ static int xen_pgd_alloc(struct mm_struct *mm)
return ret;
}

+#ifdef CONFIG_X86_64
+static __initdata u64 __last_pgt_set_rw = 0;
+static __initdata u64 __pgt_buf_start = 0;
+static __initdata u64 __pgt_buf_end = 0;
+static __initdata u64 __pgt_buf_top = 0;
+/*
+ * As a consequence of the commit:
+ *
+ * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
+ * Author: Yinghai Lu <[email protected]>
+ * Date: Fri Dec 17 16:58:28 2010 -0800
+ *
+ * x86-64, mm: Put early page table high
+ *
+ * at some point init_memory_mapping is going to reach the pagetable pages
+ * area and map those pages too (mapping them as normal memory that falls
+ * in the range of addresses passed to init_memory_mapping as argument).
+ * Some of those pages are already pagetable pages (they are in the range
+ * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
+ * everything is fine.
+ * Some of these pages are not pagetable pages yet (they fall in the range
+ * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
+ * are going to be mapped RW. When these pages become pagetable pages and
+ * are hooked into the pagetable, xen will find that the guest has already
+ * a RW mapping of them somewhere and fail the operation.
+ * The reason Xen requires pagetables to be RO is that the hypervisor needs
+ * to verify that the pagetables are valid before using them. The validation
+ * operations are called "pinning".
+ *
+ * In order to fix the issue we mark all the pages in the entire range
+ * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
+ * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
+ * init_memory_mapping. Hence the kernel is going to crash as soon as one
+ * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
+ * ranges are RO).
+ *
+ * For this reason, 'mark_rw_past_pgt' is introduced which is called _after_
+ * the init_memory_mapping has completed (in a perfect world we would
+ * call this function from init_memory_mapping, but lets ignore that).
+ *
+ * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
+ * end,top] have all changed to new values (b/c init_memory_mapping
+ * is called and setting up another new page-table). Hence, the first time
+ * we enter this function, we save away the pgt_buf_start value and update
+ * the pgt_buf_[end,top].
+ *
+ * When we detect that the "old" pgt_buf_start through pgt_buf_end
+ * PFNs have been reserved (so memblock_x86_reserve_range has been called),
+ * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
+ *
+ * And then we update those "old" pgt_buf_[end|top] with the new ones
+ * so that we can redo this on the next pagetable.
+ */
+static __init void mark_rw_past_pgt(void) {
+
+ if (pgt_buf_end > pgt_buf_start) {
+ u64 addr, size;
+
+ /* Save it away. */
+ if (!__pgt_buf_start) {
+ __pgt_buf_start = pgt_buf_start;
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ return;
+ }
+ /* If we get the range that starts at __pgt_buf_end that means
+ * the range is reserved, and that in 'init_memory_mapping'
+ * the 'memblock_x86_reserve_range' has been called with the
+ * outdated __pgt_buf_start, __pgt_buf_end (the "new"
+ * pgt_buf_[start|end|top] refer now to a new pagetable.
+ * Note: we are called _after_ the pgt_buf_[..] have been
+ * updated.*/
+
+ addr = memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start),
+ &size, PAGE_SIZE);
+
+ /* Still not reserved, meaning 'memblock_x86_reserve_range'
+ * hasn't been called yet. Update the _end and _top.*/
+ if (addr == PFN_PHYS(__pgt_buf_start)) {
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ return;
+ }
+
+ /* OK, the area is reserved, meaning it is time for us to
+ * set RW for the old end->top PFNs. */
+
+ /* ..unless we had already done this. */
+ if (__pgt_buf_end == __last_pgt_set_rw)
+ return;
+
+ addr = PFN_PHYS(__pgt_buf_end);
+
+ /* set as RW the rest */
+ printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
+ PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
+
+ while (addr < PFN_PHYS(__pgt_buf_top)) {
+ make_lowmem_page_readwrite(__va(addr));
+ addr += PAGE_SIZE;
+ }
+ /* And update everything so that we are ready for the next
+ * pagetable (the one created for regions past 4GB) */
+ __last_pgt_set_rw = __pgt_buf_end;
+ __pgt_buf_start = pgt_buf_start;
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ }
+ return;
+}
+#else
+static __init void mark_rw_past_pgt(void) { }
+#endif
static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
#ifdef CONFIG_X86_64
@@ -1489,6 +1602,14 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
unsigned long pfn = pte_pfn(pte);

/*
+ * A bit of optimization. We do not need to call the workaround
+ * when xen_set_pte_init is called with a PTE with 0 as PFN.
+ * That is b/c the pagetable at that point are just being populated
+ * with empty values and we can save some cycles by not calling
+ * the 'memblock' code.*/
+ if (pfn)
+ mark_rw_past_pgt();
+ /*
* If the new pfn is within the range of the newly allocated
* kernel pagetable, and it isn't being mapped into an
* early_ioremap fixmap slot as a freshly allocated page, make sure
@@ -1997,6 +2118,8 @@ __init void xen_ident_map_ISA(void)

static __init void xen_post_allocator_init(void)
{
+ mark_rw_past_pgt();
+
#ifdef CONFIG_XEN_DEBUG
pv_mmu_ops.make_pte = PV_CALLEE_SAVE(xen_make_pte_debug);
#endif
--
1.7.1

2011-05-02 17:22:44

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: [PATCH 2/2] xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top

From: Stefano Stabellini <[email protected]>

mask_rw_pte is currently checking if a pfn is a pagetable page if it
falls in the range pgt_buf_start - pgt_buf_end but that is incorrect
because pgt_buf_end is a moving target: pgt_buf_top is the real
boundary.

Signed-off-by: Stefano Stabellini <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/mmu.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 1bca25f..55c965b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1616,7 +1616,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
* it is RO.
*/
if (((!is_early_ioremap_ptep(ptep) &&
- pfn >= pgt_buf_start && pfn < pgt_buf_end)) ||
+ pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
(is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
pte = pte_wrprotect(pte);

--
1.7.1

2011-05-02 17:31:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

On 05/02/2011 10:22 AM, Konrad Rzeszutek Wilk wrote:
>
> But we haven't come up with an acceptable general solution yet, so this
> patchset provides a workaround for the problem. Peter, Yinghai - what would be
> the best forum/email/conference to hammer out a general solution for this?
>
> Currently, there are couple of ways of fixing this:
> - use pvops hooks: http://marc.info/?i=1302607192-21355-2-git-send-email-stefano.stabellini@eu.citrix.com
> - have a workaround in Xen MMU's early bootup code (which is what these
> two patches to this email have).
> - or remove the patch introducing the regression altogether.
>
> Foremost important is to fix the regression, and attached patches
> achieve that. I want to remove this workaround patch when we
> hammer out more appropriate semantics for the page table creation - but
> that will take some time and the runway to do that in 2.6.39 is gone.
>

My inclination would be to apply your workaround -- are there any
adverse effects to doing that?

-hpa

[Sorry if I have missed any emails recently... apparently my email was
significantly on the fritz over the last few days.]

2011-05-02 18:08:16

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

On Mon, May 02, 2011 at 10:31:21AM -0700, H. Peter Anvin wrote:
> On 05/02/2011 10:22 AM, Konrad Rzeszutek Wilk wrote:
> >
> > But we haven't come up with an acceptable general solution yet, so this
> > patchset provides a workaround for the problem. Peter, Yinghai - what would be
> > the best forum/email/conference to hammer out a general solution for this?
> >
> > Currently, there are couple of ways of fixing this:
> > - use pvops hooks: http://marc.info/?i=1302607192-21355-2-git-send-email-stefano.stabellini@eu.citrix.com
> > - have a workaround in Xen MMU's early bootup code (which is what these
> > two patches to this email have).
> > - or remove the patch introducing the regression altogether.
> >
> > Foremost important is to fix the regression, and attached patches
> > achieve that. I want to remove this workaround patch when we
> > hammer out more appropriate semantics for the page table creation - but
> > that will take some time and the runway to do that in 2.6.39 is gone.
> >
>
> My inclination would be to apply your workaround -- are there any
> adverse effects to doing that?

There is a bootup slowdown (not noticeable). That is because we call the
'memblock_find_range' function on every PTE table creation (only during
bootup of course).

Testing wise, on the machines on which the regression occurred, with these two
patches the regression disappears - so that is a good sign. I am testing
it on some more today to assure myself I am not missing anything.

>
> -hpa
>
> [Sorry if I have missed any emails recently... apparently my email was
> significantly on the fritz over the last few days.]

Yikes - I hate when that happens. I was wondering why you went so silent
on some of the emails.

2011-05-02 18:33:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

On 05/02/2011 11:08 AM, Konrad Rzeszutek Wilk wrote:
>>
>> My inclination would be to apply your workaround -- are there any
>> adverse effects to doing that?
>
> There is a bootup slowdown (not noticeable). That is because we call the
> 'memblock_find_range' function on every PTE table creation (only during
> bootup of course).
>
> Testing wise, on the machines on which the regression occurred, with these two
> patches the regression disappears - so that is a good sign. I am testing
> it on some more today to assure myself I am not missing anything.
>

OK, sounds like a plan then. I like it because it doesn't affect the
native kernel.

-hpa

2011-05-02 19:34:52

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

On Mon, May 02, 2011 at 11:33:16AM -0700, H. Peter Anvin wrote:
> On 05/02/2011 11:08 AM, Konrad Rzeszutek Wilk wrote:
> >>
> >> My inclination would be to apply your workaround -- are there any
> >> adverse effects to doing that?
> >
> > There is a bootup slowdown (not noticeable). That is because we call the
> > 'memblock_find_range' function on every PTE table creation (only during
> > bootup of course).
> >
> > Testing wise, on the machines on which the regression occurred, with these two
> > patches the regression disappears - so that is a good sign. I am testing
> > it on some more today to assure myself I am not missing anything.
> >
>
> OK, sounds like a plan then. I like it because it doesn't affect the
> native kernel.

<laughs> I figured :-)

Moving forward I really want to get rid of this wart.

Not sure if it is possible, but I was thinking it would be nice to get
you, Yinghai, Jeremy, Stefano all in one place (phone or web-conference thing)
to sketch out some ideas and hammer something out.

What days would work best?

2011-05-02 19:59:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

On 05/02/2011 12:34 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, May 02, 2011 at 11:33:16AM -0700, H. Peter Anvin wrote:
>> On 05/02/2011 11:08 AM, Konrad Rzeszutek Wilk wrote:
>>>>
>>>> My inclination would be to apply your workaround -- are there any
>>>> adverse effects to doing that?
>>>
>>> There is a bootup slowdown (not noticeable). That is because we call the
>>> 'memblock_find_range' function on every PTE table creation (only during
>>> bootup of course).
>>>
>>> Testing wise, on the machines on which the regression occurred, with these two
>>> patches the regression disappears - so that is a good sign. I am testing
>>> it on some more today to assure myself I am not missing anything.
>>>
>>
>> OK, sounds like a plan then. I like it because it doesn't affect the
>> native kernel.
>
> <laughs> I figured :-)
>
> Moving forward I really want to get rid of this wart.
>
> Not sure if it is possible, but I was thinking it would be nice to get
> you, Yinghai, Jeremy, Stefano all in one place (phone or web-conference thing)
> to sketch out some ideas and hammer something out.
>
> What days would work best?

First things first... are you pushing the workaround (you can add my
Acked-by:) or should I?

Second... I don't know what physical locations and/or time zones every
one is in, which is probably the first thing.

Third, I will be travelling a lot over the next two weeks, plus we are
getting close to merge window time, so it might be hard to schedule
before the merge window. Are you looking at trying to push something
for .40? If so, you probably should have an implementation in mind
already. If not, I suggest we aim for after the merge window is over or
at least quieted down.

-hpa

2011-05-02 20:07:42

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

On Mon, May 02, 2011 at 12:59:08PM -0700, H. Peter Anvin wrote:
> On 05/02/2011 12:34 PM, Konrad Rzeszutek Wilk wrote:
> > On Mon, May 02, 2011 at 11:33:16AM -0700, H. Peter Anvin wrote:
> >> On 05/02/2011 11:08 AM, Konrad Rzeszutek Wilk wrote:
> >>>>
> >>>> My inclination would be to apply your workaround -- are there any
> >>>> adverse effects to doing that?
> >>>
> >>> There is a bootup slowdown (not noticeable). That is because we call the
> >>> 'memblock_find_range' function on every PTE table creation (only during
> >>> bootup of course).
> >>>
> >>> Testing wise, on the machines on which the regression occurred, with these two
> >>> patches the regression disappears - so that is a good sign. I am testing
> >>> it on some more today to assure myself I am not missing anything.
> >>>
> >>
> >> OK, sounds like a plan then. I like it because it doesn't affect the
> >> native kernel.
> >
> > <laughs> I figured :-)
> >
> > Moving forward I really want to get rid of this wart.
> >
> > Not sure if it is possible, but I was thinking it would be nice to get
> > you, Yinghai, Jeremy, Stefano all in one place (phone or web-conference thing)
> > to sketch out some ideas and hammer something out.
> >
> > What days would work best?
>
> First things first... are you pushing the workaround (you can add my
> Acked-by:) or should I?

I will do it tomorrow and stick your Acked-by on both patches.
>
> Second... I don't know what physical locations and/or time zones every
> one is in, which is probably the first thing.

EST for me, Jeremy is PST (California), Stefano is in UK, so that is UTC+0, and
no idea where Yinghai is.

But people are on vacation time right now so or until next week, plus
> Third, I will be travelling a lot over the next two weeks, plus we are

.. you are traveling.
> getting close to merge window time, so it might be hard to schedule
> before the merge window. Are you looking at trying to push something
> for .40? If so, you probably should have an implementation in mind
> already. If not, I suggest we aim for after the merge window is over or
> at least quieted down.

.. and I don't have any implementation in mind yet. Was hoping we could
brainstorm something together.

OK, let me bug you guys when the merge window is over.

>
> -hpa

2011-05-02 20:16:29

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high'

On Mon, May 2, 2011 at 1:07 PM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Mon, May 02, 2011 at 12:59:08PM -0700, H. Peter Anvin wrote:
>> >> OK, sounds like a plan then. ?I like it because it doesn't affect the
>> >> native kernel.

Xen should set RAM for page-table to RO after init_memory mapping.

>> >
>> > <laughs> I figured :-)
>> >

...

> EST for me, Jeremy is PST (California), Stefano is in UK, so that is UTC+0, and
> no idea where Yinghai is.

California

Thanks

Yinghai

2011-05-03 00:55:47

by Daniel Kiper

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
> As a consequence of the commit:
>
> commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> Author: Yinghai Lu <[email protected]>
> Date: Fri Dec 17 16:58:28 2010 -0800
>
> x86-64, mm: Put early page table high
>
> it causes the Linux kernel to crash under Xen:
>
> mapping kernel into physical memory
> Xen: setup ISA identity maps
> about to get started...
> (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
> (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
> (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
> (XEN) domain_crash_sync called from entry.S
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> ...

I was hit by this bug when I was working on memory hotplug.
After some investigation I found myself above mentioned patch
as a guilty and later I discovered that you are working on that
issue. I have tested your patch and discoverd some issues with it.
First of all it has compilation issues on gcc version 4.1.2 20061115
(prerelease) (Debian 4.1.1-21). Details below.

Additionlly, I think that your patch does not work as you expected.
I found that git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26
(xen: do not create the extra e820 region at an addr lower than 4G)
do this work (to some extent). When this patch is removed domU
is crashing with following error:

(early) Linux version 2.6.39-rc5-x86_64.xenU.all.r0+ (root@dev-00) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #5 SMP Tue May 3 01:43:26 CEST 2011
(early) Command line: root=/dev/xvda debug earlyprintk=xen noapic nolapic console=hvc0
(early) ACPI in unprivileged domain disabled
(early) released 0 pages of unused memory
(early) Set 0 page(s) to 1-1 mapping.
(early) BIOS-provided physical RAM map:
(early) Xen: 0000000000000000 - 00000000000a0000 (usable)
(early) Xen: 00000000000a0000 - 0000000000100000 (reserved)
(early) Xen: 0000000000100000 - 0000000026000000 (usable)
(early) bootconsole [xenboot0] enabled
(early) NX (Execute Disable) protection: active
(early) DMI not present or invalid.
(early) e820 update range: 0000000000000000 - 0000000000010000 (early) (usable)(early) ==> (early) (reserved)(early)
(early) e820 remove range: 00000000000a0000 - 0000000000100000 (early) (usable)(early)
(early) No AGP bridge found
(early) last_pfn = 0x26000 max_arch_pfn = 0x400000000
(early) initial memory mapped : 0 - 01693000
(early) Base memory trampoline at [ffff88000009e000] 9e000 size 8192
(early) init_memory_mapping: 0000000000000000-0000000026000000
(early) 0000000000 - 0026000000 page 4k
(early) kernel direct mapping tables up to 26000000 @ 256ce000-25800000
(early) BUG: unable to handle kernel (early) NULL pointer dereference(early) at (null)
(early) IP:(early) [<ffffffff814a33f2>] find_range_array+0x4e/0x57
(early) PGD 0 (early)
(early) Oops: 0003 [#1] (early) SMP (early)
(early) last sysfs file:
(early) CPU 0 (early)
(early) Modules linked in:(early)
(early)
(early) Pid: 0, comm: swapper Not tainted 2.6.39-rc5-x86_64.xenU.all.r0+ #5(early) (early)
(early) RIP: e030:[<ffffffff814a33f2>] (early) [<ffffffff814a33f2>] find_range_array+0x4e/0x57
(early) RSP: e02b:ffffffff81427e58 EFLAGS: 00010046
(early) RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 00000000000000e0
(early) RDX: ffff8800257fff20 RSI: 00000000257fff20 RDI: ffff8800257fff20
(early) RBP: ffffffff81427e68 R08: 0000000000000005 R09: 0000000000000050
(early) R10: 0000000000000005 R11: 0000000025800000 R12: ffffffff814bd000
(early) R13: 0000000000000000 R14: 000000000000000e R15: 0000000000001000
(early) FS: 0000000000000000(0000) GS:ffffffff8147f000(0000) knlGS:0000000000000000
(early) CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
(early) CR2: 0000000000000000 CR3: 0000000001441000 CR4: 0000000000002660
(early) DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
(early) DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
(early) Process swapper (pid: 0, threadinfo ffffffff81426000, task ffffffff81449020)
(early) Stack:
(early) ffffffff81427e88(early) 0000000000000000(early) ffffffff81427ea8(early) ffffffff814a343d(early)
(early) 0000000000000100(early) 0000000026000000(early) ffffffff814bd000(early) ffffffffffffffff(early)
(early) 0000000000000000(early) 0000000000000000(early) ffffffff81427eb8(early) ffffffff814a3577(early)
(early) Call Trace:
(early) [<ffffffff814a343d>] __memblock_x86_memory_in_range+0x42/0x171
(early) [<ffffffff814a3577>] memblock_x86_memory_in_range+0xb/0xd
(early) [<ffffffff81497348>] memblock_find_dma_reserve+0x15/0x3b
(early) [<ffffffff81496b50>] setup_arch+0x721/0x7e5
(early) [<ffffffff810065fd>] ? __raw_callee_save_xen_irq_disable+0x11/0x1e
(early) [<ffffffff81493955>] start_kernel+0x8a/0x2db
(early) [<ffffffff81493299>] x86_64_start_reservations+0x84/0x88
(early) [<ffffffff814947d9>] xen_start_kernel+0x3e1/0x3e8
(early) Code: (early) 66 (early) 00 (early) 00 (early) 48 (early) 85 (early) c0 (early) 75 (early) 0c (early) 48 (early) c7 (early) c7 (early) 86 (early) 80 (early) 3a (early) 81 (early) e8 (early) 55 (early) 41 (early) b9 (early) ff (early) 48 (early) bf (early) 00 (early) 00 (early) 00 (early) 00 (early) 00 (early) 88 (early) ff (early) ff (early) 48 (early) 89 (early) d9 (early) 48 (early) 8d (early) 14 (early) 38 (early) 31 (early) c0 (early) fc (early) 48 (early) 89 (early) d7 (early) <f3> (early) aa (early) 48 (early) 89 (early) d0 (early) 5f (early) 5b (early) c9 (early) c3 (early) 55 (early) 48 (early) 89 (early) e5 (early) 41 (early) 57 (early) 49 (early) 89 (early) f7 (early) 49 (early) c1 (early) ef (early)
(early) RIP (early) [<ffffffff814a33f2>] find_range_array+0x4e/0x57
(early) RSP <ffffffff81427e58>
(early) CR2: 0000000000000000
(early) ---[ end trace 4eaa2a86a8e2da22 ]---
(early) Kernel panic - not syncing: Attempted to kill the idle task!
(early) Pid: 0, comm: swapper Tainted: G D 2.6.39-rc5-x86_64.xenU.all.r0+ #5
(early) Call Trace:
(early) [<ffffffff810375ed>] panic+0xbd/0x1c7
(early) [<ffffffff810383c7>] ? printk+0x67/0x69
(early) [<ffffffff811fd7b0>] ? account+0xe1/0xf0
(early) [<ffffffff8103a641>] do_exit+0xb4/0x676
(early) [<ffffffff81037b53>] ? spin_unlock_irqrestore+0x9/0xb
(early) [<ffffffff81038c24>] ? kmsg_dump+0x4a/0xd9
(early) [<ffffffff8100d11d>] oops_end+0xc1/0xc9
(early) [<ffffffff810252b1>] no_context+0x1f5/0x204
(early) [<ffffffff81025448>] __bad_area_nosemaphore+0x188/0x1ab
(early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
(early) [<ffffffff810254e1>] bad_area_nosemaphore+0xe/0x10
(early) [<ffffffff81025991>] do_page_fault+0x18c/0x337
(early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
(early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
(early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
(early) [<ffffffff812eebd5>] page_fault+0x25/0x30
(early) [<ffffffff814a33f2>] ? find_range_array+0x4e/0x57
(early) [<ffffffff814a33ca>] ? find_range_array+0x26/0x57
(early) [<ffffffff814a343d>] __memblock_x86_memory_in_range+0x42/0x171
(early) [<ffffffff814a3577>] memblock_x86_memory_in_range+0xb/0xd
(early) [<ffffffff81497348>] memblock_find_dma_reserve+0x15/0x3b
(early) [<ffffffff81496b50>] setup_arch+0x721/0x7e5
(early) [<ffffffff810065fd>] ? __raw_callee_save_xen_irq_disable+0x11/0x1e
(early) [<ffffffff81493955>] start_kernel+0x8a/0x2db
(early) [<ffffffff81493299>] x86_64_start_reservations+0x84/0x88
(early) [<ffffffff814947d9>] xen_start_kernel+0x3e1/0x3e8

I think that (Stefano please confirm or not) this patch was prepared
as workaround for similar issues. However, I do not like this patch
because on systems with small amount of memory it leaves huge (to some
extent) hole between max_low_pfn and 4G. Additionally, it affects
memory hotplug a bit because it allocates memory starting from current
max_mfn. It also breaks memory hotplug on i386 (maybe also others
thinks, however, I could not confirm that). If it stay for some
reason it should be amended in follwing way:

#ifdef CONFIG_X86_32
xen_extra_mem_start = mem_end;
#else
xen_extra_mem_start = max((1ULL << 32), mem_end);
#endif

Regarding comment for this patch it should be mentioned that without this
patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.

Last but least. I found that memory sizes below and including exactly 1 GiB and
exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
them because I do not have sufficient memory) are magic. It means that if memory
is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
should be tested with sizes which are not power of two nor multiple of that.

> The reason is that at some point init_memory_mapping is going to reach
> the pagetable pages area and map those pages too (mapping them as normal
> memory that falls in the range of addresses passed to init_memory_mapping
> as argument). Some of those pages are already pagetable pages (they are
> in the range pgt_buf_start-pgt_buf_end) therefore they are going to be
> mapped RO and everything is fine.
> Some of these pages are not pagetable pages yet (they fall in the range
> pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> are going to be mapped RW. When these pages become pagetable pages and
> are hooked into the pagetable, xen will find that the guest has already
> a RW mapping of them somewhere and fail the operation.
> The reason Xen requires pagetables to be RO is that the hypervisor needs
> to verify that the pagetables are valid before using them. The validation
> operations are called "pinning" (more details in arch/x86/xen/mmu.c).
>
> In order to fix the issue we mark all the pages in the entire range
> pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> init_memory_mapping. Hence the kernel is going to crash as soon as one
> of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> ranges are RO).
>
> For this reason, this function is introduced which is called _after_
> the init_memory_mapping has completed (in a perfect world we would
> call this function from init_memory_mapping, but lets ignore that).
>
> Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> end,top] have all changed to new values (b/c another init_memory_mapping
> is called). Hence, the first time we enter this function, we save
> away the pgt_buf_start value and update the pgt_buf_[end,top].
>
> When we detect that the "old" pgt_buf_start through pgt_buf_end
> PFNs have been reserved (so memblock_x86_reserve_range has been called),
> we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
>
> And then we update those "old" pgt_buf_[end|top] with the new ones
> so that we can redo this on the next pagetable.
>
> Reviewed-by: Jeremy Fitzhardinge <[email protected]>
> [v1: Updated with Jeremy's comments]
> [v2: Added the crash output]
> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> arch/x86/xen/mmu.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 123 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index aef7af9..1bca25f 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1463,6 +1463,119 @@ static int xen_pgd_alloc(struct mm_struct *mm)
> return ret;
> }
>
> +#ifdef CONFIG_X86_64
> +static __initdata u64 __last_pgt_set_rw = 0;
> +static __initdata u64 __pgt_buf_start = 0;
> +static __initdata u64 __pgt_buf_end = 0;
> +static __initdata u64 __pgt_buf_top = 0;

Please look into include/linux/init.h for proper
usage of __init macros. It should be changed to

static u64 __last_pgt_set_rw __initdata = 0;
...
...

Additionally,

static const struct pv_mmu_ops xen_mmu_ops __initdata = {

should be changed to:

static const struct pv_mmu_ops xen_mmu_ops __initconst = {

It is not in your patch, however, it conflicts
with your definitions.

> +/*
> + * As a consequence of the commit:
> + *
> + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> + * Author: Yinghai Lu <[email protected]>
> + * Date: Fri Dec 17 16:58:28 2010 -0800
> + *
> + * x86-64, mm: Put early page table high
> + *
> + * at some point init_memory_mapping is going to reach the pagetable pages
> + * area and map those pages too (mapping them as normal memory that falls
> + * in the range of addresses passed to init_memory_mapping as argument).
> + * Some of those pages are already pagetable pages (they are in the range
> + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> + * everything is fine.
> + * Some of these pages are not pagetable pages yet (they fall in the range
> + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> + * are going to be mapped RW. When these pages become pagetable pages and
> + * are hooked into the pagetable, xen will find that the guest has already
> + * a RW mapping of them somewhere and fail the operation.
> + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> + * to verify that the pagetables are valid before using them. The validation
> + * operations are called "pinning".
> + *
> + * In order to fix the issue we mark all the pages in the entire range
> + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> + * ranges are RO).
> + *
> + * For this reason, 'mark_rw_past_pgt' is introduced which is called _after_
> + * the init_memory_mapping has completed (in a perfect world we would
> + * call this function from init_memory_mapping, but lets ignore that).
> + *
> + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> + * end,top] have all changed to new values (b/c init_memory_mapping
> + * is called and setting up another new page-table). Hence, the first time
> + * we enter this function, we save away the pgt_buf_start value and update
> + * the pgt_buf_[end,top].
> + *
> + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> + *
> + * And then we update those "old" pgt_buf_[end|top] with the new ones
> + * so that we can redo this on the next pagetable.
> + */
> +static __init void mark_rw_past_pgt(void) {

Please look into include/linux/init.h. I found much more similar
mistakes in current Xen code. I will prepare relevant patch
shortly.

> + if (pgt_buf_end > pgt_buf_start) {
> + u64 addr, size;
> +
> + /* Save it away. */
> + if (!__pgt_buf_start) {
> + __pgt_buf_start = pgt_buf_start;
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + return;
> + }
> + /* If we get the range that starts at __pgt_buf_end that means
> + * the range is reserved, and that in 'init_memory_mapping'
> + * the 'memblock_x86_reserve_range' has been called with the
> + * outdated __pgt_buf_start, __pgt_buf_end (the "new"
> + * pgt_buf_[start|end|top] refer now to a new pagetable.
> + * Note: we are called _after_ the pgt_buf_[..] have been
> + * updated.*/
> +
> + addr = memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start),
> + &size, PAGE_SIZE);
> +
> + /* Still not reserved, meaning 'memblock_x86_reserve_range'
> + * hasn't been called yet. Update the _end and _top.*/
> + if (addr == PFN_PHYS(__pgt_buf_start)) {
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + return;
> + }
> +
> + /* OK, the area is reserved, meaning it is time for us to
> + * set RW for the old end->top PFNs. */
> +
> + /* ..unless we had already done this. */
> + if (__pgt_buf_end == __last_pgt_set_rw)
> + return;
> +
> + addr = PFN_PHYS(__pgt_buf_end);
> +
> + /* set as RW the rest */
> + printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
> + PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
> +
> + while (addr < PFN_PHYS(__pgt_buf_top)) {
> + make_lowmem_page_readwrite(__va(addr));
> + addr += PAGE_SIZE;
> + }
> + /* And update everything so that we are ready for the next
> + * pagetable (the one created for regions past 4GB) */
> + __last_pgt_set_rw = __pgt_buf_end;
> + __pgt_buf_start = pgt_buf_start;
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + }
> + return;

I think that this return is superfluous.

> +}
> +#else
> +static __init void mark_rw_past_pgt(void) { }

Dito.

> +#endif
> static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
> {
> #ifdef CONFIG_X86_64
> @@ -1489,6 +1602,14 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
> unsigned long pfn = pte_pfn(pte);
>
> /*
> + * A bit of optimization. We do not need to call the workaround
> + * when xen_set_pte_init is called with a PTE with 0 as PFN.
> + * That is b/c the pagetable at that point are just being populated
> + * with empty values and we can save some cycles by not calling
> + * the 'memblock' code.*/
> + if (pfn)
> + mark_rw_past_pgt();
> + /*
> * If the new pfn is within the range of the newly allocated
> * kernel pagetable, and it isn't being mapped into an
> * early_ioremap fixmap slot as a freshly allocated page, make sure
> @@ -1997,6 +2118,8 @@ __init void xen_ident_map_ISA(void)
>
> static __init void xen_post_allocator_init(void)

Dito.

Daniel

2011-05-03 13:19:14

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Mon, 2 May 2011, Konrad Rzeszutek Wilk wrote:
> As a consequence of the commit:
>
> commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> Author: Yinghai Lu <[email protected]>
> Date: Fri Dec 17 16:58:28 2010 -0800
>
> x86-64, mm: Put early page table high
>
> it causes the Linux kernel to crash under Xen:
>
> mapping kernel into physical memory
> Xen: setup ISA identity maps
> about to get started...
> (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
> (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
> (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
> (XEN) domain_crash_sync called from entry.S
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> ...
>
> The reason is that at some point init_memory_mapping is going to reach
> the pagetable pages area and map those pages too (mapping them as normal
> memory that falls in the range of addresses passed to init_memory_mapping
> as argument). Some of those pages are already pagetable pages (they are
> in the range pgt_buf_start-pgt_buf_end) therefore they are going to be
> mapped RO and everything is fine.
> Some of these pages are not pagetable pages yet (they fall in the range
> pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> are going to be mapped RW. When these pages become pagetable pages and
> are hooked into the pagetable, xen will find that the guest has already
> a RW mapping of them somewhere and fail the operation.
> The reason Xen requires pagetables to be RO is that the hypervisor needs
> to verify that the pagetables are valid before using them. The validation
> operations are called "pinning" (more details in arch/x86/xen/mmu.c).
>
> In order to fix the issue we mark all the pages in the entire range
> pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> init_memory_mapping. Hence the kernel is going to crash as soon as one
> of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> ranges are RO).
>
> For this reason, this function is introduced which is called _after_
> the init_memory_mapping has completed (in a perfect world we would
> call this function from init_memory_mapping, but lets ignore that).
>
> Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> end,top] have all changed to new values (b/c another init_memory_mapping
> is called). Hence, the first time we enter this function, we save
> away the pgt_buf_start value and update the pgt_buf_[end,top].
>
> When we detect that the "old" pgt_buf_start through pgt_buf_end
> PFNs have been reserved (so memblock_x86_reserve_range has been called),
> we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
>
> And then we update those "old" pgt_buf_[end|top] with the new ones
> so that we can redo this on the next pagetable.
>
> Reviewed-by: Jeremy Fitzhardinge <[email protected]>
> [v1: Updated with Jeremy's comments]
> [v2: Added the crash output]
> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> arch/x86/xen/mmu.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 123 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index aef7af9..1bca25f 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1463,6 +1463,119 @@ static int xen_pgd_alloc(struct mm_struct *mm)
> return ret;
> }
>
> +#ifdef CONFIG_X86_64
> +static __initdata u64 __last_pgt_set_rw = 0;
> +static __initdata u64 __pgt_buf_start = 0;
> +static __initdata u64 __pgt_buf_end = 0;
> +static __initdata u64 __pgt_buf_top = 0;
> +/*
> + * As a consequence of the commit:
> + *
> + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> + * Author: Yinghai Lu <[email protected]>
> + * Date: Fri Dec 17 16:58:28 2010 -0800
> + *
> + * x86-64, mm: Put early page table high
> + *
> + * at some point init_memory_mapping is going to reach the pagetable pages
> + * area and map those pages too (mapping them as normal memory that falls
> + * in the range of addresses passed to init_memory_mapping as argument).
> + * Some of those pages are already pagetable pages (they are in the range
> + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> + * everything is fine.
> + * Some of these pages are not pagetable pages yet (they fall in the range
> + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> + * are going to be mapped RW. When these pages become pagetable pages and
> + * are hooked into the pagetable, xen will find that the guest has already
> + * a RW mapping of them somewhere and fail the operation.
> + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> + * to verify that the pagetables are valid before using them. The validation
> + * operations are called "pinning".
> + *
> + * In order to fix the issue we mark all the pages in the entire range
> + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> + * ranges are RO).
> + *
> + * For this reason, 'mark_rw_past_pgt' is introduced which is called _after_
> + * the init_memory_mapping has completed (in a perfect world we would
> + * call this function from init_memory_mapping, but lets ignore that).
> + *
> + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> + * end,top] have all changed to new values (b/c init_memory_mapping
> + * is called and setting up another new page-table). Hence, the first time
> + * we enter this function, we save away the pgt_buf_start value and update
> + * the pgt_buf_[end,top].
> + *
> + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> + *
> + * And then we update those "old" pgt_buf_[end|top] with the new ones
> + * so that we can redo this on the next pagetable.
> + */
> +static __init void mark_rw_past_pgt(void) {
> +
> + if (pgt_buf_end > pgt_buf_start) {
> + u64 addr, size;
> +
> + /* Save it away. */
> + if (!__pgt_buf_start) {
> + __pgt_buf_start = pgt_buf_start;
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + return;
> + }
> + /* If we get the range that starts at __pgt_buf_end that means
> + * the range is reserved, and that in 'init_memory_mapping'
> + * the 'memblock_x86_reserve_range' has been called with the
> + * outdated __pgt_buf_start, __pgt_buf_end (the "new"
> + * pgt_buf_[start|end|top] refer now to a new pagetable.
> + * Note: we are called _after_ the pgt_buf_[..] have been
> + * updated.*/
> +
> + addr = memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start),
> + &size, PAGE_SIZE);
> +
> + /* Still not reserved, meaning 'memblock_x86_reserve_range'
> + * hasn't been called yet. Update the _end and _top.*/
> + if (addr == PFN_PHYS(__pgt_buf_start)) {
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + return;
> + }
> +
> + /* OK, the area is reserved, meaning it is time for us to
> + * set RW for the old end->top PFNs. */
> +
> + /* ..unless we had already done this. */
> + if (__pgt_buf_end == __last_pgt_set_rw)
> + return;
> +
> + addr = PFN_PHYS(__pgt_buf_end);
> +
> + /* set as RW the rest */
> + printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
> + PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
> +
> + while (addr < PFN_PHYS(__pgt_buf_top)) {
> + make_lowmem_page_readwrite(__va(addr));
> + addr += PAGE_SIZE;
> + }
> + /* And update everything so that we are ready for the next
> + * pagetable (the one created for regions past 4GB) */
> + __last_pgt_set_rw = __pgt_buf_end;
> + __pgt_buf_start = pgt_buf_start;
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + }
> + return;
> +}
> +#else
> +static __init void mark_rw_past_pgt(void) { }
> +#endif
> static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
> {
> #ifdef CONFIG_X86_64
> @@ -1489,6 +1602,14 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
> unsigned long pfn = pte_pfn(pte);
>
> /*
> + * A bit of optimization. We do not need to call the workaround
> + * when xen_set_pte_init is called with a PTE with 0 as PFN.
> + * That is b/c the pagetable at that point are just being populated
> + * with empty values and we can save some cycles by not calling
> + * the 'memblock' code.*/
> + if (pfn)
> + mark_rw_past_pgt();
> + /*
> * If the new pfn is within the range of the newly allocated
> * kernel pagetable, and it isn't being mapped into an
> * early_ioremap fixmap slot as a freshly allocated page, make sure
> @@ -1997,6 +2118,8 @@ __init void xen_ident_map_ISA(void)
>
> static __init void xen_post_allocator_init(void)
> {
> + mark_rw_past_pgt();
> +
> #ifdef CONFIG_XEN_DEBUG
> pv_mmu_ops.make_pte = PV_CALLEE_SAVE(xen_make_pte_debug);
> #endif



Unless I am missing something there is no guarantee that somebody else
won't use memory in the pgt_buf_end-pgt_buf_top range when the range is
still RO before mark_rw_past_pgt() is called again. If so this code
works by coincidence, that is the reason why I didn't try to reuse the
pagetable_setup_done or the pagetable_setup_start hooks.

In any case this code looks very ugly and fragile, do we really want to
add a workaround as bad as this one rather than reverting the original
commit? I think it creates a bad precedent.

2011-05-03 15:13:18

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
> > As a consequence of the commit:
> >
> > commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > Author: Yinghai Lu <[email protected]>
> > Date: Fri Dec 17 16:58:28 2010 -0800
> >
> > x86-64, mm: Put early page table high
> >
> > it causes the Linux kernel to crash under Xen:
> >
> > mapping kernel into physical memory
> > Xen: setup ISA identity maps
> > about to get started...
> > (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
> > (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
> > (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
> > (XEN) domain_crash_sync called from entry.S
> > (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> > ...
>
> I was hit by this bug when I was working on memory hotplug.
> After some investigation I found myself above mentioned patch
> as a guilty and later I discovered that you are working on that
> issue. I have tested your patch and discoverd some issues with it.
> First of all it has compilation issues on gcc version 4.1.2 20061115
> (prerelease) (Debian 4.1.1-21). Details below.
>
> Additionlly, I think that your patch does not work as you expected.
> I found that git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> (xen: do not create the extra e820 region at an addr lower than 4G)
> do this work (to some extent). When this patch is removed domU
> is crashing with following error:

Which is "this patch" ? The 24bdb0b62cc82120924762ae6bc85afc8c3f2b26?

>
> (early) Linux version 2.6.39-rc5-x86_64.xenU.all.r0+ (root@dev-00) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #5 SMP Tue May 3 01:43:26 CEST 2011
> (early) Command line: root=/dev/xvda debug earlyprintk=xen noapic nolapic console=hvc0
> (early) ACPI in unprivileged domain disabled
> (early) released 0 pages of unused memory
> (early) Set 0 page(s) to 1-1 mapping.
> (early) BIOS-provided physical RAM map:
> (early) Xen: 0000000000000000 - 00000000000a0000 (usable)
> (early) Xen: 00000000000a0000 - 0000000000100000 (reserved)
> (early) Xen: 0000000000100000 - 0000000026000000 (usable)
> (early) bootconsole [xenboot0] enabled
> (early) NX (Execute Disable) protection: active
> (early) DMI not present or invalid.
> (early) e820 update range: 0000000000000000 - 0000000000010000 (early) (usable)(early) ==> (early) (reserved)(early)
> (early) e820 remove range: 00000000000a0000 - 0000000000100000 (early) (usable)(early)
> (early) No AGP bridge found
> (early) last_pfn = 0x26000 max_arch_pfn = 0x400000000
> (early) initial memory mapped : 0 - 01693000
> (early) Base memory trampoline at [ffff88000009e000] 9e000 size 8192
> (early) init_memory_mapping: 0000000000000000-0000000026000000
> (early) 0000000000 - 0026000000 page 4k
> (early) kernel direct mapping tables up to 26000000 @ 256ce000-25800000
> (early) BUG: unable to handle kernel (early) NULL pointer dereference(early) at (null)
> (early) IP:(early) [<ffffffff814a33f2>] find_range_array+0x4e/0x57
> (early) PGD 0 (early)
> (early) Oops: 0003 [#1] (early) SMP (early)
> (early) last sysfs file:
> (early) CPU 0 (early)
> (early) Modules linked in:(early)
> (early)
> (early) Pid: 0, comm: swapper Not tainted 2.6.39-rc5-x86_64.xenU.all.r0+ #5(early) (early)
> (early) RIP: e030:[<ffffffff814a33f2>] (early) [<ffffffff814a33f2>] find_range_array+0x4e/0x57
> (early) RSP: e02b:ffffffff81427e58 EFLAGS: 00010046
> (early) RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 00000000000000e0
> (early) RDX: ffff8800257fff20 RSI: 00000000257fff20 RDI: ffff8800257fff20
> (early) RBP: ffffffff81427e68 R08: 0000000000000005 R09: 0000000000000050
> (early) R10: 0000000000000005 R11: 0000000025800000 R12: ffffffff814bd000
> (early) R13: 0000000000000000 R14: 000000000000000e R15: 0000000000001000
> (early) FS: 0000000000000000(0000) GS:ffffffff8147f000(0000) knlGS:0000000000000000
> (early) CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> (early) CR2: 0000000000000000 CR3: 0000000001441000 CR4: 0000000000002660
> (early) DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> (early) DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> (early) Process swapper (pid: 0, threadinfo ffffffff81426000, task ffffffff81449020)
> (early) Stack:
> (early) ffffffff81427e88(early) 0000000000000000(early) ffffffff81427ea8(early) ffffffff814a343d(early)
> (early) 0000000000000100(early) 0000000026000000(early) ffffffff814bd000(early) ffffffffffffffff(early)
> (early) 0000000000000000(early) 0000000000000000(early) ffffffff81427eb8(early) ffffffff814a3577(early)
> (early) Call Trace:
> (early) [<ffffffff814a343d>] __memblock_x86_memory_in_range+0x42/0x171
> (early) [<ffffffff814a3577>] memblock_x86_memory_in_range+0xb/0xd
> (early) [<ffffffff81497348>] memblock_find_dma_reserve+0x15/0x3b
> (early) [<ffffffff81496b50>] setup_arch+0x721/0x7e5
> (early) [<ffffffff810065fd>] ? __raw_callee_save_xen_irq_disable+0x11/0x1e
> (early) [<ffffffff81493955>] start_kernel+0x8a/0x2db
> (early) [<ffffffff81493299>] x86_64_start_reservations+0x84/0x88
> (early) [<ffffffff814947d9>] xen_start_kernel+0x3e1/0x3e8
> (early) Code: (early) 66 (early) 00 (early) 00 (early) 48 (early) 85 (early) c0 (early) 75 (early) 0c (early) 48 (early) c7 (early) c7 (early) 86 (early) 80 (early) 3a (early) 81 (early) e8 (early) 55 (early) 41 (early) b9 (early) ff (early) 48 (early) bf (early) 00 (early) 00 (early) 00 (early) 00 (early) 00 (early) 88 (early) ff (early) ff (early) 48 (early) 89 (early) d9 (early) 48 (early) 8d (early) 14 (early) 38 (early) 31 (early) c0 (early) fc (early) 48 (early) 89 (early) d7 (early) <f3> (early) aa (early) 48 (early) 89 (early) d0 (early) 5f (early) 5b (early) c9 (early) c3 (early) 55 (early) 48 (early) 89 (early) e5 (early) 41 (early) 57 (early) 49 (early) 89 (early) f7 (early) 49 (early) c1 (early) ef (early)
> (early) RIP (early) [<ffffffff814a33f2>] find_range_array+0x4e/0x57
> (early) RSP <ffffffff81427e58>
> (early) CR2: 0000000000000000
> (early) ---[ end trace 4eaa2a86a8e2da22 ]---
> (early) Kernel panic - not syncing: Attempted to kill the idle task!
> (early) Pid: 0, comm: swapper Tainted: G D 2.6.39-rc5-x86_64.xenU.all.r0+ #5
> (early) Call Trace:
> (early) [<ffffffff810375ed>] panic+0xbd/0x1c7
> (early) [<ffffffff810383c7>] ? printk+0x67/0x69
> (early) [<ffffffff811fd7b0>] ? account+0xe1/0xf0
> (early) [<ffffffff8103a641>] do_exit+0xb4/0x676
> (early) [<ffffffff81037b53>] ? spin_unlock_irqrestore+0x9/0xb
> (early) [<ffffffff81038c24>] ? kmsg_dump+0x4a/0xd9
> (early) [<ffffffff8100d11d>] oops_end+0xc1/0xc9
> (early) [<ffffffff810252b1>] no_context+0x1f5/0x204
> (early) [<ffffffff81025448>] __bad_area_nosemaphore+0x188/0x1ab
> (early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early) [<ffffffff810254e1>] bad_area_nosemaphore+0xe/0x10
> (early) [<ffffffff81025991>] do_page_fault+0x18c/0x337
> (early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early) [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early) [<ffffffff812eebd5>] page_fault+0x25/0x30
> (early) [<ffffffff814a33f2>] ? find_range_array+0x4e/0x57
> (early) [<ffffffff814a33ca>] ? find_range_array+0x26/0x57
> (early) [<ffffffff814a343d>] __memblock_x86_memory_in_range+0x42/0x171
> (early) [<ffffffff814a3577>] memblock_x86_memory_in_range+0xb/0xd
> (early) [<ffffffff81497348>] memblock_find_dma_reserve+0x15/0x3b
> (early) [<ffffffff81496b50>] setup_arch+0x721/0x7e5
> (early) [<ffffffff810065fd>] ? __raw_callee_save_xen_irq_disable+0x11/0x1e
> (early) [<ffffffff81493955>] start_kernel+0x8a/0x2db
> (early) [<ffffffff81493299>] x86_64_start_reservations+0x84/0x88
> (early) [<ffffffff814947d9>] xen_start_kernel+0x3e1/0x3e8
>
> I think that (Stefano please confirm or not) this patch was prepared
> as workaround for similar issues. However, I do not like this patch
> because on systems with small amount of memory it leaves huge (to some
> extent) hole between max_low_pfn and 4G. Additionally, it affects
> memory hotplug a bit because it allocates memory starting from current
> max_mfn. It also breaks memory hotplug on i386 (maybe also others
> thinks, however, I could not confirm that). If it stay for some
> reason it should be amended in follwing way:
>
> #ifdef CONFIG_X86_32
> xen_extra_mem_start = mem_end;
> #else
> xen_extra_mem_start = max((1ULL << 32), mem_end);
> #endif
>
> Regarding comment for this patch it should be mentioned that without this
> patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.
>
> Last but least. I found that memory sizes below and including exactly 1 GiB and
> exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
> them because I do not have sufficient memory) are magic. It means that if memory
> is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> should be tested with sizes which are not power of two nor multiple of that.

Hmm, I thought I did test 1500M.

> > +#ifdef CONFIG_X86_64
> > +static __initdata u64 __last_pgt_set_rw = 0;
> > +static __initdata u64 __pgt_buf_start = 0;
> > +static __initdata u64 __pgt_buf_end = 0;
> > +static __initdata u64 __pgt_buf_top = 0;
>
> Please look into include/linux/init.h for proper
> usage of __init macros. It should be changed to
>
> static u64 __last_pgt_set_rw __initdata = 0;
> ...
> ...
>
> Additionally,
>
> static const struct pv_mmu_ops xen_mmu_ops __initdata = {
>
> should be changed to:
>
> static const struct pv_mmu_ops xen_mmu_ops __initconst = {
>
> It is not in your patch, however, it conflicts
> with your definitions.

Ok. I am not that worried about the changes this patch brings. I hope
to have it removed in 2.6.40-ish time -frame.
>
> > +/*
> > + * As a consequence of the commit:
> > + *
> > + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > + * Author: Yinghai Lu <[email protected]>
> > + * Date: Fri Dec 17 16:58:28 2010 -0800
> > + *
> > + * x86-64, mm: Put early page table high
> > + *
> > + * at some point init_memory_mapping is going to reach the pagetable pages
> > + * area and map those pages too (mapping them as normal memory that falls
> > + * in the range of addresses passed to init_memory_mapping as argument).
> > + * Some of those pages are already pagetable pages (they are in the range
> > + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> > + * everything is fine.
> > + * Some of these pages are not pagetable pages yet (they fall in the range
> > + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> > + * are going to be mapped RW. When these pages become pagetable pages and
> > + * are hooked into the pagetable, xen will find that the guest has already
> > + * a RW mapping of them somewhere and fail the operation.
> > + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> > + * to verify that the pagetables are valid before using them. The validation
> > + * operations are called "pinning".
> > + *
> > + * In order to fix the issue we mark all the pages in the entire range
> > + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> > + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> > + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> > + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> > + * ranges are RO).
> > + *
> > + * For this reason, 'mark_rw_past_pgt' is introduced which is called _after_
> > + * the init_memory_mapping has completed (in a perfect world we would
> > + * call this function from init_memory_mapping, but lets ignore that).
> > + *
> > + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> > + * end,top] have all changed to new values (b/c init_memory_mapping
> > + * is called and setting up another new page-table). Hence, the first time
> > + * we enter this function, we save away the pgt_buf_start value and update
> > + * the pgt_buf_[end,top].
> > + *
> > + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> > + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> > + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> > + *
> > + * And then we update those "old" pgt_buf_[end|top] with the new ones
> > + * so that we can redo this on the next pagetable.
> > + */
> > +static __init void mark_rw_past_pgt(void) {
>
> Please look into include/linux/init.h. I found much more similar
> mistakes in current Xen code. I will prepare relevant patch
> shortly.

Excellent. Looking forward to them.
..
> > + return;
>
> I think that this return is superfluous.

>nods>
>
> > +}
> > +#else
> > +static __init void mark_rw_past_pgt(void) { }
>
> Dito.

Ok. Not that much worried - I think we will get rid of this in 2.6.40 anyhow
(or I hope so).
>
> > +#endif
> > static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
> > {
> > #ifdef CONFIG_X86_64
> > @@ -1489,6 +1602,14 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
> > unsigned long pfn = pte_pfn(pte);
> >
> > /*
> > + * A bit of optimization. We do not need to call the workaround
> > + * when xen_set_pte_init is called with a PTE with 0 as PFN.
> > + * That is b/c the pagetable at that point are just being populated
> > + * with empty values and we can save some cycles by not calling
> > + * the 'memblock' code.*/
> > + if (pfn)
> > + mark_rw_past_pgt();
> > + /*
> > * If the new pfn is within the range of the newly allocated
> > * kernel pagetable, and it isn't being mapped into an
> > * early_ioremap fixmap slot as a freshly allocated page, make sure
> > @@ -1997,6 +2118,8 @@ __init void xen_ident_map_ISA(void)
> >
> > static __init void xen_post_allocator_init(void)
>
> Dito.

<nods> That looks like a candidate for another patch.
>
> Daniel

2011-05-03 15:28:07

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Tue, May 03, 2011 at 02:20:25PM +0100, Stefano Stabellini wrote:
> On Mon, 2 May 2011, Konrad Rzeszutek Wilk wrote:
> > As a consequence of the commit:
> >
> > commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > Author: Yinghai Lu <[email protected]>
> > Date: Fri Dec 17 16:58:28 2010 -0800
> >
> > x86-64, mm: Put early page table high
> >
> > it causes the Linux kernel to crash under Xen:
> >
> > mapping kernel into physical memory
> > Xen: setup ISA identity maps
> > about to get started...
> > (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
> > (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
> > (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
> > (XEN) domain_crash_sync called from entry.S
> > (XEN) Domain 0 (vcpu#0) crashed on cpu#0:

.. snip..
>
>
> Unless I am missing something there is no guarantee that somebody else
> won't use memory in the pgt_buf_end-pgt_buf_top range when the range is
> still RO before mark_rw_past_pgt() is called again. If so this code
> works by coincidence, that is the reason why I didn't try to reuse the
> pagetable_setup_done or the pagetable_setup_start hooks.

It looks that during sequence of events after the initial pagetable is created
and when we get to the post-allocator nobody is touching those pages.
(also one of them - the 0-4GB pagetable has been .. made RW). But if you do find it
dying/crashing, please tell so that we can revert it and use the generic work-around
or revert Yinghai's patch.

> In any case this code looks very ugly and fragile, do we really want to
> add a workaround as bad as this one rather than reverting the original
> commit? I think it creates a bad precedent.

This is the second one. We had the swapper_pg_dir/initial_page_table dance in x86_32
where we mark it RO, then RW (after a cr3 load) then RO and then back to RW.
(Details escape me, but it was some form of that)

But I wonder how many workarounds the generic code has because of us?

I think we need to setup some form of meeting with the x86 maintainers
to figure out some better way of handling this.

2011-05-03 19:52:15

by Daniel Kiper

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Tue, May 03, 2011 at 11:12:06AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> > On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
> > > As a consequence of the commit:
> > >
> > > commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > > Author: Yinghai Lu <[email protected]>
> > > Date: Fri Dec 17 16:58:28 2010 -0800
> > >
> > > x86-64, mm: Put early page table high
> > >
> > > it causes the Linux kernel to crash under Xen:
> > >
> > > mapping kernel into physical memory
> > > Xen: setup ISA identity maps
> > > about to get started...
> > > (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
> > > (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
> > > (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
> > > (XEN) domain_crash_sync called from entry.S
> > > (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> > > ...
> >
> > I was hit by this bug when I was working on memory hotplug.
> > After some investigation I found myself above mentioned patch
> > as a guilty and later I discovered that you are working on that
> > issue. I have tested your patch and discoverd some issues with it.
> > First of all it has compilation issues on gcc version 4.1.2 20061115
> > (prerelease) (Debian 4.1.1-21). Details below.
> >
> > Additionlly, I think that your patch does not work as you expected.
> > I found that git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> > (xen: do not create the extra e820 region at an addr lower than 4G)
> > do this work (to some extent). When this patch is removed domU
> > is crashing with following error:
>
> Which is "this patch" ? The 24bdb0b62cc82120924762ae6bc85afc8c3f2b26?

Yep.

[...]

> > I think that (Stefano please confirm or not) this patch was prepared
> > as workaround for similar issues. However, I do not like this patch
> > because on systems with small amount of memory it leaves huge (to some
> > extent) hole between max_low_pfn and 4G. Additionally, it affects
> > memory hotplug a bit because it allocates memory starting from current
> > max_mfn. It also breaks memory hotplug on i386 (maybe also others
> > thinks, however, I could not confirm that). If it stay for some
> > reason it should be amended in follwing way:
> >
> > #ifdef CONFIG_X86_32
> > xen_extra_mem_start = mem_end;
> > #else
> > xen_extra_mem_start = max((1ULL << 32), mem_end);
> > #endif
> >
> > Regarding comment for this patch it should be mentioned that without this
> > patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.
> >
> > Last but least. I found that memory sizes below and including exactly 1 GiB and
> > exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
> > them because I do not have sufficient memory) are magic. It means that if memory
> > is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> > should be tested with sizes which are not power of two nor multiple of that.
>
> Hmm, I thought I did test 1500M.

It does not work on my machine (24bdb0b62cc82120924762ae6bc85afc8c3f2b26
removed and 4b239f458c229de044d6905c2b0f9fe16ed9e01e applied).

Daniel

2011-05-04 18:59:22

by Daniel Kiper

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Tue, May 03, 2011 at 09:51:41PM +0200, Daniel Kiper wrote:
> On Tue, May 03, 2011 at 11:12:06AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> > > On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:

[...]

> > > I think that (Stefano please confirm or not) this patch was prepared
> > > as workaround for similar issues. However, I do not like this patch
> > > because on systems with small amount of memory it leaves huge (to some
> > > extent) hole between max_low_pfn and 4G. Additionally, it affects
> > > memory hotplug a bit because it allocates memory starting from current
> > > max_mfn. It also breaks memory hotplug on i386 (maybe also others
> > > thinks, however, I could not confirm that). If it stay for some
> > > reason it should be amended in follwing way:
> > >
> > > #ifdef CONFIG_X86_32
> > > xen_extra_mem_start = mem_end;
> > > #else
> > > xen_extra_mem_start = max((1ULL << 32), mem_end);
> > > #endif
> > >
> > > Regarding comment for this patch it should be mentioned that without this
> > > patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.
> > >
> > > Last but least. I found that memory sizes below and including exactly 1 GiB and
> > > exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
> > > them because I do not have sufficient memory) are magic. It means that if memory
> > > is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > > and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> > > should be tested with sizes which are not power of two nor multiple of that.
> >
> > Hmm, I thought I did test 1500M.
>
> It does not work on my machine (24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> removed and 4b239f458c229de044d6905c2b0f9fe16ed9e01e applied).

It does not work on my machine (x86_64) with Linux Kernel Ver. 2.6.39-rc6 without
git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
e820 region at an addr lower than 4G). As I said ealier bug introduced by git
commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e (x86-64, mm: Put early page table
high) is probably hidden (repaird/workarounded ???) by git commit
24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
e820 region at an addr lower than 4G).

Konrad, Stefano could you confirm that ??? If it is true
how could I help you in removing this bug ???

Daniel

2011-05-04 19:34:14

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Wed, May 04, 2011 at 08:59:03PM +0200, Daniel Kiper wrote:
> On Tue, May 03, 2011 at 09:51:41PM +0200, Daniel Kiper wrote:
> > On Tue, May 03, 2011 at 11:12:06AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> > > > On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
>
> [...]
>
> > > > I think that (Stefano please confirm or not) this patch was prepared
> > > > as workaround for similar issues. However, I do not like this patch

It was actually to fix SandyBridge boxes. Their last E820 reserved
region was around fed40000 and then the RAM region started at
100000000. Which meant that we misinterpreted the gap (starting at fed40 mfn)
as the start of RAM.

> > > > because on systems with small amount of memory it leaves huge (to some
> > > > extent) hole between max_low_pfn and 4G. Additionally, it affects
> > > > memory hotplug a bit because it allocates memory starting from current
> > > > max_mfn. It also breaks memory hotplug on i386 (maybe also others
> > > > thinks, however, I could not confirm that). If it stay for some
> > > > reason it should be amended in follwing way:
> > > >
> > > > #ifdef CONFIG_X86_32
> > > > xen_extra_mem_start = mem_end;
> > > > #else
> > > > xen_extra_mem_start = max((1ULL << 32), mem_end);
> > > > #endif
> > > >
> > > > Regarding comment for this patch it should be mentioned that without this
> > > > patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.

Hmm. What is max_pfn set to?
Can you send the full dmesg of your guest?

> > > >
> > > > Last but least. I found that memory sizes below and including exactly 1 GiB and
> > > > exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
> > > > them because I do not have sufficient memory) are magic. It means that if memory
> > > > is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > > > and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> > > > should be tested with sizes which are not power of two nor multiple of that.
> > >
> > > Hmm, I thought I did test 1500M.
> >
> > It does not work on my machine (24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> > removed and 4b239f458c229de044d6905c2b0f9fe16ed9e01e applied).
>
> It does not work on my machine (x86_64) with Linux Kernel Ver. 2.6.39-rc6 without
> git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> e820 region at an addr lower than 4G). As I said ealier bug introduced by git
> commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e (x86-64, mm: Put early page table
> high) is probably hidden (repaird/workarounded ???) by git commit
> 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> e820 region at an addr lower than 4G).

There are a couple of things that have been going to fix "x86-64, mm: Put
early page table high" and also .. "cleanup highmem" (something) - which
has been plaguing us since 2.6.32 (and was the one you hit long time ago).

Anyhow, regarding the setting xen_extra_mem_start to 4GB or higher should
be reworked. Not sure yet how.
>
> Konrad, Stefano could you confirm that ??? If it is true
> how could I help you in removing this bug ???
>
> Daniel
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> http://lists.xensource.com/xen-devel

2011-05-05 11:53:16

by Daniel Kiper

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Wed, May 04, 2011 at 03:33:53PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, May 04, 2011 at 08:59:03PM +0200, Daniel Kiper wrote:
> > On Tue, May 03, 2011 at 09:51:41PM +0200, Daniel Kiper wrote:
> > > On Tue, May 03, 2011 at 11:12:06AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> > > > > On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
> >
> > [...]
> >
> > > > > I think that (Stefano please confirm or not) this patch was prepared
> > > > > as workaround for similar issues. However, I do not like this patch
>
> It was actually to fix SandyBridge boxes. Their last E820 reserved
> region was around fed40000 and then the RAM region started at
> 100000000. Which meant that we misinterpreted the gap (starting at fed40 mfn)
> as the start of RAM.

Thanks.

> > > > > because on systems with small amount of memory it leaves huge (to some
> > > > > extent) hole between max_low_pfn and 4G. Additionally, it affects
> > > > > memory hotplug a bit because it allocates memory starting from current
> > > > > max_mfn. It also breaks memory hotplug on i386 (maybe also others
> > > > > thinks, however, I could not confirm that). If it stay for some
> > > > > reason it should be amended in follwing way:
> > > > >
> > > > > #ifdef CONFIG_X86_32
> > > > > xen_extra_mem_start = mem_end;
> > > > > #else
> > > > > xen_extra_mem_start = max((1ULL << 32), mem_end);
> > > > > #endif
> > > > >
> > > > > Regarding comment for this patch it should be mentioned that without this
> > > > > patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.
>
> Hmm. What is max_pfn set to?
> Can you send the full dmesg of your guest?

Look into attachments. Both dmesgs are from plain 2.6.39-rc6.
Guests had allocated 2 GiB of memory.

> > > > > Last but least. I found that memory sizes below and including exactly 1 GiB and
> > > > > exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
> > > > > them because I do not have sufficient memory) are magic. It means that if memory
> > > > > is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > > > > and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> > > > > should be tested with sizes which are not power of two nor multiple of that.
> > > >
> > > > Hmm, I thought I did test 1500M.
> > >
> > > It does not work on my machine (24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> > > removed and 4b239f458c229de044d6905c2b0f9fe16ed9e01e applied).
> >
> > It does not work on my machine (x86_64) with Linux Kernel Ver. 2.6.39-rc6 without
> > git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> > e820 region at an addr lower than 4G). As I said ealier bug introduced by git
> > commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e (x86-64, mm: Put early page table
> > high) is probably hidden (repaird/workarounded ???) by git commit
> > 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> > e820 region at an addr lower than 4G).
>
> There are a couple of things that have been going to fix "x86-64, mm: Put
> early page table high" and also .. "cleanup highmem" (something) - which
> has been plaguing us since 2.6.32 (and was the one you hit long time ago).
>
> Anyhow, regarding the setting xen_extra_mem_start to 4GB or higher should
> be reworked. Not sure yet how.

OK. As I can see it is __VERY__ difficult problem. I will wait for
proper solution. However, if I could help you in any way
please drop me a line.

Daniel


Attachments:
(No filename) (3.55 kB)
dmesg.i386 (7.82 kB)
dmesg.x86_64 (7.90 kB)
Download all attachments

2011-05-05 12:54:06

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Wed, 4 May 2011, Daniel Kiper wrote:
> On Tue, May 03, 2011 at 09:51:41PM +0200, Daniel Kiper wrote:
> > On Tue, May 03, 2011 at 11:12:06AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> > > > On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
>
> [...]
>
> > > > I think that (Stefano please confirm or not) this patch was prepared
> > > > as workaround for similar issues. However, I do not like this patch
> > > > because on systems with small amount of memory it leaves huge (to some
> > > > extent) hole between max_low_pfn and 4G. Additionally, it affects
> > > > memory hotplug a bit because it allocates memory starting from current
> > > > max_mfn. It also breaks memory hotplug on i386 (maybe also others
> > > > thinks, however, I could not confirm that). If it stay for some
> > > > reason it should be amended in follwing way:
> > > >
> > > > #ifdef CONFIG_X86_32
> > > > xen_extra_mem_start = mem_end;
> > > > #else
> > > > xen_extra_mem_start = max((1ULL << 32), mem_end);
> > > > #endif
> > > >
> > > > Regarding comment for this patch it should be mentioned that without this
> > > > patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.
> > > >
> > > > Last but least. I found that memory sizes below and including exactly 1 GiB and
> > > > exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
> > > > them because I do not have sufficient memory) are magic. It means that if memory
> > > > is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > > > and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> > > > should be tested with sizes which are not power of two nor multiple of that.
> > >
> > > Hmm, I thought I did test 1500M.
> >
> > It does not work on my machine (24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> > removed and 4b239f458c229de044d6905c2b0f9fe16ed9e01e applied).
>
> It does not work on my machine (x86_64) with Linux Kernel Ver. 2.6.39-rc6 without
> git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> e820 region at an addr lower than 4G). As I said ealier bug introduced by git
> commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e (x86-64, mm: Put early page table
> high) is probably hidden (repaird/workarounded ???) by git commit
> 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> e820 region at an addr lower than 4G).
>
> Konrad, Stefano could you confirm that ??? If it is true
> how could I help you in removing this bug ???

The reason why "xen: do not create the extra e820 region at an addr
lower than 4G" is needed is the following:

starting the extra memory region at an address lower than 4G is
dangerous because if the extra memory region spans the 4G boundary then
e820_end_of_ram_pfn will get confused and return 0x100000000, therefore
init_memory_mapping will setup 1:1 mapping for 0-0x100000000, including
the whole 3G-4G range.
Of course this is wrong because among other things will cause the kernel
to remap the IOAPIC MMIO registers that should go through the fixmap
instead.
I don't think "x86-64, mm: Put early page table high" is related to the
bug that this commit is trying to solve.

Could you please explain in details what problems do this patch create?

In any case you are right about the fact that the change is not needed
on X86_32 so if it has any bad side effects on X86_32 we can always do
the following, like you suggested:

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 90bac0a..721f576 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -227,7 +227,11 @@ char * __init xen_memory_setup(void)

memcpy(map_raw, map, sizeof(map));
e820.nr_map = 0;
+#ifdef CONFIG_X86_32
+ xen_extra_mem_start = mem_end;
+#else
xen_extra_mem_start = max((1ULL << 32), mem_end);
+#endif
for (i = 0; i < memmap.nr_entries; i++) {
unsigned long long end;

2011-05-05 14:01:56

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Thu, May 05, 2011 at 01:55:22PM +0100, Stefano Stabellini wrote:
> On Wed, 4 May 2011, Daniel Kiper wrote:
> > On Tue, May 03, 2011 at 09:51:41PM +0200, Daniel Kiper wrote:
> > > On Tue, May 03, 2011 at 11:12:06AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> > > > > On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
> >
> > [...]
> >
> > > > > I think that (Stefano please confirm or not) this patch was prepared
> > > > > as workaround for similar issues. However, I do not like this patch
> > > > > because on systems with small amount of memory it leaves huge (to some
> > > > > extent) hole between max_low_pfn and 4G. Additionally, it affects
> > > > > memory hotplug a bit because it allocates memory starting from current
> > > > > max_mfn. It also breaks memory hotplug on i386 (maybe also others
> > > > > thinks, however, I could not confirm that). If it stay for some
> > > > > reason it should be amended in follwing way:
> > > > >
> > > > > #ifdef CONFIG_X86_32
> > > > > xen_extra_mem_start = mem_end;
> > > > > #else
> > > > > xen_extra_mem_start = max((1ULL << 32), mem_end);
> > > > > #endif
> > > > >
> > > > > Regarding comment for this patch it should be mentioned that without this
> > > > > patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.
> > > > >
> > > > > Last but least. I found that memory sizes below and including exactly 1 GiB and
> > > > > exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to test
> > > > > them because I do not have sufficient memory) are magic. It means that if memory
> > > > > is set with those sizes everything is working good (without 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > > > > and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> > > > > should be tested with sizes which are not power of two nor multiple of that.
> > > >
> > > > Hmm, I thought I did test 1500M.
> > >
> > > It does not work on my machine (24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> > > removed and 4b239f458c229de044d6905c2b0f9fe16ed9e01e applied).
> >
> > It does not work on my machine (x86_64) with Linux Kernel Ver. 2.6.39-rc6 without
> > git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> > e820 region at an addr lower than 4G). As I said ealier bug introduced by git
> > commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e (x86-64, mm: Put early page table
> > high) is probably hidden (repaird/workarounded ???) by git commit
> > 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 (xen: do not create the extra
> > e820 region at an addr lower than 4G).
> >
> > Konrad, Stefano could you confirm that ??? If it is true
> > how could I help you in removing this bug ???
>
> The reason why "xen: do not create the extra e820 region at an addr
> lower than 4G" is needed is the following:
>
> starting the extra memory region at an address lower than 4G is
> dangerous because if the extra memory region spans the 4G boundary then
> e820_end_of_ram_pfn will get confused and return 0x100000000, therefore
> init_memory_mapping will setup 1:1 mapping for 0-0x100000000, including
> the whole 3G-4G range.
> Of course this is wrong because among other things will cause the kernel
> to remap the IOAPIC MMIO registers that should go through the fixmap
> instead.
> I don't think "x86-64, mm: Put early page table high" is related to the
> bug that this commit is trying to solve.
>
> Could you please explain in details what problems do this patch create?
>
> In any case you are right about the fact that the change is not needed
> on X86_32 so if it has any bad side effects on X86_32 we can always do

Why not? Can't you boot a 32-bit Dom0 on those machines?

> the following, like you suggested:
>
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 90bac0a..721f576 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -227,7 +227,11 @@ char * __init xen_memory_setup(void)
>
> memcpy(map_raw, map, sizeof(map));
> e820.nr_map = 0;
> +#ifdef CONFIG_X86_32
> + xen_extra_mem_start = mem_end;
> +#else
> xen_extra_mem_start = max((1ULL << 32), mem_end);
> +#endif
> for (i = 0; i < memmap.nr_entries; i++) {
> unsigned long long end;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-05-05 14:05:05

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Thu, 5 May 2011, Konrad Rzeszutek Wilk wrote:
> > In any case you are right about the fact that the change is not needed
> > on X86_32 so if it has any bad side effects on X86_32 we can always do
>
> Why not? Can't you boot a 32-bit Dom0 on those machines?
>

Because e820_end_of_low_ram_pfn() is not called on 32-bit,
find_low_pfn_range() is called instead and works differently.
AFAICT find_low_pfn_range() is not affected by the issue described in
the previous email.

2011-05-05 16:29:20

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high' - does not work.

> >> OK, sounds like a plan then. I like it because it doesn't affect the
> >> native kernel.
.. snip..

> First things first... are you pushing the workaround (you can add my
> Acked-by:) or should I?

Grrr.. While it works on my machines, it does not work on some of the AMD opteron
CPU machines. I asked Stefan Bader from Canonical to run a simple bootup test with
2.6.39-rc6 (Linus picked it already), and if it crashed, to use the
one that Stefano came up with.

And sure enough - it crashed and Stefano's fix worked. I've Stefano's
patch in stable/bug-fixes-for-x86 and I am OK reverting the fix I came up.

Here is the email correspondence:

On 05/05/2011 04:56 PM, Konrad Rzeszutek Wilk wrote:
> On Thu, May 05, 2011 at 04:20:47PM +0200, Stefan Bader wrote:
>> On 05/04/2011 05:12 PM, Konrad Rzeszutek Wilk wrote:
>>> Hey Stefan,
>>>
>>> I have you as the only contact from Canonical who does Xen. Not sure
>>> if that is correct - if not, please forward me to the right person.
>>>
>>> When v2.6.39 merge window openned a patch from Yinghai ("Put early
>>> page table high") made x86_64 Linux kernels unbootable under Xen.
>>> A hack got added in v2.6.39-rc6 (git commit a38647837a411f7df79623128421eef2118b5884)
>>> which takes care of it. But it is a hack, and the testing we did
>>> does include a lot of hardware and config options - but not everything
>>> so we might have missed some edge cases where this particular fix
>>> won't work right.
>>>
>>> Hence I was wondering when you guys get to test this kernel (v2.6.39-rc6
>>> or later) under Xen if you could be mindful of this particular fix - and
>>> see if does the job or if it is insufficient. If it is insufficient
>>> there is a backup patch in 'stable/bug-fixes-for-x86'
>>> (git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git)
>>> that is more surgical in its fixing (but the x86 maintainers
>>> don't like it).
>>>
>>> Thanks,
>>>
>>> Konrad
>>
>> Hi Konrad,
>>
>> I just did the tests with the -rc6 kernel. While a smaller memory setup (615M,
>> one vcpu) boots successful, I see an eary crash with a bigger setup (7680M, 2
>
> ok..
>> vcpus) which I do not get with a 2.6.38 based kernel (the crash log is attached
>> but I did not do any real analysis).
>
> <nods>
>>
>> My test system is CentOS 5.6 based with Xen 3.4.3 (from gitco) on an AMD opteron
>> CPU. As said, I have not really looked into the dump as I wanted to let know
>> early. I would try the backup patch next and see how that goes.
>
> Great! Thank you.
>>
>> -Stefan
>
>> [ 0.000000]
>> <6>[ 0.000000] Initializing cgroup subsys cpuset
>> <6>[ 0.000000] Initializing cgroup subsys cpu
>> <5>[ 0.000000] Linux version 2.6.39-0-virtual (root@tangerine) (gcc version 4.6.1 20110428 (prerelease) (Ubuntu 4.6.0-6ubuntu1) ) #6~smb1 SMP Thu May 5 09:28:20 UTC 2011 (Ubuntu 2.6.39-0.6~smb1-virtual 2.6.39-rc5)
>> <6>[ 0.000000] Command line: root=LABEL=uec-rootfs ro console=hvc0
>> <6>[ 0.000000] KERNEL supported cpus:
>> <6>[ 0.000000] Intel GenuineIntel
>> <6>[ 0.000000] AMD AuthenticAMD
>> <6>[ 0.000000] Centaur CentaurHauls
>> <6>[ 0.000000] ACPI in unprivileged domain disabled
>> <6>[ 0.000000] released 0 pages of unused memory
>> <6>[ 0.000000] Set 0 page(s) to 1-1 mapping.
>> <6>[ 0.000000] BIOS-provided physical RAM map:
>> <6>[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
>> <6>[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
>> <6>[ 0.000000] Xen: 0000000000100000 - 00000001e0800000 (usable)
>> <6>[ 0.000000] NX (Execute Disable) protection: active
>> <6>[ 0.000000] DMI not present or invalid.
>> <7>[ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
>> <7>[ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
>> <6>[ 0.000000] No AGP bridge found
>> <6>[ 0.000000] last_pfn = 0x1e0800 max_arch_pfn = 0x400000000
>> <6>[ 0.000000] last_pfn = 0x100000 max_arch_pfn = 0x400000000
>> <7>[ 0.000000] initial memory mapped : 0 - 02c3a000
>> <7>[ 0.000000] Base memory trampoline at [ffff88000009b000] 9b000 size 20480
>> <6>[ 0.000000] init_memory_mapping: 0000000000000000-0000000100000000
>> <7>[ 0.000000] 0000000000 - 0100000000 page 4k
>> <7>[ 0.000000] kernel direct mapping tables up to 100000000 @ ff7fb000-100000000
>> <6>[ 0.000000] init_memory_mapping: 0000000100000000-00000001e0800000
>> <7>[ 0.000000] 0100000000 - 01e0800000 page 4k
>> <7>[ 0.000000] kernel direct mapping tables up to 1e0800000 @ 1df0f3000-1e0000000
>> <7>[ 0.000000] xen: setting RW the range fffdc000 - 100000000
>> <6>[ 0.000000] RAMDISK: 0203b000 - 02c3a000
>> <6>[ 0.000000] No NUMA configuration found
>> <6>[ 0.000000] Faking a node at 0000000000000000-00000001e0800000
>> <7>[ 0.000000] NUMA: Using 63 for the hash shift.
>> <6>[ 0.000000] Initmem setup node 0 0000000000000000-00000001e0800000
>> <6>[ 0.000000] NODE_DATA [00000001dfffb000 - 00000001dfffffff]
>> <1>[ 0.000000] BUG: unable to handle kernel NULL pointer dereference at (null)
>> <1>[ 0.000000] IP: [<ffffffff81cf6a75>] setup_node_bootmem+0x18a/0x1ea
>> <4>[ 0.000000] PGD 0
>> <0>[ 0.000000] Oops: 0003 [#1] SMP
>> <0>[ 0.000000] last sysfs file:
>> <4>[ 0.000000] CPU 0
>> <4>[ 0.000000] Modules linked in:
>> <4>[ 0.000000]
>> <4>[ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.39-0-virtual #6~smb1
>> <4>[ 0.000000] RIP: e030:[<ffffffff81cf6a75>] [<ffffffff81cf6a75>] setup_node_bootmem+0x18a/0x1ea
>> <4>[ 0.000000] RSP: e02b:ffffffff81c01e38 EFLAGS: 00010046
>> <4>[ 0.000000] RAX: 0000000000000000 RBX: 00000001e0800000 RCX: 0000000000001040
>> <4>[ 0.000000] RDX: 0000000000004100 RSI: 0000000000000000 RDI: ffff8801dfffb000
>> <4>[ 0.000000] RBP: ffffffff81c01e58 R08: 0000000000000020 R09: 0000000000000000
>> <4>[ 0.000000] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
>> <4>[ 0.000000] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000bfe400
>> <4>[ 0.000000] FS: 0000000000000000(0000) GS:ffffffff81cca000(0000) knlGS:0000000000000000
>> <4>[ 0.000000] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>> <4>[ 0.000000] CR2: 0000000000000000 CR3: 0000000001c03000 CR4: 0000000000000660
>> <4>[ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> <4>[ 0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> <4>[ 0.000000] Process swapper (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c0b020)
>> <0>[ 0.000000] Stack:
>> <4>[ 0.000000] 0000000000000040 0000000000000001 0000000000000000 ffffffffffffffff
>> <4>[ 0.000000] ffffffff81c01e88 ffffffff81cf6c25 0000000000000000 0000000000000000
>> <4>[ 0.000000] ffffffff81cf687f 0000000000000000 ffffffff81c01ea8 ffffffff81cf6e45
>> <0>[ 0.000000] Call Trace:
>> <4>[ 0.000000] [<ffffffff81cf6c25>] numa_register_memblks.constprop.3+0x150/0x181
>> <4>[ 0.000000] [<ffffffff81cf687f>] ? numa_add_memblk+0x7c/0x7c
>> <4>[ 0.000000] [<ffffffff81cf6e45>] numa_init.part.2+0x1c/0x7c
>> <4>[ 0.000000] [<ffffffff81cf687f>] ? numa_add_memblk+0x7c/0x7c
>> <4>[ 0.000000] [<ffffffff81cf6f67>] numa_init+0x6c/0x70
>> <4>[ 0.000000] [<ffffffff81cf7057>] initmem_init+0x39/0x3b
>> <4>[ 0.000000] [<ffffffff81ce5865>] setup_arch+0x64e/0x769
>> <4>[ 0.000000] [<ffffffff815e43c1>] ? printk+0x51/0x53
>> <4>[ 0.000000] [<ffffffff81cdf92b>] start_kernel+0xd4/0x3f3
>> <4>[ 0.000000] [<ffffffff81cdf388>] x86_64_start_reservations+0x132/0x136
>> <4>[ 0.000000] [<ffffffff81ce2ed4>] xen_start_kernel+0x588/0x58f
>> <0>[ 0.000000] Code: 41 00 00 48 8b 3c c5 a0 24 cc 81 31 c0 40 f6 c7 01 74 05 aa 66 ba ff 40 40 f6 c7 02 74 05 66 ab 83 ea 02 89 d1 c1 e9 02 f6 c2 02 <f3> ab 74 02 66 ab 80 e2 01 74 01 aa 49 63 c4 48 c1 eb 0c 44 89
>> <1>[ 0.000000] RIP [<ffffffff81cf6a75>] setup_node_bootmem+0x18a/0x1ea
>> <4>[ 0.000000] RSP <ffffffff81c01e38>
>> <0>[ 0.000000] CR2: 0000000000000000
>> <4>[ 0.000000] ---[ end trace a7919e7f17c0a725 ]---
>> <0>[ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
>> <4>[ 0.000000] Pid: 0, comm: swapper Tainted: G D 2.6.39-0-virtual #6~smb1
>> <4>[ 0.000000] Call Trace:
>> <4>[ 0.000000] [<ffffffff815e426d>] panic+0x91/0x194
>> <4>[ 0.000000] [<ffffffff81063ffb>] do_exit+0x40b/0x440
>> <4>[ 0.000000] [<ffffffff815fb990>] oops_end+0xb0/0xf0
>> <4>[ 0.000000] [<ffffffff815e332b>] no_context+0x145/0x152
>> <4>[ 0.000000] [<ffffffff815e34c6>] __bad_area_nosemaphore+0x18e/0x1b1
>> <4>[ 0.000000] [<ffffffff815e34fc>] bad_area_nosemaphore+0x13/0x15
>> <4>[ 0.000000] [<ffffffff815fe29d>] do_page_fault+0x43d/0x530
>> <4>[ 0.000000] [<ffffffff815fa70e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
>> <4>[ 0.000000] [<ffffffff8100721f>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
>> <4>[ 0.000000] [<ffffffff81060161>] ? vprintk+0x231/0x4b0
>> <4>[ 0.000000] [<ffffffff815facd5>] page_fault+0x25/0x30
>> <4>[ 0.000000] [<ffffffff81cf6a75>] ? setup_node_bootmem+0x18a/0x1ea
>> <4>[ 0.000000] [<ffffffff81cf6a16>] ? setup_node_bootmem+0x12b/0x1ea
>> <4>[ 0.000000] [<ffffffff81cf6c25>] numa_register_memblks.constprop.3+0x150/0x181
>> <4>[ 0.000000] [<ffffffff81cf687f>] ? numa_add_memblk+0x7c/0x7c
>> <4>[ 0.000000] [<ffffffff81cf6e45>] numa_init.part.2+0x1c/0x7c
>> <4>[ 0.000000] [<ffffffff81cf687f>] ? numa_add_memblk+0x7c/0x7c
>> <4>[ 0.000000] [<ffffffff81cf6f67>] numa_init+0x6c/0x70
>> <4>[ 0.000000] [<ffffffff81cf7057>] initmem_init+0x39/0x3b
>> <4>[ 0.000000] [<ffffffff81ce5865>] setup_arch+0x64e/0x769
>> <4>[ 0.000000] [<ffffffff815e43c1>] ? printk+0x51/0x53
>> <4>[ 0.000000] [<ffffffff81cdf92b>] start_kernel+0xd4/0x3f3
>> <4>[ 0.000000] [<ffffffff81cdf388>] x86_64_start_reservations+0x132/0x136
>> <4>[ 0.000000] [<ffffffff81ce2ed4>] xen_start_kernel+0x588/0x58f
>

This seems to work better. I reverted the work-around patch and applied the two
patches at the tip of your branch. Both configurations boot now. To get a bit
more coverage, I would also put this second version into some ec2 instances
(they usually are Intel based). But at least locally the second approach seems
better (even with x86 maintainer disklike ;)).

-Stefan


--------------000200000708070501040204
Content-Type: text/x-log;
name="dmesg.log"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="dmesg.log"

[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 2.6.39-0-virtual (root@tangerine) (gcc version 4.6.1 20110428 (prerelease) (Ubuntu 4.6.0-6ubuntu1) ) #6~smb2 SMP Thu May 5 14:44:59 UTC 2011 (Ubuntu 2.6.39-0.6~smb2-virtual 2.6.39-rc5)
[ 0.000000] Command line: root=LABEL=uec-rootfs ro console=hvc0
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Centaur CentaurHauls
[ 0.000000] ACPI in unprivileged domain disabled
[ 0.000000] released 0 pages of unused memory
[ 0.000000] Set 0 page(s) to 1-1 mapping.
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
[ 0.000000] Xen: 0000000000100000 - 00000001e0800000 (usable)
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] DMI not present or invalid.
[ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
[ 0.000000] No AGP bridge found
[ 0.000000] last_pfn = 0x1e0800 max_arch_pfn = 0x400000000
[ 0.000000] last_pfn = 0x100000 max_arch_pfn = 0x400000000
[ 0.000000] initial memory mapped : 0 - 02c3a000
[ 0.000000] Base memory trampoline at [ffff88000009b000] 9b000 size 20480
[ 0.000000] init_memory_mapping: 0000000000000000-0000000100000000
[ 0.000000] 0000000000 - 0100000000 page 4k
[ 0.000000] kernel direct mapping tables up to 100000000 @ ff7fb000-100000000
[ 0.000000] xen: setting RW the range fffdc000 - 100000000
[ 0.000000] init_memory_mapping: 0000000100000000-00000001e0800000
[ 0.000000] 0100000000 - 01e0800000 page 4k
[ 0.000000] kernel direct mapping tables up to 1e0800000 @ 1df0f3000-1e0000000
[ 0.000000] xen: setting RW the range 1df7fb000 - 1e0000000
[ 0.000000] RAMDISK: 0203b000 - 02c3a000
[ 0.000000] No NUMA configuration found
[ 0.000000] Faking a node at 0000000000000000-00000001e0800000
[ 0.000000] NUMA: Using 63 for the hash shift.
[ 0.000000] Initmem setup node 0 0000000000000000-00000001e0800000
[ 0.000000] NODE_DATA [00000001dfffb000 - 00000001dfffffff]
[ 0.000000] Zone PFN ranges:
[ 0.000000] DMA 0x00000010 -> 0x00001000
[ 0.000000] DMA32 0x00001000 -> 0x00100000
[ 0.000000] Normal 0x00100000 -> 0x001e0800
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[2] active PFN ranges
[ 0.000000] 0: 0x00000010 -> 0x000000a0
[ 0.000000] 0: 0x00000100 -> 0x001e0800
[ 0.000000] On node 0 totalpages: 1968016
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 5 pages reserved
[ 0.000000] DMA zone: 3923 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 14280 pages used for memmap
[ 0.000000] DMA32 zone: 1030200 pages, LIFO batch:31
[ 0.000000] Normal zone: 12572 pages used for memmap
[ 0.000000] Normal zone: 906980 pages, LIFO batch:31
[ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[ 0.000000] SMP: Allowing 2 CPUs, 0 hotplug CPUs
[ 0.000000] No local APIC present
[ 0.000000] APIC: disable apic facility
[ 0.000000] APIC: switched to apic NOOP
[ 0.000000] nr_irqs_gsi: 16
[ 0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000
[ 0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
[ 0.000000] PCI: Unassigned devices with 32bit resource registers may break!
[ 0.000000] Allocating PCI resources starting at 1e0900000 (gap: 1e0900000:400000)
[ 0.000000] Booting paravirtualized kernel on Xen
[ 0.000000] Xen version: 3.4.3 (preserve-AD)
[ 0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:2 nr_node_ids:1
[ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8801dff74000 s84608 r8192 d21888 u114688
[ 0.000000] pcpu-alloc: s84608 r8192 d21888 u114688 alloc=28*4096
[ 0.000000] pcpu-alloc: [0] 0 [0] 1
[ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total pages: 1941103
[ 0.000000] Policy zone: Normal
[ 0.000000] Kernel command line: root=LABEL=uec-rootfs ro console=hvc0
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] Checking aperture...
[ 0.000000] No AGP bridge found
[ 0.000000] Calgary: detecting Calgary via BIOS EBDA area
[ 0.000000] Calgary: Unable to locate Rio Grande table in EBDA - bailing!
[ 0.000000] Memory: 7629212k/7872512k available (6169k kernel code, 448k absent, 242852k reserved, 6921k data, 912k init)
[ 0.000000] SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] RCU dyntick-idle grace-period acceleration is enabled.
[ 0.000000] RCU-based detection of stalled CPUs is disabled.
[ 0.000000] NR_IRQS:4352 nr_irqs:288 16
[ 0.000000] Console: colour dummy device 80x25
[ 0.000000] console [tty0] enabled
[ 0.000000] console [hvc0] enabled
[ 0.000000] allocated 63963136 bytes of page_cgroup
[ 0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups
[ 0.000000] Xen: using vcpuop timer interface
[ 0.000000] installing Xen timer for CPU 0
[ 0.000000] Detected 2000.138 MHz processor.
[ 0.010000] Calibrating delay loop (skipped), value calculated using timer frequency.. 4000.27 BogoMIPS (lpj=20001380)
[ 0.010000] pid_max: default: 32768 minimum: 301
[ 0.010000] Security Framework initialized
[ 0.010000] AppArmor: AppArmor initialized
[ 0.010000] Yama: becoming mindful.
[ 0.010000] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[ 0.010000] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[ 0.010000] Mount-cache hash table entries: 256
[ 0.010000] Initializing cgroup subsys ns
[ 0.010000] ns_cgroup deprecated: consider using the 'clone_children' flag without the ns_cgroup.
[ 0.010000] Initializing cgroup subsys cpuacct
[ 0.010000] Initializing cgroup subsys memory
[ 0.010000] Initializing cgroup subsys devices
[ 0.010000] Initializing cgroup subsys freezer
[ 0.010000] Initializing cgroup subsys net_cls
[ 0.010000] Initializing cgroup subsys blkio
[ 0.010000] Initializing cgroup subsys perf_event
[ 0.010000] tseg: 00dff00000
[ 0.010000] CPU: Physical Processor ID: 0
[ 0.010000] CPU: Processor Core ID: 0
[ 0.010000] SMP alternatives: switching to UP code
[ 0.010000] ftrace: allocating 25924 entries in 102 pages
[ 0.010144] cpu 0 spinlock event irq 17
[ 0.010188] Performance Events:
[ 0.010195] no APIC, boot with the "lapic" boot parameter to force-enable it.
[ 0.010204] no hardware sampling interrupt available.
[ 0.010241] Broken PMU hardware detected, using software events only.
[ 0.010952] installing Xen timer for CPU 1
[ 0.010978] cpu 1 spinlock event irq 23
[ 0.011022] SMP alternatives: switching to SMP code
[ 0.020145] Brought up 2 CPUs
[ 0.020248] devtmpfs: initialized
[ 0.021630] Grant table initialized
[ 0.021630] print_constraints: dummy:
[ 0.043023] Time: 165:165:165 Date: 165/165/65
[ 0.043085] NET: Registered protocol family 16
[ 0.043283] Trying to unpack rootfs image as initramfs...
[ 0.050186] Extended Config Space enabled on 0 nodes
[ 0.050794] PCI: setting up Xen PCI frontend stub
[ 0.050794] PCI: pci_cache_line_size set to 64 bytes
[ 0.052494] bio: create slab <bio-0> at 0
[ 0.052633] ACPI: Interpreter disabled.
[ 0.053004] xen/balloon: Initialising balloon driver.
[ 0.053004] last_pfn = 0x1e0800 max_arch_pfn = 0x400000000
[ 0.054255] xen-balloon: Initialising balloon driver.
[ 0.054385] vgaarb: loaded
[ 0.054385] SCSI subsystem initialized
[ 0.054792] libata version 3.00 loaded.
[ 0.054792] usbcore: registered new interface driver usbfs
[ 0.054792] usbcore: registered new interface driver hub
[ 0.054932] usbcore: registered new device driver usb
[ 0.055054] PCI: System does not support PCI
[ 0.055054] PCI: System does not support PCI
[ 0.055144] NetLabel: Initializing
[ 0.055144] NetLabel: domain hash size = 128
[ 0.055144] NetLabel: protocols = UNLABELED CIPSOv4
[ 0.055144] NetLabel: unlabeled traffic allowed by default
[ 0.055144] Switching to clocksource xen
[ 0.055446] Switched to NOHz mode on CPU #0
[ 0.055443] Switched to NOHz mode on CPU #1
[ 0.057737] AppArmor: AppArmor Filesystem Enabled
[ 0.057771] pnp: PnP ACPI: disabled
[ 0.059315] NET: Registered protocol family 2
[ 0.059780] Freeing initrd memory: 12284k freed
[ 0.059878] IP route cache hash table entries: 262144 (order: 9, 2097152 bytes)
[ 0.062730] TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
[ 0.066505] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[ 0.066992] TCP: Hash tables configured (established 524288 bind 65536)
[ 0.067003] TCP reno registered
[ 0.067043] UDP hash table entries: 4096 (order: 5, 131072 bytes)
[ 0.067149] UDP-Lite hash table entries: 4096 (order: 5, 131072 bytes)
[ 0.067347] NET: Registered protocol family 1
[ 0.070009] PCI: CLS 0 bytes, default 64
[ 0.070009] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 0.070009] Placing 64MB software IO TLB between ffff8800fb7fb000 - ffff8800ff7fb000
[ 0.070009] software IO TLB at phys 0xfb7fb000 - 0xff7fb000
[ 0.072259] platform rtc_cmos: registered platform RTC device (no PNP device found)
[ 0.072597] audit: initializing netlink socket (disabled)
[ 0.072627] type=2000 audit(1304608433.099:1): initialized
[ 0.096031] HugeTLB registered 2 MB page size, pre-allocated 0 pages
[ 0.098001] VFS: Disk quotas dquot_6.5.2
[ 0.098076] Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 0.098825] fuse init (API version 7.16)
[ 0.098933] msgmni has been set to 14924
[ 0.099356] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[ 0.099423] io scheduler noop registered
[ 0.099431] io scheduler deadline registered (default)
[ 0.099487] io scheduler cfq registered
[ 0.099580] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[ 0.099699] pciehp: PCI Express Hot Plug Controller Driver version: 0.4
[ 0.100155] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
[ 0.440229] Linux agpgart interface v0.103
[ 0.441329] brd: module loaded
[ 0.441840] loop: module loaded
[ 0.444335] blkfront device/vbd/2049 num-ring-pages 1 nr_ents 32.
[ 0.448802] blkfront device/vbd/2050 num-ring-pages 1 nr_ents 32.
[ 0.449898] blkfront: xvde1: barriers disabled
[ 0.450534] Fixed MDIO Bus: probed
[ 0.450575] PPP generic driver version 2.4.2
[ 0.450642] Initialising Xen virtual ethernet driver.
[ 0.453787] blkfront: xvde2: barriers disabled
[ 0.453890] tun: Universal TUN/TAP device driver, 1.6
[ 0.453902] tun: (C) 1999-2004 Max Krasnyansky <[email protected]>
[ 0.454012] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[ 0.454033] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[ 0.454047] uhci_hcd: USB Universal Host Controller Interface driver
[ 0.454104] i8042: PNP: No PS/2 controller found. Probing ports directly.
[ 0.454935] i8042: No controller found
[ 0.455000] mousedev: PS/2 mouse device common for all mice
[ 0.455045] Setting capacity to 24576000
[ 0.455063] xvde2: detected capacity change from 0 to 12582912000
[ 0.495214] rtc_cmos rtc_cmos: rtc core: registered rtc_cmos as rtc0
[ 0.495280] rtc_cmos: probe of rtc_cmos failed with error -38
[ 0.495448] device-mapper: uevent: version 1.0.3
[ 0.495539] device-mapper: ioctl: 4.20.0-ioctl (2011-02-02) initialised: [email protected]
[ 0.495724] device-mapper: multipath: version 1.3.0 loaded
[ 0.495734] device-mapper: multipath round-robin: version 1.0.0 loaded
[ 0.495813] cpuidle: using governor ladder
[ 0.495820] cpuidle: using governor menu
[ 0.495825] EFI Variables Facility v0.08 2004-May-17
[ 0.496143] TCP cubic registered
[ 0.496282] NET: Registered protocol family 10
[ 0.497112] NET: Registered protocol family 17
[ 0.497137] Registering the dns_resolver key type
[ 0.497188] powernow-k8: Found 1 AMD Opteron(tm) Processor 6128 (2 cpu cores) (version 2.20.00)
[ 0.497216] [Firmware Bug]: powernow-k8: No compatible ACPI _PSS objects found.
[ 0.497218] [Firmware Bug]: powernow-k8: Try again with latest BIOS.
[ 0.497312] PM: Hibernation image not present or could not be loaded.
[ 0.497327] registered taskstats version 1
[ 0.497356] XENBUS: Device with no driver: device/console/0
[ 0.497376] Magic number: 1:252:3141
[ 0.497426] /home/smb/oneiric-amd64/ubuntu-2.6/drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
[ 0.497437] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 0.497444] EDD information not available.
[ 0.498007] Freeing unused kernel memory: 912k freed
[ 0.498343] Write protecting the kernel read-only data: 12288k
[ 0.506235] Freeing unused kernel memory: 2004k freed
[ 0.507671] Freeing unused kernel memory: 1380k freed
[ 0.560935] udev[59]: starting version 167
[ 0.692128] EXT4-fs (xvde1): mounted filesystem with ordered data mode. Opts: (null)
[ 0.908740] EXT4-fs (xvde1): re-mounted. Opts: (null)
[ 0.938669] udev[206]: starting version 167
[ 0.968617] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[ 1.186280] type=1400 audit(1304608434.208:2): apparmor="STATUS" operation="profile_load" name="/sbin/dhclient" pid=294 comm="apparmor_parser"
[ 1.191109] type=1400 audit(1304608434.218:3): apparmor="STATUS" operation="profile_load" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=294 comm="apparmor_parser"
[ 1.191399] type=1400 audit(1304608434.218:4): apparmor="STATUS" operation="profile_load" name="/usr/lib/connman/scripts/dhclient-script" pid=294 comm="apparmor_parser"
[ 1.535007] type=1400 audit(1304608434.558:5): apparmor="STATUS" operation="profile_replace" name="/sbin/dhclient" pid=401 comm="apparmor_parser"
[ 1.538107] type=1400 audit(1304608434.558:6): apparmor="STATUS" operation="profile_replace" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=401 comm="apparmor_parser"
[ 1.538405] type=1400 audit(1304608434.558:7): apparmor="STATUS" operation="profile_replace" name="/usr/lib/connman/scripts/dhclient-script" pid=401 comm="apparmor_parser"
[ 1.547190] type=1400 audit(1304608434.568:8): apparmor="STATUS" operation="profile_load" name="/usr/sbin/tcpdump" pid=402 comm="apparmor_parser"
[ 12.210059] eth0: no IPv6 routers present

2011-05-05 18:45:27

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [PATCH] Two patches fixing regression introduced by 'x86-64, mm: Put early page table high' - does not work.

On Thu, May 05, 2011 at 12:28:49PM -0400, Konrad Rzeszutek Wilk wrote:
> > >> OK, sounds like a plan then. I like it because it doesn't affect the
> > >> native kernel.
> .. snip..
>
> > First things first... are you pushing the workaround (you can add my
> > Acked-by:) or should I?
>
> Grrr.. While it works on my machines, it does not work on some of the AMD opteron
> CPU machines. I asked Stefan Bader from Canonical to run a simple bootup test with
> 2.6.39-rc6 (Linus picked it already), and if it crashed, to use the
> one that Stefano came up with.
>
> And sure enough - it crashed and Stefano's fix worked. I've Stefano's
> patch in stable/bug-fixes-for-x86 and I am OK reverting the fix I came up.

To make it easier, I made a branch called

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/bug-fixes-for-rc6

that has all the right magic sauce in it:

Konrad Rzeszutek Wilk (1):
Revert "xen/mmu: Add workaround "x86-64, mm: Put early page table high""

Sedat Dilek (1):
x86/mm: Fix section mismatch derived from native_pagetable_reserve()

Stefano Stabellini (1):
x86,xen: introduce x86_init.mapping.pagetable_reserve

2011-05-09 07:07:22

by Daniel Kiper

[permalink] [raw]
Subject: Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"

On Thu, May 05, 2011 at 03:06:18PM +0100, Stefano Stabellini wrote:
> On Thu, 5 May 2011, Konrad Rzeszutek Wilk wrote:
> > > In any case you are right about the fact that the change is not needed
> > > on X86_32 so if it has any bad side effects on X86_32 we can always do
> >
> > Why not? Can't you boot a 32-bit Dom0 on those machines?
>
> Because e820_end_of_low_ram_pfn() is not called on 32-bit,
> find_low_pfn_range() is called instead and works differently.
> AFAICT find_low_pfn_range() is not affected by the issue described in
> the previous email.

Yes, you are right. Additionally, I tested your patch with my amendments ealier
and it works on i386. I am going to prepare relevant patch today or tomorrow.

Daniel