2008-02-12 23:54:31

by Jody Belka

[permalink] [raw]
Subject: 2.6.25-rc1 xen pvops regression

Hi all,

I thought I'd try out 2.6.25-rc1 as a xen 32-bit pae domU the other day.
Unfortunately, I didn't get very far very fast, as the domain just crashed
immediately upon booting, without any direct feedback (I did have messages
on the xen message buffer, which helped). This even with earlyprintk turned on.

After a long, arduous journey, I managed to track this down to the following:

----------
commit 551889a6e2a24a9c06fd453ea03b57b7746ffdc0

x86: construct 32-bit boot time page tables in native format.

Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
kernel are in PAE format.

early_ioremap is updated to use the standard page table accessors.

Clear any mappings beyond max_low_pfn from the boot page tables in
native_pagetable_setup_start because the initial mappings can extend
beyond the range of physical memory and into the vmalloc area.

Derived from patches by Eric Biederman and H. Peter Anvin.

[ [email protected]: PAE swapper_pg_dir needs to be page-sized fix ]
----------

However, to make life more interesting, just reverting this isn't quite
enough to get us to the promised land. If we try, we find that although
we do now start booting, we crash again a short way into the process.

In a different manner though. Specifically, in early_ioremap_clear.
Reverting the above commit /except/ for the changes to arch/x86/mm/ioremap.c
gets everything working again.

Well, except that we can't shutdown/reboot properly, but I've sent a patch
for that in another email.


I'm afraid i've no idea what needs to be done to get the change to work
with xen, but i'm willing to try out any patches people come up with.
Please cc me on any replies, as i'm not subscribed, thanks.


J
--
Jody Belka
knew (at) pimb (dot) org


2008-02-13 12:01:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Jody Belka wrote:
> Hi all,
>
> I thought I'd try out 2.6.25-rc1 as a xen 32-bit pae domU the other day.
> Unfortunately, I didn't get very far very fast, as the domain just crashed
> immediately upon booting, without any direct feedback (I did have messages
> on the xen message buffer, which helped). This even with earlyprintk turned on.
>
> After a long, arduous journey, I managed to track this down to the following:
>
> ----------
> commit 551889a6e2a24a9c06fd453ea03b57b7746ffdc0
>
> x86: construct 32-bit boot time page tables in native format.
>
> Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
> kernel are in PAE format.
>
> early_ioremap is updated to use the standard page table accessors.
>
> Clear any mappings beyond max_low_pfn from the boot page tables in
> native_pagetable_setup_start because the initial mappings can extend
> beyond the range of physical memory and into the vmalloc area.
>
> Derived from patches by Eric Biederman and H. Peter Anvin.
>
> [ [email protected]: PAE swapper_pg_dir needs to be page-sized fix ]
> ----------
>
> However, to make life more interesting, just reverting this isn't quite
> enough to get us to the promised land. If we try, we find that although
> we do now start booting, we crash again a short way into the process.
>
> In a different manner though. Specifically, in early_ioremap_clear.
> Reverting the above commit /except/ for the changes to arch/x86/mm/ioremap.c
> gets everything working again.
>
> Well, except that we can't shutdown/reboot properly, but I've sent a patch
> for that in another email.
>
>
> I'm afraid i've no idea what needs to be done to get the change to work
> with xen, but i'm willing to try out any patches people come up with.
> Please cc me on any replies, as i'm not subscribed, thanks.

Hi,

Although I'm on vacation, I happened to download a recent copy of
x86.git and found that it crashes early. Here's a couple of patches to
apply; I don't know if they apply to current git, but I hope it helps.

J


Attachments:
x86-fix-early_ioremap.patch (981.00 B)
xen-unpin-init_pt.patch (799.00 B)
Download all attachments

2008-02-13 12:13:51

by Jody Belka

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Wed, Feb 13, 2008 at 10:59:33PM +1100, Jeremy Fitzhardinge wrote:
> Hi,
>
> Although I'm on vacation, I happened to download a recent copy of
> x86.git and found that it crashes early. Here's a couple of patches to
> apply; I don't know if they apply to current git, but I hope it helps.

> Subject: x86/early_ioremap: don't assume we're using swapper_pg_dir
> Subject: xen: unpin initial Xen pagetable once we're finished with it

Applying both of those to current git resulte in booting working again :)
Just leaves the halt/reboot issue that I posted a patch for in another thread.

Tested-by: Jody Belka <[email protected]>


J
--
Jody Belka
knew (at) pimb (dot) org

2008-02-13 12:24:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


* Jeremy Fitzhardinge <[email protected]> wrote:

> Although I'm on vacation, I happened to download a recent copy of
> x86.git and found that it crashes early. Here's a couple of patches
> to apply; I don't know if they apply to current git, but I hope it
> helps.

thanks Jeremy, applied.

Ingo

2008-02-14 02:31:10

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Wed, Feb 13, 2008 at 10:59:33PM +1100, Jeremy Fitzhardinge wrote:
>> I thought I'd try out 2.6.25-rc1 as a xen 32-bit pae domU the other day.
>> Unfortunately, I didn't get very far very fast, as the domain just crashed
>> immediately upon booting, without any direct feedback (I did have messages
>> on the xen message buffer, which helped). This even with earlyprintk turned on.
>>
>> After a long, arduous journey, I managed to track this down to the following:
>>
>> ----------
>> commit 551889a6e2a24a9c06fd453ea03b57b7746ffdc0

I'm seeing the same problem, with no messages at all from xen
other than "domain crashed, restart disabled" in xend.log. I got a
different commit in my bisect, 0947b2f31ca1ea1211d3cde2dbd8fcec579ef395
(i386 boot: replace boot_io remap with enhanced bt_ioremap - enhance
bt_ioremap). I started from yesterday's
96b5a46e2a72dc1829370c87053e0cd558d58bc0 (WMI: initialize
wmi_blocks.list even if ACPI is disabled) and a known good
9b73e76f3cf63379dcf45fcd4f112f5812418d0a (Merge
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6).

> Although I'm on vacation, I happened to download a recent copy of
> x86.git and found that it crashes early. Here's a couple of patches to
> apply; I don't know if they apply to current git, but I hope it helps.

> Subject: x86/early_ioremap: don't assume we're using swapper_pg_dir
> Subject: xen: unpin initial Xen pagetable once we're finished with it

After my bisect was done, I re-pulled from Linus and discovered
these patches. Searching for these emails, they certainly sound like my
problem. But the kernel does not boot, commit
10270d4838bdc493781f5a1cf2e90e9c34c9142f (acpi: fix
acpi_os_read_pci_configuration() misuse of raw_pci_read()). Still no
output from Xen - pygrub selects the kernel, and then the domain just
dies back to the dom0 shell.
Attached are my latest .config and my bisect log.

Joel

--

"The real reason GNU ls is 8-bit-clean is so that they can
start using ISO-8859-1 option characters."
- Christopher Davis ([email protected])

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127


Attachments:
(No filename) (2.11 kB)
.config (49.73 kB)
bisect.log (2.48 kB)
Download all attachments

2008-02-14 07:53:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Joel Becker wrote:
> On Wed, Feb 13, 2008 at 10:59:33PM +1100, Jeremy Fitzhardinge wrote:
>
>>> I thought I'd try out 2.6.25-rc1 as a xen 32-bit pae domU the other day.
>>> Unfortunately, I didn't get very far very fast, as the domain just crashed
>>> immediately upon booting, without any direct feedback (I did have messages
>>> on the xen message buffer, which helped). This even with earlyprintk turned on.
>>>
>>> After a long, arduous journey, I managed to track this down to the following:
>>>
>>> ----------
>>> commit 551889a6e2a24a9c06fd453ea03b57b7746ffdc0
>>>
>
> I'm seeing the same problem, with no messages at all from xen
> other than "domain crashed, restart disabled" in xend.log. I got a
> different commit in my bisect, 0947b2f31ca1ea1211d3cde2dbd8fcec579ef395
> (i386 boot: replace boot_io remap with enhanced bt_ioremap - enhance
> bt_ioremap). I started from yesterday's
> 96b5a46e2a72dc1829370c87053e0cd558d58bc0 (WMI: initialize
> wmi_blocks.list even if ACPI is disabled) and a known good
> 9b73e76f3cf63379dcf45fcd4f112f5812418d0a (Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6).
>
>
>> Although I'm on vacation, I happened to download a recent copy of
>> x86.git and found that it crashes early. Here's a couple of patches to
>> apply; I don't know if they apply to current git, but I hope it helps.
>>
>
>
>> Subject: x86/early_ioremap: don't assume we're using swapper_pg_dir
>> Subject: xen: unpin initial Xen pagetable once we're finished with it
>>
>
> After my bisect was done, I re-pulled from Linus and discovered
> these patches. Searching for these emails, they certainly sound like my
> problem. But the kernel does not boot, commit
> 10270d4838bdc493781f5a1cf2e90e9c34c9142f (acpi: fix
> acpi_os_read_pci_configuration() misuse of raw_pci_read()). Still no
> output from Xen - pygrub selects the kernel, and then the domain just
> dies back to the dom0 shell.
> Attached are my latest .config and my bisect log.
>

Is the domain ending up in the crashed state? Do you get a register
dump with xm dmesg? That would be very useful in determining what went
wrong. You may need to compile Xen with debug=y in Config.mk.

J

2008-02-15 20:25:27

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Thu, Feb 14, 2008 at 06:50:52PM +1100, Jeremy Fitzhardinge wrote:
>> I'm seeing the same problem, with no messages at all from xen
>> other than "domain crashed, restart disabled" in xend.log. I got a
>> different commit in my bisect, 0947b2f31ca1ea1211d3cde2dbd8fcec579ef395
>> (i386 boot: replace boot_io remap with enhanced bt_ioremap - enhance
>> bt_ioremap). I started from yesterday's
>> 96b5a46e2a72dc1829370c87053e0cd558d58bc0 (WMI: initialize
>> wmi_blocks.list even if ACPI is disabled) and a known good
>> 9b73e76f3cf63379dcf45fcd4f112f5812418d0a (Merge
>> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6).
>
> Is the domain ending up in the crashed state? Do you get a register
> dump with xm dmesg? That would be very useful in determining what went
> wrong. You may need to compile Xen with debug=y in Config.mk.

I didn't know xm dmesg existed :-) Regarding debug=y, I'm using
a prepackaged dom0 set. Here's what I find in xm dmesg:

Joel

(XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp 00000000e0000000) for mfn 3a2f0f (pfn f0)
(XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry 00000003a2f0f063 for dom109
(XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp 00000000e0000000) for mfn 3a2f0f (pfn f0)
(XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry 00000003a2f0f063 for dom109
(XEN) mm.c:3331:d109 ptwr_emulate: could not get_page_from_l1e()
(XEN) Unhandled page fault in domain 109 on VCPU 0 (ec=0003)
(XEN) Pagetable walk from 00000000c01687f0:
(XEN) L4[0x000] = 00000003a2933027 00000000000006cc
(XEN) L3[0x003] = 000000039afea027 0000000000000005
(XEN) L2[0x000] = 000000039bfb7067 0000000000001048
(XEN) L1[0x168] = 00000003a2e97061 0000000000000168
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 109 (vcpu#0) crashed on cpu#2:
(XEN) ----[ Xen-3.1.3-rc3 x86_64 debug=n Not tainted ]----
(XEN) CPU: 2
(XEN) RIP: e019:[<00000000c04040bd>]
(XEN) RFLAGS: 0000000000000282 CONTEXT: guest
(XEN) rax: 00000000c01687f0 rbx: 00000000f52fe000 rcx: 00000000a2f0f063
(XEN) rdx: 0000000000000003 rsi: 0000000000000000 rdi: 00000000a2f0f063
(XEN) rbp: 0000000000000003 rsp: 00000000c06daefc r8: 0000000000000000
(XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000
(XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000
(XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000006f0
(XEN) cr3: 000000039afe2000 cr2: 00000000c01687f0
(XEN) ds: e021 es: e021 fs: e021 gs: e021 ss: e021 cs: e019
(XEN) Guest stack trace from esp=c06daefc:
(XEN) 00000003 c04040bd 0001e019 00010082 c01687f0 f52fe000 80000000 a2f0f063
(XEN) 000f0000 c01687f0 c0417ab6 a2f0f063 00000003 000000f0 f52fe000 00000010
(XEN) 000004ff 000f0000 000004ff c06eee29 00000063 80000000 000f0000 c06dafc8
(XEN) c06ca400 0069a000 00000000 c06f9d82 00000000 f57fd000 f57fd000 f57fd000
(XEN) c0005d60 c06dafc8 c06ca400 0069a000 00000000 c06e70d4 c06d7680 c1043000
(XEN) 007c9000 00000000 0069a000 c1043000 c06dafe4 00000000 00000000 c06e05fd
(XEN) c0602000 c06529f8 dfbffdff c0701940 c1043000 c06dafe4 00000000 00000000
(XEN) c06e5f54 00000000 178bc1f1 00000001 03020800 00020f12 f5800000 00000000
(XEN) c1043000 c06e05c3 c06e05cb c06e05d3 c040288f c040299a c0403270 c06e6141
(XEN) c0403652 c0403709 c0403831 c0403a56 c0403a6d c0403a86 c0403b67 c04042c9
(XEN) c0404492 c04045da c040467a c06e622f c06e6238 c04052c5 c04052f7 c0405325
(XEN) c0405410 c06e6272 c06e62ee c06e6341 c05f23f3 c05f25d0 c04060eb c04063ed
(XEN) c04064cf c040652b c0406719 c0406877 c0406fcc c0407046 c0407120 c04075cb
(XEN) c04075e9 c040762b c040765f c040772c c0407755 c0407804 c0408c32 c0408cd3
(XEN) c05fa9ad c05faa20 c05faabe c05fad01 c06e6843 c06e6853 c0409b23 c0409c6d
(XEN) c06e6921 c06e692a c040a4f1 c040b5a2 c040b5b0 c040b627 c040b635 c06e7ff4
(XEN) c06e8007 c040b8bf c040b8ee c06e8340 c06e8349 c040bcb0 c040bd26 c040c9a4
(XEN) c040cca2 c040ccaa c040cd11 c040cd47 c040cd50 c040d026 c040d02e c040d278
(XEN) c040d280 c040d297 c040d29f c040d2c4 c040ddad c040de24 c040de33 c040de67
(XEN) c040dea1 c040deaa c040deb3 c05f2a94 c05f2bfb c05f2c4a c05f2c65 c05f2ddc

--

Life's Little Instruction Book #24

"Drink champagne for no reason at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2008-02-16 02:46:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Joel Becker wrote:
> On Thu, Feb 14, 2008 at 06:50:52PM +1100, Jeremy Fitzhardinge wrote:
>
>>> I'm seeing the same problem, with no messages at all from xen
>>> other than "domain crashed, restart disabled" in xend.log. I got a
>>> different commit in my bisect, 0947b2f31ca1ea1211d3cde2dbd8fcec579ef395
>>> (i386 boot: replace boot_io remap with enhanced bt_ioremap - enhance
>>> bt_ioremap). I started from yesterday's
>>> 96b5a46e2a72dc1829370c87053e0cd558d58bc0 (WMI: initialize
>>> wmi_blocks.list even if ACPI is disabled) and a known good
>>> 9b73e76f3cf63379dcf45fcd4f112f5812418d0a (Merge
>>> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6).
>>>
>> Is the domain ending up in the crashed state? Do you get a register
>> dump with xm dmesg? That would be very useful in determining what went
>> wrong. You may need to compile Xen with debug=y in Config.mk.
>>
>
> I didn't know xm dmesg existed :-) Regarding debug=y, I'm using
> a prepackaged dom0 set. Here's what I find in xm dmesg:
>
> Joel
>
> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp 00000000e0000000) for mfn 3a2f0f (pfn f0)
> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry 00000003a2f0f063 for dom109
> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp 00000000e0000000) for mfn 3a2f0f (pfn f0)
> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry 00000003a2f0f063 for dom109
> (XEN) mm.c:3331:d109 ptwr_emulate: could not get_page_from_l1e()
>

Hm, I have a suspicion about what this might be. I'll haven't tried
reproducing it yet though.

> (XEN) Unhandled page fault in domain 109 on VCPU 0 (ec=0003)
> (XEN) Pagetable walk from 00000000c01687f0:
> (XEN) L4[0x000] = 00000003a2933027 00000000000006cc
> (XEN) L3[0x003] = 000000039afea027 0000000000000005
> (XEN) L2[0x000] = 000000039bfb7067 0000000000001048
> (XEN) L1[0x168] = 00000003a2e97061 0000000000000168
> (XEN) domain_crash_sync called from entry.S
> (XEN) Domain 109 (vcpu#0) crashed on cpu#2:
> (XEN) ----[ Xen-3.1.3-rc3 x86_64 debug=n Not tainted ]----
> (XEN) CPU: 2
> (XEN) RIP: e019:[<00000000c04040bd>]
>

What does this EIP correspond to in your kernel? Also:

c01687f0 c0417ab6 c040288f c040299a c0403270

(as guesses of potential callers to try and work out a stack trace).

Thanks,
J

2008-02-16 08:57:55

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Sat, Feb 16, 2008 at 01:44:26PM +1100, Jeremy Fitzhardinge wrote:
> Joel Becker wrote:
>
>> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp
>> 00000000e0000000) for mfn 3a2f0f (pfn f0)
>> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry
>> 00000003a2f0f063 for dom109
>> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp
>> 00000000e0000000) for mfn 3a2f0f (pfn f0)
>> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry
>> 00000003a2f0f063 for dom109
>> (XEN) mm.c:3331:d109 ptwr_emulate: could not get_page_from_l1e()
>>
>
> Hm, I have a suspicion about what this might be. I'll haven't tried
> reproducing it yet though.
>
>> (XEN) Unhandled page fault in domain 109 on VCPU 0 (ec=0003)
>> (XEN) Pagetable walk from 00000000c01687f0:
>> (XEN) L4[0x000] = 00000003a2933027 00000000000006cc
>> (XEN) L3[0x003] = 000000039afea027 0000000000000005
>> (XEN) L2[0x000] = 000000039bfb7067 0000000000001048 (XEN) L1[0x168] =
>> 00000003a2e97061 0000000000000168
>> (XEN) domain_crash_sync called from entry.S
>> (XEN) Domain 109 (vcpu#0) crashed on cpu#2:
>> (XEN) ----[ Xen-3.1.3-rc3 x86_64 debug=n Not tainted ]----
>> (XEN) CPU: 2
>> (XEN) RIP: e019:[<00000000c04040bd>]
>>
>
> What does this EIP correspond to in your kernel? Also:
>
> c01687f0 c0417ab6 c040288f c040299a c0403270
>
> (as guesses of potential callers to try and work out a stack trace).

ksymoops is no help at all, but I got these from objdump of
vmlinux:

c04040bd xen_set_pte
c0417ab6 set_pte_present
c040288f set_bit
c040299a __raw_spin_unlock
c0403270 __set_64bit

Joel

--

Dort wo man B?cher verbrennt, verbrennt man am Ende auch Mensch.
(Wherever they burn books, they will also end up burning people.)
- Heinrich Heine

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2008-02-16 11:48:26

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Joel Becker wrote:
> On Sat, Feb 16, 2008 at 01:44:26PM +1100, Jeremy Fitzhardinge wrote:
>
>> Joel Becker wrote:
>>
>>
>>> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp
>>> 00000000e0000000) for mfn 3a2f0f (pfn f0)
>>> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry
>>> 00000003a2f0f063 for dom109
>>> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp
>>> 00000000e0000000) for mfn 3a2f0f (pfn f0)
>>> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry
>>> 00000003a2f0f063 for dom109
>>> (XEN) mm.c:3331:d109 ptwr_emulate: could not get_page_from_l1e()
>>>
>>>
>> Hm, I have a suspicion about what this might be. I'll haven't tried
>> reproducing it yet though.
>>
>>
>>> (XEN) Unhandled page fault in domain 109 on VCPU 0 (ec=0003)
>>> (XEN) Pagetable walk from 00000000c01687f0:
>>> (XEN) L4[0x000] = 00000003a2933027 00000000000006cc
>>> (XEN) L3[0x003] = 000000039afea027 0000000000000005
>>> (XEN) L2[0x000] = 000000039bfb7067 0000000000001048 (XEN) L1[0x168] =
>>> 00000003a2e97061 0000000000000168
>>> (XEN) domain_crash_sync called from entry.S
>>> (XEN) Domain 109 (vcpu#0) crashed on cpu#2:
>>> (XEN) ----[ Xen-3.1.3-rc3 x86_64 debug=n Not tainted ]----
>>> (XEN) CPU: 2
>>> (XEN) RIP: e019:[<00000000c04040bd>]
>>>
>>>
>> What does this EIP correspond to in your kernel? Also:
>>
>> c01687f0 c0417ab6 c040288f c040299a c0403270
>>
>> (as guesses of potential callers to try and work out a stack trace).
>>
>
> ksymoops is no help at all, but I got these from objdump of
> vmlinux:
>
> c04040bd xen_set_pte
> c0417ab6 set_pte_present
> c040288f set_bit
> c040299a __raw_spin_unlock
> c0403270 __set_64bit

(My usual technique is use "gdb vmlinux" and "x/i 0x...." to do the
lookup.)

Unfortunately that doesn't narrow down what the kernel was actually
trying to do at the time. Clearly a set_pte; looks like someone is
trying to create a writable mapping of an existing pte page.

Does "console=hvc0 earlyprintk=xen" on the kernel command line give any
clue about how far it gets before crashing?

J

2008-02-17 06:31:28

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Sat, Feb 16, 2008 at 10:46:23PM +1100, Jeremy Fitzhardinge wrote:
> Joel Becker wrote:
>> ksymoops is no help at all, but I got these from objdump of
>> vmlinux:
>>
>> c04040bd xen_set_pte
>> c0417ab6 set_pte_present
>> c040288f set_bit
>> c040299a __raw_spin_unlock
>> c0403270 __set_64bit
>
> (My usual technique is use "gdb vmlinux" and "x/i 0x...." to do the
> lookup.)

Thanks for the tip!

> Unfortunately that doesn't narrow down what the kernel was actually
> trying to do at the time. Clearly a set_pte; looks like someone is
> trying to create a writable mapping of an existing pte page.
>
> Does "console=hvc0 earlyprintk=xen" on the kernel command line give any
> clue about how far it gets before crashing?

Console is already hvc0, but earlyprintk gets us:

--8<-----------------------------------------------------------------
Reserving virtual address space above 0xf57fe000
Linux version 2.6.25-rc2-bisectme ([email protected]) (gcc
version 4.1.2 20070626 (Red Hat 4.1.2-14)) #21 SMP Fri Feb 15 16:28:35
PST 2008
ACPI in unprivileged domain disabled
BIOS-provided physical RAM map:
Xen: 0000000000000000 - 0000000078000000 (usable)
console [xenboot0] enabled
1192MB HIGHMEM available.
727MB LOWMEM available.
Started domain ca-test58
Scan SMP from c0000000 for 1024 bytes.
Scan SMP from c009fc00 for 1024 bytes.
Scan SMP from c00f0000 for 65536 bytes.
NX (Execute Disable) protection: active
Zone PFN ranges:
DMA 0 -> 4096
Normal 4096 -> 186366
HighMem 186366 -> 491520
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0: 0 -> 491520
-->8-----------------------------------------------------------------

That's it.

Joel

--

"The nearest approach to immortality on Earth is a government
bureau."
- James F. Byrnes

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2008-02-17 06:40:38

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Sat, Feb 16, 2008 at 10:46:23PM +1100, Jeremy Fitzhardinge wrote:
>>> What does this EIP correspond to in your kernel? Also:
>>>
>>> c01687f0 c0417ab6 c040288f c040299a c0403270
>>>
>>> (as guesses of potential callers to try and work out a stack trace).
>>>
>>
> (My usual technique is use "gdb vmlinux" and "x/i 0x...." to do the
> lookup.)

Ok, my objdump didn't do a good job. With gdb I get:

0xc04040bd <xen_set_pte_at+185>: mov %edi,(%eax)
0xc01687f0: Cannot access memory at address 0xc01687f0
0xc0417ab6 <__set_fixmap+326>: pop %ecx
0xc040288f <xen_alloc_ptpage+41>: lock btsl $0x8,(%eax)
0xc040299a <xen_load_idt+74>: lock incb 0xc0766148
0xc0403270 <xen_set_pte_atomic+26>: lock cmpxchg8b (%edi)

Joel

--

Life's Little Instruction Book #139

"Never deprive someone of hope; it might be all they have."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2008-02-17 12:11:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Joel Becker wrote:
>
>> Unfortunately that doesn't narrow down what the kernel was actually
>> trying to do at the time. Clearly a set_pte; looks like someone is
>> trying to create a writable mapping of an existing pte page.
>>
>> Does "console=hvc0 earlyprintk=xen" on the kernel command line give any
>> clue about how far it gets before crashing?
>>
>
>

I built a kernel using your .config here, but I can't reproduce the
problem. It makes it all the way to trying to start init (failed at
that point because I didn't create an initrd with the xvd module to
mount /).

> Console is already hvc0, but earlyprintk gets us:
>
> --8<-----------------------------------------------------------------
> Reserving virtual address space above 0xf57fe000
> Linux version 2.6.25-rc2-bisectme ([email protected]) (gcc
> version 4.1.2 20070626 (Red Hat 4.1.2-14)) #21 SMP Fri Feb 15 16:28:35
> PST 2008
> ACPI in unprivileged domain disabled
> BIOS-provided physical RAM map:
> Xen: 0000000000000000 - 0000000078000000 (usable)
> console [xenboot0] enabled
> 1192MB HIGHMEM available.
> 727MB LOWMEM available.
> Started domain ca-test58
> Scan SMP from c0000000 for 1024 bytes.
> Scan SMP from c009fc00 for 1024 bytes.
> Scan SMP from c00f0000 for 65536 bytes.
> NX (Execute Disable) protection: active
> Zone PFN ranges:
> DMA 0 -> 4096
> Normal 4096 -> 186366
> HighMem 186366 -> 491520
> Movable zone start PFN for each node
> early_node_map[1] active PFN ranges
> 0: 0 -> 491520
> -->8-----------------------------------------------------------------
>
> That's it.
>

I get:

Entering add_active_range(0, 0, 16384) 0 entries of 256 used
Zone PFN ranges:
DMA 0 -> 4096
Normal 4096 -> 16384
HighMem 16384 -> 16384
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0: 0 -> 16384
On node 0 totalpages: 16384
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 4064 pages, LIFO batch:0
Normal zone: 96 pages used for memmap
Normal zone: 12192 pages, LIFO batch:1
HighMem zone: 0 pages used for memmap
Movable zone: 0 pages used for memmap
...


What happens if you give the domain less memory?

J

2008-02-17 18:49:44

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Fri, 2008-02-15 at 12:23 -0800, Joel Becker wrote:
> On Thu, Feb 14, 2008 at 06:50:52PM +1100, Jeremy Fitzhardinge wrote:
> >> I'm seeing the same problem, with no messages at all from xen
> >> other than "domain crashed, restart disabled" in xend.log. I got a
> >> different commit in my bisect, 0947b2f31ca1ea1211d3cde2dbd8fcec579ef395
> >> (i386 boot: replace boot_io remap with enhanced bt_ioremap - enhance
> >> bt_ioremap). I started from yesterday's
> >> 96b5a46e2a72dc1829370c87053e0cd558d58bc0 (WMI: initialize
> >> wmi_blocks.list even if ACPI is disabled) and a known good
> >> 9b73e76f3cf63379dcf45fcd4f112f5812418d0a (Merge
> >> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6).
> >
> > Is the domain ending up in the crashed state? Do you get a register
> > dump with xm dmesg? That would be very useful in determining what went
> > wrong. You may need to compile Xen with debug=y in Config.mk.
>
> I didn't know xm dmesg existed :-) Regarding debug=y, I'm using
> a prepackaged dom0 set. Here's what I find in xm dmesg:
>
> Joel
>
> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp 00000000e0000000) for mfn 3a2f0f (pfn f0)
> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry 00000003a2f0f063 for dom109
> (XEN) mm.c:1825:d109 Bad type (saw 0000000028000001 != exp 00000000e0000000) for mfn 3a2f0f (pfn f0)
> (XEN) mm.c:649:d109 Error getting mfn 3a2f0f (pfn f0) from L1 entry 00000003a2f0f063 for dom109
> (XEN) mm.c:3331:d109 ptwr_emulate: could not get_page_from_l1e()

I've been seeing similar attempts to map 0xf0 but so far I was the only
one (although that made no sense to me). Does the patch below help at
all? The problem seems to be that the kernel is trying to map pages at
0xf0000 but these are not reserved in the guest E820 map so they could
contain a page table page etc.

A useful tip for getting a backtrace out of a crashed Xen guest is to
set "on_crash=preserve" in your domain config. Then once the crash has
happened you can use "/usr/lib/xen/bin/xenctx -s System.map <domid>"
where System.map is the guest kernel System.map.

Ian.

---

x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.

Signed-off-by: Ian Campbell <[email protected]>
diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c
index 9008ed5..79525f5 100644
--- a/drivers/firmware/dmi_scan.c
+++ b/drivers/firmware/dmi_scan.c
@@ -7,6 +7,7 @@
#include <linux/bootmem.h>
#include <linux/slab.h>
#include <asm/dmi.h>
+#include <asm/e820.h>

static char dmi_empty_string[] = " ";

@@ -336,6 +337,10 @@ void __init dmi_scan_machine(void)
}
}
else {
+
+ if (!e820_all_mapped(0xF0000, 0xF0000+0x10000, E820_RESERVED))
+ goto out;
+
/*
* no iounmap() for that ioremap(); it would be a no-op, but
* it's so early in setup that sucker gets confused into doing

--
Ian Campbell

The profession of book writing makes horse racing seem like a solid,
stable business.
-- John Steinbeck
[Horse racing *is* a stable business ...]


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2008-02-18 10:42:33

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Sun, Feb 17, 2008 at 06:49:21PM +0000, Ian Campbell wrote:
> I've been seeing similar attempts to map 0xf0 but so far I was the only
> one (although that made no sense to me). Does the patch below help at
> all? The problem seems to be that the kernel is trying to map pages at
> 0xf0000 but these are not reserved in the guest E820 map so they could
> contain a page table page etc.
>
> A useful tip for getting a backtrace out of a crashed Xen guest is to
> set "on_crash=preserve" in your domain config. Then once the crash has
> happened you can use "/usr/lib/xen/bin/xenctx -s System.map <domid>"
> where System.map is the guest kernel System.map.

That didn't work for me - it gave me "can't trace dom0" for
whatever reason. But...

> x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.
>
> Signed-off-by: Ian Campbell <[email protected]>
> diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c
> index 9008ed5..79525f5 100644
> --- a/drivers/firmware/dmi_scan.c
> +++ b/drivers/firmware/dmi_scan.c
> @@ -7,6 +7,7 @@
> #include <linux/bootmem.h>
> #include <linux/slab.h>
> #include <asm/dmi.h>
> +#include <asm/e820.h>
>
> static char dmi_empty_string[] = " ";
>
> @@ -336,6 +337,10 @@ void __init dmi_scan_machine(void)
> }
> }
> else {
> +
> + if (!e820_all_mapped(0xF0000, 0xF0000+0x10000, E820_RESERVED))
> + goto out;
> +
> /*
> * no iounmap() for that ioremap(); it would be a no-op, but
> * it's so early in setup that sucker gets confused into doing

This fixed it. I'm now booting successfully. Thank you! If
you have any furthur patches you'd like to test, I'd be happy to do so.

Joel

--

"And yet I find,
And yet I find repeating in my head.
If I can't be my own,
I'd feel better dead."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2008-02-19 21:51:35

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Mon, 2008-02-18 at 02:40 -0800, Joel Becker wrote:
> On Sun, Feb 17, 2008 at 06:49:21PM +0000, Ian Campbell wrote:
> > I've been seeing similar attempts to map 0xf0 but so far I was the only
> > one (although that made no sense to me). Does the patch below help at
> > all? The problem seems to be that the kernel is trying to map pages at
> > 0xf0000 but these are not reserved in the guest E820 map so they could
> > contain a page table page etc.
> >
> > A useful tip for getting a backtrace out of a crashed Xen guest is to
> > set "on_crash=preserve" in your domain config. Then once the crash has
> > happened you can use "/usr/lib/xen/bin/xenctx -s System.map <domid>"
> > where System.map is the guest kernel System.map.
>
> That didn't work for me - it gave me "can't trace dom0" for
> whatever reason. But...

Strange. You gave it the domid of the guest not dom0 I assume. Might be
an older buggy version but it works ok in newer versions:

# /usr/lib/xen/bin/xenctx -s /boot/System.map-2.6.18.8-x86_32p-xenU 2
eip: c01013a7 hypercall_page+0x3a7 flags: 00001246 i z p
esp: c0361f64
eax: 00000000 ebx: deadbeef ecx: deadbeef edx: 00000001
esi: 00000000 edi: 00000000 ebp: c0361f84
cs: 00000061 ds: 0000007b fs: 00000000 gs: 00000000

Stack:
c01089eb 02d8e984 5cec4d7d 0001f3ea 00000000 000008f1 ffffffff 00000000
c0361f8c c010436c c0361fa8 c0103263 c03880c0 c03880a0 000008f1 c038a464
c184f3c4 c0361fbc c0102415 c0102060 00000000 00000a00 c0361ff8 c036687f
c02ed19c c038e120 c031f000 00000022 c03662a0 c184d000 000023c4 00000000

Code:
cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 1d 00 00 00 cd 82 <c3> cc cc cc cc cc cc cc cc cc cc

Call Trace:
[<c01013a7>] hypercall_page+0x3a7 <--
[<c01089eb>] raw_safe_halt+0x9b
[<c010436c>] xen_idle+0x2c
[<c0103263>] cpu_idle+0x83
[<c0102415>] rest_init+0x35
[<c0102060>] _stext+0x60
[<c036687f>] start_kernel+0x36f
[<c03662a0>] parse_early_param+0x70

Ian.

--
Ian Campbell

94% of the women in America are beautiful and the rest hang out around
here.

2008-02-19 21:59:59

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Mon, 2008-02-18 at 02:40 -0800, Joel Becker wrote:
> On Sun, Feb 17, 2008 at 06:49:21PM +0000, Ian Campbell wrote:

> > x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.

> This fixed it. I'm now booting successfully. Thank you!

Excellent. Jeremy, are you happy for this to go in?

>From 23e4ec12b95064320f83fca1cc1ad5c7b2eb3386 Mon Sep 17 00:00:00 2001
From: Ian Campbell <[email protected]>
Date: Tue, 19 Feb 2008 21:57:45 +0000
Subject: [PATCH] x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.

Under Xen the memory at 0xf0000 is regular RAM and so can potentially contain a
page table and hence cannot be mapped. The e820 map given to guest reflects
this.

Signed-off-by: Ian Campbell <[email protected]>
---
drivers/firmware/dmi_scan.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c
index 653265a..7d29403 100644
--- a/drivers/firmware/dmi_scan.c
+++ b/drivers/firmware/dmi_scan.c
@@ -7,6 +7,7 @@
#include <linux/bootmem.h>
#include <linux/slab.h>
#include <asm/dmi.h>
+#include <asm/e820.h>

static char dmi_empty_string[] = " ";

@@ -371,6 +372,9 @@ void __init dmi_scan_machine(void)
}
}
else {
+ if (!e820_all_mapped(0xF0000, 0xF0000+0x10000, E820_RESERVED))
+ goto out;
+
/*
* no iounmap() for that ioremap(); it would be a no-op, but
* it's so early in setup that sucker gets confused into doing
--
1.5.4.2

--
Ian Campbell

After the game the king and the pawn go in the same box.
-- Italian proverb

2008-02-20 07:48:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Ian Campbell wrote:
> On Mon, 2008-02-18 at 02:40 -0800, Joel Becker wrote:
>> On Sun, Feb 17, 2008 at 06:49:21PM +0000, Ian Campbell wrote:
>
>>> x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.
>
>> This fixed it. I'm now booting successfully. Thank you!
>
> Excellent. Jeremy, are you happy for this to go in?
>

NAK!

It's pretty standard for 0xf0000...0x100000 to be marked RESERVED in
E820 on real hardware (including the system I'm typing on right now.)
It is so marked to indicate that hardware cannot be mapped into that
space. However, you can't rely on this fact -- heck, you can't rely on
E820 even existing on a real machine. I have specimens of real-life
machines that go both ways.

This patch WILL break real hardware.

What's particularly damning is that it's titled "x86/xen: Do not scan
for DMI unless the DMI region is reserved by e820." whereas in fact it
changes (breaks) generic code.

-hpa

>>From 23e4ec12b95064320f83fca1cc1ad5c7b2eb3386 Mon Sep 17 00:00:00 2001
> From: Ian Campbell <[email protected]>
> Date: Tue, 19 Feb 2008 21:57:45 +0000
> Subject: [PATCH] x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.
>
> Under Xen the memory at 0xf0000 is regular RAM and so can potentially contain a
> page table and hence cannot be mapped. The e820 map given to guest reflects
> this.
>
> Signed-off-by: Ian Campbell <[email protected]>
> ---
> drivers/firmware/dmi_scan.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c
> index 653265a..7d29403 100644
> --- a/drivers/firmware/dmi_scan.c
> +++ b/drivers/firmware/dmi_scan.c
> @@ -7,6 +7,7 @@
> #include <linux/bootmem.h>
> #include <linux/slab.h>
> #include <asm/dmi.h>
> +#include <asm/e820.h>
>
> static char dmi_empty_string[] = " ";
>
> @@ -371,6 +372,9 @@ void __init dmi_scan_machine(void)
> }
> }
> else {
> + if (!e820_all_mapped(0xF0000, 0xF0000+0x10000, E820_RESERVED))
> + goto out;
> +
> /*
> * no iounmap() for that ioremap(); it would be a no-op, but
> * it's so early in setup that sucker gets confused into doing

2008-02-20 08:52:21

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Tue, 2008-02-19 at 23:43 -0800, H. Peter Anvin wrote:
> Ian Campbell wrote:
> > On Mon, 2008-02-18 at 02:40 -0800, Joel Becker wrote:
> >> On Sun, Feb 17, 2008 at 06:49:21PM +0000, Ian Campbell wrote:
> >
> >>> x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.
> >
> >> This fixed it. I'm now booting successfully. Thank you!
> >
> > Excellent. Jeremy, are you happy for this to go in?
> >
>
> NAK!
>
> It's pretty standard for 0xf0000...0x100000 to be marked RESERVED in
> E820 on real hardware (including the system I'm typing on right now.)
> It is so marked to indicate that hardware cannot be mapped into that
> space. However, you can't rely on this fact -- heck, you can't rely on
> E820 even existing on a real machine. I have specimens of real-life
> machines that go both ways.
>
> This patch WILL break real hardware.
>
> What's particularly damning is that it's titled "x86/xen: Do not scan
> for DMI unless the DMI region is reserved by e820." whereas in fact it
> changes (breaks) generic code.

Sorry, I was trying to indicate that it was a generic change which was
motivated by Xen support, but you're right it did look like I was saying
it was a Xen only change.

As far as the actual change goes I was assuming that any machine that
has DMI/SMBIOS would easily be new enough to have an E820 which could be
expected to reserve this region. Looks like I was mistaken about how
long E820 had been around and/or how reliably it is used to reserve the
tables.

Anyway, will have to think of another solution.

Ian.

--
Ian Campbell

Just because the message may never be received does not mean it is
not worth sending.

2008-02-20 21:47:29

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Wed, Feb 20, 2008 at 08:51:50AM +0000, Ian Campbell wrote:
>
> On Tue, 2008-02-19 at 23:43 -0800, H. Peter Anvin wrote:
> > NAK!
>
> As far as the actual change goes I was assuming that any machine that
> has DMI/SMBIOS would easily be new enough to have an E820 which could be
> expected to reserve this region. Looks like I was mistaken about how
> long E820 had been around and/or how reliably it is used to reserve the
> tables.
>
> Anyway, will have to think of another solution.

What changed to make this not work in the first place? New dmi
code?

Joel

--

"The real reason GNU ls is 8-bit-clean is so that they can
start using ISO-8859-1 option characters."
- Christopher Davis ([email protected])

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2008-02-20 22:00:47

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Ian Campbell wrote:
> On Tue, 2008-02-19 at 23:43 -0800, H. Peter Anvin wrote:
>
>> Ian Campbell wrote:
>>
>>> On Mon, 2008-02-18 at 02:40 -0800, Joel Becker wrote:
>>>
>>>> On Sun, Feb 17, 2008 at 06:49:21PM +0000, Ian Campbell wrote:
>>>>
>>>>> x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.
>>>>>
>>>> This fixed it. I'm now booting successfully. Thank you!
>>>>
>>> Excellent. Jeremy, are you happy for this to go in?
>>>

I had no problem with it, but Peter's objection seems substantial enough.

> As far as the actual change goes I was assuming that any machine that
> has DMI/SMBIOS would easily be new enough to have an E820 which could be
> expected to reserve this region. Looks like I was mistaken about how
> long E820 had been around and/or how reliably it is used to reserve the
> tables.
>
> Anyway, will have to think of another solution.
>

Well, the way we've handled this kind of thing elsewhere is to just
reserve that pseudophys address space in earlish Xen init code and fill
it with not-DMI things (zero, I guess). It's a bit of a waste of
memory, but maybe we can recover it once DMI has given up and gone
away. This also makes it easy to insert faked-up DMI info if that turns
out to be useful.


J

2008-02-20 22:30:33

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Wed, 2008-02-20 at 13:58 -0800, Jeremy Fitzhardinge wrote:
> Ian Campbell wrote:
> > On Tue, 2008-02-19 at 23:43 -0800, H. Peter Anvin wrote:
> >
> >> Ian Campbell wrote:
> >>
> >>> On Mon, 2008-02-18 at 02:40 -0800, Joel Becker wrote:
> >>>
> >>>> On Sun, Feb 17, 2008 at 06:49:21PM +0000, Ian Campbell wrote:
> >>>>
> >>>>> x86/xen: Do not scan for DMI unless the DMI region is reserved by e820.
> >>>>>
> >>>> This fixed it. I'm now booting successfully. Thank you!
> >>>>
> >>> Excellent. Jeremy, are you happy for this to go in?
> >>>
>
> I had no problem with it, but Peter's objection seems substantial enough.

Definitely.

> > As far as the actual change goes I was assuming that any machine that
> > has DMI/SMBIOS would easily be new enough to have an E820 which could be
> > expected to reserve this region. Looks like I was mistaken about how
> > long E820 had been around and/or how reliably it is used to reserve the
> > tables.
> >
> > Anyway, will have to think of another solution.
> >
>
> Well, the way we've handled this kind of thing elsewhere is to just
> reserve that pseudophys address space in earlish Xen init code and fill
> it with not-DMI things (zero, I guess). It's a bit of a waste of
> memory, but maybe we can recover it once DMI has given up and gone
> away. This also makes it easy to insert faked-up DMI info if that turns
> out to be useful.

I'll see if I can track down where the page is getting used and have a
go at getting in there first. It must be pretty early to be allocated
already when dmi_scan_machine gets called.

It's possible that the domain builder might have already allocated a PT
at this address. I haven't checked but I think currently the domain
builder always puts PT pages after the kernel so hopefully it's only a
theoretical problem.

Another option I was thinking of was a command line option to disable
DMI, which (maybe) isn't terribly useful in itself but it introduces an
associated variable to frob with. That's similar to how the TSC was
handled in the past (well, the opposite since TSC was forced on).

Ian.
--
Ian Campbell

Universe, n.:
The problem.

2008-02-20 22:30:53

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Wed, 2008-02-20 at 13:42 -0800, Joel Becker wrote:
> What changed to make this not work in the first place? New dmi
> code?

I don't think so -- this code is present even in the 2.6.18-xen.hg tree
(where it's gated with is_initial_domain() which isn't suitable for the
upstream tree).

I don't really know why it should start happening now. It may be that
some of the recentish changes to e.g. early_ioremap or early pagetable
setup etc. have tweaked the layout of the early pagetables around so
this problem bites us more readily. Or it may just be a combination of
kernel image size, RAM size and perhaps a dose of bad luck.

Ian.
--
Ian Campbell

... at least I thought I was dancing, 'til somebody stepped on my hand.
-- J. B. White

2008-02-21 21:18:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Ian Campbell wrote:
> I'll see if I can track down where the page is getting used and have a
> go at getting in there first. It must be pretty early to be allocated
> already when dmi_scan_machine gets called.
>
> It's possible that the domain builder might have already allocated a PT
> at this address. I haven't checked but I think currently the domain
> builder always puts PT pages after the kernel so hopefully it's only a
> theoretical problem.
>

Yes, it does. And presumably the early pagetable builder is guaranteed
to avoid special memory like the DMI space. But the bug definitely
seems to be a result of the DMI code trying to make a RW mapping of a
pagetable page, so something is amiss there.

Ooh, sleazy hack idea: make DMI always map RO, so even if it does get a
pagetable it causes no complaint... A bit awkward, since there doesn't
seem to be an RO form of early_ioremap.

> Another option I was thinking of was a command line option to disable
> DMI, which (maybe) isn't terribly useful in itself but it introduces an
> associated variable to frob with. That's similar to how the TSC was
> handled in the past (well, the opposite since TSC was forced on).
>

Yep, that would work too.

Still curious about why a pagetable page is ending up in that range
though. Seems like it shouldn't be possible, since we shouldn't be
allowed to allocate from those pages, at least until the DMI probe has
happened... Unless the early allocator is only excluded from e820
reserved pages, which would cause a problem on systems which don't
reserve the DMI space... HPA?

J

2008-02-21 21:27:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Jeremy Fitzhardinge wrote:
> Ian Campbell wrote:
>> I'll see if I can track down where the page is getting used and have a
>> go at getting in there first. It must be pretty early to be allocated
>> already when dmi_scan_machine gets called.
>>
>> It's possible that the domain builder might have already allocated a PT
>> at this address. I haven't checked but I think currently the domain
>> builder always puts PT pages after the kernel so hopefully it's only a
>> theoretical problem.
>>
>
> Yes, it does. And presumably the early pagetable builder is guaranteed
> to avoid special memory like the DMI space. But the bug definitely
> seems to be a result of the DMI code trying to make a RW mapping of a
> pagetable page, so something is amiss there.
>
> Ooh, sleazy hack idea: make DMI always map RO, so even if it does get a
> pagetable it causes no complaint... A bit awkward, since there doesn't
> seem to be an RO form of early_ioremap.
>
>> Another option I was thinking of was a command line option to disable
>> DMI, which (maybe) isn't terribly useful in itself but it introduces an
>> associated variable to frob with. That's similar to how the TSC was
>> handled in the past (well, the opposite since TSC was forced on).
>>
>
> Yep, that would work too.
>
> Still curious about why a pagetable page is ending up in that range
> though. Seems like it shouldn't be possible, since we shouldn't be
> allowed to allocate from those pages, at least until the DMI probe has
> happened... Unless the early allocator is only excluded from e820
> reserved pages, which would cause a problem on systems which don't
> reserve the DMI space... HPA?
>

I thought the problem was a Xen-provided pagetable from before Linux
started?

-hpa

2008-02-21 21:40:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

H. Peter Anvin wrote:
>> Still curious about why a pagetable page is ending up in that range
>> though. Seems like it shouldn't be possible, since we shouldn't be
>> allowed to allocate from those pages, at least until the DMI probe
>> has happened... Unless the early allocator is only excluded from
>> e820 reserved pages, which would cause a problem on systems which
>> don't reserve the DMI space... HPA?
>>
>
> I thought the problem was a Xen-provided pagetable from before Linux
> started?

Hm, I don't think so. The domain-builder pagetable is put after the
kernel, so it shouldn't be under 1M.

J

2008-02-21 22:01:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>>> Still curious about why a pagetable page is ending up in that range
>>> though. Seems like it shouldn't be possible, since we shouldn't be
>>> allowed to allocate from those pages, at least until the DMI probe
>>> has happened... Unless the early allocator is only excluded from
>>> e820 reserved pages, which would cause a problem on systems which
>>> don't reserve the DMI space... HPA?
>>>
>>
>> I thought the problem was a Xen-provided pagetable from before Linux
>> started?
>
> Hm, I don't think so. The domain-builder pagetable is put after the
> kernel, so it shouldn't be under 1M.
>

That's weird, then. If we put a page table at 0xf0000, it would crash
hard on real hardware (there is ROM there); as I mentioned, I do have
hardware which both reserves and doesn't reserve this region.

-hpa

2008-02-21 22:13:33

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Thu, 2008-02-21 at 13:37 -0800, Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
> >> Still curious about why a pagetable page is ending up in that range
> >> though. Seems like it shouldn't be possible, since we shouldn't be
> >> allowed to allocate from those pages, at least until the DMI probe
> >> has happened... Unless the early allocator is only excluded from
> >> e820 reserved pages, which would cause a problem on systems which
> >> don't reserve the DMI space... HPA?
> >>
> >
> > I thought the problem was a Xen-provided pagetable from before Linux
> > started?
>
> Hm, I don't think so. The domain-builder pagetable is put after the
> kernel, so it shouldn't be under 1M.

I can confirm that it is Linux which is allocating it. The call path:
# xm create -c debian-x86_32p-1
Using config file "/etc/xen/debian-x86_32p-1".
Started domain debian-1
xen_alloc_pt_init PFN f0
Pid: 0, comm: swapper Not tainted 2.6.25-rc2 #68
[<c02ecb6b>] xen_alloc_pt_init+0x4b/0x60
[<c02f5e2b>] one_page_table_init+0x8b/0xf0
[<c02f63df>] paging_init+0x3bf/0x520
[<c02ee444>] setup_arch+0x2a4/0x410
[<c02e9a64>] start_kernel+0x64/0x380
[<c02efd7f>] cpu_detect+0x6f/0xf0
[<c02ed1a1>] xen_start_kernel+0x2f1/0x310
=======================
Entering add_active_range(0, 0, 262144) 0 entries of 256 used
Zone PFN ranges:
DMA 0 -> 4096

Ian.
>
> J
>
--
Ian Campbell

Why do they call a fast a fast, when it goes so slow?

2008-02-21 22:35:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

"H. Peter Anvin" <[email protected]> writes:

> Jeremy Fitzhardinge wrote:
>> Ian Campbell wrote:
>>> I'll see if I can track down where the page is getting used and have a
>>> go at getting in there first. It must be pretty early to be allocated
>>> already when dmi_scan_machine gets called.
>>>
>>> It's possible that the domain builder might have already allocated a PT
>>> at this address. I haven't checked but I think currently the domain
>>> builder always puts PT pages after the kernel so hopefully it's only a
>>> theoretical problem.
>>>
>>
>> Yes, it does. And presumably the early pagetable builder is guaranteed to
>> avoid special memory like the DMI space. But the bug definitely seems to be a
>> result of the DMI code trying to make a RW mapping of a pagetable page, so
>> something is amiss there.
>>
>> Ooh, sleazy hack idea: make DMI always map RO, so even if it does get a
>> pagetable it causes no complaint... A bit awkward, since there doesn't seem
>> to be an RO form of early_ioremap.
>>
>>> Another option I was thinking of was a command line option to disable
>>> DMI, which (maybe) isn't terribly useful in itself but it introduces an
>>> associated variable to frob with. That's similar to how the TSC was
>>> handled in the past (well, the opposite since TSC was forced on).
>>>
>>
>> Yep, that would work too.
>>
>> Still curious about why a pagetable page is ending up in that range though.
>> Seems like it shouldn't be possible, since we shouldn't be allowed to allocate
>> from those pages, at least until the DMI probe has happened... Unless the
>> early allocator is only excluded from e820 reserved pages, which would cause a
>> problem on systems which don't reserve the DMI space... HPA?
>>
>
> I thought the problem was a Xen-provided pagetable from before Linux started?

The immediate symptom was that we have a page table at the address we
are doing the DMI probe. Xen does not allow pages tables to be mapped
read-write so early_ioremap get into trouble.

We have a mystery:
- Why did the Xen domain builder or the linux kernel use 0xf0000 - 0x10000
for a page table.

It should be possible to instrument the early linux page allocation
and see what page pages linux is using to see who is doing weird
things.

We have possible solutions.
- Add a read-only flag to early_ioremap for use by our table scans.
- Don't do a DMI scan in the case of Xen.
- Fix the Xen domain builder.

My inclination is that we disable the fruitless DMI scan in the case
Xen, or we get the Xen domain builder fixed.

If Xen is going to increasingly look like a normal X86 BIOS we should
let the DMI scan run and be put the burden on Xen to keep things
looking like a normal x86 machine. If Xen is not going to look more
like a normal x86 machine we can say oh, that is nice, it's Xen so
don't bother doing things that will cause problems.

In this case can we confirm that the domain builder is using those
early 64k as pages for a page table, and then educate it that not
allowing OS access to those pages is a little silly.

All of that said. For DMI tables other early tables we should not
be writing to them anyway so learning to use read-only maps may be
the right solution. If the reason Xen was complaining was that
we were accessing an area that was not page tables but instead
should only be mapped read-only I would have a lot of sympathy
with that. As mapping areas that are architecturally ROMs read-write
is silly.

So guys can you please finish the root cause and really see why there
is a page table page at in 64K ROM BIOS area?

Eric

2008-02-21 22:43:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Ian Campbell wrote:
> On Thu, 2008-02-21 at 13:37 -0800, Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>>> Still curious about why a pagetable page is ending up in that range
>>>> though. Seems like it shouldn't be possible, since we shouldn't be
>>>> allowed to allocate from those pages, at least until the DMI probe
>>>> has happened... Unless the early allocator is only excluded from
>>>> e820 reserved pages, which would cause a problem on systems which
>>>> don't reserve the DMI space... HPA?
>>>>
>>> I thought the problem was a Xen-provided pagetable from before Linux
>>> started?
>> Hm, I don't think so. The domain-builder pagetable is put after the
>> kernel, so it shouldn't be under 1M.
>
> I can confirm that it is Linux which is allocating it. The call path:
> # xm create -c debian-x86_32p-1
> Using config file "/etc/xen/debian-x86_32p-1".
> Started domain debian-1
> xen_alloc_pt_init PFN f0
> Pid: 0, comm: swapper Not tainted 2.6.25-rc2 #68
> [<c02ecb6b>] xen_alloc_pt_init+0x4b/0x60
> [<c02f5e2b>] one_page_table_init+0x8b/0xf0
> [<c02f63df>] paging_init+0x3bf/0x520
> [<c02ee444>] setup_arch+0x2a4/0x410
> [<c02e9a64>] start_kernel+0x64/0x380
> [<c02efd7f>] cpu_detect+0x6f/0xf0
> [<c02ed1a1>] xen_start_kernel+0x2f1/0x310
> =======================
> Entering add_active_range(0, 0, 262144) 0 entries of 256 used
> Zone PFN ranges:
> DMA 0 -> 4096
>

What is the e820 information you feed the kernel? We should only ever
allocate page tables out of available RAM, not any other type of memory
(reserved or not).

-hpa

2008-02-21 22:53:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

H. Peter Anvin wrote:
> Ian Campbell wrote:
>> On Thu, 2008-02-21 at 13:37 -0800, Jeremy Fitzhardinge wrote:
>>> H. Peter Anvin wrote:
>>>>> Still curious about why a pagetable page is ending up in that
>>>>> range though. Seems like it shouldn't be possible, since we
>>>>> shouldn't be allowed to allocate from those pages, at least until
>>>>> the DMI probe has happened... Unless the early allocator is only
>>>>> excluded from e820 reserved pages, which would cause a problem on
>>>>> systems which don't reserve the DMI space... HPA?
>>>>>
>>>> I thought the problem was a Xen-provided pagetable from before
>>>> Linux started?
>>> Hm, I don't think so. The domain-builder pagetable is put after the
>>> kernel, so it shouldn't be under 1M.
>>
>> I can confirm that it is Linux which is allocating it. The call path:
>> # xm create -c debian-x86_32p-1
>> Using config file "/etc/xen/debian-x86_32p-1".
>> Started domain debian-1
>> xen_alloc_pt_init PFN f0
>> Pid: 0, comm: swapper Not tainted 2.6.25-rc2 #68
>> [<c02ecb6b>] xen_alloc_pt_init+0x4b/0x60
>> [<c02f5e2b>] one_page_table_init+0x8b/0xf0
>> [<c02f63df>] paging_init+0x3bf/0x520
>> [<c02ee444>] setup_arch+0x2a4/0x410
>> [<c02e9a64>] start_kernel+0x64/0x380
>> [<c02efd7f>] cpu_detect+0x6f/0xf0
>> [<c02ed1a1>] xen_start_kernel+0x2f1/0x310
>> =======================
>> Entering add_active_range(0, 0, 262144) 0 entries of 256 used
>> Zone PFN ranges:
>> DMA 0 -> 4096
>>
>
> What is the e820 information you feed the kernel? We should only ever
> allocate page tables out of available RAM, not any other type of
> memory (reserved or not).

The kernel gets a flat memory map; all memory is just plain RAM. The
problem is that we're allocating a normal page and turning it into a
pagetable - so far so good. Then the DMI code is randomly mapping that
same page RW so it can scan it for DMI signatures, which Xen is preventing.

There are two immediate fixes:

1. Only scan for DMI if the memory is reserved (rejected, because HPA
says some machines don't reserve the DMI space). Alternatively,
don't bother scanning if booting under Xen.
2. Make DMI map the memory RO so that Xen doesn't complain (which is
sensible because DMI is ROM anyway).

But as far as I can tell, this shouldn't be happening anyway, and could
happen on real hardware which doesn't reserve the DMI space. It
probably doesn't because initial pagetables on real hardware use large
pages, and therefore allocate less memory for pagetable memory and
therefore doesn't end up hitting the 0xf0000 region. But that area
should be excluded from the allocation pool.

J

2008-02-21 23:01:22

by Joel Becker

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Thu, Feb 21, 2008 at 10:12:36PM +0000, Ian Campbell wrote:
>
> On Thu, 2008-02-21 at 13:37 -0800, Jeremy Fitzhardinge wrote:
> > H. Peter Anvin wrote:
> > >> Still curious about why a pagetable page is ending up in that range
> > >> though. Seems like it shouldn't be possible, since we shouldn't be
> > >> allowed to allocate from those pages, at least until the DMI probe
> > >> has happened... Unless the early allocator is only excluded from
> > >> e820 reserved pages, which would cause a problem on systems which
> > >> don't reserve the DMI space... HPA?
> > >>
> > >
> > > I thought the problem was a Xen-provided pagetable from before Linux
> > > started?
> >
> > Hm, I don't think so. The domain-builder pagetable is put after the
> > kernel, so it shouldn't be under 1M.
>
> I can confirm that it is Linux which is allocating it. The call path:

Also, I have older kernels (2.6.24-rc era) that run Just Fine.
I haven't changed my dom0 at all.

Joel

--

"There is no more evil thing on earth than race prejudice, none at
all. I write deliberately -- it is the worst single thing in life
now. It justifies and holds together more baseness, cruelty and
abomination than any other sort of error in the world."
- H. G. Wells

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2008-02-21 23:17:24

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Eric W. Biederman wrote:
>> I thought the problem was a Xen-provided pagetable from before Linux started?
>>
>
> The immediate symptom was that we have a page table at the address we
> are doing the DMI probe. Xen does not allow pages tables to be mapped
> read-write so early_ioremap get into trouble.
>

Correct.

> We have a mystery:
> - Why did the Xen domain builder or the linux kernel use 0xf0000 - 0x10000
> for a page table.
>

It didn't. Ian confirmed the allocation comes from the kernel's
pagetable setup. The domain builder always puts pagetables after the
kernel, so over 1M.

> It should be possible to instrument the early linux page allocation
> and see what page pages linux is using to see who is doing weird
> things.
>

Linux. Which suggests a latent bug.

> We have possible solutions.
> - Add a read-only flag to early_ioremap for use by our table scans.
> - Don't do a DMI scan in the case of Xen.
> - Fix the Xen domain builder.
>

1 or 2. 3 is not a problem.

> All of that said. For DMI tables other early tables we should not
> be writing to them anyway so learning to use read-only maps may be
> the right solution. If the reason Xen was complaining was that
> we were accessing an area that was not page tables but instead
> should only be mapped read-only I would have a lot of sympathy
> with that. As mapping areas that are architecturally ROMs read-write
> is silly.
>

For PV guests, Xen itself isn't really interested in any of the platform
bios/hardware (not even acknowledging its existence, let alone emulate
anything). Definitely no magic memory ranges or scanning for
signatures. The in-kernel Xen support has made some concessions to that
sort of stuff, by making sure various memory ranges are valid to be
scanned, even if they don't contain anything meaningful. We could do
something like that for DMI if that turns out to be the best answer.

> So guys can you please finish the root cause and really see why there
> is a page table page at in 64K ROM BIOS area?
>

It seems to me that those pages are being handed out as heap pages by
the early allocator. In the Xen case this is OK because there's nothing
magic about them. But if real hardware doesn't reserve these pages in
the E820 map, then they could end up being used as regular memory by
mistake, which is an issue.

J

2008-02-21 23:18:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Jeremy Fitzhardinge wrote:
>>
>> What is the e820 information you feed the kernel? We should only ever
>> allocate page tables out of available RAM, not any other type of
>> memory (reserved or not).
>
> The kernel gets a flat memory map; all memory is just plain RAM. The
> problem is that we're allocating a normal page and turning it into a
> pagetable - so far so good. Then the DMI code is randomly mapping that
> same page RW so it can scan it for DMI signatures, which Xen is preventing.
>
> There are two immediate fixes:
>
> 1. Only scan for DMI if the memory is reserved (rejected, because HPA
> says some machines don't reserve the DMI space). Alternatively,
> don't bother scanning if booting under Xen.
> 2. Make DMI map the memory RO so that Xen doesn't complain (which is
> sensible because DMI is ROM anyway).
>
> But as far as I can tell, this shouldn't be happening anyway, and could
> happen on real hardware which doesn't reserve the DMI space. It
> probably doesn't because initial pagetables on real hardware use large
> pages, and therefore allocate less memory for pagetable memory and
> therefore doesn't end up hitting the 0xf0000 region. But that area
> should be excluded from the allocation pool.
>

Which it is on real hardware, because although it's not *reserved* (type
2), it is certainly not made available as *normal memory* (type 1). If
Xen maps this as type 1 then I definitely see the problem.

We can exclude type 1 memory from DMI scan, certainly.

However, Xen may want to consider why provide memory below the 1 MB
point at all, and certainly whether it's wise to provide RAM in the
640-1024 KB legacy region -- although you could argue that "it *should*
work", odds are pretty good you'll have nasty surprises on a regular basis.

-hpa

2008-02-21 23:32:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Jeremy Fitzhardinge wrote:
>
> It seems to me that those pages are being handed out as heap pages by
> the early allocator. In the Xen case this is OK because there's nothing
> magic about them. But if real hardware doesn't reserve these pages in
> the E820 map, then they could end up being used as regular memory by
> mistake, which is an issue.
>

No, they couldn't.

On real hardware they'll be memory types 0 or 2, depending on whether or
not they're marked reserved.

Available RAM is type 1.

-hpa

2008-02-21 23:54:47

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>
>> It seems to me that those pages are being handed out as heap pages by
>> the early allocator. In the Xen case this is OK because there's
>> nothing magic about them. But if real hardware doesn't reserve these
>> pages in the E820 map, then they could end up being used as regular
>> memory by mistake, which is an issue.
>>
>
> No, they couldn't.
>
> On real hardware they'll be memory types 0 or 2, depending on whether
> or not they're marked reserved.
>
> Available RAM is type 1.

OK. Well, perhaps Ian's patch could be amended to test to see if the
e820 map marks the ISA ROM region as normal RAM, and skip it if so?

J

2008-02-22 00:02:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Jeremy Fitzhardinge wrote:
>>
>> Available RAM is type 1.
>
> OK. Well, perhaps Ian's patch could be amended to test to see if the
> e820 map marks the ISA ROM region as normal RAM, and skip it if so?
>

That would work, at least for this particular case. I expect you'll
have a neverending list of similar issues.

-hpa

2008-02-22 07:25:34

by Ian Campbell

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression


On Thu, 2008-02-21 at 14:58 -0800, H. Peter Anvin wrote:
>
> Which it is on real hardware, because although it's not *reserved*
> (type 2), it is certainly not made available as *normal memory* (type
> 1). If Xen maps this as type 1 then I definitely see the problem.
>
> We can exclude type 1 memory from DMI scan, certainly.

I'd been meaning to ask this. So the machines you have which don't
describe 0xf0000 as reserved also don't describe it as RAM? (I guess
it's either a hole in the table or one of the other e820 types).

So it sounds like it would be acceptable to simply invert the test in my
original patch as below? (actually reverting to my original-original
patch which I never sent out because checking for reserved sounded more
correct at the time, which was dumb of me because I was well aware of
the other possible types, I must have been having one of those days).

Ian.

>From 13bdb4ee9d80b83a81c3dbefa52464e511d1b4df Mon Sep 17 00:00:00 2001
From: Ian Campbell <[email protected]>
Date: Fri, 22 Feb 2008 07:17:14 +0000
Subject: [PATCH] x86: Do not scan for DMI if the DMI region is marked as RAM by e820.

Under Xen the memory at 0xf0000 is regular RAM and so can potentially contain a
page table and hence cannot be mapped. The e820 map given to guest reflects
this.

Signed-off-by: Ian Campbell <[email protected]>
---
drivers/firmware/dmi_scan.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c
index 653265a..f8fde74 100644
--- a/drivers/firmware/dmi_scan.c
+++ b/drivers/firmware/dmi_scan.c
@@ -7,6 +7,7 @@
#include <linux/bootmem.h>
#include <linux/slab.h>
#include <asm/dmi.h>
+#include <asm/e820.h>

static char dmi_empty_string[] = " ";

@@ -371,6 +372,9 @@ void __init dmi_scan_machine(void)
}
}
else {
+ if (e820_all_mapped(0xF0000, 0xF0000+0x10000, E820_RAM))
+ goto out;
+
/*
* no iounmap() for that ioremap(); it would be a no-op, but
* it's so early in setup that sucker gets confused into doing
--
1.5.4.2



--
Ian Campbell

Stupidity, like virtue, is its own reward.

2008-02-22 09:40:23

by Alan

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

> I'd been meaning to ask this. So the machines you have which don't
> describe 0xf0000 as reserved also don't describe it as RAM? (I guess
> it's either a hole in the table or one of the other e820 types).

Making 0xf0000 bus addresses RAM is probably a bad idea anyway. Most OS's
treat the E820 map with paranoia because we do see real PCs which
variously claim that the BIOS ROM space is RAM, ACPI, Reserved or just
forget to mention it. At least for a non PV guest it should be mapped as
R/O.

Likewise you get E820 maps with zero size entries, repeated entries ...

2008-02-22 09:54:18

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Alan Cox wrote:
>> I'd been meaning to ask this. So the machines you have which don't
>> describe 0xf0000 as reserved also don't describe it as RAM? (I guess
>> it's either a hole in the table or one of the other e820 types).
>
> Making 0xf0000 bus addresses RAM is probably a bad idea anyway. Most OS's
> treat the E820 map with paranoia because we do see real PCs which
> variously claim that the BIOS ROM space is RAM, ACPI, Reserved or just
> forget to mention it.

Actually I switched 64bit over to trust e820 completely and not
reserve 640k-1MB explicitly some time ago
and AFAIK there hasn't been any reports that it causes problems.

So presumably trusting e802 is ok on modern systems (2003+)

-Andi

2008-02-22 10:11:39

by Alan

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

> Actually I switched 64bit over to trust e820 completely and not
> reserve 640k-1MB explicitly some time ago
> and AFAIK there hasn't been any reports that it causes problems.
>
> So presumably trusting e802 is ok on modern systems (2003+)

Apparently so - at least 64bit capable ones. We do still sort the entries
to remove zero length records and other suprises.

Alan

2008-02-22 10:14:27

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Alan Cox wrote:
>> Actually I switched 64bit over to trust e820 completely and not
>> reserve 640k-1MB explicitly some time ago
>> and AFAIK there hasn't been any reports that it causes problems.
>>
>> So presumably trusting e802 is ok on modern systems (2003+)
>
> Apparently so - at least 64bit capable ones.

They should all use the same BIOS code bases, except perhaps
some embedded weirdnesses.

We do still sort the entries
> to remove zero length records and other suprises.

That code could be actually dropped. And the sorting too.
It's all not needed I think.

AFAIK none of the e820 access code cares about any of
that.

64bit only has it because I copied it originally and never
bothered to remove it.

-Andi

2008-02-22 16:32:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Andi Kleen wrote:
> Alan Cox wrote:
>>> Actually I switched 64bit over to trust e820 completely and not
>>> reserve 640k-1MB explicitly some time ago
>>> and AFAIK there hasn't been any reports that it causes problems.
>>>
>>> So presumably trusting e802 is ok on modern systems (2003+)
>> Apparently so - at least 64bit capable ones.
>
> They should all use the same BIOS code bases, except perhaps
> some embedded weirdnesses.

Well, that, plus you still have to deal with a lot older stuff.

> We do still sort the entries
>> to remove zero length records and other suprises.
>
> That code could be actually dropped. And the sorting too.
> It's all not needed I think.

When I dealt with this for another project, I found that the e820 data
format is suboptimal. It's better to treat it as a sorted list of
(address, type) tuples (where type can be zero); the data from e820 can
be fed into such a data structure and it cleans it up nicely.

-hpa

2008-02-22 19:25:51

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Fri 2008-02-22 11:15:59, Andi Kleen wrote:
> Alan Cox wrote:
> >> Actually I switched 64bit over to trust e820 completely and not
> >> reserve 640k-1MB explicitly some time ago
> >> and AFAIK there hasn't been any reports that it causes problems.
> >>
> >> So presumably trusting e802 is ok on modern systems (2003+)
> >
> > Apparently so - at least 64bit capable ones.
>
> They should all use the same BIOS code bases, except perhaps

At least kohjinsha subnotebook has very 'interesting' bios. Very new,
but geode -> not 64bit.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-02-26 17:08:19

by Mark McLoughlin

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

On Fri, 2008-02-22 at 07:25 +0000, Ian Campbell wrote:
> On Thu, 2008-02-21 at 14:58 -0800, H. Peter Anvin wrote:
> >
> > Which it is on real hardware, because although it's not *reserved*
> > (type 2), it is certainly not made available as *normal memory* (type
> > 1). If Xen maps this as type 1 then I definitely see the problem.
> >
> > We can exclude type 1 memory from DMI scan, certainly.
>
> I'd been meaning to ask this. So the machines you have which don't
> describe 0xf0000 as reserved also don't describe it as RAM? (I guess
> it's either a hole in the table or one of the other e820 types).
> Ian.

...

> >From 13bdb4ee9d80b83a81c3dbefa52464e511d1b4df Mon Sep 17 00:00:00 2001
> From: Ian Campbell <[email protected]>
> Date: Fri, 22 Feb 2008 07:17:14 +0000
> Subject: [PATCH] x86: Do not scan for DMI if the DMI region is marked as RAM by e820.
>
> Under Xen the memory at 0xf0000 is regular RAM and so can potentially contain a
> page table and hence cannot be mapped. The e820 map given to guest reflects
> this.
>
> Signed-off-by: Ian Campbell <[email protected]>

...

> @@ -371,6 +372,9 @@ void __init dmi_scan_machine(void)
> }
> }
> else {
> + if (e820_all_mapped(0xF0000, 0xF0000+0x10000, E820_RAM))
> + goto out;

One issue with using the e820 map for this is that a Xen Dom0 will also
have this region marked as RAM in the e820 map, but will set up a fixmap
for it, allowing dmi_scan_machine() to map the region.

Cheers,
Mark.

2008-02-26 20:09:12

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: 2.6.25-rc1 xen pvops regression

Mark McLoughlin wrote:
>> @@ -371,6 +372,9 @@ void __init dmi_scan_machine(void)
>> }
>> }
>> else {
>> + if (e820_all_mapped(0xF0000, 0xF0000+0x10000, E820_RAM))
>> + goto out;
>>
>
> One issue with using the e820 map for this is that a Xen Dom0 will also
> have this region marked as RAM in the e820 map, but will set up a fixmap
> for it, allowing dmi_scan_machine() to map the region.
>

Would it be easier to just fake up a mapping so that window points to
the real dmi area, and mark E820 accordingly?

J