2008-02-29 20:43:33

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

The patch "x86 - Enhance DEBUG_RODATA support - alternatives" enables
the kernel for writing by clearing X86_CR0_WP allow privileged writes.
This won't work in a paravirt environment for two reasons:

1. the kernel may not be running in ring 0, so writes will still be
prevented
2. the hypervisor prevents X86_CR0_WP from being cleared anyway (it
GPFs the cr0 update)

This crashes on Xen, and it would probably break VMI too.

The only safe way to allow writes is to change the page permissions
(either on the page itself, or create a temporary writable alias for
that page). Perhaps something you could do it with kmap_atomic.

J


2008-02-29 21:00:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

Jeremy Fitzhardinge wrote:
> The patch "x86 - Enhance DEBUG_RODATA support - alternatives" enables
> the kernel for writing by clearing X86_CR0_WP allow privileged writes.
> This won't work in a paravirt environment for two reasons:
>
> 1. the kernel may not be running in ring 0, so writes will still be
> prevented
> 2. the hypervisor prevents X86_CR0_WP from being cleared anyway (it
> GPFs the cr0 update)
>
> This crashes on Xen, and it would probably break VMI too.
>
> The only safe way to allow writes is to change the page permissions
> (either on the page itself, or create a temporary writable alias for
> that page). Perhaps something you could do it with kmap_atomic.
>

A properly implemented hypervisor should arguably emulate this.

Doesn't really mean the patch is worth the pain.

-hpa

2008-02-29 21:13:42

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>> The patch "x86 - Enhance DEBUG_RODATA support - alternatives" enables
>> the kernel for writing by clearing X86_CR0_WP allow privileged
>> writes. This won't work in a paravirt environment for two reasons:
>>
>> 1. the kernel may not be running in ring 0, so writes will still be
>> prevented
>> 2. the hypervisor prevents X86_CR0_WP from being cleared anyway (it
>> GPFs the cr0 update)
>>
>> This crashes on Xen, and it would probably break VMI too.

(lguest too, of course)

>> The only safe way to allow writes is to change the page permissions
>> (either on the page itself, or create a temporary writable alias for
>> that page). Perhaps something you could do it with kmap_atomic.
>>
>
> A properly implemented hypervisor should arguably emulate this.
>
> Doesn't really mean the patch is worth the pain.

No, it would be irritating to implement.

Seems to me that doing the update in a temporary kmap_atomic mapping
would be a more straightforward way to go, anyway. How would you
implement this on a processor without something like X86_CR0_WP?

J

2008-02-29 21:24:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"


* Jeremy Fitzhardinge <[email protected]> wrote:

> The patch "x86 - Enhance DEBUG_RODATA support - alternatives" enables
> the kernel for writing by clearing X86_CR0_WP allow privileged writes.
> This won't work in a paravirt environment for two reasons:

ok, have reverted it and have pushed out a new x86.git.

Ingo

2008-03-03 17:38:56

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
>> Jeremy Fitzhardinge wrote:
>>> The patch "x86 - Enhance DEBUG_RODATA support - alternatives" enables the
>>> kernel for writing by clearing X86_CR0_WP allow privileged writes. This
>>> won't work in a paravirt environment for two reasons:
>>>
>>> 1. the kernel may not be running in ring 0, so writes will still be
>>> prevented
>>> 2. the hypervisor prevents X86_CR0_WP from being cleared anyway (it
>>> GPFs the cr0 update)
>>>
>>> This crashes on Xen, and it would probably break VMI too.
>
> (lguest too, of course)
>
>>> The only safe way to allow writes is to change the page permissions
>>> (either on the page itself, or create a temporary writable alias for that
>>> page). Perhaps something you could do it with kmap_atomic.
>>>
>>
>> A properly implemented hypervisor should arguably emulate this.
>>
>> Doesn't really mean the patch is worth the pain.
>
> No, it would be irritating to implement.
>
> Seems to me that doing the update in a temporary kmap_atomic mapping would
> be a more straightforward way to go, anyway. How would you implement this
> on a processor without something like X86_CR0_WP?
>
> J

I think kmap_atomic is only implemented on x86_32 and only deals with
highmem pages. It will simply return the original page address without
changing the protection for other pages, which is not what we want.
Would ioremap() be a good alternative ?

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-03-03 17:41:21

by Andi Kleen

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

> Would ioremap() be a good alternative ?

One of the early attempts at patching ro data used ioremap. Unfortunately
it was a little buggy and was thus reverted, but that were only detail
problems, nothing fundamental of the approach.

-Andi

2008-03-03 18:03:16

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

Mathieu Desnoyers wrote:
> I think kmap_atomic is only implemented on x86_32 and only deals with
> highmem pages. It will simply return the original page address without
> changing the protection for other pages, which is not what we want.
> Would ioremap() be a good alternative ?
>

Perhaps, though that's uncached by default. You could reserve a fixmap
slot and use set_fixmap to create the mapping. Or use vmap, which may
make dealing with instructions crossing page boundaries a little easier.

J

2008-03-03 18:03:50

by Andi Kleen

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

On Monday 03 March 2008 18:58:03 Jeremy Fitzhardinge wrote:

> Perhaps, though that's uncached by default.

ioremap_cached()

-Andi

2008-03-03 18:11:42

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

Andi Kleen wrote:
> On Monday 03 March 2008 18:58:03 Jeremy Fitzhardinge wrote:
>
>
>> Perhaps, though that's uncached by default.
>>
>
> ioremap_cached()

Sure. But given that from the perspective of this problem ioremap* is
just a wrapper for vmap, we may as well use it directly and avoid
getting tangled up in any current or future io-related stuff that
ioremap may want to do.

J

2008-03-03 20:53:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

* Jeremy Fitzhardinge ([email protected]) wrote:
> Andi Kleen wrote:
>> On Monday 03 March 2008 18:58:03 Jeremy Fitzhardinge wrote:
>>
>>
>>> Perhaps, though that's uncached by default.
>>>
>>
>> ioremap_cached()
>
> Sure. But given that from the perspective of this problem ioremap* is just
> a wrapper for vmap, we may as well use it directly and avoid getting
> tangled up in any current or future io-related stuff that ioremap may want
> to do.
>
> J

Would you have a quick hint on why I get a page fault with the following
implementation ? There is probably a fundamental detail I missed.


void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
{
char *vaddr;
struct page *pages[1];

BUG_ON(len > sizeof(long));
BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
- ((long)addr & ~(sizeof(long) - 1)));
if (kernel_text_address((unsigned long)addr))
pages[0] = virt_to_page(addr);
else
vaddr = addr;
vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
memcpy(&vaddr[(unsigned long)addr & PAGE_MASK], opcode, len);
if (kernel_text_address((unsigned long)addr))
vunmap(vaddr);
sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
return addr;
}


[ 0.149856] SMP alternatives: switching to UP code
[ 0.152009] BUG: unable to handle kernel paging request at b8902000
[ 0.152009] IP: [<c03acfa1>] text_poke+0x85/0xb4
[ 0.152009] *pde = 00000000
[ 0.152009] Oops: 0002 [#1] SMP
[ 0.152009] LTT NESTING LEVEL : 0 <0>
[ 0.152009] Modules linked in:
[ 0.152009]
[ 0.152009] Pid: 0, comm: swapper Not tainted (2.6.25-rc3-testssmp #744)
[ 0.152009] EIP: 0060:[<c03acfa1>] EFLAGS: 00010002 CPU: 0
[ 0.152009] EIP is at text_poke+0x85/0xb4
[ 0.152009] EAX: f8800000 EBX: 00000001 ECX: 00000001 EDX: f8800000
[ 0.152009] ESI: c04a3fab EDI: b8902000 EBP: c0102114 ESP: c04a3f88
[ 0.152009] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 0.152009] Process swapper (pid: 0, ti=c04a2000 task=c04703a0 task.ti=c04a2)
[ 0.152009] Stack: 00000163 f8800000 c1002040 c04a400c c04a7cac c0100000 c03
[ 0.152009] 90411244 00000206 c04e072c c17fb72c 0000672c c04aaefe c03
[ 0.152009] c04ab76a c17f5000 c04a896a 00000092 c04a80d7 00000008 000
[ 0.152009] Call Trace:
[ 0.152009] [<c03b0139>] _etext+0x0/0xf7ec7
[ 0.152009] [<c0107814>] alternatives_smp_unlock+0x57/0x59
[ 0.152009] [<c04aaefe>] alternative_instructions+0x152/0x157
[ 0.152009] [<c03b0139>] _etext+0x0/0xf7ec7
[ 0.152009] [<c04ab76a>] check_bugs+0x131/0x14e
[ 0.152009] [<c04a896a>] start_kernel+0x2d1/0x327
[ 0.152009] [<c04a80d7>] unknown_bootoption+0x0/0x1e3
[ 0.152009] =======================
[ 0.152009] Code: b9 04 00 00 00 ba 01 00 00 00 e8 a6 87 db ff 89 44 24 04 8
[ 0.152009] EIP: [<c03acfa1>] text_poke+0x85/0xb4 SS:ESP 0068:c04a3f88
[ 0.152009] ---[ end trace ca143223eefdc828 ]---
[ 0.152009] Kernel panic - not syncing: Attempted to kill the idle task!


Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-03-03 22:21:18

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

Mathieu Desnoyers wrote:
> * Jeremy Fitzhardinge ([email protected]) wrote:
>
>> Andi Kleen wrote:
>>
>>> On Monday 03 March 2008 18:58:03 Jeremy Fitzhardinge wrote:
>>>
>>>
>>>
>>>> Perhaps, though that's uncached by default.
>>>>
>>>>
>>> ioremap_cached()
>>>
>> Sure. But given that from the perspective of this problem ioremap* is just
>> a wrapper for vmap, we may as well use it directly and avoid getting
>> tangled up in any current or future io-related stuff that ioremap may want
>> to do.
>>
>> J
>>
>
> Would you have a quick hint on why I get a page fault with the following
> implementation ? There is probably a fundamental detail I missed.
>
>
> void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> {
> char *vaddr;
> struct page *pages[1];
>
> BUG_ON(len > sizeof(long));
> BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
> - ((long)addr & ~(sizeof(long) - 1)));
> if (kernel_text_address((unsigned long)addr))
> pages[0] = virt_to_page(addr);
> else
> vaddr = addr;
>

What's this for? You just overwrite vaddr with the vmap on the next line.

> vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
>
Do you need to handle the case of an instruction spanning a page boundary?

> memcpy(&vaddr[(unsigned long)addr & PAGE_MASK], opcode, len);
> if (kernel_text_address((unsigned long)addr))
> vunmap(vaddr);
> sync_core();
> /* Could also do a CLFLUSH here to speed up CPU recovery; but
> that causes hangs on some VIA CPUs. */
> return addr;
> }
>
>
> [ 0.149856] SMP alternatives: switching to UP code
> [ 0.152009] BUG: unable to handle kernel paging request at b8902000
>
That's a userspace address, unless you've got 2G:2G, and the error code
says there's nothing mapped there.

> [ 0.152009] IP: [<c03acfa1>] text_poke+0x85/0xb4
> [ 0.152009] *pde = 00000000
> [ 0.152009] Oops: 0002 [#1] SMP
> [ 0.152009] LTT NESTING LEVEL : 0 <0>
> [ 0.152009] Modules linked in:
> [ 0.152009]
> [ 0.152009] Pid: 0, comm: swapper Not tainted (2.6.25-rc3-testssmp #744)
> [ 0.152009] EIP: 0060:[<c03acfa1>] EFLAGS: 00010002 CPU: 0
> [ 0.152009] EIP is at text_poke+0x85/0xb4
> [ 0.152009] EAX: f8800000 EBX: 00000001 ECX: 00000001 EDX: f8800000
> [ 0.152009] ESI: c04a3fab EDI: b8902000 EBP: c0102114 ESP: c04a3f88
> [ 0.152009] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [ 0.152009] Process swapper (pid: 0, ti=c04a2000 task=c04703a0 task.ti=c04a2)
> [ 0.152009] Stack: 00000163 f8800000 c1002040 c04a400c c04a7cac c0100000 c03
> [ 0.152009] 90411244 00000206 c04e072c c17fb72c 0000672c c04aaefe c03
> [ 0.152009] c04ab76a c17f5000 c04a896a 00000092 c04a80d7 00000008 000
> [ 0.152009] Call Trace:
> [ 0.152009] [<c03b0139>] _etext+0x0/0xf7ec7
> [ 0.152009] [<c0107814>] alternatives_smp_unlock+0x57/0x59
> [ 0.152009] [<c04aaefe>] alternative_instructions+0x152/0x157
> [ 0.152009] [<c03b0139>] _etext+0x0/0xf7ec7
> [ 0.152009] [<c04ab76a>] check_bugs+0x131/0x14e
> [ 0.152009] [<c04a896a>] start_kernel+0x2d1/0x327
> [ 0.152009] [<c04a80d7>] unknown_bootoption+0x0/0x1e3
> [ 0.152009] =======================
> [ 0.152009] Code: b9 04 00 00 00 ba 01 00 00 00 e8 a6 87 db ff 89 44 24 04 8
> [ 0.152009] EIP: [<c03acfa1>] text_poke+0x85/0xb4 SS:ESP 0068:c04a3f88
> [ 0.152009] ---[ end trace ca143223eefdc828 ]---
> [ 0.152009] Kernel panic - not syncing: Attempted to kill the idle task!
>

J

2008-03-04 04:12:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: bad paravirt/Xen interaction in "x86 - Enhance DEBUG_RODATA support - alternatives"

* Jeremy Fitzhardinge ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> * Jeremy Fitzhardinge ([email protected]) wrote:
>>
>>> Andi Kleen wrote:
>>>
>>>> On Monday 03 March 2008 18:58:03 Jeremy Fitzhardinge wrote:
>>>>
>>>>
>>>>> Perhaps, though that's uncached by default.
>>>>>
>>>> ioremap_cached()
>>>>
>>> Sure. But given that from the perspective of this problem ioremap* is
>>> just a wrapper for vmap, we may as well use it directly and avoid getting
>>> tangled up in any current or future io-related stuff that ioremap may
>>> want to do.
>>>
>>> J
>>>
>>
>> Would you have a quick hint on why I get a page fault with the following
>> implementation ? There is probably a fundamental detail I missed.
>>
>>
>> void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>> {
>> char *vaddr;
>> struct page *pages[1];
>>
>> BUG_ON(len > sizeof(long));
>> BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
>> - ((long)addr & ~(sizeof(long) - 1)));
>> if (kernel_text_address((unsigned long)addr))
>> pages[0] = virt_to_page(addr);
>> else
>> vaddr = addr;
>>
>

Hi Jeremy,

(Connecting brain...)

> What's this for? You just overwrite vaddr with the vmap on the next line.
>

Ah, it's supposed to be

if (kernel_text_address((unsigned long)addr)) {
pages[0] = virt_to_page(addr);
vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
} else
memcpy(addr, opcode, len);



>> vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
>>
> Do you need to handle the case of an instruction spanning a page boundary?
>

Not really :

1-byte changes :

- kprobes is a 1-byte breakpoint
- smp alternatives changes an f0 prefix into a 1-byte nop when a single
CPU is active (the opposite is done when bringing up a second CPU).

Aligned on word size :
- immediate values are aligned on word size

At boot time only, before kernel pages are marked RO :
- alternatives and paravirt are applied at boot time only

But yes, I guess it's cheap and allows supporting users which does not
have their instructions aligned. I'll change it.

>> memcpy(&vaddr[(unsigned long)addr & PAGE_MASK], opcode, len);
>> if (kernel_text_address((unsigned long)addr))
>> vunmap(vaddr);
>> sync_core();
>> /* Could also do a CLFLUSH here to speed up CPU recovery; but
>> that causes hangs on some VIA CPUs. */
>> return addr;
>> }
>>
>>
>> [ 0.149856] SMP alternatives: switching to UP code
>> [ 0.152009] BUG: unable to handle kernel paging request at
>> b8902000
> That's a userspace address, unless you've got 2G:2G, and the error code
> says there's nothing mapped there.
>

<banging head on the wall> & PAGE_MASK -> & ~PAGE_MASK :)

Thanks!

Mathieu

>> [ 0.152009] IP: [<c03acfa1>] text_poke+0x85/0xb4
>> [ 0.152009] *pde = 00000000
>> [ 0.152009] Oops: 0002 [#1] SMP
>> [ 0.152009] LTT NESTING LEVEL : 0 <0>
>> [ 0.152009] Modules linked in:
>> [ 0.152009] [ 0.152009] Pid: 0, comm: swapper Not tainted
>> (2.6.25-rc3-testssmp #744)
>> [ 0.152009] EIP: 0060:[<c03acfa1>] EFLAGS: 00010002 CPU: 0
>> [ 0.152009] EIP is at text_poke+0x85/0xb4
>> [ 0.152009] EAX: f8800000 EBX: 00000001 ECX: 00000001 EDX: f8800000
>> [ 0.152009] ESI: c04a3fab EDI: b8902000 EBP: c0102114 ESP: c04a3f88
>> [ 0.152009] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> [ 0.152009] Process swapper (pid: 0, ti=c04a2000 task=c04703a0
>> task.ti=c04a2)
>> [ 0.152009] Stack: 00000163 f8800000 c1002040 c04a400c c04a7cac
>> c0100000 c03 [ 0.152009] 90411244 00000206 c04e072c c17fb72c
>> 0000672c c04aaefe c03 [ 0.152009] c04ab76a c17f5000 c04a896a
>> 00000092 c04a80d7 00000008 000 [ 0.152009] Call Trace:
>> [ 0.152009] [<c03b0139>]
>> _etext+0x0/0xf7ec7 [ 0.152009]
>> [<c0107814>] alternatives_smp_unlock+0x57/0x59 [
>> 0.152009] [<c04aaefe>] alternative_instructions+0x152/0x157
>> [ 0.152009] [<c03b0139>] _etext+0x0/0xf7ec7
>> [ 0.152009] [<c04ab76a>] check_bugs+0x131/0x14e
>> [ 0.152009] [<c04a896a>] start_kernel+0x2d1/0x327
>> [ 0.152009] [<c04a80d7>] unknown_bootoption+0x0/0x1e3
>> [ 0.152009] =======================
>> [ 0.152009] Code: b9 04 00 00 00 ba 01 00 00
>> 00 e8 a6 87 db ff 89 44 24 04 8 [ 0.152009] EIP: [<c03acfa1>]
>> text_poke+0x85/0xb4 SS:ESP 0068:c04a3f88 [ 0.152009] ---[ end
>> trace ca143223eefdc828 ]--- [ 0.152009]
>> Kernel panic - not syncing: Attempted to kill the idle task!
>
> J

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68