Recent i686 2.6.27 kernels with a certain amount of memory (between 736 and
855MB) have a problem booting under a hypervisor that supports batched mprotect
(this includes the RHEL-5 Xen hypervisor as well as any 3.3 or later Xen
hypervisor). The problem ends up being that xen_ptep_modify_prot_commit() is
using virt_to_machine to calculate which pfn to update. However, this only
works for pages that are in the p2m list, and the pages coming from
change_pte_range() in mm/mprotect.c are kmap_atomic pages. Because of this, we
can run into the situation where the lookup in the p2m table returns an
INVALID_MFN, which we then try to pass to the hypervisor, which then (correctly)
denies the request to a totally bogus pfn.
The right thing to do is to use arbitrary_virt_to_machine, so that we can be
sure we are modifying the right pfn. This unfortunately introduces a
performance penalty because of a full page-table-walk, but we can avoid that
penalty for pages in the p2m list by checking if virt_addr_valid is true, and if
so, just doing the lookup in the p2m table.
The attached patch implements this, and allows my 2.6.27 i686 based guest with
768MB of memory to boot on a RHEL-5 hypervisor again. Thanks to Jeremy for the
suggestions about how to fix this particular issue.
Signed-off-by: Chris Lalancette <[email protected]>
>>> Chris Lalancette <[email protected]> 15.10.08 13:03 >>>
>The right thing to do is to use arbitrary_virt_to_machine, so that we can be
>sure we are modifying the right pfn. This unfortunately introduces a
>performance penalty because of a full page-table-walk, but we can avoid that
>penalty for pages in the p2m list by checking if virt_addr_valid is true, and if
>so, just doing the lookup in the p2m table.
Could you explain how virt_addr_valid() can validly be used here? Looking
at its implementation
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
a kaddr in kmap space (i.e. above high_memory) would return a bogus
physical address, and hence pfn_valid() on the resulting pfn is meaningless.
I'd instead simply compare the address in question against high_memory,
and perhaps instead of in arbitrary_virt_to_machine() in
ptep_modify_prot_commit() under an #ifdef CONFIG_HIGHPTE. But
performance-wise, CONFIG_HIGHPTE sucks under Xen anyway, so you'd
better not turn this on in the first place. We may want/need to provide
a means to disable this at run time so the same kernel when run natively
could still make use of it, but without impacting performance under Xen.
Jan
Jan Beulich wrote:
>>>> Chris Lalancette <[email protected]> 15.10.08 13:03 >>>
>>>>
>> The right thing to do is to use arbitrary_virt_to_machine, so that we can be
>> sure we are modifying the right pfn. This unfortunately introduces a
>> performance penalty because of a full page-table-walk, but we can avoid that
>> penalty for pages in the p2m list by checking if virt_addr_valid is true, and if
>> so, just doing the lookup in the p2m table.
>>
>
> Could you explain how virt_addr_valid() can validly be used here? Looking
> at its implementation
>
> #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
>
> a kaddr in kmap space (i.e. above high_memory) would return a bogus
> physical address, and hence pfn_valid() on the resulting pfn is meaningless.
>
virt_addr_valid() is supposed to be usable in this circumstace. The
comment says "virt_to_page(kaddr) returns a valid pointer if and only if
virt_addr_valid(kaddr) returns true", which implies that
virt_addr_valid() returns a meaningful result on all addresses - and if
not, it should be fixed.
> I'd instead simply compare the address in question against high_memory,
> and perhaps instead of in arbitrary_virt_to_machine() in
> ptep_modify_prot_commit() under an #ifdef CONFIG_HIGHPTE.
I suppose, but I don't think there's much cost in making it generally
robust.
> But
> performance-wise, CONFIG_HIGHPTE sucks under Xen anyway, so you'd
> better not turn this on in the first place. We may want/need to provide
> a means to disable this at run time so the same kernel when run natively
> could still make use of it, but without impacting performance under Xen.
>
That's a secondary issue. What's the source of the performance hit?
Just all the extra kmap_atomic operations?
J
>>> Jeremy Fitzhardinge <[email protected]> 15.10.08 18:23 >>>
>> But
>> performance-wise, CONFIG_HIGHPTE sucks under Xen anyway, so you'd
>> better not turn this on in the first place. We may want/need to provide
>> a means to disable this at run time so the same kernel when run natively
>> could still make use of it, but without impacting performance under Xen.
>>
>
>That's a secondary issue. What's the source of the performance hit?
>Just all the extra kmap_atomic operations?
Yes, afaict.
>>> Jeremy Fitzhardinge <[email protected]> 15.10.08 18:23 >>>
>virt_addr_valid() is supposed to be usable in this circumstace. The
>comment says "virt_to_page(kaddr) returns a valid pointer if and only if
>virt_addr_valid(kaddr) returns true", which implies that
>virt_addr_valid() returns a meaningful result on all addresses - and if
>not, it should be fixed.
Where did you find this comment? I had no luck grep-ing for it...
In any case, if that's the expectation, then on i386 virt_addr_valid()
must be implemented as something like
#define virt_addr_valid(kaddr) ((kaddr) >= PAGE_OFFSET && (kaddr) < high_memory && pfn_valid(__pa(kaddr) >> PAGE_SHIFT))
x86-64 would need something similar, except that high_memory obviously
must be replaced (or that part could perhaps be left out altogether), and
the un-mapped addresses above the kernel mapping would need to be
filtered out.
Btw., if you look at other architectures, you'll see that most of them use
the same (as you say broken) construct.
Otoh, if that cited statement really holds, then virt_addr_valid() isn't
really expected to do what its name implies: In particular, there are
valid address ranges in kernel space which it wouldn't be permitted to
return true on without significantly complicating the virt_to_page()
implementation (e.g. x86-64's vmalloc and modules areas).
As to the original issue - as long as domains are given enough memory
(i.e. the range of valid pfn-s is large enough), with the current state of
virt_addr_valid() the patch presented ought to not help at all:
This code
static inline void _check_virt(const char*msg, void *v) {
unsigned long pfn = __pa(v) >> PAGE_SHIFT;
printk("%s: %p %08lx %d:%d\n", msg, v, pfn, pfn_valid(pfn), virt_addr_valid(v));
}
static int __init check_virt(void) {
_check_virt("null", NULL);
_check_virt("half", (void *)LONG_MAX);
_check_virt("hm-p", high_memory - PAGE_SIZE);
_check_virt("hm-1", high_memory - 1);
_check_virt("hm", high_memory);
_check_virt("hm+1", high_memory + 1);
_check_virt("hm+p", high_memory + PAGE_SIZE);
_check_virt("km", (void *)__fix_to_virt(FIX_KMAP_BEGIN));
_check_virt("hv", (void *)HYPERVISOR_VIRT_START);
return 0;
}
late_initcall(check_virt);
yields a positive indication from virt_addr_valid() on all tested addresses:
<4>null: 00000000 00040000 1:1
<4>half: 7fffffff 000bffff 1:1
<4>hm-p: ed7ff000 0002d7ff 1:1
<4>hm-1: ed7fffff 0002d7ff 1:1
<4>hm: ed800000 0002d800 1:1
<4>hm+1: ed800001 0002d800 1:1
<4>hm+p: ed801000 0002d801 1:1
<4>km: f56fa000 000356fa 1:1
<4>hv: f5800000 00035800 1:1
Jan
Jan Beulich wrote:
>>>> Jeremy Fitzhardinge <[email protected]> 15.10.08 18:23 >>>
>>>>
>> virt_addr_valid() is supposed to be usable in this circumstace. The
>> comment says "virt_to_page(kaddr) returns a valid pointer if and only if
>> virt_addr_valid(kaddr) returns true", which implies that
>> virt_addr_valid() returns a meaningful result on all addresses - and if
>> not, it should be fixed.
>>
>
> Where did you find this comment? I had no luck grep-ing for it...
>
It's in tip.git, which has quite a few changes in this area.
http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=blob;f=include/asm-x86/page.h;h=d4f1d5791fc186f29a9a60d4fe182d80f05038e4;hb=HEAD
http://git.kernel.org/?p=linux/kernel/git/x86/linux-2.6-tip.git;a=blob;f=arch/x86/mm/ioremap.c;h=ae71e11eb3e5e4ddeceadc9128d3afea564f27e0;hb=HEAD
> In any case, if that's the expectation, then on i386 virt_addr_valid()
> must be implemented as something like
>
> #define virt_addr_valid(kaddr) ((kaddr) >= PAGE_OFFSET && (kaddr) < high_memory && pfn_valid(__pa(kaddr) >> PAGE_SHIFT))
>
> x86-64 would need something similar, except that high_memory obviously
> must be replaced (or that part could perhaps be left out altogether), and
> the un-mapped addresses above the kernel mapping would need to be
> filtered out.
>
> Btw., if you look at other architectures, you'll see that most of them use
> the same (as you say broken) construct.
>
> Otoh, if that cited statement really holds, then virt_addr_valid() isn't
> really expected to do what its name implies: In particular, there are
> valid address ranges in kernel space which it wouldn't be permitted to
> return true on without significantly complicating the virt_to_page()
> implementation (e.g. x86-64's vmalloc and modules areas).
>
The current x86-64 implementation is:
bool __virt_addr_valid(unsigned long x)
{
if (x >= __START_KERNEL_map) {
x -= __START_KERNEL_map;
if (x >= KERNEL_IMAGE_SIZE)
return false;
x += phys_base;
} else {
if (x < PAGE_OFFSET)
return false;
x -= PAGE_OFFSET;
if (system_state == SYSTEM_BOOTING ?
x > MAXMEM : !phys_addr_valid(x)) {
return false;
}
}
return pfn_valid(x >> PAGE_SHIFT);
}
and 32-bit is similar (but simpler, since it doesn't need to worry about a separate kernel mapping).
>
> yields a positive indication from virt_addr_valid() on all tested addresses:
>
> <4>null: 00000000 00040000 1:1
> <4>half: 7fffffff 000bffff 1:1
> <4>hm-p: ed7ff000 0002d7ff 1:1
> <4>hm-1: ed7fffff 0002d7ff 1:1
> <4>hm: ed800000 0002d800 1:1
> <4>hm+1: ed800001 0002d800 1:1
> <4>hm+p: ed801000 0002d801 1:1
> <4>km: f56fa000 000356fa 1:1
> <4>hv: f5800000 00035800 1:1
>
It would be interesting to try that with tip.git's version of
__virt_addr_valid(). In the Xen case, all we need is a guarantee that
virt_addr_valid() returns true iff __pa(addr) returns a proper result,
so that we can use the resulting pfn as an index into pfn->mfn. I
believe this is what the current implementation does.
J
>>> Jeremy Fitzhardinge <[email protected]> 16.10.08 18:10 >>>
>The current x86-64 implementation is:
>
>bool __virt_addr_valid(unsigned long x)
>{
> if (x >= __START_KERNEL_map) {
> x -= __START_KERNEL_map;
> if (x >= KERNEL_IMAGE_SIZE)
> return false;
This, imo, is still broken (i.e. the name of the function still isn't matched
by the implementation): KERNEL_IMAGE_SIZE is a constant and doesn't
account for the fact that only the real kernel image can be relied upon
to be mapped.
>and 32-bit is similar (but simpler, since it doesn't need to worry about a separate kernel mapping).
This continues to be broken, but not as badly as it used to be - while it
now covers user space and the vmalloc area (I'm unclear why this is
excluded only after booting completed, though), hypervisor space
continues to not be considered here.
But as mentioned before - excluding the vmalloc area seems bogus wrt
the name of the function, but as I take it the confusion here is intended.
Jan
Jan Beulich wrote:
>>>> Jeremy Fitzhardinge <[email protected]> 16.10.08 18:10 >>>
>>>>
>> The current x86-64 implementation is:
>>
>> bool __virt_addr_valid(unsigned long x)
>> {
>> if (x >= __START_KERNEL_map) {
>> x -= __START_KERNEL_map;
>> if (x >= KERNEL_IMAGE_SIZE)
>> return false;
>>
>
> This, imo, is still broken (i.e. the name of the function still isn't matched
> by the implementation): KERNEL_IMAGE_SIZE is a constant and doesn't
> account for the fact that only the real kernel image can be relied upon
> to be mapped.
>
Perhaps, but I don't think it matters too much. Unless you have a tiny
amount of physical memory, locations in the kernel mapping beyond the
actual kernel will still resolve to proper locations in the linear map.
>> and 32-bit is similar (but simpler, since it doesn't need to worry about a separate kernel mapping).
>>
>
> This continues to be broken, but not as badly as it used to be - while it
> now covers user space and the vmalloc area (I'm unclear why this is
> excluded only after booting completed, though), hypervisor space
> continues to not be considered here.
>
> But as mentioned before - excluding the vmalloc area seems bogus wrt
> the name of the function, but as I take it the confusion here is intended.
>
I think a strictly correct name for the function would be
can_i_use___pa_on_this_address(vaddr). It isn't
is_this_really_an_addressable_location(vaddr).
J
>>> Jeremy Fitzhardinge <[email protected]> 17.10.08 17:19 >>>
>Jan Beulich wrote:
>>>>> Jeremy Fitzhardinge <[email protected]> 16.10.08 18:10 >>>
>>>>>
>>> The current x86-64 implementation is:
>>>
>>> bool __virt_addr_valid(unsigned long x)
>>> {
>>> if (x >= __START_KERNEL_map) {
>>> x -= __START_KERNEL_map;
>>> if (x >= KERNEL_IMAGE_SIZE)
>>> return false;
>>>
>>
>> This, imo, is still broken (i.e. the name of the function still isn't matched
>> by the implementation): KERNEL_IMAGE_SIZE is a constant and doesn't
>> account for the fact that only the real kernel image can be relied upon
>> to be mapped.
>>
>
>Perhaps, but I don't think it matters too much. Unless you have a tiny
>amount of physical memory, locations in the kernel mapping beyond the
>actual kernel will still resolve to proper locations in the linear map.
Is e.g. 256Mb tiny? KERNEL_IMAGE_SIZE these days is 512Mb... Indeed,
when it was 40Mb (up until a few releases ago), this indeed wouldn't
matter.
Jan
Jan Beulich wrote:
> Is e.g. 256Mb tiny? KERNEL_IMAGE_SIZE these days is 512Mb... Indeed,
> when it was 40Mb (up until a few releases ago), this indeed wouldn't
> matter.
>
Well, the real point is that anyone doing a test on such high kernel
addresses is almost certainly buggy anyway. I guess a more precise
statement is that it returns well-defined results for any va the calling
code could reasonably be using, not for any random bit pattern.
But I think we're getting into the weeds here.
J