2012-11-29 00:54:10

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 1/2] create slow_virt_to_phys()


This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems. We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages. We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area. Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory. It should work on the normal linear mapping,
vmalloc(), kmap(), etc...


Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 1
linux-2.6.git-dave/arch/x86/mm/pageattr.c | 47 ++++++++++++++++
2 files changed, 48 insertions(+)

diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys 2012-11-29 00:26:58.583830776 +0000
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2012-11-29 00:26:58.595830876 +0000
@@ -332,6 +332,7 @@ static inline void update_page_count(int
* as a pte too.
*/
extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);

#endif /* !__ASSEMBLY__ */

diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys 2012-11-29 00:26:58.591830844 +0000
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2012-11-29 00:26:58.595830876 +0000
@@ -364,6 +364,53 @@ pte_t *lookup_address(unsigned long addr
EXPORT_SYMBOL_GPL(lookup_address);

/*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems. The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+ unsigned long virt_addr = (unsigned long)__virt_addr;
+ phys_addr_t phys_addr;
+ unsigned long offset;
+ unsigned int level = -1;
+ unsigned long psize = 0;
+ unsigned long pmask = 0;
+ pte_t *pte;
+
+ pte = lookup_address(virt_addr, &level);
+ BUG_ON(!pte);
+ switch (level) {
+ case PG_LEVEL_4K:
+ psize = PAGE_SIZE;
+ pmask = PAGE_MASK;
+ break;
+ case PG_LEVEL_2M:
+ psize = PMD_PAGE_SIZE;
+ pmask = PMD_PAGE_MASK;
+ break;
+#ifdef CONFIG_X86_64
+ case PG_LEVEL_1G:
+ psize = PUD_PAGE_SIZE;
+ pmask = PUD_PAGE_MASK;
+ break;
+#endif
+ default:
+ BUG();
+ }
+ offset = virt_addr & ~pmask;
+ phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+ return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
* Set the new pmd in all the pgds we know about:
*/
static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
_


2012-11-29 01:00:57

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 2/2] fix kvm's use of __pa() on percpu areas


In short, it is illegal to call __pa() on an address holding
a percpu variable. The times when this actually matters are
pretty obscure (certain 32-bit NUMA systems), but it _does_
happen. It is important to keep KVM guests working on these
systems because the real hardware is getting harder and
harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[ 0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host. Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c
--- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2012-11-29 00:39:59.130213376 +0000
+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2012-11-29 00:51:55.428091802 +0000
@@ -284,9 +284,9 @@ static void kvm_register_steal_time(void

memset(st, 0, sizeof(*st));

- wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+ wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
- cpu, __pa(st));
+ cpu, slow_virt_to_phys(st));
}

static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
@@ -311,7 +311,7 @@ void __cpuinit kvm_guest_cpu_init(void)
return;

if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
- u64 pa = __pa(&__get_cpu_var(apf_reason));
+ u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));

#ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -327,7 +327,8 @@ void __cpuinit kvm_guest_cpu_init(void)
/* Size alignment is implied but just to make it explicit. */
BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
__get_cpu_var(kvm_apic_eoi) = 0;
- pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+ pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
+ | KVM_MSR_ENABLED;
wrmsrl(MSR_KVM_PV_EOI_EN, pa);
}

_

2012-11-29 07:47:35

by Gleb Natapov

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] fix kvm's use of __pa() on percpu areas

On Thu, Nov 29, 2012 at 12:53:56AM +0000, Dave Hansen wrote:
>
> In short, it is illegal to call __pa() on an address holding
> a percpu variable. The times when this actually matters are
> pretty obscure (certain 32-bit NUMA systems), but it _does_
> happen. It is important to keep KVM guests working on these
> systems because the real hardware is getting harder and
> harder to find.
>
> This bug manifested first by me seeing a plain hang at boot
> after this message:
>
> CPU 0 irqstacks, hard=f3018000 soft=f301a000
>
> or, sometimes, it would actually make it out to the console:
>
> [ 0.000000] BUG: unable to handle kernel paging request at ffffffff
>
> I eventually traced it down to the KVM async pagefault code.
> This can be worked around by disabling that code either at
> compile-time, or on the kernel command-line.
>
> The kvm async pagefault code was injecting page faults in
> to the guest which the guest misinterpreted because its
> "reason" was not being properly sent from the host.
>
> The guest passes a physical address of an per-cpu async page
> fault structure via an MSR to the host. Since __pa() is
> broken on percpu data, the physical address it sent was
> bascially bogus and the host went scribbling on random data.
> The guest never saw the real reason for the page fault (it
> was injected by the host), assumed that the kernel had taken
> a _real_ page fault, and panic()'d.
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c
> --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2012-11-29 00:39:59.130213376 +0000
> +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2012-11-29 00:51:55.428091802 +0000
> @@ -284,9 +284,9 @@ static void kvm_register_steal_time(void
>
> memset(st, 0, sizeof(*st));
>
> - wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
> + wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
Where is this slow_virt_to_phys() coming from? I do not find it in
kvm.git.

> printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
> - cpu, __pa(st));
> + cpu, slow_virt_to_phys(st));
> }
>
> static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
> @@ -311,7 +311,7 @@ void __cpuinit kvm_guest_cpu_init(void)
> return;
>
> if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
> - u64 pa = __pa(&__get_cpu_var(apf_reason));
> + u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
>
> #ifdef CONFIG_PREEMPT
> pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> @@ -327,7 +327,8 @@ void __cpuinit kvm_guest_cpu_init(void)
> /* Size alignment is implied but just to make it explicit. */
> BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
> __get_cpu_var(kvm_apic_eoi) = 0;
> - pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
> + pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
> + | KVM_MSR_ENABLED;
> wrmsrl(MSR_KVM_PV_EOI_EN, pa);
> }
>
> _

--
Gleb.

2012-11-29 08:34:31

by Gleb Natapov

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] fix kvm's use of __pa() on percpu areas

On Thu, Nov 29, 2012 at 09:47:30AM +0200, Gleb Natapov wrote:
> On Thu, Nov 29, 2012 at 12:53:56AM +0000, Dave Hansen wrote:
> >
> > In short, it is illegal to call __pa() on an address holding
> > a percpu variable. The times when this actually matters are
> > pretty obscure (certain 32-bit NUMA systems), but it _does_
> > happen. It is important to keep KVM guests working on these
> > systems because the real hardware is getting harder and
> > harder to find.
> >
> > This bug manifested first by me seeing a plain hang at boot
> > after this message:
> >
> > CPU 0 irqstacks, hard=f3018000 soft=f301a000
> >
> > or, sometimes, it would actually make it out to the console:
> >
> > [ 0.000000] BUG: unable to handle kernel paging request at ffffffff
> >
> > I eventually traced it down to the KVM async pagefault code.
> > This can be worked around by disabling that code either at
> > compile-time, or on the kernel command-line.
> >
> > The kvm async pagefault code was injecting page faults in
> > to the guest which the guest misinterpreted because its
> > "reason" was not being properly sent from the host.
> >
> > The guest passes a physical address of an per-cpu async page
> > fault structure via an MSR to the host. Since __pa() is
> > broken on percpu data, the physical address it sent was
> > bascially bogus and the host went scribbling on random data.
> > The guest never saw the real reason for the page fault (it
> > was injected by the host), assumed that the kernel had taken
> > a _real_ page fault, and panic()'d.
> >
> > Signed-off-by: Dave Hansen <[email protected]>
> > ---
> >
> > linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++----
> > 1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c
> > --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2012-11-29 00:39:59.130213376 +0000
> > +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2012-11-29 00:51:55.428091802 +0000
> > @@ -284,9 +284,9 @@ static void kvm_register_steal_time(void
> >
> > memset(st, 0, sizeof(*st));
> >
> > - wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
> > + wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
> Where is this slow_virt_to_phys() coming from? I do not find it in
> kvm.git.
>
Disregard this. 1/2 didn't make it to my mailbox.

> > printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
> > - cpu, __pa(st));
> > + cpu, slow_virt_to_phys(st));
> > }
> >
> > static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
> > @@ -311,7 +311,7 @@ void __cpuinit kvm_guest_cpu_init(void)
> > return;
> >
> > if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
> > - u64 pa = __pa(&__get_cpu_var(apf_reason));
> > + u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
> >
> > #ifdef CONFIG_PREEMPT
> > pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> > @@ -327,7 +327,8 @@ void __cpuinit kvm_guest_cpu_init(void)
> > /* Size alignment is implied but just to make it explicit. */
> > BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
> > __get_cpu_var(kvm_apic_eoi) = 0;
> > - pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
> > + pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
> > + | KVM_MSR_ENABLED;
> > wrmsrl(MSR_KVM_PV_EOI_EN, pa);
> > }
> >
> > _
>
> --
> Gleb.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Gleb.