Hi all,
Beginning from commits close to v2.6.25-rc2, running lguest always oopses
the host kernel. Oops is at [1].
Bisection led to the following commit:
commit 37cc8d7f963ba2deec29c9b68716944516a3244f
x86/early_ioremap: don't assume we're using swapper_pg_dir
At the early stages of boot, before the kernel pagetable has been
fully initialized, a Xen kernel will still be running off the
Xen-provided pagetables rather than swapper_pg_dir[]. Therefore,
readback cr3 to determine the base of the pagetable rather than
assuming swapper_pg_dir[].
static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
{
- pgd_t *pgd = &swapper_pg_dir[pgd_index(addr)];
+ /* Don't assume we're using swapper_pg_dir at this point */
+ pgd_t *base = __va(read_cr3());
+ pgd_t *pgd = &base[pgd_index(addr)];
pud_t *pud = pud_offset(pgd, addr);
pmd_t *pmd = pmd_offset(pud, addr);
Trying to analyze the problem, it seems on the guest side of lguest,
%cr3 has a different value from &swapper_pg-dir (which
is AFAIK fine on a pravirt guest):
Putting some debugging messages in early_ioremap_pmd:
/* Appears 3 times */
[ 0.000000] ***************************
[ 0.000000] __va(%cr3) = c0000000, &swapper_pg_dir = c02cc000
[ 0.000000] ***************************
After 8 hours of debugging and staring on lguest code, I noticed something
strange in paravirt_ops->set_pmd hypercall invocation:
static void lguest_set_pmd(pmd_t *pmdp, pmd_t pmdval)
{
*pmdp = pmdval;
lazy_hcall(LHCALL_SET_PMD, __pa(pmdp)&PAGE_MASK,
(__pa(pmdp)&(PAGE_SIZE-1))/4, 0);
}
The first hcall parameter is global pgdir which looks fine. The second
parameter is the pmd index in the pgdir which is suspectful.
AFAIK, calculating the index of pmd does not need a divisoin over four.
Removing the division made lguest work fine again . Patch is at [2].
I am not sure why the division over four existed in the first place. It
seems bogus, maybe the Xen patch just made the problem appear ?
[2]: The patch:
[PATCH] lguest: fix pgdir pmd index cacluation
Remove an error in index calculation which leads to removing
a not existing shadow page table (leading to a Null dereference).
Signed-off-by: Ahmed S. Darwish <[email protected]>
---
diff --git a/arch/x86/lguest/boot.c b/arch/x86/lguest/boot.c
index 5afdde4..6636750 100644
--- a/arch/x86/lguest/boot.c
+++ b/arch/x86/lguest/boot.c
@@ -489,7 +489,7 @@ static void lguest_set_pmd(pmd_t *pmdp, pmd_t pmdval)
{
*pmdp = pmdval;
lazy_hcall(LHCALL_SET_PMD, __pa(pmdp)&PAGE_MASK,
- (__pa(pmdp)&(PAGE_SIZE-1))/4, 0);
+ (__pa(pmdp)&(PAGE_SIZE-1)), 0);
}
/* There are a couple of legacy places where the kernel sets a PTE, but we
[1]: The oops:
[ 9.936880] BUG: unable to handle kernel NULL pointer dereference at 00000ff8
[ 9.938015] IP: [<c0204ef6>] release_pgd+0x6/0x60
[ 9.938379] *pde = 00000000
[ 9.938618] Oops: 0000 [#1]
[ 9.938680] Modules linked in:
[ 9.938680]
[ 9.938680] Pid: 173, comm: lguest Not tainted (2.6.25-rc2-dirty #59)
[ 9.938680] EIP: 0060:[<c0204ef6>] EFLAGS: 00000202 CPU: 0
[ 9.938680] EIP is at release_pgd+0x6/0x60
[ 9.938680] EAX: c7cfe000 EBX: c7d06fb8 ECX: 00000000 EDX: 00000ff8
[ 9.938680] ESI: c7cfe004 EDI: 00000ff8 EBP: 00000000 ESP: c7cebe5c
[ 9.938680] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0058
[ 9.938680] Process lguest (pid: 173, ti=c7cea000 task=c7c94ac0 task.ti=c7cea000)
[ 9.938680] Stack: c7d06fb8 c7cfe004 c7cfe004 00000000 c0204a61 00000000 00000246 00000246
[ 9.938680] c7cbd528 c10ed500 b5d65000 00000002 c02110f4 076a8067 c0151ccc c7cc8668
[ 9.938680] 01d65000 c7cbd528 00000007 c7cc8668 00000000 b5d65000 c01531b8 00000246
[ 9.938680] Call Trace:
[ 9.938680] [<c0204a61>] do_hcall+0x1a1/0x230
[ 9.938680] [<c02110f4>] _spin_unlock+0x14/0x20
[ 9.938680] [<c0151ccc>] follow_page+0x9c/0x190
[ 9.938680] [<c01531b8>] get_user_pages+0x108/0x2c0
[ 9.938680] [<c01bd9af>] copy_to_user+0x3f/0x70
[ 9.938680] [<c0206ae0>] read+0x0/0xb8
[ 9.938680] [<c0204ba9>] do_hypercalls+0xb9/0x2a0
[ 9.938680] [<c0207b5a>] lguest_arch_run_guest+0xfa/0x1d0
[ 9.938680] [<c0204651>] run_guest+0xc1/0x110
[ 9.938680] [<c0206ae0>] read+0x0/0xb8
[ 9.938680] [<c02045bb>] run_guest+0x2b/0x110
[ 9.938680] [<c01655df>] vfs_read+0x8f/0xc0
[ 9.938680] [<c0165ab6>] sys_pread64+0x76/0x80
[ 9.938680] [<c0103864>] sysenter_past_esp+0x6d/0xc5
[ 9.938680] =======================
[ 9.938680] Code: 75 03 5b c3 90 89 d8 e8 39 dc f0 ff 90 8b 1d d8 44 4f c0 c1 e8 0c c1 e0 05 01 d8 5b e9 24 81 f4 ff 8d 74 26 00 55 57 89 d7 56 53 <8b> 02 e8 f3 db f0 ff 90 a8 01 74 3f 8b 07 e8 e7 db f0 ff 90 25
[ 9.938680] EIP: [<c0204ef6>] release_pgd+0x6/0x60 SS:ESP 0058:c7cebe5c
[ 9.939759] ---[ end trace 0cda9e589a597173 ]---
Regards,
--
"Better to light a candle, than curse the darkness"
Ahmed S. Darwish
Homepage: http://darwish.07.googlepages.com
Blog: http://darwish-07.blogspot.com
* Ahmed S. Darwish <[email protected]> wrote:
> I am not sure why the division over four existed in the first place.
> It seems bogus, maybe the Xen patch just made the problem appear ?
i think so - nice detective work and nice fix!
I've picked up your patch into x86.git#testing (until Rusty picks it
up), you can track it the following way:
http://people.redhat.com/mingo/x86.git/README
it has other lguest fixes as well - could you double-check that
x86.git#testing works fine for you as-is? (i've just updated the git
tree so it includes your fix as well)
Ingo
On Sun, Feb 24, 2008 at 05:18:14PM +0100, Ingo Molnar wrote:
>
> * Ahmed S. Darwish <[email protected]> wrote:
>
> > I am not sure why the division over four existed in the first place.
> > It seems bogus, maybe the Xen patch just made the problem appear ?
>
> i think so - nice detective work and nice fix!
>
Thanks!. Your turn-around time is super, which is so nice too ;).
> I've picked up your patch into x86.git#testing (until Rusty picks it
> up), you can track it the following way:
>
> http://people.redhat.com/mingo/x86.git/README
>
> it has other lguest fixes as well - could you double-check that
> x86.git#testing works fine for you as-is? (i've just updated the git
> tree so it includes your fix as well)
>
Ofcourse. I'll send you the results by GMT night.
Regards,
--
"Better to light a candle, than curse the darkness"
Ahmed S. Darwish
Homepage: http://darwish.07.googlepages.com
Blog: http://darwish-07.blogspot.com
On Sun, Feb 24, 2008 at 06:26:53PM +0200, Ahmed S. Darwish wrote:
> On Sun, Feb 24, 2008 at 05:18:14PM +0100, Ingo Molnar wrote:
>
> > I've picked up your patch into x86.git#testing (until Rusty picks it
> > up), you can track it the following way:
> >
> > http://people.redhat.com/mingo/x86.git/README
> >
> > it has other lguest fixes as well - could you double-check that
> > x86.git#testing works fine for you as-is? (i've just updated the git
> > tree so it includes your fix as well)
> >
>
> Ofcourse. I'll send you the results by GMT night.
>
This thread's main bug no longer appears. There's a new bug though,
but it looks nicer than the original ugly bug!.
The new bug does *not* appear in mainline with the same patch. It's
a panic, but this time on the _guest_ side (which is the same host's
kernel).
[ 0.023996] CPU: Intel(R) Pentium(R) M processor 1500MHz stepping 05
[ 0.023996] Kernel panic - not syncing: Kernel compiled for Pentium+, requires TSC feature!
[ 0.023996] Pid: 0, comm: swapper Not tainted 2.6.25-rc2-00815-g3db3a05 #64
[ 0.024996] [<c011bbc7>] panic+0x47/0x120
[ 0.024996] [<c0101b30>] lguest_power_off+0x0/0x20
[ 0.024996] [<c02a88c5>] check_bugs+0x155/0x170
[ 0.024996] [<c02a2b1f>] start_kernel+0x22f/0x2d0
[ 0.024996] [<c02a2390>] unknown_bootoption+0x0/0x1f0
[ 0.024996] [<c0101450>] lguest_restart+0x0/0x20
[ 0.024996] [<c0101450>] lguest_restart+0x0/0x20
[ 0.024996] [<c02a5e22>] lguest_init+0x282/0x290
[ 0.024996] =======================
I'll track it further in my first free time slot.
It seems lguest is very sensitive to x86 tree changes. Could it be
included in a sort of automatic guest-boot testing, please ?
Thank you,
--
"Better to light a candle, than curse the darkness"
Ahmed S. Darwish
Homepage: http://darwish.07.googlepages.com
Blog: http://darwish-07.blogspot.com
Hi Ingo/x86-folks,
On Mon, Feb 25, 2008 at 02:18:16AM +0200, Ahmed S. Darwish wrote:
...
>
> This thread's main bug no longer appears. There's a new bug though,
> but it looks nicer than the original ugly bug!.
>
> The new bug does *not* appear in mainline with the same patch. It's
> a panic, but this time on the _guest_ side (which is the same host's
> kernel).
>
> [ 0.023996] CPU: Intel(R) Pentium(R) M processor 1500MHz stepping 05
> [ 0.023996] Kernel panic - not syncing: Kernel compiled for Pentium+, requires TSC feature!
> [ 0.023996] Pid: 0, comm: swapper Not tainted 2.6.25-rc2-00815-g3db3a05 #64
>
The bug appeared in mainline due to the latest x86 merge. The commit
that caused the bug is not faulty though:
commit 12c247a6719987aad65f83158d2bb3e73c75c1f5
x86: fix boot failure on 486 due to TSC breakage
1. arch/x86/kernel/tsc_32.c:tsc_init() sees !cpu_has_tsc,
so bails and calls setup_clear_cpu_cap(X86_FEATURE_TSC).
2. include/asm-x86/cpufeature.h:setup_clear_cpu_cap(bit) clears
the bit in boot_cpu_data and sets it in cleared_cpu_caps
3. arch/x86/kernel/cpu/common.c:identify_cpu() XORs all caps
in with cleared_cpu_caps
HOWEVER, at this point c->x86_capability correctly has TSC
Off, cleared_cpu_caps has TSC On, so the XOR incorrectly
sets TSC to On in c->x86_capability, with disastrous results.
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f86a3c4..a38aafa 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -504,7 +504,7 @@ void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
/* Clear all flags overriden by options */
for (i = 0; i < NCAPINTS; i++)
- c->x86_capability[i] ^= cleared_cpu_caps[i];
+ c->x86_capability[i] &= ~cleared_cpu_caps[i];
Now the commit fixed everything, and x86_capability shows the right
thing (TSC = off).
On the lguest _guest_ side, 'cpu_has_tsc' is _always_ false (due to
lguest using his own clocksource ?), thus a guest with a pentium+
cpu always panics with:
#ifdef CONFIG_X86_TSC
if (!cpu_has_tsc)
panic("Kernel compiled for Pentium+, requires TSC feature!");
#endif
In older kernels (2.6.23), the problem was also hidden using 'tsc_disable':
tsc.c:
if (!cpu_has_tsc || tsc_disable)
tsc_disable = 1;
bug.c:
#ifdef CONFIG_X86_TSC
if (!cpu_has_tsc && !tsc_disable)
panic("Kernel compiled for Pentium+, requires TSC feature!");
#endif
I've tried solving the problem with several tweaks including something like:
/* If we're running on a pentium+, fake an enabled TSC to bypass
* kernel checks for processor bugs (see x86/cpu/bugs.c) */
if (boot_cpu_data.x86 >= 5)
setup_force_cpu_cap(X86_FEATURE_TSC);
with no luck at all.
Any idea how to solve this problem in a right/mergeable way ?
Thank you,
--
"Better to light a candle, than curse the darkness"
Ahmed S. Darwish
Homepage: http://darwish.07.googlepages.com
Blog: http://darwish-07.blogspot.com
* Ahmed S. Darwish <[email protected]> wrote:
> On the lguest _guest_ side, 'cpu_has_tsc' is _always_ false (due to
> lguest using his own clocksource ?), thus a guest with a pentium+ cpu
> always panics with:
does the "RDTSC" instruction work in an lguest guest? If not, then the
lguest kernel is correct in not exposing it - and then the solution is
to build a non-TSC guest kernel. Does the patch below help?
but if the RDTSC instruction does work in an lguest guest, then the
proper approach would be to expose it in the CPU features.
Ingo
---
arch/x86/Kconfig.cpu | 1 +
1 file changed, 1 insertion(+)
Index: linux-x86.q/arch/x86/Kconfig.cpu
===================================================================
--- linux-x86.q.orig/arch/x86/Kconfig.cpu
+++ linux-x86.q/arch/x86/Kconfig.cpu
@@ -393,6 +393,7 @@ config X86_P6_NOP
config X86_TSC
def_bool y
depends on ((MWINCHIP3D || MWINCHIP2 || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2) && !X86_NUMAQ) || X86_64
+ depends on !LGUEST_GUEST
# this should be set for all -march=.. options where the compiler
# generates cmov.
On Saturday 01 March 2008 06:58:38 Ingo Molnar wrote:
> * Ahmed S. Darwish <[email protected]> wrote:
> > On the lguest _guest_ side, 'cpu_has_tsc' is _always_ false (due to
> > lguest using his own clocksource ?), thus a guest with a pentium+ cpu
> > always panics with:
>
> does the "RDTSC" instruction work in an lguest guest? If not, then the
> lguest kernel is correct in not exposing it - and then the solution is
> to build a non-TSC guest kernel. Does the patch below help?
>
> but if the RDTSC instruction does work in an lguest guest, then the
> proper approach would be to expose it in the CPU features.
>
> Ingo
It does (assuming host support), but when I tried to use the normal TSC code
it poked at other things which a guest couldn't do.
Currently I've hit another issue, but revisiting this is on my TODO. The
cpu_has_tsc panic is overzealous, since we deal without it, but it'd be nice
to use the common TSC clock code.
Thanks,
Rusty.
On Saturday 01 March 2008 06:58:38 Ingo Molnar wrote:
> * Ahmed S. Darwish <[email protected]> wrote:
> > On the lguest _guest_ side, 'cpu_has_tsc' is _always_ false (due to
> > lguest using his own clocksource ?), thus a guest with a pentium+ cpu
> > always panics with:
>
> does the "RDTSC" instruction work in an lguest guest? If not, then the
> lguest kernel is correct in not exposing it - and then the solution is
> to build a non-TSC guest kernel. Does the patch below help?
>
> but if the RDTSC instruction does work in an lguest guest, then the
> proper approach would be to expose it in the CPU features.
Yes, and after some investigation, I came up with two patches. I'll queue
both to you if that's ok, since they're dependent.
When lguest was being developed we didn't have a get_tsc_khz hook, so I had to
kill the TSC cpuid bit. Nowdays, the code can actually be much more natural:
the Host tells us the cpu khz, and if it's zero, we fall back to the
dumb "lguest" clock.
Thanks,
Rusty.
The current tsc_init() clears the TSC feature bit if the TSC khz
cannot be calculated, causing us to panic in
arch/x86/kernel/cpu/bugs.c check_config(). We should simply mark it
unstable.
Frankly, someone should take an axe to this code. mark_tsc_unstable()
not only marks it unstable, but sets tsc_enabled to 0, which seems
redundant but is actually important here because means it won't be
used by sched_clock() either. Perhaps a tristate enum "UNUSABLE,
UNSTABLE, OK" would be clearer, and separate mark_tsc_unstable() and
mark_tsc_broken() functions?
Signed-off-by: Rusty Russell <[email protected]>
diff -r f9a80502dc46 arch/x86/kernel/tsc_32.c
--- a/arch/x86/kernel/tsc_32.c Tue Mar 04 21:06:41 2008 +1100
+++ b/arch/x86/kernel/tsc_32.c Tue Mar 04 22:55:10 2008 +1100
@@ -394,13 +394,15 @@ void __init tsc_init(void)
int cpu;
if (!cpu_has_tsc)
- goto out_no_tsc;
+ return;
cpu_khz = calculate_cpu_khz();
tsc_khz = cpu_khz;
- if (!cpu_khz)
- goto out_no_tsc;
+ if (!cpu_khz) {
+ mark_tsc_unstable("could not calculate TSC khz");
+ return;
+ }
printk("Detected %lu.%03lu MHz processor.\n",
(unsigned long)cpu_khz / 1000,
@@ -433,9 +435,4 @@ void __init tsc_init(void)
tsc_enabled = 1;
clocksource_register(&clocksource_tsc);
-
- return;
-
-out_no_tsc:
- setup_clear_cpu_cap(X86_FEATURE_TSC);
}
Now the TSC code handles a zero return from calculate_cpu_khz(),
lguest can simply pass through the value it gets from the Host: if
non-zero, all the normal TSC code applies.
Otherwise (or if the Host really doesn't support TSC), the clocksource
code will fall back to the slower but reasonable lguest clock.
Signed-off-by: Rusty Russell <[email protected]>
diff -r f9a80502dc46 arch/x86/lguest/boot.c
--- a/arch/x86/lguest/boot.c Tue Mar 04 21:06:41 2008 +1100
+++ b/arch/x86/lguest/boot.c Tue Mar 04 22:55:10 2008 +1100
@@ -84,7 +84,6 @@ struct lguest_data lguest_data = {
.blocked_interrupts = { 1 }, /* Block timer interrupts */
.syscall_vec = SYSCALL_VECTOR,
};
-static cycle_t clock_base;
/*G:037 async_hcall() is pretty simple: I'm quite proud of it really. We have a
* ring buffer of stored hypercalls which the Host will run though next time we
@@ -327,8 +326,8 @@ static void lguest_cpuid(unsigned int *a
case 1: /* Basic feature request. */
/* We only allow kernel to see SSE3, CMPXCHG16B and SSSE3 */
*cx &= 0x00002201;
- /* SSE, SSE2, FXSR, MMX, CMOV, CMPXCHG8B, FPU. */
- *dx &= 0x07808101;
+ /* SSE, SSE2, FXSR, MMX, CMOV, CMPXCHG8B, TSC, FPU. */
+ *dx &= 0x07808111;
/* The Host can do a nice optimization if it knows that the
* kernel mappings (addresses above 0xC0000000 or whatever
* PAGE_OFFSET is set to) haven't changed. But Linux calls
@@ -595,19 +594,25 @@ static unsigned long lguest_get_wallcloc
return lguest_data.time.tv_sec;
}
+/* The TSC is a Time Stamp Counter. The Host tells us what speed it runs at,
+ * or 0 if it's unusable as a reliable clock source. This matches what we want
+ * here: if we return 0 from this function, the x86 TSC clock will not register
+ * itself. */
+static unsigned long lguest_cpu_khz(void)
+{
+ return lguest_data.tsc_khz;
+}
+
+/* If we can't use the TSC, the kernel falls back to our "lguest_clock", where
+ * we read the time value given to us by the Host. */
static cycle_t lguest_clock_read(void)
{
unsigned long sec, nsec;
- /* If the Host tells the TSC speed, we can trust that. */
- if (lguest_data.tsc_khz)
- return native_read_tsc();
-
- /* If we can't use the TSC, we read the time value written by the Host.
- * Since it's in two parts (seconds and nanoseconds), we risk reading
- * it just as it's changing from 99 & 0.999999999 to 100 and 0, and
- * getting 99 and 0. As Linux tends to come apart under the stress of
- * time travel, we must be careful: */
+ /* Since the time is in two parts (seconds and nanoseconds), we risk
+ * reading it just as it's changing from 99 & 0.999999999 to 100 and 0,
+ * and getting 99 and 0. As Linux tends to come apart under the stress
+ * of time travel, we must be careful: */
do {
/* First we read the seconds part. */
sec = lguest_data.time.tv_sec;
@@ -622,26 +627,20 @@ static cycle_t lguest_clock_read(void)
/* Now if the seconds part has changed, try again. */
} while (unlikely(lguest_data.time.tv_sec != sec));
- /* Our non-TSC clock is in real nanoseconds. */
+ /* Our lguest clock is in real nanoseconds. */
return sec*1000000000ULL + nsec;
}
-/* This is what we tell the kernel is our clocksource. */
+/* This is the fallback clocksource: lower priority than the TSC clocksource. */
static struct clocksource lguest_clock = {
.name = "lguest",
- .rating = 400,
+ .rating = 200,
.read = lguest_clock_read,
.mask = CLOCKSOURCE_MASK(64),
.mult = 1 << 22,
.shift = 22,
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
-
-/* The "scheduler clock" is just our real clock, adjusted to start at zero */
-static unsigned long long lguest_sched_clock(void)
-{
- return cyc2ns(&lguest_clock, lguest_clock_read() - clock_base);
-}
/* We also need a "struct clock_event_device": Linux asks us to set it to go
* off some time in the future. Actually, James Morris figured all this out, I
@@ -712,18 +711,7 @@ static void lguest_time_init(void)
/* Set up the timer interrupt (0) to go to our simple timer routine */
set_irq_handler(0, lguest_time_irq);
- /* Our clock structure looks like arch/x86/kernel/tsc_32.c if we can
- * use the TSC, otherwise it's a dumb nanosecond-resolution clock.
- * Either way, the "rating" is set so high that it's always chosen over
- * any other clocksource. */
- if (lguest_data.tsc_khz)
- lguest_clock.mult = clocksource_khz2mult(lguest_data.tsc_khz,
- lguest_clock.shift);
- clock_base = lguest_clock_read();
clocksource_register(&lguest_clock);
-
- /* Now we've set up our clock, we can use it as the scheduler clock */
- pv_time_ops.sched_clock = lguest_sched_clock;
/* We can't set cpumask in the initializer: damn C limitations! Set it
* here and register our timer device. */
@@ -995,6 +983,7 @@ __init void lguest_init(void)
/* time operations */
pv_time_ops.get_wallclock = lguest_get_wallclock;
pv_time_ops.time_init = lguest_time_init;
+ pv_time_ops.get_cpu_khz = lguest_cpu_khz;
/* Now is a good time to look at the implementations of these functions
* before returning to the rest of lguest_init(). */
On Monday 25 February 2008 02:55:15 Ahmed S. Darwish wrote:
> Hi all,
>
> Beginning from commits close to v2.6.25-rc2, running lguest always oopses
> the host kernel. Oops is at [1].
OK, whatever the guest does should not break the host. So your patch can't be
the whole solution.
> static void lguest_set_pmd(pmd_t *pmdp, pmd_t pmdval)
> {
> *pmdp = pmdval;
> lazy_hcall(LHCALL_SET_PMD, __pa(pmdp)&PAGE_MASK,
> (__pa(pmdp)&(PAGE_SIZE-1))/4, 0);
> }
>
> The first hcall parameter is global pgdir which looks fine. The second
> parameter is the pmd index in the pgdir which is suspectful.
>
> AFAIK, calculating the index of pmd does not need a divisoin over four.
> Removing the division made lguest work fine again . Patch is at [2].
Each pmd is 4 bytes long, so the divide by 4 gives the index into the (top
level) page. ie. a number in the range 0 to 1023.
Indeed, here's the correct fix. Does it work for you?
==
Revert 1ce70c4fac3c3954bd48c035f448793867592bc0, fix real problem.
Ahmed managed to crash the Host in release_pgd(), which cannot be a Guest
bug, and indeed it wasn't.
The bug was that handing a 0 as the address of the toplevel page table
being manipulated can cause the lookup code in find_pgdir() to return
an uninitialized cache entry (we shadow up to 4 top level page tables
for each Guest).
Commit 37cc8d7f963ba2deec29c9b68716944516a3244f introduced this
behaviour in the Guest, uncovering the bug.
The patch which he submitted (which removed the /4 from the index
calculation) simply ensured that these high-indexed entries hit the
early exit path of guest_set_pmd(). But you get lots of segfaults in
guest userspace as the PMDs aren't being updated.
Signed-off-by: Rusty Russell <[email protected]>
diff -r 26838579cab1 arch/x86/lguest/boot.c
--- a/arch/x86/lguest/boot.c Tue Mar 04 22:55:11 2008 +1100
+++ b/arch/x86/lguest/boot.c Tue Mar 04 23:43:10 2008 +1100
@@ -480,7 +480,7 @@ static void lguest_set_pmd(pmd_t *pmdp,
{
*pmdp = pmdval;
lazy_hcall(LHCALL_SET_PMD, __pa(pmdp)&PAGE_MASK,
- (__pa(pmdp)&(PAGE_SIZE-1)), 0);
+ (__pa(pmdp)&(PAGE_SIZE-1))/4, 0);
}
/* There are a couple of legacy places where the kernel sets a PTE, but we
diff -r 26838579cab1 drivers/lguest/page_tables.c
--- a/drivers/lguest/page_tables.c Tue Mar 04 22:55:11 2008 +1100
+++ b/drivers/lguest/page_tables.c Tue Mar 04 23:43:10 2008 +1100
@@ -391,7 +391,7 @@ static unsigned int find_pgdir(struct lg
{
unsigned int i;
for (i = 0; i < ARRAY_SIZE(lg->pgdirs); i++)
- if (lg->pgdirs[i].gpgdir == pgtable)
+ if (lg->pgdirs[i].pgdir && lg->pgdirs[i].gpgdir == pgtable)
break;
return i;
}
On Tue, Mar 04, 2008 at 11:55:10PM +1100, Rusty Russell wrote:
> On Monday 25 February 2008 02:55:15 Ahmed S. Darwish wrote:
> > Hi all,
> >
> > Beginning from commits close to v2.6.25-rc2, running lguest always oopses
> > the host kernel. Oops is at [1].
>
> OK, whatever the guest does should not break the host. So your patch can't be
> the whole solution.
>
> > static void lguest_set_pmd(pmd_t *pmdp, pmd_t pmdval)
> > {
> > *pmdp = pmdval;
> > lazy_hcall(LHCALL_SET_PMD, __pa(pmdp)&PAGE_MASK,
> > (__pa(pmdp)&(PAGE_SIZE-1))/4, 0);
> > }
> >
> > The first hcall parameter is global pgdir which looks fine. The second
> > parameter is the pmd index in the pgdir which is suspectful.
> >
> > AFAIK, calculating the index of pmd does not need a divisoin over four.
> > Removing the division made lguest work fine again . Patch is at [2].
>
> Each pmd is 4 bytes long, so the divide by 4 gives the index into the (top
> level) page. ie. a number in the range 0 to 1023.
>
aha, now it's clear.
>
> Indeed, here's the correct fix. Does it work for you?
>
Yes it does (after removing the unrelated TSC-if-pentium+ checks).
Thank you
--
"Better to light a candle, than curse the darkness"
Ahmed S. Darwish
Homepage: http://darwish.07.googlepages.com
Blog: http://darwish-07.blogspot.com