2011-03-10 20:37:21

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: [PATCH] small bug-fixes

During testing (git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git #linux-next)

I found that ballooning of PV guests stopped working and traced it down to
2f14ddc3a7146ea4cd5a3d1ecd993f85f2e4f948
"xen/setup: Inhibit resource API from using System RAM E820 gaps as PCI mem gaps"
which is correct for the initial domain, but not for true PV guests. This patch:

"xen/e820: Don't mark balloon memory as E820_UNUSABLE when running as guest."
fixes that.

The second I found when running 2.6.32 pvops tree when we swapped over from
using level_irq handler for VIRQ to percpu irq handler. There is a race window
that I was able to hit 66% of time. With 2.6.38 I can't seem to hit this and I am
not sure why - but it still makes sense to add this patch in. The patch in
question is:

xen/hvc: Disable probe_irq_on/off from poking the hvc-console IRQ line.


2011-03-10 20:37:20

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: [PATCH] xen/hvc: Disable probe_irq_on/off from poking the hvc-console IRQ line.

This fixes a particular nasty racing problem found when using
Xen hypervisor with the console (hvc) output being routed to the
serial port and the serial port receiving data when
probe_irq_off(probe_irq_on) is running.

Specifically the bug manifests itself with:

[ 4.470693] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 4.470693] IP: [<ffffffff810a8c65>] handle_IRQ_event+0xe/0xc9
..snip..
[ 4.470693] Call Trace:
[ 4.470693] <IRQ>
[ 4.470693] [<ffffffff810aa645>] handle_percpu_irq+0x3c/0x69
[ 4.470693] [<ffffffff8123cda7>] __xen_evtchn_do_upcall+0xfd/0x195
[ 4.470693] [<ffffffff810308cf>] ? xen_restore_fl_direct_end+0x0/0x1
[ 4.470693] [<ffffffff8123d873>] xen_evtchn_do_upcall+0x32/0x47
[ 4.470693] [<ffffffff81034dfe>] xen_do_hypervisor_callback+0x1e/0x30
[ 4.470693] <EOI>
[ 4.470693] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000
[ 4.470693] [<ffffffff8100922a>] ? hypercall_page+0x22a/0x1000
[ 4.470693] [<ffffffff810301c5>] ? xen_force_evtchn_callback+0xd/0xf
[ 4.470693] [<ffffffff810308e2>] ? check_events+0x12/0x20
[ 4.470693] [<ffffffff81030889>] ? xen_irq_enable_direct_end+0x0/0x7
[ 4.470693] [<ffffffff810ab0a0>] ? probe_irq_on+0x8f/0x1d7
[ 4.470693] [<ffffffff812b105e>] ? serial8250_config_port+0x7b7/0x9e6
[ 4.470693] [<ffffffff812ad66c>] ? uart_add_one_port+0x11b/0x305

The bug is trigged by three actors working together:
A). serial_8250_config_port calling
probe_irq_off(probe_irq_on())
wherein all of the IRQ handlers are being started and shut off.
The functions utilize the sleep functions so the minimum time
they are run is 120 msec.
B). Xen hypervisor receiving on the serial line any character and
setting the bits in the event channel - during this 120 msec timeframe.
C). The hvc API makes a call to 'request_irq' (and hence setting desc->action
to a valid value), much much later - when user space opens
/dev/console (hvc_open). To make the console usable during bootup,
the Xen HVC implementation sets the IRQ chip (and correspondingly
the event channel) much earlier. The IRQ chip handler that is used
is the handle_percpu_irq (aaca49642b92c8a57d3ca5029a5a94019c7af69f)

Back to the issue. When A) is being called it ends up calling the
xen_percpu_chip's chip->startup twice and chip->shutdown once. Those
are set to the default_startup and mask_irq (events.c) respectivly.
If (and this seems to depend on what serial concentrator you use), B)
gets data from the serial port it sets in the event channel a pending bit.
When A) calls chip->startup(), the masking of the pending bit, and
unmasking of the event channel mask, and also setting of the upcall_pending
flag is done (since there is data present on the event channel).
If before the 120 msec has elapsed, any IRQ handler (Xen IRQ has one
IRQ handler, which checks the event channels bitmap to figure which one
to call) is called we end up calling the handle_percpu_irq. The
handle_percpu_irq calls desc->action (which is NULL) and we blow up.

Caveats: I could only reproduce this on 2.6.32 pvops. I am not sure
why this is not showing up on 2.6.38 kernel.

The probe_irq_on/off has code to disable poking specific IRQ lines. This is
done by using the set_irq_noprobe() and then we do not have to
worry about the handle_percpu_irq being called before the IRQ action
handler has been installed.

Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
drivers/tty/hvc/hvc_xen.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
index 3740e32..c35f1a7 100644
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -177,6 +177,8 @@ static int __init xen_hvc_init(void)
}
if (xencons_irq < 0)
xencons_irq = 0; /* NO_IRQ */
+ else
+ set_irq_noprobe(xencons_irq);

hp = hvc_alloc(HVC_COOKIE, xencons_irq, ops, 256);
if (IS_ERR(hp))
--
1.7.1

2011-03-10 20:37:39

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: [PATCH] xen/e820: Don't mark balloon memory as E820_UNUSABLE when running as guest.

If we have a guest that asked for:

memory=1024
maxmem=20448

Which means we want 1GB now, and create pagetables so that we can expand
up to 2GB, we would have this E820 layout:

[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
[ 0.000000] Xen: 0000000000100000 - 0000000080800000 (usable)

Due to patch: "xen/setup: Inhibit resource API from using System RAM E820 gaps as PCI mem gaps."
we would mark the memory past the 1GB mark as unusuable resulting in:

[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
[ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable)
[ 0.000000] Xen: 0000000040000000 - 0000000080800000 (unusable)

which meant that we could not balloon up anymore. We could
balloon the guest down. The fix is to run the code introduced
by the above mentioned patch only for the initial domain.

We will have to revisit this once we start introducing a modified
E820 for PCI passthrough so that we can utilize the P2M identity code.

Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/setup.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 2a4add9..6e676fa 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -200,7 +200,8 @@ char * __init xen_memory_setup(void)
* used as potential resource for I/O address (happens
* when 'allocate_resource' is called).
*/
- if (delta && end < 0x100000000UL)
+ if (delta &&
+ (xen_initial_domain() && end < 0x100000000UL))
e820_add_region(end, delta, E820_UNUSABLE);
}

--
1.7.1

2011-03-11 15:23:39

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] xen/e820: Don't mark balloon memory as E820_UNUSABLE when running as guest.

On Thu, 2011-03-10 at 20:36 +0000, Konrad Rzeszutek Wilk wrote:
> If we have a guest that asked for:
>
> memory=1024
> maxmem=20448
>
> Which means we want 1GB now, and create pagetables so that we can expand
> up to 2GB, we would have this E820 layout:
>
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
> [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [ 0.000000] Xen: 0000000000100000 - 0000000080800000 (usable)
>
> Due to patch: "xen/setup: Inhibit resource API from using System RAM E820 gaps as PCI mem gaps."
> we would mark the memory past the 1GB mark as unusuable resulting in:
>
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
> [ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [ 0.000000] Xen: 0000000000100000 - 0000000040000000 (usable)
> [ 0.000000] Xen: 0000000040000000 - 0000000080800000 (unusable)
>
> which meant that we could not balloon up anymore. We could
> balloon the guest down. The fix is to run the code introduced
> by the above mentioned patch only for the initial domain.
>
> We will have to revisit this once we start introducing a modified
> E820 for PCI passthrough so that we can utilize the P2M identity code.
>
> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> arch/x86/xen/setup.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 2a4add9..6e676fa 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -200,7 +200,8 @@ char * __init xen_memory_setup(void)
> * used as potential resource for I/O address (happens
> * when 'allocate_resource' is called).
> */
> - if (delta && end < 0x100000000UL)
> + if (delta &&
> + (xen_initial_domain() && end < 0x100000000UL))

Not a new problem but 0x100000000 will overflow an unsigned long on 32
bit:
CC arch/x86/xen/setup.o
arch/x86/xen/setup.c: In function 'xen_memory_setup':
arch/x86/xen/setup.c:254: warning: integer constant is too large for 'unsigned long' type

I think you want ULL? (end is "unsigned long long").

Ian.