These two patches (one for Linux, one for Xen) allow PVHVM guests to use
the per-cpu VCPU mechanism after migration. Currently when an PVHVM guest
migrates all the per-cpu information is lost and we fallback on the
shared_info structure. This is regardless if the HVM guest has 2 or 128 CPUs.
Since the structure has an array for only 32 CPUs that means if we are
to migrate a PVHVM guest - we can only do it up to 32 CPUs.
These patches fix it and allow more than 32 VCPUs to be migrated with
PVHVM Linux guests.
The Linux diff is:
arch/x86/xen/enlighten.c | 21 ++++++++++++++++-----
arch/x86/xen/suspend.c | 6 +-----
arch/x86/xen/time.c | 3 +++
3 files changed, 20 insertions(+), 10 deletions(-)
while the Xen one is:
xen/arch/x86/hvm/hvm.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
From: Konrad Rzeszutek Wilk <[email protected]>
When we migrate an HVM guest, by default our shared_info can
only hold up to 32 CPUs. As such the hypercall
VCPUOP_register_vcpu_info was introduced which allowed us to
setup per-page areas for VCPUs. This means we can boot PVHVM
guest with more than 32 VCPUs. During migration the per-cpu
structure is allocated fresh by the hypervisor (vcpu_info_mfn
is set to INVALID_MFN) so that the newly migrated guest
can do make the VCPUOP_register_vcpu_info hypercall.
Unfortunatly we end up triggering this condition:
/* Run this command on yourself or on other offline VCPUS. */
if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
which means we are unable to setup the per-cpu VCPU structures
for running vCPUS. The Linux PV code paths make this work by
iterating over every vCPU with:
1) is target CPU up (VCPUOP_is_up hypercall?)
2) if yes, then VCPUOP_down to pause it.
3) VCPUOP_register_vcpu_info
4) if it was down, then VCPUOP_up to bring it back up
But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
not allowed on HVM guests we can't do this. This patch
enables this.
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
xen/arch/x86/hvm/hvm.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 38c491e..b5b92fe 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3470,6 +3470,9 @@ static long hvm_vcpu_op(
case VCPUOP_stop_singleshot_timer:
case VCPUOP_register_vcpu_info:
case VCPUOP_register_vcpu_time_memory_area:
+ case VCPUOP_down:
+ case VCPUOP_up:
+ case VCPUOP_is_up:
rc = do_vcpu_op(cmd, vcpuid, arg);
break;
default:
--
1.7.7.6
From: Konrad Rzeszutek Wilk <[email protected]>
When Xen migrates an HVM guest, by default its shared_info can
only hold up to 32 CPUs. As such the hypercall
VCPUOP_register_vcpu_info was introduced which allowed us to
setup per-page areas for VCPUs. This means we can boot PVHVM
guest with more than 32 VCPUs. During migration the per-cpu
structure is allocated freshly by the hypervisor (vcpu_info_mfn
is set to INVALID_MFN) so that the newly migrated guest
can make an VCPUOP_register_vcpu_info hypercall.
Unfortunatly we end up triggering this condition in Xen:
/* Run this command on yourself or on other offline VCPUS. */
if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
which means we are unable to setup the per-cpu VCPU structures
for running vCPUS. The Linux PV code paths make this work by
iterating over every vCPU with:
1) is target CPU up (VCPUOP_is_up hypercall?)
2) if yes, then VCPUOP_down to pause it.
3) VCPUOP_register_vcpu_info
4) if it was down, then VCPUOP_up to bring it back up
But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
not allowed on HVM guests we can't do this. However with the
Xen commit "hvm: Support more than 32 VCPUS when migrating."
we can do this. As such first check if VCPUOP_is_up is actually
possible before trying this dance.
As most of this dance code is done already in 'xen_setup_vcpu'
lets make it callable on both PV and HVM. This means moving one
of the checks out to 'xen_setup_runstate_info'.
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/enlighten.c | 21 ++++++++++++++++-----
arch/x86/xen/suspend.c | 6 +-----
arch/x86/xen/time.c | 3 +++
3 files changed, 20 insertions(+), 10 deletions(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 201d09a..af8be96 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -236,12 +236,23 @@ static void xen_vcpu_setup(int cpu)
void xen_vcpu_restore(void)
{
int cpu;
+ bool vcpuops = true;
+ const struct cpumask *mask;
- for_each_possible_cpu(cpu) {
+ mask = xen_pv_domain() ? cpu_possible_mask : cpu_online_mask;
+
+ /* Only Xen 4.5 and higher supports this. */
+ if (HYPERVISOR_vcpu_op(VCPUOP_is_up, smp_processor_id(), NULL) == -ENOSYS)
+ vcpuops = false;
+
+ for_each_cpu(cpu, mask) {
bool other_cpu = (cpu != smp_processor_id());
- bool is_up = HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL);
+ bool is_up = false;
+
+ if (vcpuops)
+ is_up = HYPERVISOR_vcpu_op(VCPUOP_is_up, cpu, NULL);
- if (other_cpu && is_up &&
+ if (vcpuops && other_cpu && is_up &&
HYPERVISOR_vcpu_op(VCPUOP_down, cpu, NULL))
BUG();
@@ -250,7 +261,7 @@ void xen_vcpu_restore(void)
if (have_vcpu_info_placement)
xen_vcpu_setup(cpu);
- if (other_cpu && is_up &&
+ if (vcpuops && other_cpu && is_up &&
HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL))
BUG();
}
@@ -1751,7 +1762,7 @@ void __ref xen_hvm_init_shared_info(void)
for_each_online_cpu(cpu) {
/* Leave it to be NULL. */
if (cpu >= MAX_VIRT_CPUS)
- continue;
+ per_cpu(xen_vcpu, cpu) = NULL; /* Triggers xen_vcpu_setup.*/
per_cpu(xen_vcpu, cpu) = &HYPERVISOR_shared_info->vcpu_info[cpu];
}
}
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 45329c8..6fb3298 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -33,11 +33,7 @@ void xen_arch_hvm_post_suspend(int suspend_cancelled)
xen_hvm_init_shared_info();
xen_callback_vector();
xen_unplug_emulated_devices();
- if (xen_feature(XENFEAT_hvm_safe_pvclock)) {
- for_each_online_cpu(cpu) {
- xen_setup_runstate_info(cpu);
- }
- }
+ xen_vcpu_restore();
#endif
}
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 7b78f88..d4feb2e 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -105,6 +105,9 @@ void xen_setup_runstate_info(int cpu)
{
struct vcpu_register_runstate_memory_area area;
+ if (xen_hvm_domain() && !(xen_feature(XENFEAT_hvm_safe_pvclock)))
+ return;
+
area.addr.v = &per_cpu(xen_runstate, cpu);
if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
--
1.7.7.6
On 08/04/14 19:25, [email protected] wrote:
> From: Konrad Rzeszutek Wilk <[email protected]>
>
> When we migrate an HVM guest, by default our shared_info can
> only hold up to 32 CPUs. As such the hypercall
> VCPUOP_register_vcpu_info was introduced which allowed us to
> setup per-page areas for VCPUs. This means we can boot PVHVM
> guest with more than 32 VCPUs. During migration the per-cpu
> structure is allocated fresh by the hypervisor (vcpu_info_mfn
> is set to INVALID_MFN) so that the newly migrated guest
> can do make the VCPUOP_register_vcpu_info hypercall.
>
> Unfortunatly we end up triggering this condition:
> /* Run this command on yourself or on other offline VCPUS. */
> if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
>
> which means we are unable to setup the per-cpu VCPU structures
> for running vCPUS. The Linux PV code paths make this work by
> iterating over every vCPU with:
>
> 1) is target CPU up (VCPUOP_is_up hypercall?)
> 2) if yes, then VCPUOP_down to pause it.
> 3) VCPUOP_register_vcpu_info
> 4) if it was down, then VCPUOP_up to bring it back up
>
> But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
> not allowed on HVM guests we can't do this. This patch
> enables this.
Hmmm, this looks like a very convoluted approach to something that could
be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
suspension, which means that all vCPUs except vCPU#0 will be in the
cpususpend_handler, see:
http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
Then on resume we unblock the "suspended" CPUs, and the first thing they
do is call cpu_ops.cpu_resume which is basically going to setup the
vcpu_info using VCPUOP_register_vcpu_info. Not sure if something similar
is possible under Linux, but it seems easier and doesn't require any
Xen-side changes.
Roger.
On Tue, Apr 08, 2014 at 08:18:48PM +0200, Roger Pau Monn? wrote:
> On 08/04/14 19:25, [email protected] wrote:
> > From: Konrad Rzeszutek Wilk <[email protected]>
> >
> > When we migrate an HVM guest, by default our shared_info can
> > only hold up to 32 CPUs. As such the hypercall
> > VCPUOP_register_vcpu_info was introduced which allowed us to
> > setup per-page areas for VCPUs. This means we can boot PVHVM
> > guest with more than 32 VCPUs. During migration the per-cpu
> > structure is allocated fresh by the hypervisor (vcpu_info_mfn
> > is set to INVALID_MFN) so that the newly migrated guest
> > can do make the VCPUOP_register_vcpu_info hypercall.
> >
> > Unfortunatly we end up triggering this condition:
> > /* Run this command on yourself or on other offline VCPUS. */
> > if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
> >
> > which means we are unable to setup the per-cpu VCPU structures
> > for running vCPUS. The Linux PV code paths make this work by
> > iterating over every vCPU with:
> >
> > 1) is target CPU up (VCPUOP_is_up hypercall?)
> > 2) if yes, then VCPUOP_down to pause it.
> > 3) VCPUOP_register_vcpu_info
> > 4) if it was down, then VCPUOP_up to bring it back up
> >
> > But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
> > not allowed on HVM guests we can't do this. This patch
> > enables this.
>
> Hmmm, this looks like a very convoluted approach to something that could
> be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
> suspension, which means that all vCPUs except vCPU#0 will be in the
> cpususpend_handler, see:
>
> http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
How do you 'suspend' them? If I remember there is a disadvantage of doing
this as you have to bring all the CPUs "offline". That in Linux means using
the stop_machine which is pretty big hammer and increases the latency for migration.
>
> Then on resume we unblock the "suspended" CPUs, and the first thing they
> do is call cpu_ops.cpu_resume which is basically going to setup the
> vcpu_info using VCPUOP_register_vcpu_info. Not sure if something similar
> is possible under Linux, but it seems easier and doesn't require any
> Xen-side changes.
>
> Roger.
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> http://lists.xen.org/xen-devel
On 08/04/14 20:53, Konrad Rzeszutek Wilk wrote:
> On Tue, Apr 08, 2014 at 08:18:48PM +0200, Roger Pau Monn? wrote:
>> On 08/04/14 19:25, [email protected] wrote:
>>> From: Konrad Rzeszutek Wilk <[email protected]>
>>>
>>> When we migrate an HVM guest, by default our shared_info can
>>> only hold up to 32 CPUs. As such the hypercall
>>> VCPUOP_register_vcpu_info was introduced which allowed us to
>>> setup per-page areas for VCPUs. This means we can boot PVHVM
>>> guest with more than 32 VCPUs. During migration the per-cpu
>>> structure is allocated fresh by the hypervisor (vcpu_info_mfn
>>> is set to INVALID_MFN) so that the newly migrated guest
>>> can do make the VCPUOP_register_vcpu_info hypercall.
>>>
>>> Unfortunatly we end up triggering this condition:
>>> /* Run this command on yourself or on other offline VCPUS. */
>>> if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
>>>
>>> which means we are unable to setup the per-cpu VCPU structures
>>> for running vCPUS. The Linux PV code paths make this work by
>>> iterating over every vCPU with:
>>>
>>> 1) is target CPU up (VCPUOP_is_up hypercall?)
>>> 2) if yes, then VCPUOP_down to pause it.
>>> 3) VCPUOP_register_vcpu_info
>>> 4) if it was down, then VCPUOP_up to bring it back up
>>>
>>> But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
>>> not allowed on HVM guests we can't do this. This patch
>>> enables this.
>>
>> Hmmm, this looks like a very convoluted approach to something that could
>> be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
>> suspension, which means that all vCPUs except vCPU#0 will be in the
>> cpususpend_handler, see:
>>
>> http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
>
> How do you 'suspend' them? If I remember there is a disadvantage of doing
> this as you have to bring all the CPUs "offline". That in Linux means using
> the stop_machine which is pretty big hammer and increases the latency for migration.
In order to suspend them an IPI_SUSPEND is sent to all vCPUs except vCPU#0:
http://fxr.watson.org/fxr/source/kern/subr_smp.c#L289
Which makes all APs call cpususpend_handler, so we know all APs are
stuck in a while loop with interrupts disabled:
http://fxr.watson.org/fxr/source/amd64/amd64/mp_machdep.c#L1459
Then on resume the APs are taken out of the while loop and the first
thing they do before returning from the IPI handler is registering the
new per-cpu vcpu_info area. But I'm not sure this is something that can
be accomplished easily on Linux.
I've tried to local-migrate a FreeBSD PVHVM guest with 33 vCPUs on my
8-way box, and it seems to be working fine :).
Roger.
>>> On 08.04.14 at 19:25, <[email protected]> wrote:
> + /* Only Xen 4.5 and higher supports this. */
> + if (HYPERVISOR_vcpu_op(VCPUOP_is_up, smp_processor_id(), NULL) == -ENOSYS)
> + vcpuops = false;
Did you mean to say "for HVM guests" in the comment? And of
course the comment could quickly become stale if we backported
the Xen side change to e.g. 4.4.1.
Jan
On Tue, 2014-04-08 at 14:53 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Apr 08, 2014 at 08:18:48PM +0200, Roger Pau Monné wrote:
> > On 08/04/14 19:25, [email protected] wrote:
> > > From: Konrad Rzeszutek Wilk <[email protected]>
> > >
> > > When we migrate an HVM guest, by default our shared_info can
> > > only hold up to 32 CPUs. As such the hypercall
> > > VCPUOP_register_vcpu_info was introduced which allowed us to
> > > setup per-page areas for VCPUs. This means we can boot PVHVM
> > > guest with more than 32 VCPUs. During migration the per-cpu
> > > structure is allocated fresh by the hypervisor (vcpu_info_mfn
> > > is set to INVALID_MFN) so that the newly migrated guest
> > > can do make the VCPUOP_register_vcpu_info hypercall.
> > >
> > > Unfortunatly we end up triggering this condition:
> > > /* Run this command on yourself or on other offline VCPUS. */
> > > if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
> > >
> > > which means we are unable to setup the per-cpu VCPU structures
> > > for running vCPUS. The Linux PV code paths make this work by
> > > iterating over every vCPU with:
> > >
> > > 1) is target CPU up (VCPUOP_is_up hypercall?)
> > > 2) if yes, then VCPUOP_down to pause it.
> > > 3) VCPUOP_register_vcpu_info
> > > 4) if it was down, then VCPUOP_up to bring it back up
> > >
> > > But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
> > > not allowed on HVM guests we can't do this. This patch
> > > enables this.
> >
> > Hmmm, this looks like a very convoluted approach to something that could
> > be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
> > suspension, which means that all vCPUs except vCPU#0 will be in the
> > cpususpend_handler, see:
> >
> > http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
>
> How do you 'suspend' them? If I remember there is a disadvantage of doing
> this as you have to bring all the CPUs "offline". That in Linux means using
> the stop_machine which is pretty big hammer and increases the latency for migration.
Yes, this is why the ability to have the toolstack save/restore the
secondary vcpu state was added. It's especially important for
checkpointing, but it's relevant to regular migrate as a performance
improvement too.
It's not just stop-machine, IIRC it's a tonne of udev events relating to
cpus off/onlinign etc too and all the userspace activity which that
implies.
Ian.
On 09/04/14 10:33, Ian Campbell wrote:
> On Tue, 2014-04-08 at 14:53 -0400, Konrad Rzeszutek Wilk wrote:
>> On Tue, Apr 08, 2014 at 08:18:48PM +0200, Roger Pau Monné wrote:
>>> On 08/04/14 19:25, [email protected] wrote:
>>>> From: Konrad Rzeszutek Wilk <[email protected]>
>>>>
>>>> When we migrate an HVM guest, by default our shared_info can
>>>> only hold up to 32 CPUs. As such the hypercall
>>>> VCPUOP_register_vcpu_info was introduced which allowed us to
>>>> setup per-page areas for VCPUs. This means we can boot PVHVM
>>>> guest with more than 32 VCPUs. During migration the per-cpu
>>>> structure is allocated fresh by the hypervisor (vcpu_info_mfn
>>>> is set to INVALID_MFN) so that the newly migrated guest
>>>> can do make the VCPUOP_register_vcpu_info hypercall.
>>>>
>>>> Unfortunatly we end up triggering this condition:
>>>> /* Run this command on yourself or on other offline VCPUS. */
>>>> if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
>>>>
>>>> which means we are unable to setup the per-cpu VCPU structures
>>>> for running vCPUS. The Linux PV code paths make this work by
>>>> iterating over every vCPU with:
>>>>
>>>> 1) is target CPU up (VCPUOP_is_up hypercall?)
>>>> 2) if yes, then VCPUOP_down to pause it.
>>>> 3) VCPUOP_register_vcpu_info
>>>> 4) if it was down, then VCPUOP_up to bring it back up
>>>>
>>>> But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
>>>> not allowed on HVM guests we can't do this. This patch
>>>> enables this.
>>>
>>> Hmmm, this looks like a very convoluted approach to something that could
>>> be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
>>> suspension, which means that all vCPUs except vCPU#0 will be in the
>>> cpususpend_handler, see:
>>>
>>> http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
>>
>> How do you 'suspend' them? If I remember there is a disadvantage of doing
>> this as you have to bring all the CPUs "offline". That in Linux means using
>> the stop_machine which is pretty big hammer and increases the latency for migration.
>
> Yes, this is why the ability to have the toolstack save/restore the
> secondary vcpu state was added. It's especially important for
> checkpointing, but it's relevant to regular migrate as a performance
> improvement too.
>
> It's not just stop-machine, IIRC it's a tonne of udev events relating to
> cpus off/onlinign etc too and all the userspace activity which that
> implies.
Well, what it's done on FreeBSD is nothing like that, it's called the
cpususpend handler, but it's not off-lining CPUs or anything like that,
it just places the CPU in a while loop inside of an IPI handler, so we
can do something like this will all APs:
while (suspended)
pause();
register_vcpu_info();
So the registration of the vcpu_info area happens just after the CPU is
waken from suspension and before it leaves the IPI handler, and it's the
CPU itself the one that calls VCPUOP_register_vcpu_info (so we can avoid
the gate in Xen that prevents registering the vcpu_info area for CPUs
different that ourself).
Roger.
>>> On 08.04.14 at 19:25, <[email protected]> wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3470,6 +3470,9 @@ static long hvm_vcpu_op(
> case VCPUOP_stop_singleshot_timer:
> case VCPUOP_register_vcpu_info:
> case VCPUOP_register_vcpu_time_memory_area:
> + case VCPUOP_down:
> + case VCPUOP_up:
> + case VCPUOP_is_up:
This, if I checked it properly, leaves only VCPUOP_initialise,
VCPUOP_send_nmi, and VCPUOP_get_physid disallowed for HVM.
None of which look inherently bad to be used by HVM (but
VCPUOP_initialise certainly would need closer checking), so I
wonder whether either the wrapper shouldn't be dropped altogether
or at least be converted from a white list approach to a black list one.
Jan
On Apr 9, 2014 4:03 AM, Jan Beulich <[email protected]> wrote:
>
> >>> On 08.04.14 at 19:25, <[email protected]> wrote:
> > + /* Only Xen 4.5 and higher supports this. */
> > + if (HYPERVISOR_vcpu_op(VCPUOP_is_up, smp_processor_id(), NULL) == -ENOSYS)
> > + vcpuops = false;
>
> Did you mean to say "for HVM guests" in the comment? And of
> course the comment could quickly become stale if we backported
> the Xen side change to e.g. 4.4.1.
>
Right. But I thought that features like that (new hypercalls) wasn't applicable for backporting?
Thabksy
> Jan
>
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
>>> On 09.04.14 at 12:15, <[email protected]> wrote:
> On Apr 9, 2014 4:03 AM, Jan Beulich <[email protected]> wrote:
>>
>> >>> On 08.04.14 at 19:25, <[email protected]> wrote:
>> > + /* Only Xen 4.5 and higher supports this. */
>> > + if (HYPERVISOR_vcpu_op(VCPUOP_is_up, smp_processor_id(), NULL) ==
> -ENOSYS)
>> > + vcpuops = false;
>>
>> Did you mean to say "for HVM guests" in the comment? And of
>> course the comment could quickly become stale if we backported
>> the Xen side change to e.g. 4.4.1.
>>
>
> Right. But I thought that features like that (new hypercalls) wasn't
> applicable for backporting?
It's not really a new hypercall, and fixing a problem, so I'd consider
it a candidate at least for 4.4.1.
Jan
On Wed, Apr 09, 2014 at 10:06:12AM +0100, Jan Beulich wrote:
> >>> On 08.04.14 at 19:25, <[email protected]> wrote:
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -3470,6 +3470,9 @@ static long hvm_vcpu_op(
> > case VCPUOP_stop_singleshot_timer:
> > case VCPUOP_register_vcpu_info:
> > case VCPUOP_register_vcpu_time_memory_area:
> > + case VCPUOP_down:
> > + case VCPUOP_up:
> > + case VCPUOP_is_up:
>
> This, if I checked it properly, leaves only VCPUOP_initialise,
> VCPUOP_send_nmi, and VCPUOP_get_physid disallowed for HVM.
> None of which look inherently bad to be used by HVM (but
> VCPUOP_initialise certainly would need closer checking), so I
> wonder whether either the wrapper shouldn't be dropped altogether
> or at least be converted from a white list approach to a black list one.
I was being conservative here because I did not want to allow the
other ones without at least testing it.
Perhaps that can be done as a seperate patch and this just as
a bug-fix?
>
> Jan
>
On Wed, Apr 09, 2014 at 09:37:01AM +0200, Roger Pau Monn? wrote:
> On 08/04/14 20:53, Konrad Rzeszutek Wilk wrote:
> > On Tue, Apr 08, 2014 at 08:18:48PM +0200, Roger Pau Monn? wrote:
> >> On 08/04/14 19:25, [email protected] wrote:
> >>> From: Konrad Rzeszutek Wilk <[email protected]>
> >>>
> >>> When we migrate an HVM guest, by default our shared_info can
> >>> only hold up to 32 CPUs. As such the hypercall
> >>> VCPUOP_register_vcpu_info was introduced which allowed us to
> >>> setup per-page areas for VCPUs. This means we can boot PVHVM
> >>> guest with more than 32 VCPUs. During migration the per-cpu
> >>> structure is allocated fresh by the hypervisor (vcpu_info_mfn
> >>> is set to INVALID_MFN) so that the newly migrated guest
> >>> can do make the VCPUOP_register_vcpu_info hypercall.
> >>>
> >>> Unfortunatly we end up triggering this condition:
> >>> /* Run this command on yourself or on other offline VCPUS. */
> >>> if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
> >>>
> >>> which means we are unable to setup the per-cpu VCPU structures
> >>> for running vCPUS. The Linux PV code paths make this work by
> >>> iterating over every vCPU with:
> >>>
> >>> 1) is target CPU up (VCPUOP_is_up hypercall?)
> >>> 2) if yes, then VCPUOP_down to pause it.
> >>> 3) VCPUOP_register_vcpu_info
> >>> 4) if it was down, then VCPUOP_up to bring it back up
> >>>
> >>> But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
> >>> not allowed on HVM guests we can't do this. This patch
> >>> enables this.
> >>
> >> Hmmm, this looks like a very convoluted approach to something that could
> >> be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
> >> suspension, which means that all vCPUs except vCPU#0 will be in the
> >> cpususpend_handler, see:
> >>
> >> http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
> >
> > How do you 'suspend' them? If I remember there is a disadvantage of doing
> > this as you have to bring all the CPUs "offline". That in Linux means using
> > the stop_machine which is pretty big hammer and increases the latency for migration.
>
> In order to suspend them an IPI_SUSPEND is sent to all vCPUs except vCPU#0:
>
> http://fxr.watson.org/fxr/source/kern/subr_smp.c#L289
>
> Which makes all APs call cpususpend_handler, so we know all APs are
> stuck in a while loop with interrupts disabled:
>
> http://fxr.watson.org/fxr/source/amd64/amd64/mp_machdep.c#L1459
>
> Then on resume the APs are taken out of the while loop and the first
> thing they do before returning from the IPI handler is registering the
> new per-cpu vcpu_info area. But I'm not sure this is something that can
> be accomplished easily on Linux.
That is a bit of what the 'stop_machine' would do. It puts all of the
CPUs in whatever function you want. But I am not sure of the latency impact - as
in what if the migration takes longer and all of the CPUs sit there spinning.
Another variant of that is the 'smp_call_function'.
Then when we resume - we need a mailbox that is shared (easily enough
I think) to tell us that the migration has been done - and then need to call
that VCPUOP_register_vcpu_info.
But if the migration has taken quite long - I fear that the watchdogs
might kick in and start complaining about the CPUs stuck. Especially
if we migrating on overcommitted guest.
With this the latency for them to be 'paused', 'initted', 'unpaused' I
think is much much smaller.
Ugh, lets wait with this exercise of using the 'smp_call_function'
sometime at the end of the summer - and see. That functionality
should be shared with the PV code path IMHO.
>
> I've tried to local-migrate a FreeBSD PVHVM guest with 33 vCPUs on my
> 8-way box, and it seems to be working fine :).
Awesome!
>
> Roger.
>
>>> On 09.04.14 at 17:27, <[email protected]> wrote:
> On Wed, Apr 09, 2014 at 10:06:12AM +0100, Jan Beulich wrote:
>> >>> On 08.04.14 at 19:25, <[email protected]> wrote:
>> > --- a/xen/arch/x86/hvm/hvm.c
>> > +++ b/xen/arch/x86/hvm/hvm.c
>> > @@ -3470,6 +3470,9 @@ static long hvm_vcpu_op(
>> > case VCPUOP_stop_singleshot_timer:
>> > case VCPUOP_register_vcpu_info:
>> > case VCPUOP_register_vcpu_time_memory_area:
>> > + case VCPUOP_down:
>> > + case VCPUOP_up:
>> > + case VCPUOP_is_up:
>>
>> This, if I checked it properly, leaves only VCPUOP_initialise,
>> VCPUOP_send_nmi, and VCPUOP_get_physid disallowed for HVM.
>> None of which look inherently bad to be used by HVM (but
>> VCPUOP_initialise certainly would need closer checking), so I
>> wonder whether either the wrapper shouldn't be dropped altogether
>> or at least be converted from a white list approach to a black list one.
>
> I was being conservative here because I did not want to allow the
> other ones without at least testing it.
>
> Perhaps that can be done as a seperate patch and this just as
> a bug-fix?
I'm clearly not in favor of the patch as is - minimally I'd want it to
convert the white list to a black list. And once you do this it would
seem rather natural to not pointlessly add entries.
Jan
On 09/04/14 16:34, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 09, 2014 at 09:37:01AM +0200, Roger Pau Monn? wrote:
>> On 08/04/14 20:53, Konrad Rzeszutek Wilk wrote:
>>> On Tue, Apr 08, 2014 at 08:18:48PM +0200, Roger Pau Monn? wrote:
>>>> On 08/04/14 19:25, [email protected] wrote:
>>>>> From: Konrad Rzeszutek Wilk <[email protected]>
>>>>>
>>>>> When we migrate an HVM guest, by default our shared_info can
>>>>> only hold up to 32 CPUs. As such the hypercall
>>>>> VCPUOP_register_vcpu_info was introduced which allowed us to
>>>>> setup per-page areas for VCPUs. This means we can boot PVHVM
>>>>> guest with more than 32 VCPUs. During migration the per-cpu
>>>>> structure is allocated fresh by the hypervisor (vcpu_info_mfn
>>>>> is set to INVALID_MFN) so that the newly migrated guest
>>>>> can do make the VCPUOP_register_vcpu_info hypercall.
>>>>>
>>>>> Unfortunatly we end up triggering this condition:
>>>>> /* Run this command on yourself or on other offline VCPUS. */
>>>>> if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
>>>>>
>>>>> which means we are unable to setup the per-cpu VCPU structures
>>>>> for running vCPUS. The Linux PV code paths make this work by
>>>>> iterating over every vCPU with:
>>>>>
>>>>> 1) is target CPU up (VCPUOP_is_up hypercall?)
>>>>> 2) if yes, then VCPUOP_down to pause it.
>>>>> 3) VCPUOP_register_vcpu_info
>>>>> 4) if it was down, then VCPUOP_up to bring it back up
>>>>>
>>>>> But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
>>>>> not allowed on HVM guests we can't do this. This patch
>>>>> enables this.
>>>>
>>>> Hmmm, this looks like a very convoluted approach to something that could
>>>> be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
>>>> suspension, which means that all vCPUs except vCPU#0 will be in the
>>>> cpususpend_handler, see:
>>>>
>>>> http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
>>>
>>> How do you 'suspend' them? If I remember there is a disadvantage of doing
>>> this as you have to bring all the CPUs "offline". That in Linux means using
>>> the stop_machine which is pretty big hammer and increases the latency for migration.
>>
>> In order to suspend them an IPI_SUSPEND is sent to all vCPUs except vCPU#0:
>>
>> http://fxr.watson.org/fxr/source/kern/subr_smp.c#L289
>>
>> Which makes all APs call cpususpend_handler, so we know all APs are
>> stuck in a while loop with interrupts disabled:
>>
>> http://fxr.watson.org/fxr/source/amd64/amd64/mp_machdep.c#L1459
>>
>> Then on resume the APs are taken out of the while loop and the first
>> thing they do before returning from the IPI handler is registering the
>> new per-cpu vcpu_info area. But I'm not sure this is something that can
>> be accomplished easily on Linux.
>
> That is a bit of what the 'stop_machine' would do. It puts all of the
> CPUs in whatever function you want. But I am not sure of the latency impact - as
> in what if the migration takes longer and all of the CPUs sit there spinning.
> Another variant of that is the 'smp_call_function'.
I tested stop_machine() on all CPUs during suspend once and it was
awful: 100s of ms of additional downtime.
Perhaps a hand-rolled IPI-and-park-in-handler would be quicker the full
stop_machine().
David
On Wed, Apr 09, 2014 at 04:38:37PM +0100, David Vrabel wrote:
> On 09/04/14 16:34, Konrad Rzeszutek Wilk wrote:
> > On Wed, Apr 09, 2014 at 09:37:01AM +0200, Roger Pau Monn? wrote:
> >> On 08/04/14 20:53, Konrad Rzeszutek Wilk wrote:
> >>> On Tue, Apr 08, 2014 at 08:18:48PM +0200, Roger Pau Monn? wrote:
> >>>> On 08/04/14 19:25, [email protected] wrote:
> >>>>> From: Konrad Rzeszutek Wilk <[email protected]>
> >>>>>
> >>>>> When we migrate an HVM guest, by default our shared_info can
> >>>>> only hold up to 32 CPUs. As such the hypercall
> >>>>> VCPUOP_register_vcpu_info was introduced which allowed us to
> >>>>> setup per-page areas for VCPUs. This means we can boot PVHVM
> >>>>> guest with more than 32 VCPUs. During migration the per-cpu
> >>>>> structure is allocated fresh by the hypervisor (vcpu_info_mfn
> >>>>> is set to INVALID_MFN) so that the newly migrated guest
> >>>>> can do make the VCPUOP_register_vcpu_info hypercall.
> >>>>>
> >>>>> Unfortunatly we end up triggering this condition:
> >>>>> /* Run this command on yourself or on other offline VCPUS. */
> >>>>> if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
> >>>>>
> >>>>> which means we are unable to setup the per-cpu VCPU structures
> >>>>> for running vCPUS. The Linux PV code paths make this work by
> >>>>> iterating over every vCPU with:
> >>>>>
> >>>>> 1) is target CPU up (VCPUOP_is_up hypercall?)
> >>>>> 2) if yes, then VCPUOP_down to pause it.
> >>>>> 3) VCPUOP_register_vcpu_info
> >>>>> 4) if it was down, then VCPUOP_up to bring it back up
> >>>>>
> >>>>> But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
> >>>>> not allowed on HVM guests we can't do this. This patch
> >>>>> enables this.
> >>>>
> >>>> Hmmm, this looks like a very convoluted approach to something that could
> >>>> be solved more easily IMHO. What we do on FreeBSD is put all vCPUs into
> >>>> suspension, which means that all vCPUs except vCPU#0 will be in the
> >>>> cpususpend_handler, see:
> >>>>
> >>>> http://svnweb.freebsd.org/base/head/sys/amd64/amd64/mp_machdep.c?revision=263878&view=markup#l1460
> >>>
> >>> How do you 'suspend' them? If I remember there is a disadvantage of doing
> >>> this as you have to bring all the CPUs "offline". That in Linux means using
> >>> the stop_machine which is pretty big hammer and increases the latency for migration.
> >>
> >> In order to suspend them an IPI_SUSPEND is sent to all vCPUs except vCPU#0:
> >>
> >> http://fxr.watson.org/fxr/source/kern/subr_smp.c#L289
> >>
> >> Which makes all APs call cpususpend_handler, so we know all APs are
> >> stuck in a while loop with interrupts disabled:
> >>
> >> http://fxr.watson.org/fxr/source/amd64/amd64/mp_machdep.c#L1459
> >>
> >> Then on resume the APs are taken out of the while loop and the first
> >> thing they do before returning from the IPI handler is registering the
> >> new per-cpu vcpu_info area. But I'm not sure this is something that can
> >> be accomplished easily on Linux.
> >
> > That is a bit of what the 'stop_machine' would do. It puts all of the
> > CPUs in whatever function you want. But I am not sure of the latency impact - as
> > in what if the migration takes longer and all of the CPUs sit there spinning.
> > Another variant of that is the 'smp_call_function'.
>
> I tested stop_machine() on all CPUs during suspend once and it was
> awful: 100s of ms of additional downtime.
Yikes.
>
> Perhaps a hand-rolled IPI-and-park-in-handler would be quicker the full
> stop_machine().
But that is clearly a bigger patch than this little bug-fix.
Do you want to just take this patch as is and then later on I can work on
prototyping the 'IPI-and-park-in-handler'?
On Wed, Apr 09, 2014 at 04:36:53PM +0100, Jan Beulich wrote:
> >>> On 09.04.14 at 17:27, <[email protected]> wrote:
> > On Wed, Apr 09, 2014 at 10:06:12AM +0100, Jan Beulich wrote:
> >> >>> On 08.04.14 at 19:25, <[email protected]> wrote:
> >> > --- a/xen/arch/x86/hvm/hvm.c
> >> > +++ b/xen/arch/x86/hvm/hvm.c
> >> > @@ -3470,6 +3470,9 @@ static long hvm_vcpu_op(
> >> > case VCPUOP_stop_singleshot_timer:
> >> > case VCPUOP_register_vcpu_info:
> >> > case VCPUOP_register_vcpu_time_memory_area:
> >> > + case VCPUOP_down:
> >> > + case VCPUOP_up:
> >> > + case VCPUOP_is_up:
> >>
> >> This, if I checked it properly, leaves only VCPUOP_initialise,
> >> VCPUOP_send_nmi, and VCPUOP_get_physid disallowed for HVM.
> >> None of which look inherently bad to be used by HVM (but
> >> VCPUOP_initialise certainly would need closer checking), so I
> >> wonder whether either the wrapper shouldn't be dropped altogether
> >> or at least be converted from a white list approach to a black list one.
> >
> > I was being conservative here because I did not want to allow the
> > other ones without at least testing it.
> >
> > Perhaps that can be done as a seperate patch and this just as
> > a bug-fix?
>
> I'm clearly not in favor of the patch as is - minimally I'd want it to
> convert the white list to a black list. And once you do this it would
> seem rather natural to not pointlessly add entries.
With this patch:
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index b5b92fe..5eee790 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3455,34 +3455,6 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
}
}
-static long hvm_vcpu_op(
- int cmd, int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
-{
- long rc;
-
- switch ( cmd )
- {
- case VCPUOP_register_runstate_memory_area:
- case VCPUOP_get_runstate_info:
- case VCPUOP_set_periodic_timer:
- case VCPUOP_stop_periodic_timer:
- case VCPUOP_set_singleshot_timer:
- case VCPUOP_stop_singleshot_timer:
- case VCPUOP_register_vcpu_info:
- case VCPUOP_register_vcpu_time_memory_area:
- case VCPUOP_down:
- case VCPUOP_up:
- case VCPUOP_is_up:
- rc = do_vcpu_op(cmd, vcpuid, arg);
- break;
- default:
- rc = -ENOSYS;
- break;
- }
-
- return rc;
-}
-
typedef unsigned long hvm_hypercall_t(
unsigned long, unsigned long, unsigned long, unsigned long, unsigned long,
unsigned long);
@@ -3517,30 +3489,6 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
return compat_memory_op(cmd, arg);
}
-static long hvm_vcpu_op_compat32(
- int cmd, int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
-{
- long rc;
-
- switch ( cmd )
- {
- case VCPUOP_register_runstate_memory_area:
- case VCPUOP_get_runstate_info:
- case VCPUOP_set_periodic_timer:
- case VCPUOP_stop_periodic_timer:
- case VCPUOP_set_singleshot_timer:
- case VCPUOP_stop_singleshot_timer:
- case VCPUOP_register_vcpu_info:
- case VCPUOP_register_vcpu_time_memory_area:
- rc = compat_vcpu_op(cmd, vcpuid, arg);
- break;
- default:
- rc = -ENOSYS;
- break;
- }
-
- return rc;
-}
static long hvm_physdev_op_compat32(
int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
@@ -3563,7 +3511,7 @@ static long hvm_physdev_op_compat32(
static hvm_hypercall_t *const hvm_hypercall64_table[NR_hypercalls] = {
[ __HYPERVISOR_memory_op ] = (hvm_hypercall_t *)hvm_memory_op,
[ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op,
- [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op,
+ HYPERCALL(vcpu_op),
[ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op,
HYPERCALL(xen_version),
HYPERCALL(console_io),
@@ -3583,7 +3531,7 @@ static hvm_hypercall_t *const hvm_hypercall64_table[NR_hypercalls] = {
static hvm_hypercall_t *const hvm_hypercall32_table[NR_hypercalls] = {
[ __HYPERVISOR_memory_op ] = (hvm_hypercall_t *)hvm_memory_op_compat32,
[ __HYPERVISOR_grant_table_op ] = (hvm_hypercall_t *)hvm_grant_table_op_compat32,
- [ __HYPERVISOR_vcpu_op ] = (hvm_hypercall_t *)hvm_vcpu_op_compat32,
+ COMPAT_CALL(vcpu_op),
[ __HYPERVISOR_physdev_op ] = (hvm_hypercall_t *)hvm_physdev_op_compat32,
COMPAT_CALL(xen_version),
HYPERCALL(console_io),
And with an HVM guest poking at the rest of VCPUOPs: VCPUOP_get_physid,
VCPUOP_initialise, and VCPUOP_send_nmi - either before the CPU is up or
when it is up - work.
That is: the VCPUOP_get_physid would return -EINVAL; VCPUOP_initialise
would return either -EEXIST or 0, and in either case
the content of the ctxt was full of zeros.
The VCPUOP_send_nmi did cause the HVM to get an NMI and it spitted out
'Dazed and confused'. It also noticed corruption:
[ 3.611742] Corrupted low memory at c000fffc (fffc phys) = 00029b00
[ 2.386785] Corrupted low memory at ffff88000000fff8 (fff8 phys) = 2990000000000
Which is odd because there does not seem to be anything in the path
of hypervisor that would cause this.
>
> Jan
>
>>> On 22.04.14 at 20:34, <[email protected]> wrote:
> With this patch:
> [...]
> And with an HVM guest poking at the rest of VCPUOPs: VCPUOP_get_physid,
> VCPUOP_initialise, and VCPUOP_send_nmi - either before the CPU is up or
> when it is up - work.
>
> That is: the VCPUOP_get_physid would return -EINVAL; VCPUOP_initialise
> would return either -EEXIST or 0, and in either case
> the content of the ctxt was full of zeros.
Good.
> The VCPUOP_send_nmi did cause the HVM to get an NMI and it spitted out
> 'Dazed and confused'. It also noticed corruption:
>
> [ 3.611742] Corrupted low memory at c000fffc (fffc phys) = 00029b00
> [ 2.386785] Corrupted low memory at ffff88000000fff8 (fff8 phys) =
> 2990000000000
>
> Which is odd because there does not seem to be anything in the path
> of hypervisor that would cause this.
Indeed. This looks a little like a segment descriptor got modified here
with a descriptor table base of zero and a selector of 0xfff8. That
corruption needs to be hunted down in any case before enabling
VCPUOP_send_nmi for HVM.
Jan