Booting RHEL 5 i386 in kvm with -no-kvm-irqchip -smp 4 will hang in
udev. I bisected this to a change in the _guest_ kernel:
> commit 95492e4646e5de8b43d9a7908d6177fb737b61f0
> Author: Ingo Molnar <[email protected]>
> Date: Fri Feb 16 01:27:34 2007 -0800
>
> [PATCH] x86: rewrite SMP TSC sync code
>
> make the TSC synchronization code more robust, and unify it
> between x86_64 and
> i386.
>
> The biggest change is the removal of the 'fix up TSCs' code on
> x86_64 and
> i386, in some rare cases it was /causing/ time-warps on SMP systems.
>
> The new code only checks for TSC asynchronity - and if it can prove a
> time-warp (if it can observe the TSC going backwards when going
> from one CPU
> to another within a critical section), then the TSC clock-source
> is turned
> off.
>
> The TSC synchronization-checking code also got moved into a
> separate file.
So, guest kernels prior to this commit will hang in kvm smp; after this
commit they will boot fine.
While the change mentions that it fixes a time warp bug, it also says it
should be rare. So clearly kvm smp tsc handing is buggy. Ingo/Thomas,
(or anybody else), do you have any insight as to what kvm can be doing
wrong to trigger this behavior?
--
error compiling committee.c: too many arguments to function
* Avi Kivity <[email protected]> wrote:
> Booting RHEL 5 i386 in kvm with -no-kvm-irqchip -smp 4 will hang in udev.
> I bisected this to a change in the _guest_ kernel:
>
>> commit 95492e4646e5de8b43d9a7908d6177fb737b61f0
>> Author: Ingo Molnar <[email protected]>
>> Date: Fri Feb 16 01:27:34 2007 -0800
>>
>> [PATCH] x86: rewrite SMP TSC sync code
>>
>> make the TSC synchronization code more robust, and unify it between
>> x86_64 and
>> i386.
>>
>> The biggest change is the removal of the 'fix up TSCs' code on x86_64
>> and
>> i386, in some rare cases it was /causing/ time-warps on SMP systems.
>>
>> The new code only checks for TSC asynchronity - and if it can prove a
>> time-warp (if it can observe the TSC going backwards when going from
>> one CPU
>> to another within a critical section), then the TSC clock-source is
>> turned
>> off.
>>
>> The TSC synchronization-checking code also got moved into a separate
>> file.
>
> So, guest kernels prior to this commit will hang in kvm smp; after this
> commit they will boot fine.
>
> While the change mentions that it fixes a time warp bug, it also says
> it should be rare. So clearly kvm smp tsc handing is buggy.
> Ingo/Thomas, (or anybody else), do you have any insight as to what kvm
> can be doing wrong to trigger this behavior?
hm. Those time warps were really small, due to the small imperfections
in the "sync up all CPUs to the same moment and do a WRMSR to clear all
their TSCs" mechanism. I.e. at most a few usec time warps. I really dont
know how that should result in udevd hanging. Can you debug udevd in any
way?
so the only thing that KVM might be doing incorrectly here is the
emulation of the WRMSR that clears the TSC of each vcpu?
Ingo
Ingo Molnar wrote:
>> While the change mentions that it fixes a time warp bug, it also says
>> it should be rare. So clearly kvm smp tsc handing is buggy.
>> Ingo/Thomas, (or anybody else), do you have any insight as to what kvm
>> can be doing wrong to trigger this behavior?
>>
>
> hm. Those time warps were really small, due to the small imperfections
> in the "sync up all CPUs to the same moment and do a WRMSR to clear all
> their TSCs" mechanism. I.e. at most a few usec time warps. I really dont
> know how that should result in udevd hanging. Can you debug udevd in any
> way?
>
>
Adding debug didn't help. I'll try some sysrq keys to see what the
guest thinks is happening.
> so the only thing that KVM might be doing incorrectly here is the
> emulation of the WRMSR that clears the TSC of each vcpu?
>
By inspection, it is correct. Of course I may be missing something, so
I'll write a unit test for it. It should also be much slower than the
native wrmsr.
--
error compiling committee.c: too many arguments to function
Avi Kivity wrote:
> Ingo Molnar wrote:
>
>>> While the change mentions that it fixes a time warp bug, it also says
>>> it should be rare. So clearly kvm smp tsc handing is buggy.
>>> Ingo/Thomas, (or anybody else), do you have any insight as to what kvm
>>> can be doing wrong to trigger this behavior?
>>>
>>>
>> hm. Those time warps were really small, due to the small imperfections
>> in the "sync up all CPUs to the same moment and do a WRMSR to clear all
>> their TSCs" mechanism. I.e. at most a few usec time warps. I really dont
>> know how that should result in udevd hanging. Can you debug udevd in any
>> way?
>>
>>
>>
>
> Adding debug didn't help. I'll try some sysrq keys to see what the
> guest thinks is happening.
>
>
many udev children are exiting; udevd itself is sleeping:
> udevd S D5DCDF24 2924 573 372 594 629 535 (NOTLB)
> d5dcdf38 00000086 00000002 d5dcdf24 d5dcdf20 00000000 d5dcdefc
> d6169f68
> d7db7f68 d5dcdf68 00000001 d5dd7560 c13b8a90 749ae8d2 00000002
> 000326a1
> d5dd7684 c131c700 00000003 d74f8900 892d6946 00000402 ffffffff
> 00000000
> Call Trace:
> [<c060d2c9>] do_nanosleep+0x3b/0x66
> [<c0439b20>] hrtimer_nanosleep+0x50/0x106
> [<c04397ee>] hrtimer_wakeup+0x0/0x18
> [<c0439c1f>] sys_nanosleep+0x49/0x59
> [<c0404e4c>] syscall_call+0x7/0xb
> [<c0600000>] xfrm_state_find+0x49f/0x51e
So likely sleeping is screwed up somehow (though only on smp).
>> so the only thing that KVM might be doing incorrectly here is the
>> emulation of the WRMSR that clears the TSC of each vcpu?
>>
>>
>
> By inspection, it is correct. Of course I may be missing something, so
> I'll write a unit test for it. It should also be much slower than the
> native wrmsr.
>
>
Testing shows wrmsr and rdtsc function normally.
I'll try pinning the vcpus to cpus and see if that helps.
--
error compiling committee.c: too many arguments to function
Avi Kivity wrote:
>
> Testing shows wrmsr and rdtsc function normally.
>
> I'll try pinning the vcpus to cpus and see if that helps.
>
It does.
--
error compiling committee.c: too many arguments to function
* Avi Kivity <[email protected]> wrote:
> Avi Kivity wrote:
>> Testing shows wrmsr and rdtsc function normally.
>>
>> I'll try pinning the vcpus to cpus and see if that helps.
>>
>
> It does.
do we let the guest read the physical CPU's TSC? That would be trouble.
Ingo
Ingo Molnar wrote:
> * Avi Kivity <[email protected]> wrote:
>
>
>> Avi Kivity wrote:
>>
>>> Testing shows wrmsr and rdtsc function normally.
>>>
>>> I'll try pinning the vcpus to cpus and see if that helps.
>>>
>>>
>> It does.
>>
>
> do we let the guest read the physical CPU's TSC? That would be trouble.
>
>
vmx (and svm) allow us to add an offset to the physical tsc. We set it
on startup to -tsc (so that an rdtsc on boot would return 0), and
massage it on vcpu migration so that guest rdtsc is monotonic.
The net effect is that tsc on a vcpu can experience large forward jumps
and changes in rate, but no negative jumps.
--
error compiling committee.c: too many arguments to function
Ingo Molnar wrote:
> * Avi Kivity <[email protected]> wrote:
>
>
>> Avi Kivity wrote:
>>
>>> Testing shows wrmsr and rdtsc function normally.
>>>
>>> I'll try pinning the vcpus to cpus and see if that helps.
>>>
>>>
>> It does.
>>
>
> do we let the guest read the physical CPU's TSC? That would be trouble.
>
>
vmx (and svm) allow us to add an offset to the physical tsc. We set it
on startup to -tsc (so that an rdtsc on boot would return 0), and
massage it on vcpu migration so that guest rdtsc is monotonic.
The net effect is that tsc on a vcpu can experience large forward jumps
and changes in rate, but no negative jumps.
--
error compiling committee.c: too many arguments to function
try this test perhaps in an SMP guest:
http://people.redhat.com/mingo/time-warp-test/time-warp-test.c
you can ignore TSC warps - but no GTOD or CLOCK warps should occur.
Ingo
Ingo Molnar wrote:
> try this test perhaps in an SMP guest:
>
> http://people.redhat.com/mingo/time-warp-test/time-warp-test.c
>
> you can ignore TSC warps - but no GTOD or CLOCK warps should occur.
>
>
On a broken guest kernel, I see gtod and clock warps. On a good guest
kernel, I do not, presumably because the tsc clocksource is marked as
unstable.
I see tsc warps on both. 8 threads on 4 cpus.
--
error compiling committee.c: too many arguments to function
On Dec 19, 2007 12:27 PM, Avi Kivity <[email protected]> wrote:
> Ingo Molnar wrote:
> > * Avi Kivity <[email protected]> wrote:
> >
> >
> >> Avi Kivity wrote:
> >>
> >>> Testing shows wrmsr and rdtsc function normally.
> >>>
> >>> I'll try pinning the vcpus to cpus and see if that helps.
> >>>
> >>>
> >> It does.
> >>
> >
> > do we let the guest read the physical CPU's TSC? That would be trouble.
> >
> >
>
> vmx (and svm) allow us to add an offset to the physical tsc. We set it
> on startup to -tsc (so that an rdtsc on boot would return 0), and
> massage it on vcpu migration so that guest rdtsc is monotonic.
>
> The net effect is that tsc on a vcpu can experience large forward jumps
> and changes in rate, but no negative jumps.
>
Changes in rate does not sound good. It's possibly what's screwing up
my paravirt clock implementation in smp.
Since the host updates guest time prior to putting vcpu to run, two
vcpus that start running at different times will have different system
values.
Now if the vcpu that started running later probes the time first,
we'll se the time going backwards. A constant tsc rate is the only way
around
my limited mind sees around the problem (besides, obviously, _not_
making the system time per-vcpu).
--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net
"The less confident you are, the more serious you have to act."
Glauber de Oliveira Costa wrote:
> Changes in rate does not sound good. It's possibly what's screwing up
> my paravirt clock implementation in smp.
>
You should renew the timebase on vcpu migration, and hook cpufreq so
that changes in frequency are reflected in the timebase.
> Since the host updates guest time prior to putting vcpu to run, two
> vcpus that start running at different times will have different system
> values.
>
> Now if the vcpu that started running later probes the time first,
> we'll se the time going backwards. A constant tsc rate is the only way
> around
> my limited mind sees around the problem (besides, obviously, _not_
> making the system time per-vcpu).
>
I tried disabling frequency scaling (rmmod acpi_cpufreq) but that didn't
help my present problems.
--
error compiling committee.c: too many arguments to function
On Wednesday 19 December 2007 21:02:06 Glauber de Oliveira Costa wrote:
> On Dec 19, 2007 12:27 PM, Avi Kivity <[email protected]> wrote:
> > Ingo Molnar wrote:
> > > * Avi Kivity <[email protected]> wrote:
> > >> Avi Kivity wrote:
> > >>> Testing shows wrmsr and rdtsc function normally.
> > >>>
> > >>> I'll try pinning the vcpus to cpus and see if that helps.
> > >>
> > >> It does.
> > >
> > > do we let the guest read the physical CPU's TSC? That would be trouble.
> >
> > vmx (and svm) allow us to add an offset to the physical tsc. We set it
> > on startup to -tsc (so that an rdtsc on boot would return 0), and
> > massage it on vcpu migration so that guest rdtsc is monotonic.
> >
> > The net effect is that tsc on a vcpu can experience large forward jumps
> > and changes in rate, but no negative jumps.
>
> Changes in rate does not sound good. It's possibly what's screwing up
> my paravirt clock implementation in smp.
Do you mean in the case of VM migration, or just starting them on a single
host?
> Since the host updates guest time prior to putting vcpu to run, two
> vcpus that start running at different times will have different system
> values.
>
> Now if the vcpu that started running later probes the time first,
> we'll se the time going backwards. A constant tsc rate is the only way
> around
> my limited mind sees around the problem (besides, obviously, _not_
> making the system time per-vcpu).
Amit Shah wrote:
>
> On Wednesday 19 December 2007 21:02:06 Glauber de Oliveira Costa wrote:
> > On Dec 19, 2007 12:27 PM, Avi Kivity <[email protected]> wrote:
> > > Ingo Molnar wrote:
> > > > * Avi Kivity <[email protected]> wrote:
> > > >> Avi Kivity wrote:
> > > >>> Testing shows wrmsr and rdtsc function normally.
> > > >>>
> > > >>> I'll try pinning the vcpus to cpus and see if that helps.
> > > >>
> > > >> It does.
> > > >
> > > > do we let the guest read the physical CPU's TSC? That would be
> trouble.
> > >
> > > vmx (and svm) allow us to add an offset to the physical tsc. We
> set it
> > > on startup to -tsc (so that an rdtsc on boot would return 0), and
> > > massage it on vcpu migration so that guest rdtsc is monotonic.
> > >
> > > The net effect is that tsc on a vcpu can experience large forward
> jumps
> > > and changes in rate, but no negative jumps.
> >
> > Changes in rate does not sound good. It's possibly what's screwing up
> > my paravirt clock implementation in smp.
>
> Do you mean in the case of VM migration, or just starting them on a single
> host?
>
It's the cpu preemption stuff on local host and not VM migration
>
> > Since the host updates guest time prior to putting vcpu to run, two
> > vcpus that start running at different times will have different system
> > values.
> >
> > Now if the vcpu that started running later probes the time first,
> > we'll se the time going backwards. A constant tsc rate is the only way
> > around
> > my limited mind sees around the problem (besides, obviously, _not_
> > making the system time per-vcpu).
>
> -------------------------------------------------------------------------
> SF.Net email is sponsored by:
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services
> for just about anything Open Source.
> http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
> _______________________________________________
> kvm-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
Dor Laor wrote:
>> > >
>> > > vmx (and svm) allow us to add an offset to the physical tsc. We
>> set it
>> > > on startup to -tsc (so that an rdtsc on boot would return 0), and
>> > > massage it on vcpu migration so that guest rdtsc is monotonic.
>> > >
>> > > The net effect is that tsc on a vcpu can experience large forward
>> jumps
>> > > and changes in rate, but no negative jumps.
>> >
>> > Changes in rate does not sound good. It's possibly what's screwing up
>> > my paravirt clock implementation in smp.
>>
>> Do you mean in the case of VM migration, or just starting them on a
>> single
>> host?
>>
> It's the cpu preemption stuff on local host and not VM migration
No, migrating a vcpu to another cpu.
--
error compiling committee.c: too many arguments to function
On Dec 19, 2007 1:41 PM, Avi Kivity <[email protected]> wrote:
> Glauber de Oliveira Costa wrote:
> > Changes in rate does not sound good. It's possibly what's screwing up
> > my paravirt clock implementation in smp.
> >
>
> You should renew the timebase on vcpu migration, and hook cpufreq so
> that changes in frequency are reflected in the timebase.
To be conservative, I do it in every vcpu run, and have any kind of
cpu frequency scaling disabled. And it does not work.
In a trace in the host, I see that vcpu runs happens very often in
vcpu 0 (probably because exits happen often there, so we have to go
back),
and comparatively, very few times in vcpu 1.
So what's probably happening is : vcpu 1 does system_time + tsc_delta,
but vcpu 0 has already updated it so many times, the tsc does not
keep up,
and it end going backwards.
I'm running (in the host), the following test, upon module loading
(and Ingo can please tell me if I'm doing something idiotic in it,
compromising my conclusions)
void test (int foo)
{
u64 start, stop;
start = native_read_tsc();
udelay(foo);
stop = native_read_tsc();
printk("%d Result: %lld\n", foo, foo * 1000 - cycles_2_ns(stop
- start));
}
Output is:
30 Result: -126
90 Result: 576
300 Result: 2627
1000 Result: 9381
3000 Result: 28238
5000 Result: 48086
So the delta is expecting to get bigger. If a vcpu passes a long time
without having the time updated.
Xen manages to keep the guest tsc stable and steady by doing
synchronization from time to time.
We can either: (If I'm right at this, of course):
* put a periodic timer in the host to update the system time from time to time;
* use some sort of global timestamp, instead of the per-cpu one.
* do something akin to what xen does, and still rely on the tsc.
Any thoughts?
--
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net
"The less confident you are, the more serious you have to act."