2014-02-11 18:35:00

by Stefan Bader

[permalink] [raw]
Subject: Another preempt folding issue?

Hi Peter,

I am currently looking at a weird issue that manifest itself when trying to run
kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
cpu is 64bit capable, so qemu-system-x86_64 is called). Sooner or later this
causes softlockup messages on the host. I tracked this down to __vcpu_run in
arch/x86/kvm/x86.c which does a loop which in that case never seems to make
progress or exit.

What I found is that vcpu_enter_guest will exit quickly without causing the loop
to exit when need_resched() is true. Looking at a crash dump I took, this was
the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately
returning __vcpu_run has the following code:

if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
kvm_resched(vcpu); // now cond_resched();
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
}

The kvm_resched basically would end up doing a cond_resched() which now checks
preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it
just does nothing. Looking at the percpu variables in the dump, I saw that
the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me
triggering the kexec crashdump with sysrq-c).

I saw that there have been some changes in the upstream kernel and have picked
the following:
1) x86, acpi, idle: Restructure the mwait idle routines
2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
4) sched/preempt/x86: Fix voluntary preempt for x86

Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to
the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for
3.13.y stable].

Still, with all those I got the softlockup. Since I knew from the dump info that
something is wrong with the folding, I made the pragmatic approach and added the
following:

if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
+ preempt_fold_need_resched();
kvm_resched(vcpu); // now cond_resched();
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
}

And this lets the kvm guest run without the softlockups! However I am less than
convinced that this is the right thing to do. Somehow something done when
converting the preempt_count into percpu has caused at least the i386 side to
get into this mess (as there has not been any whining about 64bit). Just fail to
see what.

-Stefan


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-11 19:46:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
> Hi Peter,
>
> I am currently looking at a weird issue that manifest itself when trying to run
> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
> cpu is 64bit capable, so qemu-system-x86_64 is called).

AMD or Intel machine?

> Sooner or later this
> causes softlockup messages on the host. I tracked this down to __vcpu_run in
> arch/x86/kvm/x86.c which does a loop which in that case never seems to make
> progress or exit.
>
> What I found is that vcpu_enter_guest will exit quickly without causing the loop
> to exit when need_resched() is true. Looking at a crash dump I took, this was
> the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately
> returning __vcpu_run has the following code:
>
> if (need_resched()) {
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> kvm_resched(vcpu); // now cond_resched();
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
> }
>
> The kvm_resched basically would end up doing a cond_resched() which now checks
> preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it
> just does nothing. Looking at the percpu variables in the dump, I saw that
> the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me
> triggering the kexec crashdump with sysrq-c).
>
> I saw that there have been some changes in the upstream kernel and have picked
> the following:
> 1) x86, acpi, idle: Restructure the mwait idle routines
> 2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
> 3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
> 4) sched/preempt/x86: Fix voluntary preempt for x86
>
> Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to
> the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for
> 3.13.y stable].
>
> Still, with all those I got the softlockup. Since I knew from the dump info that
> something is wrong with the folding, I made the pragmatic approach and added the
> following:
>
> if (need_resched()) {
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> + preempt_fold_need_resched();
> kvm_resched(vcpu); // now cond_resched();
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
> }
>
> And this lets the kvm guest run without the softlockups! However I am less than
> convinced that this is the right thing to do. Somehow something done when
> converting the preempt_count into percpu has caused at least the i386 side to
> get into this mess (as there has not been any whining about 64bit). Just fail to
> see what.

I've been looking at the same thing too; I've got a trace from someone
who can reproduce and its just not making sense.

Looks like the hardware is loosing an interrupt (or worse).

With the below patch on top of current -git (might be whitespace damaged
due to copy-paste).

---
arch/x86/kvm/x86.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 39c28f09dfd5..f8b2fc7ce491 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6115,7 +6115,11 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
}
if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
- cond_resched();
+ if (need_resched() != should_resched()) {
+ tracing_stop();
+ printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n");
+ }
+ schedule();
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
}
}

---

I got the following from an Athlon X2 (fam 0xf):

On Mon, Feb 10, 2014 at 05:23:15PM -0200, Marcelo Tosatti wrote:
> What is the problem exactly?

>From the trace:

qemu-system-x86-2455 [001] d.s3 393.630586: smp_reschedule_interrupt <-reschedule_interrupt
qemu-system-x86-2455 [001] d.s3 393.630586: scheduler_ipi <-smp_reschedule_interrupt
qemu-system-x86-2455 [001] ..s3 393.630586: preempt_count_sub <-_raw_spin_unlock_irqrestore
qemu-system-x86-2455 [001] ..s2 393.630586: _raw_spin_lock_irq <-run_timer_softirq
qemu-system-x86-2455 [001] d.s2 393.630586: preempt_count_add <-_raw_spin_lock_irq
qemu-system-x86-2455 [001] d.s3 393.630586: _raw_spin_unlock_irq <-run_timer_softirq
qemu-system-x86-2455 [001] ..s3 393.630586: preempt_count_sub <-_raw_spin_unlock_irq
qemu-system-x86-2455 [001] ..s2 393.630586: rcu_bh_qs <-__do_softirq
qemu-system-x86-2455 [001] d.s2 393.630586: irqtime_account_irq <-__do_softirq
qemu-system-x86-2455 [001] d.s2 393.630586: __local_bh_enable <-__do_softirq
qemu-system-x86-2455 [001] d.s2 393.630586: preempt_count_sub <-__local_bh_enable


dmesg-2458 [000] d.s5 393.630614: resched_task <-check_preempt_wakeup
dmesg-2458 [000] d.s5 393.630614: native_smp_send_reschedule <-resched_task
dmesg-2458 [000] d.s5 393.630614: default_send_IPI_mask_logical <-native_smp_send_reschedule


qemu-system-x86-2455 [001] .n.. 393.630636: preempt_count_add <-kmap_atomic_prot
qemu-system-x86-2455 [001] .n.1 393.630639: __kunmap_atomic <-clear_huge_page
qemu-system-x86-2455 [001] .n.1 393.630639: preempt_count_sub <-__kunmap_atomic
qemu-system-x86-2455 [001] .n.. 393.630639: _cond_resched <-clear_huge_page
qemu-system-x86-2455 [001] .n.. 393.630640: kmap_atomic <-clear_huge_page
qemu-system-x86-2455 [001] .n.. 393.630640: kmap_atomic_prot <-kmap_atomic
qemu-system-x86-2455 [001] .n.. 393.630640: preempt_count_add <-kmap_atomic_prot


The resched_task() in the middle does:

set_tsk_need_resched();
smp_mb();
smp_send_reschedule();

This means that the remote cpu must observe TIF_NEED_RESCHED (.n.. in
the traces) when the IPI (smp_reschedule_interrupt) hits.

Now given the machine has unsynchronized TSC the order in the above is
not necessarily the true order, but there's only two possible scenarios:

1) its the right order; the smp_reschedule_interrupt() is from a
previous resched_task()/rcu kick or any of the other users that sends
it.

In which case we should get another smp_reschedule_interrupt() _after_
the resched_task() which should observe the 'n' and promote it to an
'N'.

This doesn't happen and we're stuck with the 'n' state..

2) its the wrong order and the smp_reschedule_interrupt() actually
happened after the resched_task(), but in that case we should have
observed the 'n', we did not.

Now given that 2) would mean a bug in the coherency fabric and 1) would
mean a lost interrupt, 1 seems like the more likely suspect.

Joerg wanted the extra kvm tracepoints to see what the vm-exit reason
is.

2014-02-12 08:20:35

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 11.02.2014 20:45, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
>> Hi Peter,
>>
>> I am currently looking at a weird issue that manifest itself when trying to run
>> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
>> cpu is 64bit capable, so qemu-system-x86_64 is called).
>
> AMD or Intel machine?

Personally I am reproducing this on an AMD:

cpu family : 15
model : 72
model name : AMD Turion(tm) 64 X2 Mobile Technology TL-64

but the reports I got are showing the same issue on an Intel i7

cpu family : 6
model : 37
model name : Intel(R) Core(TM) i7 CPU M 620 @ 2.67GH

Another reporter also saw this on an AMD and said it could not be reproduced on
the same hardware and the same software versions when using 64bit instead of 32.

In my case on a 32bit installation I will see this on every invocation of

qemu-system-x86_64 -cdrom <32bit live iso> -boot d -m 512

>
>> Sooner or later this
>> causes softlockup messages on the host. I tracked this down to __vcpu_run in
>> arch/x86/kvm/x86.c which does a loop which in that case never seems to make
>> progress or exit.
>>
>> What I found is that vcpu_enter_guest will exit quickly without causing the loop
>> to exit when need_resched() is true. Looking at a crash dump I took, this was
>> the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately
>> returning __vcpu_run has the following code:
>>
>> if (need_resched()) {
>> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> kvm_resched(vcpu); // now cond_resched();
>> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>> }
>>
>> The kvm_resched basically would end up doing a cond_resched() which now checks
>> preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it
>> just does nothing. Looking at the percpu variables in the dump, I saw that
>> the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me
>> triggering the kexec crashdump with sysrq-c).
>>
>> I saw that there have been some changes in the upstream kernel and have picked
>> the following:
>> 1) x86, acpi, idle: Restructure the mwait idle routines
>> 2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
>> 3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
>> 4) sched/preempt/x86: Fix voluntary preempt for x86
>>
>> Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to
>> the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for
>> 3.13.y stable].
>>
>> Still, with all those I got the softlockup. Since I knew from the dump info that
>> something is wrong with the folding, I made the pragmatic approach and added the
>> following:
>>
>> if (need_resched()) {
>> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> + preempt_fold_need_resched();
>> kvm_resched(vcpu); // now cond_resched();
>> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>> }
>>
>> And this lets the kvm guest run without the softlockups! However I am less than
>> convinced that this is the right thing to do. Somehow something done when
>> converting the preempt_count into percpu has caused at least the i386 side to
>> get into this mess (as there has not been any whining about 64bit). Just fail to
>> see what.
>
> I've been looking at the same thing too; I've got a trace from someone
> who can reproduce and its just not making sense.
>
> Looks like the hardware is loosing an interrupt (or worse).
>
> With the below patch on top of current -git (might be whitespace damaged
> due to copy-paste).
>
> ---
> arch/x86/kvm/x86.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 39c28f09dfd5..f8b2fc7ce491 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6115,7 +6115,11 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> }
> if (need_resched()) {
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> - cond_resched();
> + if (need_resched() != should_resched()) {
> + tracing_stop();
> + printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n");
> + }
> + schedule();
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
> }
> }
>
> ---
>
> I got the following from an Athlon X2 (fam 0xf):
>
> On Mon, Feb 10, 2014 at 05:23:15PM -0200, Marcelo Tosatti wrote:
>> What is the problem exactly?
>
> From the trace:
>
> qemu-system-x86-2455 [001] d.s3 393.630586: smp_reschedule_interrupt <-reschedule_interrupt
> qemu-system-x86-2455 [001] d.s3 393.630586: scheduler_ipi <-smp_reschedule_interrupt
> qemu-system-x86-2455 [001] ..s3 393.630586: preempt_count_sub <-_raw_spin_unlock_irqrestore
> qemu-system-x86-2455 [001] ..s2 393.630586: _raw_spin_lock_irq <-run_timer_softirq
> qemu-system-x86-2455 [001] d.s2 393.630586: preempt_count_add <-_raw_spin_lock_irq
> qemu-system-x86-2455 [001] d.s3 393.630586: _raw_spin_unlock_irq <-run_timer_softirq
> qemu-system-x86-2455 [001] ..s3 393.630586: preempt_count_sub <-_raw_spin_unlock_irq
> qemu-system-x86-2455 [001] ..s2 393.630586: rcu_bh_qs <-__do_softirq
> qemu-system-x86-2455 [001] d.s2 393.630586: irqtime_account_irq <-__do_softirq
> qemu-system-x86-2455 [001] d.s2 393.630586: __local_bh_enable <-__do_softirq
> qemu-system-x86-2455 [001] d.s2 393.630586: preempt_count_sub <-__local_bh_enable
>
>
> dmesg-2458 [000] d.s5 393.630614: resched_task <-check_preempt_wakeup
> dmesg-2458 [000] d.s5 393.630614: native_smp_send_reschedule <-resched_task
> dmesg-2458 [000] d.s5 393.630614: default_send_IPI_mask_logical <-native_smp_send_reschedule
>
>
> qemu-system-x86-2455 [001] .n.. 393.630636: preempt_count_add <-kmap_atomic_prot
> qemu-system-x86-2455 [001] .n.1 393.630639: __kunmap_atomic <-clear_huge_page
> qemu-system-x86-2455 [001] .n.1 393.630639: preempt_count_sub <-__kunmap_atomic
> qemu-system-x86-2455 [001] .n.. 393.630639: _cond_resched <-clear_huge_page
> qemu-system-x86-2455 [001] .n.. 393.630640: kmap_atomic <-clear_huge_page
> qemu-system-x86-2455 [001] .n.. 393.630640: kmap_atomic_prot <-kmap_atomic
> qemu-system-x86-2455 [001] .n.. 393.630640: preempt_count_add <-kmap_atomic_prot
>
>
> The resched_task() in the middle does:
>
> set_tsk_need_resched();
> smp_mb();
> smp_send_reschedule();
>
> This means that the remote cpu must observe TIF_NEED_RESCHED (.n.. in
> the traces) when the IPI (smp_reschedule_interrupt) hits.
>
> Now given the machine has unsynchronized TSC the order in the above is
> not necessarily the true order, but there's only two possible scenarios:
>
> 1) its the right order; the smp_reschedule_interrupt() is from a
> previous resched_task()/rcu kick or any of the other users that sends
> it.
>
> In which case we should get another smp_reschedule_interrupt() _after_
> the resched_task() which should observe the 'n' and promote it to an
> 'N'.
>
> This doesn't happen and we're stuck with the 'n' state..
>
> 2) its the wrong order and the smp_reschedule_interrupt() actually
> happened after the resched_task(), but in that case we should have
> observed the 'n', we did not.
>
> Now given that 2) would mean a bug in the coherency fabric and 1) would
> mean a lost interrupt, 1 seems like the more likely suspect.
>
> Joerg wanted the extra kvm tracepoints to see what the vm-exit reason
> is.
>

Let me think about this after I had another cup of coffee and my brain decides
to follow my body in the process of waking up...

-Stefan


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-12 10:37:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Wed, Feb 12, 2014 at 09:20:24AM +0100, Stefan Bader wrote:
> On 11.02.2014 20:45, Peter Zijlstra wrote:
> > On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
> >> Hi Peter,
> >>
> >> I am currently looking at a weird issue that manifest itself when trying to run
> >> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
> >> cpu is 64bit capable, so qemu-system-x86_64 is called).
> >
> > AMD or Intel machine?
>
> Personally I am reproducing this on an AMD:
>
> cpu family : 15
> model : 72
> model name : AMD Turion(tm) 64 X2 Mobile Technology TL-64
>
> but the reports I got are showing the same issue on an Intel i7
>
> cpu family : 6
> model : 37
> model name : Intel(R) Core(TM) i7 CPU M 620 @ 2.67GH
>
> Another reporter also saw this on an AMD and said it could not be reproduced on
> the same hardware and the same software versions when using 64bit instead of 32.
>
> In my case on a 32bit installation I will see this on every invocation of

Curious.. so its both AMD and Intel.

That makes funny hardware less likely... but how can the vmexit loose an
interrupt like this.

2014-02-12 10:40:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Wed, Feb 12, 2014 at 11:37:13AM +0100, Peter Zijlstra wrote:
> > Another reporter also saw this on an AMD and said it could not be reproduced on
> > the same hardware and the same software versions when using 64bit instead of 32.
> >
> > In my case on a 32bit installation I will see this on every invocation of
>
> Curious.. so its both AMD and Intel.
>
> That makes funny hardware less likely... but how can the vmexit loose an
> interrupt like this.

Also what I'm wondering about and what's not clear from Stefan's reply
is whether this is purely a 32-bit issue, i.e. a 32-bit host running a
64-bit qemu running a 32-bit iso or what is it?

Or do we have reports for both 32-bit and 64-bit hosts?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-12 11:09:49

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 12.02.2014 11:40, Borislav Petkov wrote:
> On Wed, Feb 12, 2014 at 11:37:13AM +0100, Peter Zijlstra wrote:
>>> Another reporter also saw this on an AMD and said it could not be reproduced on
>>> the same hardware and the same software versions when using 64bit instead of 32.
>>>
>>> In my case on a 32bit installation I will see this on every invocation of
>>
>> Curious.. so its both AMD and Intel.
>>
>> That makes funny hardware less likely... but how can the vmexit loose an
>> interrupt like this.
>
> Also what I'm wondering about and what's not clear from Stefan's reply
> is whether this is purely a 32-bit issue, i.e. a 32-bit host running a
> 64-bit qemu running a 32-bit iso or what is it?
>
> Or do we have reports for both 32-bit and 64-bit hosts?

Only for 32bit hosts. I have not tested a 64bit kernel plus 32bit userspace,
though. But 64bit kernel and 64bit user-space seemed ok. And one detail I should
add is that while the reproduction is possible in all attempts, the timing can
vary. In some of the cases this was even before running a Linux kernel in the guest.

Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
default_send_IPI_mask_logical is only used from there, I assume the trace you
got does the same. Maybe something there is wrong which would explain why we
only see it on 32bit hosts.

>
> Thanks.
>



Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-12 11:12:11

by Joerg Roedel

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Wed, Feb 12, 2014 at 11:40:17AM +0100, Borislav Petkov wrote:
> Also what I'm wondering about and what's not clear from Stefan's reply
> is whether this is purely a 32-bit issue, i.e. a 32-bit host running a
> 64-bit qemu running a 32-bit iso or what is it?
>
> Or do we have reports for both 32-bit and 64-bit hosts?

As I understand it, the problem reproduces only on a 32bit host, which
also limits the guests to 32bit, at least when ran with KVM. This is
true even if a x86-64 qemu is used. This is also independent on whether
the host-cpu supports 64bit or not.


Joerg

2014-02-12 11:54:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
> default_send_IPI_mask_logical is only used from there, I assume the trace you
> got does the same. Maybe something there is wrong which would explain why we
> only see it on 32bit hosts.

Can you try with a different APIC driver to test this?

2014-02-13 17:00:49

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 12.02.2014 12:54, Peter Zijlstra wrote:
> On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
>> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
>> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
>> default_send_IPI_mask_logical is only used from there, I assume the trace you
>> got does the same. Maybe something there is wrong which would explain why we
>> only see it on 32bit hosts.
>
> Can you try with a different APIC driver to test this?
>
I don't think I can. And I think the statement about this only be used for 32bit
could be wrong. I got mislead to think so because those are only defined in
probe_32 but the 64bit counterpart isn't containing much aside that.

Anyway, I played around with tracing a bit more. So with this change:

if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
if (need_resched() != should_resched()) {
+ trace_printk("need(%i) != should(%i)\n",
+ need_resched(), should_resched());
+ trace_printk("exit_reason=%u\n",
+ vcpu->run->exit_reason);
+ trace_printk("preempt_count=%lx\n",
+ __this_cpu_read_4(__preempt_count));
+ tracing_stop();
+ printk(KERN_ERR "Stopped tracing, due to
inconsistent state.\n");
}
+ schedule();
- cond_reschedule();
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
}

I get the following (weird) output:

Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule
<-resched_task
Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical
<-native_smp_send_reschedule
bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup
bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup
compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup
compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup
compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup
qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt
<-reschedule_interrupt
qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi
<-smp_reschedule_interrupt
qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1)
!= should(0)
qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run:
exit_reason=2
qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run:
preempt_count=0

So am I reading this right, that the interrupt did get delivered to cpu#0 while
the thread info already had the resched flag set. So this really should have
cleared the bit in preempt_count. But while the trace info shows 'N' for some
reason should_reschedule returns false but at the same time reading the preempt
count manually shows it 0?



Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-13 17:39:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
> On 12.02.2014 12:54, Peter Zijlstra wrote:
> > On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
> >> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
> >> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
> >> default_send_IPI_mask_logical is only used from there, I assume the trace you
> >> got does the same. Maybe something there is wrong which would explain why we
> >> only see it on 32bit hosts.
> >
> > Can you try with a different APIC driver to test this?
> >
> I don't think I can. And I think the statement about this only be used for 32bit
> could be wrong. I got mislead to think so because those are only defined in
> probe_32 but the 64bit counterpart isn't containing much aside that.
>
> Anyway, I played around with tracing a bit more. So with this change:
>
> if (need_resched()) {
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> if (need_resched() != should_resched()) {
> + trace_printk("need(%i) != should(%i)\n",
> + need_resched(), should_resched());
> + trace_printk("exit_reason=%u\n",
> + vcpu->run->exit_reason);
> + trace_printk("preempt_count=%lx\n",
> + __this_cpu_read_4(__preempt_count));
> + tracing_stop();
> + printk(KERN_ERR "Stopped tracing, due to
> inconsistent state.\n");
> }
> + schedule();
> - cond_reschedule();
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
> }
>
> I get the following (weird) output:
>
> Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule
> <-resched_task
> Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical
> <-native_smp_send_reschedule
> bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup
> bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup
> compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup
> compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup
> compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup
> qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt
> <-reschedule_interrupt
> qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi
> <-smp_reschedule_interrupt
> qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1)
> != should(0)
> qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run:
> exit_reason=2
> qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run:
> preempt_count=0
>
> So am I reading this right, that the interrupt did get delivered to cpu#0 while
> the thread info already had the resched flag set. So this really should have
> cleared the bit in preempt_count. But while the trace info shows 'N' for some
> reason should_reschedule returns false but at the same time reading the preempt
> count manually shows it 0?

*blink*... That's weird indeed... do you have the asm that goes along
with that?

2014-02-13 18:04:19

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 13.02.2014 18:38, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
>> On 12.02.2014 12:54, Peter Zijlstra wrote:
>>> On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
>>>> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
>>>> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
>>>> default_send_IPI_mask_logical is only used from there, I assume the trace you
>>>> got does the same. Maybe something there is wrong which would explain why we
>>>> only see it on 32bit hosts.
>>>
>>> Can you try with a different APIC driver to test this?
>>>
>> I don't think I can. And I think the statement about this only be used for 32bit
>> could be wrong. I got mislead to think so because those are only defined in
>> probe_32 but the 64bit counterpart isn't containing much aside that.
>>
>> Anyway, I played around with tracing a bit more. So with this change:
>>
>> if (need_resched()) {
>> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> if (need_resched() != should_resched()) {
>> + trace_printk("need(%i) != should(%i)\n",
>> + need_resched(), should_resched());
>> + trace_printk("exit_reason=%u\n",
>> + vcpu->run->exit_reason);
>> + trace_printk("preempt_count=%lx\n",
>> + __this_cpu_read_4(__preempt_count));
>> + tracing_stop();
>> + printk(KERN_ERR "Stopped tracing, due to
>> inconsistent state.\n");
>> }
>> + schedule();
>> - cond_reschedule();
>> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>> }
>>
>> I get the following (weird) output:
>>
>> Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule
>> <-resched_task
>> Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical
>> <-native_smp_send_reschedule
>> bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup
>> bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup
>> qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt
>> <-reschedule_interrupt
>> qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi
>> <-smp_reschedule_interrupt
>> qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1)
>> != should(0)
>> qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run:
>> exit_reason=2
>> qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run:
>> preempt_count=0
>>
>> So am I reading this right, that the interrupt did get delivered to cpu#0 while
>> the thread info already had the resched flag set. So this really should have
>> cleared the bit in preempt_count. But while the trace info shows 'N' for some
>> reason should_reschedule returns false but at the same time reading the preempt
>> count manually shows it 0?
>
> *blink*... That's weird indeed... do you have the asm that goes along
> with that?
>
Yeah... not sure the interleaved source helps or not ...



Attachments:
kvm-x86.disass.gz (253.87 kB)
signature.asc (901.00 B)
OpenPGP digital signature
Download all attachments

2014-02-13 18:25:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
> On 12.02.2014 12:54, Peter Zijlstra wrote:
> > On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
> >> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
> >> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
> >> default_send_IPI_mask_logical is only used from there, I assume the trace you
> >> got does the same. Maybe something there is wrong which would explain why we
> >> only see it on 32bit hosts.
> >
> > Can you try with a different APIC driver to test this?
> >
> I don't think I can. And I think the statement about this only be used for 32bit
> could be wrong. I got mislead to think so because those are only defined in
> probe_32 but the 64bit counterpart isn't containing much aside that.
>
> Anyway, I played around with tracing a bit more. So with this change:
>
> if (need_resched()) {
> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> if (need_resched() != should_resched()) {
> + trace_printk("need(%i) != should(%i)\n",
> + need_resched(), should_resched());
> + trace_printk("exit_reason=%u\n",
> + vcpu->run->exit_reason);
> + trace_printk("preempt_count=%lx\n",
> + __this_cpu_read_4(__preempt_count));
> + tracing_stop();
> + printk(KERN_ERR "Stopped tracing, due to
> inconsistent state.\n");
> }
> + schedule();
> - cond_reschedule();
> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
> }
>
> I get the following (weird) output:
>
> Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule
> <-resched_task
> Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical
> <-native_smp_send_reschedule
> bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup
> bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup
> compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup
> compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup
> compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup
> qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt
> <-reschedule_interrupt
> qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi
> <-smp_reschedule_interrupt
> qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1)
> != should(0)
> qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run:
> exit_reason=2
> qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run:
> preempt_count=0
>
> So am I reading this right, that the interrupt did get delivered to cpu#0 while
> the thread info already had the resched flag set. So this really should have
> cleared the bit in preempt_count. But while the trace info shows 'N' for some
> reason should_reschedule returns false but at the same time reading the preempt
> count manually shows it 0?

So the assembly merges the first and second should_resched(), so its
possible that load got before the interrupt().

The 3rd preempt_count load gets re-issued and so that would show the
'true' value again.

If you want to force a reload after the condition; put in a barrier().

In any case; this looks like a false-positive. Please try again until
you get one where the interrupt doesn't happen and we stay in 'n' state.

2014-02-13 18:26:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote:
> Yeah... not sure the interleaved source helps or not ...

It did, thanks!

2014-02-14 10:56:07

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 13.02.2014 19:25, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
>> On 12.02.2014 12:54, Peter Zijlstra wrote:
>>> On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
>>>> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
>>>> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
>>>> default_send_IPI_mask_logical is only used from there, I assume the trace you
>>>> got does the same. Maybe something there is wrong which would explain why we
>>>> only see it on 32bit hosts.
>>>
>>> Can you try with a different APIC driver to test this?
>>>
>> I don't think I can. And I think the statement about this only be used for 32bit
>> could be wrong. I got mislead to think so because those are only defined in
>> probe_32 but the 64bit counterpart isn't containing much aside that.
>>
>> Anyway, I played around with tracing a bit more. So with this change:
>>
>> if (need_resched()) {
>> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> if (need_resched() != should_resched()) {
>> + trace_printk("need(%i) != should(%i)\n",
>> + need_resched(), should_resched());
>> + trace_printk("exit_reason=%u\n",
>> + vcpu->run->exit_reason);
>> + trace_printk("preempt_count=%lx\n",
>> + __this_cpu_read_4(__preempt_count));
>> + tracing_stop();
>> + printk(KERN_ERR "Stopped tracing, due to
>> inconsistent state.\n");
>> }
>> + schedule();
>> - cond_reschedule();
>> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>> }
>>
>> I get the following (weird) output:
>>
>> Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule
>> <-resched_task
>> Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical
>> <-native_smp_send_reschedule
>> bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup
>> bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup
>> qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt
>> <-reschedule_interrupt
>> qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi
>> <-smp_reschedule_interrupt
>> qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1)
>> != should(0)
>> qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run:
>> exit_reason=2
>> qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run:
>> preempt_count=0
>>
>> So am I reading this right, that the interrupt did get delivered to cpu#0 while
>> the thread info already had the resched flag set. So this really should have
>> cleared the bit in preempt_count. But while the trace info shows 'N' for some
>> reason should_reschedule returns false but at the same time reading the preempt
>> count manually shows it 0?
>
> So the assembly merges the first and second should_resched(), so its
> possible that load got before the interrupt().
>
> The 3rd preempt_count load gets re-issued and so that would show the
> 'true' value again.
>
> If you want to force a reload after the condition; put in a barrier().
>
> In any case; this looks like a false-positive. Please try again until
> you get one where the interrupt doesn't happen and we stay in 'n' state.
>
Ok, I think I now got a log of the actual issue. It seems cpu#1 missed out on
handling a reschedule interrupt but did send one to cpu#0 and on cpu#0 while
handling the interrupt the tif flag was not set (yet?) but then when it is,
there is no more interrupt to do the folding... (sorry for the wrapping)...

qemu-system-x86-2503 [000] d.h. 1145.887872: default_send_IPI_mask_logical
<-native_smp_send_reschedule
qemu-system-x86-2503 [000] d... 1145.887888: resched_task <-check_preempt_curr
<idle>-0 [000] d.h. 1145.888001: resched_task <-check_preempt_curr
<idle>-0 [001] dn.. 1145.888788: smp_reschedule_interrupt
<-reschedule_interrupt
<idle>-0 [001] dn.. 1145.888789: scheduler_ipi
<-smp_reschedule_interrupt
gnome-terminal-2440 [000] d... 1145.888957: resched_task <-check_preempt_curr
gnome-terminal-2440 [000] d... 1145.888959: native_smp_send_reschedule
<-resched_task
gnome-terminal-2440 [000] d... 1145.888959: default_send_IPI_mask_logical
<-native_smp_send_reschedule
<idle>-0 [001] dn.. 1145.889296: smp_reschedule_interrupt
<-reschedule_interrupt
<idle>-0 [001] dn.. 1145.889297: scheduler_ipi
<-smp_reschedule_interrupt
gnome-terminal-2440 [000] d... 1145.889465: resched_task <-check_preempt_curr
gnome-terminal-2440 [000] d... 1145.889467: native_smp_send_reschedule
<-resched_task
gnome-terminal-2440 [000] d... 1145.889467: default_send_IPI_mask_logical
<-native_smp_send_reschedule
Xorg-1054 [001] d.h. 1145.889469: resched_task <-check_preempt_wakeup
Xorg-1054 [001] d.h. 1145.889469: native_smp_send_reschedule
<-resched_task
Xorg-1054 [001] d.h. 1145.889469: default_send_IPI_mask_logical
<-native_smp_send_reschedule
qemu-system-x86-2503 [000] d.s. 1145.889643: smp_reschedule_interrupt
<-reschedule_interrupt
qemu-system-x86-2503 [000] d.s. 1145.889643: scheduler_ipi
<-smp_reschedule_interrupt
qemu-system-x86-2503 [000] .n.. 1145.889647: kvm_arch_vcpu_ioctl_run: sched
inconsistency
qemu-system-x86-2503 [000] .n.. 1145.889662: kvm_arch_vcpu_ioctl_run: sched
inconsistency
qemu-system-x86-2503 [000] .n.. 1145.889664: kvm_arch_vcpu_ioctl_run: sched
inconsistency
qemu-system-x86-2503 [000] .n.. 1145.889668: kvm_arch_vcpu_ioctl_run: need(1)
!= should(0)
qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run:
exit_reason=2
qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run:
preempt_count=80000000


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-14 11:25:11

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 13.02.2014 19:25, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 06:00:19PM +0100, Stefan Bader wrote:
>> On 12.02.2014 12:54, Peter Zijlstra wrote:
>>> On Wed, Feb 12, 2014 at 12:09:29PM +0100, Stefan Bader wrote:
>>>> Something else here I run a kernel with CONFIG_PREEMPT not set and NR_CPUS
>>>> limited to 8 (for the 32bit kernel). So the default apic driver is used. Since
>>>> default_send_IPI_mask_logical is only used from there, I assume the trace you
>>>> got does the same. Maybe something there is wrong which would explain why we
>>>> only see it on 32bit hosts.
>>>
>>> Can you try with a different APIC driver to test this?
>>>
>> I don't think I can. And I think the statement about this only be used for 32bit
>> could be wrong. I got mislead to think so because those are only defined in
>> probe_32 but the 64bit counterpart isn't containing much aside that.
>>
>> Anyway, I played around with tracing a bit more. So with this change:
>>
>> if (need_resched()) {
>> srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> if (need_resched() != should_resched()) {
>> + trace_printk("need(%i) != should(%i)\n",
>> + need_resched(), should_resched());
>> + trace_printk("exit_reason=%u\n",
>> + vcpu->run->exit_reason);
>> + trace_printk("preempt_count=%lx\n",
>> + __this_cpu_read_4(__preempt_count));
>> + tracing_stop();
>> + printk(KERN_ERR "Stopped tracing, due to
>> inconsistent state.\n");
>> }
>> + schedule();
>> - cond_reschedule();
>> vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>> }
>>
>> I get the following (weird) output:
>>
>> Xorg-1078 [001] d... 71.270251: native_smp_send_reschedule
>> <-resched_task
>> Xorg-1078 [001] d... 71.270251: default_send_IPI_mask_logical
>> <-native_smp_send_reschedule
>> bamfdaemon-2318 [001] d... 71.270465: resched_task <-check_preempt_wakeup
>> bamfdaemon-2318 [001] d... 71.270539: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270689: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270827: resched_task <-check_preempt_wakeup
>> compiz-2365 [001] d... 71.270940: resched_task <-check_preempt_wakeup
>> qemu-system-x86-2679 [000] dn.. 71.270999: smp_reschedule_interrupt
>> <-reschedule_interrupt
>> qemu-system-x86-2679 [000] dn.. 71.270999: scheduler_ipi
>> <-smp_reschedule_interrupt
>> qemu-system-x86-2679 [000] .N.. 71.271001: kvm_arch_vcpu_ioctl_run: need(1)
>> != should(0)
>> qemu-system-x86-2679 [000] .N.. 71.271002: kvm_arch_vcpu_ioctl_run:
>> exit_reason=2
>> qemu-system-x86-2679 [000] .N.. 71.271003: kvm_arch_vcpu_ioctl_run:
>> preempt_count=0
>>
>> So am I reading this right, that the interrupt did get delivered to cpu#0 while
>> the thread info already had the resched flag set. So this really should have
>> cleared the bit in preempt_count. But while the trace info shows 'N' for some
>> reason should_reschedule returns false but at the same time reading the preempt
>> count manually shows it 0?
>
> So the assembly merges the first and second should_resched(), so its
> possible that load got before the interrupt().
>
> The 3rd preempt_count load gets re-issued and so that would show the
> 'true' value again.
>
> If you want to force a reload after the condition; put in a barrier().
>
> In any case; this looks like a false-positive. Please try again until
> you get one where the interrupt doesn't happen and we stay in 'n' state.
>

Oh and one thing I was wondering. Not sure I do understand it right... When
initially converting to percpu counts, you changed the 32bit assembly like that:

--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -362,12 +362,9 @@ END(ret_from_exception)
#ifdef CONFIG_PREEMPT
ENTRY(resume_kernel)
DISABLE_INTERRUPTS(CLBR_ANY)
- cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ?
- jnz restore_all
need_resched:
- movl TI_flags(%ebp), %ecx # need_resched set ?
- testb $_TIF_NEED_RESCHED, %cl
- jz restore_all
+ cmpl $0,PER_CPU_VAR(__preempt_count)
+ jnz restore_all
testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path
jz restore_all
call preempt_schedule_irq

This seems to say if preempt_count was 0 then then if the thread flag was set
and interrupts were not off(?) it would do a preempt ipi and then come back to
re-check the thread flag.
This would now be if preempt_count is 0 only... and I wonder whether that would
change from doing that loop...


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-14 11:49:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Fri, Feb 14, 2014 at 12:24:42PM +0100, Stefan Bader wrote:
> Oh and one thing I was wondering. Not sure I do understand it right... When
> initially converting to percpu counts, you changed the 32bit assembly like that:
>
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -362,12 +362,9 @@ END(ret_from_exception)
> #ifdef CONFIG_PREEMPT
> ENTRY(resume_kernel)
> DISABLE_INTERRUPTS(CLBR_ANY)
> - cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ?
> - jnz restore_all
> need_resched:
> - movl TI_flags(%ebp), %ecx # need_resched set ?
> - testb $_TIF_NEED_RESCHED, %cl
> - jz restore_all
> + cmpl $0,PER_CPU_VAR(__preempt_count)
> + jnz restore_all
> testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path
> jz restore_all
> call preempt_schedule_irq
>
> This seems to say if preempt_count was 0 then then if the thread flag was set
> and interrupts were not off(?) it would do a preempt ipi and then come back to
> re-check the thread flag.

No not an IPI; it would reschedule.

So the old code:

if preempt_count != 0; continue out
if !TIF_NEED_RESCHED; continue out
if IRQs-off in calling context; continue out
preempt_schedule_irq

The new code:

if preempt_count != 0; continue out
if IRQs-off in calling context; continue out
preempt_schedule_irq

> This would now be if preempt_count is 0 only... and I wonder whether that would
> change from doing that loop...

We can do away with the TIF_NEED_RESCHED test because that state is
folded into the preempt_count by means of PREEMPT_NEED_RESCHED.

2014-02-14 13:18:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Fri, Feb 14, 2014 at 11:55:25AM +0100, Stefan Bader wrote:
> Ok, I think I now got a log of the actual issue. It seems cpu#1 missed out on
> handling a reschedule interrupt but did send one to cpu#0 and on cpu#0 while
> handling the interrupt the tif flag was not set (yet?) but then when it is,
> there is no more interrupt to do the folding... (sorry for the wrapping)...
>
> qemu-system-x86-2503 [000] d.h. 1145.887872: default_send_IPI_mask_logical <-native_smp_send_reschedule
> qemu-system-x86-2503 [000] d... 1145.887888: resched_task <-check_preempt_curr
> <idle>-0 [000] d.h. 1145.888001: resched_task <-check_preempt_curr

> <idle>-0 [001] dn.. 1145.888788: smp_reschedule_interrupt <-reschedule_interrupt
> <idle>-0 [001] dn.. 1145.888789: scheduler_ipi <-smp_reschedule_interrupt

> gnome-terminal-2440 [000] d... 1145.888957: resched_task <-check_preempt_curr
> gnome-terminal-2440 [000] d... 1145.888959: native_smp_send_reschedule <-resched_task
> gnome-terminal-2440 [000] d... 1145.888959: default_send_IPI_mask_logical <-native_smp_send_reschedule

> <idle>-0 [001] dn.. 1145.889296: smp_reschedule_interrupt <-reschedule_interrupt
> <idle>-0 [001] dn.. 1145.889297: scheduler_ipi <-smp_reschedule_interrupt

> gnome-terminal-2440 [000] d... 1145.889465: resched_task <-check_preempt_curr
> gnome-terminal-2440 [000] d... 1145.889467: native_smp_send_reschedule <-resched_task
> gnome-terminal-2440 [000] d... 1145.889467: default_send_IPI_mask_logical <-native_smp_send_reschedule

> Xorg-1054 [001] d.h. 1145.889469: resched_task <-check_preempt_wakeup
> Xorg-1054 [001] d.h. 1145.889469: native_smp_send_reschedule <-resched_task
> Xorg-1054 [001] d.h. 1145.889469: default_send_IPI_mask_logical <-native_smp_send_reschedule

> qemu-system-x86-2503 [000] d.s. 1145.889643: smp_reschedule_interrupt <-reschedule_interrupt
> qemu-system-x86-2503 [000] d.s. 1145.889643: scheduler_ipi <-smp_reschedule_interrupt
> qemu-system-x86-2503 [000] .n.. 1145.889647: kvm_arch_vcpu_ioctl_run: sched inconsistency
> qemu-system-x86-2503 [000] .n.. 1145.889662: kvm_arch_vcpu_ioctl_run: sched inconsistency
> qemu-system-x86-2503 [000] .n.. 1145.889664: kvm_arch_vcpu_ioctl_run: sched inconsistency
> qemu-system-x86-2503 [000] .n.. 1145.889668: kvm_arch_vcpu_ioctl_run: need(1) != should(0)
> qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run: exit_reason=2
> qemu-system-x86-2503 [000] .n.. 1145.889669: kvm_arch_vcpu_ioctl_run: preempt_count=80000000
>

So afaiu you're running this on an AMD fam 0xf which doesn't have
synchronized TSC; so again the timing might be off.. however:

The first block from cpu 0, sends an IPI to cpu 1, cpu 1 sees 'n'.

Then again, cpu 0 sends to cpu 1, we observe 'n'.

So far so good and timing seems to reflect causality.

Then however, cpu 0 sends, but nothing in this trace answers on cpu 1..
odd.

Then cpu 1 sends to cpu 0, cpu 0 does receive the IPI but does _not_
observe 'n', which is mighty odd. However after the IPI is handled cpu 0
magically gets the 'n'.



So one thing I did wonder about; if cpu 0 sends an IPI while cpu 1 is
already processing the same interrupt; cpu 1 should latch the interrupt
and it would re-trigger with when we re-enable interrupts after we're
done with the first.

The only thing that would ruin that is late-ack, eg. ack (clear the
latch) when exiting the interrupt; but I checked and we ack before we
call scheduler_ipi().

Now; if this were to happen (but I don't exactly see how on a 2 CPU
system, nor do I see this in the trace), there might be a hole...

CPU 0 | CPU 1
--------------------------------+--------------------------------
|
/* ! resched_task */ |
smp_send_reschedule(1) |L
|L smp_reschedule_interrupt()
resched_task() |L
set_tsk_need_resched() |L
smp_mb() |L
smp_send_reschedule(1) |L
|L ack_APIC_irq()
| scheduler_ipi()
| need_resched();

Now strictly speaking that need_resched() load could creep up before
the store and we'd fail to observe 'n'.

However, x86 forbids reads from passing an earlier write to the same
location (it allows passing earlier writes in general). At which point
the only other option is GCC emitting that load far earlier, in fact
before ack_APIC_irq(), but that would be a violation of the C std
because scheduler_ipi() is a (real) function call and acts as a compiler
barrier AFAIK (as opposed to inlining which doesn't act as a compiler
barrier).

So colour me confused..


You could try putting an smp_mb() at the very start of scheduler_ipi()
just to make absolutely sure, but it _should_ not matter.

Of course, I could be completely off base with how interrupts work; and
there's a wide gaping hole in there, in which case, someone please
educate me ;-)

2014-02-14 13:34:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Thu, Feb 13, 2014 at 07:26:05PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote:
> > Yeah... not sure the interleaved source helps or not ...
>
> It did, thanks!

Stefan, can you also send sched/core.s? I'm particularly interested in
what resched_task() has been turned into by your build env.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-14 13:41:22

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 14.02.2014 14:34, Borislav Petkov wrote:
> On Thu, Feb 13, 2014 at 07:26:05PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote:
>>> Yeah... not sure the interleaved source helps or not ...
>>
>> It did, thanks!
>
> Stefan, can you also send sched/core.s? I'm particularly interested in
> what resched_task() has been turned into by your build env.
>
> Thanks.
>
Sure. Here it is.


Attachments:
sched-core.disass.gz (193.88 kB)
signature.asc (901.00 B)
OpenPGP digital signature
Download all attachments

2014-02-14 14:24:26

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 14.02.2014 14:34, Borislav Petkov wrote:
> On Thu, Feb 13, 2014 at 07:26:05PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 13, 2014 at 07:03:56PM +0100, Stefan Bader wrote:
>>> Yeah... not sure the interleaved source helps or not ...
>>
>> It did, thanks!
>
> Stefan, can you also send sched/core.s? I'm particularly interested in
> what resched_task() has been turned into by your build env.
>
> Thanks.
>
Actually, this code just makes so much more sense if I let objdump do relocation
info...



Attachments:
sched-core.disass.gz (205.58 kB)
signature.asc (901.00 B)
OpenPGP digital signature
Download all attachments

2014-02-14 14:47:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Fri, Feb 14, 2014 at 03:24:09PM +0100, Stefan Bader wrote:
> Actually, this code just makes so much more sense if I let objdump do
> relocation info...

Ok, we're pretty sure you have an MFENCE there in resched_task but can
you confirm it please.

First, does /proc/cpuinfo have the "sse2" string? It should but nowadays
I don't trust anything.

Then, can you boot that kernel in qemu with the -gdb flag so that it
starts the gdb stub, here's the manpage about it:

-gdb dev
Wait for gdb connection on device dev. Typical connections will likely be
TCP-based, but also UDP, pseudo TTY, or even stdio are reasonable use case.
The latter is allowing to start QEMU from within gdb and establish the
connection via a pipe:

(gdb) target remote | exec qemu-system-i386 -gdb stdio ...

-s Shorthand for -gdb tcp::1234, i.e. open a gdbserver on TCP port 1234.

then boot the guest and when it is up, do

$ gdb ./vmlinux
$ target remote localhost:1234

and type in the prompt

$ (gdb) x/50i resched_task

It gives here:

(gdb) x/50i resched_task
0xffffffff810836f0 <resched_task>: data32 data32 data32 xchg %ax,%ax
0xffffffff810836f5 <resched_task+5>: push %rbp
0xffffffff810836f6 <resched_task+6>: mov 0x85e123(%rip),%r10d # 0xffffffff818e1820 <debug_locks>
0xffffffff810836fd <resched_task+13>: mov %rsp,%rbp
0xffffffff81083700 <resched_task+16>: push %r12
0xffffffff81083702 <resched_task+18>: test %r10d,%r10d
0xffffffff81083705 <resched_task+21>: push %rbx
0xffffffff81083706 <resched_task+22>: mov %rdi,%rbx
0xffffffff81083709 <resched_task+25>: jne 0xffffffff81083760 <resched_task+112>
0xffffffff8108370b <resched_task+27>: mov 0x8(%rbx),%rax
0xffffffff8108370f <resched_task+31>: mov 0x10(%rax),%rdx
0xffffffff81083713 <resched_task+35>: and $0x8,%edx
0xffffffff81083716 <resched_task+38>: jne 0xffffffff8108373c <resched_task+76>
0xffffffff81083718 <resched_task+40>: lock orb $0x8,0x10(%rax)
0xffffffff8108371d <resched_task+45>: mov 0x8(%rbx),%rax
0xffffffff81083721 <resched_task+49>: mov 0x18(%rax),%r12d
0xffffffff81083725 <resched_task+53>: callq 0xffffffff812d8fc0 <debug_smp_processor_id>
0xffffffff8108372a <resched_task+58>: cmp %r12d,%eax
0xffffffff8108372d <resched_task+61>: je 0xffffffff810837a0 <resched_task+176>
0xffffffff8108372f <resched_task+63>: mfence
^^^^^^
I want to make sure the smp_mb() is really replaced with an MFENCE there.

Thanks!

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-14 15:21:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: Another preempt folding issue? (maybe bisect)

Oh, and just in case this is relatively easy to reproduce and in case we
don't have any other idea, bisection might be another option. I'm not
saying you should do it right away - I'm just putting it on the table...

:-)

:-)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-14 15:28:50

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue? (maybe bisect)

On 14.02.2014 16:21, Borislav Petkov wrote:
> Oh, and just in case this is relatively easy to reproduce and in case we
> don't have any other idea, bisection might be another option. I'm not
> saying you should do it right away - I'm just putting it on the table...
>
> :-)
>
> :-)
>

Oh yeah, bisection is nearly as entertaining as doing my tax records. Hm, on the
other hand those will have to be done at some point too... :-P


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-14 15:44:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: Another preempt folding issue? (maybe bisect)

On Fri, Feb 14, 2014 at 04:28:16PM +0100, Stefan Bader wrote:
> Oh yeah, bisection is nearly as entertaining as doing my tax records.
> Hm, on the other hand those will have to be done at some point too...
> :-P

Ah, tax records, crap, I have those lying around in the corner since
forever... I need to do those too :-(

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-14 16:21:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue? (maybe bisect)

On Fri, Feb 14, 2014 at 04:21:32PM +0100, Borislav Petkov wrote:
> Oh, and just in case this is relatively easy to reproduce and in case we
> don't have any other idea, bisection might be another option. I'm not
> saying you should do it right away - I'm just putting it on the table...

I'm fairly sure you're going to hit the patch that introduced the whole
per-cpu preemption thing. Before that not receiving the IPI wasn't as
bad as it is now.

2014-02-14 17:02:54

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 14.02.2014 15:47, Borislav Petkov wrote:
> On Fri, Feb 14, 2014 at 03:24:09PM +0100, Stefan Bader wrote:
>> Actually, this code just makes so much more sense if I let objdump do
>> relocation info...
>
> Ok, we're pretty sure you have an MFENCE there in resched_task but can
> you confirm it please.
>
> First, does /proc/cpuinfo have the "sse2" string? It should but nowadays
> I don't trust anything.
>
> Then, can you boot that kernel in qemu with the -gdb flag so that it
> starts the gdb stub, here's the manpage about it:
>
> -gdb dev
> Wait for gdb connection on device dev. Typical connections will likely be
> TCP-based, but also UDP, pseudo TTY, or even stdio are reasonable use case.
> The latter is allowing to start QEMU from within gdb and establish the
> connection via a pipe:
>
> (gdb) target remote | exec qemu-system-i386 -gdb stdio ...
>
> -s Shorthand for -gdb tcp::1234, i.e. open a gdbserver on TCP port 1234.
>
> then boot the guest and when it is up, do
>
> $ gdb ./vmlinux
> $ target remote localhost:1234
>
> and type in the prompt
>
> $ (gdb) x/50i resched_task
>
> It gives here:
>
> (gdb) x/50i resched_task
> 0xffffffff810836f0 <resched_task>: data32 data32 data32 xchg %ax,%ax
> 0xffffffff810836f5 <resched_task+5>: push %rbp
> 0xffffffff810836f6 <resched_task+6>: mov 0x85e123(%rip),%r10d # 0xffffffff818e1820 <debug_locks>
> 0xffffffff810836fd <resched_task+13>: mov %rsp,%rbp
> 0xffffffff81083700 <resched_task+16>: push %r12
> 0xffffffff81083702 <resched_task+18>: test %r10d,%r10d
> 0xffffffff81083705 <resched_task+21>: push %rbx
> 0xffffffff81083706 <resched_task+22>: mov %rdi,%rbx
> 0xffffffff81083709 <resched_task+25>: jne 0xffffffff81083760 <resched_task+112>
> 0xffffffff8108370b <resched_task+27>: mov 0x8(%rbx),%rax
> 0xffffffff8108370f <resched_task+31>: mov 0x10(%rax),%rdx
> 0xffffffff81083713 <resched_task+35>: and $0x8,%edx
> 0xffffffff81083716 <resched_task+38>: jne 0xffffffff8108373c <resched_task+76>
> 0xffffffff81083718 <resched_task+40>: lock orb $0x8,0x10(%rax)
> 0xffffffff8108371d <resched_task+45>: mov 0x8(%rbx),%rax
> 0xffffffff81083721 <resched_task+49>: mov 0x18(%rax),%r12d
> 0xffffffff81083725 <resched_task+53>: callq 0xffffffff812d8fc0 <debug_smp_processor_id>
> 0xffffffff8108372a <resched_task+58>: cmp %r12d,%eax
> 0xffffffff8108372d <resched_task+61>: je 0xffffffff810837a0 <resched_task+176>
> 0xffffffff8108372f <resched_task+63>: mfence
> ^^^^^^
> I want to make sure the smp_mb() is really replaced with an MFENCE there.
>
> Thanks!
>
Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
there is a mfence in the disassembly:

0xc107dcb0 <resched_task>: push %ebp
0xc107dcb1 <resched_task+1>: mov %esp,%ebp
0xc107dcb3 <resched_task+3>: lea %ds:0x0(%esi,%eiz,1),%esi
0xc107dcb8 <resched_task+8>: mov 0x4(%eax),%edx
0xc107dcbb <resched_task+11>: mov 0x8(%edx),%ecx
0xc107dcbe <resched_task+14>: and $0x8,%ecx
0xc107dcc1 <resched_task+17>: jne 0xc107dce7 <resched_task+55>
0xc107dcc3 <resched_task+19>: orb $0x8,%ds:0x8(%edx)
0xc107dcc8 <resched_task+24>: mov 0x4(%eax),%edx
0xc107dccb <resched_task+27>: mov %fs:0xc1a71010,%ecx
0xc107dcd2 <resched_task+34>: mov 0x10(%edx),%edx
0xc107dcd5 <resched_task+37>: cmp %ecx,%edx
0xc107dcd7 <resched_task+39>: je 0xc107dd00 <resched_task+80>
0xc107dcd9 <resched_task+41>: mfence
0xc107dcdc <resched_task+44>: mov %esi,%esi
0xc107dcde <resched_task+46>: mov 0x4(%eax),%eax
0xc107dce1 <resched_task+49>: testb $0x4,0xc(%eax)
0xc107dce5 <resched_task+53>: je 0xc107dcf0 <resched_task+64>
0xc107dce7 <resched_task+55>: pop %ebp
0xc107dce8 <resched_task+56>: ret
0xc107dce9 <resched_task+57>: lea 0x0(%esi,%eiz,1),%esi
0xc107dcf0 <resched_task+64>: mov %edx,%eax
0xc107dcf2 <resched_task+66>: call *0xc1917950
0xc107dcf8 <resched_task+72>: pop %ebp

Thinking about it, I guess Peter is quite right saying that I likely will end on
the patch that converted preempt_count to percpu.

One thing I likely should do is to reinstall the exact same laptop with 64bit
kernel and userspace... maybe only 64bit kernel first... and make sure on my
side that this does not show up on 64bit, too. I took the word of reporters for
that (and the impression that otherwise many more people would have complained).



Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-14 17:22:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
> One thing I likely should do is to reinstall the exact same laptop with 64bit
> kernel and userspace... maybe only 64bit kernel first... and make sure on my
> side that this does not show up on 64bit, too. I took the word of reporters for
> that (and the impression that otherwise many more people would have complained).

Yeha, I'm going to try and install some 32bit userspace on a usb
harddisk I've got and see if I can boot my Core2 laptop from that to try
and reproduce.

But all that is probably going to be Monday :/

2014-02-14 17:33:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
> there is a mfence in the disassembly:

Btw, I just realized booting the kernel in the guest was a dumb idea,
because, doh, the guest is not baremetal. The only reliable thing we
can say is that sse2 is present and that MFENCE alternative replacement
works :)

But for simplicity's sake let's just assume the machine can do MFENCE
just fine and it gets replaced by the alternatives code.

Besides, if that weren't true, we'd have a whole lot of other problems
on those boxes.

> Thinking about it, I guess Peter is quite right saying that I likely
> will end on the patch that converted preempt_count to percpu.

Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al.

> One thing I likely should do is to reinstall the exact same laptop
> with 64bit kernel and userspace... maybe only 64bit kernel first...
> and make sure on my side that this does not show up on 64bit, too. I
> took the word of reporters for that (and the impression that otherwise
> many more people would have complained).

Yeah, that should be a prudent thing to do.

Also, Paolo and I were wondering whether you can trigger this thing
without kvm, i.e. virtualization involved... do you have any data on
that?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-14 18:23:36

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 14.02.2014 18:33, Borislav Petkov wrote:
> On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
>> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
>> there is a mfence in the disassembly:
>
> Btw, I just realized booting the kernel in the guest was a dumb idea,
> because, doh, the guest is not baremetal. The only reliable thing we
> can say is that sse2 is present and that MFENCE alternative replacement
> works :)
>
> But for simplicity's sake let's just assume the machine can do MFENCE
> just fine and it gets replaced by the alternatives code.
>
> Besides, if that weren't true, we'd have a whole lot of other problems
> on those boxes.
>
>> Thinking about it, I guess Peter is quite right saying that I likely
>> will end on the patch that converted preempt_count to percpu.
>
> Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al.
>
>> One thing I likely should do is to reinstall the exact same laptop
>> with 64bit kernel and userspace... maybe only 64bit kernel first...
>> and make sure on my side that this does not show up on 64bit, too. I
>> took the word of reporters for that (and the impression that otherwise
>> many more people would have complained).
>
> Yeah, that should be a prudent thing to do.
>
> Also, Paolo and I were wondering whether you can trigger this thing
> without kvm, i.e. virtualization involved... do you have any data on
> that?

Unfortunately no hard evidence. Kvm just happens to be such a good way to notice
this as it is using the reschedule interrupt itself and has this exit before
running the guest vcpu to hadnle it in the outer loop by calling cond_resched()
and repeat.
I find running kvm seems to make that laptop quite sluggish in responding to
other tasks (in that install) and I got some oddness going on when lightdm quite
often refuses to take keyboard input without opening some menu with the mouse
first... But I could not be sure whether that is the kernel or some new
user-space ... errr "feature".
At least Marcello (iirc that other report came from him directly or indirectly)
has seen it, too. And he likely has complete different user-space.

So I will go and do that different (64bit) kernel and kernel + user-space test.
But like fo Peter, it likely is a Monday thing...





Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-14 19:03:55

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 14.02.2014 19:23, Stefan Bader wrote:
> On 14.02.2014 18:33, Borislav Petkov wrote:
>> On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
>>> Okaaay, I think I did what you asked. So yes, there is sse2 in the cpu info. And
>>> there is a mfence in the disassembly:
>>
>> Btw, I just realized booting the kernel in the guest was a dumb idea,
>> because, doh, the guest is not baremetal. The only reliable thing we
>> can say is that sse2 is present and that MFENCE alternative replacement
>> works :)
>>
>> But for simplicity's sake let's just assume the machine can do MFENCE
>> just fine and it gets replaced by the alternatives code.
>>
>> Besides, if that weren't true, we'd have a whole lot of other problems
>> on those boxes.
>>
>>> Thinking about it, I guess Peter is quite right saying that I likely
>>> will end on the patch that converted preempt_count to percpu.
>>
>> Yeah, c2daa3bed53a81171cf8c1a36db798e82b91afe8 et al.
>>
>>> One thing I likely should do is to reinstall the exact same laptop
>>> with 64bit kernel and userspace... maybe only 64bit kernel first...
>>> and make sure on my side that this does not show up on 64bit, too. I
>>> took the word of reporters for that (and the impression that otherwise
>>> many more people would have complained).
>>
>> Yeah, that should be a prudent thing to do.
>>
>> Also, Paolo and I were wondering whether you can trigger this thing
>> without kvm, i.e. virtualization involved... do you have any data on
>> that?
>
> Unfortunately no hard evidence. Kvm just happens to be such a good way to notice
> this as it is using the reschedule interrupt itself and has this exit before
> running the guest vcpu to hadnle it in the outer loop by calling cond_resched()
> and repeat.
> I find running kvm seems to make that laptop quite sluggish in responding to
> other tasks (in that install) and I got some oddness going on when lightdm quite
> often refuses to take keyboard input without opening some menu with the mouse
> first... But I could not be sure whether that is the kernel or some new
> user-space ... errr "feature".
> At least Marcello (iirc that other report came from him directly or indirectly)
> has seen it, too. And he likely has complete different user-space.
>
> So I will go and do that different (64bit) kernel and kernel + user-space test.
> But like fo Peter, it likely is a Monday thing...
>

Ok, it is still Friday... So a quick test (2 boots of a 32bit guest, same as
before) on the 32bit user-space, with the same kernel source, but compiled as
64bit (obviously not 100% same config but close). While I see the false
inconsistency messages (I modified the in kernel-test to trigger the trace stop
only if the __vcpu_run loop encounters an inconsistent state three times in a
row), I do not see the final stop message. Also (but that is rather feeling) the
system seems to remain more responsive (switching to other windows, opening
terminal windows,...) compared to 32bit kernel.



Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-20 15:38:34

by Stefan Bader

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On 14.02.2014 18:21, Peter Zijlstra wrote:
> On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
>> One thing I likely should do is to reinstall the exact same laptop with 64bit
>> kernel and userspace... maybe only 64bit kernel first... and make sure on my
>> side that this does not show up on 64bit, too. I took the word of reporters for
>> that (and the impression that otherwise many more people would have complained).
>
> Yeha, I'm going to try and install some 32bit userspace on a usb
> harddisk I've got and see if I can boot my Core2 laptop from that to try
> and reproduce.
>
> But all that is probably going to be Monday :/
>
*sigh* Already Thursday...

Peter, did you get to reproduce this locally? Unfortunately I had some
interruption and have not more Information than on last Friday (which is that
the same hw but 64bit kernel does not show it).

Meanwhile I wonder whether it would make sense to push the following (or more?)
to stable for 3.13.y:

1) 16824255394f55adf31b9a96a9965d8c15bdac4c
x86, acpi, idle: Restructure the mwait idle routines
2) 7e98b71920464b8d15fa95c74366416cd3c88861
x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
3) 8cb75e0c4ec9786b81439761eac1d18d4a931af3
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
4) 215393bc1fab3d61a5a296838bdffce22f27ffda
sched/preempt/x86: Fix voluntary preempt for x86

1+2 would be to avoid touching 3 too much and looked to be improvements on their
own. 3+4 would be cherry-picks if not for some fuzz 2.
I saw a few more things labelled preempt between 3.13 and current HEAD but am
not sure whether or which of those are strictly required. Namely some fixing to
preempt_enable_no_resched() mis-usage and maybe one fixing an issue of ftrace
locking up.

-Stefan
3


Attachments:
signature.asc (901.00 B)
OpenPGP digital signature

2014-02-20 15:50:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Another preempt folding issue?

On Thu, Feb 20, 2014 at 04:38:13PM +0100, Stefan Bader wrote:
> On 14.02.2014 18:21, Peter Zijlstra wrote:
> > On Fri, Feb 14, 2014 at 06:02:32PM +0100, Stefan Bader wrote:
> >> One thing I likely should do is to reinstall the exact same laptop with 64bit
> >> kernel and userspace... maybe only 64bit kernel first... and make sure on my
> >> side that this does not show up on 64bit, too. I took the word of reporters for
> >> that (and the impression that otherwise many more people would have complained).
> >
> > Yeha, I'm going to try and install some 32bit userspace on a usb
> > harddisk I've got and see if I can boot my Core2 laptop from that to try
> > and reproduce.
> >
> > But all that is probably going to be Monday :/
> >
> *sigh* Already Thursday...
>
> Peter, did you get to reproduce this locally? Unfortunately I had some
> interruption and have not more Information than on last Friday (which is that
> the same hw but 64bit kernel does not show it).

I got side-tracked as well, someone reported crashes, which come above
weird behaviour :/

> Meanwhile I wonder whether it would make sense to push the following (or more?)
> to stable for 3.13.y:
>
> 1) 16824255394f55adf31b9a96a9965d8c15bdac4c
> x86, acpi, idle: Restructure the mwait idle routines
> 2) 7e98b71920464b8d15fa95c74366416cd3c88861
> x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
> 3) 8cb75e0c4ec9786b81439761eac1d18d4a931af3
> sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
> 4) 215393bc1fab3d61a5a296838bdffce22f27ffda
> sched/preempt/x86: Fix voluntary preempt for x86
>
> 1+2 would be to avoid touching 3 too much and looked to be improvements on their
> own. 3+4 would be cherry-picks if not for some fuzz 2.

I'll not object; but if someone wants a smaller set you could do with
just the idle.c hunk from 3.

> I saw a few more things labelled preempt between 3.13 and current HEAD but am
> not sure whether or which of those are strictly required. Namely some fixing to
> preempt_enable_no_resched() mis-usage and maybe one fixing an issue of ftrace
> locking up.

Yeah, that's all fresh breakage and doesn't go back to 3.13 I think.