2019-08-01 15:32:05

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

TIF_NOTITY_RESUME is evaluated on return to user space along with other TIF
flags.

>From the kernels point of view a VMENTER is more or less equivalent to
return to user space which means that at least a subset of TIF flags needs
to be evaluated and handled.

Currently KVM handles only TIF_SIGPENDING and TIF_NEED_RESCHED, but
TIF_NOTIFY_RESUME is ignored. So pending task_work etc, is completely
ignored until the vCPU thread actually goes all the way back into
userspace/qemu.

Use the newly provided notify_resume_pending() and
tracehook_handle_notify_resume() to solve this similar to the existing
handling of SIGPENDING.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: Radim Krcmar <[email protected]>
Cc: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/x86.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -52,6 +52,7 @@
#include <linux/irqbypass.h>
#include <linux/sched/stat.h>
#include <linux/sched/isolation.h>
+#include <linux/tracehook.h>
#include <linux/mem_encrypt.h>

#include <trace/events/kvm.h>
@@ -7972,7 +7973,8 @@ static int vcpu_enter_guest(struct kvm_v
kvm_x86_ops->sync_pir_to_irr(vcpu);

if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu)
- || need_resched() || signal_pending(current)) {
+ || need_resched() || signal_pending(current)
+ || notify_resume_pending()) {
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
local_irq_enable();
@@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
++vcpu->stat.signal_exits;
break;
}
+
+ if (notify_resume_pending())
+ tracehook_handle_notify_resume();
+
if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
cond_resched();



2019-08-01 18:01:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On 08/01, Thomas Gleixner wrote:
>
> @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> ++vcpu->stat.signal_exits;
> break;
> }
> +
> + if (notify_resume_pending())
> + tracehook_handle_notify_resume();

shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?

I don't understand this code at all, but vcpu_run() does this even before
cond_resched().

Oleg.

2019-08-01 22:54:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> On 08/01, Thomas Gleixner wrote:
> >
> > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > ++vcpu->stat.signal_exits;
> > break;
> > }
> > +
> > + if (notify_resume_pending())
> > + tracehook_handle_notify_resume();
>
> shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
>
> I don't understand this code at all, but vcpu_run() does this even before
> cond_resched().

Yeah, I noticed that it's dropped around cond_resched().

My understanding is that for voluntary giving up the CPU via cond_resched()
it needs to be dropped.

For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
whole code section after preempt_enable() is fully preemptible.

Now the 1Mio$ question is whether any of the notify functions invokes
cond_resched() and whether that really matters. Paolo?

Thanks,

tglx

2019-08-02 02:18:36

by Sean Christopherson

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > On 08/01, Thomas Gleixner wrote:
> > >
> > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > ++vcpu->stat.signal_exits;
> > > break;
> > > }
> > > +
> > > + if (notify_resume_pending())
> > > + tracehook_handle_notify_resume();
> >
> > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> >
> > I don't understand this code at all, but vcpu_run() does this even before
> > cond_resched().
>
> Yeah, I noticed that it's dropped around cond_resched().
>
> My understanding is that for voluntary giving up the CPU via cond_resched()
> it needs to be dropped.
>
> For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> whole code section after preempt_enable() is fully preemptible.
>
> Now the 1Mio$ question is whether any of the notify functions invokes
> cond_resched() and whether that really matters. Paolo?

cond_resched() is called via tracehook_notify_resume()->task_work_run(),
and "kernel code can only call cond_resched() in places where it ...
cannot hold references to any RCU-protected data structures" according to
https://lwn.net/Articles/603252/.

2019-08-02 02:22:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On Thu, Aug 01, 2019 at 02:35:50PM -0700, Sean Christopherson wrote:
> On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> > On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > > On 08/01, Thomas Gleixner wrote:
> > > >
> > > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > > ++vcpu->stat.signal_exits;
> > > > break;
> > > > }
> > > > +
> > > > + if (notify_resume_pending())
> > > > + tracehook_handle_notify_resume();
> > >
> > > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> > >
> > > I don't understand this code at all, but vcpu_run() does this even before
> > > cond_resched().
> >
> > Yeah, I noticed that it's dropped around cond_resched().
> >
> > My understanding is that for voluntary giving up the CPU via cond_resched()
> > it needs to be dropped.
> >
> > For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> > whole code section after preempt_enable() is fully preemptible.
> >
> > Now the 1Mio$ question is whether any of the notify functions invokes
> > cond_resched() and whether that really matters. Paolo?
>
> cond_resched() is called via tracehook_notify_resume()->task_work_run(),
> and "kernel code can only call cond_resched() in places where it ...
> cannot hold references to any RCU-protected data structures" according to
> https://lwn.net/Articles/603252/.

This is SRCU, you can reschedule while holding that just fine. It will
just delay some kvm operations, like the memslot stuff. I don't think it
is terrible to keep it, but people more versed in KVM might know of a
good reason to drop it anyway.

2019-08-02 04:26:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On Thu, 1 Aug 2019, Thomas Gleixner wrote:

> On Thu, 1 Aug 2019, Sean Christopherson wrote:
> > On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> > > On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > > > On 08/01, Thomas Gleixner wrote:
> > > > >
> > > > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > > > ++vcpu->stat.signal_exits;
> > > > > break;
> > > > > }
> > > > > +
> > > > > + if (notify_resume_pending())
> > > > > + tracehook_handle_notify_resume();
> > > >
> > > > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> > > >
> > > > I don't understand this code at all, but vcpu_run() does this even before
> > > > cond_resched().
> > >
> > > Yeah, I noticed that it's dropped around cond_resched().
> > >
> > > My understanding is that for voluntary giving up the CPU via cond_resched()
> > > it needs to be dropped.
> > >
> > > For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> > > whole code section after preempt_enable() is fully preemptible.
> > >
> > > Now the 1Mio$ question is whether any of the notify functions invokes
> > > cond_resched() and whether that really matters. Paolo?
> >
> > cond_resched() is called via tracehook_notify_resume()->task_work_run(),
> > and "kernel code can only call cond_resched() in places where it ...
> > cannot hold references to any RCU-protected data structures" according to
> > https://lwn.net/Articles/603252/.
>
> Right you are.

Bah. Hit send too fast.

Right you are about cond_resched() being called, but for SRCU this does not
matter unless there is some way to do a synchronize operation on that SRCU
entity. It might have some other performance side effect though.

Thanks,

tglx


2019-08-02 04:55:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On Thu, 1 Aug 2019, Sean Christopherson wrote:
> On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> > On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > > On 08/01, Thomas Gleixner wrote:
> > > >
> > > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > > ++vcpu->stat.signal_exits;
> > > > break;
> > > > }
> > > > +
> > > > + if (notify_resume_pending())
> > > > + tracehook_handle_notify_resume();
> > >
> > > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> > >
> > > I don't understand this code at all, but vcpu_run() does this even before
> > > cond_resched().
> >
> > Yeah, I noticed that it's dropped around cond_resched().
> >
> > My understanding is that for voluntary giving up the CPU via cond_resched()
> > it needs to be dropped.
> >
> > For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> > whole code section after preempt_enable() is fully preemptible.
> >
> > Now the 1Mio$ question is whether any of the notify functions invokes
> > cond_resched() and whether that really matters. Paolo?
>
> cond_resched() is called via tracehook_notify_resume()->task_work_run(),
> and "kernel code can only call cond_resched() in places where it ...
> cannot hold references to any RCU-protected data structures" according to
> https://lwn.net/Articles/603252/.

Right you are.

2019-08-02 22:08:31

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On 08/01, Thomas Gleixner wrote:
>
> On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > On 08/01, Thomas Gleixner wrote:
> > >
> > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > ++vcpu->stat.signal_exits;
> > > break;
> > > }
> > > +
> > > + if (notify_resume_pending())
> > > + tracehook_handle_notify_resume();
> >
> > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> >
> > I don't understand this code at all, but vcpu_run() does this even before
> > cond_resched().
>
> Yeah, I noticed that it's dropped around cond_resched().
>
> My understanding is that for voluntary giving up the CPU via cond_resched()
> it needs to be dropped.

I am not sure it really needs, but this doesn't matter.

tracehook_handle_notify_resume() can do "anything", say it can run the
works queued by systemtap. I don't think it should delay synchronize_srcu().
And may be this is simply unsafe, even if I don't think a task_work can
ever call synchronize_srcu(kvm->srcu) directly.

Oleg.

2019-08-04 03:02:23

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On 01/08/19 23:47, Thomas Gleixner wrote:
> Right you are about cond_resched() being called, but for SRCU this does not
> matter unless there is some way to do a synchronize operation on that SRCU
> entity. It might have some other performance side effect though.

I would use srcu_read_unlock/lock around the call.

However, I'm wondering if the API can be improved because basically we
have six functions for three checks of TIF flags. Does it make sense to
have something like task_has_request_flags and task_do_requests (names
are horrible I know) that can be used like

if (task_has_request_flags()) {
int err;
...srcu_read_unlock...
// return -EINTR if signal_pending
err = task_do_requests();
if (err < 0)
goto exit_no_srcu_read_unlock;
...srcu_read_lock...
}

taking care of all three cases with a single hook? This is important
especially because all these checks are done by all KVM architectures in
slightly different ways, and a unified API would be a good reason to
make all architectures look the same.

(Of course I could also define this unified API in virt/kvm/kvm_main.c,
so this is not blocking the series in any way!).

Paolo

2019-08-04 03:08:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT

On Fri, 2 Aug 2019, Paolo Bonzini wrote:
> On 01/08/19 23:47, Thomas Gleixner wrote:
> > Right you are about cond_resched() being called, but for SRCU this does not
> > matter unless there is some way to do a synchronize operation on that SRCU
> > entity. It might have some other performance side effect though.
>
> I would use srcu_read_unlock/lock around the call.
>
> However, I'm wondering if the API can be improved because basically we
> have six functions for three checks of TIF flags. Does it make sense to
> have something like task_has_request_flags and task_do_requests (names
> are horrible I know) that can be used like
>
> if (task_has_request_flags()) {
> int err;
> ...srcu_read_unlock...
> // return -EINTR if signal_pending
> err = task_do_requests();
> if (err < 0)
> goto exit_no_srcu_read_unlock;
> ...srcu_read_lock...
> }
>
> taking care of all three cases with a single hook? This is important
> especially because all these checks are done by all KVM architectures in
> slightly different ways, and a unified API would be a good reason to
> make all architectures look the same.
>
> (Of course I could also define this unified API in virt/kvm/kvm_main.c,
> so this is not blocking the series in any way!).

You're not holding up something. Having a common function for this is
definitely the right approach.

As this is virt specific because it only checks for non arch specific bits
(TIF_NOTIFY_RESUME should be available for all KVM archs) and the TIF bits
are a subset of the available TIF bits because all others do not make any
sense there, this really should be a common function for KVM so that all
other archs which obviously lack a TIF_NOTIFY_RESUME check, can be fixed up
and consolidated. If we add another TIF check later then we only have to do
it in one place.

Thanks,

tglx

2019-08-04 03:11:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT



> On Aug 2, 2019, at 3:22 PM, Thomas Gleixner <[email protected]> wrote:
>
>> On Fri, 2 Aug 2019, Paolo Bonzini wrote:
>>> On 01/08/19 23:47, Thomas Gleixner wrote:
>>> Right you are about cond_resched() being called, but for SRCU this does not
>>> matter unless there is some way to do a synchronize operation on that SRCU
>>> entity. It might have some other performance side effect though.
>>
>> I would use srcu_read_unlock/lock around the call.
>>
>> However, I'm wondering if the API can be improved because basically we
>> have six functions for three checks of TIF flags. Does it make sense to
>> have something like task_has_request_flags and task_do_requests (names
>> are horrible I know) that can be used like
>>
>> if (task_has_request_flags()) {
>> int err;
>> ...srcu_read_unlock...
>> // return -EINTR if signal_pending
>> err = task_do_requests();
>> if (err < 0)
>> goto exit_no_srcu_read_unlock;
>> ...srcu_read_lock...
>> }
>>
>> taking care of all three cases with a single hook? This is important
>> especially because all these checks are done by all KVM architectures in
>> slightly different ways, and a unified API would be a good reason to
>> make all architectures look the same.
>>
>> (Of course I could also define this unified API in virt/kvm/kvm_main.c,
>> so this is not blocking the series in any way!).
>
> You're not holding up something. Having a common function for this is
> definitely the right approach.
>
> As this is virt specific because it only checks for non arch specific bits
> (TIF_NOTIFY_RESUME should be available for all KVM archs) and the TIF bits
> are a subset of the available TIF bits because all others do not make any
> sense there, this really should be a common function for KVM so that all
> other archs which obviously lack a TIF_NOTIFY_RESUME check, can be fixed up
> and consolidated. If we add another TIF check later then we only have to do
> it in one place.
>
>

If we add a real API for this, can we make it, or a very similar API, work for exit_to_usermode_loop() too? Maybe:

bool usermode_work_pending();
bool guestmode_work_pending();

void do_usermode_work();
void do_guestmode_work();

The first two are called with IRQs off. The latter two are called with IRQs on.