Hi, Chris
I noticed your task isolation patch set at
https://lkml.org/lkml/2016/8/9/759 . Thanks a lot for the great effort.
I checked the patch and have some questions about how to use this
functionality on the NFV environment. In the NFV scenario, high speed network
code runs on a VM user space and the VM is hosted by KVM hypervisor. To achieve
the high speed network, not only the network code should be isolated, also the
vCPU thread.
I checked your patch to think how to isolate the vCPU thread, and I have
some questions and hope your hints:
a) If the task isolation need prctl to mark itself as isolated,
possibly the vCPU thread can't achieve it. First, the vCPU thread may
need system service during OS booting time, also it's the application,
instead of the vCPU thread to decide if the vCPU thread should be
isolated. So possibly we need a mechanism so that another process can
set the vCPU thread's task isolation?
b) I noticed that curently the task_isolation_prepare() is invoked on
exit_to_usermode_loop(), while we may need such job on VM Exit procedure, or
interrupt handling during VM exit handling.
Also, I'm also considering how to utilize this task_isolation for network
node which is not busy-loop, like the interrupt mode DPDK code. In the
interrupt mode DPDK code, the application may yield the CPU if there is no
network packet for a very long time and it will re-enter the busy loop mode
once new packet arrived. This interrupt mode will be helpful for both power
management and resource utilization. Per my understanding, the application
should turn off the task isolation before yield the CPU and the restart the
task isolation after the new packet, right?
Thanks
--jyh
On 12/1/2016 5:28 PM, yunhong jiang wrote:
> Hi, Chris
> I noticed your task isolation patch set at
> https://lkml.org/lkml/2016/8/9/759 . Thanks a lot for the great effort.
>
> I checked the patch and have some questions about how to use this
> functionality on the NFV environment. In the NFV scenario, high speed network
> code runs on a VM user space and the VM is hosted by KVM hypervisor. To achieve
> the high speed network, not only the network code should be isolated, also the
> vCPU thread.
That's true.
> I checked your patch to think how to isolate the vCPU thread, and I have
> some questions and hope your hints:
>
> a) If the task isolation need prctl to mark itself as isolated,
> possibly the vCPU thread can't achieve it. First, the vCPU thread may
> need system service during OS booting time, also it's the application,
> instead of the vCPU thread to decide if the vCPU thread should be
> isolated. So possibly we need a mechanism so that another process can
> set the vCPU thread's task isolation?
These are good questions. I think that the we would probably want to
add a KVM mode that did the prctl() before transitioning back to the
guest. But then, in the same way that we currently allow another
prctl() from a task-isolated userspace process, we'd probably need to
allow a KVM exit from a guest to happen without triggering any
task-isolation errors.
Alternately, if we add the proposed NOSIG mode, then the hypervisor
can use that, and simply stay in task-isolation mode no matter what
system support is requested.
Presumably in the guest, we'd run task isolation as well, but
presumably we'd only exit to the hypervisor if we were already in the
guest kernel? Although I guess you could image providing trapping
mmio mappings to the guest's userspace process that caused a
hypervisor exit. Perhaps we extend the hypervisor to know that a
guest might support task isolation, and to generate a warning on
behalf of the guest if the hypervisor sees an exit from the guest's
userspace and the guest task is in task-isolation mode?
> b) I noticed that curently the task_isolation_prepare() is invoked on
> exit_to_usermode_loop(), while we may need such job on VM Exit procedure, or
> interrupt handling during VM exit handling.
Yes, something like that. I am not super-familiar with the KVM
internals (I did a KVM port to the tile architecture a while back, but
I'd have to swap all that memory back in before I could even have a
half-educated opinion.)
> Also, I'm also considering how to utilize this task_isolation for network
> node which is not busy-loop, like the interrupt mode DPDK code. In the
> interrupt mode DPDK code, the application may yield the CPU if there is no
> network packet for a very long time and it will re-enter the busy loop mode
> once new packet arrived. This interrupt mode will be helpful for both power
> management and resource utilization. Per my understanding, the application
> should turn off the task isolation before yield the CPU and the restart the
> task isolation after the new packet, right?
Yes, exactly.
I am not likely to pursue KVM myself, at least until the basic patch series
has been accepted upstream.
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
On Fri, 2 Dec 2016 13:58:08 -0500
Chris Metcalf <[email protected]> wrote:
> On 12/1/2016 5:28 PM, yunhong jiang wrote:
> > Hi, Chris
> > I noticed your task isolation patch set at
> > https://lkml.org/lkml/2016/8/9/759 . Thanks a lot for the great
> > effort.
> >
> > I checked the patch and have some questions about how to use
> > this functionality on the NFV environment. In the NFV scenario,
> > high speed network code runs on a VM user space and the VM is
> > hosted by KVM hypervisor. To achieve the high speed network, not
> > only the network code should be isolated, also the vCPU thread.
>
> That's true.
>
> > I checked your patch to think how to isolate the vCPU thread,
> > and I have some questions and hope your hints:
> >
> > a) If the task isolation need prctl to mark itself as isolated,
> > possibly the vCPU thread can't achieve it. First, the vCPU thread
> > may need system service during OS booting time, also it's the
> > application, instead of the vCPU thread to decide if the vCPU
> > thread should be isolated. So possibly we need a mechanism so that
> > another process can set the vCPU thread's task isolation?
>
> These are good questions. I think that the we would probably want to
> add a KVM mode that did the prctl() before transitioning back to the
Would prctl() when back to gues be too heavy?
> guest. But then, in the same way that we currently allow another
> prctl() from a task-isolated userspace process, we'd probably need to
You mean currently in your patch we alraedy can do the prctl from 3rd party
process to task-isolate a userspace process? Sorry that I didn't notice that
part.
> allow a KVM exit from a guest to happen without triggering any
> task-isolation errors.
yes.
>
> Alternately, if we add the proposed NOSIG mode, then the hypervisor
> can use that, and simply stay in task-isolation mode no matter what
> system support is requested.
Yes, possibly we will keep the vCPU thread as always NOSIG mode.
>
> Presumably in the guest, we'd run task isolation as well, but
> presumably we'd only exit to the hypervisor if we were already in the
> guest kernel? Although I guess you could image providing trapping
> mmio mappings to the guest's userspace process that caused a
> hypervisor exit. Perhaps we extend the hypervisor to know that a
> guest might support task isolation, and to generate a warning on
> behalf of the guest if the hypervisor sees an exit from the guest's
> userspace and the guest task is in task-isolation mode?
>
> > b) I noticed that curently the task_isolation_prepare() is invoked
> > on exit_to_usermode_loop(), while we may need such job on VM Exit
> > procedure, or interrupt handling during VM exit handling.
>
> Yes, something like that. I am not super-familiar with the KVM
> internals (I did a KVM port to the tile architecture a while back, but
> I'd have to swap all that memory back in before I could even have a
> half-educated opinion.)
Got it.
>
> > Also, I'm also considering how to utilize this task_isolation
> > for network node which is not busy-loop, like the interrupt mode
> > DPDK code. In the interrupt mode DPDK code, the application may
> > yield the CPU if there is no network packet for a very long time
> > and it will re-enter the busy loop mode once new packet arrived.
> > This interrupt mode will be helpful for both power management and
> > resource utilization. Per my understanding, the application should
> > turn off the task isolation before yield the CPU and the restart
> > the task isolation after the new packet, right?
>
> Yes, exactly.
>
> I am not likely to pursue KVM myself, at least until the basic patch
> series has been accepted upstream.
Yes, and I will be more than happy to help on this.
Thanks
-jyh
>
Sorry for the slow response - I have been busy with some other things.
On 12/6/2016 4:43 PM, yunhong jiang wrote:
> On Fri, 2 Dec 2016 13:58:08 -0500
> Chris Metcalf <[email protected]> wrote:
>
>> On 12/1/2016 5:28 PM, yunhong jiang wrote:
>>> a) If the task isolation need prctl to mark itself as isolated,
>>> possibly the vCPU thread can't achieve it. First, the vCPU thread
>>> may need system service during OS booting time, also it's the
>>> application, instead of the vCPU thread to decide if the vCPU
>>> thread should be isolated. So possibly we need a mechanism so that
>>> another process can set the vCPU thread's task isolation?
>> These are good questions. I think that the we would probably want to
>> add a KVM mode that did the prctl() before transitioning back to the
> Would prctl() when back to guest be too heavy?
It's a good question; it can be heavy. But the design for task isolation is that
the task isolated process is always running in userspace anyway. If you are
transitioning in and out of the guest or host kernels frequently, you probably
should not be using task isolation, but just regular NOHZ_FULL.
>> guest. But then, in the same way that we currently allow another
>> prctl() from a task-isolated userspace process, we'd probably need to
> You mean currently in your patch we alraedy can do the prctl from 3rd party
> process to task-isolate a userspace process? Sorry that I didn't notice that
> part.
Sorry, I think I wasn't clear. Normally when you are running task isolated
and you enter the kernel, you will get a fatal signal. The exception is if you
call prctl itself (or exit), the kernel tolerates it without a signal, since obviously
that's how you need to cleanly tell the kernel you are done with task isolation.
My point in the previous email was that we might need to similarly tolerate
a guest exit without causing a fatal signal to the userspace process. But as
I think about it, that's probably not true; we probably would want to notify
the guest kernel of the task isolation violation and have it kill the userspace
process just as if it had entered the guest kernel.
Perhaps the way to drive this is to have task isolation be triggered from
the guest's prctl up to the host, so there's some kind of KVM exit to
the host that indicates that the guest has a userspace process that
wants to run task isolated, at which point qemu invokes task isolation
on behalf of the guest then returns to the guest to set up its own
virtualized task isolation. It does get confusing!
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
On Fri, 16 Dec 2016 16:00:48 -0500
Chris Metcalf <[email protected]> wrote:
> Sorry for the slow response - I have been busy with some other things.
Thanks for the reply.
>
> On 12/6/2016 4:43 PM, yunhong jiang wrote:
> > On Fri, 2 Dec 2016 13:58:08 -0500
> > Chris Metcalf <[email protected]> wrote:
> >
> >> On 12/1/2016 5:28 PM, yunhong jiang wrote:
> >>> a) If the task isolation need prctl to mark itself as isolated,
> >>> possibly the vCPU thread can't achieve it. First, the vCPU thread
> >>> may need system service during OS booting time, also it's the
> >>> application, instead of the vCPU thread to decide if the vCPU
> >>> thread should be isolated. So possibly we need a mechanism so that
> >>> another process can set the vCPU thread's task isolation?
> >> These are good questions. I think that the we would probably want
> >> to add a KVM mode that did the prctl() before transitioning back
> >> to the
> > Would prctl() when back to guest be too heavy?
>
> It's a good question; it can be heavy. But the design for task
> isolation is that the task isolated process is always running in
> userspace anyway. If you are transitioning in and out of the guest
> or host kernels frequently, you probably should not be using task
> isolation, but just regular NOHZ_FULL.
As you pointed out late, the guest task isolation does not gurantee no guest VM
exit to host, although we hope we can achieve vmexit free situation in
future.
>
> >> guest. But then, in the same way that we currently allow another
> >> prctl() from a task-isolated userspace process, we'd probably need
> >> to
> > You mean currently in your patch we alraedy can do the prctl from
> > 3rd party process to task-isolate a userspace process? Sorry that I
> > didn't notice that part.
>
> Sorry, I think I wasn't clear. Normally when you are running task
> isolated and you enter the kernel, you will get a fatal signal. The
> exception is if you call prctl itself (or exit), the kernel tolerates
> it without a signal, since obviously that's how you need to cleanly
> tell the kernel you are done with task isolation.
>
> My point in the previous email was that we might need to similarly
> tolerate a guest exit without causing a fatal signal to the userspace
> process. But as I think about it, that's probably not true; we
> probably would want to notify the guest kernel of the task isolation
> violation and have it kill the userspace process just as if it had
> entered the guest kernel.
Thanks for the clarification. It's clear now.
>
> Perhaps the way to drive this is to have task isolation be triggered
> from the guest's prctl up to the host, so there's some kind of KVM
> exit to the host that indicates that the guest has a userspace
> process that wants to run task isolated, at which point qemu invokes
> task isolation on behalf of the guest then returns to the guest to
> set up its own virtualized task isolation. It does get confusing!
Hmm, PV solution is always a choice on virtualization world.
>
BTW, currently both the isolated CPU and task isolation requires the kernel
parameter to reserve CPUs in advance. Possibly we can extend it to be dynamic
like through sysfs in future, to avoid resource wast.
--jyh
On 16/12/2016 22:00, Chris Metcalf wrote:
>
> Sorry, I think I wasn't clear. Normally when you are running task
> isolated and you enter the kernel, you will get a fatal signal. The
> exception is if you call prctl itself (or exit), the kernel tolerates
> it without a signal, since obviously that's how you need to cleanly
> tell the kernel you are done with task isolation.
Running in a guest is pretty much the same as running in userspace.
Would it be possible to exclude the KVM_RUN ioctl as well? QEMU would
still have to run prctl when a CPU goes to sleep, and KVM_RUN would have
to enable/disable isolated mode when a VM executes HLT (which should
never happen anyway in NFV scenarios).
Paolo
On 12/20/2016 4:27 AM, Paolo Bonzini wrote:
> On 16/12/2016 22:00, Chris Metcalf wrote:
>> Sorry, I think I wasn't clear. Normally when you are running task
>> isolated and you enter the kernel, you will get a fatal signal. The
>> exception is if you call prctl itself (or exit), the kernel tolerates
>> it without a signal, since obviously that's how you need to cleanly
>> tell the kernel you are done with task isolation.
> Running in a guest is pretty much the same as running in userspace.
> Would it be possible to exclude the KVM_RUN ioctl as well? QEMU would
> still have to run prctl when a CPU goes to sleep, and KVM_RUN would have
> to enable/disable isolated mode when a VM executes HLT (which should
> never happen anyway in NFV scenarios).
I think that probably makes sense. The flow would be that qemu executes
first the prctl() for task isolation, then the KVM_RUN ioctl. We obviously can't
do it in the other order, so we'd need to make task isolation tolerate KVM_RUN.
I won't try to do it for my next patch series (based on 4.10) though, since I'd
like to get the basic support upstreamed before trying to extend it.
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com