2016-03-04 12:56:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support

On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:
> On 01/30/2016 04:11 PM, Frederic Weisbecker wrote:
> >We have reverted the patch that made isolcpus |= nohz_full. Too
> >many people complained about unusable machines with NO_HZ_FULL_ALL
> >
> >But the user can still set that parameter manually.
>
> Yes. What I was suggesting is that if the user specifies task_isolation=X-Y
> we should add cpus X-Y to both the nohz_full set and the isolcpus set.
> I've changed it to work that way for the v10 patch series.

Ok.

>
>
> >>>>>>+bool _task_isolation_ready(void)
> >>>>>>+{
> >>>>>>+ WARN_ON_ONCE(!irqs_disabled());
> >>>>>>+
> >>>>>>+ /* If we need to drain the LRU cache, we're not ready. */
> >>>>>>+ if (lru_add_drain_needed(smp_processor_id()))
> >>>>>>+ return false;
> >>>>>>+
> >>>>>>+ /* If vmstats need updating, we're not ready. */
> >>>>>>+ if (!vmstat_idle())
> >>>>>>+ return false;
> >>>>>>+
> >>>>>>+ /* Request rescheduling unless we are in full dynticks mode. */
> >>>>>>+ if (!tick_nohz_tick_stopped()) {
> >>>>>>+ set_tsk_need_resched(current);
> >>>>>I'm not sure doing this will help getting the tick to get stopped.
> >>>>Well, I don't know that there is anything else we CAN do, right? If there's
> >>>>another task that can run, great - it may be that that's why full dynticks
> >>>>isn't happening yet. Or, it might be that we're waiting for an RCU tick and
> >>>>there's nothing else we can do, in which case we basically spend our time
> >>>>going around through the scheduler code and back out to the
> >>>>task_isolation_ready() test, but again, there's really nothing else more
> >>>>useful we can be doing at this point. Once the RCU tick fires (or whatever
> >>>>it was that was preventing full dynticks from engaging), we will pass this
> >>>>test and return to user space.
> >>>There is nothing at all you can do and setting TIF_RESCHED won't help either.
> >>>If there is another task that can run, the scheduler takes care of resched
> >>>by itself :-)
> >>The problem is that the scheduler will only take care of resched at a
> >>later time, typically when we get a timer interrupt later.
> >When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
> >target is remote it sends an IPI, if it's local then we wait the next reschedule
> >point (preemption points, voluntary reschedule, interrupts). There is just nothing
> >you can do to accelerate that.
>
> But that's exactly what I'm saying. If we're sitting in a loop here waiting
> for some short-lived process (maybe kernel thread) to run and get out of
> the way, we don't want to just spin sitting in prepare_exit_to_usermode().
> We want to call schedule(), get the short-lived process to run, then when
> it calls schedule() again, we're back in prepare_exit_to_usermode but now
> we can return to userspace.

Maybe, although I think returning to userspace with -EAGAIN or -EBUSY, something like
that would be better so that userspace retries a bit later with prctl. Otherwise we may
well be waiting for ever in kernelmode.

>
> We don't want to wait for preemption points or interrupts, and there are
> no other voluntary reschedules in the prepare_exit_to_usermode() loop.
>
> If the other task had been woken up for some completion, then yes we would
> already have had TIF_RESCHED set, but if the other runnable task was (for
> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
> this point, and thus we might need to call schedule() explicitly.

There can't be another task in the runqueue waiting to be preempted since
we (the current task) are running on the CPU.

Besides, if we aren't alone in the runqueue, this breaks the task isolation
mode.

>
> Note that the prepare_exit_to_usermode() loop is exactly the point at
> which we normally call schedule() if we are in syscall exit, so we are
> just encouraging that schedule() to happen if otherwise it might not.
>
> >>By invoking the scheduler here, we allow any tasks that are ready to run to run
> >>immediately, rather than waiting for an interrupt to wake the scheduler.
> >Well, in this case here we are interested in the current CPU. And if a task
> >got awoken and waits for the current CPU, it will have an opportunity to get
> >schedule on syscall exit.
>
> That's true if TIF_RESCHED was set because a completion occurred that
> the other task was waiting for. But there might not be any such completion
> and the task just got preempted earlier and is still ready to run.

But if another task waits for the CPU, this break task isolation mode. Now
assuming we want a pending task to resume such that we get the CPU for ourself,
we have no idea if the scheduler is going to schedule that task, it depends on
vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't
guarantee any context switch.

> My point is that setting TIF_RESCHED is never harmful, and there are
> cases like involuntary preemption where it might help.

Sure but we don't write code just because it doesn't harm. Strange code hurts
the brain of reviewers.

Now concerning involuntary preemption, it's a matter of a millisecond, userspace
needs to wait a few millisecond before retrying anyway. Sleeping at that point is
what can be useful as we leave the CPU for the resuming task.

Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly
or not, it's a very bad sign for a task isolation session. Either we did not affine tasks
correctly or there is a kernel thread that might run again at some time ahead.

>
> >>Plenty of places in the kernel just call schedule() directly when they are
> >>waiting. Since we're waiting here regardless, we might as well
> >>immediately get any other runnable tasks dealt with.
> >>
> >>We could also just return "false" in _task_isolation_ready(), and then
> >>check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
> >>call schedule() explicitly there, but that seems a little more roundabout.
> >>Admittedly it's more usual to see kernel code call schedule() directly
> >>to yield the processor, but in this case I'm not convinced it's cleaner
> >>given we're already in a loop where the caller is checking TIF_RESCHED
> >>and then calling schedule() when it's set.
> >You could call cond_resched(), but really syscall exit is enough for what
> >you want. And the problem here if a task prevents the CPU from stopping the
> >tick is that task itself, not the fact it doesn't get scheduled.
>
> True, although in that case we just need to wait (e.g. for an RCU tick
> to occur to quiesce); we could spin, but spinning through the scheduler
> seems no better or worse in that case then just spinning with
> interrupts enabled in a loop. And (as I said above) it could help.

Lets just leave that waiting to userspace. Just sleep a few milliseconds.

>
> >If we have
> >other tasks than the current isolated one on the CPU, it means that the
> >environment is not ready for hard isolation.
>
> Right. But the model is that in that case, the task that wants hard
> isolation is just going to have to wait to return to userspace.

I think we shouldn't do that wait for isolation on the kernel.

>
>
> >And in general: we shouldn't loop at all there: if something depends on the tick,
> >the CPU is not ready for isolation and something needs to be done: setting
> >some task affinity, etc... So we should just fail the prctl and let the user
> >deal with it.
>
> So there are potentially two cases here:
>
> (1) When we initially do the prctl(), should we check to see if there are
> other schedulable tasks, etc., and fail the prctl() if so? You could make a
> case for this, but I think in practice userspace would just end up looping
> back to retry the prctl if we created that semantic in the kernel.

That sounds saner to me. And if we still fail after one second, then just give up.
In fact if it doesn't work on the first time, that's a bad sign like I said above.
The task that is running on the CPU may well come again later. Some pre-conditons
are not met.

>
> (2) What about times when we are leaving the kernel after already
> doing the prctl()? For example a core doing packet forwarding might
> want to report some error condition up to the kernel, and remove itself
> from the set of cores handling packets, then do some syscall(s) to generate
> logging data, and then go back and continue handling packets. Or, the
> process might have created some large anonymous mapping where
> every now and then it needs to cross a page boundary for some structure
> and touch a new page, and it knows to expect a page fault in that case.
> In those cases we are returning from the kernel, not at prctl() time, and
> we still want to enforce the semantics that no further interrupts will
> occur to disturb the task. These kinds of use cases are why we have
> as general-purpose a mechanism as we do for task isolation.

If any interrupt or any kind of disturbance happens, we should leave that
task isolation mode and warn the isolated task about that. SIGTERM?

Thanks.


2016-03-09 19:39:55

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support

Frederic,

Thanks for the detailed feedback on the task isolation stuff.

This reply kind of turned into an essay, so I've added a little "TL;DR"
sentence before each section.


TL;DR: Let's make an explicit decision about whether task isolation
should be "persistent" or "one-shot". Both have some advantages.
=====

An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:

"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely. It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.

"One-shot mode": A task requests isolation via prctl(), the kernel
ensures it is isolated on return from the prctl(), but then as soon as
it enters the kernel again, task isolation is switched off until
another prctl is issued. This is what you recommended in your last
email.

There are a number of pros and cons to the two models. I think on
balance I still like the "persistent mode" approach, but here's all
the pros/cons I can think of:

PRO for persistent mode: A somewhat easier programming model. Users
can just imagine "task isolation" as a way for them to still be able
to use the kernel exactly as they always have; it's just slower to get
back out of the kernel so you use it judiciously. For example, a
process is free to call write() on a socket to perform a diagnostic,
but when returning from the write() syscall, the kernel will hold the
task in kernel mode until any timer ticks (perhaps from networking
stuff) are complete, and then let it return to userspace to continue
in task isolation mode. This is convenient to the user since they
don't have to fret about re-enabling task isolation after that
syscall, page fault, or whatever; they can just continue running.
With your suggestion, the user pretty much has to leave STRICT mode
enabled so he gets notified of any unexpected return to kernel space
(in fact we might make it required so you always get a signal when
leaving task isolation unless it's via a prctl or exit syscall).

PRO for one-shot mode: A somewhat crisper interaction with
sched_setaffinity() etc. With a persistent mode approach, a task can
start up task isolation, then later another task can be placed on its
cpu and break it (it won't return to userspace until killed or the new
process affinitizes itself away or stops running). By contrast, in
one-shot mode, any return to kernel spaces turns off task isolation
anyway, so it's very clear what the interaction looks like. I suspect
this is more a theoretical advantage to one-shot mode than a practical
one, though.

CON for one-shot mode: It's actually hard to catch every kernel entry
so we can turn the task-isolation flag off again - and we really do
need to have a flag, just so that we can suitably debug any bad
actions that bring us into the kernel when we're not expecting it.
Right now there are things that bring us into the kernel that we don't
bother annotating for task isolation STRICT mode, just because they're
visible to the user anyway: e.g., a bus fault or segmentation
violation.

I think we can actually make both modes available to users with just
another flag bit, so maybe we can look at what that looks like in v11:
adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
isolation at the next syscall entry, page fault, etc. Then we can
think more specifically about whether we want to remove the flag or
not, and if we remove it, whether we want to make the code that was
controlled by it unconditionally true or unconditionally false
(i.e. remove it again).


TL;DR: We should be more willing to return -EINVAL from prctl().
=====

One thing you've argued is that we should be more aggressive about
failing the prctl() call. I think, in any case, that this is probably
reasonable. We already check that the task's affinity is limited to
the current core and that that core is a task_isolation cpu; I think we
can also require that can_stop_full_tick() return true (or the moral
equivalent given your recent patch series). This will mean you can't
even try to go into task isolation mode if another task is
schedulable, among other things, which seems like a good thing.

However, it is important to note that the current task_isolation_ready
and task_isolation_enter calls that are in the prepare_exit_to_userspace
routine are still required even with your proposed one-shot mode. We
have to be sure that no interrupts occur on the way back to userspace
that might then in principle lead to timer interrupts being scheduled,
and the way to do that is make sure task_isolation_ready returns true
with interrupts disabled, and interrupts are not then re-enabled before
return to userspace. Anything else is just keeping your fingers
crossed and guessing.


TL;DR: Returning -EBUSY from prctl() isn't really that helpful.
=====

Frederic wonders if we can test for various things not being ready
(dynticks not off yet, etc) and just return -EBUSY and let userspace
do the spinning.

First, note that this is only possible for one-shot mode. For
persistent mode, we have the potential to run up against this on
return from any syscall, and we obviously can't add new error returns
to other syscalls. So it doesn't really make sense to add EBUSY
semantics to prctl if nothing else can use it.

But even in one-shot mode, I'm not really sure what the advantage is
here. We still need to do something like task_isolation_ready() in
the prepare_exit_to_usermode() loop, since that's where we have
interrupts disabled and can do a final assessment of the state of the
kernel for this core. So, while you could imagine having that code
just hook in and call syscall_set_return_value() there instead of
causing things to loop back, that doesn't really save us much
complexity in the kernel, and instead pushes complexity back to
userspace, which may well handle it just by busywaiting on the prctl()
anyway. You might argue that if we just return to userspace, userspace
can sleep briefly and retry, thus avoiding spinning in the scheduler.
But it's relatively easy to do that (or better) in the kernel, so I'm
not sure that's more than a straw man. See the next point.


TL;DR: Should we arrange to actually use a completion in
task_isolation_enter when dynticks are ticking, and call complete()
in tick-sched.c when we shut down dynticks, or, just spin in
schedule() and not worry about burning a little cpu?
=====

One question that keeps getting asked is how useful it is to just call
schedule() while we're waiting for dynticks to shut off, since it
could just be a busy spin going into schedule() over and over. Even
if another task is ready to run we might not switch to it right away.
So one thing we could think about is arranging so that whenever we
turn off dynticks, we also notify any tasks that were waiting for it
to be turned off; that way we can just sleep in task_isolation_enter()
and wait to be notified, thus guaranteeing any other task that wants
to run can run, or even just waiting in cpu idle for a little while.
Does this seem like it's worth coding up? My impression has always
been that we wait pretty briefly for dynticks to shut down, so it
doesn't really matter if we spin - and even if we do spin, in
principle we already arranged for this cpu to be dedicated to this
task anyway, so it doesn't really do anything bad except maybe burn a
little bit of extra cpu power. But I'm willing to be convinced...


TL;DR: We should turn off task isolation mode for signals.
=====

One thing that occurs to me is that we should arrange so that
any signal delivery turns off task isolation mode. This is
easily documented semantics even in persistent mode, and it
allows the userspace program to run and discover that something bad
has happened, rather than potentially hanging in the kernel trying to
wait for isolation to be possible before calling the signal handler.
I'll make this change for v11 in any case.

Also, doing this is something of a requirement for the proposed
one-shot mode, since if we have STRICT mode enabled, then any entry
into the kernel is either a syscall, or else ends up causing a signal,
and by hooking the signal mechanism we have a place to catch all the
non-syscall entrypoints, more or less.


TL;DR: Maybe we should use seccomp for STRICT mode syscall detection.
=====

This is being investigated in a separate email thread with Andy
Lutomirski. Whether it gets included in v11 is still TBD.


TL;DR: Various minor issues in answer to Frederic's comments :-)
=====

On 03/04/2016 07:56 AM, Frederic Weisbecker wrote:
> On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:
>> We don't want to wait for preemption points or interrupts, and there are
>> no other voluntary reschedules in the prepare_exit_to_usermode() loop.
>>
>> If the other task had been woken up for some completion, then yes we would
>> already have had TIF_RESCHED set, but if the other runnable task was (for
>> example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
>> this point, and thus we might need to call schedule() explicitly.
>
> There can't be another task in the runqueue waiting to be preempted since
> we (the current task) are running on the CPU.

My earlier sentence may not have been clear. By saying "if the other
runnable task was pre-empted on a timer tick", I meant that
TIF_RESCHED wasn't set on our task, and we'd only eventually schedule
to that other task once a timer interrupt fired and ended our
scheduler slice. I know you can't have a different task in the
runqueue waiting to be preempted, since that doesn't make sense :-)

> Besides, if we aren't alone in the runqueue, this breaks the task isolation
> mode.

Indeed. We can and will do better catching that at prctl() time.
So the question is, if we adopt the "persistent mode", how do we
handle this case on some other return from kernel space?

>>>> By invoking the scheduler here, we allow any tasks that are ready to run to run
>>>> immediately, rather than waiting for an interrupt to wake the scheduler.
>>> Well, in this case here we are interested in the current CPU. And if a task
>>> got awoken and waits for the current CPU, it will have an opportunity to get
>>> schedule on syscall exit.
>>
>> That's true if TIF_RESCHED was set because a completion occurred that
>> the other task was waiting for. But there might not be any such completion
>> and the task just got preempted earlier and is still ready to run.
>
> But if another task waits for the CPU, this break task isolation mode. Now
> assuming we want a pending task to resume such that we get the CPU for ourself,
> we have no idea if the scheduler is going to schedule that task, it depends on
> vruntime and other things. TIF_RESCHED only make entering the scheduler, it doesn't
> guarantee any context switch.

Yes, true. So we have to decide if we feel spinning into the
scheduler is so harmful that we should set up a new completion driven
by entering dynticks fullmode, and handle it that way instead.

>> My point is that setting TIF_RESCHED is never harmful, and there are
>> cases like involuntary preemption where it might help.
>
> Sure but we don't write code just because it doesn't harm. Strange code hurts
> the brain of reviewers.

Fair enough, and certainly means at a minimum we need a good comment there!

> Now concerning involuntary preemption, it's a matter of a millisecond, userspace
> needs to wait a few millisecond before retrying anyway. Sleeping at that point is
> what can be useful as we leave the CPU for the resuming task.
>
> Also if we have any task on the runqueue anyway, whether we hope that it resumes quickly
> or not, it's a very bad sign for a task isolation session. Either we did not affine tasks
> correctly or there is a kernel thread that might run again at some time ahead.

Note that it might also be a one-time kernel task or kworker that is
queued by some random syscall in "persistent mode" and we need to let
it run until it quiesces again. Then we can context switch back to
our task isolation task, and safely return from it to userspace.

>> (2) What about times when we are leaving the kernel after already
>> doing the prctl()? For example a core doing packet forwarding might
>> want to report some error condition up to the kernel, and remove itself
>> from the set of cores handling packets, then do some syscall(s) to generate
>> logging data, and then go back and continue handling packets. Or, the
>> process might have created some large anonymous mapping where
>> every now and then it needs to cross a page boundary for some structure
>> and touch a new page, and it knows to expect a page fault in that case.
>> In those cases we are returning from the kernel, not at prctl() time, and
>> we still want to enforce the semantics that no further interrupts will
>> occur to disturb the task. These kinds of use cases are why we have
>> as general-purpose a mechanism as we do for task isolation.
>
> If any interrupt or any kind of disturbance happens, we should leave that
> task isolation mode and warn the isolated task about that. SIGTERM?

That's the goal of STRICT mode. By default it uses SIGTERM. You can
also choose a different signal via the prctl() API.

Thanks again, Frederic! I'll work to put together a new version of
the patch incorporating a selectable one-shot mode, plus the other
things mentioned in this patch. I think I will still not add the
suggested "dynticks full enabled completion" thing for now, and just
add a big comment on the code that makes us call schedule(), unless folks
really agree it's a necessary thing to have there.
--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

2016-04-08 13:56:36

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support

On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> Frederic,
>
> Thanks for the detailed feedback on the task isolation stuff.
>
> This reply kind of turned into an essay, so I've added a little "TL;DR"
> sentence before each section.

I think I'm going to cut my reply into several threads, because really
I can't get myself to make a giant reply in once :-)

>
>
> TL;DR: Let's make an explicit decision about whether task isolation
> should be "persistent" or "one-shot". Both have some advantages.
> =====
>
> An important high-level issue is how "sticky" task isolation mode is.
> We need to choose one of these two options:
>
> "Persistent mode": A task switches state to "task isolation" mode
> (kind of a level-triggered analogy) and stays there indefinitely. It
> can make a syscall, take a page fault, etc., if it wants to, but the
> kernel protects it from incurring any further asynchronous interrupts.
> This is the model I've been advocating for.

But then in this mode, what happens when an interrupt triggers.

>
> "One-shot mode": A task requests isolation via prctl(), the kernel
> ensures it is isolated on return from the prctl(), but then as soon as
> it enters the kernel again, task isolation is switched off until
> another prctl is issued. This is what you recommended in your last
> email.

No I think we can issue syscalls for exemple. But asynchronous interruptions
such as exceptions (actually somewhat synchronous but can be unexpected) and
interrupts are what we want to avoid.

>
> There are a number of pros and cons to the two models. I think on
> balance I still like the "persistent mode" approach, but here's all
> the pros/cons I can think of:
>
> PRO for persistent mode: A somewhat easier programming model. Users
> can just imagine "task isolation" as a way for them to still be able
> to use the kernel exactly as they always have; it's just slower to get
> back out of the kernel so you use it judiciously. For example, a
> process is free to call write() on a socket to perform a diagnostic,
> but when returning from the write() syscall, the kernel will hold the
> task in kernel mode until any timer ticks (perhaps from networking
> stuff) are complete, and then let it return to userspace to continue
> in task isolation mode.

So this is not hard isolation anymore. This is rather soft isolation with
best efforts to avoid disturbance.

Surely we can have different levels of isolation.

I'm still wondering what to do if the task migrates to another CPU. In fact,
perhaps what you're trying to do is rather a CPU property than a process property?

> This is convenient to the user since they
> don't have to fret about re-enabling task isolation after that
> syscall, page fault, or whatever; they can just continue running.
> With your suggestion, the user pretty much has to leave STRICT mode
> enabled so he gets notified of any unexpected return to kernel space
> (in fact we might make it required so you always get a signal when
> leaving task isolation unless it's via a prctl or exit syscall).

Right. Although we can allow all syscalls in this mode actually.

>
> PRO for one-shot mode: A somewhat crisper interaction with
> sched_setaffinity() etc. With a persistent mode approach, a task can
> start up task isolation, then later another task can be placed on its
> cpu and break it (it won't return to userspace until killed or the new
> process affinitizes itself away or stops running). By contrast, in
> one-shot mode, any return to kernel spaces turns off task isolation
> anyway, so it's very clear what the interaction looks like. I suspect
> this is more a theoretical advantage to one-shot mode than a practical
> one, though.

I think I heard about workloads that need such strict hard isolation.
Workloads that really can not afford any disturbance. They even
use userspace network stack. Maybe HFT?

> CON for one-shot mode: It's actually hard to catch every kernel entry
> so we can turn the task-isolation flag off again - and we really do
> need to have a flag, just so that we can suitably debug any bad
> actions that bring us into the kernel when we're not expecting it.
> Right now there are things that bring us into the kernel that we don't
> bother annotating for task isolation STRICT mode, just because they're
> visible to the user anyway: e.g., a bus fault or segmentation
> violation.
>
> I think we can actually make both modes available to users with just
> another flag bit, so maybe we can look at what that looks like in v11:
> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> isolation at the next syscall entry, page fault, etc. Then we can
> think more specifically about whether we want to remove the flag or
> not, and if we remove it, whether we want to make the code that was
> controlled by it unconditionally true or unconditionally false
> (i.e. remove it again).

I think we shouldn't bother with strict hard isolation if we don't need
it yet. The implementation may well be invasive. Lets wait for someone
who really needs it.

>
>
> TL;DR: We should be more willing to return -EINVAL from prctl().
> =====
>
> One thing you've argued is that we should be more aggressive about
> failing the prctl() call. I think, in any case, that this is probably
> reasonable. We already check that the task's affinity is limited to
> the current core and that that core is a task_isolation cpu; I think we
> can also require that can_stop_full_tick() return true (or the moral
> equivalent given your recent patch series). This will mean you can't
> even try to go into task isolation mode if another task is
> schedulable, among other things, which seems like a good thing.
>
> However, it is important to note that the current task_isolation_ready
> and task_isolation_enter calls that are in the prepare_exit_to_userspace
> routine are still required even with your proposed one-shot mode. We
> have to be sure that no interrupts occur on the way back to userspace
> that might then in principle lead to timer interrupts being scheduled,
> and the way to do that is make sure task_isolation_ready returns true
> with interrupts disabled, and interrupts are not then re-enabled before
> return to userspace. Anything else is just keeping your fingers
> crossed and guessing.

So your requirements are actually hard isolation but in userspace?

And what happens if you get interrupted in userspace? What about page
faults and other exceptions?

Thanks.

2016-04-08 16:51:13

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support

On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> > TL;DR: Let's make an explicit decision about whether task isolation
> > should be "persistent" or "one-shot". Both have some advantages.
> > =====
> >
> > An important high-level issue is how "sticky" task isolation mode is.
> > We need to choose one of these two options:
> >
> > "Persistent mode": A task switches state to "task isolation" mode
> > (kind of a level-triggered analogy) and stays there indefinitely. It
> > can make a syscall, take a page fault, etc., if it wants to, but the
> > kernel protects it from incurring any further asynchronous interrupts.
> > This is the model I've been advocating for.
>
> But then in this mode, what happens when an interrupt triggers.

So here I'm taking "interrupt" to mean an external, asynchronous
interrupt, from another core or device, or asynchronously triggered
on the local core, like a timer interrupt. By contrast I use "exception"
or "fault" to refer to synchronous, locally-triggered interruptions.

So for interrupts, the short answer is, it's a bug! :-)

An interrupt could be a kernel bug, in which case we consider it a
"true" bug. This could be a timer interrupt occurring even after the
task isolation code thought there were none pending, or a hardware
device that incorrectly distributes interrupts to a task-isolation
cpu, or a global IPI that should be sent to fewer cores, or a kernel
TLB flush that could be deferred until the task-isolation task
re-enters the kernel later, etc. Regardless, I'd consider it a kernel
bug. I'm sure there are more such bugs that we can continue to fix
going forward; it depends on how arbitrary you want to allow code
running on other cores to be. For example, can another core unload a
kernel module without interrupting a task-isolation task? Not right now.

Or, it could be an application bug: the standard example is if you
have an application with task-isolated cores that also does occasional
unmaps on another thread in the same process, on another core. This
causes TLB flush interrupts under application control. The
application shouldn't do this, and we tell our customers not to build
their applications this way. The typical way we encourage our
customers to arrange this kind of "multi-threading" is by having a
pure memory API between the task isolation threads and what are
typically "control" threads running on non-task-isolated cores. The
two types of threads just both mmap some common, shared memory but run
as different processes.

So what happens if an interrupt does occur?

In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process. This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.

If you enable "strict" mode, we disable task isolation mode for that
core and deliver a signal to it. This lets the application know that
an interrupt occurred, and it can take whatever kind of logging or
debugging action it wants to, re-enable task isolation if it wants to
and continue, or just exit or abort, etc.

If you don't enable "strict" mode, but you do have
task_isolation_debug enabled as a boot flag, you will at least get a
console dump with a backtrace and whatever other data we have.
(Sometimes the debug info actually includes a backtrace of the
interrupting core, if it's an IPI or TLB flush from another core,
which can be pretty useful.)

> > "One-shot mode": A task requests isolation via prctl(), the kernel
> > ensures it is isolated on return from the prctl(), but then as soon as
> > it enters the kernel again, task isolation is switched off until
> > another prctl is issued. This is what you recommended in your last
> > email.
>
> No I think we can issue syscalls for exemple. But asynchronous interruptions
> such as exceptions (actually somewhat synchronous but can be unexpected) and
> interrupts are what we want to avoid.

Hmm, so I think I'm not really understanding what you are suggesting.

We're certainly in agreement that avoiding interrupts and exceptions
is important. I'm arguing that the way to deal with them is to
generate appropriate signals/printks, etc. I'm not actually sure what
you're recommending we do to avoid exceptions. Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them. For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region. I'd argue it's an application
bug; one should enable "strict" mode to catch and deal with such bugs.

(Typically the recommendation is to do an mlockall() before starting
task isolation mode, to handle the case of page faults. But you can
do that and still be screwed by another thread in your process doing a
fork() and then your pages end up read-only for COW and you have to
fault them back in. But, that's an application bug for a
task-isolation thread, and should just be treated as such.)

> > There are a number of pros and cons to the two models. I think on
> > balance I still like the "persistent mode" approach, but here's all
> > the pros/cons I can think of:
> >
> > PRO for persistent mode: A somewhat easier programming model. Users
> > can just imagine "task isolation" as a way for them to still be able
> > to use the kernel exactly as they always have; it's just slower to get
> > back out of the kernel so you use it judiciously. For example, a
> > process is free to call write() on a socket to perform a diagnostic,
> > but when returning from the write() syscall, the kernel will hold the
> > task in kernel mode until any timer ticks (perhaps from networking
> > stuff) are complete, and then let it return to userspace to continue
> > in task isolation mode.
>
> So this is not hard isolation anymore. This is rather soft isolation with
> best efforts to avoid disturbance.

No, it's still hard isolation. The distinction is that we offer a way
to get in and out of the kernel "safely" if you want to run in that
mode. The syscalls can take a long time if the syscall ends up
requiring some additional timer ticks to finish sorting out whatever
it was you asked the kernel to do, but once you're back in userspace
you immediately regain "hard" isolation. It's under program control.

Or, you can enable "strict" mode, and then you get hard isolation
without the ability to get in and out of the kernel at all: the kernel
just kills you if you try to leave hard isolation other than by an
explicit prctl().

> Surely we can have different levels of isolation.

Well, we have nohz_full now, and by adding task-isolation, we have
two. Or three if you count "base" and "strict" mode task isolation as
two separate levels.

> I'm still wondering what to do if the task migrates to another CPU. In fact,
> perhaps what you're trying to do is rather a CPU property than a
> process property?

Well, we did go around on this issue once already (last August) and at
the time you were encouraging isolation to be a "task" property, not a
"cpu" property:

https://lkml.kernel.org/r/20150812160020.GG21542@lerouge

You convinced me at the time :-)

You're right that migration conflicts with task isolation. But
certainly, if a task has enabled "strict" semantics, it can't migrate;
it will lose task isolation entirely and get a signal instead,
regardless of whether it calls sched_setaffinity() on itself, or if
someone else changes its affinity and it gets a kick.

However, if a task doesn't have strict mode enabled, it can call
sched_setaffinity() and force itself onto a non-task_isolation cpu and
it won't get any isolation until it schedules itself back onto a
task_isolation cpu, at which point it wakes up on the new cpu with
hard isolation still in effect. I can make up reasons why this sort
of thing might be useful, but it's probably a corner case.

However, this makes me wonder if "strict" mode should be the default
for task isolation?? That way task isolation really doesn't conflict
semantically with migration. And we could provide a "weak" mode, or a
"kernel-friendly" mode, or some such nomenclature, and define the
migration semantics just for that case, where it makes it clear it's a
bit unusual.

> I think I heard about workloads that need such strict hard isolation.
> Workloads that really can not afford any disturbance. They even
> use userspace network stack. Maybe HFT?

Certainly HFT is one case.

A lot of TILE-Gx customers using task isolation (which we call
"dataplane" or "Zero-Overhead Linux") are doing high-speed network
applications with user-space networking stacks. It can be DPDK, or it
can be another TCP/IP stack (we ship one called tStack) or it
could just be an application directly messing with the network
hardware from userspace. These are exactly the applications that led
me into this part of kernel development in the first place.
Googling "Zero-Overhead Linux" does take you to some discussions
of customers that have used this functionality.

> > I think we can actually make both modes available to users with just
> > another flag bit, so maybe we can look at what that looks like in v11:
> > adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> > isolation at the next syscall entry, page fault, etc. Then we can
> > think more specifically about whether we want to remove the flag or
> > not, and if we remove it, whether we want to make the code that was
> > controlled by it unconditionally true or unconditionally false
> > (i.e. remove it again).
>
> I think we shouldn't bother with strict hard isolation if we don't need
> it yet. The implementation may well be invasive. Lets wait for someone
> who really needs it.

I'm not sure what part of the patch series you're saying you don't
think we need yet. I'd argue the whole patch series is "hard
isolation", and that the "strict" mode introduced in patch 06/13 isn't
particularly invasive.

> So your requirements are actually hard isolation but in userspace?

Yes, exactly. Were you thinking about a kernel-level hard isolation?
That would have some similarities, I guess, but in some ways might
actually be a harder problem.

> And what happens if you get interrupted in userspace? What about page
> faults and other exceptions?

See above :-)

I hope we're converging here. If you want to talk live or chat online
to help finish converging, perhaps that would make sense? I'd be
happy to take notes and publish a summary of wherever we get to.

Thanks for taking the time to review this!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

2016-04-12 18:42:09

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support

On 4/8/2016 12:34 PM, Chris Metcalf wrote:
> However, this makes me wonder if "strict" mode should be the default
> for task isolation?? That way task isolation really doesn't conflict
> semantically with migration. And we could provide a "weak" mode, or a
> "kernel-friendly" mode, or some such nomenclature, and define the
> migration semantics just for that case, where it makes it clear it's a
> bit unusual.

I noodled around with this and decided it was a better default,
so I've made the changes and pushed it up to the branch:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Now, by default when you enter task isolation mode, you are in
what I used to call "strict" mode, i.e. you can't use the kernel.

You can select a user-specified signal you want to deliver instead of
the default SIGKILL, and if you select signal 0, then you don't get
a signal at all and instead you get to keep running in task
isolation mode after making a syscall, page fault, etc.

Thus the API now looks like this in <linux/prctl.h>:

#define PR_SET_TASK_ISOLATION 48
#define PR_GET_TASK_ISOLATION 49
# define PR_TASK_ISOLATION_ENABLE (1 << 0)
# define PR_TASK_ISOLATION_USERSIG (1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))

I think this better matches what people should want to do in
their applications, and also matches the expectations people
have about what it means to go into task isolation mode in the
first place.

I got rid of the ONESHOT mode that I added in the v12 series, since
it didn't seem like it was what Frederic had been asking for anyway,
and it didn't seem particularly useful on its own.

Frederic, how does this align with your intuition for this stuff?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

2016-04-22 13:17:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support

On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >> TL;DR: Let's make an explicit decision about whether task isolation
> >> should be "persistent" or "one-shot". Both have some advantages.
> >> =====
> >>
> >> An important high-level issue is how "sticky" task isolation mode is.
> >> We need to choose one of these two options:
> >>
> >> "Persistent mode": A task switches state to "task isolation" mode
> >> (kind of a level-triggered analogy) and stays there indefinitely. It
> >> can make a syscall, take a page fault, etc., if it wants to, but the
> >> kernel protects it from incurring any further asynchronous interrupts.
> >> This is the model I've been advocating for.
> >
> >But then in this mode, what happens when an interrupt triggers.
>
> So here I'm taking "interrupt" to mean an external, asynchronous
> interrupt, from another core or device, or asynchronously triggered
> on the local core, like a timer interrupt. By contrast I use "exception"
> or "fault" to refer to synchronous, locally-triggered interruptions.

Ok.

> So for interrupts, the short answer is, it's a bug! :-)
>
> An interrupt could be a kernel bug, in which case we consider it a
> "true" bug. This could be a timer interrupt occurring even after the
> task isolation code thought there were none pending, or a hardware
> device that incorrectly distributes interrupts to a task-isolation
> cpu, or a global IPI that should be sent to fewer cores, or a kernel
> TLB flush that could be deferred until the task-isolation task
> re-enters the kernel later, etc. Regardless, I'd consider it a kernel
> bug. I'm sure there are more such bugs that we can continue to fix
> going forward; it depends on how arbitrary you want to allow code
> running on other cores to be. For example, can another core unload a
> kernel module without interrupting a task-isolation task? Not right now.
>
> Or, it could be an application bug: the standard example is if you
> have an application with task-isolated cores that also does occasional
> unmaps on another thread in the same process, on another core. This
> causes TLB flush interrupts under application control. The
> application shouldn't do this, and we tell our customers not to build
> their applications this way. The typical way we encourage our
> customers to arrange this kind of "multi-threading" is by having a
> pure memory API between the task isolation threads and what are
> typically "control" threads running on non-task-isolated cores. The
> two types of threads just both mmap some common, shared memory but run
> as different processes.
>
> So what happens if an interrupt does occur?
>
> In the "base" task isolation mode, you just take the interrupt, then
> wait to quiesce any further kernel timer ticks, etc., and return to
> the process. This at least limits the damage to being a single
> interruption rather than potentially additional ones, if the interrupt
> also caused timers to get queued, etc.

So if we take an interrupt that we didn't expect, we want to wait some more
in the end of that interrupt to wait for things to quiesce some more?

That doesn't look right. Things should be quiesced once and for all on
return from the initial prctl() call. We can't even expect to quiesce more
in case of interruptions, the tick can't be forced off anyway.

>
> If you enable "strict" mode, we disable task isolation mode for that
> core and deliver a signal to it. This lets the application know that
> an interrupt occurred, and it can take whatever kind of logging or
> debugging action it wants to, re-enable task isolation if it wants to
> and continue, or just exit or abort, etc.

That sounds sensible.

>
> If you don't enable "strict" mode, but you do have
> task_isolation_debug enabled as a boot flag, you will at least get a
> console dump with a backtrace and whatever other data we have.
> (Sometimes the debug info actually includes a backtrace of the
> interrupting core, if it's an IPI or TLB flush from another core,
> which can be pretty useful.)

Ok.

>
> >> "One-shot mode": A task requests isolation via prctl(), the kernel
> >> ensures it is isolated on return from the prctl(), but then as soon as
> >> it enters the kernel again, task isolation is switched off until
> >> another prctl is issued. This is what you recommended in your last
> >> email.
> >
> >No I think we can issue syscalls for exemple. But asynchronous interruptions
> >such as exceptions (actually somewhat synchronous but can be unexpected) and
> >interrupts are what we want to avoid.
>
> Hmm, so I think I'm not really understanding what you are suggesting.
>
> We're certainly in agreement that avoiding interrupts and exceptions
> is important. I'm arguing that the way to deal with them is to
> generate appropriate signals/printks, etc. I'm not actually sure what
> you're recommending we do to avoid exceptions. Since they're
> synchronous and deterministic, we can't really avoid them if the
> program wants to issue them. For example, mmap() some anonymous
> memory and then start running, and you'll take exceptions each time
> you touch a page in that mapped region. I'd argue it's an application
> bug; one should enable "strict" mode to catch and deal with such bugs.

Ok, that looks right.

>
> (Typically the recommendation is to do an mlockall() before starting
> task isolation mode, to handle the case of page faults. But you can
> do that and still be screwed by another thread in your process doing a
> fork() and then your pages end up read-only for COW and you have to
> fault them back in. But, that's an application bug for a
> task-isolation thread, and should just be treated as such.)

Ok.

>
> >> There are a number of pros and cons to the two models. I think on
> >> balance I still like the "persistent mode" approach, but here's all
> >> the pros/cons I can think of:
> >>
> >> PRO for persistent mode: A somewhat easier programming model. Users
> >> can just imagine "task isolation" as a way for them to still be able
> >> to use the kernel exactly as they always have; it's just slower to get
> >> back out of the kernel so you use it judiciously. For example, a
> >> process is free to call write() on a socket to perform a diagnostic,
> >> but when returning from the write() syscall, the kernel will hold the
> >> task in kernel mode until any timer ticks (perhaps from networking
> >> stuff) are complete, and then let it return to userspace to continue
> >> in task isolation mode.
> >
> >So this is not hard isolation anymore. This is rather soft isolation with
> >best efforts to avoid disturbance.
>
> No, it's still hard isolation. The distinction is that we offer a way
> to get in and out of the kernel "safely" if you want to run in that
> mode. The syscalls can take a long time if the syscall ends up
> requiring some additional timer ticks to finish sorting out whatever
> it was you asked the kernel to do, but once you're back in userspace
> you immediately regain "hard" isolation. It's under program control.

Yeah indeed, task should be allowed to perform syscalls. So we can assume
that interrupts are fine when they fire in kernel mode.

>
> Or, you can enable "strict" mode, and then you get hard isolation
> without the ability to get in and out of the kernel at all: the kernel
> just kills you if you try to leave hard isolation other than by an
> explicit prctl().

That would be extreme strict mode yeah. We can still add such mode later
if any user request it.

Thanks.

(I'll reply the rest of the email soonish)

2016-04-25 20:52:52

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH v9 04/13] task_isolation: add initial support

On 4/22/2016 9:16 AM, Frederic Weisbecker wrote:
> On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
>> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
>>> On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>>>> TL;DR: Let's make an explicit decision about whether task isolation
>>>> should be "persistent" or "one-shot". Both have some advantages.
>>>> =====
>>>>
>>>> An important high-level issue is how "sticky" task isolation mode is.
>>>> We need to choose one of these two options:
>>>>
>>>> "Persistent mode": A task switches state to "task isolation" mode
>>>> (kind of a level-triggered analogy) and stays there indefinitely. It
>>>> can make a syscall, take a page fault, etc., if it wants to, but the
>>>> kernel protects it from incurring any further asynchronous interrupts.
>>>> This is the model I've been advocating for.
>>> But then in this mode, what happens when an interrupt triggers.
>> So here I'm taking "interrupt" to mean an external, asynchronous
>> interrupt, from another core or device, or asynchronously triggered
>> on the local core, like a timer interrupt. By contrast I use "exception"
>> or "fault" to refer to synchronous, locally-triggered interruptions.
> Ok.
>
>> So for interrupts, the short answer is, it's a bug! :-)
>>
>> An interrupt could be a kernel bug, in which case we consider it a
>> "true" bug. This could be a timer interrupt occurring even after the
>> task isolation code thought there were none pending, or a hardware
>> device that incorrectly distributes interrupts to a task-isolation
>> cpu, or a global IPI that should be sent to fewer cores, or a kernel
>> TLB flush that could be deferred until the task-isolation task
>> re-enters the kernel later, etc. Regardless, I'd consider it a kernel
>> bug. I'm sure there are more such bugs that we can continue to fix
>> going forward; it depends on how arbitrary you want to allow code
>> running on other cores to be. For example, can another core unload a
>> kernel module without interrupting a task-isolation task? Not right now.
>>
>> Or, it could be an application bug: the standard example is if you
>> have an application with task-isolated cores that also does occasional
>> unmaps on another thread in the same process, on another core. This
>> causes TLB flush interrupts under application control. The
>> application shouldn't do this, and we tell our customers not to build
>> their applications this way. The typical way we encourage our
>> customers to arrange this kind of "multi-threading" is by having a
>> pure memory API between the task isolation threads and what are
>> typically "control" threads running on non-task-isolated cores. The
>> two types of threads just both mmap some common, shared memory but run
>> as different processes.
>>
>> So what happens if an interrupt does occur?
>>
>> In the "base" task isolation mode, you just take the interrupt, then
>> wait to quiesce any further kernel timer ticks, etc., and return to
>> the process. This at least limits the damage to being a single
>> interruption rather than potentially additional ones, if the interrupt
>> also caused timers to get queued, etc.
> So if we take an interrupt that we didn't expect, we want to wait some more
> in the end of that interrupt to wait for things to quiesce some more?

I think it's actually pretty plausible.

Consider the "application bug" case, where you're running some code that does
packet dispatch to different cores. If a core seems to back up you stop
dispatching packets to it.

Now, we get a TLB flush. If handling the flush causes us to restart the tick
(maybe just as a side effect of entering the kernel in the first place) we
really are better off staying in the kernel until the tick is handled and
things are quiesced again. That way, although we may end up dropping a
bunch of packets that were queued up to that core, we only do so ONCE - we
don't do it again when the tick fires a little bit later on, when the core
has already caught up and is claiming to be able to handle packets again.

Also, pragmatically, we would require a whole bunch of machinery in the
kernel to figure out whether we were returning from a syscall, an exception,
or an interrupt, and only skip the task-isolation work for interrupts. We
don't actually have that information available to us at the moment we are
returning to userspace right now, so we'd need to add that tracking state
in each platform's code somehow.


> That doesn't look right. Things should be quiesced once and for all on
> return from the initial prctl() call. We can't even expect to quiesce more
> in case of interruptions, the tick can't be forced off anyway.

Yes, things are quiesced once and for all after prctl(). We also need to
be prepared to handle unexpected interrupts, though. It's true that we can't
force the tick off, but as I suggested above, just waiting for the tick may
well be a better strategy than subjecting the application to another interrupt
after some fraction of a second.

>> Or, you can enable "strict" mode, and then you get hard isolation
>> without the ability to get in and out of the kernel at all: the kernel
>> just kills you if you try to leave hard isolation other than by an
>> explicit prctl().
> That would be extreme strict mode yeah. We can still add such mode later
> if any user request it.

So, humorously, I have become totally convinced that "extreme strict mode"
is really the right default for isolation. It gives semantics that are easily
understandable: you stay in userspace until you do a prctl() to turn off
the flag, or exit(), or else the kernel kills you. And, it's probably what
people want by default anyway for userspace driver code. For code that
legitimately wants to make syscalls in this mode, you can just prctl() the
mode off, do whatever you need to do, then prctl() the mode back on again.
It's nominally a bit of overhead, but as a task-isolated application you
should be expecting tons of overhead from going into the kernel anyway.

The "less extreme strict mode" is arguably reasonable if you want to allow
people to make occasional syscalls, but it has confusing performance
characteristics (sometimes the syscalls happen quickly, but sometimes they
take multiple ticks while we wait for interrupts to quiesce), and it has
confusing semantics (what happens if a third party re-affinitizes you to
a non-isolated core). So I like the idea of just having a separate flag
(PR_TASK_ISOLATION_NOSIG) that tells the kernel to let the user play in
the kernel without getting killed.

> (I'll reply the rest of the email soonish)

Thanks for the feedback. It makes me feel like we may get there eventually :-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com