2020-09-01 10:48:48

by Frederic Weisbecker

[permalink] [raw]
Subject: Requirements to control kernel isolation/nohz_full at runtime

Hi,

I'm currently working on making nohz_full/nohz_idle runtime toggable
and some other people seem to be interested as well. So I've dumped
a few thoughts about some pre-requirements to achieve that for those
interested.

As you can see, there is a bit of hard work in the way. I'm iterating
that in https://pad.kernel.org/p/isolation, feel free to edit:


== RCU nocb ==

Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
We need to make it toggeable at runtime. Currently handling that:
v1: https://lwn.net/Articles/820544/
v2: coming soon

== TIF_NOHZ ==

Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
mode. Prefer static keys to call context tracking on archs. x86 does that well.

== Proper entry code ==

We must make sure that a given arch never calls exception_enter() / exception_exit().
This saves the previous state of context tracking and switch to kernel mode (from context tracking POV)
temporarily. Since this state is saved on the stack, this prevents us from turning off context tracking
entirely on a CPU: The tracking must be done on all CPUs and that takes some cycles.

This means that, considering early entry code (before the call to context tracking upon kernel entry,
and after the call to context tracking upon kernel exit), we must take care of few things:

1) Make sure early entry code can't trigger exceptions. Or if it does, the given exception can't schedule
or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call exception_enter()/exception_exit()
which we don't want.

2) No call to schedule_user().

3) Make sure early entry code is not interruptible or preempt_schedule_irq() would rely on
exception_entry()/exception_exit()

4) Make sure early entry code can't be traced (no call to preempt_schedule_notrace()), or if it does it
can't schedule

I believe x86 does most of that well. In the end we should remove exception_enter()/exit implementations
in x86 and replace it with a check that makes sure context_tracking state is not in USER. An arch meeting
all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
at runtime would depend on that.


== Cputime accounting ==

Both write and read side must switch to tick based accounting and drop the use of seqlock in task_cputime(),
task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state machine is required to make that without races.

== Nohz ==

Switch from nohz_full to nohz_idle. Mind a few details:

1) Turn off 1Hz offlined tick handled in housekeeping
2) Handle tick dependencies, take care of racing CPUs setting/clearing tick dependency. It's much trickier when
we switch from nohz_idle to nohz_full

== Unbound affinity ==

Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
interfaces: sysfs, procfs, etc...


2020-09-03 18:27:57

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> Hi,

Hi Frederic,

Thanks for the summary! Looking forward to your comments...

> I'm currently working on making nohz_full/nohz_idle runtime toggable
> and some other people seem to be interested as well. So I've dumped
> a few thoughts about some pre-requirements to achieve that for those
> interested.
>
> As you can see, there is a bit of hard work in the way. I'm iterating
> that in https://pad.kernel.org/p/isolation, feel free to edit:
>
>
> == RCU nocb ==
>
> Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
> We need to make it toggeable at runtime. Currently handling that:
> v1: https://lwn.net/Articles/820544/
> v2: coming soon

Nice.

> == TIF_NOHZ ==
>
> Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
> Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
> mode. Prefer static keys to call context tracking on archs. x86 does that well.
>
> == Proper entry code ==
>
> We must make sure that a given arch never calls exception_enter() / exception_exit().
> This saves the previous state of context tracking and switch to kernel mode (from context tracking POV)
> temporarily. Since this state is saved on the stack, this prevents us from turning off context tracking
> entirely on a CPU: The tracking must be done on all CPUs and that takes some cycles.
>
> This means that, considering early entry code (before the call to context tracking upon kernel entry,
> and after the call to context tracking upon kernel exit), we must take care of few things:
>
> 1) Make sure early entry code can't trigger exceptions. Or if it does, the given exception can't schedule
> or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call exception_enter()/exception_exit()
> which we don't want.
>
> 2) No call to schedule_user().
>
> 3) Make sure early entry code is not interruptible or preempt_schedule_irq() would rely on
> exception_entry()/exception_exit()
>
> 4) Make sure early entry code can't be traced (no call to preempt_schedule_notrace()), or if it does it
> can't schedule
>
> I believe x86 does most of that well. In the end we should remove exception_enter()/exit implementations
> in x86 and replace it with a check that makes sure context_tracking state is not in USER. An arch meeting
> all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> at runtime would depend on that.
>
>
> == Cputime accounting ==
>
> Both write and read side must switch to tick based accounting and drop the use of seqlock in task_cputime(),
> task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state machine is required to make that without races.
>
> == Nohz ==
>
> Switch from nohz_full to nohz_idle. Mind a few details:
>
> 1) Turn off 1Hz offlined tick handled in housekeeping
> 2) Handle tick dependencies, take care of racing CPUs setting/clearing tick dependency. It's much trickier when
> we switch from nohz_idle to nohz_full
>
> == Unbound affinity ==
>
> Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> interfaces: sysfs, procfs, etc...

We were looking at a userspace interface: what would be a proper
(unified, similar to isolcpus= interface) and its implementation:

The simplest idea for interface seemed to be exposing the integer list of
CPUs and isolation flags to userspace (probably via sysfs).

The scheme would allow flags to be separately enabled/disabled,
with not all flags being necessary toggable (could for example
disallow nohz_full= toggling until it is implemented, but allow for
other isolation features to be toggable).

This would require per flag housekeeping_masks (instead of a single).

Back to the userspace interface, you mentioned earlier that cpusets
was a possibility for it. However:

"Cpusets provide a Linux kernel mechanism to constrain which CPUs and
Memory Nodes are used by a process or set of processes.

The Linux kernel already has a pair of mechanisms to specify on which
CPUs a task may be scheduled (sched_setaffinity) and on which Memory
Nodes it may obtain memory (mbind, set_mempolicy).

Cpusets extends these two mechanisms as follows:"

The isolation flags do not necessarily have anything to do with
tasks, but with CPUs: a given feature is disabled or enabled on a
given CPU.
No?

---

Regarding locking of the masks, since housekeeping_masks can be called
from hot paths (eg: get_nohz_timer_target) it seems RCU is a natural
fit, so userspace would:

1) use interface to change cpumask for a given feature:

-> set_rcu_pointer
-> wait for grace period

2) proceed to trigger actions that rely on housekeeping_cpumask,
to validate the cpumask at 1) is being used.

---

Regarding nohz_full=, a way to get an immediate implementation
(without handling the issues you mention above) would be to boot
with a set of CPUs as "nohz_full toggable" and others not. For
the nohz_full toggable ones, you'd introduce a per-CPU tick
dependency that is enabled/disabled on runtime. Probably better
to avoid this one if possible...


2020-09-03 18:32:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > Hi,
>
> Hi Frederic,
>
> Thanks for the summary! Looking forward to your comments...
>
> > I'm currently working on making nohz_full/nohz_idle runtime toggable
> > and some other people seem to be interested as well. So I've dumped
> > a few thoughts about some pre-requirements to achieve that for those
> > interested.
> >
> > As you can see, there is a bit of hard work in the way. I'm iterating
> > that in https://pad.kernel.org/p/isolation, feel free to edit:
> >
> >
> > == RCU nocb ==
> >
> > Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
> > We need to make it toggeable at runtime. Currently handling that:
> > v1: https://lwn.net/Articles/820544/
> > v2: coming soon
>
> Nice.
>
> > == TIF_NOHZ ==
> >
> > Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
> > Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
> > mode. Prefer static keys to call context tracking on archs. x86 does that well.
> >
> > == Proper entry code ==
> >
> > We must make sure that a given arch never calls exception_enter() / exception_exit().
> > This saves the previous state of context tracking and switch to kernel mode (from context tracking POV)
> > temporarily. Since this state is saved on the stack, this prevents us from turning off context tracking
> > entirely on a CPU: The tracking must be done on all CPUs and that takes some cycles.
> >
> > This means that, considering early entry code (before the call to context tracking upon kernel entry,
> > and after the call to context tracking upon kernel exit), we must take care of few things:
> >
> > 1) Make sure early entry code can't trigger exceptions. Or if it does, the given exception can't schedule
> > or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call exception_enter()/exception_exit()
> > which we don't want.
> >
> > 2) No call to schedule_user().
> >
> > 3) Make sure early entry code is not interruptible or preempt_schedule_irq() would rely on
> > exception_entry()/exception_exit()
> >
> > 4) Make sure early entry code can't be traced (no call to preempt_schedule_notrace()), or if it does it
> > can't schedule
> >
> > I believe x86 does most of that well. In the end we should remove exception_enter()/exit implementations
> > in x86 and replace it with a check that makes sure context_tracking state is not in USER. An arch meeting
> > all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> > at runtime would depend on that.
> >
> >
> > == Cputime accounting ==
> >
> > Both write and read side must switch to tick based accounting and drop the use of seqlock in task_cputime(),
> > task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state machine is required to make that without races.
> >
> > == Nohz ==
> >
> > Switch from nohz_full to nohz_idle. Mind a few details:
> >
> > 1) Turn off 1Hz offlined tick handled in housekeeping
> > 2) Handle tick dependencies, take care of racing CPUs setting/clearing tick dependency. It's much trickier when
> > we switch from nohz_idle to nohz_full
> >
> > == Unbound affinity ==
> >
> > Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> > interfaces: sysfs, procfs, etc...
>
> We were looking at a userspace interface: what would be a proper
> (unified, similar to isolcpus= interface) and its implementation:
>
> The simplest idea for interface seemed to be exposing the integer list of
> CPUs and isolation flags to userspace (probably via sysfs).
>
> The scheme would allow flags to be separately enabled/disabled,
> with not all flags being necessary toggable (could for example
> disallow nohz_full= toggling until it is implemented, but allow for
> other isolation features to be toggable).
>
> This would require per flag housekeeping_masks (instead of a single).
>
> Back to the userspace interface, you mentioned earlier that cpusets
> was a possibility for it. However:
>
> "Cpusets provide a Linux kernel mechanism to constrain which CPUs and
> Memory Nodes are used by a process or set of processes.
>
> The Linux kernel already has a pair of mechanisms to specify on which
> CPUs a task may be scheduled (sched_setaffinity) and on which Memory
> Nodes it may obtain memory (mbind, set_mempolicy).
>
> Cpusets extends these two mechanisms as follows:"
>
> The isolation flags do not necessarily have anything to do with
> tasks, but with CPUs: a given feature is disabled or enabled on a
> given CPU.
> No?

One cpumask per feature, implemented separately in sysfs, also
seems OK (modulo documentation about the RCU update and users
of the previous versions).

This is what is being done for rcu_nocbs= already...

2020-09-03 18:37:53

by Phil Auld

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Thu, Sep 03, 2020 at 03:30:15PM -0300 Marcelo Tosatti wrote:
> On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> > On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > > Hi,
> >
> > Hi Frederic,
> >
> > Thanks for the summary! Looking forward to your comments...
> >
> > > I'm currently working on making nohz_full/nohz_idle runtime toggable
> > > and some other people seem to be interested as well. So I've dumped
> > > a few thoughts about some pre-requirements to achieve that for those
> > > interested.
> > >
> > > As you can see, there is a bit of hard work in the way. I'm iterating
> > > that in https://pad.kernel.org/p/isolation, feel free to edit:
> > >
> > >
> > > == RCU nocb ==
> > >
> > > Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
> > > We need to make it toggeable at runtime. Currently handling that:
> > > v1: https://lwn.net/Articles/820544/
> > > v2: coming soon
> >
> > Nice.
> >
> > > == TIF_NOHZ ==
> > >
> > > Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
> > > Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
> > > mode. Prefer static keys to call context tracking on archs. x86 does that well.
> > >
> > > == Proper entry code ==
> > >
> > > We must make sure that a given arch never calls exception_enter() / exception_exit().
> > > This saves the previous state of context tracking and switch to kernel mode (from context tracking POV)
> > > temporarily. Since this state is saved on the stack, this prevents us from turning off context tracking
> > > entirely on a CPU: The tracking must be done on all CPUs and that takes some cycles.
> > >
> > > This means that, considering early entry code (before the call to context tracking upon kernel entry,
> > > and after the call to context tracking upon kernel exit), we must take care of few things:
> > >
> > > 1) Make sure early entry code can't trigger exceptions. Or if it does, the given exception can't schedule
> > > or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call exception_enter()/exception_exit()
> > > which we don't want.
> > >
> > > 2) No call to schedule_user().
> > >
> > > 3) Make sure early entry code is not interruptible or preempt_schedule_irq() would rely on
> > > exception_entry()/exception_exit()
> > >
> > > 4) Make sure early entry code can't be traced (no call to preempt_schedule_notrace()), or if it does it
> > > can't schedule
> > >
> > > I believe x86 does most of that well. In the end we should remove exception_enter()/exit implementations
> > > in x86 and replace it with a check that makes sure context_tracking state is not in USER. An arch meeting
> > > all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> > > at runtime would depend on that.
> > >
> > >
> > > == Cputime accounting ==
> > >
> > > Both write and read side must switch to tick based accounting and drop the use of seqlock in task_cputime(),
> > > task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state machine is required to make that without races.
> > >
> > > == Nohz ==
> > >
> > > Switch from nohz_full to nohz_idle. Mind a few details:
> > >
> > > 1) Turn off 1Hz offlined tick handled in housekeeping
> > > 2) Handle tick dependencies, take care of racing CPUs setting/clearing tick dependency. It's much trickier when
> > > we switch from nohz_idle to nohz_full
> > >
> > > == Unbound affinity ==
> > >
> > > Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> > > interfaces: sysfs, procfs, etc...
> >
> > We were looking at a userspace interface: what would be a proper
> > (unified, similar to isolcpus= interface) and its implementation:
> >
> > The simplest idea for interface seemed to be exposing the integer list of
> > CPUs and isolation flags to userspace (probably via sysfs).
> >
> > The scheme would allow flags to be separately enabled/disabled,
> > with not all flags being necessary toggable (could for example
> > disallow nohz_full= toggling until it is implemented, but allow for
> > other isolation features to be toggable).
> >
> > This would require per flag housekeeping_masks (instead of a single).
> >
> > Back to the userspace interface, you mentioned earlier that cpusets
> > was a possibility for it. However:
> >
> > "Cpusets provide a Linux kernel mechanism to constrain which CPUs and
> > Memory Nodes are used by a process or set of processes.
> >
> > The Linux kernel already has a pair of mechanisms to specify on which
> > CPUs a task may be scheduled (sched_setaffinity) and on which Memory
> > Nodes it may obtain memory (mbind, set_mempolicy).
> >
> > Cpusets extends these two mechanisms as follows:"
> >
> > The isolation flags do not necessarily have anything to do with
> > tasks, but with CPUs: a given feature is disabled or enabled on a
> > given CPU.
> > No?
>
> One cpumask per feature, implemented separately in sysfs, also
> seems OK (modulo documentation about the RCU update and users
> of the previous versions).
>
> This is what is being done for rcu_nocbs= already...
>

exclusive cpusets is used now to control scheduler load balancing on
a group of cpus. It seems to me that this is the same idea and is part
of the isolation concept. Having a toggle for each subsystem/feature in
cpusets could provide the needed userspace api.

Under the covers it might be implemented as twiddling various cpumasks.

We need to be shifting to managing load balancing with cpusets anyway.



Cheers,
Phil

--

2020-09-03 20:47:13

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Thu, Sep 03, 2020 at 02:36:36PM -0400, Phil Auld wrote:
> On Thu, Sep 03, 2020 at 03:30:15PM -0300 Marcelo Tosatti wrote:
> > On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> > > On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > > > Hi,
> > >
> > > Hi Frederic,
> > >
> > > Thanks for the summary! Looking forward to your comments...
> > >
> > > > I'm currently working on making nohz_full/nohz_idle runtime toggable
> > > > and some other people seem to be interested as well. So I've dumped
> > > > a few thoughts about some pre-requirements to achieve that for those
> > > > interested.
> > > >
> > > > As you can see, there is a bit of hard work in the way. I'm iterating
> > > > that in https://pad.kernel.org/p/isolation, feel free to edit:
> > > >
> > > >
> > > > == RCU nocb ==
> > > >
> > > > Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
> > > > We need to make it toggeable at runtime. Currently handling that:
> > > > v1: https://lwn.net/Articles/820544/
> > > > v2: coming soon
> > >
> > > Nice.
> > >
> > > > == TIF_NOHZ ==
> > > >
> > > > Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
> > > > Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
> > > > mode. Prefer static keys to call context tracking on archs. x86 does that well.
> > > >
> > > > == Proper entry code ==
> > > >
> > > > We must make sure that a given arch never calls exception_enter() / exception_exit().
> > > > This saves the previous state of context tracking and switch to kernel mode (from context tracking POV)
> > > > temporarily. Since this state is saved on the stack, this prevents us from turning off context tracking
> > > > entirely on a CPU: The tracking must be done on all CPUs and that takes some cycles.
> > > >
> > > > This means that, considering early entry code (before the call to context tracking upon kernel entry,
> > > > and after the call to context tracking upon kernel exit), we must take care of few things:
> > > >
> > > > 1) Make sure early entry code can't trigger exceptions. Or if it does, the given exception can't schedule
> > > > or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call exception_enter()/exception_exit()
> > > > which we don't want.
> > > >
> > > > 2) No call to schedule_user().
> > > >
> > > > 3) Make sure early entry code is not interruptible or preempt_schedule_irq() would rely on
> > > > exception_entry()/exception_exit()
> > > >
> > > > 4) Make sure early entry code can't be traced (no call to preempt_schedule_notrace()), or if it does it
> > > > can't schedule
> > > >
> > > > I believe x86 does most of that well. In the end we should remove exception_enter()/exit implementations
> > > > in x86 and replace it with a check that makes sure context_tracking state is not in USER. An arch meeting
> > > > all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> > > > at runtime would depend on that.
> > > >
> > > >
> > > > == Cputime accounting ==
> > > >
> > > > Both write and read side must switch to tick based accounting and drop the use of seqlock in task_cputime(),
> > > > task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state machine is required to make that without races.
> > > >
> > > > == Nohz ==
> > > >
> > > > Switch from nohz_full to nohz_idle. Mind a few details:
> > > >
> > > > 1) Turn off 1Hz offlined tick handled in housekeeping
> > > > 2) Handle tick dependencies, take care of racing CPUs setting/clearing tick dependency. It's much trickier when
> > > > we switch from nohz_idle to nohz_full
> > > >
> > > > == Unbound affinity ==
> > > >
> > > > Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> > > > interfaces: sysfs, procfs, etc...
> > >
> > > We were looking at a userspace interface: what would be a proper
> > > (unified, similar to isolcpus= interface) and its implementation:
> > >
> > > The simplest idea for interface seemed to be exposing the integer list of
> > > CPUs and isolation flags to userspace (probably via sysfs).
> > >
> > > The scheme would allow flags to be separately enabled/disabled,
> > > with not all flags being necessary toggable (could for example
> > > disallow nohz_full= toggling until it is implemented, but allow for
> > > other isolation features to be toggable).
> > >
> > > This would require per flag housekeeping_masks (instead of a single).
> > >
> > > Back to the userspace interface, you mentioned earlier that cpusets
> > > was a possibility for it. However:
> > >
> > > "Cpusets provide a Linux kernel mechanism to constrain which CPUs and
> > > Memory Nodes are used by a process or set of processes.
> > >
> > > The Linux kernel already has a pair of mechanisms to specify on which
> > > CPUs a task may be scheduled (sched_setaffinity) and on which Memory
> > > Nodes it may obtain memory (mbind, set_mempolicy).
> > >
> > > Cpusets extends these two mechanisms as follows:"
> > >
> > > The isolation flags do not necessarily have anything to do with
> > > tasks, but with CPUs: a given feature is disabled or enabled on a
> > > given CPU.
> > > No?
> >
> > One cpumask per feature, implemented separately in sysfs, also
> > seems OK (modulo documentation about the RCU update and users
> > of the previous versions).
> >
> > This is what is being done for rcu_nocbs= already...
> >
>
> exclusive cpusets is used now to control scheduler load balancing on
> a group of cpus. It seems to me that this is the same idea and is part
> of the isolation concept. Having a toggle for each subsystem/feature in
> cpusets could provide the needed userspace api.
>
> Under the covers it might be implemented as twiddling various cpumasks.
>
> We need to be shifting to managing load balancing with cpusets anyway.

OK, adding a new file per isolation feature:

- cpuset.isolation_nohz_full
- cpuset.isolation_kthread
- cpuset.isolation_time

With a bool value per file, is an option.

2020-09-04 20:51:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> Hi,
>
> I'm currently working on making nohz_full/nohz_idle runtime toggable
> and some other people seem to be interested as well. So I've dumped
> a few thoughts about some pre-requirements to achieve that for those
> interested.
>
> As you can see, there is a bit of hard work in the way. I'm iterating
> that in https://pad.kernel.org/p/isolation, feel free to edit:
>
>
> == RCU nocb ==
>
> Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
> We need to make it toggeable at runtime. Currently handling that:
> v1: https://lwn.net/Articles/820544/
> v2: coming soon

Looking forward to seeing it!

> == TIF_NOHZ ==
>
> Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
> Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
> mode. Prefer static keys to call context tracking on archs. x86 does that well.

Would it help if RCU was able to, on a per-CPU basis, distinguish between
nohz_full userspace execution on the one hand and idle-loop execution
on the other? Or do you have some other trick in mind?

Thanx, Paul

> == Proper entry code ==
>
> We must make sure that a given arch never calls exception_enter() / exception_exit().
> This saves the previous state of context tracking and switch to kernel mode (from context tracking POV)
> temporarily. Since this state is saved on the stack, this prevents us from turning off context tracking
> entirely on a CPU: The tracking must be done on all CPUs and that takes some cycles.
>
> This means that, considering early entry code (before the call to context tracking upon kernel entry,
> and after the call to context tracking upon kernel exit), we must take care of few things:
>
> 1) Make sure early entry code can't trigger exceptions. Or if it does, the given exception can't schedule
> or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call exception_enter()/exception_exit()
> which we don't want.
>
> 2) No call to schedule_user().
>
> 3) Make sure early entry code is not interruptible or preempt_schedule_irq() would rely on
> exception_entry()/exception_exit()
>
> 4) Make sure early entry code can't be traced (no call to preempt_schedule_notrace()), or if it does it
> can't schedule
>
> I believe x86 does most of that well. In the end we should remove exception_enter()/exit implementations
> in x86 and replace it with a check that makes sure context_tracking state is not in USER. An arch meeting
> all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> at runtime would depend on that.
>
>
> == Cputime accounting ==
>
> Both write and read side must switch to tick based accounting and drop the use of seqlock in task_cputime(),
> task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state machine is required to make that without races.
>
> == Nohz ==
>
> Switch from nohz_full to nohz_idle. Mind a few details:
>
> 1) Turn off 1Hz offlined tick handled in housekeeping
> 2) Handle tick dependencies, take care of racing CPUs setting/clearing tick dependency. It's much trickier when
> we switch from nohz_idle to nohz_full
>
> == Unbound affinity ==
>
> Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> interfaces: sysfs, procfs, etc...

2020-09-07 15:39:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime


(your mailer broke and forgot to keep lines shorter than 78 chars)

On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:

> == TIF_NOHZ ==
>
> Need to get rid of that in order not to trigger syscall slowpath on
> CPUs that don't want nohz_full. Also we don't want to iterate all
> threads and clear the flag when the last nohz_full CPU exits nohz_full
> mode. Prefer static keys to call context tracking on archs. x86 does
> that well.

Build on the common entry code I suppose. Then any arch that uses that
gets to have the new features.

> == Proper entry code ==
>
> We must make sure that a given arch never calls exception_enter() /
> exception_exit(). This saves the previous state of context tracking
> and switch to kernel mode (from context tracking POV) temporarily.
> Since this state is saved on the stack, this prevents us from turning
> off context tracking entirely on a CPU: The tracking must be done on
> all CPUs and that takes some cycles.
>
> This means that, considering early entry code (before the call to
> context tracking upon kernel entry, and after the call to context
> tracking upon kernel exit), we must take care of few things:
>
> 1) Make sure early entry code can't trigger exceptions. Or if it does,
> the given exception can't schedule or use RCU (unless it calls
> rcu_nmi_enter()). Otherwise the exception must call
> exception_enter()/exception_exit() which we don't want.

I think this is true for x86. Early entry has interrupts disabled, any
exception that can still happen is NMI-like and will thus use
rcu_nmi_enter().

On x86 that now includes #DB (which is also excluded due to us refusing
to set execution breakpoints on entry code), #BP, NMI and MCE.

> 2) No call to schedule_user().

I'm not sure what that is supposed to do, but x86 doesn't appear to have
it, so all good :-)

> 3) Make sure early entry code is not interruptible or
> preempt_schedule_irq() would rely on
> exception_entry()/exception_exit()

This is so for x86.

> 4) Make sure early entry code can't be traced (no call to
> preempt_schedule_notrace()), or if it does it can't schedule

noinstr is your friend.

> I believe x86 does most of that well.

It does now.

2020-09-09 22:35:36

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > == Unbound affinity ==
> >
> > Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> > interfaces: sysfs, procfs, etc...
>
> We were looking at a userspace interface: what would be a proper
> (unified, similar to isolcpus= interface) and its implementation:
>
> The simplest idea for interface seemed to be exposing the integer list of
> CPUs and isolation flags to userspace (probably via sysfs).
>
> The scheme would allow flags to be separately enabled/disabled,
> with not all flags being necessary toggable (could for example
> disallow nohz_full= toggling until it is implemented, but allow for
> other isolation features to be toggable).
>
> This would require per flag housekeeping_masks (instead of a single).

Right, I think cpusets provide exactly.

> Back to the userspace interface, you mentioned earlier that cpusets
> was a possibility for it. However:
>
> "Cpusets provide a Linux kernel mechanism to constrain which CPUs and
> Memory Nodes are used by a process or set of processes.
>
> The Linux kernel already has a pair of mechanisms to specify on which
> CPUs a task may be scheduled (sched_setaffinity) and on which Memory
> Nodes it may obtain memory (mbind, set_mempolicy).
>
> Cpusets extends these two mechanisms as follows:"
>
> The isolation flags do not necessarily have anything to do with
> tasks, but with CPUs: a given feature is disabled or enabled on a
> given CPU.
> No?

When cpusets are set as exclusive, they become strict CPU properties.
I think we'll need to enforce the exclusive property to set the isolated
flags.

Then you're free to move the tasks you like into any isolated cpusets.

> Regarding locking of the masks, since housekeeping_masks can be called
> from hot paths (eg: get_nohz_timer_target) it seems RCU is a natural
> fit, so userspace would:
>
> 1) use interface to change cpumask for a given feature:
>
> -> set_rcu_pointer
> -> wait for grace period

Yep, could be a solution.

> 2) proceed to trigger actions that rely on housekeeping_cpumask,
> to validate the cpumask at 1) is being used.

Exactly. I guess we can simply call directly to subsystems (timers,
workqueue, kthreads, ...) from the isolation code upon cpumask update.
This way we avoid ordering surprises that would come with a notifier.

> Regarding nohz_full=, a way to get an immediate implementation
> (without handling the issues you mention above) would be to boot
> with a set of CPUs as "nohz_full toggable" and others not. For
> the nohz_full toggable ones, you'd introduce a per-CPU tick
> dependency that is enabled/disabled on runtime. Probably better
> to avoid this one if possible...

Right but you would still have all the overhead that comes with nohz full
(kernel entry/exit tracking, RCU userspace extended grace period, RCU callbacks
offloaded, vtime accounting, ...). It will become really interesting once we
can switch all that overhead off.

Thanks.

2020-09-09 22:40:10

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Thu, Sep 03, 2020 at 03:52:00PM -0300, Marcelo Tosatti wrote:
> On Thu, Sep 03, 2020 at 02:36:36PM -0400, Phil Auld wrote:
> > exclusive cpusets is used now to control scheduler load balancing on
> > a group of cpus. It seems to me that this is the same idea and is part
> > of the isolation concept. Having a toggle for each subsystem/feature in
> > cpusets could provide the needed userspace api.
> >
> > Under the covers it might be implemented as twiddling various cpumasks.
> >
> > We need to be shifting to managing load balancing with cpusets anyway.
>
> OK, adding a new file per isolation feature:
>
> - cpuset.isolation_nohz_full
> - cpuset.isolation_kthread
> - cpuset.isolation_time
>
> With a bool value per file, is an option.

Exactly. I would merge kthread/timers/workqueue into
cpuset.isolation.unbound though. Unless anyone may need more
granularity there?

2020-09-10 02:35:23

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Fri, Sep 04, 2020 at 01:47:40PM -0700, Paul E. McKenney wrote:
> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > Hi,
> >
> > I'm currently working on making nohz_full/nohz_idle runtime toggable
> > and some other people seem to be interested as well. So I've dumped
> > a few thoughts about some pre-requirements to achieve that for those
> > interested.
> >
> > As you can see, there is a bit of hard work in the way. I'm iterating
> > that in https://pad.kernel.org/p/isolation, feel free to edit:
> >
> >
> > == RCU nocb ==
> >
> > Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
> > We need to make it toggeable at runtime. Currently handling that:
> > v1: https://lwn.net/Articles/820544/
> > v2: coming soon
>
> Looking forward to seeing it!

So many ordering riddles I had to put on paper. But I'm getting close to
something RFC-postable now.

>
> > == TIF_NOHZ ==
> >
> > Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
> > Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
> > mode. Prefer static keys to call context tracking on archs. x86 does that well.
>
> Would it help if RCU was able to, on a per-CPU basis, distinguish between
> nohz_full userspace execution on the one hand and idle-loop execution
> on the other? Or do you have some other trick in mind?

No it's more about context tracking. Initially it used TIF_NOHZ to enter
the syscall slow path and call to context tracking on kernel entry and exit.

The problem is that it forces all CPUs, including housekeepers, to run into
that syscall slowpath. So we rather want the context tracking call conditional
on a per cpu basis and not on a per task basis. And static keys are good for
that. That's what x86 does.

So RCU can't help much I fear (but hey, first time I can say that! ;-)

Thanks.

2020-09-10 02:45:51

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Requirements to control kernel isolation/nohz_full at runtime

On Mon, Sep 07, 2020 at 05:34:17PM +0200, [email protected] wrote:
>
> (your mailer broke and forgot to keep lines shorter than 78 chars)

I manually reordered the lines and that's indeed quite a mess :o)

>
> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
>
> > == TIF_NOHZ ==
> >
> > Need to get rid of that in order not to trigger syscall slowpath on
> > CPUs that don't want nohz_full. Also we don't want to iterate all
> > threads and clear the flag when the last nohz_full CPU exits nohz_full
> > mode. Prefer static keys to call context tracking on archs. x86 does
> > that well.
>
> Build on the common entry code I suppose. Then any arch that uses that
> gets to have the new features.

Yep, eventually I hope we can put all these crucial pieces on the common entry
code.

>
> > == Proper entry code ==
> >
> > We must make sure that a given arch never calls exception_enter() /
> > exception_exit(). This saves the previous state of context tracking
> > and switch to kernel mode (from context tracking POV) temporarily.
> > Since this state is saved on the stack, this prevents us from turning
> > off context tracking entirely on a CPU: The tracking must be done on
> > all CPUs and that takes some cycles.
> >
> > This means that, considering early entry code (before the call to
> > context tracking upon kernel entry, and after the call to context
> > tracking upon kernel exit), we must take care of few things:
> >
> > 1) Make sure early entry code can't trigger exceptions. Or if it does,
> > the given exception can't schedule or use RCU (unless it calls
> > rcu_nmi_enter()). Otherwise the exception must call
> > exception_enter()/exception_exit() which we don't want.
>
> I think this is true for x86. Early entry has interrupts disabled, any
> exception that can still happen is NMI-like and will thus use
> rcu_nmi_enter().
>
> On x86 that now includes #DB (which is also excluded due to us refusing
> to set execution breakpoints on entry code), #BP, NMI and MCE.

Perfect! That's what I assumed as well.

>
> > 2) No call to schedule_user().
>
> I'm not sure what that is supposed to do, but x86 doesn't appear to have
> it, so all good :-)

I think it was there in case an exception would schedule after context tracking
exit kernel but before we actually exit kernel. But we removed that (Andy probably)
when we made sure the early entry was not interruptible. Now some other archs
still use it, I'm just not sure if they do it for a good reason...

>
> > 3) Make sure early entry code is not interruptible or
> > preempt_schedule_irq() would rely on
> > exception_entry()/exception_exit()
>
> This is so for x86.

Perfect!

>
> > 4) Make sure early entry code can't be traced (no call to
> > preempt_schedule_notrace()), or if it does it can't schedule
>
> noinstr is your friend.

Right. My fear was rather on special areas that temporarily
enable tracing (instrumentation_begin()...instrumentation_end())
but those should only happen with interrupts disabled on entry code
with preempt_schedule_notrace() having no effect.

>
> > I believe x86 does most of that well.
>
> It does now.

Thanks a lot for confirming! I guess I can remove
exception_enter()/exit() on x86. Fortunately any issue
will be very easily spotted.

Thanks!