2022-03-16 06:02:37

by Marcelo Tosatti

[permalink] [raw]
Subject: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 1) for details. Its attached at the end of this message
in .txt format as well.

Note: the prctl interface is independent of nohz_full=.

The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/

- qemu-task-isolation.patch: activate task isolation from CPU execution loop
- rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
- util-linux-chisol.patch: add chisol tool to util-linux.

---

v12
- Set TIF_TASK_ISOL only when necessary (Frederic Weisbecker)
- Switch from raw_cpu_read to __this_cpu_read (Frederic Weisbecker)
- Add missing kvm-entry.h change (Frederic Weisbecker)

v11
- Add TIF_TASK_ISOL bit to thread info flags and use it
to decide whether to perform task isolation work on
return to userspace (Frederic Weisbecker)
- Fold patch to add task_isol_exit hooks into
"sync vmstats on return to userspace" patch. (Frederic Weisbecker)
- Fix typo on prctl_task_isol_cfg_get declaration (Oscar Shiang)
- Add preempt notifiers

v10
- no changes (resending series without changelog corruption).

v9
- Clarify inheritance is propagated to all descendents (Frederic Weisbecker)
- Fix inheritance of one-shot mode across exec/fork (Frederic Weisbecker)
- Unify naming on "task_isol_..." (Frederic Weisbecker)
- Introduce CONFIG_TASK_ISOLATION (Frederic Weisbecker)

v8
- Document the possibility for ISOL_F_QUIESCE_ONE,
to configure individual features (Frederic Weisbecker).
- Fix PR_ISOL_CFG_GET typo in documentation (Frederic Weisbecker).
- Rebased against linux-2.6.git.

v7
- no changes (resending series without changelog corruption).

v6
- Move oneshot mode enablement to configuration time (Frederic Weisbecker).
- Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
- Update docs and samples regarding oneshot mode (Frederic Weisbecker).
- Update docs and samples regarding more extensibility of
CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
- prctl_task_isolation_activate_get should copy active_mask
to address in arg2.
- modify exit_to_user_mode_loop to cover exceptions
and interrupts.
- split exit hooks into its own patch

v5
- Add changelogs to individual patches (Peter Zijlstra).
- Add documentation to patchset intro (Peter Zijlstra).

v4:
- Switch to structures for parameters when possible
(which are more extensible).
- Switch to CFG_{S,G}ET naming and use drop
"internal configuration" prctls (Frederic Weisbecker).
- Add summary of terms to documentation (Frederic Weisbecker).
- Examples for compute and one-shot modes (Thomas G/Christoph L).

v3:

- Split in smaller patches (Nitesh Lal).
- Misc cleanups (Nitesh Lal).
- Clarify nohz_full is not a dependency (Nicolas Saenz).
- Incorrect values for prctl definitions (kernel robot).
- Save configured state, so applications
can activate externally configured
task isolation parameters.
- Remove "system default" notion (chisol should
make it obsolete).
- Update documentation: add new section with explanation
about configuration/activation and code example.
- Update samples.
- Report configuration/activation state at
/proc/pid/task_isolation.
- Condense dirty information of per-CPU vmstats counters
in a bool.
- In-kernel KVM support.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
to use ISOL_F_QUIESCE_DEFMASK (whose default value
is configurable in /sys/). (Nitesh Lal / Nicolas Saenz).

v11 can be found at:

https://lore.kernel.org/all/[email protected]/

---

Documentation/userspace-api/task_isolation.rst | 379 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/s390/include/asm/thread_info.h | 2
arch/x86/include/asm/thread_info.h | 2
fs/proc/base.c | 68 ++++++++++
include/linux/entry-common.h | 2
include/linux/entry-kvm.h | 2
include/linux/sched.h | 5
include/linux/task_isolation.h | 136 ++++++++++++++++++++
include/linux/vmstat.h | 25 +++
include/uapi/linux/prctl.h | 47 +++++++
init/Kconfig | 16 ++
init/init_task.c | 3
kernel/Makefile | 2
kernel/entry/common.c | 4
kernel/entry/kvm.c | 4
kernel/exit.c | 2
kernel/fork.c | 23 +++
kernel/sys.c | 16 ++
kernel/task_isolation.c | 424 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/vmstat.c | 173 +++++++++++++++++++++-----
samples/Kconfig | 7 +
samples/Makefile | 1
samples/task_isolation/Makefile | 11 +
samples/task_isolation/task_isol.c | 92 ++++++++++++++
samples/task_isolation/task_isol.h | 9 +
samples/task_isolation/task_isol_computation.c | 89 +++++++++++++
samples/task_isolation/task_isol_oneshot.c | 104 +++++++++++++++
samples/task_isolation/task_isol_userloop.c | 54 ++++++++
28 files changed, 1664 insertions(+), 38 deletions(-)

---

Task isolation prctl interface
******************************

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth
networking applications with user-space drivers can fall into the
category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and end of
the latency sensitive application section (with configurable system
behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

* PR_ISOL_FEAT_GET: Retrieve supported features.

* PR_ISOL_CFG_GET: Retrieve task isolation configuration.

* PR_ISOL_CFG_SET: Set task isolation configuration.

* PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.

* PR_ISOL_ACTIVATE_SET: Set task isolation activation state.

Summary of terms:

* feature:

A distinct attribute or aspect of task isolation. Examples of
features could be logging, new operating modes (eg: syscalls
disallowed), userspace notifications, etc. The only feature
currently available is quiescing.

* configuration:

A specific choice from a given set of possible choices that
dictate how the particular feature in question should behave.

* activation state:

The activation state (whether active/inactive) of the task
isolation features (features must be configured before being
activated).

Inheritance of the isolation parameters and state, across fork(2) and
clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.

At a high-level, task isolation is divided in two steps:

1. Configuration.

2. Activation.

Section "Userspace support" describes how to use task isolation.

In terms of the interface, the sequence of steps to activate task
isolation are:

1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).

2. Configure task isolation features
(PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).

3. Activate or deactivate task isolation features
(PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).

This interface is based on ideas and code from the task isolation
patchset from Alex Belits: https://lwn.net/Articles/816298/

Note: if the need arises to configure an individual quiesce feature
with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
features per syscall currently).


Feature description
===================

* "ISOL_F_QUIESCE"

This feature allows quiescing selected kernel activities on return
from system calls.


Interface description
=====================

**PR_ISOL_FEAT**:

Returns the supported features and feature capabilities, as a
bitmask:

prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);

The 'feat' argument specifies whether to return supported features
(if zero), or feature capabilities (if not zero). Possible values
for 'feat' are:

* "0":

Return the bitmask of supported features, in the location
pointed to by "(int *)arg3". The buffer should allow space
for 8 bytes.

* "ISOL_F_QUIESCE":

Return a structure containing which kernel activities are
supported for quiescing, in the location pointed to by "(int
*)arg3":

struct task_isol_quiesce_extensions {
__u64 flags;
__u64 supported_quiesce_bits;
__u64 pad[6];
};

Where:

*flags*: Additional flags (should be zero).

*supported_quiesce_bits*: Bitmask indicating
which features are supported for quiescing.

*pad*: Additional space for future enhancements.

Features and its capabilities are defined at
include/uapi/linux/task_isolation.h.

**PR_ISOL_CFG_GET**:

Retrieve task isolation configuration. The general format is:

prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);

The 'what' argument specifies what to configure. Possible values
are:

* "I_CFG_FEAT":

Return configuration of task isolation features. The 'arg3'
argument specifies whether to return configured features (if
zero), or individual feature configuration (if not zero), as
follows.

* "0":

Return the bitmask of configured features, in the
location pointed to by "(int *)arg4". The buffer
should allow space for 8 bytes.

* "ISOL_F_QUIESCE":

If arg4 is QUIESCE_CONTROL, return the control structure
for quiescing of background kernel activities, in the
location pointed to by "(int *)arg5":

struct task_isol_quiesce_control {
__u64 flags;
__u64 quiesce_mask;
__u64 quiesce_oneshot_mask;
__u64 pad[5];
};

See PR_ISOL_CFG_SET description for meaning of fields.

* "I_CFG_INHERIT":

Retrieve inheritance configuration across fork/clone.

Return the structure which configures inheritance across
fork/clone, in the location pointed to by "(int *)arg4":

struct task_isol_inherit_control {
__u8 inherit_mask;
__u8 pad[7];
};

See PR_ISOL_CFG_SET description for meaning of fields.

**PR_ISOL_CFG_SET**:

Set task isolation configuration. The general format is:

prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);

The 'what' argument specifies what to configure. Possible values
are:

* "I_CFG_FEAT":

Set configuration of task isolation features. 'arg3' specifies
the feature. Possible values are:

* "ISOL_F_QUIESCE":

If arg4 is QUIESCE_CONTROL, set the control structure for
quiescing of background kernel activities, from the
location pointed to by "(int *)arg5":

struct task_isol_quiesce_control {
__u64 flags;
__u64 quiesce_mask;
__u64 quiesce_oneshot_mask;
__u64 pad[5];
};

Where:

*flags*: Additional flags (should be zero).

*quiesce_mask*: A bitmask containing which kernel
activities to quiesce.

*quiesce_oneshot_mask*: A bitmask indicating which kernel
activities should behave in oneshot mode, that is,
quiescing will happen on return from
prctl(PR_ISOL_ACTIVATE_SET), but not on return of
subsequent system calls. The corresponding bit(s) must
also be set at quiesce_mask.

*pad*: Additional space for future enhancements.

For quiesce_mask (and quiesce_oneshot_mask), possible bit
sets are:

* "ISOL_F_QUIESCE_VMSTATS"

VM statistics are maintained in per-CPU counters to
improve performance. When a CPU modifies a VM statistic,
this modification is kept in the per-CPU counter. Certain
activities require a global count, which involves
requesting each CPU to flush its local counters to the
global VM counters.

This flush is implemented via a workqueue item, which
might schedule a workqueue on isolated CPUs.

To avoid this interruption, task isolation can be
configured to, upon return from system calls, synchronize
the per-CPU counters to global counters, thus avoiding
the interruption.

* "I_CFG_INHERIT":

Set inheritance configuration when a new task is created via
fork and clone.

The "(int *)arg4" argument is a pointer to:

struct task_isol_inherit_control {
__u8 inherit_mask;
__u8 pad[7];
};

inherit_mask is a bitmask that specifies which part of task
isolation should be inherited:

* Bit ISOL_INHERIT_CONF: Inherit task isolation
configuration. This is the state written via
prctl(PR_ISOL_CFG_SET, ...).

* Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
(requires ISOL_INHERIT_CONF to be set). The new task should
behave, after fork/clone, in the same manner as the parent
task after it executed:

prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);

Note: the inheritance propagates to all the descendants and
not just the immediate children, unless the inheritance is
explicitly reconfigured by some children.

**PR_ISOL_ACTIVATE_GET**:

Retrieve task isolation activation state.

The general format is:

prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);

'pmask' specifies the location of a feature mask, where the current
active mask will be copied. See PR_ISOL_ACTIVATE_SET for
description of individual bits.

**PR_ISOL_ACTIVATE_SET**:

Set task isolation activation state (activates/deactivates task
isolation).

The general format is:

prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);

The 'pmask' argument specifies the location of an 8 byte mask
containing which features should be activated. Features whose bits
are cleared will be deactivated. The possible bits for this mask
are:

* "ISOL_F_QUIESCE":

Activate quiescing of background kernel activities. Quiescing
happens on return to userspace from this system call, and on
return from subsequent system calls (unless quiesce_oneshot_mask
has been set at PR_ISOL_CFG_SET time).

Quiescing can be adjusted (while active) by
prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).


Userspace support
*****************

Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose.

This results in three combinations:

1. Both configuration and activation performed by the latency
sensitive application. Allows fine grained control of what task
isolation features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app (and
configuration performed by chisol). This allows the admin/user to
control task isolation parameters, and applications have to be
modified only once.

3. Configuration and activation performed by an external tool. This
allows unmodified applications to take advantage of task isolation.
Activation is performed by the "-a" option of chisol.


Examples
********

The "samples/task_isolation/" directory contains 3 examples:

* task_isol_userloop.c:

Example of program with a loop on userspace scenario.

* task_isol_computation.c:

Example of program that enters task isolated mode, performs an
amount of computation, exits task isolated mode, and writes the
computation to disk.

* task_isol_oneshot.c:

Example of program that enables one-shot mode for quiescing,
enters a processing loop, then upon an external event performs a
number of syscalls to handle that event.

This is a snippet of code to activate task isolation if it has been
previously configured (by chisol for example):

#include <sys/prctl.h>
#include <linux/types.h>

#ifdef PR_ISOL_CFG_GET
unsigned long long fmask;

ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
if (ret != -1 && fmask != 0) {
ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_ACTIVATE_SET");
return ret;
}
}
#endif






2022-03-17 18:12:33

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Tue, Mar 15, 2022 at 12:31:32PM -0300, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
>
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
>
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
>
> To fix this, add task isolation prctl interface to quiesce
> deferred actions when returning to userspace.
>
> The patchset is based on ideas and code from the
> task isolation patchset from Alex Belits:
> https://lwn.net/Articles/816298/
>
> Please refer to Documentation/userspace-api/task_isolation.rst
> (patch 1) for details. Its attached at the end of this message
> in .txt format as well.
>
> Note: the prctl interface is independent of nohz_full=.
>
> The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
>
> - qemu-task-isolation.patch: activate task isolation from CPU execution loop
> - rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
> - util-linux-chisol.patch: add chisol tool to util-linux.

I still see a few details to sort out but overall the whole thing looks good:

Acked-by: Frederic Weisbecker <[email protected]>

Perhaps it's time to apply this patchset on some branch and iterate from there.

Thomas, Peter, what do you think?

Thanks!

2022-04-25 20:47:33

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Thu, Mar 17, 2022 at 04:08:04PM +0100, Frederic Weisbecker wrote:
> On Tue, Mar 15, 2022 at 12:31:32PM -0300, Marcelo Tosatti wrote:
> > The logic to disable vmstat worker thread, when entering
> > nohz full, does not cover all scenarios. For example, it is possible
> > for the following to happen:
> >
> > 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> > 2) app runs mlock, which increases counters for mlock'ed pages.
> > 3) start -RT loop
> >
> > Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> > the mlock, vmstat shepherd can restart vmstat worker thread on
> > the CPU in question.
> >
> > To fix this, add task isolation prctl interface to quiesce
> > deferred actions when returning to userspace.
> >
> > The patchset is based on ideas and code from the
> > task isolation patchset from Alex Belits:
> > https://lwn.net/Articles/816298/
> >
> > Please refer to Documentation/userspace-api/task_isolation.rst
> > (patch 1) for details. Its attached at the end of this message
> > in .txt format as well.
> >
> > Note: the prctl interface is independent of nohz_full=.
> >
> > The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> >
> > - qemu-task-isolation.patch: activate task isolation from CPU execution loop
> > - rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
> > - util-linux-chisol.patch: add chisol tool to util-linux.
>
> I still see a few details to sort out but overall the whole thing looks good:
>
> Acked-by: Frederic Weisbecker <[email protected]>
>
> Perhaps it's time to apply this patchset on some branch and iterate from there.
>
> Thomas, Peter, what do you think?
>
> Thanks!

Ping ?

2022-04-25 22:25:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Mon, Apr 25 2022 at 13:29, Marcelo Tosatti wrote:
> On Thu, Mar 17, 2022 at 04:08:04PM +0100, Frederic Weisbecker wrote:
>>
>> I still see a few details to sort out but overall the whole thing looks good:

From a cursery inspection there is more than a few details to sort out.

>> Acked-by: Frederic Weisbecker <[email protected]>
>>
>> Perhaps it's time to apply this patchset on some branch and iterate from there.
>>
>> Thomas, Peter, what do you think?
>>
>> Thanks!
>
> Ping ?

This does not apply against 5.18-rc1, which was released on April
3rd. Oh, well. You are really new to kernel development, right?

Don't bother to resend before I finished reviewing the pile.

Thanks,

tglx

2022-04-27 11:37:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

Ok I actually have started an opensource project that may make use of the
onshot interface. This is a bridging tool between two RDMA protocols
called ib2roce. See https://gentwo.org/christoph/2022-bridging-rdma.pdf

The relevant code can be found at
https://github.com/clameter/rdma-core/tree/ib2roce/ib2roce. In
particular look at the ib2roce.c source code. This is still
under development.

The ib2roce briding can run in a busy loop mode (-k option) where it spins
on ibv_poll_cq() which is an RDMA call to handle incoming packets without
kernel interaction. See busyloop() in ib2roce.c

Currently I have configured the system to use CONFIG_NOHZ_FULL. With that
I am able to reliably forward packets at a rate that saturates 100G
Ethernet / EDR Infiniband from a single spinning thread.

Without CONFIG_NOHZ_FULL any slight disturbance causes the forwarding to
fall behind which will lead to dramatic packet loss since we are looking
here at a potential data rate of 12.5Gbyte/sec and about 12.5Mbyte per
msec. If the kernel interrupts the forwarding by say 10 msecs then we are
falling behind by 125MB which would have to be buffered and processing by
additional codes. That complexity makes it processing packets much slower
which could cause the forwarding to slow down so that a recovery is not
possible should the data continue to arrive at line rate.

Isolation of the threads was done through the following kernel parameters:

nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt

And systemd was configured with the following affinites:

system.conf:CPUAffinity=0-7,16-23

This means that the second socket will be generally free of tasks and
kernel threads.

The NUMA configuration:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 94798 MB
node 0 free: 92000 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 96765 MB
node 1 free: 96082 MB

node distances:
node 0 1
0: 10 21
1: 21 10


I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
provided by this patch instead of the NOHZ_FULL.

What kind of metric could I be using to show the difference in idleness of
the quality of the cpu isolation?

The ib2roce tool already has a CLI mode where one can monitor the
latencies that the busyloop experiences. See the latency calculations in
busyloop() and the CLI command "core". Stats can be reset via the "zap"
command.

I can see the usefulness of the oneshot mode but (I am very very sorry) I
still think that this patchset overdoes what is needed and I fail to
understand what the point of inheritance, per syscall quiescint etc is.
Those cause needless overhead in syscall handling and increase the
complexity of managing a busyloop. Special handling when the scheduler
switches a task? If tasks are being switched that requires them to be low
latency and undisturbed then something went very very wrong with the
system configuration and the only thing I would suggest is to issue some
kernel warning that this is not the way one should configure the system.

2022-05-04 10:29:08

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
> Ok I actually have started an opensource project that may make use of the
> onshot interface. This is a bridging tool between two RDMA protocols
> called ib2roce. See https://gentwo.org/christoph/2022-bridging-rdma.pdf
>
> The relevant code can be found at
> https://github.com/clameter/rdma-core/tree/ib2roce/ib2roce. In
> particular look at the ib2roce.c source code. This is still
> under development.
>
> The ib2roce briding can run in a busy loop mode (-k option) where it spins
> on ibv_poll_cq() which is an RDMA call to handle incoming packets without
> kernel interaction. See busyloop() in ib2roce.c
>
> Currently I have configured the system to use CONFIG_NOHZ_FULL. With that
> I am able to reliably forward packets at a rate that saturates 100G
> Ethernet / EDR Infiniband from a single spinning thread.
>
> Without CONFIG_NOHZ_FULL any slight disturbance causes the forwarding to
> fall behind which will lead to dramatic packet loss since we are looking
> here at a potential data rate of 12.5Gbyte/sec and about 12.5Mbyte per
> msec. If the kernel interrupts the forwarding by say 10 msecs then we are
> falling behind by 125MB which would have to be buffered and processing by
> additional codes. That complexity makes it processing packets much slower
> which could cause the forwarding to slow down so that a recovery is not
> possible should the data continue to arrive at line rate.

Right.

> Isolation of the threads was done through the following kernel parameters:
>
> nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
> numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt
>
> And systemd was configured with the following affinites:
>
> system.conf:CPUAffinity=0-7,16-23
>
> This means that the second socket will be generally free of tasks and
> kernel threads.
>
> The NUMA configuration:
>
> $ numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 94798 MB
> node 0 free: 92000 MB
> node 1 cpus: 8 9 10 11 12 13 14 15
> node 1 size: 96765 MB
> node 1 free: 96082 MB
>
> node distances:
> node 0 1
> 0: 10 21
> 1: 21 10
>
>
> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
> provided by this patch instead of the NOHZ_FULL.
>
> What kind of metric could I be using to show the difference in idleness of
> the quality of the cpu isolation?

Interruption length and frequencies:

-------|xxxxx|---------------|xxx|---------
5us 3us

which is what should be reported by oslat ?

>
> The ib2roce tool already has a CLI mode where one can monitor the
> latencies that the busyloop experiences. See the latency calculations in
> busyloop() and the CLI command "core". Stats can be reset via the "zap"
> command.
>
> I can see the usefulness of the oneshot mode but (I am very very sorry)

Its in there...

> I
> still think that this patchset overdoes what is needed and I fail to
> understand what the point of inheritance, per syscall quiescint etc is.

Inheritance is an attempt to support unmodified binaries like so:

1) configure task isolation parameters (eg sync per-CPU vmstat to global
stats on system call returns).
2) enable inheritance (so that task isolation configuration and
activation states are copied across to child processes).
3) enable task isolation.
4) execv(binary, params)

Per syscall quiescint ? Not sure what you mean here.

> Those cause needless overhead in syscall handling and increase the
> complexity of managing a busyloop.

Inheritance seems like a useful feature to us. Isnt it? (to be able to
configure and activate task isolation for unmodified binaries).

> Special handling when the scheduler
> switches a task? If tasks are being switched that requires them to be low
> latency and undisturbed then something went very very wrong with the
> system configuration and the only thing I would suggest is to issue some
> kernel warning that this is not the way one should configure the system.

Trying to provide mechanisms, not policy?

Or from another POV: if the user desires, we can display the warning.


2022-05-04 14:21:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Mon, Apr 25, 2022 at 11:12:27PM +0200, Thomas Gleixner wrote:
> On Mon, Apr 25 2022 at 13:29, Marcelo Tosatti wrote:
> > On Thu, Mar 17, 2022 at 04:08:04PM +0100, Frederic Weisbecker wrote:
> >>
> >> I still see a few details to sort out but overall the whole thing looks good:
>
> >From a cursery inspection there is more than a few details to sort out.
>
> >> Acked-by: Frederic Weisbecker <[email protected]>
> >>
> >> Perhaps it's time to apply this patchset on some branch and iterate from there.
> >>
> >> Thomas, Peter, what do you think?
> >>
> >> Thanks!
> >
> > Ping ?
>
> This does not apply against 5.18-rc1, which was released on April
> 3rd. Oh, well. You are really new to kernel development, right?

I can resend.

> Don't bother to resend before I finished reviewing the pile.

Sure, thanks!!!


2022-05-04 16:17:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Tue, May 03 2022 at 15:57, Marcelo Tosatti wrote:
> On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
>> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
>> provided by this patch instead of the NOHZ_FULL.
>>
>> What kind of metric could I be using to show the difference in idleness of
>> the quality of the cpu isolation?
>
> Interruption length and frequencies:
>
> -------|xxxxx|---------------|xxx|---------
> 5us 3us
>
> which is what should be reported by oslat ?

How is oslat helpful there? That's running artifical workload benchmarks
which are not necessarily representing the actual
idle->interrupt->idle... timing sequence of the real world usecase.

> Inheritance is an attempt to support unmodified binaries like so:
>
> 1) configure task isolation parameters (eg sync per-CPU vmstat to global
> stats on system call returns).
> 2) enable inheritance (so that task isolation configuration and
> activation states are copied across to child processes).
> 3) enable task isolation.
> 4) execv(binary, params)

What for? If an application has isolation requirements, then the
specific requirements are part of the application design and not of some
arbitrary wrapper. Can we please focus on the initial problem of
providing a sensible isolation mechanism with well defined semantics?

Inheritance is an orthogonal problem and there is no reason to have this
initially.

>> Special handling when the scheduler
>> switches a task? If tasks are being switched that requires them to be low
>> latency and undisturbed then something went very very wrong with the
>> system configuration and the only thing I would suggest is to issue some
>> kernel warning that this is not the way one should configure the system.
>
> Trying to provide mechanisms, not policy?

This preemption notifier is not a mechanism, it's simply mindless
hackery as I told you already.

Thanks,

tglx

2022-05-04 21:48:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
> On Tue, May 03 2022 at 15:57, Marcelo Tosatti wrote:
> > On Wed, Apr 27, 2022 at 11:19:02AM +0200, Christoph Lameter wrote:
> >> I could modify busyloop() in ib2roce.c to use the oneshot mode via prctl
> >> provided by this patch instead of the NOHZ_FULL.
> >>
> >> What kind of metric could I be using to show the difference in idleness of
> >> the quality of the cpu isolation?
> >
> > Interruption length and frequencies:
> >
> > -------|xxxxx|---------------|xxx|---------
> > 5us 3us
> >
> > which is what should be reported by oslat ?
>
> How is oslat helpful there? That's running artifical workload benchmarks
> which are not necessarily representing the actual
> idle->interrupt->idle... timing sequence of the real world usecase.

Ok, so what is happening today on production on some telco installations
(considering virtualized RAN usecase) is this:

1) Basic testing: Verify the hardware, software and its configuration
(cpu isolation parameters etc) are able to achieve the desired maximum
interruption length/frequencies through a synthetic benchmark like
cyclictest and oslat, for a duration that is considered sufficient.

One might also use the actual application in a synthetic configuration
(for example FlexRAN with synthetic data).

2) From the above, assume real world usecase is able to achieve the
desired maximum interruption length/frequency.

Of course, that is sub-optimal. The rt-trace-bcc.py scripts instruments
certain functions in the kernel (say smp_call_function family),
allowing one to check if certain interruptions have happened
on production. Example:

$ sudo ./rt-trace-bcc.py -c 36-39
[There can be some warnings dumped; we can ignore them]
Enabled hook point: process_one_work
Enabled hook point: __queue_work
Enabled hook point: __queue_delayed_work
Enabled hook point: generic_exec_single
Enabled hook point: smp_call_function_many_cond
Enabled hook point: irq_work_queue
Enabled hook point: irq_work_queue_on
TIME(s) COMM CPU PID MSG
0.009599293 rcuc/8 8 75 irq_work_queue_on (target=36, func=nohz_full_kick_func)
0.009603039 rcuc/8 8 75 irq_work_queue_on (target=37, func=nohz_full_kick_func)
0.009604047 rcuc/8 8 75 irq_work_queue_on (target=38, func=nohz_full_kick_func)
0.009604848 rcuc/8 8 75 irq_work_queue_on (target=39, func=nohz_full_kick_func)
0.103600589 rcuc/8 8 75 irq_work_queue_on (target=36, func=nohz_full_kick_func)
...

Currently it does not record the length of each interruption, but it
could (or you can do that from its output).

Note however that the sum of the interruptions is not the entire
overhead caused by the interruptions: there might also cachelines thrown
away which are only going to be "counted" when the latency sensitive
app executes.

But you could say the overhead is _at least_ the sum of interruptions +
cache effects unaccounted for.

Also note that for idle->interrupt->idle type scenarios (the vRAN
usecase above currently does not idle at all, but there is interest
from the field for that to happen for power saving reasons), you'd
also sum return from idle.

> > Inheritance is an attempt to support unmodified binaries like so:
> >
> > 1) configure task isolation parameters (eg sync per-CPU vmstat to global
> > stats on system call returns).
> > 2) enable inheritance (so that task isolation configuration and
> > activation states are copied across to child processes).
> > 3) enable task isolation.
> > 4) execv(binary, params)
>
> What for? If an application has isolation requirements, then the
> specific requirements are part of the application design and not of some
> arbitrary wrapper.

To be able to configure and active task isolation for an unmodified
binary, which seems a useful feature. However, have no problem of not
supporting unmodified binaries (would have then to change the
applications).

There are 3 types of application arrangements:

==================
Userspace support
==================

Task isolation is divided in two main steps: configuration and activation.

Each step can be performed by an external tool or the latency sensitive
application itself. util-linux contains the "chisol" tool for this
purpose.

This results in three combinations:

1. Both configuration and activation performed by the
latency sensitive application.
Allows fine grained control of what task isolation
features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app
(and configuration performed by chisol).
This allows the admin/user to control task isolation parameters,
and applications have to be modified only once.

3. Configuration and activation performed by an external tool.
This allows unmodified applications to take advantage of
task isolation. Activation is performed by the "-a" option
of chisol.

---

Some features might not be supportable (or have awkward behavior) on a
given combination. For example, if a feature such as "warn if
sched_out/sched_in is ever performed if task isolation is
configured/activated", then you'll get those warnings
for combination 3 (which is the case of unmodified binaries above).

> Inheritance is an orthogonal problem and there is no reason to have this
> initially.

No problem, will drop it.

> Can we please focus on the initial problem of
> providing a sensible isolation mechanism with well defined semantics?

Case 2, however, was implicitly suggested by you (or at least i
understood that):

"Summary: The problem to be solved cannot be restricted to

self_defined_important_task(OWN_WORLD);

Policy is not a binary on/off problem. It's manifold across all levels
of the stack and only a kernel problem when it comes down to the last
line of defence.

Up to the point where the kernel puts the line of last defence, policy
is defined by the user/admin via mechanims provided by the kernel.

Emphasis on "mechanims provided by the kernel", aka. user API.

Just in case, I hope that I don't have to explain what level of scrunity
and thought this requires."

The idea, as i understood was that certain task isolation features (or
they parameters) might have to be changed at runtime (which depends on
the task isolation features themselves, and the plan is to create
an extensible interface). So for case 2, all you'd have to do is to
modify the application only once and allow the admin to configure
the features. From the documentation:

This is a snippet of code to activate task isolation if
it has been previously configured (by chisol for example)::

#include <sys/prctl.h>
#include <linux/types.h>

#ifdef PR_ISOL_CFG_GET
unsigned long long fmask;

ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
if (ret != -1 && fmask != 0) {
ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_ACTIVATE_SET");
return ret;
}
}
#endif

This seemed pretty useful to me (which is possible if the features
being discussed do not require further modifications on the part of the
application). For example, a new task isolation feature can be enabled
without having to modify the application.

Again, maybe that was misunderstood (and i'm OK with dropping this
and forcing both configuration and activation to be performed
inside the app), no problem.

> >> Special handling when the scheduler
> >> switches a task? If tasks are being switched that requires them to be low
> >> latency and undisturbed then something went very very wrong with the
> >> system configuration and the only thing I would suggest is to issue some
> >> kernel warning that this is not the way one should configure the system.
> >
> > Trying to provide mechanisms, not policy?
>
> This preemption notifier is not a mechanism, it's simply mindless
> hackery as I told you already.

Sure, if there is another way of checking "if per-CPU vmstats require
syncing" that is cheap (which seems you suggested on the other email),
can drop preempt notifiers.


2022-05-04 23:08:05

by Tim Chen

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Tue, 2022-03-15 at 12:31 -0300, Marcelo Tosatti wrote:
> The logic to disable vmstat worker thread, when entering
> nohz full, does not cover all scenarios. For example, it is possible
> for the following to happen:
>
> 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> 2) app runs mlock, which increases counters for mlock'ed pages.
> 3) start -RT loop
>
> Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> the mlock, vmstat shepherd can restart vmstat worker thread on
> the CPU in question.
>
> To fix this, add task isolation prctl interface to quiesce
> deferred actions when returning to userspace.
>
> The patchset is based on ideas and code from the
> task isolation patchset from Alex Belits:
> https://lwn.net/Articles/816298/
>
> Please refer to Documentation/userspace-api/task_isolation.rst
> (patch 1) for details. Its attached at the end of this message

Patch 1 doesn't seem to be the documentation patch but rather is
in patch 4.

>
> Task isolation prctl interface
> ******************************
>
> Certain types of applications benefit from running uninterrupted by
> background OS activities. Realtime systems and high-bandwidth
> networking applications with user-space drivers can fall into the
> category.
>
> To create an OS noise free environment for the application, this
> interface allows userspace to inform the kernel the start and end of
> the latency sensitive application section (with configurable system
> behaviour for that section).
>
> Note: the prctl interface is independent of nohz_full=.
>
> The prctl options are:
>
> * PR_ISOL_FEAT_GET: Retrieve supported features.
>
> * PR_ISOL_CFG_GET: Retrieve task isolation configuration.
>
> * PR_ISOL_CFG_SET: Set task isolation configuration.
>
> * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
>
> * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
>
> Summary of terms:
>
> * feature:
>
> A distinct attribute or aspect of task isolation. Examples of
> features could be logging, new operating modes (eg: syscalls
> disallowed), userspace notifications, etc. The only feature
> currently available is quiescing.
>
> * configuration:
>
> A specific choice from a given set of possible choices that
> dictate how the particular feature in question should behave.
>
> * activation state:
>
> The activation state (whether active/inactive) of the task
> isolation features (features must be configured before being
> activated).
>
> Inheritance of the isolation parameters and state, across fork(2) and
> clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
>
> At a high-level, task isolation is divided in two steps:
>
> 1. Configuration.
>
> 2. Activation.
>
> Section "Userspace support" describes how to use task isolation.
>
> In terms of the interface, the sequence of steps to activate task
> isolation are:
>
> 1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
>
> 2. Configure task isolation features
> (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
>
> 3. Activate or deactivate task isolation features
> (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
>
> This interface is based on ideas and code from the task isolation
> patchset from Alex Belits: https://lwn.net/Articles/816298/
>
> Note: if the need arises to configure an individual quiesce feature
> with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
> PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
> features per syscall currently).
>
>
> Feature description
> ===================
>
> * "ISOL_F_QUIESCE"
>
> This feature allows quiescing selected kernel activities on return
> from system calls.
>
>
> Interface description
> =====================
>
> **PR_ISOL_FEAT**:
>
> Returns the supported features and feature capabilities, as a
> bitmask:
>
> prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
>
> The 'feat' argument specifies whether to return supported features
> (if zero), or feature capabilities (if not zero). Possible values
> for 'feat' are:
>
> * "0":
>
> Return the bitmask of supported features, in the location
> pointed to by "(int *)arg3". The buffer should allow space
> for 8 bytes.
>
> * "ISOL_F_QUIESCE":
>
> Return a structure containing which kernel activities are
> supported for quiescing, in the location pointed to by "(int
> *)arg3":
>
> struct task_isol_quiesce_extensions {
> __u64 flags;
> __u64 supported_quiesce_bits;
> __u64 pad[6];
> };
>
> Where:
>
> *flags*: Additional flags (should be zero).
>
> *supported_quiesce_bits*: Bitmask indicating
> which features are supported for quiescing.
>
> *pad*: Additional space for future enhancements.
>
> Features and its capabilities are defined at
> include/uapi/linux/task_isolation.h.
>
> **PR_ISOL_CFG_GET**:
>
> Retrieve task isolation configuration. The general format is:
>
> prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
>
> The 'what' argument specifies what to configure. Possible values
> are:
>
> * "I_CFG_FEAT":
>
> Return configuration of task isolation features. The 'arg3'
> argument specifies whether to return configured features (if
> zero), or individual feature configuration (if not zero), as
> follows.
>
> * "0":
>
> Return the bitmask of configured features, in the
> location pointed to by "(int *)arg4". The buffer
> should allow space for 8 bytes.
>
> * "ISOL_F_QUIESCE":
>
> If arg4 is QUIESCE_CONTROL, return the control structure
> for quiescing of background kernel activities, in the
> location pointed to by "(int *)arg5":
>
> struct task_isol_quiesce_control {
> __u64 flags;
> __u64 quiesce_mask;
> __u64 quiesce_oneshot_mask;
> __u64 pad[5];
> };
>
> See PR_ISOL_CFG_SET description for meaning of fields.
>
> * "I_CFG_INHERIT":
>
> Retrieve inheritance configuration across fork/clone.
>
> Return the structure which configures inheritance across
> fork/clone, in the location pointed to by "(int *)arg4":
>
> struct task_isol_inherit_control {
> __u8 inherit_mask;
> __u8 pad[7];
> };
>
> See PR_ISOL_CFG_SET description for meaning of fields.
>
> **PR_ISOL_CFG_SET**:
>
> Set task isolation configuration. The general format is:
>
> prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
>
> The 'what' argument specifies what to configure. Possible values
> are:
>
> * "I_CFG_FEAT":
>
> Set configuration of task isolation features. 'arg3' specifies
> the feature. Possible values are:
>
> * "ISOL_F_QUIESCE":

Is it really necessary for such fine grain control for which kernel
activity to quiesce?

For most user, all they care about is their
task is not disturbed by kernel activities and not be bothered about
setting which particular activities to quiesce. And in your patches there
is only ISOL_F_QUIESCE_VMSTATS and nothing else. I think you could
probably skip the QUIESCE control for now and add it when there's really
a true need for fine grain control. This will make the interface simpler
for user applications.

Tim






2022-05-05 12:51:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Wed, May 04 2022 at 15:56, Marcelo Tosatti wrote:
> On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
>> Can we please focus on the initial problem of
>> providing a sensible isolation mechanism with well defined semantics?
>
> Case 2, however, was implicitly suggested by you (or at least i
> understood that):
>
> "Summary: The problem to be solved cannot be restricted to
>
> self_defined_important_task(OWN_WORLD);
>
> Policy is not a binary on/off problem. It's manifold across all levels
> of the stack and only a kernel problem when it comes down to the last
> line of defence.
>
> Up to the point where the kernel puts the line of last defence, policy
> is defined by the user/admin via mechanims provided by the kernel.
>
> Emphasis on "mechanims provided by the kernel", aka. user API.
>
> Just in case, I hope that I don't have to explain what level of scrunity
> and thought this requires."

Correct. This reasoning is still valid and I haven't changed my opinion
on that since then.

My main objections against the proposed solution back then were the all
or nothing approach and the implicit hard coded policies.

> The idea, as i understood was that certain task isolation features (or
> they parameters) might have to be changed at runtime (which depends on
> the task isolation features themselves, and the plan is to create
> an extensible interface).

Again. I'm not against useful controls to select the isolation an
application requires. I'm neither against extensible interfaces.

But I'm against overengineered implementations which lack any form of
sensible design and have ill defined semantics at the user ABI.

Designing user space ABI is _hard_ and needs a lot of thoughts. It's not
done with throwing something 'extensible' at the kernel and hope it
sticks. As I showed you in the review, the ABI is inconsistent in
itself, it has ill defined semantics and lacks any form of justification
of the approach taken.

Can we please take a step back and:

1) Define what is trying to be solved and what are the pieces known
today which need to be controlled in order to achieve the desired
isolation properties.

2) Describe the usage scenarios and the resulting constraints.

3) Describe the requirements for features on top, e.g. inheritance
or external control.

Once we have that, we can have a discussion about the desired control
granularity and how to support the extra features in a consistent and
well defined way.

A good and extensible UABI design comes with well defined functionality
for the start and an obvious and maintainable extension path. The most
important part is the well defined functionality.

There have been enough examples in the past how well received approaches
are, which lack the well defined part. Linus really loves to get a pull
request for something which cannot be described what it does, but could
be used for cool things in the future.

> So for case 2, all you'd have to do is to modify the application only
> once and allow the admin to configure the features.

That's still an orthogonal problem, which can be solved once a sensible
mechanism to control the isolation and handle it at the transition
points is in place. You surely want to consider it when designing the
UABI, but it's not required to create the real isolation mechanism in
the first place.

Problem decomposition is not an entirely new concept, really.

Thanks,

tglx

2022-05-09 03:55:03

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Wed, May 04, 2022 at 10:01:47AM -0700, Tim Chen wrote:
> On Tue, 2022-03-15 at 12:31 -0300, Marcelo Tosatti wrote:
> > The logic to disable vmstat worker thread, when entering
> > nohz full, does not cover all scenarios. For example, it is possible
> > for the following to happen:
> >
> > 1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
> > 2) app runs mlock, which increases counters for mlock'ed pages.
> > 3) start -RT loop
> >
> > Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
> > the mlock, vmstat shepherd can restart vmstat worker thread on
> > the CPU in question.
> >
> > To fix this, add task isolation prctl interface to quiesce
> > deferred actions when returning to userspace.
> >
> > The patchset is based on ideas and code from the
> > task isolation patchset from Alex Belits:
> > https://lwn.net/Articles/816298/
> >
> > Please refer to Documentation/userspace-api/task_isolation.rst
> > (patch 1) for details. Its attached at the end of this message
>
> Patch 1 doesn't seem to be the documentation patch but rather is
> in patch 4.

Hi Tim,

Yes, one might consider that awkward (but that order made sense when
writing the patches).

Is there a different order that makes more sense?

> >
> > Task isolation prctl interface
> > ******************************
> >
> > Certain types of applications benefit from running uninterrupted by
> > background OS activities. Realtime systems and high-bandwidth
> > networking applications with user-space drivers can fall into the
> > category.
> >
> > To create an OS noise free environment for the application, this
> > interface allows userspace to inform the kernel the start and end of
> > the latency sensitive application section (with configurable system
> > behaviour for that section).
> >
> > Note: the prctl interface is independent of nohz_full=.
> >
> > The prctl options are:
> >
> > * PR_ISOL_FEAT_GET: Retrieve supported features.
> >
> > * PR_ISOL_CFG_GET: Retrieve task isolation configuration.
> >
> > * PR_ISOL_CFG_SET: Set task isolation configuration.
> >
> > * PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
> >
> > * PR_ISOL_ACTIVATE_SET: Set task isolation activation state.
> >
> > Summary of terms:
> >
> > * feature:
> >
> > A distinct attribute or aspect of task isolation. Examples of
> > features could be logging, new operating modes (eg: syscalls
> > disallowed), userspace notifications, etc. The only feature
> > currently available is quiescing.
> >
> > * configuration:
> >
> > A specific choice from a given set of possible choices that
> > dictate how the particular feature in question should behave.
> >
> > * activation state:
> >
> > The activation state (whether active/inactive) of the task
> > isolation features (features must be configured before being
> > activated).
> >
> > Inheritance of the isolation parameters and state, across fork(2) and
> > clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.
> >
> > At a high-level, task isolation is divided in two steps:
> >
> > 1. Configuration.
> >
> > 2. Activation.
> >
> > Section "Userspace support" describes how to use task isolation.
> >
> > In terms of the interface, the sequence of steps to activate task
> > isolation are:
> >
> > 1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).
> >
> > 2. Configure task isolation features
> > (PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).
> >
> > 3. Activate or deactivate task isolation features
> > (PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).
> >
> > This interface is based on ideas and code from the task isolation
> > patchset from Alex Belits: https://lwn.net/Articles/816298/
> >
> > Note: if the need arises to configure an individual quiesce feature
> > with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
> > PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
> > features per syscall currently).
> >
> >
> > Feature description
> > ===================
> >
> > * "ISOL_F_QUIESCE"
> >
> > This feature allows quiescing selected kernel activities on return
> > from system calls.
> >
> >
> > Interface description
> > =====================
> >
> > **PR_ISOL_FEAT**:
> >
> > Returns the supported features and feature capabilities, as a
> > bitmask:
> >
> > prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);
> >
> > The 'feat' argument specifies whether to return supported features
> > (if zero), or feature capabilities (if not zero). Possible values
> > for 'feat' are:
> >
> > * "0":
> >
> > Return the bitmask of supported features, in the location
> > pointed to by "(int *)arg3". The buffer should allow space
> > for 8 bytes.
> >
> > * "ISOL_F_QUIESCE":
> >
> > Return a structure containing which kernel activities are
> > supported for quiescing, in the location pointed to by "(int
> > *)arg3":
> >
> > struct task_isol_quiesce_extensions {
> > __u64 flags;
> > __u64 supported_quiesce_bits;
> > __u64 pad[6];
> > };
> >
> > Where:
> >
> > *flags*: Additional flags (should be zero).
> >
> > *supported_quiesce_bits*: Bitmask indicating
> > which features are supported for quiescing.
> >
> > *pad*: Additional space for future enhancements.
> >
> > Features and its capabilities are defined at
> > include/uapi/linux/task_isolation.h.
> >
> > **PR_ISOL_CFG_GET**:
> >
> > Retrieve task isolation configuration. The general format is:
> >
> > prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);
> >
> > The 'what' argument specifies what to configure. Possible values
> > are:
> >
> > * "I_CFG_FEAT":
> >
> > Return configuration of task isolation features. The 'arg3'
> > argument specifies whether to return configured features (if
> > zero), or individual feature configuration (if not zero), as
> > follows.
> >
> > * "0":
> >
> > Return the bitmask of configured features, in the
> > location pointed to by "(int *)arg4". The buffer
> > should allow space for 8 bytes.
> >
> > * "ISOL_F_QUIESCE":
> >
> > If arg4 is QUIESCE_CONTROL, return the control structure
> > for quiescing of background kernel activities, in the
> > location pointed to by "(int *)arg5":
> >
> > struct task_isol_quiesce_control {
> > __u64 flags;
> > __u64 quiesce_mask;
> > __u64 quiesce_oneshot_mask;
> > __u64 pad[5];
> > };
> >
> > See PR_ISOL_CFG_SET description for meaning of fields.
> >
> > * "I_CFG_INHERIT":
> >
> > Retrieve inheritance configuration across fork/clone.
> >
> > Return the structure which configures inheritance across
> > fork/clone, in the location pointed to by "(int *)arg4":
> >
> > struct task_isol_inherit_control {
> > __u8 inherit_mask;
> > __u8 pad[7];
> > };
> >
> > See PR_ISOL_CFG_SET description for meaning of fields.
> >
> > **PR_ISOL_CFG_SET**:
> >
> > Set task isolation configuration. The general format is:
> >
> > prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);
> >
> > The 'what' argument specifies what to configure. Possible values
> > are:
> >
> > * "I_CFG_FEAT":
> >
> > Set configuration of task isolation features. 'arg3' specifies
> > the feature. Possible values are:
> >
> > * "ISOL_F_QUIESCE":
>
> Is it really necessary for such fine grain control for which kernel
> activity to quiesce?
>
> For most user, all they care about is their
> task is not disturbed by kernel activities and not be bothered about
> setting which particular activities to quiesce. And in your patches there
> is only ISOL_F_QUIESCE_VMSTATS and nothing else. I think you could
> probably skip the QUIESCE control for now and add it when there's really
> a true need for fine grain control. This will make the interface simpler
> for user applications.
>
> Tim

Yes, this could be done (for example can maintain the current scheme,
add a way to query all supported features, enable them at chisol).
Then the application only has to be modified with:

This is a snippet of code to activate task isolation if
it has been previously configured (by chisol for example)::

#include <sys/prctl.h>
#include <linux/types.h>

#ifdef PR_ISOL_CFG_GET
unsigned long long fmask;

ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
if (ret != -1 && fmask != 0) {
ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_ACTIVATE_SET");
return ret;
}
}
#endif

Or not modified at all (well not initially since support for executing
unmodified binaries will be dropped).









2022-05-09 06:47:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync


Hi Thomas,

On Wed, May 04, 2022 at 10:15:14PM +0200, Thomas Gleixner wrote:
> On Wed, May 04 2022 at 15:56, Marcelo Tosatti wrote:
> > On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
> >> Can we please focus on the initial problem of
> >> providing a sensible isolation mechanism with well defined semantics?
> >
> > Case 2, however, was implicitly suggested by you (or at least i
> > understood that):
> >
> > "Summary: The problem to be solved cannot be restricted to
> >
> > self_defined_important_task(OWN_WORLD);
> >
> > Policy is not a binary on/off problem. It's manifold across all levels
> > of the stack and only a kernel problem when it comes down to the last
> > line of defence.
> >
> > Up to the point where the kernel puts the line of last defence, policy
> > is defined by the user/admin via mechanims provided by the kernel.
> >
> > Emphasis on "mechanims provided by the kernel", aka. user API.
> >
> > Just in case, I hope that I don't have to explain what level of scrunity
> > and thought this requires."
>
> Correct. This reasoning is still valid and I haven't changed my opinion
> on that since then.
>
> My main objections against the proposed solution back then were the all
> or nothing approach and the implicit hard coded policies.
>
> > The idea, as i understood was that certain task isolation features (or
> > they parameters) might have to be changed at runtime (which depends on
> > the task isolation features themselves, and the plan is to create
> > an extensible interface).
>
> Again. I'm not against useful controls to select the isolation an
> application requires. I'm neither against extensible interfaces.
>
> But I'm against overengineered implementations which lack any form of
> sensible design and have ill defined semantics at the user ABI.
>
> Designing user space ABI is _hard_ and needs a lot of thoughts. It's not
> done with throwing something 'extensible' at the kernel and hope it
> sticks. As I showed you in the review, the ABI is inconsistent in
> itself, it has ill defined semantics and lacks any form of justification
> of the approach taken.
>
> Can we please take a step back and:
>
> 1) Define what is trying to be solved

Avoid interruptions to application code execution on isolated CPUs.

Different use-cases might accept different length/frequencies
of interruptions (including no interruptions).

> and what are the pieces known
> today which need to be controlled in order to achieve the desired
> isolation properties.

I hope you don't mean the current CPU isolation features which have to
be enabled, but only the ones which are not enabled today:

"Isolation of the threads was done through the following kernel parameters:

nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt

And systemd was configured with the following affinites:

system.conf:CPUAffinity=0-7,16-23

This means that the second socket will be generally free of tasks and
kernel threads."

So here are some features which could be written on top of the proposed
task isolation via prctl:

1)

Enable or disable the following optional behaviour

A.
if (cpu->isolated_avoid_queue_work)
return -EBUSY;

queue_work_on(cpu, workfn);

(for the functions that can handle errors gracefully).

B.
if (cpu->isolated_avoid_function_ipi)
return -EBUSY;

smp_call_function_single(cpu, fn);
(for the functions that can handle errors gracefully).
Those that can't handle errors gracefully should be changed
to either handle errors or to remote work.

Not certain if this should be on per-case basis: say
"avoid action1|avoid action2|avoid action3|..." (bit per
action) and a "ALL" control, where actionZ is an action
that triggers an IPI or remote work (then you would check
for whether to fail not at smp_call_function_single
time but before the action starts).

Also, one might use something such as stalld (that schedules
tasks in/out for a short amount of time every given time window),
which might be acceptable for his workload, so he'd disable
cpu->isolated_avoid_queue_work (or expose this on per-case basis,
unsure which is better).

As for IPIs, whether to block a function call to an isolated
CPU depends on whether that function call (and its frequency)
will cause the latency sensitive application to violate its "latency"
requirements.

Perhaps "ALL - action1, action2, action3" is useful.

=======================================

2)

In general, avoiding (or uncaching on return to userspace)
a CPU from caching per-CPU data (which might require an
IPI to invalidate later on) (see point [1] below for more thoughts
on this issue).


For example, for KVM:

/*
* MMU notifier 'invalidate_range_start' hook.
*/
void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
unsigned long end, bool may_block)
{
DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS);
struct gfn_to_pfn_cache *gpc;
bool wake_vcpus = false;
...
called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);

which will
smp_call_function_many(cpus, ack_flush, NULL, wait);
...


====================================================

3) Enabling a kernel warning when a task switch happens on a CPU
which runs a task isolated thread?

From Christoph:

Special handling when the scheduler
switches a task? If tasks are being switched that requires them to be low
latency and undisturbed then something went very very wrong with the
system configuration and the only thing I would suggest is to issue some
kernel warning that this is not the way one should configure the system.

====================================================

4) Sending a signal whenever an application is interrupted
(hum, this could be done via BPF).

Those are the ones i can think of at the moment.
Not sure what other people can think of.

> 2) Describe the usage scenarios and the resulting constraints.

Well the constraints should be in the form

"In a given window of time, there should be no more than N
CPU interruptions of length L each."

(should be more complicated due to cache effects, but choosing
a lower N and L one is able to correct that)

I believe?

Also some memory bandwidth must be available to the application
(or data/code in shared caches).
Which depends on what other CPUs in the system are doing, the
cache hierarchy, the application, etc.

[1]: There is also a question of whether to focus only on
applications that do not perform system calls on their latency
sensitive path, and applications that perform system calls.

Because some CPU interruptions can't be avoided if the application
is in the kernel: for example instruction cache flushes due to
static_key rewrites or kernel TLB flushes (well they could be avoided
with more infrastructure, but there is no such infrastructure at
the moment).

> 3) Describe the requirements for features on top, e.g. inheritance
> or external control.

1) Be able to use unmodified applications (as long as the features
to be enabled are compatible with such usage, for example "killing
/ sending signal to application if task is interrupted" is obviously
incompatible with unmodified applications).

2) External control: be able to modify what task isolation features are
enabled externally (not within the application itself). The latency
sensitive application should inform the kernel the beginning of
the latency sensitive section (at this time, the task isolation
features configured externally will be activated).

3) One-shot mode: be able to quiesce certain kernel activities
only on the first time a syscall is made (because the overhead
of subsequent quiescing, for the subsequent system calls, is
undesired).

> Once we have that, we can have a discussion about the desired control
> granularity and how to support the extra features in a consistent and
> well defined way.
>
> A good and extensible UABI design comes with well defined functionality
> for the start and an obvious and maintainable extension path. The most
> important part is the well defined functionality.
>
> There have been enough examples in the past how well received approaches
> are, which lack the well defined part. Linus really loves to get a pull
> request for something which cannot be described what it does, but could
> be used for cool things in the future.
>
> > So for case 2, all you'd have to do is to modify the application only
> > once and allow the admin to configure the features.
>
> That's still an orthogonal problem, which can be solved once a sensible
> mechanism to control the isolation and handle it at the transition
> points is in place. You surely want to consider it when designing the
> UABI, but it's not required to create the real isolation mechanism in
> the first place.

Ok, can drop all of that for smaller patches with the handling
of transition points only (then later add oneshot mode, inheritance,
external control).

But might wait for discussion of requirements that you raise
first.

> Problem decomposition is not an entirely new concept, really.

Sure, thanks.


2022-06-01 21:12:04

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v12 00/13] extensible prctl task isolation interface and vmstat sync

On Thu, May 05, 2022 at 01:52:35PM -0300, Marcelo Tosatti wrote:
>
> Hi Thomas,
>
> On Wed, May 04, 2022 at 10:15:14PM +0200, Thomas Gleixner wrote:
> > On Wed, May 04 2022 at 15:56, Marcelo Tosatti wrote:
> > > On Wed, May 04, 2022 at 03:20:03PM +0200, Thomas Gleixner wrote:
> > >> Can we please focus on the initial problem of
> > >> providing a sensible isolation mechanism with well defined semantics?
> > >
> > > Case 2, however, was implicitly suggested by you (or at least i
> > > understood that):
> > >
> > > "Summary: The problem to be solved cannot be restricted to
> > >
> > > self_defined_important_task(OWN_WORLD);
> > >
> > > Policy is not a binary on/off problem. It's manifold across all levels
> > > of the stack and only a kernel problem when it comes down to the last
> > > line of defence.
> > >
> > > Up to the point where the kernel puts the line of last defence, policy
> > > is defined by the user/admin via mechanims provided by the kernel.
> > >
> > > Emphasis on "mechanims provided by the kernel", aka. user API.
> > >
> > > Just in case, I hope that I don't have to explain what level of scrunity
> > > and thought this requires."
> >
> > Correct. This reasoning is still valid and I haven't changed my opinion
> > on that since then.
> >
> > My main objections against the proposed solution back then were the all
> > or nothing approach and the implicit hard coded policies.
> >
> > > The idea, as i understood was that certain task isolation features (or
> > > they parameters) might have to be changed at runtime (which depends on
> > > the task isolation features themselves, and the plan is to create
> > > an extensible interface).
> >
> > Again. I'm not against useful controls to select the isolation an
> > application requires. I'm neither against extensible interfaces.
> >
> > But I'm against overengineered implementations which lack any form of
> > sensible design and have ill defined semantics at the user ABI.
> >
> > Designing user space ABI is _hard_ and needs a lot of thoughts. It's not
> > done with throwing something 'extensible' at the kernel and hope it
> > sticks. As I showed you in the review, the ABI is inconsistent in
> > itself, it has ill defined semantics and lacks any form of justification
> > of the approach taken.
> >
> > Can we please take a step back and:
> >
> > 1) Define what is trying to be solved
>
> Avoid interruptions to application code execution on isolated CPUs.
>
> Different use-cases might accept different length/frequencies
> of interruptions (including no interruptions).
>
> > and what are the pieces known
> > today which need to be controlled in order to achieve the desired
> > isolation properties.
>
> I hope you don't mean the current CPU isolation features which have to
> be enabled, but only the ones which are not enabled today:
>
> "Isolation of the threads was done through the following kernel parameters:
>
> nohz_full=8-15,24-31 rcu_nocbs=8-15,24-31 poll_spectre_v2=off
> numa_balancing=disable rcutree.kthread_prio=3 intel_pstate=disable nosmt
>
> And systemd was configured with the following affinites:
>
> system.conf:CPUAffinity=0-7,16-23
>
> This means that the second socket will be generally free of tasks and
> kernel threads."
>
> So here are some features which could be written on top of the proposed
> task isolation via prctl:
>
> 1)
>
> Enable or disable the following optional behaviour
>
> A.
> if (cpu->isolated_avoid_queue_work)
> return -EBUSY;
>
> queue_work_on(cpu, workfn);
>
> (for the functions that can handle errors gracefully).
>
> B.
> if (cpu->isolated_avoid_function_ipi)
> return -EBUSY;
>
> smp_call_function_single(cpu, fn);
> (for the functions that can handle errors gracefully).
> Those that can't handle errors gracefully should be changed
> to either handle errors or to remote work.
>
> Not certain if this should be on per-case basis: say
> "avoid action1|avoid action2|avoid action3|..." (bit per
> action) and a "ALL" control, where actionZ is an action
> that triggers an IPI or remote work (then you would check
> for whether to fail not at smp_call_function_single
> time but before the action starts).
>
> Also, one might use something such as stalld (that schedules
> tasks in/out for a short amount of time every given time window),
> which might be acceptable for his workload, so he'd disable
> cpu->isolated_avoid_queue_work (or expose this on per-case basis,
> unsure which is better).
>
> As for IPIs, whether to block a function call to an isolated
> CPU depends on whether that function call (and its frequency)
> will cause the latency sensitive application to violate its "latency"
> requirements.
>
> Perhaps "ALL - action1, action2, action3" is useful.
>
> =======================================
>
> 2)
>
> In general, avoiding (or uncaching on return to userspace)
> a CPU from caching per-CPU data (which might require an
> IPI to invalidate later on) (see point [1] below for more thoughts
> on this issue).
>
>
> For example, for KVM:
>
> /*
> * MMU notifier 'invalidate_range_start' hook.
> */
> void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
> unsigned long end, bool may_block)
> {
> DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS);
> struct gfn_to_pfn_cache *gpc;
> bool wake_vcpus = false;
> ...
> called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);
>
> which will
> smp_call_function_many(cpus, ack_flush, NULL, wait);
> ...
>
>
> ====================================================
>
> 3) Enabling a kernel warning when a task switch happens on a CPU
> which runs a task isolated thread?
>
> From Christoph:
>
> Special handling when the scheduler
> switches a task? If tasks are being switched that requires them to be low
> latency and undisturbed then something went very very wrong with the
> system configuration and the only thing I would suggest is to issue some
> kernel warning that this is not the way one should configure the system.
>
> ====================================================
>
> 4) Sending a signal whenever an application is interrupted
> (hum, this could be done via BPF).
>
> Those are the ones i can think of at the moment.
> Not sure what other people can think of.
>
> > 2) Describe the usage scenarios and the resulting constraints.
>
> Well the constraints should be in the form
>
> "In a given window of time, there should be no more than N
> CPU interruptions of length L each."
>
> (should be more complicated due to cache effects, but choosing
> a lower N and L one is able to correct that)
>
> I believe?
>
> Also some memory bandwidth must be available to the application
> (or data/code in shared caches).
> Which depends on what other CPUs in the system are doing, the
> cache hierarchy, the application, etc.
>
> [1]: There is also a question of whether to focus only on
> applications that do not perform system calls on their latency
> sensitive path, and applications that perform system calls.
>
> Because some CPU interruptions can't be avoided if the application
> is in the kernel: for example instruction cache flushes due to
> static_key rewrites or kernel TLB flushes (well they could be avoided
> with more infrastructure, but there is no such infrastructure at
> the moment).
>
> > 3) Describe the requirements for features on top, e.g. inheritance
> > or external control.
>
> 1) Be able to use unmodified applications (as long as the features
> to be enabled are compatible with such usage, for example "killing
> / sending signal to application if task is interrupted" is obviously
> incompatible with unmodified applications).
>
> 2) External control: be able to modify what task isolation features are
> enabled externally (not within the application itself). The latency
> sensitive application should inform the kernel the beginning of
> the latency sensitive section (at this time, the task isolation
> features configured externally will be activated).
>
> 3) One-shot mode: be able to quiesce certain kernel activities
> only on the first time a syscall is made (because the overhead
> of subsequent quiescing, for the subsequent system calls, is
> undesired).
>
> > Once we have that, we can have a discussion about the desired control
> > granularity and how to support the extra features in a consistent and
> > well defined way.
> >
> > A good and extensible UABI design comes with well defined functionality
> > for the start and an obvious and maintainable extension path. The most
> > important part is the well defined functionality.
> >
> > There have been enough examples in the past how well received approaches
> > are, which lack the well defined part. Linus really loves to get a pull
> > request for something which cannot be described what it does, but could
> > be used for cool things in the future.
> >
> > > So for case 2, all you'd have to do is to modify the application only
> > > once and allow the admin to configure the features.
> >
> > That's still an orthogonal problem, which can be solved once a sensible
> > mechanism to control the isolation and handle it at the transition
> > points is in place. You surely want to consider it when designing the
> > UABI, but it's not required to create the real isolation mechanism in
> > the first place.
>
> Ok, can drop all of that for smaller patches with the handling
> of transition points only (then later add oneshot mode, inheritance,
> external control).
>
> But might wait for discussion of requirements that you raise
> first.
>
> > Problem decomposition is not an entirely new concept, really.
>
> Sure, thanks.

Actually, hope that the patches from Aaron:

[RFC PATCH v3] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

https://lore.kernel.org/all/[email protected]/T/

Can enable syncing of vmstat on return to userspace, for nohz_full CPUs.

Then the remaining items such as

> if (cpu->isolated_avoid_queue_work)
> return -EBUSY;

Can be enabled with a different (more flexible) interface such as writes
to filesystem (or task attribute that is transferred to per-CPU variable
on task initialization and remove from per-CPU variables when task
dies)