2022-02-04 20:21:08

by Marcelo Tosatti

[permalink] [raw]
Subject: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, add task isolation prctl interface to quiesce
deferred actions when returning to userspace.

The patchset is based on ideas and code from the
task isolation patchset from Alex Belits:
https://lwn.net/Articles/816298/

Please refer to Documentation/userspace-api/task_isolation.rst
(patch 1) for details. Its attached at the end of this message
in .txt format as well.

Note: the prctl interface is independent of nohz_full=.

The userspace patches can be found at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/

- qemu-task-isolation.patch: activate task isolation from CPU execution loop
- rt-tests-task-isolation.patch: add task isolation activation to cyclictest/oslat
- util-linux-chisol.patch: add chisol tool to util-linux.

---

v11
- Add TIF_TASK_ISOL bit to thread info flags and use it
to decide whether to perform task isolation work on
return to userspace (Frederic Weisbecker)
- Fold patch to add task_isol_exit hooks into
"sync vmstats on return to userspace" patch. (Frederic Weisbecker)
- Fix typo on prctl_task_isol_cfg_get declaration (Oscar Shiang)
- Add preempt notifiers

v10
- no changes (resending series without changelog corruption).

v9
- Clarify inheritance is propagated to all descendents (Frederic Weisbecker)
- Fix inheritance of one-shot mode across exec/fork (Frederic Weisbecker)
- Unify naming on "task_isol_..." (Frederic Weisbecker)
- Introduce CONFIG_TASK_ISOLATION (Frederic Weisbecker)

v8
- Document the possibility for ISOL_F_QUIESCE_ONE,
to configure individual features (Frederic Weisbecker).
- Fix PR_ISOL_CFG_GET typo in documentation (Frederic Weisbecker).
- Rebased against linux-2.6.git.

v7
- no changes (resending series without changelog corruption).

v6
- Move oneshot mode enablement to configuration time (Frederic Weisbecker).
- Allow more extensions to CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
- Update docs and samples regarding oneshot mode (Frederic Weisbecker).
- Update docs and samples regarding more extensibility of
CFG_SET of ISOL_F_QUIESCE (Frederic Weisbecker).
- prctl_task_isolation_activate_get should copy active_mask
to address in arg2.
- modify exit_to_user_mode_loop to cover exceptions
and interrupts.
- split exit hooks into its own patch

v5
- Add changelogs to individual patches (Peter Zijlstra).
- Add documentation to patchset intro (Peter Zijlstra).

v4:
- Switch to structures for parameters when possible
(which are more extensible).
- Switch to CFG_{S,G}ET naming and use drop
"internal configuration" prctls (Frederic Weisbecker).
- Add summary of terms to documentation (Frederic Weisbecker).
- Examples for compute and one-shot modes (Thomas G/Christoph L).

v3:

- Split in smaller patches (Nitesh Lal).
- Misc cleanups (Nitesh Lal).
- Clarify nohz_full is not a dependency (Nicolas Saenz).
- Incorrect values for prctl definitions (kernel robot).
- Save configured state, so applications
can activate externally configured
task isolation parameters.
- Remove "system default" notion (chisol should
make it obsolete).
- Update documentation: add new section with explanation
about configuration/activation and code example.
- Update samples.
- Report configuration/activation state at
/proc/pid/task_isolation.
- Condense dirty information of per-CPU vmstats counters
in a bool.
- In-kernel KVM support.

v2:

- Finer-grained control of quiescing (Frederic Weisbecker / Nicolas Saenz).

- Avoid potential regressions by allowing applications
to use ISOL_F_QUIESCE_DEFMASK (whose default value
is configurable in /sys/). (Nitesh Lal / Nicolas Saenz).


v10 can be found at:

https://lore.kernel.org/lkml/[email protected]/

---

Documentation/userspace-api/task_isolation.rst | 379 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/s390/include/asm/thread_info.h | 2
arch/x86/include/asm/thread_info.h | 2
fs/proc/base.c | 68 ++++++++++
include/linux/entry-common.h | 2
include/linux/sched.h | 5
include/linux/task_isolation.h | 136 +++++++++++++++++++++
include/linux/vmstat.h | 25 +++
include/uapi/linux/prctl.h | 47 +++++++
init/Kconfig | 16 ++
init/init_task.c | 3
kernel/Makefile | 2
kernel/entry/common.c | 4
kernel/entry/kvm.c | 4
kernel/exit.c | 2
kernel/fork.c | 23 +++
kernel/sys.c | 16 ++
kernel/task_isolation.c | 419 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/vmstat.c | 173 +++++++++++++++++++++-----
samples/Kconfig | 7 +
samples/Makefile | 1
samples/task_isolation/Makefile | 11 +
samples/task_isolation/task_isol.c | 92 ++++++++++++++
samples/task_isolation/task_isol.h | 9 +
samples/task_isolation/task_isol_computation.c | 89 +++++++++++++
samples/task_isolation/task_isol_oneshot.c | 104 ++++++++++++++++
samples/task_isolation/task_isol_userloop.c | 54 ++++++++
27 files changed, 1658 insertions(+), 37 deletions(-)

---

Task isolation prctl interface
******************************

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth
networking applications with user-space drivers can fall into the
category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and end of
the latency sensitive application section (with configurable system
behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:

* PR_ISOL_FEAT_GET: Retrieve supported features.

* PR_ISOL_CFG_GET: Retrieve task isolation configuration.

* PR_ISOL_CFG_SET: Set task isolation configuration.

* PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.

* PR_ISOL_ACTIVATE_SET: Set task isolation activation state.

Summary of terms:

* feature:

A distinct attribute or aspect of task isolation. Examples of
features could be logging, new operating modes (eg: syscalls
disallowed), userspace notifications, etc. The only feature
currently available is quiescing.

* configuration:

A specific choice from a given set of possible choices that
dictate how the particular feature in question should behave.

* activation state:

The activation state (whether active/inactive) of the task
isolation features (features must be configured before being
activated).

Inheritance of the isolation parameters and state, across fork(2) and
clone(2), can be changed via PR_ISOL_CFG_GET/PR_ISOL_CFG_SET.

At a high-level, task isolation is divided in two steps:

1. Configuration.

2. Activation.

Section "Userspace support" describes how to use task isolation.

In terms of the interface, the sequence of steps to activate task
isolation are:

1. Retrieve supported task isolation features (PR_ISOL_FEAT_GET).

2. Configure task isolation features
(PR_ISOL_CFG_GET/PR_ISOL_CFG_SET).

3. Activate or deactivate task isolation features
(PR_ISOL_ACTIVATE_GET/PR_ISOL_ACTIVATE_SET).

This interface is based on ideas and code from the task isolation
patchset from Alex Belits: https://lwn.net/Articles/816298/

Note: if the need arises to configure an individual quiesce feature
with its own extensible structure, please add ISOL_F_QUIESCE_ONE to
PR_ISOL_CFG_GET/PR_ISOL_CFG_SET (ISOL_F_QUIESCE operates on multiple
features per syscall currently).


Feature description
===================

* "ISOL_F_QUIESCE"

This feature allows quiescing selected kernel activities on return
from system calls.


Interface description
=====================

**PR_ISOL_FEAT**:

Returns the supported features and feature capabilities, as a
bitmask:

prctl(PR_ISOL_FEAT, feat, arg3, arg4, arg5);

The 'feat' argument specifies whether to return supported features
(if zero), or feature capabilities (if not zero). Possible values
for 'feat' are:

* "0":

Return the bitmask of supported features, in the location
pointed to by "(int *)arg3". The buffer should allow space
for 8 bytes.

* "ISOL_F_QUIESCE":

Return a structure containing which kernel activities are
supported for quiescing, in the location pointed to by "(int
*)arg3":

struct task_isol_quiesce_extensions {
__u64 flags;
__u64 supported_quiesce_bits;
__u64 pad[6];
};

Where:

*flags*: Additional flags (should be zero).

*supported_quiesce_bits*: Bitmask indicating
which features are supported for quiescing.

*pad*: Additional space for future enhancements.

Features and its capabilities are defined at
include/uapi/linux/task_isolation.h.

**PR_ISOL_CFG_GET**:

Retrieve task isolation configuration. The general format is:

prctl(PR_ISOL_CFG_GET, what, arg3, arg4, arg5);

The 'what' argument specifies what to configure. Possible values
are:

* "I_CFG_FEAT":

Return configuration of task isolation features. The 'arg3'
argument specifies whether to return configured features (if
zero), or individual feature configuration (if not zero), as
follows.

* "0":

Return the bitmask of configured features, in the
location pointed to by "(int *)arg4". The buffer
should allow space for 8 bytes.

* "ISOL_F_QUIESCE":

If arg4 is QUIESCE_CONTROL, return the control structure
for quiescing of background kernel activities, in the
location pointed to by "(int *)arg5":

struct task_isol_quiesce_control {
__u64 flags;
__u64 quiesce_mask;
__u64 quiesce_oneshot_mask;
__u64 pad[5];
};

See PR_ISOL_CFG_SET description for meaning of fields.

* "I_CFG_INHERIT":

Retrieve inheritance configuration across fork/clone.

Return the structure which configures inheritance across
fork/clone, in the location pointed to by "(int *)arg4":

struct task_isol_inherit_control {
__u8 inherit_mask;
__u8 pad[7];
};

See PR_ISOL_CFG_SET description for meaning of fields.

**PR_ISOL_CFG_SET**:

Set task isolation configuration. The general format is:

prctl(PR_ISOL_CFG_SET, what, arg3, arg4, arg5);

The 'what' argument specifies what to configure. Possible values
are:

* "I_CFG_FEAT":

Set configuration of task isolation features. 'arg3' specifies
the feature. Possible values are:

* "ISOL_F_QUIESCE":

If arg4 is QUIESCE_CONTROL, set the control structure for
quiescing of background kernel activities, from the
location pointed to by "(int *)arg5":

struct task_isol_quiesce_control {
__u64 flags;
__u64 quiesce_mask;
__u64 quiesce_oneshot_mask;
__u64 pad[5];
};

Where:

*flags*: Additional flags (should be zero).

*quiesce_mask*: A bitmask containing which kernel
activities to quiesce.

*quiesce_oneshot_mask*: A bitmask indicating which kernel
activities should behave in oneshot mode, that is,
quiescing will happen on return from
prctl(PR_ISOL_ACTIVATE_SET), but not on return of
subsequent system calls. The corresponding bit(s) must
also be set at quiesce_mask.

*pad*: Additional space for future enhancements.

For quiesce_mask (and quiesce_oneshot_mask), possible bit
sets are:

* "ISOL_F_QUIESCE_VMSTATS"

VM statistics are maintained in per-CPU counters to
improve performance. When a CPU modifies a VM statistic,
this modification is kept in the per-CPU counter. Certain
activities require a global count, which involves
requesting each CPU to flush its local counters to the
global VM counters.

This flush is implemented via a workqueue item, which
might schedule a workqueue on isolated CPUs.

To avoid this interruption, task isolation can be
configured to, upon return from system calls, synchronize
the per-CPU counters to global counters, thus avoiding
the interruption.

* "I_CFG_INHERIT":

Set inheritance configuration when a new task is created via
fork and clone.

The "(int *)arg4" argument is a pointer to:

struct task_isol_inherit_control {
__u8 inherit_mask;
__u8 pad[7];
};

inherit_mask is a bitmask that specifies which part of task
isolation should be inherited:

* Bit ISOL_INHERIT_CONF: Inherit task isolation
configuration. This is the state written via
prctl(PR_ISOL_CFG_SET, ...).

* Bit ISOL_INHERIT_ACTIVE: Inherit task isolation activation
(requires ISOL_INHERIT_CONF to be set). The new task should
behave, after fork/clone, in the same manner as the parent
task after it executed:

prctl(PR_ISOL_ACTIVATE_SET, &mask, ...);

Note: the inheritance propagates to all the descendants and
not just the immediate children, unless the inheritance is
explicitly reconfigured by some children.

**PR_ISOL_ACTIVATE_GET**:

Retrieve task isolation activation state.

The general format is:

prctl(PR_ISOL_ACTIVATE_GET, pmask, arg3, arg4, arg5);

'pmask' specifies the location of a feature mask, where the current
active mask will be copied. See PR_ISOL_ACTIVATE_SET for
description of individual bits.

**PR_ISOL_ACTIVATE_SET**:

Set task isolation activation state (activates/deactivates task
isolation).

The general format is:

prctl(PR_ISOL_ACTIVATE_SET, pmask, arg3, arg4, arg5);

The 'pmask' argument specifies the location of an 8 byte mask
containing which features should be activated. Features whose bits
are cleared will be deactivated. The possible bits for this mask
are:

* "ISOL_F_QUIESCE":

Activate quiescing of background kernel activities. Quiescing
happens on return to userspace from this system call, and on
return from subsequent system calls (unless quiesce_oneshot_mask
has been set at PR_ISOL_CFG_SET time).

Quiescing can be adjusted (while active) by
prctl(PR_ISOL_ACTIVATE_SET, &new_mask, ...).


Userspace support
*****************

Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose.

This results in three combinations:

1. Both configuration and activation performed by the latency
sensitive application. Allows fine grained control of what task
isolation features are enabled and when (see samples section below).

2. Only activation can be performed by the latency sensitive app (and
configuration performed by chisol). This allows the admin/user to
control task isolation parameters, and applications have to be
modified only once.

3. Configuration and activation performed by an external tool. This
allows unmodified applications to take advantage of task isolation.
Activation is performed by the "-a" option of chisol.


Examples
********

The "samples/task_isolation/" directory contains 3 examples:

* task_isol_userloop.c:

Example of program with a loop on userspace scenario.

* task_isol_computation.c:

Example of program that enters task isolated mode, performs an
amount of computation, exits task isolated mode, and writes the
computation to disk.

* task_isol_oneshot.c:

Example of program that enables one-shot mode for quiescing,
enters a processing loop, then upon an external event performs a
number of syscalls to handle that event.

This is a snippet of code to activate task isolation if it has been
previously configured (by chisol for example):

#include <sys/prctl.h>
#include <linux/types.h>

#ifdef PR_ISOL_CFG_GET
unsigned long long fmask;

ret = prctl(PR_ISOL_CFG_GET, I_CFG_FEAT, 0, &fmask, 0);
if (ret != -1 && fmask != 0) {
ret = prctl(PR_ISOL_ACTIVATE_SET, &fmask, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_ACTIVATE_SET");
return ret;
}
}
#endif





2022-02-21 03:19:20

by Oscar Shiang

[permalink] [raw]
Subject: Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync

Hi Marcelo,

I tried to apply your patches to kernel v5.15.18-rt28 and measured
the latencies through oslat [1].

It turns out that the peak latency (around 100us) can drop to about 90us.
The result is impressive since I only changed the guest's kernel
instead of installing the patched kernel to both host and guest.

However, I am still curious about:
1) Why did I catch a bigger maximum latency in almost each of the
results of applying task isolation patches? Or does it come from
other reasons?
2) Why did we only get a 10us improvement on quiescing vmstat?

[1]: The result and the test scripts I used can be found at
https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat

Thanks,
Oscar

2022-02-24 00:44:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync

Hi Oscar,

On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> Hi Marcelo,
>
> I tried to apply your patches to kernel v5.15.18-rt28 and measured
> the latencies through oslat [1].
>
> It turns out that the peak latency (around 100us) can drop to about 90us.
> The result is impressive since I only changed the guest's kernel
> instead of installing the patched kernel to both host and guest.
>
> However, I am still curious about:
> 1) Why did I catch a bigger maximum latency in almost each of the
> results of applying task isolation patches? Or does it come from
> other reasons?

There are a number of things that need to be done in order to have an
"well enough" isolated CPU so you can measure latency reliably:

* Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
https://github.com/redhat-performance/tuned.git, which does a bunch of
other things to avoid interruptions to isolated CPUs).
* Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
to util-linux and rt-tests.

Run oslat with chisol:

chisol -q vmstat_sync -I conf oslat -c ...

Where chisol is from patched util-linux and oslat from patched rt-tests.

If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.

Are you doing those things?

> 2) Why did we only get a 10us improvement on quiescing vmstat?

If you did not have FIFO priority on oslat, then other daemons
could be interrupting it, so better make sure the 10us improvement
you see is due to vmstat_flush workqueue work not executing anymore.

The testcase i use is:

Stock kernel:

terminal 1:
# oslat -f 1 -c X ...

terminal 2:
# echo 1 > /proc/sys/vm/stat_refresh
(hang)

Patched kernel:

terminal 1:
# chisol -q vmstat_sync -I conf oslat -f 1 -c X ...

terminal 2:
# echo 1 > /proc/sys/vm/stat_refresh
#

> [1]: The result and the test scripts I used can be found at
> https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat

OK, you seem to be doing everything necessary for chisol
to work. Does /proc/pid/task_isolation of the oslat worker thread
(note its not the same pid as the main oslat thread) show "vmstat"
configured and activated for quiesce?

However 100us is really high. You should be able to get < 10us with
realtime-virtual-host (i see 4us on an idle system).

The answer might be: because 10us is what it takes to execute
vmstat_worker on the isolated CPU (you can verify with tracepoints).

That time depends on the number of per-CPU vmstat variables that need flushing,
i suppose...


2022-03-08 08:51:19

by Oscar Shiang

[permalink] [raw]
Subject: Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync

Hi Marcelo,

I also tried to enable task isolation on arm64 with the following changes.

Maybe you can consider these in next version :)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 6623c99f0984..c1257bca1763 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -72,6 +72,7 @@ int arch_dup_task_struct(struct task_struct *dst,
#define TIF_SYSCALL_TRACEPOINT 10 /* syscall tracepoint for ftrace */
#define TIF_SECCOMP 11 /* syscall secure computing */
#define TIF_SYSCALL_EMU 12 /* syscall emulation active */
+#define TIF_TASK_ISOL 13 /* task isolation work pending */
#define TIF_MEMDIE 18 /* is terminating due to OOM killer */
#define TIF_FREEZE 19
#define TIF_RESTORE_SIGMASK 20
@@ -85,6 +86,7 @@ int arch_dup_task_struct(struct task_struct *dst,
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
+#define _TIF_TASK_ISOL (1 << TIF_TASK_ISOL)
#define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE)
#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index c1257bca1763..c136850d623c 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -103,7 +103,7 @@ int arch_dup_task_struct(struct task_struct *dst,
#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
_TIF_UPROBE | _TIF_MTE_ASYNC_FAULT | \
- _TIF_NOTIFY_SIGNAL)
+ _TIF_NOTIFY_SIGNAL | _TIF_TASK_ISOL)

#define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
_TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index c287b9407f28..8308f6dc5d4b 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
#include <linux/tracehook.h>
#include <linux/ratelimit.h>
#include <linux/syscalls.h>
+#include <linux/task_isolation.h>

#include <asm/daifflags.h>
#include <asm/debug-monitors.h>
@@ -945,6 +946,9 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_flags)

if (thread_flags & _TIF_FOREIGN_FPSTATE)
fpsimd_restore_current_state();
+
+ if (thread_flags & _TIF_TASK_ISOL)
+ task_isol_exit_to_user_mode();
}

local_daif_mask();

Thanks,
Oscar

2022-03-08 23:19:02

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync

On Tue, Mar 08, 2022 at 02:32:46PM +0800, Oscar Shiang wrote:
> On Feb 24, 2022, at 1:31 AM, Marcelo Tosatti <[email protected]> wrote:
> > Hi Oscar,
> >
> > On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> > > Hi Marcelo,
> > >
> > > I tried to apply your patches to kernel v5.15.18-rt28 and measured
> > > the latencies through oslat [1].
> > >
> > > It turns out that the peak latency (around 100us) can drop to about 90us.
> > > The result is impressive since I only changed the guest's kernel
> > > instead of installing the patched kernel to both host and guest.
> > >
> > > However, I am still curious about:
> > > 1) Why did I catch a bigger maximum latency in almost each of the
> > > results of applying task isolation patches? Or does it come from
> > > other reasons?
> >
> > There are a number of things that need to be done in order to have an
> > "well enough" isolated CPU so you can measure latency reliably:
> >
> > * Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
> > https://github.com/redhat-performance/tuned.git, which does a bunch of
> > other things to avoid interruptions to isolated CPUs).
> > * Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> > to util-linux and rt-tests.
> >
> > Run oslat with chisol:
> >
> > chisol -q vmstat_sync -I conf oslat -c ...
> >
> > Where chisol is from patched util-linux and oslat from patched rt-tests.
> >
> > If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.
> >
> > Are you doing those things?
> >
> > > 2) Why did we only get a 10us improvement on quiescing vmstat?
> >
> > If you did not have FIFO priority on oslat, then other daemons
> > could be interrupting it, so better make sure the 10us improvement
> > you see is due to vmstat_flush workqueue work not executing anymore.
> >
> > The testcase i use is:
> >
> > Stock kernel:
> >
> > terminal 1:
> > # oslat -f 1 -c X ...
> >
> > terminal 2:
> > # echo 1 > /proc/sys/vm/stat_refresh
> > (hang)
> >
> > Patched kernel:
> >
> > terminal 1:
> > # chisol -q vmstat_sync -I conf oslat -f 1 -c X ...
> >
> > terminal 2:
> > # echo 1 > /proc/sys/vm/stat_refresh
> > #
>
> Sure, I did see the terminal hung during oslat with FIFO priority.
>
> BTW, thanks for providing this test case. I used to run all workload stuff to just
> verify the improvement of task isolation. It is a more straightr- forward way to do.
>
> > > [1]: The result and the test scripts I used can be found at
> > > https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat
> >
> > OK, you seem to be doing everything necessary for chisol
> > to work. Does /proc/pid/task_isolation of the oslat worker thread
> > (note its not the same pid as the main oslat thread) show "vmstat"
> > configured and activated for quiesce?
>
> The status of task_isolation seems to be set properly with "vmstat" and activated
>
> > However 100us is really high. You should be able to get < 10us with
> > realtime-virtual-host (i see 4us on an idle system).
> >
> > The answer might be: because 10us is what it takes to execute
> > vmstat_worker on the isolated CPU (you can verify with tracepoints).
> >
> > That time depends on the number of per-CPU vmstat variables that need flushing,
> > i suppose...
>
> Considering the interferences outside of the KVM, I have redone the measurements
> directly on my x86_64 computer [1].
>
> As result, most of the latencies are down to 60us (and below). There are still
> some latencies larger than 80us, I am working on and trying to figure out the reason.
>
> [1]: https://gist.github.com/OscarShiang/202eb691e649557fe3eaa5ec67a5aa82

Oscar,

Did you confirm with hwlatdetect that the BIOS does not have long
running SMIs?

Also, for the software part, you could save time by using the
realtime-virtual-host profile (check /usr/lib/tuned/realtime-virtual-host/
to see what its doing in addition to isolcpus=).

2022-03-09 01:13:05

by Oscar Shiang

[permalink] [raw]
Subject: Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync

On Feb 24, 2022, at 1:31 AM, Marcelo Tosatti <[email protected]> wrote:
> Hi Oscar,
>
> On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> > Hi Marcelo,
> >
> > I tried to apply your patches to kernel v5.15.18-rt28 and measured
> > the latencies through oslat [1].
> >
> > It turns out that the peak latency (around 100us) can drop to about 90us.
> > The result is impressive since I only changed the guest's kernel
> > instead of installing the patched kernel to both host and guest.
> >
> > However, I am still curious about:
> > 1) Why did I catch a bigger maximum latency in almost each of the
> > results of applying task isolation patches? Or does it come from
> > other reasons?
>
> There are a number of things that need to be done in order to have an
> "well enough" isolated CPU so you can measure latency reliably:
>
> * Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
> https://github.com/redhat-performance/tuned.git, which does a bunch of
> other things to avoid interruptions to isolated CPUs).
> * Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> to util-linux and rt-tests.
>
> Run oslat with chisol:
>
> chisol -q vmstat_sync -I conf oslat -c ...
>
> Where chisol is from patched util-linux and oslat from patched rt-tests.
>
> If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.
>
> Are you doing those things?
>
> > 2) Why did we only get a 10us improvement on quiescing vmstat?
>
> If you did not have FIFO priority on oslat, then other daemons
> could be interrupting it, so better make sure the 10us improvement
> you see is due to vmstat_flush workqueue work not executing anymore.
>
> The testcase i use is:
>
> Stock kernel:
>
> terminal 1:
> # oslat -f 1 -c X ...
>
> terminal 2:
> # echo 1 > /proc/sys/vm/stat_refresh
> (hang)
>
> Patched kernel:
>
> terminal 1:
> # chisol -q vmstat_sync -I conf oslat -f 1 -c X ...
>
> terminal 2:
> # echo 1 > /proc/sys/vm/stat_refresh
> #

Sure, I did see the terminal hung during oslat with FIFO priority.

BTW, thanks for providing this test case. I used to run all workload stuff to just
verify the improvement of task isolation. It is a more straightr- forward way to do.

> > [1]: The result and the test scripts I used can be found at
> > https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat
>
> OK, you seem to be doing everything necessary for chisol
> to work. Does /proc/pid/task_isolation of the oslat worker thread
> (note its not the same pid as the main oslat thread) show "vmstat"
> configured and activated for quiesce?

The status of task_isolation seems to be set properly with "vmstat" and activated

> However 100us is really high. You should be able to get < 10us with
> realtime-virtual-host (i see 4us on an idle system).
>
> The answer might be: because 10us is what it takes to execute
> vmstat_worker on the isolated CPU (you can verify with tracepoints).
>
> That time depends on the number of per-CPU vmstat variables that need flushing,
> i suppose...

Considering the interferences outside of the KVM, I have redone the measurements
directly on my x86_64 computer [1].

As result, most of the latencies are down to 60us (and below). There are still
some latencies larger than 80us, I am working on and trying to figure out the reason.

[1]: https://gist.github.com/OscarShiang/202eb691e649557fe3eaa5ec67a5aa82

Thanks,
Oscar

2022-03-09 16:24:43

by Oscar Shiang

[permalink] [raw]
Subject: Re: [patch v11 00/13] extensible prctl task isolation interface and vmstat sync

On Mar 8, 2022, at 9:12 PM, Marcelo Tosatti <[email protected]> wrote:
> On Tue, Mar 08, 2022 at 02:32:46PM +0800, Oscar Shiang wrote:
> > On Feb 24, 2022, at 1:31 AM, Marcelo Tosatti <[email protected]> wrote:
> > > Hi Oscar,
> > >
> > > On Sat, Feb 19, 2022 at 04:02:10PM +0800, Oscar Shiang wrote:
> > > > Hi Marcelo,
> > > >
> > > > I tried to apply your patches to kernel v5.15.18-rt28 and measured
> > > > the latencies through oslat [1].
> > > >
> > > > It turns out that the peak latency (around 100us) can drop to about 90us.
> > > > The result is impressive since I only changed the guest's kernel
> > > > instead of installing the patched kernel to both host and guest.
> > > >
> > > > However, I am still curious about:
> > > > 1) Why did I catch a bigger maximum latency in almost each of the
> > > > results of applying task isolation patches? Or does it come from
> > > > other reasons?
> > >
> > > There are a number of things that need to be done in order to have an
> > > "well enough" isolated CPU so you can measure latency reliably:
> > >
> > > * Boot a kernel with isolated CPU (or better, use realtime-virtual-host profile of
> > > https://github.com/redhat-performance/tuned.git, which does a bunch of
> > > other things to avoid interruptions to isolated CPUs).
> > > * Apply the userspace patches at https://people.redhat.com/~mtosatti/task-isol-v6-userspace-patches/
> > > to util-linux and rt-tests.
> > >
> > > Run oslat with chisol:
> > >
> > > chisol -q vmstat_sync -I conf oslat -c ...
> > >
> > > Where chisol is from patched util-linux and oslat from patched rt-tests.
> > >
> > > If you had "-f 1" (FIFO priority), on oslat, then the vmstat work would be hung.
> > >
> > > Are you doing those things?
> > >
> > > > 2) Why did we only get a 10us improvement on quiescing vmstat?
> > >
> > > If you did not have FIFO priority on oslat, then other daemons
> > > could be interrupting it, so better make sure the 10us improvement
> > > you see is due to vmstat_flush workqueue work not executing anymore.
> > >
> > > The testcase i use is:
> > >
> > > Stock kernel:
> > >
> > > terminal 1:
> > > # oslat -f 1 -c X ...
> > >
> > > terminal 2:
> > > # echo 1 > /proc/sys/vm/stat_refresh
> > > (hang)
> > >
> > > Patched kernel:
> > >
> > > terminal 1:
> > > # chisol -q vmstat_sync -I conf oslat -f 1 -c X ...
> > >
> > > terminal 2:
> > > # echo 1 > /proc/sys/vm/stat_refresh
> > > #
> >
> > Sure, I did see the terminal hung during oslat with FIFO priority.
> >
> > BTW, thanks for providing this test case. I used to run all workload stuff to just
> > verify the improvement of task isolation. It is a more straightr- forward way to do.
> >
> > > > [1]: The result and the test scripts I used can be found at
> > > > https://gist.github.com/OscarShiang/8b530a00f472fd1c39f5979ee601516d#testing-task-isolation-via-oslat
> > >
> > > OK, you seem to be doing everything necessary for chisol
> > > to work. Does /proc/pid/task_isolation of the oslat worker thread
> > > (note its not the same pid as the main oslat thread) show "vmstat"
> > > configured and activated for quiesce?
> >
> > The status of task_isolation seems to be set properly with "vmstat" and activated
> >
> > > However 100us is really high. You should be able to get < 10us with
> > > realtime-virtual-host (i see 4us on an idle system).
> > >
> > > The answer might be: because 10us is what it takes to execute
> > > vmstat_worker on the isolated CPU (you can verify with tracepoints).
> > >
> > > That time depends on the number of per-CPU vmstat variables that need flushing,
> > > i suppose...
> >
> > Considering the interferences outside of the KVM, I have redone the measurements
> > directly on my x86_64 computer [1].
> >
> > As result, most of the latencies are down to 60us (and below). There are still
> > some latencies larger than 80us, I am working on and trying to figure out the reason.
> >
> > [1]: https://gist.github.com/OscarShiang/202eb691e649557fe3eaa5ec67a5aa82
>
> Oscar,
>
> Did you confirm with hwlatdetect that the BIOS does not have long
> running SMIs?

Marcelo,

I have run hwlatdetect and the result is shown below:

hwlatdetect: test duration 900 seconds
detector: tracer
parameters:
Latency threshold: 0us
Sample window: 1000000us
Sample width: 500000us
Non-sampling period: 500000us
Output File: test.report

Starting test
test finished
Max Latency: 48us
Samples recorded: 37
Samples exceeding threshold: 37
SMIs during run: 0
report saved to test.report (37 samples)
ts: 1646837340.346151918, inner:5, outer:9
ts: 1646837351.550312752, inner:46, outer:45
ts: 1646837381.549331178, inner:45, outer:0
ts: 1646837400.008623200, inner:0, outer:9
ts: 1646837425.578093371, inner:0, outer:8
ts: 1646837429.587363003, inner:0, outer:1
ts: 1646837436.549050243, inner:45, outer:45
ts: 1646837580.005173999, inner:0, outer:9
ts: 1646837605.591017161, inner:7, outer:8
ts: 1646837635.552410329, inner:0, outer:1
ts: 1646837639.246489489, inner:9, outer:5
ts: 1646837645.426611917, inner:9, outer:5
ts: 1646837651.550721975, inner:0, outer:1
ts: 1646837728.549928137, inner:40, outer:47
ts: 1646837756.281606376, inner:1, outer:8
ts: 1646837757.492693661, inner:0, outer:1
ts: 1646837759.355807689, inner:13, outer:13
ts: 1646837761.590570928, inner:0, outer:1
ts: 1646837762.382475433, inner:5, outer:9
ts: 1646837764.185172836, inner:0, outer:9
ts: 1646837768.675668348, inner:1, outer:0
ts: 1646837776.485184319, inner:0, outer:1
ts: 1646837777.544517878, inner:45, outer:41
ts: 1646837855.549140154, inner:45, outer:0
ts: 1646837886.492523509, inner:1, outer:0
ts: 1646837897.544172247, inner:45, outer:0
ts: 1646837933.550015925, inner:48, outer:48
ts: 1646837981.546557947, inner:45, outer:45
ts: 1646837998.051615598, inner:0, outer:1
ts: 1646838030.550099263, inner:45, outer:0
ts: 1646838072.549664559, inner:0, outer:46
ts: 1646838106.571038324, inner:0, outer:1
ts: 1646838107.547228682, inner:1, outer:0
ts: 1646838114.549686904, inner:45, outer:45
ts: 1646838162.549477219, inner:46, outer:47
ts: 1646838180.014465679, inner:5, outer:9
ts: 1646838237.486873064, inner:0, outer:1

It seems that there is no SMI occurring (but some other latencies around 40us?)

> Also, for the software part, you could save time by using the
> realtime-virtual-host profile (check /usr/lib/tuned/realtime-virtual-host/
> to see what its doing in addition to isolcpus=).

Yes, I have switched to realtime-virtual-host profile with its kernel cmdline args.