2023-03-20 18:21:34

by Marcelo Tosatti

[permalink] [raw]
Subject: [PATCH v7 00/13] fold per-CPU vmstats remotely

This patch series addresses the following two problems:

1. A customer provided evidence indicating that a process
was stalled in direct reclaim:

- The process was trapped in throttle_direct_reclaim().
The function wait_event_killable() was called to wait condition
allow_direct_reclaim(pgdat) for current node to be true.
The allow_direct_reclaim(pgdat) examined the number of free pages
on the node by zone_page_state() which just returns value in
zone->vm_stat[NR_FREE_PAGES].

- On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
However, the freelist on this node was not empty.

- This inconsistent of vmstat value was caused by percpu vmstat on
nohz_full cpus. Every increment/decrement of vmstat is performed
on percpu vmstat counter at first, then pooled diffs are cumulated
to the zone's vmstat counter in timely manner. However, on nohz_full
cpus (in case of this customer's system, 48 of 52 cpus) these pooled
diffs were not cumulated once the cpu had no event on it so that
the cpu started sleeping infinitely.
I checked percpu vmstat and found there were total 69 counts not
cumulated to the zone's vmstat counter yet.

- In this situation, kswapd did not help the trapped process.
In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
of free pages on the node by zone_page_state_snapshot() which
checks pending counts on percpu vmstat.
Therefore kswapd could know there were 69 free pages correctly.
Since zone->_watermark = {8, 20, 32}, kswapd did not work because
69 was greater than 32 as high watermark.

2. With a task that busy loops on a given CPU,
the kworker interruption to execute vmstat_update
is undesired and may exceed latency thresholds
for certain applications.

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.


Performance details for the kworker interruption:

oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.

v7:
- Fix allow_direct_reclaim issue by using
zone_page_state_snapshot (Michal Hocko)

v6:
- Add more information on throttle_direct_reclaim problem
to commit logs (Michal Hocko)

v5:
- Drop "mm/vmstat: remove remote node draining" (Vlastimil Babka)
- Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka)

v4:
- Switch per-CPU vmstat counters to s32, required
by RISC-V, ARC architectures

v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
stats more accurate (Peter Xu)
- Improve changelog of
"mm/vmstat: switch counter modification to cmpxchg" (Peter Xu / David)
- Improve changelog of
"mm/vmstat: remove remote node draining" (David Hildenbrand)


v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
(Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
(Uros Bizjak / Matthew Wilcox)

arch/arm64/include/asm/percpu.h | 16 ++
arch/loongarch/include/asm/percpu.h | 23 +++-
arch/s390/include/asm/percpu.h | 5
arch/x86/include/asm/percpu.h | 39 +++---
include/asm-generic/percpu.h | 17 ++
include/linux/mmzone.h | 8 -
include/linux/percpu-defs.h | 2
include/linux/vmstat.h | 2
kernel/fork.c | 2
kernel/scs.c | 2
mm/page_alloc.c | 5
mm/vmscan.c | 2
mm/vmstat.c | 440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------
13 files changed, 361 insertions(+), 202 deletions(-)




2023-03-20 18:35:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> This patch series addresses the following two problems:
>
> 1. A customer provided evidence indicating that a process
> was stalled in direct reclaim:
>
This is addressed by the trivial patch 1.

[...]
> 2. With a task that busy loops on a given CPU,
> the kworker interruption to execute vmstat_update
> is undesired and may exceed latency thresholds
> for certain applications.

Yes it can but why does that matter?

> By having vmstat_shepherd flush the per-CPU counters to the
> global counters from remote CPUs.
>
> This is done using cmpxchg to manipulate the counters,
> both CPU locally (via the account functions),
> and remotely (via cpu_vm_stats_fold).
>
> Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> the initial patch series.
>
>
> Performance details for the kworker interruption:
>
> oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
>
> The example above shows an additional 7us for the
>
> oslat -> kworker -> oslat
>
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold for certain applications.

I do not think we have ever promissed any specific latency guarantees
for vmstat. These are statistics have been mostly used for debugging
purposes AFAIK. I am not aware of any specific user space use case that
would be latency sensitive. Your changelog doesn't go into details there
either.

[...]
> mm/vmstat.c | 440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------

This requires much more detailed story why we really need that.
--
Michal Hocko
SUSE Labs

2023-03-21 12:58:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > This patch series addresses the following two problems:
> >
> > 1. A customer provided evidence indicating that a process
> > was stalled in direct reclaim:
> >
> This is addressed by the trivial patch 1.
>
> [...]
> > 2. With a task that busy loops on a given CPU,
> > the kworker interruption to execute vmstat_update
> > is undesired and may exceed latency thresholds
> > for certain applications.
>
> Yes it can but why does that matter?

It matters for the application that is executing and expects
not to be interrupted.

> > By having vmstat_shepherd flush the per-CPU counters to the
> > global counters from remote CPUs.
> >
> > This is done using cmpxchg to manipulate the counters,
> > both CPU locally (via the account functions),
> > and remotely (via cpu_vm_stats_fold).
> >
> > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > the initial patch series.
> >
> >
> > Performance details for the kworker interruption:
> >
> > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> >
> > The example above shows an additional 7us for the
> >
> > oslat -> kworker -> oslat
> >
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold for certain applications.
>
> I do not think we have ever promissed any specific latency guarantees
> for vmstat. These are statistics have been mostly used for debugging
> purposes AFAIK. I am not aware of any specific user space use case that
> would be latency sensitive. Your changelog doesn't go into details there
> either.

There is a class of workloads for which response time can be
of interest. MAC scheduler is an example:

https://par.nsf.gov/servlets/purl/10090368

Thanks!


2023-03-22 10:21:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > This patch series addresses the following two problems:
> > >
> > > 1. A customer provided evidence indicating that a process
> > > was stalled in direct reclaim:
> > >
> > This is addressed by the trivial patch 1.
> >
> > [...]
> > > 2. With a task that busy loops on a given CPU,
> > > the kworker interruption to execute vmstat_update
> > > is undesired and may exceed latency thresholds
> > > for certain applications.
> >
> > Yes it can but why does that matter?
>
> It matters for the application that is executing and expects
> not to be interrupted.

Those workloads shouldn't enter the kernel in the first place, no?
Otherwise the in kernel execution with all the direct or indirect
dependencies (e.g. via locks) can throw any latency expectations off the
window.

> > > By having vmstat_shepherd flush the per-CPU counters to the
> > > global counters from remote CPUs.
> > >
> > > This is done using cmpxchg to manipulate the counters,
> > > both CPU locally (via the account functions),
> > > and remotely (via cpu_vm_stats_fold).
> > >
> > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > the initial patch series.
> > >
> > >
> > > Performance details for the kworker interruption:
> > >
> > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > >
> > > The example above shows an additional 7us for the
> > >
> > > oslat -> kworker -> oslat
> > >
> > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > observed in the guest is higher than 50us, violating the acceptable
> > > latency threshold for certain applications.
> >
> > I do not think we have ever promissed any specific latency guarantees
> > for vmstat. These are statistics have been mostly used for debugging
> > purposes AFAIK. I am not aware of any specific user space use case that
> > would be latency sensitive. Your changelog doesn't go into details there
> > either.
>
> There is a class of workloads for which response time can be
> of interest. MAC scheduler is an example:
>
> https://par.nsf.gov/servlets/purl/10090368

Yes, I am not disputing low latency workloads in general. I am just
saying that you haven't really established a very sound justification
here. Of course there are workloads which do not want to conflict with
any in kernel house keeping. Those have to be configured and implemented
very carefully though. Vmstat as such should not collide with those
workloads as long as they do not interact with the kernel in a way
counters are updated. Is this hard or impossible to avoid?

I can imagine that those workloads have an start up sequence where the
kernel is involved and counters updated so that deferred flushing could
interfere with the later and latency sensitive phase. Is that a real
problem in practice? Please tell us much more why we need to make the
vmstat code more complex.

Thanks!
--
Michal Hocko
SUSE Labs

2023-03-22 13:12:31

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote:
> On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > > This patch series addresses the following two problems:
> > > >
> > > > 1. A customer provided evidence indicating that a process
> > > > was stalled in direct reclaim:
> > > >
> > > This is addressed by the trivial patch 1.
> > >
> > > [...]
> > > > 2. With a task that busy loops on a given CPU,
> > > > the kworker interruption to execute vmstat_update
> > > > is undesired and may exceed latency thresholds
> > > > for certain applications.
> > >
> > > Yes it can but why does that matter?
> >
> > It matters for the application that is executing and expects
> > not to be interrupted.
>
> Those workloads shouldn't enter the kernel in the first place, no?

It depends on the latency requirements and individual system calls.

> Otherwise the in kernel execution with all the direct or indirect
> dependencies (e.g. via locks) can throw any latency expectations off the
> window.
>
> > > > By having vmstat_shepherd flush the per-CPU counters to the
> > > > global counters from remote CPUs.
> > > >
> > > > This is done using cmpxchg to manipulate the counters,
> > > > both CPU locally (via the account functions),
> > > > and remotely (via cpu_vm_stats_fold).
> > > >
> > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > > the initial patch series.
> > > >
> > > >
> > > > Performance details for the kworker interruption:
> > > >
> > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > >
> > > > The example above shows an additional 7us for the
> > > >
> > > > oslat -> kworker -> oslat
> > > >
> > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > observed in the guest is higher than 50us, violating the acceptable
> > > > latency threshold for certain applications.
> > >
> > > I do not think we have ever promissed any specific latency guarantees
> > > for vmstat. These are statistics have been mostly used for debugging
> > > purposes AFAIK. I am not aware of any specific user space use case that
> > > would be latency sensitive. Your changelog doesn't go into details there
> > > either.
> >
> > There is a class of workloads for which response time can be
> > of interest. MAC scheduler is an example:
> >
> > https://par.nsf.gov/servlets/purl/10090368
>
> Yes, I am not disputing low latency workloads in general. I am just
> saying that you haven't really established a very sound justification
> here.

The -v7 cover letter was updated with additional details,
as you requested (perhaps you missed it):

"Performance details for the kworker interruption:

oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications."

> Of course there are workloads which do not want to conflict with
> any in kernel house keeping. Those have to be configured and implemented
> very carefully though. Vmstat as such should not collide with those
> workloads as long as they do not interact with the kernel in a way
> counters are updated. Is this hard or impossible to avoid?

The practical problem we have been seeing is -RT app initialization.
For example:

1) mlock();
2) enter loop without system calls

> I can imagine that those workloads have an start up sequence where the
> kernel is involved and counters updated so that deferred flushing could
> interfere with the later and latency sensitive phase. Is that a real
> problem in practice? Please tell us much more why we need to make the
> vmstat code more complex.

Yes, it is. I have attached traces and performance numbers above.

2023-03-22 13:36:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed 22-03-23 08:23:21, Marcelo Tosatti wrote:
> On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote:
> > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > > > This patch series addresses the following two problems:
> > > > >
> > > > > 1. A customer provided evidence indicating that a process
> > > > > was stalled in direct reclaim:
> > > > >
> > > > This is addressed by the trivial patch 1.
> > > >
> > > > [...]
> > > > > 2. With a task that busy loops on a given CPU,
> > > > > the kworker interruption to execute vmstat_update
> > > > > is undesired and may exceed latency thresholds
> > > > > for certain applications.
> > > >
> > > > Yes it can but why does that matter?
> > >
> > > It matters for the application that is executing and expects
> > > not to be interrupted.
> >
> > Those workloads shouldn't enter the kernel in the first place, no?
>
> It depends on the latency requirements and individual system calls.
>
> > Otherwise the in kernel execution with all the direct or indirect
> > dependencies (e.g. via locks) can throw any latency expectations off the
> > window.
> >
> > > > > By having vmstat_shepherd flush the per-CPU counters to the
> > > > > global counters from remote CPUs.
> > > > >
> > > > > This is done using cmpxchg to manipulate the counters,
> > > > > both CPU locally (via the account functions),
> > > > > and remotely (via cpu_vm_stats_fold).
> > > > >
> > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > > > the initial patch series.
> > > > >
> > > > >
> > > > > Performance details for the kworker interruption:
> > > > >
> > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > >
> > > > > The example above shows an additional 7us for the
> > > > >
> > > > > oslat -> kworker -> oslat
> > > > >
> > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > latency threshold for certain applications.
> > > >
> > > > I do not think we have ever promissed any specific latency guarantees
> > > > for vmstat. These are statistics have been mostly used for debugging
> > > > purposes AFAIK. I am not aware of any specific user space use case that
> > > > would be latency sensitive. Your changelog doesn't go into details there
> > > > either.
> > >
> > > There is a class of workloads for which response time can be
> > > of interest. MAC scheduler is an example:
> > >
> > > https://par.nsf.gov/servlets/purl/10090368
> >
> > Yes, I am not disputing low latency workloads in general. I am just
> > saying that you haven't really established a very sound justification
> > here.
>
> The -v7 cover letter was updated with additional details,
> as you requested (perhaps you missed it):
>
> "Performance details for the kworker interruption:
>
> oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
>
> The example above shows an additional 7us for the
>
> oslat -> kworker -> oslat
>
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold for certain applications."

Yes, I have seen that but it doesn't really give a wider context to
understand why those numbers matter.

> > Of course there are workloads which do not want to conflict with
> > any in kernel house keeping. Those have to be configured and implemented
> > very carefully though. Vmstat as such should not collide with those
> > workloads as long as they do not interact with the kernel in a way
> > counters are updated. Is this hard or impossible to avoid?
>
> The practical problem we have been seeing is -RT app initialization.
> For example:
>
> 1) mlock();
> 2) enter loop without system calls

OK, that is what I have kinda expected. Would have been better to
mention it explicitly.

I expect this to be a very common pattern and vmstat might not be the
only subsystem that could interfere later on. Would it make more sense
to address this by a more generic solution? E.g. a syscall to flush all
per-cpu caches so they won't interfere later unless userspace hits the
kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)?
The above pattern could then be implemented as

do_initial_setup()
sched_setaffinity(getpid(), cpumask);
flush_cpu_caches(cpumask, 0);
do_userspace_loop()

--
Michal Hocko
SUSE Labs

2023-03-22 17:36:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> On Wed 22-03-23 08:23:21, Marcelo Tosatti wrote:
> > On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote:
> > > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> > > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > > > > This patch series addresses the following two problems:
> > > > > >
> > > > > > 1. A customer provided evidence indicating that a process
> > > > > > was stalled in direct reclaim:
> > > > > >
> > > > > This is addressed by the trivial patch 1.
> > > > >
> > > > > [...]
> > > > > > 2. With a task that busy loops on a given CPU,
> > > > > > the kworker interruption to execute vmstat_update
> > > > > > is undesired and may exceed latency thresholds
> > > > > > for certain applications.
> > > > >
> > > > > Yes it can but why does that matter?
> > > >
> > > > It matters for the application that is executing and expects
> > > > not to be interrupted.
> > >
> > > Those workloads shouldn't enter the kernel in the first place, no?
> >
> > It depends on the latency requirements and individual system calls.
> >
> > > Otherwise the in kernel execution with all the direct or indirect
> > > dependencies (e.g. via locks) can throw any latency expectations off the
> > > window.
> > >
> > > > > > By having vmstat_shepherd flush the per-CPU counters to the
> > > > > > global counters from remote CPUs.
> > > > > >
> > > > > > This is done using cmpxchg to manipulate the counters,
> > > > > > both CPU locally (via the account functions),
> > > > > > and remotely (via cpu_vm_stats_fold).
> > > > > >
> > > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > > > > the initial patch series.
> > > > > >
> > > > > >
> > > > > > Performance details for the kworker interruption:
> > > > > >
> > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > >
> > > > > > The example above shows an additional 7us for the
> > > > > >
> > > > > > oslat -> kworker -> oslat
> > > > > >
> > > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > > latency threshold for certain applications.
> > > > >
> > > > > I do not think we have ever promissed any specific latency guarantees
> > > > > for vmstat. These are statistics have been mostly used for debugging
> > > > > purposes AFAIK. I am not aware of any specific user space use case that
> > > > > would be latency sensitive. Your changelog doesn't go into details there
> > > > > either.
> > > >
> > > > There is a class of workloads for which response time can be
> > > > of interest. MAC scheduler is an example:
> > > >
> > > > https://par.nsf.gov/servlets/purl/10090368
> > >
> > > Yes, I am not disputing low latency workloads in general. I am just
> > > saying that you haven't really established a very sound justification
> > > here.
> >
> > The -v7 cover letter was updated with additional details,
> > as you requested (perhaps you missed it):
> >
> > "Performance details for the kworker interruption:
> >
> > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> >
> > The example above shows an additional 7us for the
> >
> > oslat -> kworker -> oslat
> >
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold for certain applications."
>
> Yes, I have seen that but it doesn't really give a wider context to
> understand why those numbers matter.

OK.

"In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
interruption observed in a guest (which is above the safety
threshold for this application)."

Is that OK?


> > > Of course there are workloads which do not want to conflict with
> > > any in kernel house keeping. Those have to be configured and implemented
> > > very carefully though. Vmstat as such should not collide with those
> > > workloads as long as they do not interact with the kernel in a way
> > > counters are updated. Is this hard or impossible to avoid?
> >
> > The practical problem we have been seeing is -RT app initialization.
> > For example:
> >
> > 1) mlock();
> > 2) enter loop without system calls
>
> OK, that is what I have kinda expected. Would have been better to
> mention it explicitly.
>
> I expect this to be a very common pattern and vmstat might not be the
> only subsystem that could interfere later on. Would it make more sense
> to address this by a more generic solution? E.g. a syscall to flush all
> per-cpu caches so they won't interfere later unless userspace hits the
> kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)?
> The above pattern could then be implemented as
>
> do_initial_setup()
> sched_setaffinity(getpid(), cpumask);
> flush_cpu_caches(cpumask, 0);
> do_userspace_loop()

I would argue that fixing this without introducing a userspace tunable
is more generic as all programs (modified to use a syscall or not)
benefit from the improvement. HPC workloads, for example.

But it might be necessary to do what you suggest for other
reasons (where you'd want a behaviour to be enabled which
is undesired for other application types).

2023-03-23 08:13:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
[...]
> > > "Performance details for the kworker interruption:
> > >
> > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > >
> > > The example above shows an additional 7us for the
> > >
> > > oslat -> kworker -> oslat
> > >
> > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > observed in the guest is higher than 50us, violating the acceptable
> > > latency threshold for certain applications."
> >
> > Yes, I have seen that but it doesn't really give a wider context to
> > understand why those numbers matter.
>
> OK.
>
> "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> interruption observed in a guest (which is above the safety
> threshold for this application)."
>
> Is that OK?

This might be a sufficient information for somebody familiar with the
matter (not me). So no, not enough. We need to hear a more complete
story.

--
Michal Hocko
SUSE Labs

2023-03-23 11:08:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> [...]
> > > > "Performance details for the kworker interruption:
> > > >
> > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > >
> > > > The example above shows an additional 7us for the
> > > >
> > > > oslat -> kworker -> oslat
> > > >
> > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > observed in the guest is higher than 50us, violating the acceptable
> > > > latency threshold for certain applications."
> > >
> > > Yes, I have seen that but it doesn't really give a wider context to
> > > understand why those numbers matter.
> >
> > OK.
> >
> > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > interruption observed in a guest (which is above the safety
> > threshold for this application)."
> >
> > Is that OK?
>
> This might be a sufficient information for somebody familiar with the
> matter (not me). So no, not enough. We need to hear a more complete
> story.

Michal,

Please refer to
https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf

2.3 Channel Dependent Scheduling
The purpose of scheduling is to decide which terminal will transmit data on which set
of resource blocks with what transport format to use. The objective is to assign
resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
same length of Transmission Time Interval (TTI) in LTE system.

In general:

https://en.wikipedia.org/wiki/Real-time_computing

Real-time computing (RTC) is the computer science term for hardware and
software systems subject to a "real-time constraint", for example from
event to system response.[1] Real-time programs must guarantee response
within specified time constraints, often referred to as "deadlines".[2]

Real-time responses are often understood to be in the order of
milliseconds, and sometimes microseconds. A system not specified as
operating in real time cannot usually guarantee a response within any
timeframe, although typical or expected response times may be given.
Real-time processing fails if not completed within a specified deadline
relative to an event; deadlines must always be met, regardless of system
load.

For example, for the MAC scheduler processing must occur every 1ms,
and a certain amount of computation takes place (and must finish before
the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
is considered a "failure".


2023-03-23 11:08:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Mar 23, 2023 at 07:52:22AM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > [...]
> > > > > "Performance details for the kworker interruption:
> > > > >
> > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > >
> > > > > The example above shows an additional 7us for the
> > > > >
> > > > > oslat -> kworker -> oslat
> > > > >
> > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > latency threshold for certain applications."
> > > >
> > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > understand why those numbers matter.
> > >
> > > OK.
> > >
> > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > interruption observed in a guest (which is above the safety
> > > threshold for this application)."
> > >
> > > Is that OK?
> >
> > This might be a sufficient information for somebody familiar with the
> > matter (not me). So no, not enough. We need to hear a more complete
> > story.
>
> Michal,
>
> Please refer to
> https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
>
> 2.3 Channel Dependent Scheduling
> The purpose of scheduling is to decide which terminal will transmit data on which set
> of resource blocks with what transport format to use. The objective is to assign
> resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> same length of Transmission Time Interval (TTI) in LTE system.
>
> In general:
>
> https://en.wikipedia.org/wiki/Real-time_computing
>
> Real-time computing (RTC) is the computer science term for hardware and
> software systems subject to a "real-time constraint", for example from
> event to system response.[1] Real-time programs must guarantee response
> within specified time constraints, often referred to as "deadlines".[2]
>
> Real-time responses are often understood to be in the order of
> milliseconds, and sometimes microseconds. A system not specified as
> operating in real time cannot usually guarantee a response within any
> timeframe, although typical or expected response times may be given.
> Real-time processing fails if not completed within a specified deadline
> relative to an event; deadlines must always be met, regardless of system
> load.
>
> For example, for the MAC scheduler processing must occur every 1ms,
> and a certain amount of computation takes place (and must finish before
> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> is considered a "failure".

If you need more detail, will have to ask someone else, because that is
all I know.

2023-03-23 12:19:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote:
> On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > [...]
> > > > > "Performance details for the kworker interruption:
> > > > >
> > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > >
> > > > > The example above shows an additional 7us for the
> > > > >
> > > > > oslat -> kworker -> oslat
> > > > >
> > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > latency threshold for certain applications."
> > > >
> > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > understand why those numbers matter.
> > >
> > > OK.
> > >
> > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > interruption observed in a guest (which is above the safety
> > > threshold for this application)."
> > >
> > > Is that OK?
> >
> > This might be a sufficient information for somebody familiar with the
> > matter (not me). So no, not enough. We need to hear a more complete
> > story.
>
> Michal,
>
> Please refer to
> https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
>
> 2.3 Channel Dependent Scheduling
> The purpose of scheduling is to decide which terminal will transmit data on which set
> of resource blocks with what transport format to use. The objective is to assign
> resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> same length of Transmission Time Interval (TTI) in LTE system.
>
> In general:
>
> https://en.wikipedia.org/wiki/Real-time_computing

Thank you, but not something I was really asking for (repeatedly). I am
pretty aware of what RT computing is about. I am not really interested
in a generic fluff. I am asking about specific usecases you have in mind
when pushing these changes.

> For example, for the MAC scheduler processing must occur every 1ms,
> and a certain amount of computation takes place (and must finish before
> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> is considered a "failure".

OK, you are claiming that much but you are not really filling up other
holes in your story. Let me just outline few questions I have. Your
measurements talk about 7us overhead the vmstat processing might add.
This is really far from > 50us above. You suggest that this is an effect
of the workload running in a guest without more details. I am quite
surprised to hear about RT expectations inside a guest system TBH.

All that being said, it would be really helpful if you were more
specific about the workload and why there is no other way but making
vmstat infrastructure more complex (it is quite complex on its own).

Thanks!

--
Michal Hocko
SUSE Labs

2023-03-23 13:43:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Mar 23, 2023 at 01:17:32PM +0100, Michal Hocko wrote:
> On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote:
> > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > > "Performance details for the kworker interruption:
> > > > > >
> > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > >
> > > > > > The example above shows an additional 7us for the
> > > > > >
> > > > > > oslat -> kworker -> oslat
> > > > > >
> > > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > > latency threshold for certain applications."
> > > > >
> > > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > > understand why those numbers matter.
> > > >
> > > > OK.
> > > >
> > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > > interruption observed in a guest (which is above the safety
> > > > threshold for this application)."
> > > >
> > > > Is that OK?
> > >
> > > This might be a sufficient information for somebody familiar with the
> > > matter (not me). So no, not enough. We need to hear a more complete
> > > story.
> >
> > Michal,
> >
> > Please refer to
> > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
> >
> > 2.3 Channel Dependent Scheduling
> > The purpose of scheduling is to decide which terminal will transmit data on which set
> > of resource blocks with what transport format to use. The objective is to assign
> > resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> > same length of Transmission Time Interval (TTI) in LTE system.
> >
> > In general:
> >
> > https://en.wikipedia.org/wiki/Real-time_computing
>
> Thank you, but not something I was really asking for (repeatedly). I am
> pretty aware of what RT computing is about. I am not really interested
> in a generic fluff. I am asking about specific usecases you have in mind
> when pushing these changes.
>
> > For example, for the MAC scheduler processing must occur every 1ms,
> > and a certain amount of computation takes place (and must finish before
> > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> > is considered a "failure".
>
> OK, you are claiming that much but you are not really filling up other
> holes in your story. Let me just outline few questions I have. Your
> measurements talk about 7us overhead the vmstat processing might add.
> This is really far from > 50us above.

7us in the host, for the following sched_switch events:

oslat -> kworker
kworker -> oslat

However, if the impact is for a virtualized application:

oslat, executing via qemu-vcpu process in the host.

oslat executing
qemu-vcpu VM-EXIT
qemu-vcpu -> kworker
kworker -> qemu-vcpu
qemu-vcpu VM-ENTRY

is much higher than the 7us (can be above 100us).

> You suggest that this is an effect
> of the workload running in a guest without more details. I am quite
> surprised to hear about RT expectations inside a guest system TBH.

https://www.youtube.com/watch?v=zIDwc6uDszY

> All that being said, it would be really helpful if you were more
> specific about the workload and why there is no other way but making
> vmstat infrastructure more complex (it is quite complex on its own).

The patchset is just changing vmstat_shepherd from happening locally
to happening remotely. There are a number of algorithms in the kernel
that deal with concurrent access already.

What you think this particular patchset makes things complicated
and what can be done to make it simpler?



2023-03-23 13:46:03

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Mar 23, 2023 at 10:30:13AM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 23, 2023 at 01:17:32PM +0100, Michal Hocko wrote:
> > On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote:
> > > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > > > [...]
> > > > > > > "Performance details for the kworker interruption:
> > > > > > >
> > > > > > > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > > > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > > > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > > >
> > > > > > > The example above shows an additional 7us for the
> > > > > > >
> > > > > > > oslat -> kworker -> oslat
> > > > > > >
> > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > > > latency threshold for certain applications."
> > > > > >
> > > > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > > > understand why those numbers matter.
> > > > >
> > > > > OK.
> > > > >
> > > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > > > interruption observed in a guest (which is above the safety
> > > > > threshold for this application)."
> > > > >
> > > > > Is that OK?
> > > >
> > > > This might be a sufficient information for somebody familiar with the
> > > > matter (not me). So no, not enough. We need to hear a more complete
> > > > story.
> > >
> > > Michal,
> > >
> > > Please refer to
> > > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
> > >
> > > 2.3 Channel Dependent Scheduling
> > > The purpose of scheduling is to decide which terminal will transmit data on which set
> > > of resource blocks with what transport format to use. The objective is to assign
> > > resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> > > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> > > same length of Transmission Time Interval (TTI) in LTE system.
> > >
> > > In general:
> > >
> > > https://en.wikipedia.org/wiki/Real-time_computing
> >
> > Thank you, but not something I was really asking for (repeatedly). I am
> > pretty aware of what RT computing is about. I am not really interested
> > in a generic fluff. I am asking about specific usecases you have in mind
> > when pushing these changes.
> >
> > > For example, for the MAC scheduler processing must occur every 1ms,
> > > and a certain amount of computation takes place (and must finish before
> > > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> > > is considered a "failure".
> >
> > OK, you are claiming that much but you are not really filling up other
> > holes in your story. Let me just outline few questions I have. Your
> > measurements talk about 7us overhead the vmstat processing might add.
> > This is really far from > 50us above.
>
> 7us in the host, for the following sched_switch events:
>
> oslat -> kworker
> kworker -> oslat
>
> However, if the impact is for a virtualized application:
>
> oslat, executing via qemu-vcpu process in the host.
>
> oslat executing
> qemu-vcpu VM-EXIT
> qemu-vcpu -> kworker
> kworker -> qemu-vcpu
> qemu-vcpu VM-ENTRY
>
> is much higher than the 7us (can be above 100us).

And nothing prevents this from happening:

oslat executing
qemu-vcpu VM-EXIT
qemu-vcpu -> kworker (in the host, to handle vmstat_update)
kworker -> qemu-vcpu
qemu-vcpu VM-ENTRY
oslat -> kworker (in the guest, to handle vmstat_update)
kworker -> oslat



2023-04-18 22:16:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <[email protected]> wrote:

> This patch series addresses the following two problems:
>
> 1. A customer provided evidence indicating that a process
> was stalled in direct reclaim:
>
> ...
>
> 2. With a task that busy loops on a given CPU,
> the kworker interruption to execute vmstat_update
> is undesired and may exceed latency thresholds
> for certain applications.
>

I don't think I'll be sending this upstream in the next merge window.
Because it isn't clear that the added complexity in vmstat handling is
justified.

- Michal's request for more clarity on the end-user requirements
seems reasonable.

- You have indicated that additional changelog material is forthcoming.

- The alternative idea of adding a syscall which tells the kernel
"I'm about to go realtime, so please clear away all the pending crap
which might later interrupt me" sounds pretty good.

Partly because there are surely other places where we can use this.

Partly because it moves all the crap-clearing into special
crap-clearing code paths while adding less burden to the
commonly-executed code.

And I don't think this alternative has been fully investigated and
discussed.

2023-04-19 11:25:42

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote:
> > On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <[email protected]> wrote:
> >
> > > This patch series addresses the following two problems:
> > >
> > > 1. A customer provided evidence indicating that a process
> > > was stalled in direct reclaim:
> > >
> > > ...
> > >
> > > 2. With a task that busy loops on a given CPU,
> > > the kworker interruption to execute vmstat_update
> > > is undesired and may exceed latency thresholds
> > > for certain applications.
> > >
> >
> > I don't think I'll be sending this upstream in the next merge window.
> > Because it isn't clear that the added complexity in vmstat handling is
> > justified.
>
> From my POV this is an incorrect statement (that the complexity in
> vmstat handling is not justified).
>
> Andrew, this is the 3rd attempt to fix this problem:
>
> First try: https://lore.kernel.org/lkml/[email protected]/
>
> Second try: https://patchew.org/linux/[email protected]/
>
> Third try: syncing vmstats remotely from vmstat_shepherd (this
> patchset).
>
> And also, can you please explain: what is so complicated about the
> vmstat handling? cmpxchg has been around and is used all over the
> kernel, and nobody considers "excessively complicated".
>
> > - Michal's request for more clarity on the end-user requirements
> > seems reasonable.
>
> And i explained to Michal in great detail where the end-user
> requirements come from. For virtualized workloads, there are two
> types of use-cases:
>
> 1) For example, for the MAC scheduler processing must occur every 1ms,
> and a certain amount of computation takes place (and must finish before
> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> is considered a "failure".
>
> I showed him a 7us trace caused by, and explained that will extend to >
> 50us in the case of virtualized vCPU.
>
> 2) PLCs. These workloads will also suffer > 50us latency spikes
> which is undesirable.
>
> Can you please explain what additional clarity is required?
>
> RH's performance team, for example, has been performing packet
> latency tests and waiting for this issue to be fixed for about 2
> years now.
>
> Andrew Theurer, can you please explain what problem is the vmstat_work
> interruption causing in your testing?

+CC Andrew.

2023-04-19 11:26:29

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote:
> On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <[email protected]> wrote:
>
> > This patch series addresses the following two problems:
> >
> > 1. A customer provided evidence indicating that a process
> > was stalled in direct reclaim:
> >
> > ...
> >
> > 2. With a task that busy loops on a given CPU,
> > the kworker interruption to execute vmstat_update
> > is undesired and may exceed latency thresholds
> > for certain applications.
> >
>
> I don't think I'll be sending this upstream in the next merge window.
> Because it isn't clear that the added complexity in vmstat handling is
> justified.

From my POV this is an incorrect statement (that the complexity in
vmstat handling is not justified).

Andrew, this is the 3rd attempt to fix this problem:

First try: https://lore.kernel.org/lkml/[email protected]/

Second try: https://patchew.org/linux/[email protected]/

Third try: syncing vmstats remotely from vmstat_shepherd (this
patchset).

And also, can you please explain: what is so complicated about the
vmstat handling? cmpxchg has been around and is used all over the
kernel, and nobody considers "excessively complicated".

> - Michal's request for more clarity on the end-user requirements
> seems reasonable.

And i explained to Michal in great detail where the end-user
requirements come from. For virtualized workloads, there are two
types of use-cases:

1) For example, for the MAC scheduler processing must occur every 1ms,
and a certain amount of computation takes place (and must finish before
the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
is considered a "failure".

I showed him a 7us trace caused by, and explained that will extend to >
50us in the case of virtualized vCPU.

2) PLCs. These workloads will also suffer > 50us latency spikes
which is undesirable.

Can you please explain what additional clarity is required?

RH's performance team, for example, has been performing packet
latency tests and waiting for this issue to be fixed for about 2
years now.

Andrew Theurer, can you please explain what problem is the vmstat_work
interruption causing in your testing?

> - You have indicated that additional changelog material is forthcoming.

Not really.

Do you think additional information on the changelog is necessary?

> - The alternative idea of adding a syscall which tells the kernel
> "I'm about to go realtime, so please clear away all the pending crap
> which might later interrupt me" sounds pretty good.
>
> Partly because there are surely other places where we can use this.
>
> Partly because it moves all the crap-clearing into special
> crap-clearing code paths while adding less burden to the
> commonly-executed code.
>
> And I don't think this alternative has been fully investigated and
> discussed.

This was tried before:
https://lore.kernel.org/lkml/[email protected]/

My conclusion from that discussion (and work) is that a special system
call:

1) Does not allow the benefits to be widely applied (only modified
applications will benefit). Is not portable across different operating systems.

Removing the vmstat_work interruption is a benefit for HPC workloads,
for example (in fact, it is a benefit for any kind of application,
since the interruption causes cache misses).

2) Increases the system call cost for applications which would use
the interface.

So avoiding the vmstat_update update interruption, without userspace
knowledge and modifications, is a better than solution than a modified
userspace.





2023-04-19 11:40:06

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> This was tried before:
> https://lore.kernel.org/lkml/[email protected]/
>
> My conclusion from that discussion (and work) is that a special system
> call:
>
> 1) Does not allow the benefits to be widely applied (only modified
> applications will benefit). Is not portable across different operating systems.
>
> Removing the vmstat_work interruption is a benefit for HPC workloads,
> for example (in fact, it is a benefit for any kind of application,
> since the interruption causes cache misses).
>
> 2) Increases the system call cost for applications which would use
> the interface.
>
> So avoiding the vmstat_update update interruption, without userspace
> knowledge and modifications, is a better than solution than a modified
> userspace.

Another important point is this: if an application dirties
its own per-CPU vmstat cache, while performing a system call,
and a vmstat sync event is triggered on a different CPU, you'd have to:

1) Wait for that CPU to return to userspace and sync its stats
(unfeasible).

2) Queue work to execute on that CPU (undesirable, as that causes
an interruption).

3) Remotely sync the vmstat for that CPU.


2023-04-19 12:03:39

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > This was tried before:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > My conclusion from that discussion (and work) is that a special system
> > call:
> >
> > 1) Does not allow the benefits to be widely applied (only modified
> > applications will benefit). Is not portable across different operating systems.
> >
> > Removing the vmstat_work interruption is a benefit for HPC workloads,
> > for example (in fact, it is a benefit for any kind of application,
> > since the interruption causes cache misses).
> >
> > 2) Increases the system call cost for applications which would use
> > the interface.
> >
> > So avoiding the vmstat_update update interruption, without userspace
> > knowledge and modifications, is a better than solution than a modified
> > userspace.
>
> Another important point is this: if an application dirties
> its own per-CPU vmstat cache, while performing a system call,

Or while handling a VM-exit from a vCPU.

This are, in my mind, sufficient reasons to discard the "flush per-cpu
caches" idea. This is also why i chose to abandon the prctrl interface
patchset.

> and a vmstat sync event is triggered on a different CPU, you'd have to:
>
> 1) Wait for that CPU to return to userspace and sync its stats
> (unfeasible).
>
> 2) Queue work to execute on that CPU (undesirable, as that causes
> an interruption).
>
> 3) Remotely sync the vmstat for that CPU.


2023-04-19 12:30:07

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

Le Wed, Apr 19, 2023 at 08:59:28AM -0300, Marcelo Tosatti a ?crit :
> On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote:
> > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > > This was tried before:
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > My conclusion from that discussion (and work) is that a special system
> > > call:
> > >
> > > 1) Does not allow the benefits to be widely applied (only modified
> > > applications will benefit). Is not portable across different operating systems.
> > >
> > > Removing the vmstat_work interruption is a benefit for HPC workloads,
> > > for example (in fact, it is a benefit for any kind of application,
> > > since the interruption causes cache misses).
> > >
> > > 2) Increases the system call cost for applications which would use
> > > the interface.
> > >
> > > So avoiding the vmstat_update update interruption, without userspace
> > > knowledge and modifications, is a better than solution than a modified
> > > userspace.
> >
> > Another important point is this: if an application dirties
> > its own per-CPU vmstat cache, while performing a system call,
>
> Or while handling a VM-exit from a vCPU.
>
> This are, in my mind, sufficient reasons to discard the "flush per-cpu
> caches" idea. This is also why i chose to abandon the prctrl interface
> patchset.

If you're running your isolated workloads on guests, which sounds quite
challenging but I guess you guys managed, I'd expect that VMEXITs are
absolutely out of question while the task runs critical code, so I'm not
sure why you would care. I guess not only your guests but also your hosts
run nohz_full, right?

I can't tell if the prctl solution which quiesces everything is the solution
for you, I don't know well enough your workloads, but I would expect that
the pattern is as follows:

1) Arrange for full isolation (no more interrupts/exceptions/VMEXITs)
2) Run critical code
3) Optionally do something once you're done

If vmstat is going to be the only thing to wait for on 1), then the remote
solution looks good enough (although I leave that to -mm guys as I'm too
clueless about those matters), if there is more to be expected, I guess the
quiescing prctl (or whatever syscall) is something to consider.

Thanks.

2023-04-19 13:54:38

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely



> On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <[email protected]> wrote:
>
> On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
>> On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote:
>>> On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <[email protected]> wrote:
>>>
>>>> This patch series addresses the following two problems:
>>>>
>>>> 1. A customer provided evidence indicating that a process
>>>> was stalled in direct reclaim:
>>>>
>>>> ...
>>>>
>>>> 2. With a task that busy loops on a given CPU,
>>>> the kworker interruption to execute vmstat_update
>>>> is undesired and may exceed latency thresholds
>>>> for certain applications.
>>>>
>>>
>>> I don't think I'll be sending this upstream in the next merge window.
>>> Because it isn't clear that the added complexity in vmstat handling is
>>> justified.
>>
>> From my POV this is an incorrect statement (that the complexity in
>> vmstat handling is not justified).
>>
>> Andrew, this is the 3rd attempt to fix this problem:
>>
>> First try: https://lore.kernel.org/lkml/[email protected]/
>>
>> Second try: https://patchew.org/linux/[email protected]/
>>
>> Third try: syncing vmstats remotely from vmstat_shepherd (this
>> patchset).
>>
>> And also, can you please explain: what is so complicated about the
>> vmstat handling? cmpxchg has been around and is used all over the
>> kernel, and nobody considers "excessively complicated".
>>
>>> - Michal's request for more clarity on the end-user requirements
>>> seems reasonable.
>>
>> And i explained to Michal in great detail where the end-user
>> requirements come from. For virtualized workloads, there are two
>> types of use-cases:
>>
>> 1) For example, for the MAC scheduler processing must occur every 1ms,
>> and a certain amount of computation takes place (and must finish before
>> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
>> is considered a "failure".
>>
>> I showed him a 7us trace caused by, and explained that will extend to >
>> 50us in the case of virtualized vCPU.
>>
>> 2) PLCs. These workloads will also suffer > 50us latency spikes
>> which is undesirable.
>>
>> Can you please explain what additional clarity is required?
>>
>> RH's performance team, for example, has been performing packet
>> latency tests and waiting for this issue to be fixed for about 2
>> years now.
>>
>> Andrew Theurer, can you please explain what problem is the vmstat_work
>> interruption causing in your testing?
>
> +CC Andrew.

Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency as measured by cyclictest & oslat. We cannot achieve under 20 usec with the vmstats interruption.

-Andrew

2023-04-19 14:18:23

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> Le Wed, Apr 19, 2023 at 08:59:28AM -0300, Marcelo Tosatti a ?crit :
> > On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote:
> > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > > > This was tried before:
> > > > https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > My conclusion from that discussion (and work) is that a special system
> > > > call:
> > > >
> > > > 1) Does not allow the benefits to be widely applied (only modified
> > > > applications will benefit). Is not portable across different operating systems.
> > > >
> > > > Removing the vmstat_work interruption is a benefit for HPC workloads,
> > > > for example (in fact, it is a benefit for any kind of application,
> > > > since the interruption causes cache misses).
> > > >
> > > > 2) Increases the system call cost for applications which would use
> > > > the interface.
> > > >
> > > > So avoiding the vmstat_update update interruption, without userspace
> > > > knowledge and modifications, is a better than solution than a modified
> > > > userspace.
> > >
> > > Another important point is this: if an application dirties
> > > its own per-CPU vmstat cache, while performing a system call,
> >
> > Or while handling a VM-exit from a vCPU.
> >
> > This are, in my mind, sufficient reasons to discard the "flush per-cpu
> > caches" idea. This is also why i chose to abandon the prctrl interface
> > patchset.
>
> If you're running your isolated workloads on guests, which sounds quite
> challenging but I guess you guys managed, I'd expect that VMEXITs are
> absolutely out of question while the task runs critical code, so I'm not
> sure why you would care. I guess not only your guests but also your hosts
> run nohz_full, right?

The answer is: there are VM-exits. For example to write MSRs to program
LAPIC timer.

Yes both host and guest are nohz_full (but for example, cyclictest
or a PLC program can call nanosleep in the guest which translate to
MSR writes to program LAPIC timer which is a VM-exit).

> I can't tell if the prctl solution which quiesces everything is the solution
> for you, I don't know well enough your workloads, but I would expect that
> the pattern is as follows:
>
> 1) Arrange for full isolation (no more interrupts/exceptions/VMEXITs)

Yes, this in the general scheme. Full isolation is automated by
tuned (realtime-virtual-host/realtime-virtual-guest profiles).

There are VM-exits in our use-case.
There might be use-cases where interrupts are desired.

For more details:
https://www.youtube.com/watch?v=SyhfctYqjc8

> 2) Run critical code
> 3) Optionally do something once you're done
>
> If vmstat is going to be the only thing to wait for on 1), then the remote
> solution looks good enough (although I leave that to -mm guys as I'm too
> clueless about those matters),

I am mostly clueless too, but i don't see a problem with the proposed
patch (and no one has pointed any problem either).

> if there is more to be expected, I guess the
> quiescing prctl (or whatever syscall) is something to consider.
>
> Thanks.

I don't know of anything else to consider ATM, and for all cases we have
analyzed so far there has always been the possibility to do the work remotely,
via RCU or some other locking scheme, rather than requiring the application
to be modified (which decreases the number of userspace applications that
can benefit).




2023-04-19 14:39:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
[...]
> > 2) Run critical code
> > 3) Optionally do something once you're done
> >
> > If vmstat is going to be the only thing to wait for on 1), then the remote
> > solution looks good enough (although I leave that to -mm guys as I'm too
> > clueless about those matters),
>
> I am mostly clueless too, but i don't see a problem with the proposed
> patch (and no one has pointed any problem either).

I really hate to repeat myself again. The biggest pushback has been on
a) justification and b) single purpose solution which is very likely
incomplete. For a) we are getting the story piece by piece which doesn't
speed up the process. You are proposing a non-trivial change to an
already convoluted code so having a solid justification is something
that shouldn't be all that surprising.

b) is what concerns me more though. There are other per-cpu specific
things going on that require some regular flushing. Just to mention
another one that your group has been brought up was the memcg pcp
caches. Again with a non-trivial proposal to deal with that problem
[1]. It has turned out that we can do a simpler thing [2]. I do not
think it is a stretch to expect that similar things will pop out every
now and then and rather than dealing with each one in its own way it
kinda makes sense to come up with a more general concept so that all
those cases can be handled at a single place at least. All I hear about
that is that the code of those special applications would need to be
changed to use that. Well, true but is that bar so impractical that we
are going to grow kernel complexity and therefore a maintenance burden?
Everything for a very specialized workloads?

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2023-04-19 16:58:12

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On 4/19/23 13:29, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
>> This was tried before:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> My conclusion from that discussion (and work) is that a special system
>> call:
>>
>> 1) Does not allow the benefits to be widely applied (only modified
>> applications will benefit). Is not portable across different operating systems.
>>
>> Removing the vmstat_work interruption is a benefit for HPC workloads,
>> for example (in fact, it is a benefit for any kind of application,
>> since the interruption causes cache misses).
>>
>> 2) Increases the system call cost for applications which would use
>> the interface.
>>
>> So avoiding the vmstat_update update interruption, without userspace
>> knowledge and modifications, is a better than solution than a modified
>> userspace.
>
> Another important point is this: if an application dirties
> its own per-CPU vmstat cache, while performing a system call,
> and a vmstat sync event is triggered on a different CPU, you'd have to:
>
> 1) Wait for that CPU to return to userspace and sync its stats
> (unfeasible).
>
> 2) Queue work to execute on that CPU (undesirable, as that causes
> an interruption).

So you're saying the application might do a syscall from the isolcpu, so
IIUC it cannot expect any latency guarantees at that very moment, but then
it immediately starts expecting them again after returning to userspace, and
a single interruption for a one-time flush after the syscall would be too
intrusive?

(elsewhere in the thread you described an RT app initialization that may
generate vmstats to flush and then entry userspace loop, again, would a
single interruption soon after entering the loop be so critical?)

> 3) Remotely sync the vmstat for that CPU.
>
>
>

2023-04-20 02:44:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 19, 2023 at 06:47:30PM +0200, Vlastimil Babka wrote:
> On 4/19/23 13:29, Marcelo Tosatti wrote:
> > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> >> This was tried before:
> >> https://lore.kernel.org/lkml/[email protected]/
> >>
> >> My conclusion from that discussion (and work) is that a special system
> >> call:
> >>
> >> 1) Does not allow the benefits to be widely applied (only modified
> >> applications will benefit). Is not portable across different operating systems.
> >>
> >> Removing the vmstat_work interruption is a benefit for HPC workloads,
> >> for example (in fact, it is a benefit for any kind of application,
> >> since the interruption causes cache misses).
> >>
> >> 2) Increases the system call cost for applications which would use
> >> the interface.
> >>
> >> So avoiding the vmstat_update update interruption, without userspace
> >> knowledge and modifications, is a better than solution than a modified
> >> userspace.
> >
> > Another important point is this: if an application dirties
> > its own per-CPU vmstat cache, while performing a system call,
> > and a vmstat sync event is triggered on a different CPU, you'd have to:
> >
> > 1) Wait for that CPU to return to userspace and sync its stats
> > (unfeasible).
> >
> > 2) Queue work to execute on that CPU (undesirable, as that causes
> > an interruption).
>
> So you're saying the application might do a syscall from the isolcpu, so
> IIUC it cannot expect any latency guarantees at that very moment,

Why not? cyclictest uses nanosleep and its the main tool for measuring
latency.

> but then
> it immediately starts expecting them again after returning to userspace,

No, the expectation more generally is this:

For certain types of applications (for example PLC software or
RAN processing), upon occurrence of an event, it is necessary to
complete a certain task in a maximum amount of time (deadline).

One way to express this requirement is with a pair of numbers,
deadline time and execution time, where:

* deadline time: length of time between event and deadline.
* execution time: length of time it takes for processing of event
to occur on a particular hardware platform
(uninterrupted).

The particular values depend on use-case. For the case
where the realtime application executes in a virtualized
guest, an interruption which must be serviced in the host will cause
the following sequence of events:

1) VM-exit
2) execution of IPI (and function call) (or switch to kwork
thread to execute some work item).
3) VM-entry

Which causes an excess of 50us latency as observed by cyclictest
(this violates the latency requirement of vRAN application with 1ms TTI,
for example).

> and
> a single interruption for a one-time flush after the syscall would be too
> intrusive?

Generally, if you can't complete the task (which involves executing a
number of instructions) before the deadline, then its a problem.

One-time flush? You mean to switch between:

rt-app -> kworker (to execute vmstat_update flush) -> rt-app

My measurement, which probably had vmstat_update code/data in cache, took 7us.
It might be the case that the code to execute must be brought in from
memory, which takes even longer.

> (elsewhere in the thread you described an RT app initialization that may
> generate vmstats to flush and then entry userspace loop, again, would a
> single interruption soon after entering the loop be so critical?)

1) It depends on the application. For the use-case above, where < 50us
interruption is desired, yes it is critical.

2) The interruptions can come from different sources.

Time
0 rt-app executing instruction 1
1 rt-app executing instruction 2
2 scheduler switches between rt-app and kworker
3 kworker runs vmstat_work
4 scheduler switches between kworker and rt-app
5 rt-app executing instruction 3
6 ipi to handle a KVM request IPI
7 fill in your preferred IPI handler

So the argument "a single interruption might not cause your deadline
to be exceeded" fails (because the time to handle the
different interruptions might sum).

Does that make sense?

> > 3) Remotely sync the vmstat for that CPU.
> >
> >
> >
>
>

2023-04-20 02:51:10

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

Hi Michal,

On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> [...]
> > > 2) Run critical code
> > > 3) Optionally do something once you're done
> > >
> > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > clueless about those matters),
> >
> > I am mostly clueless too, but i don't see a problem with the proposed
> > patch (and no one has pointed any problem either).
>
> I really hate to repeat myself again. The biggest pushback has been on
> a) justification and b) single purpose solution which is very likely
> incomplete. For a) we are getting the story piece by piece which doesn't
> speed up the process. You are proposing a non-trivial change to an
> already convoluted code so having a solid justification is something
> that shouldn't be all that surprising.

The justification is simple and concise:

2. With a task that busy loops on a given CPU,
the kworker interruption to execute vmstat_update
is undesired and may exceed latency thresholds
for certain applications.

Performance details for the kworker interruption:

oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.

---

An additional use-case is what has been noted by Andrew Theurer:

Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
as measured by cyclictest & oslat. We cannot achieve under 20 usec with
the vmstats interruption.

---

It seems to me this is solid justification (it seems you want
particular microsecond values, but those depend on application
and the CPU). The point is there are several applications which do not
want to be interrupted (we can ignore the specifics and focus on
that fact).

Moreover, unrelated interruptions might occur close in time
(for example, random function call IPIs generated by other
kernel subsystems), which renders the "lets just consider this
one application, running on this CPU, to decide what to do"
short sighted.

> b) is what concerns me more though. There are other per-cpu specific
> things going on that require some regular flushing. Just to mention
> another one that your group has been brought up was the memcg pcp
> caches. Again with a non-trivial proposal to deal with that problem
> [1].

Yes.

> It has turned out that we can do a simpler thing [2].

For the particular memcg case, there was a simpler fix.

For the vmstat_update case, i don't see a simpler fix.

> I do not think it is a stretch to expect that similar things will pop
> out every now and then

Agree.

> and rather than dealing with each one in its own way it
> kinda makes sense to come up with a more general concept so that all
> those cases can be handled at a single place at least.

I can understand where you are coming from. Unfortunately,
for some cases it is appropriate to perform the work from a
remote CPU (and i think this is one such case).

> All I hear about
> that is that the code of those special applications would need to be
> changed to use that.

This is a burden for application writers and for system configuration.

Or it could be done automatically (from outside of the application).
Which is what is described and implemented here:

https://lore.kernel.org/lkml/[email protected]/

"Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose."

But not only that, the second thing is:

"> Another important point is this: if an application dirties
> its own per-CPU vmstat cache, while performing a system call,

Or while handling a VM-exit from a vCPU.

This are, in my mind, sufficient reasons to discard the "flush per-cpu
caches" idea. This is also why i chose to abandon the prctrl interface
patchset.

> and a vmstat sync event is triggered on a different CPU, you'd have to:
>
> 1) Wait for that CPU to return to userspace and sync its stats
> (unfeasible).
>
> 2) Queue work to execute on that CPU (undesirable, as that causes
> an interruption).
>
> 3) Remotely sync the vmstat for that CPU."

So the only option is to remotely sync vmstat for the CPU
(unless you have a better suggestion).

> Well, true but is that bar so impractical that we
> are going to grow kernel complexity and therefore a maintenance burden?

Honestly, this patchset is just using cmpxchg to transfer data from
per-CPU counters to global counters. I don't see why its that
complicated.

If you have a suggestion on how to reduce the apparent complexity,
that would be great.

> Everything for a very specialized workloads?

Well the kernel has been increasing in complexity, and the maintenance
burden has increased as a side-effect, to accomodate more workloads
than it was initially designed for.

> [1] http://lkml.kernel.org/r/[email protected]
> [2] http://lkml.kernel.org/r/[email protected]
> --
> Michal Hocko
> SUSE Labs
>
>

2023-04-20 07:57:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed 19-04-23 08:44:23, Andrew Theurer wrote:
> > On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <[email protected]> wrote:
> >> Andrew Theurer, can you please explain what problem is the vmstat_work
> >> interruption causing in your testing?
> >
> > +CC Andrew.
>
> Nearly every telco we work with for 5G RAN is demanding <20 usec CPU
> latency as measured by cyclictest & oslat. We cannot achieve under 20
> usec with the vmstats interruption.

Are you able to get those latency requirements with PREEMPT_RT?
--
Michal Hocko
SUSE Labs

2023-04-20 08:49:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed 19-04-23 13:35:12, Marcelo Tosatti wrote:
[...]
> This is a burden for application writers and for system configuration.

Yes. And I find it reasonable to expect that burden put there as there
are non-trivial requirements for those workloads anyway. It is not
out-of-the-box thing, right?

> Or it could be done automatically (from outside of the application).
> Which is what is described and implemented here:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> "Task isolation is divided in two main steps: configuration and
> activation.
>
> Each step can be performed by an external tool or the latency
> sensitive application itself. util-linux contains the "chisol" tool
> for this purpose."

I cannot say I would be a fan of prctl interfaces in general but I do
agree with the overal idea to forcing a quiescent state on a set of
CPUs.

> But not only that, the second thing is:
>
> "> Another important point is this: if an application dirties
> > its own per-CPU vmstat cache, while performing a system call,
>
> Or while handling a VM-exit from a vCPU.

Do you have any specific examples on this?

> This are, in my mind, sufficient reasons to discard the "flush per-cpu
> caches" idea. This is also why i chose to abandon the prctrl interface
> patchset.
>
> > and a vmstat sync event is triggered on a different CPU, you'd have to:
> >
> > 1) Wait for that CPU to return to userspace and sync its stats
> > (unfeasible).
> >
> > 2) Queue work to execute on that CPU (undesirable, as that causes
> > an interruption).
> >
> > 3) Remotely sync the vmstat for that CPU."
>
> So the only option is to remotely sync vmstat for the CPU
> (unless you have a better suggestion).

`echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same
without any kernel changes.

But let me repeat, this is not just about vmstats. Just have a look at
other queue_work_on users. You do not want to handy pick each and every
one and do so in the future as well.
--
Michal Hocko
SUSE Labs

2023-04-23 01:35:08

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Apr 20, 2023 at 09:55:40AM +0200, Michal Hocko wrote:
> On Wed 19-04-23 08:44:23, Andrew Theurer wrote:
> > > On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <[email protected]> wrote:
> > >> Andrew Theurer, can you please explain what problem is the vmstat_work
> > >> interruption causing in your testing?
> > >
> > > +CC Andrew.
> >
> > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU
> > latency as measured by cyclictest & oslat. We cannot achieve under 20
> > usec with the vmstats interruption.
>
> Are you able to get those latency requirements with PREEMPT_RT?

What do you mean, exactly?

PREEMPT_RT allows for the preemption of tasks in kernel context
(so that higher priority tasks can interrupt lower priority tasks).
It also enables IRQ handling to happen in thread context
(so that a given thread might be given higher priority than executing
a particular IRQ handler).

If the question is: "Are you able to achieve <20 usec latency while
allowing switching between different tasks ?" The answer with current
processor and memory speeds is probably: no.
But with more performant processors, you might.

However, with a fully isolated processor which does not require
switching between tasks, yes you can achieve < 20 usec latency.


2023-04-23 01:57:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Apr 20, 2023 at 10:40:25AM +0200, Michal Hocko wrote:
> On Wed 19-04-23 13:35:12, Marcelo Tosatti wrote:
> [...]
> > This is a burden for application writers and for system configuration.
>
> Yes. And I find it reasonable to expect that burden put there as there
> are non-trivial requirements for those workloads anyway. It is not
> out-of-the-box thing, right?

See below.

> > Or it could be done automatically (from outside of the application).
> > Which is what is described and implemented here:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > "Task isolation is divided in two main steps: configuration and
> > activation.
> >
> > Each step can be performed by an external tool or the latency
> > sensitive application itself. util-linux contains the "chisol" tool
> > for this purpose."
>
> I cannot say I would be a fan of prctl interfaces in general but I do
> agree with the overal idea to forcing a quiescent state on a set of
> CPUs.

This has been avoided with success so far.

> > But not only that, the second thing is:
> >
> > "> Another important point is this: if an application dirties
> > > its own per-CPU vmstat cache, while performing a system call,
> >
> > Or while handling a VM-exit from a vCPU.
>
> Do you have any specific examples on this?

Interrupt handling freeing a page.

handle_access_fault (ARM64) ->
handle_changed_spte_acc_track (x86) -> kvm_set_pfn_accessed -> kvm_set_page_accessed -> mark_page_accessed ->
folio_mark_accessed -> folio_activate -> folio_activate_fn ->
lruvec_add_folio -> update_lru_size -> __update_lru_size ->
__mod_zone_page_state(&pgdat->node_zones[zid],
NR_ZONE_LRU_BASE + lru, nr_pages);


The other option would be to _FORBID_ use of __mod_zone_page_state in certain code sections.

> > This are, in my mind, sufficient reasons to discard the "flush per-cpu
> > caches" idea. This is also why i chose to abandon the prctrl interface
> > patchset.
> >
> > > and a vmstat sync event is triggered on a different CPU, you'd have to:
> > >
> > > 1) Wait for that CPU to return to userspace and sync its stats
> > > (unfeasible).
> > >
> > > 2) Queue work to execute on that CPU (undesirable, as that causes
> > > an interruption).
> > >
> > > 3) Remotely sync the vmstat for that CPU."
> >
> > So the only option is to remotely sync vmstat for the CPU
> > (unless you have a better suggestion).
>
> `echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same
> without any kernel changes.

It is unsuitable. You'd have to guarantee that, by the time you return
from the write() system call to that file, there has been no other
mod_zone_page_state call. For example, no interrupt
or exception that frees or allocates a page through rmqueue
(NR_FREE_PAGES counter), or that bounce_end_io cannot be called
(since it calls dec_zone_page_state).

It has been used internally as a workaround, but it is not reliable.

> But let me repeat, this is not just about vmstats. Just have a look at
> other queue_work_on users. You do not want to handy pick each and every
> one and do so in the future as well.

The ones that are problematic are being fixed for sometime now. For example:

commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047
Author: Paolo Abeni <[email protected]>
Date: Thu Sep 10 23:33:18 2020 +0200

net: try to avoid unneeded backlog flush

flush_all_backlogs() may cause deadlock on systems
running processes with FIFO scheduling policy.

The above is critical in -RT scenarios, where user-space
specifically ensure no network activity is scheduled on
the CPU running the mentioned FIFO process, but still get
stuck.

This commit tries to address the problem checking the
backlog status on the remote CPUs before scheduling the
flush operation. If the backlog is empty, we can skip it.

v1 -> v2:
- explicitly clear flushed cpu mask - Eric

Signed-off-by: Paolo Abeni <[email protected]>
Signed-off-by: David S. Miller <[email protected]>


And it has been a normal process so far.

I think what needs to be done is to avoid new queue_work_on()
users from being introduced in the tree (the number of
existing ones is finite and can therefore be fixed).

2023-04-23 02:27:35

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 19, 2023 at 01:35:12PM -0300, Marcelo Tosatti wrote:
> Hi Michal,
>
> On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> > [...]
> > > > 2) Run critical code
> > > > 3) Optionally do something once you're done
> > > >
> > > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > > clueless about those matters),
> > >
> > > I am mostly clueless too, but i don't see a problem with the proposed
> > > patch (and no one has pointed any problem either).
> >
> > I really hate to repeat myself again. The biggest pushback has been on
> > a) justification and b) single purpose solution which is very likely
> > incomplete. For a) we are getting the story piece by piece which doesn't
> > speed up the process. You are proposing a non-trivial change to an
> > already convoluted code so having a solid justification is something
> > that shouldn't be all that surprising.
>
> The justification is simple and concise:
>
> 2. With a task that busy loops on a given CPU,
> the kworker interruption to execute vmstat_update
> is undesired and may exceed latency thresholds
> for certain applications.
>
> Performance details for the kworker interruption:
>
> oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
>
> The example above shows an additional 7us for the
>
> oslat -> kworker -> oslat
>
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold for certain applications.
>
> ---
>
> An additional use-case is what has been noted by Andrew Theurer:
>
> Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
> as measured by cyclictest & oslat. We cannot achieve under 20 usec with
> the vmstats interruption.
>
> ---
>
> It seems to me this is solid justification (it seems you want
> particular microsecond values, but those depend on application
> and the CPU). The point is there are several applications which do not
> want to be interrupted (we can ignore the specifics and focus on
> that fact).
>
> Moreover, unrelated interruptions might occur close in time
> (for example, random function call IPIs generated by other
> kernel subsystems), which renders the "lets just consider this
> one application, running on this CPU, to decide what to do"
> short sighted.
>
> > b) is what concerns me more though. There are other per-cpu specific
> > things going on that require some regular flushing. Just to mention
> > another one that your group has been brought up was the memcg pcp
> > caches. Again with a non-trivial proposal to deal with that problem
> > [1].
>
> Yes.
>
> > It has turned out that we can do a simpler thing [2].
>
> For the particular memcg case, there was a simpler fix.
>
> For the vmstat_update case, i don't see a simpler fix.
>
> > I do not think it is a stretch to expect that similar things will pop
> > out every now and then
>
> Agree.
>
> > and rather than dealing with each one in its own way it
> > kinda makes sense to come up with a more general concept so that all
> > those cases can be handled at a single place at least.
>
> I can understand where you are coming from. Unfortunately,
> for some cases it is appropriate to perform the work from a
> remote CPU (and i think this is one such case).
>
> > All I hear about
> > that is that the code of those special applications would need to be
> > changed to use that.
>
> This is a burden for application writers and for system configuration.
>
> Or it could be done automatically (from outside of the application).
> Which is what is described and implemented here:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> "Task isolation is divided in two main steps: configuration and
> activation.
>
> Each step can be performed by an external tool or the latency
> sensitive application itself. util-linux contains the "chisol" tool
> for this purpose."
>
> But not only that, the second thing is:
>
> "> Another important point is this: if an application dirties
> > its own per-CPU vmstat cache, while performing a system call,
>
> Or while handling a VM-exit from a vCPU.
>
> This are, in my mind, sufficient reasons to discard the "flush per-cpu
> caches" idea. This is also why i chose to abandon the prctrl interface
> patchset.

There are additional details that were not mentioned. When we think
of flushing caches, or disabling per-CPU caches, this means that the
isolated application loses the benefit of those caches (which means you
are turning a "general purpose" programming environment into
potentially slower environment for applications to execute).

(yes, of course, one has to be mindful of which system calls can be
used, for example the execution time of system calls which take locks will
depend on whether, and how many, users of those locks there are at a
given moment).

For example, if we flush the vm-stats at every system call return,
and the application happens to switch between different phases of

isolated, userspace only behaviour (computation)
system call intensive behaviour

(which is an HPC example Thomas Gleixner mentioned in a different
thread), then you see the disadvantage of the "special" environment
where flushing is performed on return to system calls.

So it seems to me (unless there are points that show otherwise, which
would indicate that explicit userspace interfaces are preferred) _not_
requiring userspace changes is a superior solution.

Perhaps the complexity should be judged for individual cases
of interruptions, and if a given interruption-free conversion
is seen as too complex, then a "disable feature which makes use of per-CPU
caches" style solution can be made (and then userspace has to
explicitly request for that per-CPU feature to be disabled).

But i don't see that this patchset introduces unmanageable complexity,
neither:

01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock


2023-04-26 14:48:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 01:35:12PM -0300, Marcelo Tosatti wrote:
> > Hi Michal,
> >
> > On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> > > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> > > [...]
> > > > > 2) Run critical code
> > > > > 3) Optionally do something once you're done
> > > > >
> > > > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > > > clueless about those matters),
> > > >
> > > > I am mostly clueless too, but i don't see a problem with the proposed
> > > > patch (and no one has pointed any problem either).
> > >
> > > I really hate to repeat myself again. The biggest pushback has been on
> > > a) justification and b) single purpose solution which is very likely
> > > incomplete. For a) we are getting the story piece by piece which doesn't
> > > speed up the process. You are proposing a non-trivial change to an
> > > already convoluted code so having a solid justification is something
> > > that shouldn't be all that surprising.
> >
> > The justification is simple and concise:
> >
> > 2. With a task that busy loops on a given CPU,
> > the kworker interruption to execute vmstat_update
> > is undesired and may exceed latency thresholds
> > for certain applications.
> >
> > Performance details for the kworker interruption:
> >
> > oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> >
> > The example above shows an additional 7us for the
> >
> > oslat -> kworker -> oslat
> >
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold for certain applications.
> >
> > ---
> >
> > An additional use-case is what has been noted by Andrew Theurer:
> >
> > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
> > as measured by cyclictest & oslat. We cannot achieve under 20 usec with
> > the vmstats interruption.
> >
> > ---
> >
> > It seems to me this is solid justification (it seems you want
> > particular microsecond values, but those depend on application
> > and the CPU). The point is there are several applications which do not
> > want to be interrupted (we can ignore the specifics and focus on
> > that fact).
> >
> > Moreover, unrelated interruptions might occur close in time
> > (for example, random function call IPIs generated by other
> > kernel subsystems), which renders the "lets just consider this
> > one application, running on this CPU, to decide what to do"
> > short sighted.
> >
> > > b) is what concerns me more though. There are other per-cpu specific
> > > things going on that require some regular flushing. Just to mention
> > > another one that your group has been brought up was the memcg pcp
> > > caches. Again with a non-trivial proposal to deal with that problem
> > > [1].
> >
> > Yes.
> >
> > > It has turned out that we can do a simpler thing [2].
> >
> > For the particular memcg case, there was a simpler fix.
> >
> > For the vmstat_update case, i don't see a simpler fix.
> >
> > > I do not think it is a stretch to expect that similar things will pop
> > > out every now and then
> >
> > Agree.
> >
> > > and rather than dealing with each one in its own way it
> > > kinda makes sense to come up with a more general concept so that all
> > > those cases can be handled at a single place at least.
> >
> > I can understand where you are coming from. Unfortunately,
> > for some cases it is appropriate to perform the work from a
> > remote CPU (and i think this is one such case).
> >
> > > All I hear about
> > > that is that the code of those special applications would need to be
> > > changed to use that.
> >
> > This is a burden for application writers and for system configuration.
> >
> > Or it could be done automatically (from outside of the application).
> > Which is what is described and implemented here:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > "Task isolation is divided in two main steps: configuration and
> > activation.
> >
> > Each step can be performed by an external tool or the latency
> > sensitive application itself. util-linux contains the "chisol" tool
> > for this purpose."
> >
> > But not only that, the second thing is:
> >
> > "> Another important point is this: if an application dirties
> > > its own per-CPU vmstat cache, while performing a system call,
> >
> > Or while handling a VM-exit from a vCPU.
> >
> > This are, in my mind, sufficient reasons to discard the "flush per-cpu
> > caches" idea. This is also why i chose to abandon the prctrl interface
> > patchset.
>
> There are additional details that were not mentioned. When we think
> of flushing caches, or disabling per-CPU caches, this means that the
> isolated application loses the benefit of those caches (which means you
> are turning a "general purpose" programming environment into
> potentially slower environment for applications to execute).

https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html

> (yes, of course, one has to be mindful of which system calls can be
> used, for example the execution time of system calls which take locks will
> depend on whether, and how many, users of those locks there are at a
> given moment).
>
> For example, if we flush the vm-stats at every system call return,
> and the application happens to switch between different phases of
>
> isolated, userspace only behaviour (computation)
> system call intensive behaviour
>
> (which is an HPC example Thomas Gleixner mentioned in a different
> thread), then you see the disadvantage of the "special" environment
> where flushing is performed on return to system calls.

https://lore.kernel.org/linux-mm/[email protected]/

"In a real-world usecase we had the situation of compute bursts and an
unfortunate hw enforced requirement to go into the kernel between them
for synchronization between the compute threads and hardware (A quick
hardware assisted save/load).

Unmodified NOHZ full accumulated to more than 6% loss compared to a
fully undisturbed run. Most of it was caused by cache effects and not by
the actually used CPU time.

A full enforced quiescing upfront gained ~2-3%, but a lazy approach of
accepting that some stuff might happen once and does not happen again
gained almost 5%. In that particular scenario 5% _is_ a huge win."

>
> So it seems to me (unless there are points that show otherwise, which
> would indicate that explicit userspace interfaces are preferred) _not_
> requiring userspace changes is a superior solution.
>
> Perhaps the complexity should be judged for individual cases
> of interruptions, and if a given interruption-free conversion
> is seen as too complex, then a "disable feature which makes use of per-CPU
> caches" style solution can be made (and then userspace has to
> explicitly request for that per-CPU feature to be disabled).
>
> But i don't see that this patchset introduces unmanageable complexity,
> neither:
>
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
>
>

2023-04-26 15:06:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On 4/20/23 15:45, Marcelo Tosatti wrote:
> Perhaps the complexity should be judged for individual cases
> of interruptions, and if a given interruption-free conversion
> is seen as too complex, then a "disable feature which makes use of per-CPU
> caches" style solution can be made (and then userspace has to
> explicitly request for that per-CPU feature to be disabled).
>
> But i don't see that this patchset introduces unmanageable complexity,
> neither:
>
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

Well that one is a bit different, as there was one kind of lock replaced
with other kind of lock, the lock is uncontended unless there's remote
flushes happening so it's not causing extra overhead for the fast paths,
and later even the irq disabling was removed, which should even improve
things. But this patchset is turning all vmstat counter increments a
cmpxchg.

2023-04-26 16:22:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 26, 2023 at 05:04:49PM +0200, Vlastimil Babka wrote:
> On 4/20/23 15:45, Marcelo Tosatti wrote:
> > Perhaps the complexity should be judged for individual cases
> > of interruptions, and if a given interruption-free conversion
> > is seen as too complex, then a "disable feature which makes use of per-CPU
> > caches" style solution can be made (and then userspace has to
> > explicitly request for that per-CPU feature to be disabled).
> >
> > But i don't see that this patchset introduces unmanageable complexity,
> > neither:
> >
> > 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> > 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
>
> Well that one is a bit different, as there was one kind of lock replaced
> with other kind of lock,

local_lock is defined to NULL if CONFIG_PREEMPT_RT is not set.
So for the !CONFIG_PREEMPT_RT case, it introduced a lock.

> the lock is uncontended unless there's remote
> flushes happening so it's not causing extra overhead for the fast paths,
> and later even the irq disabling was removed, which should even improve
> things. But this patchset is turning all vmstat counter increments a
> cmpxchg.

Yes, and we have a similar situation in this case:

1) CMPXCHG is already used to protect many vmstat counter increments.
2) The patchset adds "LOCK CMPXCHG" to existing CMPXCHG user.
3) The performance decrease is negligible, because cache locking is
effective.

"To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

/** Loop to measure **/
for (i = 0; i < rec->loops; i++) {
my_page = alloc_page(gfp_mask);
if (unlikely(my_page == NULL))
return 0;
__free_page(my_page);
}

Unit is cycles.

Vanilla Patched Diff
115.25 117 1.4%"

To be honest, that 1.4% difference was not stable but fluctuated between
positive and negative percentages (so the performance difference was in
the noise).

So performance is not a decisive factor in this case.

2023-04-27 08:39:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed 26-04-23 11:34:00, Marcelo Tosatti wrote:
> On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote:
[...]
> > There are additional details that were not mentioned. When we think
> > of flushing caches, or disabling per-CPU caches, this means that the
> > isolated application loses the benefit of those caches (which means you
> > are turning a "general purpose" programming environment into
> > potentially slower environment for applications to execute).

I do not really buy this argument! Nothing is really free and somebody
has to pay for the overhead. You want highly specialized workload to
enjoy all the performance while having high demand on latency yet the
overhead has to pay everybody else.

> https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html

This is just talking about who benefits from isolation and I do not
think there is any dispute in that regard. I haven't questioned that. My
main argument was that those really need to be special and careful to
achieve their goal and Thomas says a very similar thing. I do not see
any objection to an explicit programming model to achieve that goal.

> > (yes, of course, one has to be mindful of which system calls can be
> > used, for example the execution time of system calls which take locks will
> > depend on whether, and how many, users of those locks there are at a
> > given moment).

This is simply not maintainble state. Once you enter the kernel you
cannot really expect your _ultra low_ latency expectations.

[...]
> > So it seems to me (unless there are points that show otherwise, which
> > would indicate that explicit userspace interfaces are preferred) _not_
> > requiring userspace changes is a superior solution.
> >
> > Perhaps the complexity should be judged for individual cases
> > of interruptions, and if a given interruption-free conversion
> > is seen as too complex, then a "disable feature which makes use of per-CPU
> > caches" style solution can be made (and then userspace has to
> > explicitly request for that per-CPU feature to be disabled).
> >
> > But i don't see that this patchset introduces unmanageable complexity,
> > neither:

As I've tried to explain, I disagree about the approach you are taking.
You are fixing your problem at a wrong layer. You really need to address
the fundamental issue and that is that you do not want housekeeping done
on isolated cpu(s) while your workload is running there.

vmstat updates are just one of schedule_on_cpu users who could disturb
your workload. We do not want to chase every single one and keep
doing that for ever as new callers of that API are added. See the
point? "Fixing" vmstat will not make your isolated workload more
reliable. You really need a more generic solution rather than a quick
hack.

Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC
this is used by NOHZ. I do not remember any details but if anything this
is something I would have a look into.

There is close to 0 benefit to teaching remote stat flushing. As I've
said stats are only for debugging purposes and imprecise values
shouldn't matter. So this just adds a complexity without any actual real
benefit.

--
Michal Hocko
SUSE Labs

2023-04-27 08:41:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed 26-04-23 13:10:54, Marcelo Tosatti wrote:
[...]
> "To test the performance difference, a page allocator microbenchmark:
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
> with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
>
> For the single_page_alloc_free test, which does
>
> /** Loop to measure **/
> for (i = 0; i < rec->loops; i++) {
> my_page = alloc_page(gfp_mask);
> if (unlikely(my_page == NULL))
> return 0;
> __free_page(my_page);
> }
>
> Unit is cycles.
>
> Vanilla Patched Diff
> 115.25 117 1.4%"
>
> To be honest, that 1.4% difference was not stable but fluctuated between
> positive and negative percentages (so the performance difference was in
> the noise).
>
> So performance is not a decisive factor in this case.

It is not neglible considering that majority worklods will not benefit
from this change. You are clearly ignoring that vmstat code has been
highly optimized for local per-cpu access exactly to avoid locked
operations and cache line bouncing.
--
Michal Hocko
SUSE Labs

2023-04-27 15:07:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Apr 27, 2023 at 10:31:21AM +0200, Michal Hocko wrote:
> On Wed 26-04-23 11:34:00, Marcelo Tosatti wrote:
> > On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote:
> [...]
> > > There are additional details that were not mentioned. When we think
> > > of flushing caches, or disabling per-CPU caches, this means that the
> > > isolated application loses the benefit of those caches (which means you
> > > are turning a "general purpose" programming environment into
> > > potentially slower environment for applications to execute).
>
> I do not really buy this argument! Nothing is really free and somebody
> has to pay for the overhead.

About the overhead, modern processors perform "cache locking":

https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-261.html

Which means that as long as memory is completly contained in a cacheline
(for write-back memory), then it is not necessary to perform the LOCK
operation on the bus.
Multiple experiments have confirmed this is the case.

This is the case with per-CPU vmstat memory as well.

> You want highly specialized workload to
> enjoy all the performance while having high demand on latency yet the
> overhead has to pay everybody else.

Yes, the overhead is that code should avoid interrupting CPUs.

Your argument is that: "Avoiding the interruptions adds unsurmountable
complexity therefore those workloads should not be supported" ?

It seems to me this argument can be used against any new piece of code
of functionality that is added to the kernel, isnt it ?

The same argument could be used to reject (at the time) new additions
such as RCU (because systems with large number of processors are a
a highly specialized workload), memory hotplug (same thing), PCI hotplug
(same thing).

> > https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html
>
> This is just talking about who benefits from isolation and I do not
> think there is any dispute in that regard. I haven't questioned that. My
> main argument was that those really need to be special and careful to
> achieve their goal

I see.

> and Thomas says a very similar thing. I do not see
> any objection to an explicit programming model to achieve that goal.

Yes, but it seems to me that the best possible (and most widely
applicable) solution is to avoid any explicit programming if possible.

> > > (yes, of course, one has to be mindful of which system calls can be
> > > used, for example the execution time of system calls which take locks will
> > > depend on whether, and how many, users of those locks there are at a
> > > given moment).
>
> This is simply not maintainble state. Once you enter the kernel you
> cannot really expect your _ultra low_ latency expectations.

Whether or not its OK to perform system calls is up to the application:
What matters is the latency expectation from the outside world [1]
VS how long it takes to execute a set of instructions.

I can give two concrete example:

1) Cyclictest use sys_nanosleep(). It makes sense
to abstract the details of HLT'ing to the operating system.

A whole class of programs (which must handle periodic tasks)
will sleep via the kernel.

2) The HPC example from Thomas, where:

"
1 read_data_set() <- involving syscalls/OS obviously
2 compute_set() <- let me alone
3 save_data_set() <- involving syscalls/OS obviously

repeat the above...

then it's at his discretion to decide to inflict a particular isolation
set on the task which is obviously ineffective while doing #1 and #3 but
might provide the so desired 0.9% boost for compute_set() which
dominates the judgement."

It seems the operating system is capable of providing an isolation free
environment for the application without explicit knowledge from it
(other than taskset).

So all the above are good reasons to try and avoid an explicit
programming interface (again, i did write an explicit programming
interface, and seen in practice its downsides). I will assume you now
understand and agree that the additional complexity added to the kernel
is worthwhile (since its not that huge complexity to perform particular
work remotely, with appropriate locking and/or usage of lockless
algorithms, these things have been around and used in the kernel for a
while now).

As for the next topic...

> [...]
> > > So it seems to me (unless there are points that show otherwise, which
> > > would indicate that explicit userspace interfaces are preferred) _not_
> > > requiring userspace changes is a superior solution.
> > >
> > > Perhaps the complexity should be judged for individual cases
> > > of interruptions, and if a given interruption-free conversion
> > > is seen as too complex, then a "disable feature which makes use of per-CPU
> > > caches" style solution can be made (and then userspace has to
> > > explicitly request for that per-CPU feature to be disabled).
> > >
> > > But i don't see that this patchset introduces unmanageable complexity,
> > > neither:
>
> As I've tried to explain, I disagree about the approach you are taking.
> You are fixing your problem at a wrong layer. You really need to address
> the fundamental issue and that is that you do not want housekeeping done
> on isolated cpu(s) while your workload is running there.

OK, that is a problem. But the fact is that there are interfaces to request
work to be performed on remote CPUs (usually on per-CPU data), and those
must be addressed one-by-one. We (as in the community) are looking into
ways to address multiple classes of interruptions at once, for example:

https://lpc.events/event/16/contributions/1218/
https://www.spinics.net/lists/linux-s390/msg57118.html

"The current CPU isolation is a best effort approach and I agree that for
more strict isolation modes we need to be able to enforce that and hunt
down offenders and think about them one by one."

But we can't block IPIs or requests for work to be executed remotely.

> vmstat updates are just one of schedule_on_cpu users who could disturb
> your workload.

Yes. But its one case which we see right now. So our approach so far has
been to monitor the workload and remove individual interruptions that
are observed.

But as long as you have code that is doing:

queue_work_on(CPU, &this_work);
flush_work(&this_work);

There is nothing one can do about it other than:

1) Return an error to avoid the interruption
(which is what we are trying here:
https://lpc.events/event/16/contributions/1218/).
However this might not be suitable for all cases
(because you rely on the functionality).
2) Convert it to avoid the remote work somehow.
3) Don't queue the work (which might result in incorrect
system operation).

Again, we are trying to widen the number of callsites that can be
handled with certain approaches (see the above URLs).

> We do not want to chase every single one and keep
> doing that for ever as new callers of that API are added. See the
> point?

Yes, agree with this point. Possible solutions:

1) Change the APIs so that any new users that attempt
to use the APIs are encouraged to avoid executing code on
isolated CPUs (or have to handle the errors).

/**
* queue_work_on - queue work on specific cpu
* @cpu: CPU number to execute work on
* @wq: workqueue to use
* @work: work to queue
*
* We queue the work to a specific CPU, the caller must ensure it
* can't go away. Callers that fail to ensure that the specified
* CPU cannot go away will execute on a randomly chosen CPU.
*
* Return: %false if @work was already on a queue, %true otherwise.
*/
bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work)
{

Perhaps _fail variants of APIs to queue remote work
(https://lore.kernel.org/lkml/[email protected]/T/#mc25ddea62ff095dba61d244fbfdca1f61221c915)
plus
a checkpatch.pl error in case of addition of queue_work_on would
be helpful?

2) Change the patch acceptance criteria to avoid introducing
queue_work_on's to isolated CPUs.

If you have additional suggestions or ideas, they are welcome.

> "Fixing" vmstat will not make your isolated workload more
> reliable. You really need a more generic solution rather than a quick
> hack.

It does as long as no one else executes queue_work_on() :-)

Yes, i understand and agree this is weak and fragile.

> Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC
> this is used by NOHZ. I do not remember any details but if anything this
> is something I would have a look into.

Its not sufficient since what it does is only flushing the per-CPU
vmstats when entering idle. There has been work on a patchset
(https://lkml.org/lkml/2022/9/24/306) to improve the infrastructure, to
call quiet_vmstat on return to userspace when nohz_full is used, but
honestly the remote flushing is superior: it potentially avoids many
unecessary flushes (which can happen if the application performs a lot
of system calls).

> There is close to 0 benefit to teaching remote stat flushing. As I've
> said stats are only for debugging purposes and imprecise values
> shouldn't matter. So this just adds a complexity without any actual real
> benefit.

Well there is the need to request a synchronization of the events from
all CPUs, right? For example:

int vmstat_refresh(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
long val;
int err;
int i;

/*
* The regular update, every sysctl_stat_interval, may come later
* than expected: leaving a significant amount in per_cpu buckets.
* This is particularly misleading when checking a quantity of HUGE
* pages, immediately after running a test. /proc/sys/vm/stat_refresh,
* which can equally be echo'ed to or cat'ted from (by root),
* can be used to update the stats just before reading them.
*
* Oh, and since global_zone_page_state() etc. are so careful to hide
* transiently negative values, report an error here if any of
* the stats is negative, so we know to go looking for imbalance.
*/
err = schedule_on_each_cpu(refresh_vm_stats);

So if you let a remote CPU dirty its own cache, then at somepoint you
need to flush back that cache.

Are you suggesting to disable synchronization of the per-CPU
vmstats for a given CPU (if isolated, for example), and then any
callsites which require proper values must loop over all CPUs
when requesting uptodate statistics?

Works for me and the addresses the requests from users
(since it removes the interruption to isolated CPUs).

2023-04-27 16:48:23

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Thu, Apr 27, 2023 at 10:39:29AM +0200, Michal Hocko wrote:
> On Wed 26-04-23 13:10:54, Marcelo Tosatti wrote:
> [...]
> > "To test the performance difference, a page allocator microbenchmark:
> > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
> > with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
> >
> > For the single_page_alloc_free test, which does
> >
> > /** Loop to measure **/
> > for (i = 0; i < rec->loops; i++) {
> > my_page = alloc_page(gfp_mask);
> > if (unlikely(my_page == NULL))
> > return 0;
> > __free_page(my_page);
> > }
> >
> > Unit is cycles.
> >
> > Vanilla Patched Diff
> > 115.25 117 1.4%"
> >
> > To be honest, that 1.4% difference was not stable but fluctuated between
> > positive and negative percentages (so the performance difference was in
> > the noise).
> >
> > So performance is not a decisive factor in this case.
>
> It is not neglible considering that majority worklods will not benefit
> from this change. You are clearly ignoring that vmstat code has been
> highly optimized for local per-cpu access exactly to avoid locked
> operations and cache line bouncing.
> --
> Michal Hocko
> SUSE Labs

Again, the values fluctuate between positive and negative
performance difference (i happen to have copied a positive value).

So the performance difference is in the noise (its not stable at 1.4%),
but rather close to 0%.

So the data is showing that there is no negative performance impact.

2023-05-03 14:08:50

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

On Wed, Apr 19, 2023 at 04:15:50PM -0300, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 06:47:30PM +0200, Vlastimil Babka wrote:
> > On 4/19/23 13:29, Marcelo Tosatti wrote:
> > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > >> This was tried before:
> > >> https://lore.kernel.org/lkml/[email protected]/
> > >>
> > >> My conclusion from that discussion (and work) is that a special system
> > >> call:
> > >>
> > >> 1) Does not allow the benefits to be widely applied (only modified
> > >> applications will benefit). Is not portable across different operating systems.
> > >>
> > >> Removing the vmstat_work interruption is a benefit for HPC workloads,
> > >> for example (in fact, it is a benefit for any kind of application,
> > >> since the interruption causes cache misses).
> > >>
> > >> 2) Increases the system call cost for applications which would use
> > >> the interface.
> > >>
> > >> So avoiding the vmstat_update update interruption, without userspace
> > >> knowledge and modifications, is a better than solution than a modified
> > >> userspace.
> > >
> > > Another important point is this: if an application dirties
> > > its own per-CPU vmstat cache, while performing a system call,
> > > and a vmstat sync event is triggered on a different CPU, you'd have to:
> > >
> > > 1) Wait for that CPU to return to userspace and sync its stats
> > > (unfeasible).
> > >
> > > 2) Queue work to execute on that CPU (undesirable, as that causes
> > > an interruption).
> >
> > So you're saying the application might do a syscall from the isolcpu, so
> > IIUC it cannot expect any latency guarantees at that very moment,
>
> Why not? cyclictest uses nanosleep and its the main tool for measuring
> latency.
>
> > but then
> > it immediately starts expecting them again after returning to userspace,
>
> No, the expectation more generally is this:
>
> For certain types of applications (for example PLC software or
> RAN processing), upon occurrence of an event, it is necessary to
> complete a certain task in a maximum amount of time (deadline).
>
> One way to express this requirement is with a pair of numbers,
> deadline time and execution time, where:
>
> * deadline time: length of time between event and deadline.
> * execution time: length of time it takes for processing of event
> to occur on a particular hardware platform
> (uninterrupted).
>
> The particular values depend on use-case. For the case
> where the realtime application executes in a virtualized
> guest, an interruption which must be serviced in the host will cause
> the following sequence of events:
>
> 1) VM-exit
> 2) execution of IPI (and function call) (or switch to kwork
> thread to execute some work item).
> 3) VM-entry
>
> Which causes an excess of 50us latency as observed by cyclictest
> (this violates the latency requirement of vRAN application with 1ms TTI,
> for example).
>
> > and
> > a single interruption for a one-time flush after the syscall would be too
> > intrusive?
>
> Generally, if you can't complete the task (which involves executing a
> number of instructions) before the deadline, then its a problem.
>
> One-time flush? You mean to switch between:
>
> rt-app -> kworker (to execute vmstat_update flush) -> rt-app
>
> My measurement, which probably had vmstat_update code/data in cache, took 7us.
> It might be the case that the code to execute must be brought in from
> memory, which takes even longer.
>
> > (elsewhere in the thread you described an RT app initialization that may
> > generate vmstats to flush and then entry userspace loop, again, would a
> > single interruption soon after entering the loop be so critical?)
>
> 1) It depends on the application. For the use-case above, where < 50us
> interruption is desired, yes it is critical.
>
> 2) The interruptions can come from different sources.
>
> Time
> 0 rt-app executing instruction 1
> 1 rt-app executing instruction 2
> 2 scheduler switches between rt-app and kworker
> 3 kworker runs vmstat_work
> 4 scheduler switches between kworker and rt-app
> 5 rt-app executing instruction 3
> 6 ipi to handle a KVM request IPI
> 7 fill in your preferred IPI handler
>
> So the argument "a single interruption might not cause your deadline
> to be exceeded" fails (because the time to handle the
> different interruptions might sum).
>
> Does that make sense?

Ping ? (just want to double check the reasoning above makes sense).