2023-03-13 16:29:18

by Marcelo Tosatti

[permalink] [raw]
Subject: [PATCH v5 00/12] fold per-CPU vmstats remotely

This patch series addresses the following two problems:

1. A customer provided some evidence which indicates that
the idle tick was stopped; albeit, CPU-specific vmstat
counters still remained populated.

Thus one can only assume quiet_vmstat() was not
invoked on return to the idle loop. If I understand
correctly, I suspect this divergence might erroneously
prevent a reclaim attempt by kswapd. If the number of
zone specific free pages are below their per-cpu drift
value then zone_page_state_snapshot() is used to
compute a more accurate view of the aforementioned
statistic. Thus any task blocked on the NUMA node
specific pfmemalloc_wait queue will be unable to make
significant progress via direct reclaim unless it is
killed after being woken up by kswapd
(see throttle_direct_reclaim())

2. With a SCHED_FIFO task that busy loops on a given CPU,
and kworker for that CPU at SCHED_OTHER priority,
queuing work to sync per-vmstats will either cause that
work to never execute, or stalld (i.e. stall daemon)
boosts kworker priority which causes a latency
violation

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.

v5:
- Drop "mm/vmstat: remove remote node draining" (Vlastimil Babka)
- Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka)

v4:
- Switch per-CPU vmstat counters to s32, required
by RISC-V, ARC architectures

v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
stats more accurate (Peter Xu)
- Improve changelog of
"mm/vmstat: switch counter modification to cmpxchg" (Peter Xu / David)
- Improve changelog of
"mm/vmstat: remove remote node draining" (David Hildenbrand)


v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
(Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
(Uros Bizjak / Matthew Wilcox)


arch/arm64/include/asm/percpu.h | 16 ++
arch/loongarch/include/asm/percpu.h | 23 +++-
arch/s390/include/asm/percpu.h | 5
arch/x86/include/asm/percpu.h | 39 +++----
include/asm-generic/percpu.h | 17 +++
include/linux/mmzone.h | 3
include/linux/percpu-defs.h | 2
kernel/fork.c | 2
kernel/scs.c | 2
mm/page_alloc.c | 23 ----
mm/vmstat.c | 424 +++++++++++++++++++++++++++++++++++++++++------------------------------------
11 files changed, 307 insertions(+), 249 deletions(-)




2023-03-14 12:34:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 00/12] fold per-CPU vmstats remotely

On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> This patch series addresses the following two problems:
>
> 1. A customer provided some evidence which indicates that
> the idle tick was stopped; albeit, CPU-specific vmstat
> counters still remained populated.
>
> Thus one can only assume quiet_vmstat() was not
> invoked on return to the idle loop. If I understand
> correctly, I suspect this divergence might erroneously
> prevent a reclaim attempt by kswapd. If the number of
> zone specific free pages are below their per-cpu drift
> value then zone_page_state_snapshot() is used to
> compute a more accurate view of the aforementioned
> statistic. Thus any task blocked on the NUMA node
> specific pfmemalloc_wait queue will be unable to make
> significant progress via direct reclaim unless it is
> killed after being woken up by kswapd
> (see throttle_direct_reclaim())

I have hard time to follow the actual problem described above. Are you
suggesting that a lack of pcp vmstat counters update has led to
reclaim issues? What is the said "evidence"? Could you share more of the
story please?

> 2. With a SCHED_FIFO task that busy loops on a given CPU,
> and kworker for that CPU at SCHED_OTHER priority,
> queuing work to sync per-vmstats will either cause that
> work to never execute, or stalld (i.e. stall daemon)
> boosts kworker priority which causes a latency
> violation

Why is that a problem? Out-of-sync stats shouldn't cause major problems.
Or can they?

Thanks!
--
Michal Hocko
SUSE Labs

2023-03-14 13:05:58

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v5 00/12] fold per-CPU vmstats remotely

On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > This patch series addresses the following two problems:
> >
> > 1. A customer provided some evidence which indicates that
> > the idle tick was stopped; albeit, CPU-specific vmstat
> > counters still remained populated.
> >
> > Thus one can only assume quiet_vmstat() was not
> > invoked on return to the idle loop. If I understand
> > correctly, I suspect this divergence might erroneously
> > prevent a reclaim attempt by kswapd. If the number of
> > zone specific free pages are below their per-cpu drift
> > value then zone_page_state_snapshot() is used to
> > compute a more accurate view of the aforementioned
> > statistic. Thus any task blocked on the NUMA node
> > specific pfmemalloc_wait queue will be unable to make
> > significant progress via direct reclaim unless it is
> > killed after being woken up by kswapd
> > (see throttle_direct_reclaim())
>
> I have hard time to follow the actual problem described above. Are you
> suggesting that a lack of pcp vmstat counters update has led to
> reclaim issues? What is the said "evidence"? Could you share more of the
> story please?


- The process was trapped in throttle_direct_reclaim().
The function wait_event_killable() was called to wait condition
allow_direct_reclaim(pgdat) for current node to be true.
The allow_direct_reclaim(pgdat) examined the number of free pages
on the node by zone_page_state() which just returns value in
zone->vm_stat[NR_FREE_PAGES].

- On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
However, the freelist on this node was not empty.

- This inconsistent of vmstat value was caused by percpu vmstat on
nohz_full cpus. Every increment/decrement of vmstat is performed
on percpu vmstat counter at first, then pooled diffs are cumulated
to the zone's vmstat counter in timely manner. However, on nohz_full
cpus (in case of this customer's system, 48 of 52 cpus) these pooled
diffs were not cumulated once the cpu had no event on it so that
the cpu started sleeping infinitely.
I checked percpu vmstat and found there were total 69 counts not
cumulated to the zone's vmstat counter yet.

- In this situation, kswapd did not help the trapped process.
In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
of free pages on the node by zone_page_state_snapshot() which
checks pending counts on percpu vmstat.
Therefore kswapd could know there were 69 free pages correctly.
Since zone->_watermark = {8, 20, 32}, kswapd did not work because
69 was greater than 32 as high watermark.

- As the result:
- The process waited allow_direct_reclaim(pgdat) to be true.
- allow_direct_reclaim() saw 0 via zone_page_state().
It woke kswapd since 0 was lower than min watermark.
- The kswapd did nothing.
- kswapd saw 69 via zone_page_state_snapshot().
It woke waiters without performing memory reclaim
since 69 is greater than high watermark.
- The process woked by kswapd soon restart waiting for kswapd.
- Still allow_direct_reclaim() saw 0 via zone_page_state().
It woke kswapd since 0 was lower than min watermark.

>
> > 2. With a SCHED_FIFO task that busy loops on a given CPU,
> > and kworker for that CPU at SCHED_OTHER priority,
> > queuing work to sync per-vmstats will either cause that
> > work to never execute, or stalld (i.e. stall daemon)
> > boosts kworker priority which causes a latency
> > violation
>
> Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> Or can they?

Consider SCHED_FIFO task that is polling the network queue (say
testpmd).

do {
if (net_registers->state & DATA_AVAILABLE) {
process_data)();
}
} while (!stopped);

Since this task runs at SCHED_FIFO priority, kworker won't
be scheduled to run (therefore per-CPU vmstats won't be
flushed to global vmstats).

Or, if testpmd runs at SCHED_OTHER, then the work item to
flush per-CPU vmstats causes

testpmd -> kworker
kworker: flush per-CPU vmstats
kworker -> testpmd

And this might cause undesired latencies to the packets being
processed by the testpmd task.


2023-03-14 13:07:34

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v5 00/12] fold per-CPU vmstats remotely

On Tue, Mar 14, 2023 at 09:59:37AM -0300, Marcelo Tosatti wrote:
> > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > Or can they?
>
> Consider SCHED_FIFO task that is polling the network queue (say
> testpmd).
>
> do {
> if (net_registers->state & DATA_AVAILABLE) {
> process_data)();
> }
> } while (!stopped);
>
> Since this task runs at SCHED_FIFO priority, kworker won't
> be scheduled to run (therefore per-CPU vmstats won't be
> flushed to global vmstats).
>
> Or, if testpmd runs at SCHED_OTHER, then the work item to
> flush per-CPU vmstats causes
>
> testpmd -> kworker
> kworker: flush per-CPU vmstats
> kworker -> testpmd
>
> And this might cause undesired latencies to the packets being
> processed by the testpmd task.

This problem is unrelated to the kswapd problem, but both are addressed
by the patchset.


2023-03-14 14:31:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 00/12] fold per-CPU vmstats remotely

On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > This patch series addresses the following two problems:
> > >
> > > 1. A customer provided some evidence which indicates that
> > > the idle tick was stopped; albeit, CPU-specific vmstat
> > > counters still remained populated.
> > >
> > > Thus one can only assume quiet_vmstat() was not
> > > invoked on return to the idle loop. If I understand
> > > correctly, I suspect this divergence might erroneously
> > > prevent a reclaim attempt by kswapd. If the number of
> > > zone specific free pages are below their per-cpu drift
> > > value then zone_page_state_snapshot() is used to
> > > compute a more accurate view of the aforementioned
> > > statistic. Thus any task blocked on the NUMA node
> > > specific pfmemalloc_wait queue will be unable to make
> > > significant progress via direct reclaim unless it is
> > > killed after being woken up by kswapd
> > > (see throttle_direct_reclaim())
> >
> > I have hard time to follow the actual problem described above. Are you
> > suggesting that a lack of pcp vmstat counters update has led to
> > reclaim issues? What is the said "evidence"? Could you share more of the
> > story please?
>
>
> - The process was trapped in throttle_direct_reclaim().
> The function wait_event_killable() was called to wait condition
> allow_direct_reclaim(pgdat) for current node to be true.
> The allow_direct_reclaim(pgdat) examined the number of free pages
> on the node by zone_page_state() which just returns value in
> zone->vm_stat[NR_FREE_PAGES].
>
> - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
> However, the freelist on this node was not empty.
>
> - This inconsistent of vmstat value was caused by percpu vmstat on
> nohz_full cpus. Every increment/decrement of vmstat is performed
> on percpu vmstat counter at first, then pooled diffs are cumulated
> to the zone's vmstat counter in timely manner. However, on nohz_full
> cpus (in case of this customer's system, 48 of 52 cpus) these pooled
> diffs were not cumulated once the cpu had no event on it so that
> the cpu started sleeping infinitely.
> I checked percpu vmstat and found there were total 69 counts not
> cumulated to the zone's vmstat counter yet.
>
> - In this situation, kswapd did not help the trapped process.
> In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
> of free pages on the node by zone_page_state_snapshot() which
> checks pending counts on percpu vmstat.
> Therefore kswapd could know there were 69 free pages correctly.
> Since zone->_watermark = {8, 20, 32}, kswapd did not work because
> 69 was greater than 32 as high watermark.

If the imprecision of allow_direct_reclaim is the underlying problem why
haven't you used zone_page_state_snapshot instead?

Anyway, this is kind of information that is really helpful to have in
the patch description.

[...]
> > > 2. With a SCHED_FIFO task that busy loops on a given CPU,
> > > and kworker for that CPU at SCHED_OTHER priority,
> > > queuing work to sync per-vmstats will either cause that
> > > work to never execute, or stalld (i.e. stall daemon)
> > > boosts kworker priority which causes a latency
> > > violation
> >
> > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > Or can they?
>
> Consider SCHED_FIFO task that is polling the network queue (say
> testpmd).
>
> do {
> if (net_registers->state & DATA_AVAILABLE) {
> process_data)();
> }
> } while (!stopped);
>
> Since this task runs at SCHED_FIFO priority, kworker won't
> be scheduled to run (therefore per-CPU vmstats won't be
> flushed to global vmstats).

Yes, that is certainly possible. But my main point is that vmstat
imprecision shouldn't cause functional problems. That is why we have
_snapshot readers to get an exact value where it matters for
consistency.

> Or, if testpmd runs at SCHED_OTHER, then the work item to
> flush per-CPU vmstats causes
>
> testpmd -> kworker
> kworker: flush per-CPU vmstats
> kworker -> testpmd
>
> And this might cause undesired latencies to the packets being
> processed by the testpmd task.

Right but can you have any latencies expectation in a situation like
that?

--
Michal Hocko
SUSE Labs

2023-03-14 19:02:29

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v5 00/12] fold per-CPU vmstats remotely

On Tue, Mar 14, 2023 at 03:31:21PM +0100, Michal Hocko wrote:
> On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> > On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > > This patch series addresses the following two problems:
> > > >
> > > > 1. A customer provided some evidence which indicates that
> > > > the idle tick was stopped; albeit, CPU-specific vmstat
> > > > counters still remained populated.
> > > >
> > > > Thus one can only assume quiet_vmstat() was not
> > > > invoked on return to the idle loop. If I understand
> > > > correctly, I suspect this divergence might erroneously
> > > > prevent a reclaim attempt by kswapd. If the number of
> > > > zone specific free pages are below their per-cpu drift
> > > > value then zone_page_state_snapshot() is used to
> > > > compute a more accurate view of the aforementioned
> > > > statistic. Thus any task blocked on the NUMA node
> > > > specific pfmemalloc_wait queue will be unable to make
> > > > significant progress via direct reclaim unless it is
> > > > killed after being woken up by kswapd
> > > > (see throttle_direct_reclaim())
> > >
> > > I have hard time to follow the actual problem described above. Are you
> > > suggesting that a lack of pcp vmstat counters update has led to
> > > reclaim issues? What is the said "evidence"? Could you share more of the
> > > story please?
> >
> >
> > - The process was trapped in throttle_direct_reclaim().
> > The function wait_event_killable() was called to wait condition
> > allow_direct_reclaim(pgdat) for current node to be true.
> > The allow_direct_reclaim(pgdat) examined the number of free pages
> > on the node by zone_page_state() which just returns value in
> > zone->vm_stat[NR_FREE_PAGES].
> >
> > - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
> > However, the freelist on this node was not empty.
> >
> > - This inconsistent of vmstat value was caused by percpu vmstat on
> > nohz_full cpus. Every increment/decrement of vmstat is performed
> > on percpu vmstat counter at first, then pooled diffs are cumulated
> > to the zone's vmstat counter in timely manner. However, on nohz_full
> > cpus (in case of this customer's system, 48 of 52 cpus) these pooled
> > diffs were not cumulated once the cpu had no event on it so that
> > the cpu started sleeping infinitely.
> > I checked percpu vmstat and found there were total 69 counts not
> > cumulated to the zone's vmstat counter yet.
> >
> > - In this situation, kswapd did not help the trapped process.
> > In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
> > of free pages on the node by zone_page_state_snapshot() which
> > checks pending counts on percpu vmstat.
> > Therefore kswapd could know there were 69 free pages correctly.
> > Since zone->_watermark = {8, 20, 32}, kswapd did not work because
> > 69 was greater than 32 as high watermark.
>
> If the imprecision of allow_direct_reclaim is the underlying problem why
> haven't you used zone_page_state_snapshot instead?

It might have dealt with problem #1 for this particular case. However,
looking at the callers of zone_page_state:

5 2227 mm/compaction.c <<compaction_suitable>>
zone_page_state(zone, NR_FREE_PAGES));
6 124 mm/highmem.c <<__nr_free_highpages>>
pages += zone_page_state(zone, NR_FREE_PAGES);
7 283 mm/page-writeback.c <<node_dirtyable_memory>>
nr_pages += zone_page_state(zone, NR_FREE_PAGES);
8 318 mm/page-writeback.c <<highmem_dirtyable_memory>>
nr_pages = zone_page_state(z, NR_FREE_PAGES);
9 321 mm/page-writeback.c <<highmem_dirtyable_memory>>
nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
10 322 mm/page-writeback.c <<highmem_dirtyable_memory>>
nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
11 3091 mm/page_alloc.c <<__rmqueue>>
zone_page_state(zone, NR_FREE_CMA_PAGES) >
12 3092 mm/page_alloc.c <<__rmqueue>>
zone_page_state(zone, NR_FREE_PAGES) / 2) {

The suggested patchset fixes the problem of where due to nohz_full,
the delayed timer for vmstat_work can be armed but not executed (which means
the per-cpu counters can be out of sync for as long as the cpu is in
idle while in nohz_full mode).

You probably do not want to convert all callers of zone_page_state
into zone_page_state_snapshot (as a justification for the proposed
patchset).

> Anyway, this is kind of information that is really helpful to have in
> the patch description.

Agree: resending a new version with updated commit.

> [...]
> > > > 2. With a SCHED_FIFO task that busy loops on a given CPU,
> > > > and kworker for that CPU at SCHED_OTHER priority,
> > > > queuing work to sync per-vmstats will either cause that
> > > > work to never execute, or stalld (i.e. stall daemon)
> > > > boosts kworker priority which causes a latency
> > > > violation
> > >
> > > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > > Or can they?
> >
> > Consider SCHED_FIFO task that is polling the network queue (say
> > testpmd).
> >
> > do {
> > if (net_registers->state & DATA_AVAILABLE) {
> > process_data)();
> > }
> > } while (!stopped);
> >
> > Since this task runs at SCHED_FIFO priority, kworker won't
> > be scheduled to run (therefore per-CPU vmstats won't be
> > flushed to global vmstats).
>
> Yes, that is certainly possible. But my main point is that vmstat
> imprecision shouldn't cause functional problems. That is why we have
> _snapshot readers to get an exact value where it matters for
> consistency.

Understood. Perhaps allow_direct_reclaim should use
zone_page_state_snapshot, as otherwise it is only precise
at sysctl_stat_interval intervals?

>
> > Or, if testpmd runs at SCHED_OTHER, then the work item to
> > flush per-CPU vmstats causes
> >
> > testpmd -> kworker
> > kworker: flush per-CPU vmstats
> > kworker -> testpmd
>
> And this might cause undesired latencies to the packets being
> processed by the testpmd task.

> Right but can you have any latencies expectation in a situation like
> that?

Not sure i understand what you mean. Example:

https://www.gabrieleara.it/assets/documents/papers/conferences/2021-ieee-nfv-sdn.pdf

In general, UDPDK exhibits a much lower
latency than the in-kernel UDP stack used through the POSIX
API (e.g., a 69 % reduction from 95 ?s down to 29 ?s), thanks
to its ability to bypass the kernel entirely, which in turn
outperforms the in-kernel TCP stack as available through the
POSIX API, as expected.
...
Alternatively, application processes can use UDPDK
with the non-blocking API calls (using the O_NONBLOCK flag)
and perform some other action while waiting for packets to
be ready to be sent/received to/from the UDPDK Process,
instead of performing continuous busy-loops on packet queues.
However, in this case the cost of a single CPU fully busy due
to the UDPDK Process itself is anyway unavoidab


2023-03-14 21:02:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 00/12] fold per-CPU vmstats remotely

On Tue 14-03-23 15:49:09, Marcelo Tosatti wrote:
> On Tue, Mar 14, 2023 at 03:31:21PM +0100, Michal Hocko wrote:
> > On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> > > On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > > > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > > > This patch series addresses the following two problems:
> > > > >
> > > > > 1. A customer provided some evidence which indicates that
> > > > > the idle tick was stopped; albeit, CPU-specific vmstat
> > > > > counters still remained populated.
> > > > >
> > > > > Thus one can only assume quiet_vmstat() was not
> > > > > invoked on return to the idle loop. If I understand
> > > > > correctly, I suspect this divergence might erroneously
> > > > > prevent a reclaim attempt by kswapd. If the number of
> > > > > zone specific free pages are below their per-cpu drift
> > > > > value then zone_page_state_snapshot() is used to
> > > > > compute a more accurate view of the aforementioned
> > > > > statistic. Thus any task blocked on the NUMA node
> > > > > specific pfmemalloc_wait queue will be unable to make
> > > > > significant progress via direct reclaim unless it is
> > > > > killed after being woken up by kswapd
> > > > > (see throttle_direct_reclaim())
> > > >
> > > > I have hard time to follow the actual problem described above. Are you
> > > > suggesting that a lack of pcp vmstat counters update has led to
> > > > reclaim issues? What is the said "evidence"? Could you share more of the
> > > > story please?
> > >
> > >
> > > - The process was trapped in throttle_direct_reclaim().
> > > The function wait_event_killable() was called to wait condition
> > > allow_direct_reclaim(pgdat) for current node to be true.
> > > The allow_direct_reclaim(pgdat) examined the number of free pages
> > > on the node by zone_page_state() which just returns value in
> > > zone->vm_stat[NR_FREE_PAGES].
> > >
> > > - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
> > > However, the freelist on this node was not empty.
> > >
> > > - This inconsistent of vmstat value was caused by percpu vmstat on
> > > nohz_full cpus. Every increment/decrement of vmstat is performed
> > > on percpu vmstat counter at first, then pooled diffs are cumulated
> > > to the zone's vmstat counter in timely manner. However, on nohz_full
> > > cpus (in case of this customer's system, 48 of 52 cpus) these pooled
> > > diffs were not cumulated once the cpu had no event on it so that
> > > the cpu started sleeping infinitely.
> > > I checked percpu vmstat and found there were total 69 counts not
> > > cumulated to the zone's vmstat counter yet.
> > >
> > > - In this situation, kswapd did not help the trapped process.
> > > In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
> > > of free pages on the node by zone_page_state_snapshot() which
> > > checks pending counts on percpu vmstat.
> > > Therefore kswapd could know there were 69 free pages correctly.
> > > Since zone->_watermark = {8, 20, 32}, kswapd did not work because
> > > 69 was greater than 32 as high watermark.
> >
> > If the imprecision of allow_direct_reclaim is the underlying problem why
> > haven't you used zone_page_state_snapshot instead?
>
> It might have dealt with problem #1 for this particular case. However,
> looking at the callers of zone_page_state:
>
> 5 2227 mm/compaction.c <<compaction_suitable>>
> zone_page_state(zone, NR_FREE_PAGES));
> 6 124 mm/highmem.c <<__nr_free_highpages>>
> pages += zone_page_state(zone, NR_FREE_PAGES);
> 7 283 mm/page-writeback.c <<node_dirtyable_memory>>
> nr_pages += zone_page_state(zone, NR_FREE_PAGES);
> 8 318 mm/page-writeback.c <<highmem_dirtyable_memory>>
> nr_pages = zone_page_state(z, NR_FREE_PAGES);
> 9 321 mm/page-writeback.c <<highmem_dirtyable_memory>>
> nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
> 10 322 mm/page-writeback.c <<highmem_dirtyable_memory>>
> nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
> 11 3091 mm/page_alloc.c <<__rmqueue>>
> zone_page_state(zone, NR_FREE_CMA_PAGES) >
> 12 3092 mm/page_alloc.c <<__rmqueue>>
> zone_page_state(zone, NR_FREE_PAGES) / 2) {
>
> The suggested patchset fixes the problem of where due to nohz_full,
> the delayed timer for vmstat_work can be armed but not executed (which means
> the per-cpu counters can be out of sync for as long as the cpu is in
> idle while in nohz_full mode).
>
> You probably do not want to convert all callers of zone_page_state
> into zone_page_state_snapshot (as a justification for the proposed
> patchset).

Yes, I do not really think we want or even need to convert all of them.
But it seems that your direct reclaim throttling example really requires
that. The thing with the remote flushing is that it would suffer from
a similar imprecations as the flushing could be deferred and under
certain conditions really starved. So it is definitely worth fixing the
issue you are seeing without such a complex scheme.

> > Anyway, this is kind of information that is really helpful to have in
> > the patch description.
>
> Agree: resending a new version with updated commit.

I would really recommend trying out the simple fix and see if it changes
the behavior.

> > [...]
> > > > > 2. With a SCHED_FIFO task that busy loops on a given CPU,
> > > > > and kworker for that CPU at SCHED_OTHER priority,
> > > > > queuing work to sync per-vmstats will either cause that
> > > > > work to never execute, or stalld (i.e. stall daemon)
> > > > > boosts kworker priority which causes a latency
> > > > > violation
> > > >
> > > > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > > > Or can they?
> > >
> > > Consider SCHED_FIFO task that is polling the network queue (say
> > > testpmd).
> > >
> > > do {
> > > if (net_registers->state & DATA_AVAILABLE) {
> > > process_data)();
> > > }
> > > } while (!stopped);
> > >
> > > Since this task runs at SCHED_FIFO priority, kworker won't
> > > be scheduled to run (therefore per-CPU vmstats won't be
> > > flushed to global vmstats).
> >
> > Yes, that is certainly possible. But my main point is that vmstat
> > imprecision shouldn't cause functional problems. That is why we have
> > _snapshot readers to get an exact value where it matters for
> > consistency.
>
> Understood. Perhaps allow_direct_reclaim should use
> zone_page_state_snapshot, as otherwise it is only precise
> at sysctl_stat_interval intervals?

or even much less than that. The flusher uses WQ infrastructure and even
when a WQ_MEM_RECLAIM one is used this doesn't mean that all workers
could be jammed.

> > > Or, if testpmd runs at SCHED_OTHER, then the work item to
> > > flush per-CPU vmstats causes
> > >
> > > testpmd -> kworker
> > > kworker: flush per-CPU vmstats
> > > kworker -> testpmd
> >
> > And this might cause undesired latencies to the packets being
> > processed by the testpmd task.
>
> > Right but can you have any latencies expectation in a situation like
> > that?
>
> Not sure i understand what you mean. Example:
>
> https://www.gabrieleara.it/assets/documents/papers/conferences/2021-ieee-nfv-sdn.pdf
>
> In general, UDPDK exhibits a much lower
> latency than the in-kernel UDP stack used through the POSIX
> API (e.g., a 69 % reduction from 95 ?s down to 29 ?s), thanks
> to its ability to bypass the kernel entirely, which in turn
> outperforms the in-kernel TCP stack as available through the
> POSIX API, as expected.
> ...
> Alternatively, application processes can use UDPDK
> with the non-blocking API calls (using the O_NONBLOCK flag)
> and perform some other action while waiting for packets to
> be ready to be sent/received to/from the UDPDK Process,
> instead of performing continuous busy-loops on packet queues.
> However, in this case the cost of a single CPU fully busy due
> to the UDPDK Process itself is anyway unavoidab

If the userspace workload avoids the kernel completely then it is quite
unlikely that there is any pcp work to be flushed for in-kernel
counters.

That being said, I am nor saying remote flushing is not useful. I just
think that the issue you are reporting here could be fixed by a much
simpler fix that doesn't change the way how the flushing is performed.
Such a large rework should be justified by performance numbers. It
should be also explained how do we end up doing a lot of work on
isolated cpus or a pure user space workload.

--
Michal Hocko
SUSE Labs

2023-03-15 14:24:01

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v5 00/12] fold per-CPU vmstats remotely

On Tue, Mar 14, 2023 at 10:01:06PM +0100, Michal Hocko wrote:
> On Tue 14-03-23 15:49:09, Marcelo Tosatti wrote:
> > On Tue, Mar 14, 2023 at 03:31:21PM +0100, Michal Hocko wrote:
> > > On Tue 14-03-23 09:59:37, Marcelo Tosatti wrote:
> > > > On Tue, Mar 14, 2023 at 01:25:53PM +0100, Michal Hocko wrote:
> > > > > On Mon 13-03-23 13:25:07, Marcelo Tosatti wrote:
> > > > > > This patch series addresses the following two problems:
> > > > > >
> > > > > > 1. A customer provided some evidence which indicates that
> > > > > > the idle tick was stopped; albeit, CPU-specific vmstat
> > > > > > counters still remained populated.
> > > > > >
> > > > > > Thus one can only assume quiet_vmstat() was not
> > > > > > invoked on return to the idle loop. If I understand
> > > > > > correctly, I suspect this divergence might erroneously
> > > > > > prevent a reclaim attempt by kswapd. If the number of
> > > > > > zone specific free pages are below their per-cpu drift
> > > > > > value then zone_page_state_snapshot() is used to
> > > > > > compute a more accurate view of the aforementioned
> > > > > > statistic. Thus any task blocked on the NUMA node
> > > > > > specific pfmemalloc_wait queue will be unable to make
> > > > > > significant progress via direct reclaim unless it is
> > > > > > killed after being woken up by kswapd
> > > > > > (see throttle_direct_reclaim())
> > > > >
> > > > > I have hard time to follow the actual problem described above. Are you
> > > > > suggesting that a lack of pcp vmstat counters update has led to
> > > > > reclaim issues? What is the said "evidence"? Could you share more of the
> > > > > story please?
> > > >
> > > >
> > > > - The process was trapped in throttle_direct_reclaim().
> > > > The function wait_event_killable() was called to wait condition
> > > > allow_direct_reclaim(pgdat) for current node to be true.
> > > > The allow_direct_reclaim(pgdat) examined the number of free pages
> > > > on the node by zone_page_state() which just returns value in
> > > > zone->vm_stat[NR_FREE_PAGES].
> > > >
> > > > - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
> > > > However, the freelist on this node was not empty.
> > > >
> > > > - This inconsistent of vmstat value was caused by percpu vmstat on
> > > > nohz_full cpus. Every increment/decrement of vmstat is performed
> > > > on percpu vmstat counter at first, then pooled diffs are cumulated
> > > > to the zone's vmstat counter in timely manner. However, on nohz_full
> > > > cpus (in case of this customer's system, 48 of 52 cpus) these pooled
> > > > diffs were not cumulated once the cpu had no event on it so that
> > > > the cpu started sleeping infinitely.
> > > > I checked percpu vmstat and found there were total 69 counts not
> > > > cumulated to the zone's vmstat counter yet.
> > > >
> > > > - In this situation, kswapd did not help the trapped process.
> > > > In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
> > > > of free pages on the node by zone_page_state_snapshot() which
> > > > checks pending counts on percpu vmstat.
> > > > Therefore kswapd could know there were 69 free pages correctly.
> > > > Since zone->_watermark = {8, 20, 32}, kswapd did not work because
> > > > 69 was greater than 32 as high watermark.
> > >
> > > If the imprecision of allow_direct_reclaim is the underlying problem why
> > > haven't you used zone_page_state_snapshot instead?
> >
> > It might have dealt with problem #1 for this particular case. However,
> > looking at the callers of zone_page_state:
> >
> > 5 2227 mm/compaction.c <<compaction_suitable>>
> > zone_page_state(zone, NR_FREE_PAGES));
> > 6 124 mm/highmem.c <<__nr_free_highpages>>
> > pages += zone_page_state(zone, NR_FREE_PAGES);
> > 7 283 mm/page-writeback.c <<node_dirtyable_memory>>
> > nr_pages += zone_page_state(zone, NR_FREE_PAGES);
> > 8 318 mm/page-writeback.c <<highmem_dirtyable_memory>>
> > nr_pages = zone_page_state(z, NR_FREE_PAGES);
> > 9 321 mm/page-writeback.c <<highmem_dirtyable_memory>>
> > nr_pages += zone_page_state(z, NR_ZONE_INACTIVE_FILE);
> > 10 322 mm/page-writeback.c <<highmem_dirtyable_memory>>
> > nr_pages += zone_page_state(z, NR_ZONE_ACTIVE_FILE);
> > 11 3091 mm/page_alloc.c <<__rmqueue>>
> > zone_page_state(zone, NR_FREE_CMA_PAGES) >
> > 12 3092 mm/page_alloc.c <<__rmqueue>>
> > zone_page_state(zone, NR_FREE_PAGES) / 2) {
> >
> > The suggested patchset fixes the problem of where due to nohz_full,
> > the delayed timer for vmstat_work can be armed but not executed (which means
> > the per-cpu counters can be out of sync for as long as the cpu is in
> > idle while in nohz_full mode).
> >
> > You probably do not want to convert all callers of zone_page_state
> > into zone_page_state_snapshot (as a justification for the proposed
> > patchset).
>
> Yes, I do not really think we want or even need to convert all of them.

OK.

> But it seems that your direct reclaim throttling example really requires
> that. The thing with the remote flushing is that it would suffer from
> a similar imprecations as the flushing could be deferred and under
> certain conditions really starved.

> So it is definitely worth fixing the
> issue you are seeing without such a complex scheme.

The scheme is necessary for other reasons.

> > > Anyway, this is kind of information that is really helpful to have in
> > > the patch description.
> >
> > Agree: resending a new version with updated commit.
>
> I would really recommend trying out the simple fix and see if it changes
> the behavior.
>
> > > [...]
> > > > > > 2. With a SCHED_FIFO task that busy loops on a given CPU,
> > > > > > and kworker for that CPU at SCHED_OTHER priority,
> > > > > > queuing work to sync per-vmstats will either cause that
> > > > > > work to never execute, or stalld (i.e. stall daemon)
> > > > > > boosts kworker priority which causes a latency
> > > > > > violation
> > > > >
> > > > > Why is that a problem? Out-of-sync stats shouldn't cause major problems.
> > > > > Or can they?
> > > >
> > > > Consider SCHED_FIFO task that is polling the network queue (say
> > > > testpmd).
> > > >
> > > > do {
> > > > if (net_registers->state & DATA_AVAILABLE) {
> > > > process_data)();
> > > > }
> > > > } while (!stopped);
> > > >
> > > > Since this task runs at SCHED_FIFO priority, kworker won't
> > > > be scheduled to run (therefore per-CPU vmstats won't be
> > > > flushed to global vmstats).
> > >
> > > Yes, that is certainly possible. But my main point is that vmstat
> > > imprecision shouldn't cause functional problems. That is why we have
> > > _snapshot readers to get an exact value where it matters for
> > > consistency.
> >
> > Understood. Perhaps allow_direct_reclaim should use
> > zone_page_state_snapshot, as otherwise it is only precise
> > at sysctl_stat_interval intervals?
>
> or even much less than that. The flusher uses WQ infrastructure and even
> when a WQ_MEM_RECLAIM one is used this doesn't mean that all workers
> could be jammed.
>
> > > > Or, if testpmd runs at SCHED_OTHER, then the work item to
> > > > flush per-CPU vmstats causes
> > > >
> > > > testpmd -> kworker
> > > > kworker: flush per-CPU vmstats
> > > > kworker -> testpmd
> > >
> > > And this might cause undesired latencies to the packets being
> > > processed by the testpmd task.
> >
> > > Right but can you have any latencies expectation in a situation like
> > > that?
> >
> > Not sure i understand what you mean. Example:
> >
> > https://www.gabrieleara.it/assets/documents/papers/conferences/2021-ieee-nfv-sdn.pdf
> >
> > In general, UDPDK exhibits a much lower
> > latency than the in-kernel UDP stack used through the POSIX
> > API (e.g., a 69 % reduction from 95 ?s down to 29 ?s), thanks
> > to its ability to bypass the kernel entirely, which in turn
> > outperforms the in-kernel TCP stack as available through the
> > POSIX API, as expected.
> > ...
> > Alternatively, application processes can use UDPDK
> > with the non-blocking API calls (using the O_NONBLOCK flag)
> > and perform some other action while waiting for packets to
> > be ready to be sent/received to/from the UDPDK Process,
> > instead of performing continuous busy-loops on packet queues.
> > However, in this case the cost of a single CPU fully busy due
> > to the UDPDK Process itself is anyway unavoidab
>
> If the userspace workload avoids the kernel completely then it is quite
> unlikely that there is any pcp work to be flushed for in-kernel
> counters.

This particular workload avoids the kernel. Others (were latency is
still a concern) don't.

> That being said, I am nor saying remote flushing is not useful.

> I just think that the issue you are reporting here could be fixed by
> a much simpler fix that doesn't change the way how the flushing is
> performed.

OK. Must change flushing anyway, but fixing allow_direct_reclaim to
use zone_page_state_snapshot won't hurt.

> Such a large rework should be justified by performance numbers.

OK.

> It should be also explained how do we end up doing a lot of work on
> isolated cpus or a pure user space workload.

Again, pure user space workload is one example where latency matters, in
response to the "can you have any latencies expectation in a situation like
that?" question.

Will resend -v8 with allow_direct_reclaim fix.