2023-02-09 15:34:20

by Marcelo Tosatti

[permalink] [raw]
Subject: [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely

Now that the counters are modified via cmpxchg both CPU locally
(via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU
vmstats folding remotely.

This fixes the following two problems:

1. A customer provided some evidence which indicates that
the idle tick was stopped; albeit, CPU-specific vmstat
counters still remained populated.

Thus one can only assume quiet_vmstat() was not
invoked on return to the idle loop. If I understand
correctly, I suspect this divergence might erroneously
prevent a reclaim attempt by kswapd. If the number of
zone specific free pages are below their per-cpu drift
value then zone_page_state_snapshot() is used to
compute a more accurate view of the aforementioned
statistic. Thus any task blocked on the NUMA node
specific pfmemalloc_wait queue will be unable to make
significant progress via direct reclaim unless it is
killed after being woken up by kswapd
(see throttle_direct_reclaim())

2. With a SCHED_FIFO task that busy loops on a given CPU,
and kworker for that CPU at SCHED_OTHER priority,
queuing work to sync per-vmstats will either cause that
work to never execute, or stalld (i.e. stall daemon)
boosts kworker priority which causes a latency
violation

Signed-off-by: Marcelo Tosatti <[email protected]>

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -2007,6 +2007,23 @@ static void vmstat_shepherd(struct work_

static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+ int cpu;
+
+ cpus_read_lock();
+ for_each_online_cpu(cpu) {
+ cpu_vm_stats_fold(cpu);
+ cond_resched();
+ }
+ cpus_read_unlock();
+
+ schedule_delayed_work(&shepherd,
+ round_jiffies_relative(sysctl_stat_interval));
+}
+#else
static void vmstat_shepherd(struct work_struct *w)
{
int cpu;
@@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
}
+#endif

static void __init start_shepherd_timer(void)
{




2023-03-02 21:03:23

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely

On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> +static void vmstat_shepherd(struct work_struct *w)
> +{
> + int cpu;
> +
> + cpus_read_lock();
> + for_each_online_cpu(cpu) {
> + cpu_vm_stats_fold(cpu);

Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
to replace the queuing. Would it be cleaner to move the ifdef into
vmstat_shepherd, then, and keep the common logic?

> + cond_resched();
> + }
> + cpus_read_unlock();
> +
> + schedule_delayed_work(&shepherd,
> + round_jiffies_relative(sysctl_stat_interval));
> +}
> +#else
> static void vmstat_shepherd(struct work_struct *w)
> {
> int cpu;
> @@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> }
> +#endif
>
> static void __init start_shepherd_timer(void)
> {
>
>
>

--
Peter Xu


2023-03-02 22:03:36

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely

On Thu, Mar 02, 2023 at 06:16:42PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 02, 2023 at 04:01:07PM -0500, Peter Xu wrote:
> > On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> > > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > > +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> > > +static void vmstat_shepherd(struct work_struct *w)
> > > +{
> > > + int cpu;
> > > +
> > > + cpus_read_lock();
> > > + for_each_online_cpu(cpu) {
> > > + cpu_vm_stats_fold(cpu);
> >
> > Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
> > to replace the queuing. Would it be cleaner to move the ifdef into
> > vmstat_shepherd, then, and keep the common logic?
>
> https://lore.kernel.org/lkml/20221223144150.GA79369@lothringen/

:-)

[...]

> So it seems the current separation is quite readable
> (unless you have a suggestion).

No, feel free to ignore any of my nitpicks when you don't think proper. :)
Keeping it is fine to me.

--
Peter Xu


2023-03-02 22:19:59

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely

On Thu, Mar 02, 2023 at 04:01:07PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> > +static void vmstat_shepherd(struct work_struct *w)
> > +{
> > + int cpu;
> > +
> > + cpus_read_lock();
> > + for_each_online_cpu(cpu) {
> > + cpu_vm_stats_fold(cpu);
>
> Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
> to replace the queuing. Would it be cleaner to move the ifdef into
> vmstat_shepherd, then, and keep the common logic?

https://lore.kernel.org/lkml/20221223144150.GA79369@lothringen/

Could have

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
int cpu_flush_vm_stats(int cpu)
{
return cpu_vm_stats_fold(cpu);
}
#else
int cpu_flush_vm_stats(int cpu)
{
struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

if (!delayed_work_pending(dw) && need_update(cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
}
#endif

static void vmstat_shepherd(struct work_struct *w)
{
int cpu;

cpus_read_lock();
for_each_online_cpu(cpu) {
cpu_flush_vm_stats(cpu);
cond_resched();
}
cpus_read_unlock();

schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
}

This looks really awkward to me. But then, we don't want
schedule_delayed_work if !CONFIG_HAVE_CMPXCHG_LOCAL.
The common part would be the cpus_read_lock and for_each_online_cpu
loop.

So it seems the current separation is quite readable
(unless you have a suggestion).

> > + cond_resched();
> > + }
> > + cpus_read_unlock();
> > +
> > + schedule_delayed_work(&shepherd,
> > + round_jiffies_relative(sysctl_stat_interval));
> > +}
> > +#else
> > static void vmstat_shepherd(struct work_struct *w)
> > {
> > int cpu;
> > @@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
> > schedule_delayed_work(&shepherd,
> > round_jiffies_relative(sysctl_stat_interval));
> > }
> > +#endif
> >
> > static void __init start_shepherd_timer(void)
> > {
> >
> >
> >
>
> --
> Peter Xu
>
>