2022-09-24 15:51:40

by Aaron Tomlin

[permalink] [raw]
Subject: [PATCH v8 3/5] mm/vmstat: Do not queue vmstat_update if tick is stopped

From: Marcelo Tosatti <[email protected]>

From the vmstat shepherd, for CPUs that have the tick stopped, do not
queue local work to flush the per-CPU vmstats, since in that case the
flush is performed on return to userspace or when entering idle. Also
cancel any delayed work on the local CPU, when entering idle on nohz
full CPUs. Per-CPU pages can be freed remotely from housekeeping CPUs.

Signed-off-by: Marcelo Tosatti <[email protected]>
---
mm/vmstat.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 472175642bd9..3b9a497965b4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -29,6 +29,7 @@
#include <linux/page_ext.h>
#include <linux/page_owner.h>
#include <linux/migrate.h>
+#include <linux/tick.h>

#include "internal.h"

@@ -1990,19 +1991,23 @@ static void vmstat_update(struct work_struct *w)
*/
void quiet_vmstat(void)
{
+ struct delayed_work *dw;
+
if (system_state != SYSTEM_RUNNING)
return;

if (!is_vmstat_dirty())
return;

+ refresh_cpu_vm_stats(false);
+
/*
- * Just refresh counters and do not care about the pending delayed
- * vmstat_update. It doesn't fire that often to matter and canceling
- * it would be too expensive from this path.
- * vmstat_shepherd will take care about that for us.
+ * If the tick is stopped, cancel any delayed work to avoid
+ * interruptions to this CPU in the future.
*/
- refresh_cpu_vm_stats(false);
+ dw = &per_cpu(vmstat_work, smp_processor_id());
+ if (delayed_work_pending(dw) && tick_nohz_tick_stopped())
+ cancel_delayed_work(dw);
}

/*
@@ -2024,6 +2029,9 @@ static void vmstat_shepherd(struct work_struct *w)
for_each_online_cpu(cpu) {
struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

+ if (tick_nohz_tick_stopped_cpu(cpu))
+ continue;
+
if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

--
2.37.1


2022-10-24 11:32:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/vmstat: Do not queue vmstat_update if tick is stopped

On Sat, Sep 24, 2022 at 04:22:25PM +0100, Aaron Tomlin wrote:
> From: Marcelo Tosatti <[email protected]>
>
> From the vmstat shepherd, for CPUs that have the tick stopped, do not
> queue local work to flush the per-CPU vmstats, since in that case the
> flush is performed on return to userspace or when entering idle. Also
> cancel any delayed work on the local CPU, when entering idle on nohz
> full CPUs. Per-CPU pages can be freed remotely from housekeeping CPUs.
>
> Signed-off-by: Marcelo Tosatti <[email protected]>
> ---
> mm/vmstat.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 472175642bd9..3b9a497965b4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -29,6 +29,7 @@
> #include <linux/page_ext.h>
> #include <linux/page_owner.h>
> #include <linux/migrate.h>
> +#include <linux/tick.h>
>
> #include "internal.h"
>
> @@ -1990,19 +1991,23 @@ static void vmstat_update(struct work_struct *w)
> */
> void quiet_vmstat(void)
> {
> + struct delayed_work *dw;
> +
> if (system_state != SYSTEM_RUNNING)
> return;
>
> if (!is_vmstat_dirty())
> return;
>
> + refresh_cpu_vm_stats(false);
> +
> /*
> - * Just refresh counters and do not care about the pending delayed
> - * vmstat_update. It doesn't fire that often to matter and canceling
> - * it would be too expensive from this path.
> - * vmstat_shepherd will take care about that for us.
> + * If the tick is stopped, cancel any delayed work to avoid
> + * interruptions to this CPU in the future.
> */
> - refresh_cpu_vm_stats(false);
> + dw = &per_cpu(vmstat_work, smp_processor_id());
> + if (delayed_work_pending(dw) && tick_nohz_tick_stopped())
> + cancel_delayed_work(dw);
> }
>
> /*
> @@ -2024,6 +2029,9 @@ static void vmstat_shepherd(struct work_struct *w)
> for_each_online_cpu(cpu) {
> struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
>
> + if (tick_nohz_tick_stopped_cpu(cpu))
> + continue;
> +
> if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
> queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

All these checks are racy though. You may well eventually:

1) Arm the timer after the CPU has entered in userspace
2) Not arm the timer when the CPU has entered the kernel

How about converting that to an IPI instead? This should be a good candidate
for the future IPI deferment.

Another possible way to go is this:

1) vmstat_shepherd completely ignores nohz_full CPUs
2) vmstat_work is only ever armed locally
3) A nohz_full CPU turning its local vmstat as dirty checks if vmstat_work is
pending. If not, queue it, possibly through a self IPI (IRQ_WORK) to get
away with current locking context.
3) Fold on idle if dirty
4) Fold on user enter and disarm vmstat_work if pending

Does that sound possible?

Thanks.


>
> --
> 2.37.1
>

2022-10-24 13:16:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/vmstat: Do not queue vmstat_update if tick is stopped

On Sat, Sep 24, 2022 at 04:22:25PM +0100, Aaron Tomlin wrote:
> From: Marcelo Tosatti <[email protected]>
>
> From the vmstat shepherd, for CPUs that have the tick stopped, do not
> queue local work to flush the per-CPU vmstats, since in that case the
> flush is performed on return to userspace or when entering idle. Also
> cancel any delayed work on the local CPU, when entering idle on nohz
> full CPUs. Per-CPU pages can be freed remotely from housekeeping CPUs.
>
> Signed-off-by: Marcelo Tosatti <[email protected]>
> ---
> mm/vmstat.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 472175642bd9..3b9a497965b4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -29,6 +29,7 @@
> #include <linux/page_ext.h>
> #include <linux/page_owner.h>
> #include <linux/migrate.h>
> +#include <linux/tick.h>
>
> #include "internal.h"
>
> @@ -1990,19 +1991,23 @@ static void vmstat_update(struct work_struct *w)
> */
> void quiet_vmstat(void)
> {
> + struct delayed_work *dw;
> +
> if (system_state != SYSTEM_RUNNING)
> return;
>
> if (!is_vmstat_dirty())
> return;
>
> + refresh_cpu_vm_stats(false);
> +
> /*
> - * Just refresh counters and do not care about the pending delayed
> - * vmstat_update. It doesn't fire that often to matter and canceling
> - * it would be too expensive from this path.
> - * vmstat_shepherd will take care about that for us.
> + * If the tick is stopped, cancel any delayed work to avoid
> + * interruptions to this CPU in the future.
> */
> - refresh_cpu_vm_stats(false);
> + dw = &per_cpu(vmstat_work, smp_processor_id());
> + if (delayed_work_pending(dw) && tick_nohz_tick_stopped())
> + cancel_delayed_work(dw);

This is doing the costly cancel_delayed_work() which is only necessary
right before entering entering in user.

There are places where the tick is stopped but it's not necessary to
cancel the work:

* nohz_full enter idle
* idle IRQs
* nohz_full exit idle
* nohz_full IRQ exit

I suggest having quiet_vmstat_enter_user() which does:

void quiet_vmstat_enter_user(void)
{
quiet_vmstat();
if (delayed_work_pending(dw) && tick_nohz_tick_stopped())
cancel_delayed_work(dw);
}

And call this one only before leaving the kernel. The rest can use quiet_vmstat().

Thanks.

2022-11-09 20:54:30

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/vmstat: Do not queue vmstat_update if tick is stopped

On Mon, Oct 24, 2022 at 01:03:11PM +0200, Frederic Weisbecker wrote:
> Lines: 94
>
> On Sat, Sep 24, 2022 at 04:22:25PM +0100, Aaron Tomlin wrote:
> > From: Marcelo Tosatti <[email protected]>
> >
> > From the vmstat shepherd, for CPUs that have the tick stopped, do not
> > queue local work to flush the per-CPU vmstats, since in that case the
> > flush is performed on return to userspace or when entering idle. Also
> > cancel any delayed work on the local CPU, when entering idle on nohz
> > full CPUs. Per-CPU pages can be freed remotely from housekeeping CPUs.
> >
> > Signed-off-by: Marcelo Tosatti <[email protected]>
> > ---
> > mm/vmstat.c | 18 +++++++++++++-----
> > 1 file changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 472175642bd9..3b9a497965b4 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -29,6 +29,7 @@
> > #include <linux/page_ext.h>
> > #include <linux/page_owner.h>
> > #include <linux/migrate.h>
> > +#include <linux/tick.h>
> >
> > #include "internal.h"
> >
> > @@ -1990,19 +1991,23 @@ static void vmstat_update(struct work_struct *w)
> > */
> > void quiet_vmstat(void)
> > {
> > + struct delayed_work *dw;
> > +
> > if (system_state != SYSTEM_RUNNING)
> > return;
> >
> > if (!is_vmstat_dirty())
> > return;
> >
> > + refresh_cpu_vm_stats(false);
> > +
> > /*
> > - * Just refresh counters and do not care about the pending delayed
> > - * vmstat_update. It doesn't fire that often to matter and canceling
> > - * it would be too expensive from this path.
> > - * vmstat_shepherd will take care about that for us.
> > + * If the tick is stopped, cancel any delayed work to avoid
> > + * interruptions to this CPU in the future.
> > */
> > - refresh_cpu_vm_stats(false);
> > + dw = &per_cpu(vmstat_work, smp_processor_id());
> > + if (delayed_work_pending(dw) && tick_nohz_tick_stopped())
> > + cancel_delayed_work(dw);
> > }
> >
> > /*
> > @@ -2024,6 +2029,9 @@ static void vmstat_shepherd(struct work_struct *w)
> > for_each_online_cpu(cpu) {
> > struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> >
> > + if (tick_nohz_tick_stopped_cpu(cpu))
> > + continue;
> > +
> > if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
> > queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>
> All these checks are racy though. You may well eventually:
>
> 1) Arm the timer after the CPU has entered in userspace
> 2) Not arm the timer when the CPU has entered the kernel
>
> How about converting that to an IPI instead? This should be a good candidate
> for the future IPI deferment.
>
> Another possible way to go is this:
>
> 1) vmstat_shepherd completely ignores nohz_full CPUs
> 2) vmstat_work is only ever armed locally
> 3) A nohz_full CPU turning its local vmstat as dirty checks if vmstat_work is
> pending. If not, queue it, possibly through a self IPI (IRQ_WORK) to get
> away with current locking context.

I'm afraid there might be workloads where local vmstat touch is a
hot-path.

> 3) Fold on idle if dirty
> 4) Fold on user enter and disarm vmstat_work if pending
>
> Does that sound possible?
>
> Thanks.

I guess so, but proper barriers would also work.

Do you have any particular reason for the 1-4 sequence above
instead of barriers?


2022-11-10 21:17:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/vmstat: Do not queue vmstat_update if tick is stopped

On Wed, Nov 09, 2022 at 04:40:26PM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 24, 2022 at 01:03:11PM +0200, Frederic Weisbecker wrote:
> > Lines: 94
> >
> > On Sat, Sep 24, 2022 at 04:22:25PM +0100, Aaron Tomlin wrote:
> > > From: Marcelo Tosatti <[email protected]>
> > >
> > > From the vmstat shepherd, for CPUs that have the tick stopped, do not
> > > queue local work to flush the per-CPU vmstats, since in that case the
> > > flush is performed on return to userspace or when entering idle. Also
> > > cancel any delayed work on the local CPU, when entering idle on nohz
> > > full CPUs. Per-CPU pages can be freed remotely from housekeeping CPUs.
> > >
> > > Signed-off-by: Marcelo Tosatti <[email protected]>
> > > ---
> > > mm/vmstat.c | 18 +++++++++++++-----
> > > 1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 472175642bd9..3b9a497965b4 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -29,6 +29,7 @@
> > > #include <linux/page_ext.h>
> > > #include <linux/page_owner.h>
> > > #include <linux/migrate.h>
> > > +#include <linux/tick.h>
> > >
> > > #include "internal.h"
> > >
> > > @@ -1990,19 +1991,23 @@ static void vmstat_update(struct work_struct *w)
> > > */
> > > void quiet_vmstat(void)
> > > {
> > > + struct delayed_work *dw;
> > > +
> > > if (system_state != SYSTEM_RUNNING)
> > > return;
> > >
> > > if (!is_vmstat_dirty())
> > > return;
> > >
> > > + refresh_cpu_vm_stats(false);
> > > +
> > > /*
> > > - * Just refresh counters and do not care about the pending delayed
> > > - * vmstat_update. It doesn't fire that often to matter and canceling
> > > - * it would be too expensive from this path.
> > > - * vmstat_shepherd will take care about that for us.
> > > + * If the tick is stopped, cancel any delayed work to avoid
> > > + * interruptions to this CPU in the future.
> > > */
> > > - refresh_cpu_vm_stats(false);
> > > + dw = &per_cpu(vmstat_work, smp_processor_id());
> > > + if (delayed_work_pending(dw) && tick_nohz_tick_stopped())
> > > + cancel_delayed_work(dw);
> > > }
> > >
> > > /*
> > > @@ -2024,6 +2029,9 @@ static void vmstat_shepherd(struct work_struct *w)
> > > for_each_online_cpu(cpu) {
> > > struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> > >
> > > + if (tick_nohz_tick_stopped_cpu(cpu))
> > > + continue;
> > > +
> > > if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
> > > queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> >
> > All these checks are racy though. You may well eventually:
> >
> > 1) Arm the timer after the CPU has entered in userspace
> > 2) Not arm the timer when the CPU has entered the kernel
> >
> > How about converting that to an IPI instead? This should be a good candidate
> > for the future IPI deferment.
> >
> > Another possible way to go is this:
> >
> > 1) vmstat_shepherd completely ignores nohz_full CPUs
> > 2) vmstat_work is only ever armed locally
> > 3) A nohz_full CPU turning its local vmstat as dirty checks if vmstat_work is
> > pending. If not, queue it, possibly through a self IPI (IRQ_WORK) to get
> > away with current locking context.
>
> I'm afraid there might be workloads where local vmstat touch is a
> hot-path.
>
> > 3) Fold on idle if dirty
> > 4) Fold on user enter and disarm vmstat_work if pending
> >
> > Does that sound possible?
> >
> > Thanks.
>
> I guess so, but proper barriers would also work.
>
> Do you have any particular reason for the 1-4 sequence above
> instead of barriers?

I think a per-CPU atomic variable might be necessary, not just barriers.

Thanks.


2022-11-14 13:24:14

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/vmstat: Do not queue vmstat_update if tick is stopped

On Wed, Nov 09, 2022 at 04:40:26PM -0300, Marcelo Tosatti wrote:
> On Mon, Oct 24, 2022 at 01:03:11PM +0200, Frederic Weisbecker wrote:
> > Lines: 94
> >
> > On Sat, Sep 24, 2022 at 04:22:25PM +0100, Aaron Tomlin wrote:
> > > From: Marcelo Tosatti <[email protected]>
> > >
> > > From the vmstat shepherd, for CPUs that have the tick stopped, do not
> > > queue local work to flush the per-CPU vmstats, since in that case the
> > > flush is performed on return to userspace or when entering idle. Also
> > > cancel any delayed work on the local CPU, when entering idle on nohz
> > > full CPUs. Per-CPU pages can be freed remotely from housekeeping CPUs.
> > >
> > > Signed-off-by: Marcelo Tosatti <[email protected]>
> > > ---
> > > mm/vmstat.c | 18 +++++++++++++-----
> > > 1 file changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 472175642bd9..3b9a497965b4 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -29,6 +29,7 @@
> > > #include <linux/page_ext.h>
> > > #include <linux/page_owner.h>
> > > #include <linux/migrate.h>
> > > +#include <linux/tick.h>
> > >
> > > #include "internal.h"
> > >
> > > @@ -1990,19 +1991,23 @@ static void vmstat_update(struct work_struct *w)
> > > */
> > > void quiet_vmstat(void)
> > > {
> > > + struct delayed_work *dw;
> > > +
> > > if (system_state != SYSTEM_RUNNING)
> > > return;
> > >
> > > if (!is_vmstat_dirty())
> > > return;
> > >
> > > + refresh_cpu_vm_stats(false);
> > > +
> > > /*
> > > - * Just refresh counters and do not care about the pending delayed
> > > - * vmstat_update. It doesn't fire that often to matter and canceling
> > > - * it would be too expensive from this path.
> > > - * vmstat_shepherd will take care about that for us.
> > > + * If the tick is stopped, cancel any delayed work to avoid
> > > + * interruptions to this CPU in the future.
> > > */
> > > - refresh_cpu_vm_stats(false);
> > > + dw = &per_cpu(vmstat_work, smp_processor_id());
> > > + if (delayed_work_pending(dw) && tick_nohz_tick_stopped())
> > > + cancel_delayed_work(dw);
> > > }
> > >
> > > /*
> > > @@ -2024,6 +2029,9 @@ static void vmstat_shepherd(struct work_struct *w)
> > > for_each_online_cpu(cpu) {
> > > struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> > >
> > > + if (tick_nohz_tick_stopped_cpu(cpu))
> > > + continue;
> > > +
> > > if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
> > > queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> >
> > All these checks are racy though. You may well eventually:
> >
> > 1) Arm the timer after the CPU has entered in userspace
> > 2) Not arm the timer when the CPU has entered the kernel
> >
> > How about converting that to an IPI instead? This should be a good candidate
> > for the future IPI deferment.
> >
> > Another possible way to go is this:
> >
> > 1) vmstat_shepherd completely ignores nohz_full CPUs
> > 2) vmstat_work is only ever armed locally
> > 3) A nohz_full CPU turning its local vmstat as dirty checks if vmstat_work is
> > pending. If not, queue it, possibly through a self IPI (IRQ_WORK) to get
> > away with current locking context.
>
> I'm afraid there might be workloads where local vmstat touch is a
> hot-path.
>
> > 3) Fold on idle if dirty
> > 4) Fold on user enter and disarm vmstat_work if pending
> >
> > Does that sound possible?
> >
> > Thanks.
>
> I guess so, but proper barriers would also work.
>
> Do you have any particular reason for the 1-4 sequence above
> instead of barriers?

That means adding an smp_mb() on user enter, or atomic_cmpxchg().
But then deferred IPI would handle all that for you, right?