2022-02-09 13:23:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_alloc: Add remote draining support to per-cpu lists

On Tue, Feb 08, 2022 at 11:07:50AM +0100, Nicolas Saenz Julienne wrote:
> The page allocator's per-cpu page lists (pcplists) are currently
> protected using local_locks. While performance savvy, this doesn't allow
> for remote access to these structures. CPUs requiring system-wide
> changes to the per-cpu lists get around this by scheduling
> workers on each CPU. That said, some setups like NOHZ_FULL CPUs,
> aren't well suited to this since they can't handle interruptions
> of any sort.
>
> To mitigate this, replace the current draining mechanism with one that
> allows remotely draining the lists:
>
> - Each CPU now has two pcplists pointers: one that points to a pcplists
> instance that is in-use, 'pcp->lp', another that points to an idle
> and empty instance, 'pcp->drain'. CPUs access their local pcplists
> through 'pcp->lp' and the pointer is dereferenced atomically.
>
> - When a CPU decides it needs to empty some remote pcplists, it'll
> atomically exchange the remote CPU's 'pcp->lp' and 'pcp->drain'
> pointers. A remote CPU racing with this will either have:
>
> - An old 'pcp->lp' reference, it'll soon be emptied by the drain
> process, we just have to wait for it to finish using it.
>
> - The new 'pcp->lp' reference, that is, an empty pcplists instance.
> rcu_replace_pointer()'s release semantics ensures any prior
> changes will be visible by the remote CPU, for example: changes
> to 'pcp->high' and 'pcp->batch' when disabling the pcplists.
>
> - The CPU that started the drain can now wait for an RCU grace period
> to make sure the remote CPU is done using the old pcplists.
> synchronize_rcu() counts as a full memory barrier, so any changes the
> local CPU makes to the soon to be drained pcplists will be visible to
> the draining CPU once it returns.
>
> - Then the CPU can safely free the old pcplists. Nobody else holds a
> reference to it. Note that concurrent access to the remote pcplists
> drain is protected by the 'pcpu_drain_mutex'.
>
> >From an RCU perspective, we're only protecting access to the pcplists
> pointer, the drain operation is the writer and the local_lock critical
> sections are the readers. RCU guarantees atomicity both while
> dereferencing the pcplists pointer and replacing it. It also checks for
> RCU critical section/locking correctness, as all readers have to hold
> their per-cpu pagesets local_lock, which also counts as a critical
> section from RCU's perspective.
>
> >From a performance perspective, on production configurations, the patch
> adds an extra dereference to all hot paths (under such circumstances
> rcu_dereference() will simplify to READ_ONCE()). Extensive measurements
> have been performed on different architectures to ascertain the
> performance impact is minimal. Most platforms don't see any difference
> and the worst-case scenario shows a 1-3% degradation on a page
> allocation micro-benchmark. See cover letter for in-depth results.
>
> Accesses to the pcplists like the ones in mm/vmstat.c don't require RCU
> supervision since they can handle outdated data, but they do use
> rcu_access_pointer() to avoid compiler weirdness make sparse happy.
>
> Note that special care has been taken to verify there are no races with
> the memory hotplug code paths. Notably with calls to zone_pcp_reset().
> As Mel Gorman explains in a previous patch[1]: "The existing hotplug
> paths guarantees the pcplists cannot be used after zone_pcp_enable()
> [the one in offline_pages()]. That should be the case already because
> all the pages have been freed and there is no page to put on the PCP
> lists."
>
> All in all, this technique allows for remote draining on all setups with
> an acceptable performance impact. It benefits all sorts of use cases:
> low-latency, real-time, HPC, idle systems, KVM guests.
>
> [1] 8ca559132a2d ("mm/memory_hotplug: remove broken locking of zone PCP
> structures during hot remove")
>
> Signed-off-by: Nicolas Saenz Julienne <[email protected]>
> ---
>
> Changes since RFC:
> - Avoid unnecessary spin_lock_irqsave/restore() in free_pcppages_bulk()
> - Add more detail to commit and code comments.
> - Use synchronize_rcu() instead of synchronize_rcu_expedited(), the RCU
> documentation says to avoid it unless really justified. I don't think
> it's justified here, if we can schedule and join works, waiting for
> an RCU grace period is OK.

https://patchwork.ozlabs.org/project/netdev/patch/1306228052.3026.16.camel@edumazet-laptop/

Adding 100ms to direct reclaim path might be problematic. It will also
slowdown kcompactd (note it'll call drain_all_pages for each zone).

> - Avoid sparse warnings by using rcu_access_pointer() and
> rcu_dereference_protected().
>
> include/linux/mmzone.h | 22 +++++-
> mm/page_alloc.c | 155 ++++++++++++++++++++++++++---------------
> mm/vmstat.c | 6 +-
> 3 files changed, 120 insertions(+), 63 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b4cb85d9c6e8..b0b593fd8e48 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -388,13 +388,31 @@ struct per_cpu_pages {
> short expire; /* When 0, remote pagesets are drained */
> #endif
>
> - struct pcplists *lp;
> + /*
> + * As a rule of thumb, any access to struct per_cpu_pages's 'lp' has
> + * happen with the pagesets local_lock held and using
> + * rcu_dereference_check(). If there is a need to modify both
> + * 'lp->count' and 'lp->lists' in the same critical section 'pcp->lp'
> + * can only be derefrenced once. See for example:

Typo.



2022-02-15 14:32:36

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_alloc: Add remote draining support to per-cpu lists

On Tue, 2022-02-08 at 12:47 -0300, Marcelo Tosatti wrote:
> > Changes since RFC:
> > - Avoid unnecessary spin_lock_irqsave/restore() in free_pcppages_bulk()
> > - Add more detail to commit and code comments.
> > - Use synchronize_rcu() instead of synchronize_rcu_expedited(), the RCU
> > documentation says to avoid it unless really justified. I don't think
> > it's justified here, if we can schedule and join works, waiting for
> > an RCU grace period is OK.
>
> https://patchwork.ozlabs.org/project/netdev/patch/1306228052.3026.16.camel@edumazet-laptop/
>
> Adding 100ms to direct reclaim path might be problematic. It will also
> slowdown kcompactd (note it'll call drain_all_pages for each zone).

I did some measurements on an idle machine, worst case was ~30ms. I agree that
might too much for direct reclaim, so I'll switch back to expedited and add a
comment.

> > - Avoid sparse warnings by using rcu_access_pointer() and
> > rcu_dereference_protected().
> >
> > include/linux/mmzone.h | 22 +++++-
> > mm/page_alloc.c | 155 ++++++++++++++++++++++++++---------------
> > mm/vmstat.c | 6 +-
> > 3 files changed, 120 insertions(+), 63 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index b4cb85d9c6e8..b0b593fd8e48 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -388,13 +388,31 @@ struct per_cpu_pages {
> > short expire; /* When 0, remote pagesets are drained */
> > #endif
> >
> > - struct pcplists *lp;
> > + /*
> > + * As a rule of thumb, any access to struct per_cpu_pages's 'lp' has
> > + * happen with the pagesets local_lock held and using
> > + * rcu_dereference_check(). If there is a need to modify both
> > + * 'lp->count' and 'lp->lists' in the same critical section 'pcp->lp'
> > + * can only be derefrenced once. See for example:
>
> Typo.

Noted.

Thanks!

--
Nicolás Sáenz

2022-02-15 20:02:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_alloc: Add remote draining support to per-cpu lists

On Tue, Feb 15, 2022 at 09:47:35AM +0100, Nicolas Saenz Julienne wrote:
> On Tue, 2022-02-08 at 12:47 -0300, Marcelo Tosatti wrote:
> > > Changes since RFC:
> > > - Avoid unnecessary spin_lock_irqsave/restore() in free_pcppages_bulk()
> > > - Add more detail to commit and code comments.
> > > - Use synchronize_rcu() instead of synchronize_rcu_expedited(), the RCU
> > > documentation says to avoid it unless really justified. I don't think
> > > it's justified here, if we can schedule and join works, waiting for
> > > an RCU grace period is OK.
> >
> > https://patchwork.ozlabs.org/project/netdev/patch/1306228052.3026.16.camel@edumazet-laptop/
> >
> > Adding 100ms to direct reclaim path might be problematic. It will also
> > slowdown kcompactd (note it'll call drain_all_pages for each zone).
>
> I did some measurements on an idle machine, worst case was ~30ms. I agree that
> might too much for direct reclaim, so I'll switch back to expedited and add a
> comment.

Given your measurements, it looks to me like this is a case where use
of expedited grace periods really is justified.

For one thing, expedited grace periods are much less disruptive than
they were in the old days, for example, back when they used stop-machine.
For another thing, systems that cannot tolerate the disturbance (an IPI
per non-idle non-nohz_full CPU per grace period, less than a wakeup)
can always be booted with rcupdate.rcu_normal=1, which will make
synchronize_rcu_expedited() act like synchronize_rcu(), at least once
RCU has spawned its kthreads. And CONFIG_PREEMPT_RT=y kernels forcibly
set this mode. ;-)

Nevertheless, expedited grace periods should not be used lightly because
they do increase overhead.

Thanx, Paul

> > > - Avoid sparse warnings by using rcu_access_pointer() and
> > > rcu_dereference_protected().
> > >
> > > include/linux/mmzone.h | 22 +++++-
> > > mm/page_alloc.c | 155 ++++++++++++++++++++++++++---------------
> > > mm/vmstat.c | 6 +-
> > > 3 files changed, 120 insertions(+), 63 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index b4cb85d9c6e8..b0b593fd8e48 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -388,13 +388,31 @@ struct per_cpu_pages {
> > > short expire; /* When 0, remote pagesets are drained */
> > > #endif
> > >
> > > - struct pcplists *lp;
> > > + /*
> > > + * As a rule of thumb, any access to struct per_cpu_pages's 'lp' has
> > > + * happen with the pagesets local_lock held and using
> > > + * rcu_dereference_check(). If there is a need to modify both
> > > + * 'lp->count' and 'lp->lists' in the same critical section 'pcp->lp'
> > > + * can only be derefrenced once. See for example:
> >
> > Typo.
>
> Noted.
>
> Thanks!
>
> --
> Nicol?s S?enz
>