LinuxLists.cc - [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

2021-03-29 12:10:59

Subject: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

This series requires patches in Andrew's tree so the series is also
available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15

tldr: Jesper and Chuck, it would be nice to verify if this series helps
the allocation rate of the bulk page allocator. RT people, this
*partially* addresses some problems PREEMPT_RT has with the page
allocator but it needs review.

The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
with vmstat which is inconvenient and causes some issues. Possibly because
of that, the PCP list and vmstat share the same per-cpu space meaning that
it's possible that vmstat updates dirty cache lines holding per-cpu lists
across CPUs unless padding is used. The series splits that structure and
separates the locking.

Second, PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

local_irq_disable();
spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly
separates the locking requirements for the PCP list (local_lock) and stat
updates (irqs disabled). Once that is done, the length of time IRQs are
disabled can be reduced and in some cases, IRQ disabling can be replaced
with preempt_disable.

After that, it was very obvious that zone_statistics in particular has way
too much overhead and leaves IRQs disabled for longer than necessary. It
has perfectly accurate counters requiring IRQs be disabled for parallel
RMW sequences when inaccurate ones like vm_events would do. The series
makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
only require preempt be disabled.

Finally the bulk page allocator can then do all the stat updates in bulk
with IRQs enabled which should improve the efficiency of the bulk page
allocator. Technically, this could have been done without the local_lock
and vmstat conversion work and the order simply reflects the timing of
when different series were implemented.

No performance data is included because despite the overhead of the
stats, it's within the noise for most workloads but Jesper and Chuck may
observe a significant different with the same tests used for the bulk
page allocator. The series is more likely to be interesting to the RT
folk in terms of slowing getting the PREEMPT tree into mainline.

drivers/base/node.c | 18 +--
include/linux/mmzone.h | 29 +++--
include/linux/vmstat.h | 65 ++++++-----
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 173 ++++++++++++++++------------
mm/vmstat.c | 254 +++++++++++++++--------------------------
6 files changed, 254 insertions(+), 287 deletions(-)

--
2.26.2

2021-03-29 12:11:38

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 4/6] mm/vmstat: Inline NUMA event counter updates

__count_numa_event is small enough to be treated similarly to
__count_vm_event so inline it.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 9 +++++++++
mm/vmstat.c | 9 ---------
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fc14415223c5..dde4dec4e7dd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -237,6 +237,15 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
}

#ifdef CONFIG_NUMA
+/* See __count_vm_event comment on why raw_cpu_inc is used. */
+static inline void
+__count_numa_event(struct zone *zone, enum numa_stat_item item)
+{
+ struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+
+ raw_cpu_inc(pzstats->vm_numa_event[item]);
+}
+
extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 46bc61184afc..a326483dd4ab 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -906,15 +906,6 @@ void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats)
#endif

#ifdef CONFIG_NUMA
-/* See __count_vm_event comment on why raw_cpu_inc is used. */
-void __count_numa_event(struct zone *zone,
- enum numa_stat_item item)
-{
- struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-
- raw_cpu_inc(pzstats->vm_numa_event[item]);
-}
-
/*
* Determine the per node value of a stat item. This function
* is called frequently in a NUMA machine, so try to be as
--
2.26.2

2021-03-29 12:11:50

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 5/6] mm/page_alloc: Batch the accounting updates in the bulk allocator

Now that the zone_statistics are a simple counter that does not require
special protection, the bulk allocator accounting updates can be
batch updated without requiring IRQs to be disabled.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 8 ++++++++
mm/page_alloc.c | 30 +++++++++++++-----------------
2 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index dde4dec4e7dd..8473b8fa9756 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -246,6 +246,14 @@ __count_numa_event(struct zone *zone, enum numa_stat_item item)
raw_cpu_inc(pzstats->vm_numa_event[item]);
}

+static inline void
+__count_numa_events(struct zone *zone, enum numa_stat_item item, long delta)
+{
+ struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
+
+ raw_cpu_add(pzstats->vm_numa_event[item], delta);
+}
+
extern void __count_numa_event(struct zone *zone, enum numa_stat_item item);
extern unsigned long sum_zone_node_page_state(int node,
enum zone_stat_item item);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7eb48632bcac..32c64839c145 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3398,7 +3398,8 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt)
*
* Must be called with interrupts disabled.
*/
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
+static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
+ long nr_account)
{
#ifdef CONFIG_NUMA
enum numa_stat_item local_stat = NUMA_LOCAL;
@@ -3411,12 +3412,12 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
local_stat = NUMA_OTHER;

if (zone_to_nid(z) == zone_to_nid(preferred_zone))
- __count_numa_event(z, NUMA_HIT);
+ __count_numa_events(z, NUMA_HIT, nr_account);
else {
- __count_numa_event(z, NUMA_MISS);
- __count_numa_event(preferred_zone, NUMA_FOREIGN);
+ __count_numa_events(z, NUMA_MISS, nr_account);
+ __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account);
}
- __count_numa_event(z, local_stat);
+ __count_numa_events(z, local_stat, nr_account);
#endif
}

@@ -3462,7 +3463,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
- zone_statistics(preferred_zone, zone);
+ zone_statistics(preferred_zone, zone, 1);
}
local_unlock_irqrestore(&pagesets.lock, flags);
return page;
@@ -3523,7 +3524,7 @@ struct page *rmqueue(struct zone *preferred_zone,
get_pcppage_migratetype(page));

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
- zone_statistics(preferred_zone, zone);
+ zone_statistics(preferred_zone, zone, 1);
local_irq_restore(flags);

out:
@@ -5006,7 +5007,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
struct alloc_context ac;
gfp_t alloc_gfp;
unsigned int alloc_flags;
- int nr_populated = 0;
+ int nr_populated = 0, nr_account = 0;

if (unlikely(nr_pages <= 0))
return 0;
@@ -5079,15 +5080,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed_irq;
break;
}
-
- /*
- * Ideally this would be batched but the best way to do
- * that cheaply is to first convert zone_statistics to
- * be inaccurate per-cpu counter like vm_events to avoid
- * a RMW cycle then do the accounting with IRQs enabled.
- */
- __count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
- zone_statistics(ac.preferred_zoneref->zone, zone);
+ nr_account++;

prep_new_page(page, 0, gfp, 0);
if (page_list)
@@ -5097,6 +5090,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

+ __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
+ zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
+
local_unlock_irqrestore(&pagesets.lock, flags);

return nr_populated;
--
2.26.2

2021-03-30 18:53:56

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

On Mon, 29 Mar 2021 13:06:42 +0100
Mel Gorman <[email protected]> wrote:

> This series requires patches in Andrew's tree so the series is also
> available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
>
> tldr: Jesper and Chuck, it would be nice to verify if this series helps
> the allocation rate of the bulk page allocator. RT people, this
> *partially* addresses some problems PREEMPT_RT has with the page
> allocator but it needs review.

I've run a new micro-benchmark[1] which shows:
(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)

BASELINE
single_page alloc+put: 194 cycles(tsc) 54.106 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

Per elem: 200 cycles(tsc) 55.667 ns (step:1)
Per elem: 143 cycles(tsc) 39.755 ns (step:2)
Per elem: 132 cycles(tsc) 36.758 ns (step:3)
Per elem: 128 cycles(tsc) 35.795 ns (step:4)
Per elem: 123 cycles(tsc) 34.339 ns (step:8)
Per elem: 120 cycles(tsc) 33.396 ns (step:16)
Per elem: 118 cycles(tsc) 32.806 ns (step:32)
Per elem: 115 cycles(tsc) 32.169 ns (step:64)
Per elem: 116 cycles(tsc) 32.258 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

Per elem: 195 cycles(tsc) 54.225 ns (step:1)
Per elem: 127 cycles(tsc) 35.492 ns (step:2)
Per elem: 117 cycles(tsc) 32.643 ns (step:3)
Per elem: 111 cycles(tsc) 30.992 ns (step:4)
Per elem: 106 cycles(tsc) 29.606 ns (step:8)
Per elem: 102 cycles(tsc) 28.532 ns (step:16)
Per elem: 99 cycles(tsc) 27.728 ns (step:32)
Per elem: 98 cycles(tsc) 27.252 ns (step:64)
Per elem: 97 cycles(tsc) 27.090 ns (step:128)

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk-local_lock-v1r15

This should be seen in comparison with the older micro-benchmark[2]
done on branch mm-bulk-rebase-v5r9.

BASELINE
single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

Per elem: 206 cycles(tsc) 57.478 ns (step:1)
Per elem: 154 cycles(tsc) 42.861 ns (step:2)
Per elem: 145 cycles(tsc) 40.536 ns (step:3)
Per elem: 142 cycles(tsc) 39.477 ns (step:4)
Per elem: 142 cycles(tsc) 39.610 ns (step:8)
Per elem: 137 cycles(tsc) 38.155 ns (step:16)
Per elem: 135 cycles(tsc) 37.739 ns (step:32)
Per elem: 134 cycles(tsc) 37.282 ns (step:64)
Per elem: 133 cycles(tsc) 36.993 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

Per elem: 202 cycles(tsc) 56.383 ns (step:1)
Per elem: 144 cycles(tsc) 40.047 ns (step:2)
Per elem: 134 cycles(tsc) 37.339 ns (step:3)
Per elem: 128 cycles(tsc) 35.578 ns (step:4)
Per elem: 120 cycles(tsc) 33.592 ns (step:8)
Per elem: 116 cycles(tsc) 32.362 ns (step:16)
Per elem: 113 cycles(tsc) 31.476 ns (step:32)
Per elem: 110 cycles(tsc) 30.633 ns (step:64)
Per elem: 110 cycles(tsc) 30.596 ns (step:128)

[2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk

This new patchset does show some improvements in the micro-benchmark.

> The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
> with vmstat which is inconvenient and causes some issues. Possibly because
> of that, the PCP list and vmstat share the same per-cpu space meaning that
> it's possible that vmstat updates dirty cache lines holding per-cpu lists
> across CPUs unless padding is used. The series splits that structure and
> separates the locking.
>
> Second, PREEMPT_RT considers the following sequence to be unsafe
> as documented in Documentation/locking/locktypes.rst
>
> local_irq_disable();
> spin_lock(&lock);
>
> The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
> -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly
> separates the locking requirements for the PCP list (local_lock) and stat
> updates (irqs disabled). Once that is done, the length of time IRQs are
> disabled can be reduced and in some cases, IRQ disabling can be replaced
> with preempt_disable.
>
> After that, it was very obvious that zone_statistics in particular has way
> too much overhead and leaves IRQs disabled for longer than necessary. It
> has perfectly accurate counters requiring IRQs be disabled for parallel
> RMW sequences when inaccurate ones like vm_events would do. The series
> makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
> only require preempt be disabled.
>
> Finally the bulk page allocator can then do all the stat updates in bulk
> with IRQs enabled which should improve the efficiency of the bulk page
> allocator. Technically, this could have been done without the local_lock
> and vmstat conversion work and the order simply reflects the timing of
> when different series were implemented.
>
> No performance data is included because despite the overhead of the
> stats, it's within the noise for most workloads but Jesper and Chuck may
> observe a significant different with the same tests used for the bulk
> page allocator. The series is more likely to be interesting to the RT
> folk in terms of slowing getting the PREEMPT tree into mainline.

I've try to run some longer packet benchmarks later. A quick test
showed performance was within same range, and slightly better. The
perf report and objdump, did reveal that code layout prefers the label
"failed:" as the primary code path, which should only be used for
single page allocs, which is kind of weird. (But as performance is the
same or slightly better, I will not complain).

> drivers/base/node.c | 18 +--
> include/linux/mmzone.h | 29 +++--
> include/linux/vmstat.h | 65 ++++++-----
> mm/mempolicy.c | 2 +-
> mm/page_alloc.c | 173 ++++++++++++++++------------
> mm/vmstat.c | 254 +++++++++++++++--------------------------
> 6 files changed, 254 insertions(+), 287 deletions(-)

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-03-31 07:42:12

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 29 Mar 2021 13:06:42 +0100
> Mel Gorman <[email protected]> wrote:
>
> > This series requires patches in Andrew's tree so the series is also
> > available at
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
> >
> > tldr: Jesper and Chuck, it would be nice to verify if this series helps
> > the allocation rate of the bulk page allocator. RT people, this
> > *partially* addresses some problems PREEMPT_RT has with the page
> > allocator but it needs review.
>
> I've run a new micro-benchmark[1] which shows:
> (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
>
> <Editting to focus on arrays>
> BASELINE
> single_page alloc+put: 194 cycles(tsc) 54.106 ns
>
> ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
>
> Per elem: 195 cycles(tsc) 54.225 ns (step:1)
> Per elem: 127 cycles(tsc) 35.492 ns (step:2)
> Per elem: 117 cycles(tsc) 32.643 ns (step:3)
> Per elem: 111 cycles(tsc) 30.992 ns (step:4)
> Per elem: 106 cycles(tsc) 29.606 ns (step:8)
> Per elem: 102 cycles(tsc) 28.532 ns (step:16)
> Per elem: 99 cycles(tsc) 27.728 ns (step:32)
> Per elem: 98 cycles(tsc) 27.252 ns (step:64)
> Per elem: 97 cycles(tsc) 27.090 ns (step:128)
>
> This should be seen in comparison with the older micro-benchmark[2]
> done on branch mm-bulk-rebase-v5r9.
>
> BASELINE
> single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns
>
> ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
>
> Per elem: 202 cycles(tsc) 56.383 ns (step:1)
> Per elem: 144 cycles(tsc) 40.047 ns (step:2)
> Per elem: 134 cycles(tsc) 37.339 ns (step:3)
> Per elem: 128 cycles(tsc) 35.578 ns (step:4)
> Per elem: 120 cycles(tsc) 33.592 ns (step:8)
> Per elem: 116 cycles(tsc) 32.362 ns (step:16)
> Per elem: 113 cycles(tsc) 31.476 ns (step:32)
> Per elem: 110 cycles(tsc) 30.633 ns (step:64)
> Per elem: 110 cycles(tsc) 30.596 ns (step:128)
>

Ok, so bulk allocation is faster than allocating single pages, no surprise
there. Putting the array figures for bulk allocation into tabular format
and comparing we get;

Array variant (time to allocate a page in nanoseconds, lower is better)
Baseline Patched
1 56.383 54.225 (+3.83%)
2 40.047 35.492 (+11.38%)
3 37.339 32.643 (+12.58%)
4 35.578 30.992 (+12.89%)
8 33.592 29.606 (+11.87%)
16 32.362 28.532 (+11.85%)
32 31.476 27.728 (+11.91%)
64 30.633 27.252 (+11.04%)
128 30.596 27.090 (+11.46%)

The series is 11-12% faster when allocating multiple pages. That's a
fairly positive outcome and I'll include this in the series leader if
you have no objections.

Thanks Jesper!

--
Mel Gorman
SUSE Labs

2021-03-31 08:19:52

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

On Wed, 31 Mar 2021 08:38:05 +0100
Mel Gorman <[email protected]> wrote:

> On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 29 Mar 2021 13:06:42 +0100
> > Mel Gorman <[email protected]> wrote:
> >
> > > This series requires patches in Andrew's tree so the series is also
> > > available at
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
> > >
> > > tldr: Jesper and Chuck, it would be nice to verify if this series helps
> > > the allocation rate of the bulk page allocator. RT people, this
> > > *partially* addresses some problems PREEMPT_RT has with the page
> > > allocator but it needs review.
> >
> > I've run a new micro-benchmark[1] which shows:
> > (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
> >
> > <Editting to focus on arrays>
> > BASELINE
> > single_page alloc+put: 194 cycles(tsc) 54.106 ns
> >
> > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> >
> > Per elem: 195 cycles(tsc) 54.225 ns (step:1)
> > Per elem: 127 cycles(tsc) 35.492 ns (step:2)
> > Per elem: 117 cycles(tsc) 32.643 ns (step:3)
> > Per elem: 111 cycles(tsc) 30.992 ns (step:4)
> > Per elem: 106 cycles(tsc) 29.606 ns (step:8)
> > Per elem: 102 cycles(tsc) 28.532 ns (step:16)
> > Per elem: 99 cycles(tsc) 27.728 ns (step:32)
> > Per elem: 98 cycles(tsc) 27.252 ns (step:64)
> > Per elem: 97 cycles(tsc) 27.090 ns (step:128)
> >
> > This should be seen in comparison with the older micro-benchmark[2]
> > done on branch mm-bulk-rebase-v5r9.
> >
> > BASELINE
> > single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns
> >
> > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> >
> > Per elem: 202 cycles(tsc) 56.383 ns (step:1)
> > Per elem: 144 cycles(tsc) 40.047 ns (step:2)
> > Per elem: 134 cycles(tsc) 37.339 ns (step:3)
> > Per elem: 128 cycles(tsc) 35.578 ns (step:4)
> > Per elem: 120 cycles(tsc) 33.592 ns (step:8)
> > Per elem: 116 cycles(tsc) 32.362 ns (step:16)
> > Per elem: 113 cycles(tsc) 31.476 ns (step:32)
> > Per elem: 110 cycles(tsc) 30.633 ns (step:64)
> > Per elem: 110 cycles(tsc) 30.596 ns (step:128)
> >
>
> Ok, so bulk allocation is faster than allocating single pages, no surprise
> there. Putting the array figures for bulk allocation into tabular format
> and comparing we get;
>
> Array variant (time to allocate a page in nanoseconds, lower is better)
> Baseline Patched
> 1 56.383 54.225 (+3.83%)
> 2 40.047 35.492 (+11.38%)
> 3 37.339 32.643 (+12.58%)
> 4 35.578 30.992 (+12.89%)
> 8 33.592 29.606 (+11.87%)
> 16 32.362 28.532 (+11.85%)
> 32 31.476 27.728 (+11.91%)
> 64 30.633 27.252 (+11.04%)
> 128 30.596 27.090 (+11.46%)
>
> The series is 11-12% faster when allocating multiple pages. That's a
> fairly positive outcome and I'll include this in the series leader if
> you have no objections.

That is fine by me to add this to the cover letter. I like your
tabular format as it makes is easier to compare. If you use the
nanosec measurements and not the cycles, you should state that
this was run on a CPU E5-1650 v4 @ 3.60GHz. You might notice that the
factor between cycles(tsc) and ns is very close to 3.6.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-03-31 08:56:00

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

Ingo, Thomas or Peter, is there any chance one of you could take a look
at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to
local_lock" from this series? It's partially motivated by PREEMPT_RT. More
details below.

On Mon, Mar 29, 2021 at 01:06:42PM +0100, Mel Gorman wrote:
> This series requires patches in Andrew's tree so the series is also
> available at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
>
> The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
> with vmstat which is inconvenient and causes some issues. Possibly because
> of that, the PCP list and vmstat share the same per-cpu space meaning that
> it's possible that vmstat updates dirty cache lines holding per-cpu lists
> across CPUs unless padding is used. The series splits that structure and
> separates the locking.
>

The bulk page allocation series that the local_lock work had an
additional fix so I've rebased this onto

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r16

> Second, PREEMPT_RT considers the following sequence to be unsafe
> as documented in Documentation/locking/locktypes.rst
>
> local_irq_disable();
> spin_lock(&lock);
>
> The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
> -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly
> separates the locking requirements for the PCP list (local_lock) and stat
> updates (irqs disabled). Once that is done, the length of time IRQs are
> disabled can be reduced and in some cases, IRQ disabling can be replaced
> with preempt_disable.
>

It's this part I'm interested in even though it only partially addresses
the preempt-rt tree concerns. More legwork is needed for preempt-rt which
is outside the context of this series. At minimum, it involves

1. Split locking of pcp and buddy allocator instead of using spin_lock()
when it's "known" that IRQs are disabled (not necessarily a valid
assumption on PREEMPT_RT)
2. Split the zone lock into what protects the zone metadata and what
protects the free lists

This looks straight-forward but it involves audit work and it may be
difficult to avoid regressing non-PREEMPT_RT kernels by disabling/enabling
IRQs when switching between the pcp allocator and the buddy allocator.

> After that, it was very obvious that zone_statistics in particular has way
> too much overhead and leaves IRQs disabled for longer than necessary. It
> has perfectly accurate counters requiring IRQs be disabled for parallel
> RMW sequences when inaccurate ones like vm_events would do. The series
> makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
> only require preempt be disabled.
>
> Finally the bulk page allocator can then do all the stat updates in bulk
> with IRQs enabled which should improve the efficiency of the bulk page
> allocator. Technically, this could have been done without the local_lock
> and vmstat conversion work and the order simply reflects the timing of
> when different series were implemented.
>
> No performance data is included because despite the overhead of the
> stats, it's within the noise for most workloads but Jesper and Chuck may
> observe a significant different with the same tests used for the bulk
> page allocator. The series is more likely to be interesting to the RT
> folk in terms of slowing getting the PREEMPT tree into mainline.
>
> drivers/base/node.c | 18 +--
> include/linux/mmzone.h | 29 +++--
> include/linux/vmstat.h | 65 ++++++-----
> mm/mempolicy.c | 2 +-
> mm/page_alloc.c | 173 ++++++++++++++++------------
> mm/vmstat.c | 254 +++++++++++++++--------------------------
> 6 files changed, 254 insertions(+), 287 deletions(-)
>
> --
> 2.26.2
>

--
Mel Gorman
SUSE Labs

2021-03-31 09:53:34

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

On Wed, Mar 31 2021 at 09:52, Mel Gorman wrote:
> Ingo, Thomas or Peter, is there any chance one of you could take a look
> at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to
> local_lock" from this series? It's partially motivated by PREEMPT_RT. More
> details below.

Sure.