LinuxLists.cc - [lkp-robot] [mm] e27be240df: will-it-scale.per_process

2018-05-08 05:38:10

Subject: [lkp-robot] [mm] e27be240df: will-it-scale.per_process_ops -27.2% regression

Greeting,

FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops due to commit:

commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure memory.events is uptodate when waking pollers")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
with following parameters:

nr_task: 100%
mode: process
test: page_fault3
cpufreq_governor: performance

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

Details are as below:
-------------------------------------------------------------------------------------------------->
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2016-08-31.cgz/lkp-hsw-ep4/page_fault3/will-it-scale

commit:
a38c015f31 ("mm/ksm.c: fix inconsistent accounting of zero pages")
e27be240df ("mm: memcg: make sure memory.events is uptodate when waking pollers")

a38c015f3156895b e27be240df53f1a20c659168e7
---------------- --------------------------
%stddev %change %stddev
\ | \
639324 -27.2% 465226 will-it-scale.per_process_ops
46031421 -27.2% 33496351 will-it-scale.workload
17.55 -3.2 14.38 mpstat.cpu.usr%
1130383 ? 6% -19.6% 909067 ? 4% softirqs.RCU
95892 ? 2% -7.5% 88706 ? 3% vmstat.system.in
2714 +2.0% 2768 turbostat.Avg_MHz
0.43 ? 9% -33.3% 0.29 ? 15% turbostat.CPU%c1
15.72 -2.5% 15.33 turbostat.RAMWatt
15220184 -26.9% 11118535 numa-numastat.node0.local_node
15223689 -26.9% 11125573 numa-numastat.node0.numa_hit
15236149 -22.2% 11857182 numa-numastat.node1.local_node
15246716 -22.2% 11864179 numa-numastat.node1.numa_hit
8676822 -22.6% 6714739 numa-vmstat.node0.numa_hit
8673095 -22.7% 6707502 numa-vmstat.node0.numa_local
8661159 -19.7% 6951620 numa-vmstat.node1.numa_hit
8481025 -20.1% 6775023 numa-vmstat.node1.numa_local
30466411 -24.6% 22979746 proc-vmstat.numa_hit
30452327 -24.6% 22965700 proc-vmstat.numa_local
30512939 -24.6% 23021801 proc-vmstat.pgalloc_normal
1.386e+10 -27.2% 1.008e+10 proc-vmstat.pgfault
28718588 ? 3% -24.0% 21818568 ? 5% proc-vmstat.pgfree
62.72 ? 10% -21.8% 49.06 ? 2% sched_debug.cfs_rq:/.exec_clock.stddev
80883 ? 10% -14.1% 69503 ? 6% sched_debug.cfs_rq:/.min_vruntime.stddev
2.04 ? 3% +10.0% 2.24 ? 2% sched_debug.cfs_rq:/.nr_spread_over.stddev
119225 ? 11% -58.0% 50132 ? 59% sched_debug.cfs_rq:/.spread0.avg
199133 ? 7% -35.3% 128853 ? 23% sched_debug.cfs_rq:/.spread0.max
80591 ? 10% -14.1% 69247 ? 6% sched_debug.cfs_rq:/.spread0.stddev
6.275e+12 -27.3% 4.565e+12 perf-stat.branch-instructions
4.772e+10 ? 2% -26.7% 3.498e+10 perf-stat.branch-misses
55.58 -20.5 35.13 perf-stat.cache-miss-rate%
2.658e+10 -20.4% 2.116e+10 perf-stat.cache-misses
4.782e+10 +26.0% 6.025e+10 perf-stat.cache-references
1.86 +40.3% 2.60 perf-stat.cpi
5.875e+13 +2.0% 5.994e+13 perf-stat.cpu-cycles
8.997e+12 -27.4% 6.532e+12 perf-stat.dTLB-loads
2.94 -0.5 2.48 perf-stat.dTLB-store-miss-rate%
1.599e+11 -38.9% 9.764e+10 perf-stat.dTLB-store-misses
5.27e+12 -27.2% 3.838e+12 perf-stat.dTLB-stores
2.684e+10 -27.3% 1.95e+10 perf-stat.iTLB-load-misses
3.166e+13 -27.3% 2.303e+13 perf-stat.instructions
0.54 -28.7% 0.38 perf-stat.ipc
1.386e+10 -27.2% 1.009e+10 perf-stat.minor-faults
0.57 ? 10% +10.9 11.49 perf-stat.node-load-miss-rate%
67281213 ? 10% +1624.2% 1.16e+09 perf-stat.node-load-misses
1.179e+10 -24.2% 8.934e+09 perf-stat.node-loads
5.02 +0.6 5.64 perf-stat.node-store-miss-rate%
7.36e+08 -15.5% 6.216e+08 perf-stat.node-store-misses
1.393e+10 -25.3% 1.041e+10 perf-stat.node-stores
1.386e+10 -27.2% 1.009e+10 perf-stat.page-faults

will-it-scale.per_process_ops

700000 +-+----------------------------------------------------------------+
|.+ .+.+ +.+ |
650000 +-++ .+.+. .+ + : + |
| + .+.+ + +.. : + +.+.+.+..+.+.+.+.+.|
| + .+..+ : + + |
600000 +-+ +.+ +.+ +.+ |
| |
550000 +-+ |
| |
500000 +-+ |
O |
| O O O O O O O O O O O O O O O O O O O |
450000 +-+ O |
| O O O |
400000 +-+---O------------------------------------------------------------+

will-it-scale.workload

5e+07 +-+---------------------------------------------------------------+
4.8e+07 +-+ .+.+ +.+ |
| + .+.+.+.+ + : + .+.+.+. .+. .+.+.|
4.6e+07 +-+ +. .+..+ +. : +. +. + + |
4.4e+07 +-+ +. .+.+ +. : +. + |
4.2e+07 +-+ + + + |
4e+07 +-+ |
| |
3.8e+07 +-+ |
3.6e+07 +-+ |
3.4e+07 O-O O O O O O O O O O O O O |
3.2e+07 +-+ O O O O O O |
| O |
3e+07 +-+ O O O O |
2.8e+07 +-+---------------------------------------------------------------+

[*] bisect-good sample
[O] bisect-bad sample

Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

Thanks,
Xiaolong

Attachments:

(No filename) (8.53 kB)
config-4.16.0-10982-ge27be24 (166.72 kB)
job-script (7.01 kB)
job.yaml (4.70 kB)
reproduce (315.00 B)
Download all attachments

2018-05-08 17:27:08

by Johannes Weiner

[permalink] [raw]

Subject: Re: [lkp-robot] [mm] e27be240df: will-it-scale.per_process_ops -27.2% regression

Hello,

On Tue, May 08, 2018 at 01:34:51PM +0800, kernel test robot wrote:
> FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops due to commit:
>
>
> commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure memory.events is uptodate when waking pollers")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: will-it-scale
> on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
> with following parameters:
>
> nr_task: 100%
> mode: process
> test: page_fault3
> cpufreq_governor: performance
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale

This is surprising. Do you run these tests in a memory cgroup with a
limit set? Can you dump that cgroup's memory.events after the run?

Thanks

2018-05-09 02:31:51

by Aaron Lu

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [mm] e27be240df: will-it-scale.per_process_ops -27.2% regression

Attachments:

(No filename) (1.71 kB)
a38c015f3156-perf_profile.gz (7.05 kB)
e27be240df53-perf_profile.gz (6.72 kB)
a38c015f3156-dmesg.xz (21.90 kB)
e27be240df53-dmesg.xz (21.37 kB)
Download all attachments

2018-05-09 15:01:49

by Aaron Lu

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [mm] e27be240df: will-it-scale.per_process_ops -27.2% regression

On Wed, May 09, 2018 at 10:32:11AM +0800, Aaron Lu wrote:
> On Tue, May 08, 2018 at 01:26:40PM -0400, Johannes Weiner wrote:
> > Hello,
> >
> > On Tue, May 08, 2018 at 01:34:51PM +0800, kernel test robot wrote:
> > > FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops due to commit:
> > >
> > >
> > > commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure memory.events is uptodate when waking pollers")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: will-it-scale
> > > on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
> > > with following parameters:
> > >
> > > nr_task: 100%
> > > mode: process
> > > test: page_fault3
> > > cpufreq_governor: performance
> > >
> > > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > > test-url: https://github.com/antonblanchard/will-it-scale
> >
> > This is surprising. Do you run these tests in a memory cgroup with a
> > limit set? Can you dump that cgroup's memory.events after the run?
>
> There is no cgroup related setup so yes, this is surprising.
> But the result is quite stable, I have just confirmed on another
> Haswell-EP machine.
>
> perf shows increased cycles spent for lock_page_memcg and
> unlock_page_memcg, maybe this can shed some light. Full profile for this
> commit and its parent are attached.
>
> I have also attached dmesg for both commits in case they are useful,
> please feel free to let me know if you need any other information. We
> also collected a ton of other information during the run like
> /proc/vmstat, /proc/meminfo, /proc/interrupt etc.

Test on Broadwell-EP also showed 35% regression, here are a list of
functions that take more CPU cycles with this commit according to perf:

a38c015f3156895b e27be240df53f1a20c659168e7
---------------- --------------------------
%stddev %change %stddev
\ | \
58033709 -35.0% 37727244 will-it-scale.workload
... ...
3.82 +6.1 9.97 perf-profile.self.cycles-pp.handle_mm_fault
3.19 +6.2 9.37 perf-profile.self.cycles-pp.page_remove_rmap
0.25 +6.5 6.71 perf-profile.self.cycles-pp.__unlock_page_memcg
3.63 +7.5 11.15 perf-profile.self.cycles-pp.page_add_file_rmap
0.60 +8.1 8.70 perf-profile.self.cycles-pp.lock_page_memcg

2018-05-28 08:53:32

by Aaron Lu

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [mm] e27be240df: will-it-scale.per_process_ops -27.2% regression

On Tue, May 08, 2018 at 01:26:40PM -0400, Johannes Weiner wrote:
> Hello,
>
> On Tue, May 08, 2018 at 01:34:51PM +0800, kernel test robot wrote:
> > FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops due to commit:
> >
> >
> > commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure memory.events is uptodate when waking pollers")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: will-it-scale
> > on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
> > with following parameters:
> >
> > nr_task: 100%
> > mode: process
> > test: page_fault3
> > cpufreq_governor: performance
> >
> > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > test-url: https://github.com/antonblanchard/will-it-scale
>
> This is surprising. Do you run these tests in a memory cgroup with a
> limit set? Can you dump that cgroup's memory.events after the run?

"Some background in case it's forgotten: we do not set any memory control
group specifically and the test machine is using ramfs as its root.
The machine has plenty memory, no swap is setup. All pages belong to
root_mem_cgroup"

Turned out the performance change is due to 'struct mem_cgroup' layout
change, i.e. if I do:

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..c767db1da0bb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -205,7 +205,6 @@ struct mem_cgroup {
int oom_kill_disable;

/* memory.events */
- atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
struct cgroup_file events_file;

/* protect arrays of thresholds */
@@ -238,6 +237,7 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu __percpu *stat_cpu;
atomic_long_t stat[MEMCG_NR_STAT];
atomic_long_t events[NR_VM_EVENT_ITEMS];
+ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];

unsigned long socket_pressure;

The performance will restore.

Move information:
With this patch, perf profile+annotate showed increased cycles spent on
accessing root_mem_cgroup->stat_cpu in count_memcg_event_mm()(called by
handle_mm_fault()):

│ x = count + __this_cpu_read(memcg->stat_cpu->events[idx]);
92.31 │ mov 0x308(%rcx),%rax
0.58 │ mov %gs:0x1b0(%rax),%rdx
0.09 │ add $0x1,%rdx

And in __mod_memcg_state() called by page_add_file_rmap():

│ x = val + __this_cpu_read(memcg->stat_cpu->count[idx]);
70.89 │ mov 0x308(%rdi),%rdx
0.43 │ mov %gs:0x68(%rdx),%rax
0.08 │ add %rbx,%rax
│ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {

My first reaction is, with the patch changing the sturcture layout, the
stat_cpu field might end up in a cacheline that is constantly being
written to. With the help of pahole, I got:
1 after this patch(bad)

/* --- cacheline 12 boundary (768 bytes) --- */
long unsigned int move_lock_flags; /* 768 8 */
struct mem_cgroup_stat_cpu * stat_cpu; /* 776 8 */
atomic_long_t stat[34]; /* 784 0 */

stat[0] - stat[5] falls in this cacheline.

2 before this patch(good)
/* --- cacheline 11 boundary (704 bytes) was 8 bytes ago --- */
long unsigned int move_charge_at_immigrate; /* 712 8 */
atomic_t moving_account; /* 720 0 */

/* XXX 4 bytes hole, try to pack */

spinlock_t move_lock; /* 724 0 */

/* XXX 4 bytes hole, try to pack */

struct task_struct * move_lock_task; /* 728 8 */
long unsigned int move_lock_flags; /* 736 8 */
struct mem_cgroup_stat_cpu * stat_cpu; /* 744 8 */
atomic_long_t stat[34]; /* 752 0 */

stat[0] - stat[1] falls in this cacheline.

We now have more stat[]s fall in the cacheline, but then I realized
stats[0] - stat[12] are never written to for a memory control group, the
first written field is 13(NR_FILE_MAPPED).

So I think my first reaction is wrong.

Looking at the good layout, there is a field moving_account that will be
accessed during the test in lock_page_memcg(), and that access is always
read only since there is no page changing memcg. So the good performance
might be due to having the two fields in the cache line. I moved the
moving_account field to the same cacheline as stat_cpu for the bad case,
the performance restored a lot but still not as good as base.

I'm not sure where to go next step and would like to seek some
suggestion. Based on my analysis, it appears the good performance for
base is entirely by accident(having moving_account and stat_cpu in the
same cacheline), we never ensure that. In the meantime, it might not be
a good idea to ensure that since stat_cpu should be an always_read field
while moving_account will be modified when needed.

Or any idea what might be the cause? Thanks.

2018-05-29 08:49:09

by Michal Hocko

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [mm] e27be240df: will-it-scale.per_process_ops -27.2% regression

On Mon 28-05-18 16:52:01, Aaron Lu wrote:
> On Tue, May 08, 2018 at 01:26:40PM -0400, Johannes Weiner wrote:
> > Hello,
> >
> > On Tue, May 08, 2018 at 01:34:51PM +0800, kernel test robot wrote:
> > > FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops due to commit:
> > >
> > >
> > > commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure memory.events is uptodate when waking pollers")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: will-it-scale
> > > on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
> > > with following parameters:
> > >
> > > nr_task: 100%
> > > mode: process
> > > test: page_fault3
> > > cpufreq_governor: performance
> > >
> > > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > > test-url: https://github.com/antonblanchard/will-it-scale
> >
> > This is surprising. Do you run these tests in a memory cgroup with a
> > limit set? Can you dump that cgroup's memory.events after the run?
>
> "Some background in case it's forgotten: we do not set any memory control
> group specifically and the test machine is using ramfs as its root.
> The machine has plenty memory, no swap is setup. All pages belong to
> root_mem_cgroup"
>
> Turned out the performance change is due to 'struct mem_cgroup' layout
> change, i.e. if I do:
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d99b71bc2c66..c767db1da0bb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -205,7 +205,6 @@ struct mem_cgroup {
> int oom_kill_disable;
>
> /* memory.events */
> - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
> struct cgroup_file events_file;
>
> /* protect arrays of thresholds */
> @@ -238,6 +237,7 @@ struct mem_cgroup {
> struct mem_cgroup_stat_cpu __percpu *stat_cpu;
> atomic_long_t stat[MEMCG_NR_STAT];
> atomic_long_t events[NR_VM_EVENT_ITEMS];
> + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
>
> unsigned long socket_pressure;

Well, I do not mind moving memory_events down to other stats/events. I
suspect Johannes' chosen the current location to be close to
events_file.

> The performance will restore.
>
> Move information:
> With this patch, perf profile+annotate showed increased cycles spent on
> accessing root_mem_cgroup->stat_cpu in count_memcg_event_mm()(called by
> handle_mm_fault()):
>
> │ x = count + __this_cpu_read(memcg->stat_cpu->events[idx]);
> 92.31 │ mov 0x308(%rcx),%rax
> 0.58 │ mov %gs:0x1b0(%rax),%rdx
> 0.09 │ add $0x1,%rdx
>
> And in __mod_memcg_state() called by page_add_file_rmap():
>
> │ x = val + __this_cpu_read(memcg->stat_cpu->count[idx]);
> 70.89 │ mov 0x308(%rdi),%rdx
> 0.43 │ mov %gs:0x68(%rdx),%rax
> 0.08 │ add %rbx,%rax
> │ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
>
> My first reaction is, with the patch changing the sturcture layout, the
> stat_cpu field might end up in a cacheline that is constantly being
> written to. With the help of pahole, I got:
> 1 after this patch(bad)
>
> /* --- cacheline 12 boundary (768 bytes) --- */
> long unsigned int move_lock_flags; /* 768 8 */
> struct mem_cgroup_stat_cpu * stat_cpu; /* 776 8 */
> atomic_long_t stat[34]; /* 784 0 */
>
> stat[0] - stat[5] falls in this cacheline.
>
> 2 before this patch(good)
> /* --- cacheline 11 boundary (704 bytes) was 8 bytes ago --- */
> long unsigned int move_charge_at_immigrate; /* 712 8 */
> atomic_t moving_account; /* 720 0 */
>
> /* XXX 4 bytes hole, try to pack */
>
> spinlock_t move_lock; /* 724 0 */
>
> /* XXX 4 bytes hole, try to pack */
>
> struct task_struct * move_lock_task; /* 728 8 */
> long unsigned int move_lock_flags; /* 736 8 */
> struct mem_cgroup_stat_cpu * stat_cpu; /* 744 8 */
> atomic_long_t stat[34]; /* 752 0 */
>
> stat[0] - stat[1] falls in this cacheline.
>
> We now have more stat[]s fall in the cacheline, but then I realized
> stats[0] - stat[12] are never written to for a memory control group, the
> first written field is 13(NR_FILE_MAPPED).

This is a bit scary though. Seeing 27% regression just because of this
is really unexpected and fragile wrt. future changes.

> So I think my first reaction is wrong.
>
> Looking at the good layout, there is a field moving_account that will be
> accessed during the test in lock_page_memcg(), and that access is always
> read only since there is no page changing memcg. So the good performance
> might be due to having the two fields in the cache line. I moved the
> moving_account field to the same cacheline as stat_cpu for the bad case,
> the performance restored a lot but still not as good as base.
>
> I'm not sure where to go next step and would like to seek some
> suggestion. Based on my analysis, it appears the good performance for
> base is entirely by accident(having moving_account and stat_cpu in the
> same cacheline), we never ensure that. In the meantime, it might not be
> a good idea to ensure that since stat_cpu should be an always_read field
> while moving_account will be modified when needed.
>
> Or any idea what might be the cause? Thanks.

Can you actually prepare a patch with all these numbers and a big fat
comment in the structure to keep the most hot counters close to
moving_account. Maybe we want to re-organize this some more and pull
move_lock* out of the sensitive cache line because they are a slow path
stuff. We would have more stasts in the same cache line then. What do
you think?
--
Michal Hocko
SUSE Labs

2018-05-30 08:29:30

by Aaron Lu

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [mm] e27be240df: will-it-scale.per_process_ops -27.2% regression

On Tue, May 29, 2018 at 10:48:16AM +0200, Michal Hocko wrote:
> On Mon 28-05-18 16:52:01, Aaron Lu wrote:
> > On Tue, May 08, 2018 at 01:26:40PM -0400, Johannes Weiner wrote:
> > > Hello,
> > >
> > > On Tue, May 08, 2018 at 01:34:51PM +0800, kernel test robot wrote:
> > > > FYI, we noticed a -27.2% regression of will-it-scale.per_process_ops due to commit:
> > > >
> > > >
> > > > commit: e27be240df53f1a20c659168e722b5d9f16cc7f4 ("mm: memcg: make sure memory.events is uptodate when waking pollers")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > in testcase: will-it-scale
> > > > on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
> > > > with following parameters:
> > > >
> > > > nr_task: 100%
> > > > mode: process
> > > > test: page_fault3
> > > > cpufreq_governor: performance
> > > >
> > > > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > > > test-url: https://github.com/antonblanchard/will-it-scale
> > >
> > > This is surprising. Do you run these tests in a memory cgroup with a
> > > limit set? Can you dump that cgroup's memory.events after the run?
> >
> > "Some background in case it's forgotten: we do not set any memory control
> > group specifically and the test machine is using ramfs as its root.
> > The machine has plenty memory, no swap is setup. All pages belong to
> > root_mem_cgroup"
> >
> > Turned out the performance change is due to 'struct mem_cgroup' layout
> > change, i.e. if I do:
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index d99b71bc2c66..c767db1da0bb 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -205,7 +205,6 @@ struct mem_cgroup {
> > int oom_kill_disable;
> >
> > /* memory.events */
> > - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
> > struct cgroup_file events_file;
> >
> > /* protect arrays of thresholds */
> > @@ -238,6 +237,7 @@ struct mem_cgroup {
> > struct mem_cgroup_stat_cpu __percpu *stat_cpu;
> > atomic_long_t stat[MEMCG_NR_STAT];
> > atomic_long_t events[NR_VM_EVENT_ITEMS];
> > + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
> >
> > unsigned long socket_pressure;
>
> Well, I do not mind moving memory_events down to other stats/events. I
> suspect Johannes' chosen the current location to be close to
> events_file.
>
> > The performance will restore.
> >
> > Move information:
> > With this patch, perf profile+annotate showed increased cycles spent on
> > accessing root_mem_cgroup->stat_cpu in count_memcg_event_mm()(called by
> > handle_mm_fault()):
> >
> > │ x = count + __this_cpu_read(memcg->stat_cpu->events[idx]);
> > 92.31 │ mov 0x308(%rcx),%rax
> > 0.58 │ mov %gs:0x1b0(%rax),%rdx
> > 0.09 │ add $0x1,%rdx
> >
> > And in __mod_memcg_state() called by page_add_file_rmap():
> >
> > │ x = val + __this_cpu_read(memcg->stat_cpu->count[idx]);
> > 70.89 │ mov 0x308(%rdi),%rdx
> > 0.43 │ mov %gs:0x68(%rdx),%rax
> > 0.08 │ add %rbx,%rax
> > │ if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
> >
> > My first reaction is, with the patch changing the sturcture layout, the
> > stat_cpu field might end up in a cacheline that is constantly being
> > written to. With the help of pahole, I got:
> > 1 after this patch(bad)
> >
> > /* --- cacheline 12 boundary (768 bytes) --- */
> > long unsigned int move_lock_flags; /* 768 8 */
> > struct mem_cgroup_stat_cpu * stat_cpu; /* 776 8 */
> > atomic_long_t stat[34]; /* 784 0 */
> >
> > stat[0] - stat[5] falls in this cacheline.
> >
> > 2 before this patch(good)
> > /* --- cacheline 11 boundary (704 bytes) was 8 bytes ago --- */
> > long unsigned int move_charge_at_immigrate; /* 712 8 */
> > atomic_t moving_account; /* 720 0 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > spinlock_t move_lock; /* 724 0 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > struct task_struct * move_lock_task; /* 728 8 */
> > long unsigned int move_lock_flags; /* 736 8 */
> > struct mem_cgroup_stat_cpu * stat_cpu; /* 744 8 */
> > atomic_long_t stat[34]; /* 752 0 */
> >
> > stat[0] - stat[1] falls in this cacheline.
> >
> > We now have more stat[]s fall in the cacheline, but then I realized
> > stats[0] - stat[12] are never written to for a memory control group, the
> > first written field is 13(NR_FILE_MAPPED).
>
> This is a bit scary though. Seeing 27% regression just because of this
> is really unexpected and fragile wrt. future changes.

Agree.

And it turned out as long as moving_account, move_lock_task and stat_cpu
being placed in the same cacheline, the performance will restore.

moving_account is accessed in lock_page_memcg(), move_lock_task is
accessed in unlock_page_memcg() and stat_cpu is accessed in multiple
places that calls __mod_memcg_state(). Placing the 3 fields in the same
cacheline really helps this workload.

>
> > So I think my first reaction is wrong.
> >
> > Looking at the good layout, there is a field moving_account that will be
> > accessed during the test in lock_page_memcg(), and that access is always
> > read only since there is no page changing memcg. So the good performance
> > might be due to having the two fields in the cache line. I moved the
> > moving_account field to the same cacheline as stat_cpu for the bad case,
> > the performance restored a lot but still not as good as base.
> >
> > I'm not sure where to go next step and would like to seek some
> > suggestion. Based on my analysis, it appears the good performance for
> > base is entirely by accident(having moving_account and stat_cpu in the
> > same cacheline), we never ensure that. In the meantime, it might not be
> > a good idea to ensure that since stat_cpu should be an always_read field
> > while moving_account will be modified when needed.
> >
> > Or any idea what might be the cause? Thanks.
>
> Can you actually prepare a patch with all these numbers and a big fat
> comment in the structure to keep the most hot counters close to
> moving_account. Maybe we want to re-organize this some more and pull
> move_lock* out of the sensitive cache line because they are a slow path
> stuff. We would have more stasts in the same cache line then. What do
> you think?

Below is what I did to get the performance back for this workload, will
need to verify how it works with the other one([LKP] [lkp-robot] [mm,
memcontrol] 309fe96bfc: vm-scalability.throughput +23.0% improvement):

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..c79972a78d6c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -158,6 +158,15 @@ enum memcg_kmem_state {
KMEM_ONLINE,
};

+#if defined(CONFIG_SMP)
+struct memcg_padding {
+ char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define MEMCG_PADDING(name) struct memcg_padding name;
+#else
+#define MEMCG_PADDING(name)
+#endif
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -225,17 +234,23 @@ struct mem_cgroup {
* mem_cgroup ? And what type of charges should we move ?
*/
unsigned long move_charge_at_immigrate;
+ /* taken only while moving_account > 0 */
+ spinlock_t move_lock;
+ unsigned long move_lock_flags;
+
+ MEMCG_PADDING(_pad1_);
+
/*
* set > 0 if pages under this cgroup are moving to other cgroup.
*/
atomic_t moving_account;
- /* taken only while moving_account > 0 */
- spinlock_t move_lock;
struct task_struct *move_lock_task;
- unsigned long move_lock_flags;

/* memory.stat */
struct mem_cgroup_stat_cpu __percpu *stat_cpu;
+
+ MEMCG_PADDING(_pad2_);
+
atomic_long_t stat[MEMCG_NR_STAT];
atomic_long_t events[NR_VM_EVENT_ITEMS];

I'm not entirely sure if this is the right thing to do, considering
move_lock_task and moving_account could be modified while stat_cpu will
remain unchanged. But I think it is a rare case for move_lock_task and
moving_account to be modified? Per my understanding, only when a task
moves to another memory cgroup will those fields being modified. If so,
I think the above diff should be OK. Thoughts?

2018-06-01 07:13:19

by Aaron Lu

[permalink] [raw]

Subject: [RFC PATCH] mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline

The LKP robot found a 27% will-it-scale/page_fault3 performance regression
regarding commit e27be240df53("mm: memcg: make sure memory.events is
uptodate when waking pollers").

What the test does is:
1 mkstemp() a 128M file on a tmpfs;
2 start $nr_cpu processes, each to loop the following:
2.1 mmap() this file in shared write mode;
2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
2.3 unmap() this file and repeat this process.
3 After 5 minutes, check how many loops they managed to complete,
the higher the better.

The commit itself looks innocent enough as it merely changed some event
counting mechanism and this test didn't trigger those events at all.
Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
in count_memcg_event_mm()(called by handle_mm_fault()) and in
__mod_memcg_state() called by page_add_file_rmap(). So it's likely due
to the changed layout of 'struct mem_cgroup' that either make stat_cpu
falling into a constantly modifying cacheline or some hot fields stop
being in the same cacheline.

I verified this by moving memory_events[] back to where it was:

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71b..c767db1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -205,7 +205,6 @@ struct mem_cgroup {
int oom_kill_disable;

/* memory.events */
- atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
struct cgroup_file events_file;

/* protect arrays of thresholds */
@@ -238,6 +237,7 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu __percpu *stat_cpu;
atomic_long_t stat[MEMCG_NR_STAT];
atomic_long_t events[NR_VM_EVENT_ITEMS];
+ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];

unsigned long socket_pressure;

And performance restored.

Later investigation found that as long as the following 3 fields
moving_account, move_lock_task and stat_cpu are in the same cacheline,
performance will be good. To avoid future performance surprise by
other commits changing the layout of 'struct mem_cgroup', this patch
makes sure the 3 fields stay in the same cacheline.

One concern of this approach is, moving_account and move_lock_task
could be modified when a process changes memory cgroup while stat_cpu
is a always read field, it might hurt to place them in the same
cacheline. I assume it is rare for a process to change memory cgroup
so this should be OK.

LINK: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
Reported-by: kernel test robot <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
include/linux/memcontrol.h | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..c79972a78d6c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -158,6 +158,15 @@ enum memcg_kmem_state {
KMEM_ONLINE,
};

+#if defined(CONFIG_SMP)
+struct memcg_padding {
+ char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define MEMCG_PADDING(name) struct memcg_padding name;
+#else
+#define MEMCG_PADDING(name)
+#endif
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -225,17 +234,23 @@ struct mem_cgroup {
* mem_cgroup ? And what type of charges should we move ?
*/
unsigned long move_charge_at_immigrate;
+ /* taken only while moving_account > 0 */
+ spinlock_t move_lock;
+ unsigned long move_lock_flags;
+
+ MEMCG_PADDING(_pad1_);
+
/*
* set > 0 if pages under this cgroup are moving to other cgroup.
*/
atomic_t moving_account;
- /* taken only while moving_account > 0 */
- spinlock_t move_lock;
struct task_struct *move_lock_task;
- unsigned long move_lock_flags;

/* memory.stat */
struct mem_cgroup_stat_cpu __percpu *stat_cpu;
+
+ MEMCG_PADDING(_pad2_);
+
atomic_long_t stat[MEMCG_NR_STAT];
atomic_long_t events[NR_VM_EVENT_ITEMS];

--
2.17.0