2016-10-17 02:29:06

by kernel test robot

[permalink] [raw]
Subject: [lkp] [ipc/sem.c] 5864a2fd30: aim9.shared_memory.ops_per_sec -13.0% regression


FYI, we noticed a -13.0% regression of aim9.shared_memory.ops_per_sec due to commit:

commit 5864a2fd3088db73d47942370d0f7210a807b9bc ("ipc/sem.c: fix complex_count vs. simple op race")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

in testcase: aim9
on test machine: 144 threads Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory
with following parameters:

testtime: 300s
test: shared_memory
cpufreq_governor: performance

Suite IX is the "AIM Independent Resource Benchmark:" the famous synthetic benchmark.



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-6/performance/x86_64-rhel-7.2/debian-x86_64-2016-08-31.cgz/lkp-hsx04/shared_memory/aim9/300s

commit:
65deb8af76 ("kcov: do not instrument lib/stackdepot.c")
5864a2fd30 ("ipc/sem.c: fix complex_count vs. simple op race")

65deb8af76defeae 5864a2fd3088db73d47942370d
---------------- --------------------------
%stddev %change %stddev
\ | \
1450659 ? 0% -13.0% 1262704 ? 0% aim9.shared_memory.ops_per_sec
4353113 ? 0% -13.0% 3789220 ? 0% aim9.time.minor_page_faults
251.75 ? 0% +2.5% 258.09 ? 0% aim9.time.system_time
48.25 ? 0% -13.1% 41.91 ? 0% aim9.time.user_time
967475 ? 8% -26.3% 713361 ? 9% cpuidle.C1-HSW.time
6351555 ? 7% +10.8% 7036941 ? 0% meminfo.DirectMap2M
4856604 ? 0% -40.6% 2882512 ? 65% numa-numastat.node0.local_node
7725 ? 74% +80.1% 13913 ? 0% numa-numastat.node0.numa_foreign
4856610 ? 0% -40.6% 2882518 ? 65% numa-numastat.node0.numa_hit
7725 ? 74% +80.1% 13913 ? 0% numa-numastat.node0.numa_miss
16976 ? 2% +15.2% 19550 ? 4% slabinfo.kmalloc-512.active_objs
17132 ? 2% +15.1% 19712 ? 4% slabinfo.kmalloc-512.num_objs
133771 ? 3% -13.6% 115576 ? 9% slabinfo.kmalloc-8.active_objs
134332 ? 3% -13.4% 116281 ? 9% slabinfo.kmalloc-8.num_objs
5553071 ? 0% -11.7% 4903071 ? 0% proc-vmstat.numa_hit
5553039 ? 0% -11.7% 4903039 ? 0% proc-vmstat.numa_local
7787767 ? 0% -11.2% 6913807 ? 0% proc-vmstat.pgalloc_normal
5088685 ? 0% -11.0% 4527573 ? 0% proc-vmstat.pgfault
7812098 ? 0% -11.3% 6930089 ? 0% proc-vmstat.pgfree
39721 ? 35% -46.2% 21378 ? 72% numa-meminfo.node1.Active
32736 ? 42% -54.8% 14793 ?108% numa-meminfo.node1.Active(anon)
25797 ? 50% -61.0% 10062 ?127% numa-meminfo.node1.AnonHugePages
32721 ? 43% -54.8% 14777 ?108% numa-meminfo.node1.AnonPages
916.00 ? 14% -50.5% 453.67 ? 18% numa-meminfo.node1.PageTables
15945 ? 11% +23.8% 19737 ? 10% numa-meminfo.node2.SReclaimable
647.67 ? 18% +60.8% 1041 ? 15% numa-meminfo.node3.PageTables
2195 ? 80% +128.1% 5008 ? 14% numa-meminfo.node3.Shmem
5.152e+09 ? 5% +6.4% 5.481e+09 ? 4% perf-stat.branch-misses
0.19 ? 6% +9.0% 0.20 ? 2% perf-stat.dTLB-load-miss-rate%
1.28e+09 ? 1% -3.7% 1.233e+09 ? 0% perf-stat.dTLB-load-misses
6.934e+11 ? 7% -12.0% 6.101e+11 ? 2% perf-stat.dTLB-loads
3.348e+08 ? 0% -2.3% 3.27e+08 ? 0% perf-stat.dTLB-store-misses
84.70 ? 1% -4.4% 81.00 ? 2% perf-stat.iTLB-load-miss-rate%
1.219e+09 ? 1% -16.5% 1.018e+09 ? 14% perf-stat.iTLB-load-misses
2.202e+08 ? 4% +6.6% 2.348e+08 ? 2% perf-stat.iTLB-loads
2102 ? 29% +39.9% 2941 ? 13% perf-stat.instructions-per-iTLB-miss
5064989 ? 0% -11.1% 4500438 ? 0% perf-stat.minor-faults
57745625 ? 10% -44.9% 31799796 ? 53% perf-stat.node-store-misses
9693133 ? 9% -20.9% 7668803 ? 23% perf-stat.node-stores
5064525 ? 0% -11.1% 4500444 ? 0% perf-stat.page-faults
2.35 ? 23% -24.4% 1.78 ? 14% perf-profile.calltrace.cycles-pp.__tick_nohz_idle_enter.tick_nohz_irq_exit.irq_exit.smp_apic_timer_interrupt.apic_timer_interrupt
3.00 ? 23% -26.4% 2.21 ? 14% perf-profile.calltrace.cycles-pp.irq_exit.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter.call_cpuidle
2.42 ? 23% -24.3% 1.83 ? 14% perf-profile.calltrace.cycles-pp.tick_nohz_irq_exit.irq_exit.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter
2.44 ? 23% -24.5% 1.84 ? 14% perf-profile.children.cycles-pp.__tick_nohz_idle_enter
0.96 ? 17% -24.9% 0.72 ? 19% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
0.72 ? 28% -27.2% 0.53 ? 14% perf-profile.children.cycles-pp.hrtimer_start_range_ns
3.13 ? 23% -25.1% 2.34 ? 14% perf-profile.children.cycles-pp.irq_exit
1.01 ? 76% -65.2% 0.35 ? 24% perf-profile.children.cycles-pp.rest_init
1.01 ? 76% -65.2% 0.35 ? 24% perf-profile.children.cycles-pp.start_kernel
2.49 ? 24% -24.3% 1.88 ? 13% perf-profile.children.cycles-pp.tick_nohz_irq_exit
1.01 ? 76% -65.2% 0.35 ? 24% perf-profile.children.cycles-pp.x86_64_start_kernel
1.01 ? 76% -65.2% 0.35 ? 24% perf-profile.children.cycles-pp.x86_64_start_reservations
2.34 ? 16% +48.1% 3.47 ? 16% perf-profile.self.cycles-pp.SYSC_semtimedop
0.96 ? 17% -24.9% 0.72 ? 19% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
33058 ? 17% +19.8% 39614 ? 1% numa-vmstat.node0.numa_foreign
2553587 ? 0% -38.9% 1560692 ? 59% numa-vmstat.node0.numa_hit
2553582 ? 0% -38.9% 1560686 ? 59% numa-vmstat.node0.numa_local
33058 ? 17% +19.8% 39614 ? 1% numa-vmstat.node0.numa_miss
8185 ? 42% -54.8% 3699 ?108% numa-vmstat.node1.nr_active_anon
8179 ? 43% -54.8% 3693 ?108% numa-vmstat.node1.nr_anon_pages
226.33 ? 15% -50.7% 111.67 ? 19% numa-vmstat.node1.nr_page_table_pages
8186 ? 42% -54.8% 3699 ?108% numa-vmstat.node1.nr_zone_active_anon
3986 ? 11% +23.8% 4933 ? 10% numa-vmstat.node2.nr_slab_reclaimable
161.00 ? 18% +61.1% 259.33 ? 15% numa-vmstat.node3.nr_page_table_pages
548.33 ? 80% +128.3% 1251 ? 14% numa-vmstat.node3.nr_shmem
88775 ? 4% -7.1% 82488 ? 4% numa-vmstat.node3.numa_foreign
177250 ? 9% +378.1% 847409 ?109% numa-vmstat.node3.numa_hit
177245 ? 9% +378.1% 847405 ?109% numa-vmstat.node3.numa_local
88775 ? 4% -7.1% 82488 ? 4% numa-vmstat.node3.numa_miss
0.00 ? 0% +3.7e+09% 36.92 ? 38% sched_debug.cfs_rq:/.MIN_vruntime.avg
0.00 ? 0% +3.6e+11% 3602 ? 15% sched_debug.cfs_rq:/.MIN_vruntime.max
0.00 ? 0% +1.8e+25% 345.99 ? 12% sched_debug.cfs_rq:/.MIN_vruntime.stddev
191.05 ? 2% +9.4% 208.96 ? 4% sched_debug.cfs_rq:/.load_avg.avg
573.00 ? 16% +53.3% 878.22 ? 11% sched_debug.cfs_rq:/.load_avg.max
49.42 ? 14% +51.3% 74.77 ? 9% sched_debug.cfs_rq:/.load_avg.stddev
0.00 ? 0% +3.7e+09% 36.92 ? 38% sched_debug.cfs_rq:/.max_vruntime.avg
0.00 ? 0% +3.6e+11% 3602 ? 15% sched_debug.cfs_rq:/.max_vruntime.max
0.00 ? 0% +1.8e+25% 345.99 ? 12% sched_debug.cfs_rq:/.max_vruntime.stddev
849009 ? 9% -36.3% 540564 ? 25% sched_debug.cfs_rq:/.min_vruntime.max
69571 ? 9% -36.5% 44149 ? 26% sched_debug.cfs_rq:/.min_vruntime.stddev
0.19 ? 40% -57.0% 0.08 ? 89% sched_debug.cfs_rq:/.nr_spread_over.stddev
2.85 ? 29% +71.1% 4.87 ? 11% sched_debug.cfs_rq:/.runnable_load_avg.avg
313.78 ? 36% +96.8% 617.61 ? 14% sched_debug.cfs_rq:/.runnable_load_avg.max
27.10 ? 35% +90.5% 51.62 ? 13% sched_debug.cfs_rq:/.runnable_load_avg.stddev
813659 ? 9% -38.0% 504715 ? 27% sched_debug.cfs_rq:/.spread0.max
69571 ? 9% -36.5% 44151 ? 26% sched_debug.cfs_rq:/.spread0.stddev
65.19 ? 9% -62.2% 24.62 ? 27% sched_debug.cpu.clock.stddev
65.19 ? 9% -62.2% 24.62 ? 27% sched_debug.cpu.clock_task.stddev
2.84 ? 29% +70.4% 4.85 ? 11% sched_debug.cpu.cpu_load[0].avg
313.78 ? 36% +96.8% 617.61 ? 14% sched_debug.cpu.cpu_load[0].max
27.10 ? 35% +90.5% 51.61 ? 13% sched_debug.cpu.cpu_load[0].stddev
3.00 ? 34% +73.6% 5.21 ? 7% sched_debug.cpu.cpu_load[1].avg
312.39 ? 36% +102.1% 631.44 ? 12% sched_debug.cpu.cpu_load[1].max
27.24 ? 36% +94.8% 53.07 ? 11% sched_debug.cpu.cpu_load[1].stddev
2.88 ? 34% +78.3% 5.13 ? 8% sched_debug.cpu.cpu_load[2].avg
310.06 ? 37% +104.0% 632.61 ? 12% sched_debug.cpu.cpu_load[2].max
26.75 ? 36% +98.8% 53.18 ? 11% sched_debug.cpu.cpu_load[2].stddev
2.68 ? 34% +86.8% 5.01 ? 9% sched_debug.cpu.cpu_load[3].avg
303.11 ? 37% +110.0% 636.39 ? 12% sched_debug.cpu.cpu_load[3].max
25.82 ? 36% +106.2% 53.24 ? 11% sched_debug.cpu.cpu_load[3].stddev
2.44 ? 33% +99.3% 4.87 ? 9% sched_debug.cpu.cpu_load[4].avg
289.67 ? 36% +119.5% 635.78 ? 12% sched_debug.cpu.cpu_load[4].max
24.41 ? 35% +117.4% 53.07 ? 11% sched_debug.cpu.cpu_load[4].stddev
0.00 ? 12% -54.4% 0.00 ? 19% sched_debug.cpu.next_balance.stddev
44337 ? 46% +66.1% 73663 ? 4% sched_debug.cpu.nr_switches.max
5304 ? 31% +34.0% 7105 ? 5% sched_debug.cpu.nr_switches.stddev
-21.06 ?-17% -38.5% -12.94 ?-35% sched_debug.cpu.nr_uninterruptible.min
20897 ? 48% +72.1% 35970 ? 4% sched_debug.cpu.sched_goidle.max
2524 ? 33% +37.8% 3478 ? 5% sched_debug.cpu.sched_goidle.stddev
21067 ? 45% +88.3% 39667 ? 11% sched_debug.cpu.ttwu_count.max
2771 ? 34% +39.0% 3851 ? 8% sched_debug.cpu.ttwu_count.stddev
10886 ? 41% +117.4% 23669 ? 14% sched_debug.cpu.ttwu_local.max
1101 ? 31% +84.7% 2033 ? 15% sched_debug.cpu.ttwu_local.stddev



perf-stat.page-faults

6e+06 ++------------------------------------------------------------------+
| |
5e+06 **.****.***.****.***.****.***.****.****.***.****.***.****.***.** *.**
OO OO O OOO OOOO OOO OOOO OOO OOOO OOO OOO OO O O : : |
| O O O O : : |
4e+06 ++ : : |
| : : |
3e+06 ++ : : |
| :: |
2e+06 ++ : |
| : |
| : |
1e+06 ++ : |
| : |
0 ++-----------------------------------------------O--------------*---+


perf-stat.minor-faults

6e+06 ++------------------------------------------------------------------+
| |
5e+06 **.****.***.****.***.****.***.****.****.***.****.***.****.***.** *.**
OO OO O OOO OOOO OOO OOOO OOO OOOO OOO OOO OO O O : : |
| O O O O : : |
4e+06 ++ : : |
| : : |
3e+06 ++ : : |
| :: |
2e+06 ++ : |
| : |
| : |
1e+06 ++ : |
| : |
0 ++-----------------------------------------------O--------------*---+


aim9.shared_memory.ops_per_sec

1.6e+06 ++----------------------------------------------------------------+
** .** .*** .** *.* **. ***.****.****.* **. ** .** *.****.* * ***
1.4e+06 ++* ** * * * * * * * * *: : |
OOO OOOO OOOO OOOO OOOO OOOO OOOO OOOO OOOO OOO OO : : |
1.2e+06 ++ : : |
1e+06 ++ : : |
| : : |
800000 ++ : : |
| :: |
600000 ++ :: |
400000 ++ :: |
| : |
200000 ++ : |
| : |
0 ++---------------------------------------------O--------------*---+


aim9.time.minor_page_faults

4.5e+06 ++-----------------------*----***-**-------------------*-*-*-*--***
***.****.****.****.****.* **.* **.****.****.****.* * *: : |
4e+06 OOO OOOO OOOO OOOO OOOO OOOO OOOO OOOO OOOO OOO OO : : |
3.5e+06 ++ : : |
| : : |
3e+06 ++ : : |
2.5e+06 ++ : : |
| :: |
2e+06 ++ :: |
1.5e+06 ++ :: |
| :: |
1e+06 ++ : |
500000 ++ : |
| : |
0 ++---------------------------------------------O--------------*---+

[*] bisect-good sample
[O] bisect-bad sample





Thanks,
Xiaolong


Attachments:
(No filename) (15.56 kB)
config-4.8.0-11974-g5864a2f (150.06 kB)
job-script (6.36 kB)
job.yaml (3.99 kB)
reproduce (102.00 B)
Download all attachments

2016-10-19 04:38:44

by Manfred Spraul

[permalink] [raw]
Subject: [PATCH 1/2] ipc/sem.c: Avoid using spin_unlock_wait()

a) The ACQUIRE in spin_lock() applies to the read, not to the store,
at least for powerpc. This forces to add a smp_mb() into the fast
path.

b) The memory barrier provided by spin_unlock_wait() is right now
arch dependent.

Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait().

Advantage: faster single op semop calls(), observed +8.9% on
x86. (the other solution would be arch dependencies in ipc/sem).

Disadvantage: slower complex op semop calls, if (and only if)
there are no sleeping operations.

The next patch adds hysteresis, this further reduces the
probability that the slow path is used.

Signed-off-by: Manfred Spraul <[email protected]>
---
ipc/sem.c | 25 +++----------------------
1 file changed, 3 insertions(+), 22 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 5e318c5..d5f2710 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -280,24 +280,13 @@ static void complexmode_enter(struct sem_array *sma)
return;
}

- /* We need a full barrier after seting complex_mode:
- * The write to complex_mode must be visible
- * before we read the first sem->lock spinlock state.
- */
- smp_store_mb(sma->complex_mode, true);
+ sma->complex_mode = true;

for (i = 0; i < sma->sem_nsems; i++) {
sem = sma->sem_base + i;
- spin_unlock_wait(&sem->lock);
+ spin_lock(&sem->lock);
+ spin_unlock(&sem->lock);
}
- /*
- * spin_unlock_wait() is not a memory barriers, it is only a
- * control barrier. The code must pair with spin_unlock(&sem->lock),
- * thus just the control barrier is insufficient.
- *
- * smp_rmb() is sufficient, as writes cannot pass the control barrier.
- */
- smp_rmb();
}

/*
@@ -363,14 +352,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
*/
spin_lock(&sem->lock);

- /*
- * See 51d7d5205d33
- * ("powerpc: Add smp_mb() to arch_spin_is_locked()"):
- * A full barrier is required: the write of sem->lock
- * must be visible before the read is executed
- */
- smp_mb();
-
if (!smp_load_acquire(&sma->complex_mode)) {
/* fast path successful! */
return sops->sem_num;
--
2.7.4

2016-10-19 04:38:41

by Manfred Spraul

[permalink] [raw]
Subject: Re: [lkp] [ipc/sem.c] 5864a2fd30: aim9.shared_memory.ops_per_sec -13.0%

Hi,

as discussed before:
The root cause for the performance regression is the smp_mb() that was
added into the fast path.

I see two options:
1) switch to full spin_lock()/spin_unlock() for the rare codepath,
then the fast path doesn't need the smp_mb() anymore.

2) confirm that no arch needs the smp_mb(), then remove it.
- powerpc is ok after commit
6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()")
- arm is ok after commit
d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")
- for x86 is ok after commit
2c6100227116 ("locking/qspinlock: Fix spin_unlock_wait() some more")
- for the remaining SMP architectures, I don't have a status.

I would prefer the approach 1:
The memory ordering provided by spin_lock()/spin_unlock() is clear.

Thus:
Attached are patches for approach 1:

- Patch 1 replaces spin_unlock_wait() with spin_lock()/spin_unlock() and
removes all memory barriers that are then unnecessary.

- Patch 2 adds the hysteresis code: This makes the rare codepath
extremely rare.
It also corrects some wrong comments, e.g. regarding switching
from global lock to per-sem lock (we "must' switch, not we "can"
switch as written right now).

The patches passed stress-testing.

What do you think?
My initial idea was to aim for 4.10, then we have more time to decide.

--
Manfred

2016-10-19 04:39:05

by Manfred Spraul

[permalink] [raw]
Subject: [PATCH 2/2] ipc/sem: Add hysteresis.

sysv sem has two lock modes: One with per-semaphore locks, one lock mode
with a single global lock for the whole array.
When switching from the per-semaphore locks to the global lock, all
per-semaphore locks must be scanned for ongoing operations.

The patch adds a hysteresis for switching from the global lock to the
per semaphore locks. This reduces how often the per-semaphore locks
must be scanned.

Compared to the initial patch, this is a simplified solution:
Setting USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior.

In theory, a workload with exactly 10 simple sops and then
one complex op now scales a bit worse, but this is pure theory:
If there is concurrency, the it won't be exactly 10:1:10:1:10:1:...
If there is no concurrency, then there is no need for scalability.

Signed-off-by: Manfred Spraul <[email protected]>
---
include/linux/sem.h | 2 +-
ipc/sem.c | 86 +++++++++++++++++++++++++++++++++++++----------------
2 files changed, 62 insertions(+), 26 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index d0efd6e..4fc222f 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,7 +21,7 @@ struct sem_array {
struct list_head list_id; /* undo requests on this array */
int sem_nsems; /* no. of semaphores in array */
int complex_count; /* pending complex operations */
- bool complex_mode; /* no parallel simple ops */
+ unsigned int use_global_lock;/* >0: global lock required */
};

#ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index 94df1bf..e84703d 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -161,22 +161,42 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
#define SEMOPM_FAST 64 /* ~ 372 bytes on stack */

/*
+ * Switching from the mode suitable for simple ops
+ * to the mode for complex ops is costly. Therefore:
+ * use some hysteresis
+ */
+#define USE_GLOBAL_LOCK_HYSTERESIS 10
+
+/*
* Locking:
* a) global sem_lock() for read/write
* sem_undo.id_next,
* sem_array.complex_count,
- * sem_array.complex_mode
* sem_array.pending{_alter,_const},
* sem_array.sem_undo
*
* b) global or semaphore sem_lock() for read/write:
* sem_array.sem_base[i].pending_{const,alter}:
- * sem_array.complex_mode (for read)
*
* c) special:
* sem_undo_list.list_proc:
* * undo_list->lock for write
* * rcu for read
+ * use_global_lock:
+ * * global sem_lock() for write
+ * * either local or global sem_lock() for read.
+ *
+ * Memory ordering:
+ * Most ordering is enforced by using spin_lock() and spin_unlock().
+ * The special case is use_global_lock:
+ * Setting it from non-zero to 0 is a RELEASE, this is ensured by
+ * using smp_store_release().
+ * Testing if it is non-zero is an ACQUIRE, this is ensured by using
+ * smp_load_acquire().
+ * Setting it from 0 to non-zero must be ordered with regards to
+ * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
+ * is inside a spin_lock() and after a write from 0 to non-zero a
+ * spin_lock()+spin_unlock() is done.
*/

#define sc_semmsl sem_ctls[0]
@@ -275,12 +295,16 @@ static void complexmode_enter(struct sem_array *sma)
int i;
struct sem *sem;

- if (sma->complex_mode) {
- /* We are already in complex_mode. Nothing to do */
+ if (sma->use_global_lock > 0) {
+ /*
+ * We are already in global lock mode.
+ * Nothing to do, just reset the
+ * counter until we return to simple mode.
+ */
+ sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
return;
}
-
- sma->complex_mode = true;
+ sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;

for (i = 0; i < sma->sem_nsems; i++) {
sem = sma->sem_base + i;
@@ -301,13 +325,17 @@ static void complexmode_tryleave(struct sem_array *sma)
*/
return;
}
- /*
- * Immediately after setting complex_mode to false,
- * a simple op can start. Thus: all memory writes
- * performed by the current operation must be visible
- * before we set complex_mode to false.
- */
- smp_store_release(&sma->complex_mode, false);
+ if (sma->use_global_lock == 1) {
+ /*
+ * Immediately after setting use_global_lock to 0,
+ * a simple op can start. Thus: all memory writes
+ * performed by the current operation must be visible
+ * before we set use_global_lock to 0.
+ */
+ smp_store_release(&sma->use_global_lock, 0);
+ } else {
+ sma->use_global_lock--;
+ }
}

#define SEM_GLOBAL_LOCK (-1)
@@ -337,22 +365,23 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
* Optimized locking is possible if no complex operation
* is either enqueued or processed right now.
*
- * Both facts are tracked by complex_mode.
+ * Both facts are tracked by use_global_mode.
*/
sem = sma->sem_base + sops->sem_num;

/*
- * Initial check for complex_mode. Just an optimization,
+ * Initial check for use_global_lock. Just an optimization,
* no locking, no memory barrier.
*/
- if (!sma->complex_mode) {
+ if (!sma->use_global_lock) {
/*
* It appears that no complex operation is around.
* Acquire the per-semaphore lock.
*/
spin_lock(&sem->lock);

- if (!smp_load_acquire(&sma->complex_mode)) {
+ /* pairs with smp_store_release() */
+ if (!smp_load_acquire(&sma->use_global_lock)) {
/* fast path successful! */
return sops->sem_num;
}
@@ -362,19 +391,26 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
/* slow path: acquire the full lock */
ipc_lock_object(&sma->sem_perm);

- if (sma->complex_count == 0) {
- /* False alarm:
- * There is no complex operation, thus we can switch
- * back to the fast path.
+ if (sma->use_global_lock == 0) {
+ /*
+ * The use_global_lock mode ended while we waited for
+ * sma->sem_perm.lock. Thus we must switch to locking
+ * with sem->lock.
+ * Unlike in the fast path, there is no need to recheck
+ * sma->use_global_lock after we have acquired sem->lock:
+ * We own sma->sem_perm.lock, thus use_global_lock cannot
+ * change.
*/
spin_lock(&sem->lock);
+
ipc_unlock_object(&sma->sem_perm);
return sops->sem_num;
} else {
- /* Not a false alarm, thus complete the sequence for a
- * full lock.
+ /*
+ * Not a false alarm, thus continue to use the global lock
+ * mode. No need for complexmode_enter(), this was done by
+ * the caller that has set use_global_mode to non-zero.
*/
- complexmode_enter(sma);
return SEM_GLOBAL_LOCK;
}
}
@@ -535,7 +571,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
}

sma->complex_count = 0;
- sma->complex_mode = true; /* dropped by sem_unlock below */
+ sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
INIT_LIST_HEAD(&sma->pending_alter);
INIT_LIST_HEAD(&sma->pending_const);
INIT_LIST_HEAD(&sma->list_id);
--
2.7.4

2016-10-20 00:21:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [lkp] [ipc/sem.c] 5864a2fd30: aim9.shared_memory.ops_per_sec -13.0%

On Wed, 19 Oct 2016 06:38:14 +0200 Manfred Spraul <[email protected]> wrote:

> Hi,
>
> as discussed before:
> The root cause for the performance regression is the smp_mb() that was
> added into the fast path.
>
> I see two options:
> 1) switch to full spin_lock()/spin_unlock() for the rare codepath,
> then the fast path doesn't need the smp_mb() anymore.
>
> 2) confirm that no arch needs the smp_mb(), then remove it.
> - powerpc is ok after commit
> 6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()")
> - arm is ok after commit
> d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")
> - for x86 is ok after commit
> 2c6100227116 ("locking/qspinlock: Fix spin_unlock_wait() some more")
> - for the remaining SMP architectures, I don't have a status.
>
> I would prefer the approach 1:
> The memory ordering provided by spin_lock()/spin_unlock() is clear.
>
> Thus:
> Attached are patches for approach 1:
>
> - Patch 1 replaces spin_unlock_wait() with spin_lock()/spin_unlock() and
> removes all memory barriers that are then unnecessary.
>
> - Patch 2 adds the hysteresis code: This makes the rare codepath
> extremely rare.
> It also corrects some wrong comments, e.g. regarding switching
> from global lock to per-sem lock (we "must' switch, not we "can"
> switch as written right now).
>
> The patches passed stress-testing.
>
> What do you think?

Are you able to confirm that the performance issues are fixed?

> My initial idea was to aim for 4.10, then we have more time to decide.

I suppose I can slip these into -next and see what the effect is upon
the Intel test results. But a) I don't know if they test linux-next(?)
and b) I don't know where the test results are published, assuming they
are published(?).

2016-10-20 04:46:09

by Manfred Spraul

[permalink] [raw]
Subject: Re: [lkp] [ipc/sem.c] 5864a2fd30: aim9.shared_memory.ops_per_sec -13.0%

On 10/20/2016 02:21 AM, Andrew Morton wrote:
> On Wed, 19 Oct 2016 06:38:14 +0200 Manfred Spraul <[email protected]> wrote:
>
>> Hi,
>>
>> as discussed before:
>> The root cause for the performance regression is the smp_mb() that was
>> added into the fast path.
>>
>> I see two options:
>> 1) switch to full spin_lock()/spin_unlock() for the rare codepath,
>> then the fast path doesn't need the smp_mb() anymore.
>>
>> 2) confirm that no arch needs the smp_mb(), then remove it.
>> - powerpc is ok after commit
>> 6262db7c088b ("powerpc/spinlock: Fix spin_unlock_wait()")
>> - arm is ok after commit
>> d86b8da04dfa ("arm64: spinlock: serialise spin_unlock_wait against concurrent lockers")
>> - for x86 is ok after commit
>> 2c6100227116 ("locking/qspinlock: Fix spin_unlock_wait() some more")
>> - for the remaining SMP architectures, I don't have a status.
>>
>> I would prefer the approach 1:
>> The memory ordering provided by spin_lock()/spin_unlock() is clear.
>>
>> Thus:
>> Attached are patches for approach 1:
>>
>> - Patch 1 replaces spin_unlock_wait() with spin_lock()/spin_unlock() and
>> removes all memory barriers that are then unnecessary.
>>
>> - Patch 2 adds the hysteresis code: This makes the rare codepath
>> extremely rare.
>> It also corrects some wrong comments, e.g. regarding switching
>> from global lock to per-sem lock (we "must' switch, not we "can"
>> switch as written right now).
>>
>> The patches passed stress-testing.
>>
>> What do you think?
> Are you able to confirm that the performance issues are fixed?
I don't know why we are now at -13%.
The previous -9% were resolved by the patches:

http://marc.info/?l=linux-kernel&m=147608082017551&w=2
>> My initial idea was to aim for 4.10, then we have more time to decide.
> I suppose I can slip these into -next and see what the effect is upon
> the Intel test results. But a) I don't know if they test linux-next(?)
> and b) I don't know where the test results are published, assuming they
> are published(?).

--

Manfred