From: Ma Ling <[email protected]>
Hi ALL,
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.
However if the serialized works are sent to one core and executed
when lock contention happens, that can save much time and power,
because all shared data are located in private cache of one core.
We call the mechanism as Acceleration from Lock Integration
(ali spinlock)
Usually when requests are queued, we have to wait work to submit
one bye one, in order to improve the whole throughput further,
we introduce LOCK_FREE. So when requests are sent to lock owner,
requester may do other works in parallelism, then ali_spin_is_completed
function could tell us whether the work has been completed.
The new code is based on qspinlock and implement Lock Integration,
improves performance up to 3X on intel platform with 72 cores(18x2HTx2S HSW),
2X on ARM platform with 96 cores too. And additional trival changes on
Makefile/Kconfig are made to enable compiling of this feature on x86 platform.
(We would like to do further experiments according to your requirement)
Happy New Year 2016!
Ling
Signed-off-by: Ma Ling <[email protected]>
---
arch/x86/Kconfig | 1 +
include/linux/alispinlock.h | 41 ++++++++++++++++++
kernel/Kconfig.locks | 7 +++
kernel/locking/Makefile | 1 +
kernel/locking/alispinlock.c | 97 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 147 insertions(+), 0 deletions(-)
create mode 100644 include/linux/alispinlock.h
create mode 100644 kernel/locking/alispinlock.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index db3622f..47d9277 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -42,6 +42,7 @@ config X86
select ARCH_USE_CMPXCHG_LOCKREF if X86_64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+ select ARCH_USE_ALI_SPINLOCKS
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANT_FRAME_POINTERS
diff --git a/include/linux/alispinlock.h b/include/linux/alispinlock.h
new file mode 100644
index 0000000..5207c41
--- /dev/null
+++ b/include/linux/alispinlock.h
@@ -0,0 +1,41 @@
+#ifndef ALI_SPINLOCK_H
+#define ALI_SPINLOCK_H
+/*
+ * Acceleration from Lock Integration
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2015 Alibaba Group.
+ *
+ * Authors: Ma Ling <[email protected]>
+ *
+ */
+typedef struct ali_spinlock {
+ void *lock_p;
+} ali_spinlock_t;
+
+struct ali_spinlock_info {
+ struct ali_spinlock_info *next;
+ int flags;
+ int locked;
+ void (*fn)(void *);
+ void *para;
+};
+
+static __always_inline int ali_spin_is_completed(struct ali_spinlock_info *ali)
+{
+ return (READ_ONCE(ali->locked) == 0);
+}
+
+void alispinlock(struct ali_spinlock *lock, struct ali_spinlock_info *ali);
+
+#define ALI_LOCK_FREE 1
+#endif /* ALI_SPINLOCK_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index ebdb004..5130c63 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -235,6 +235,13 @@ config LOCK_SPIN_ON_OWNER
def_bool y
depends on MUTEX_SPIN_ON_OWNER || RWSEM_SPIN_ON_OWNER
+config ARCH_USE_ALI_SPINLOCKS
+ bool
+
+config ALI_SPINLOCKS
+ def_bool y if ARCH_USE_ALI_SPINLOCKS
+ depends on SMP
+
config ARCH_USE_QUEUED_SPINLOCKS
bool
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 8e96f6c..a4241f8 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -13,6 +13,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep.o
ifeq ($(CONFIG_PROC_FS),y)
obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
endif
+obj-$(CONFIG_ALI_SPINLOCKS) += alispinlock.o
obj-$(CONFIG_SMP) += spinlock.o
obj-$(CONFIG_LOCK_SPIN_ON_OWNER) += osq_lock.o
obj-$(CONFIG_SMP) += lglock.o
diff --git a/kernel/locking/alispinlock.c b/kernel/locking/alispinlock.c
new file mode 100644
index 0000000..43078b4
--- /dev/null
+++ b/kernel/locking/alispinlock.c
@@ -0,0 +1,97 @@
+/*
+ * Acceleration from Lock Integration
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2015 Alibaba Group.
+ *
+ * Authors: Ma Ling <[email protected]>
+ *
+ */
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/spinlock.h>
+#include <linux/alispinlock.h>
+/*
+ * Wire-latency(RC delay) dominate modern computer performance,
+ * conventional serialized works cause cache line ping-pong seriously,
+ * the process spend lots of time and power to complete.
+ * specially on multi-core platform.
+ *
+ * However if the serialized works are sent to one core and executed
+ * when lock contention happens, that can save much time and power,
+ * because all shared data are located in private cache of one core.
+ * We call the mechanism as Acceleration from Lock Integration
+ * (ali spinlock)
+ *
+ * Usually when requests are queued, we have to wait work to submit
+ * one bye one, in order to improve the whole throughput further,
+ * we introduce LOCK_FREE. So when requests are sent to lock owner,
+ * requester may do other works in parallelism, then ali_spin_is_completed
+ * function could tell us whether the work is completed.
+ *
+ */
+void alispinlock(struct ali_spinlock *lock, struct ali_spinlock_info *ali)
+{
+ struct ali_spinlock_info *next, *old;
+
+ ali->next = NULL;
+ ali->locked = 1;
+ old = xchg(&lock->lock_p, ali);
+
+ /* If NULL we are the first one */
+ if (old) {
+ WRITE_ONCE(old->next, ali);
+ if(ali->flags & ALI_LOCK_FREE)
+ return;
+ while((READ_ONCE(ali->locked)))
+ cpu_relax_lowlatency();
+ return;
+ }
+ old = READ_ONCE(lock->lock_p);
+
+ /* Handle all pending works */
+repeat:
+ if(old == ali)
+ goto end;
+
+ while (!(next = READ_ONCE(ali->next)))
+ cpu_relax();
+
+ ali->fn(ali->para);
+ ali->locked = 0;
+
+ if(old != next) {
+ while (!(ali = READ_ONCE(next->next)))
+ cpu_relax();
+ next->fn(next->para);
+ next->locked = 0;
+ goto repeat;
+
+ } else
+ ali = next;
+end:
+ ali->fn(ali->para);
+ /* If we are the last one, clear lock and return */
+ old = cmpxchg(&lock->lock_p, old, 0);
+
+ if(old != ali) {
+ /* There are still some works to do */
+ while (!(next = READ_ONCE(ali->next)))
+ cpu_relax();
+ ali->locked = 0;
+ ali = next;
+ goto repeat;
+ }
+
+ ali->locked = 0;
+ return;
+}
--
1.7.1
Hi Longman,
> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.
The attachment is for other use case with the new lock optimization.
It include two files: main.c (user space workload),
fcntl-lock-opt.patch (kernel patch on 4.3.0-rc4 version)
(The hardware platform is on Intel E5 2699 V3, 72 threads (18core *2Socket *2HT)
1. when we run a.out from main.c on original 4.3.0-rc4 version,
the average throughput from a.out is 1887592( 98% cpu cost from perf top -d1)
2. when we run a.out from main.c with the fcntl-lock-opt.patch ,
the average throughput from a.out is 5277281 (91% cpu cost from perf top -d1)
So we say the new mechanism give us about 2.79x (5277281 / 1887592) improvement.
Appreciate your comments.
Thanks
Ling
Is it acceptable for performance improvement or more comments on this patch?
Thanks
Ling
2016-04-05 11:44 GMT+08:00 Ling Ma <[email protected]>:
> Hi Longman,
>
>> with some modest increase in performance. That can be hard to justify. Maybe
>> you should find other use cases that involve less changes, but still have
>> noticeable performance improvement. That will make it easier to be accepted.
>
> The attachment is for other use case with the new lock optimization.
> It include two files: main.c (user space workload),
> fcntl-lock-opt.patch (kernel patch on 4.3.0-rc4 version)
> (The hardware platform is on Intel E5 2699 V3, 72 threads (18core *2Socket *2HT)
>
> 1. when we run a.out from main.c on original 4.3.0-rc4 version,
> the average throughput from a.out is 1887592( 98% cpu cost from perf top -d1)
>
> 2. when we run a.out from main.c with the fcntl-lock-opt.patch ,
> the average throughput from a.out is 5277281 (91% cpu cost from perf top -d1)
>
> So we say the new mechanism give us about 2.79x (5277281 / 1887592) improvement.
>
> Appreciate your comments.
>
> Thanks
> Ling