2019-02-07 19:09:06

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. This patchset removes all the
architecture specific assembly code and uses generic C code for all
architectures. This eases maintenance and enables us to enhance the
code more easily.

This patchset also implements the following 3 new features:

1) Waiter lock handoff
2) Reader optimistic spinning
3) Store write-lock owner in the atomic count (x86-64 only)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly. So workloads that use a mix of readers and writers should
see an increase in performance.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

Patches 1-2 reworks the qspinlock_stat code to make it generic (lock
event counting) so that it can be used by all architectures and all
locking code.

Patch 3 reloctes the rwsem_down_read_failed() and associated functions
to below the optimistic spinning functions.

Patch 4 eliminates all architecture specific code and use generic C
code for all.

Patch 5 moves code that manages the owner field closer to the rwsem
lock fast path as it is not needed by the rwsem-spinlock code.

Patch 6 renames rwsem.h to rwsem-xadd.h as it is now specific to
rwsem-xadd.c only.

Patch 7 hides the internal rwsem-xadd functions from the public.

Patch 8 moves the DEBUG_RWSEMS_WARN_ON checks from rwsem.c to
kernel/locking/rwsem-xadd.h and adds some new ones.

Patch 9 enhances the DEBUG_RWSEMS_WARN_ON macro to print out rwsem
internal states that can be useful for debugging purpose.

Patch 10 enables lock event countings in the rwsem code.

Patch 11 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 12 implments lock handoff to prevent lock starvation.

Patch 13 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 14 adds some new rwsem owner access helper functions.

Patch 15 merges the write-lock owner task pointer into the count.
Only 64-bit count has enough space to provide a reasonable number of bits
for reader count. ARM64 seems to have problem with the current encoding
scheme. So this owner merging is currently limited to x86-64 only.

Patch 16 eliminates redundant computation of the merged owner-count.

Patch 17 reduces the chance of missed optimistic spinning opportunity
because of some race conditions.

Patch 18 makes rwsem_spin_on_owner() returns a tri-state value.

Patch 19 enables reader to spin on a writer-owned rwsem.

Patch 20 enables lock waiters to spin on a reader-owned rwsem with
limited number of tries.

Patch 21 makes reader wakeup to wake all the readers in the wait queue
instead of just those in the front.

Patch 22 disallows RT tasks to spin on a rwsem with unknown owner.

In term of performance, eliminating architecture specific assembly code
and using generic code doesn't seem to have any impact on performance.

Supporting lock handoff does have a minor performance impact on highly
contended rwsem, but it is a price worth paying for preventing lock
starvation.

Reader optimistic spinning is generally good for performance. Of course,
there will be some corner cases where performance may suffer.

Merging owner into count does have a minor performance impact. We can
discuss if this is a feature we want to have in the rwsem code.

There are also some performance data scattered in some of the patches.


Waiman Long (22):
locking/qspinlock_stat: Introduce a generic lockevent counting APIs
locking/lock_events: Make lock_events available for all archs & other
locks
locking/rwsem: Relocate rwsem_down_read_failed()
locking/rwsem: Remove arch specific rwsem files
locking/rwsem: Move owner setting code from rwsem.c to rwsem.h
locking/rwsem: Rename kernel/locking/rwsem.h
locking/rwsem: Move rwsem internal function declarations to
rwsem-xadd.h
locking/rwsem: Add debug check for __down_read*()
locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro
locking/rwsem: Enable lock event counting
locking/rwsem: Implement a new locking scheme
locking/rwsem: Implement lock handoff to prevent lock starvation
locking/rwsem: Remove rwsem_wake() wakeup optimization
locking/rwsem: Add more rwsem owner access helpers
locking/rwsem: Merge owner into count on x86-64
locking/rwsem: Remove redundant computation of writer lock word
locking/rwsem: Recheck owner if it is not on cpu
locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value
locking/rwsem: Enable readers spinning on writer
locking/rwsem: Enable count-based spinning on reader
locking/rwsem: Wake up all readers in wait queue
locking/rwsem: Ensure an RT task will not spin on reader

MAINTAINERS | 1 -
arch/Kconfig | 10 +
arch/alpha/include/asm/rwsem.h | 211 -----------
arch/arm/include/asm/Kbuild | 1 -
arch/arm64/include/asm/Kbuild | 1 -
arch/hexagon/include/asm/Kbuild | 1 -
arch/ia64/include/asm/rwsem.h | 172 ---------
arch/powerpc/include/asm/Kbuild | 1 -
arch/s390/include/asm/Kbuild | 1 -
arch/sh/include/asm/Kbuild | 1 -
arch/sparc/include/asm/Kbuild | 1 -
arch/x86/Kconfig | 8 -
arch/x86/include/asm/rwsem.h | 237 -------------
arch/x86/lib/Makefile | 1 -
arch/x86/lib/rwsem.S | 156 ---------
arch/xtensa/include/asm/Kbuild | 1 -
include/asm-generic/rwsem.h | 140 --------
include/linux/rwsem.h | 11 +-
kernel/locking/Makefile | 1 +
kernel/locking/lock_events.c | 153 ++++++++
kernel/locking/lock_events.h | 55 +++
kernel/locking/lock_events_list.h | 71 ++++
kernel/locking/percpu-rwsem.c | 4 +
kernel/locking/qspinlock.c | 8 +-
kernel/locking/qspinlock_paravirt.h | 19 +-
kernel/locking/qspinlock_stat.h | 242 +++----------
kernel/locking/rwsem-xadd.c | 682 +++++++++++++++++++++---------------
kernel/locking/rwsem-xadd.h | 436 +++++++++++++++++++++++
kernel/locking/rwsem.c | 31 +-
kernel/locking/rwsem.h | 134 -------
30 files changed, 1197 insertions(+), 1594 deletions(-)
delete mode 100644 arch/alpha/include/asm/rwsem.h
delete mode 100644 arch/ia64/include/asm/rwsem.h
delete mode 100644 arch/x86/include/asm/rwsem.h
delete mode 100644 arch/x86/lib/rwsem.S
delete mode 100644 include/asm-generic/rwsem.h
create mode 100644 kernel/locking/lock_events.c
create mode 100644 kernel/locking/lock_events.h
create mode 100644 kernel/locking/lock_events_list.h
create mode 100644 kernel/locking/rwsem-xadd.h
delete mode 100644 kernel/locking/rwsem.h

--
1.8.3.1



2019-02-07 19:09:40

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 02/22] locking/lock_events: Make lock_events available for all archs & other locks

The QUEUED_LOCK_STAT option to report queued spinlocks event counts
was previously allowed only on x86 architecture. To make the locking
event counting code more useful, it is now renamed to a more generic
LOCK_EVENT_COUNTS config option. This new option will be available to
all the architectures that use qspinlock at the moment.

Other locking code can now start to use the generic locking event
counting code by including lock_events.h and put the new locking event
names into the lock_events_list.h header file.

My experience with lock event counting is that it gives valuable insight
on how the locking code works and what can be done to make it better. I
would like to extend this benefit to other locking code like mutex and
rwsem in the near future.

The PV qspinlock specific code will stay in qspinlock_stat.h. The
locking event counters will now reside in the <debugfs>/lock_event_counts
directory.

Signed-off-by: Waiman Long <[email protected]>
---
arch/Kconfig | 10 +++
arch/x86/Kconfig | 8 ---
kernel/locking/Makefile | 1 +
kernel/locking/lock_events.c | 153 ++++++++++++++++++++++++++++++++++++++++
kernel/locking/lock_events.h | 6 +-
kernel/locking/qspinlock_stat.h | 141 ++++--------------------------------
6 files changed, 179 insertions(+), 140 deletions(-)
create mode 100644 kernel/locking/lock_events.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 9f02132..af147c2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -888,6 +888,16 @@ config HAVE_ARCH_PREL32_RELOCATIONS
config ARCH_USE_MEMREMAP_PROT
bool

+config LOCK_EVENT_COUNTS
+ bool "Locking event counts collection"
+ depends on DEBUG_FS
+ depends on QUEUED_SPINLOCKS
+ ---help---
+ Enable light-weight counting of various locking related events
+ in the system with minimal performance impact. This reduces
+ the chance of application behavior change because of timing
+ differences. The counts are reported via debugfs.
+
source "kernel/gcov/Kconfig"

source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a7e7a60..e4f620f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -784,14 +784,6 @@ config PARAVIRT_SPINLOCKS

If you are unsure how to answer this question, answer Y.

-config QUEUED_LOCK_STAT
- bool "Paravirt queued spinlock statistics"
- depends on PARAVIRT_SPINLOCKS && DEBUG_FS
- ---help---
- Enable the collection of statistical data on the slowpath
- behavior of paravirtualized queued spinlocks and report
- them on debugfs.
-
source "arch/x86/xen/Kconfig"

config KVM_GUEST
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 392c7f2..f2257d7 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
obj-$(CONFIG_WW_MUTEX_SELFTEST) += test-ww_mutex.o
+obj-$(CONFIG_LOCK_EVENT_COUNTS) += lock_events.o
diff --git a/kernel/locking/lock_events.c b/kernel/locking/lock_events.c
new file mode 100644
index 0000000..71c36d1
--- /dev/null
+++ b/kernel/locking/lock_events.c
@@ -0,0 +1,153 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+
+/*
+ * Collect locking event counts
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+#include <linux/sched/clock.h>
+#include <linux/fs.h>
+
+#include "lock_events.h"
+
+#undef LOCK_EVENT
+#define LOCK_EVENT(name) [LOCKEVENT_ ## name] = #name,
+
+#define LOCK_EVENTS_DIR "lock_event_counts"
+
+/*
+ * When CONFIG_LOCK_EVENT_COUNTS is enabled, event counts of different
+ * types of locks will be reported under the <debugfs>/lock_event_counts/
+ * directory. See lock_events_list.h for the list of available locking
+ * events.
+ *
+ * Writing to the special ".reset_counts" file will reset all the above
+ * locking event counts. This is a very slow operation and so should not
+ * be done frequently.
+ *
+ * These event counts are implemented as per-cpu variables which are
+ * summed and computed whenever the corresponding debugfs files are read. This
+ * minimizes added overhead making the counts usable even in a production
+ * environment.
+ */
+static const char * const lockevent_names[lockevent_num + 1] = {
+
+#include "lock_events_list.h"
+
+ [LOCKEVENT_reset_cnts] = ".reset_counts",
+};
+
+/*
+ * Per-cpu counts
+ */
+DEFINE_PER_CPU(unsigned long, lockevents[lockevent_num]);
+
+/*
+ * The lockevent_read() function can be overridden.
+ */
+ssize_t __weak lockevent_read(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ char buf[64];
+ int cpu, id, len;
+ u64 sum = 0;
+
+ /*
+ * Get the counter ID stored in file->f_inode->i_private
+ */
+ id = (long)file_inode(file)->i_private;
+
+ if (id >= lockevent_num)
+ return -EBADF;
+
+ for_each_possible_cpu(cpu)
+ sum += per_cpu(lockevents[id], cpu);
+ len = snprintf(buf, sizeof(buf) - 1, "%llu\n", sum);
+
+ return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+/*
+ * Function to handle write request
+ *
+ * When idx = reset_cnts, reset all the counts.
+ */
+static ssize_t lockevent_write(struct file *file, const char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ int cpu;
+
+ /*
+ * Get the counter ID stored in file->f_inode->i_private
+ */
+ if ((long)file_inode(file)->i_private != LOCKEVENT_reset_cnts)
+ return count;
+
+ for_each_possible_cpu(cpu) {
+ int i;
+ unsigned long *ptr = per_cpu_ptr(lockevents, cpu);
+
+ for (i = 0 ; i < lockevent_num; i++)
+ WRITE_ONCE(ptr[i], 0);
+ }
+ return count;
+}
+
+/*
+ * Debugfs data structures
+ */
+static const struct file_operations fops_lockevent = {
+ .read = lockevent_read,
+ .write = lockevent_write,
+ .llseek = default_llseek,
+};
+
+/*
+ * Initialize debugfs for the locking event counts.
+ */
+static int __init init_lockevent_counts(void)
+{
+ struct dentry *d_counts = debugfs_create_dir(LOCK_EVENTS_DIR, NULL);
+ int i;
+
+ if (!d_counts)
+ goto out;
+
+ /*
+ * Create the debugfs files
+ *
+ * As reading from and writing to the stat files can be slow, only
+ * root is allowed to do the read/write to limit impact to system
+ * performance.
+ */
+ for (i = 0; i < lockevent_num; i++)
+ if (!debugfs_create_file(lockevent_names[i], 0400, d_counts,
+ (void *)(long)i, &fops_lockevent))
+ goto fail_undo;
+
+ if (!debugfs_create_file(lockevent_names[LOCKEVENT_reset_cnts], 0200,
+ d_counts, (void *)(long)LOCKEVENT_reset_cnts,
+ &fops_lockevent))
+ goto fail_undo;
+
+ return 0;
+fail_undo:
+ debugfs_remove_recursive(d_counts);
+out:
+ pr_warn("Could not create '%s' debugfs entries\n", LOCK_EVENTS_DIR);
+ return -ENOMEM;
+}
+fs_initcall(init_lockevent_counts);
diff --git a/kernel/locking/lock_events.h b/kernel/locking/lock_events.h
index 4009e07..d965470 100644
--- a/kernel/locking/lock_events.h
+++ b/kernel/locking/lock_events.h
@@ -21,7 +21,7 @@ enum lock_events {
LOCKEVENT_reset_cnts = lockevent_num,
};

-#ifdef CONFIG_QUEUED_LOCK_STAT
+#ifdef CONFIG_LOCK_EVENT_COUNTS
/*
* Per-cpu counters
*/
@@ -46,10 +46,10 @@ static inline void __lockevent_add(enum lock_events event, int inc)

#define lockevent_add(ev, c) __lockevent_add(LOCKEVENT_ ##ev, c)

-#else /* CONFIG_QUEUED_LOCK_STAT */
+#else /* CONFIG_LOCK_EVENT_COUNTS */

#define lockevent_inc(ev)
#define lockevent_add(ev, c)
#define lockevent_cond_inc(ev, c)

-#endif /* CONFIG_QUEUED_LOCK_STAT */
+#endif /* CONFIG_LOCK_EVENT_COUNTS */
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 1db5b37..5415267 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -9,76 +9,29 @@
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
- * Authors: Waiman Long <[email protected]>
+ * Authors: Waiman Long <[email protected]>
*/

-/*
- * When queued spinlock statistical counters are enabled, the following
- * debugfs files will be created for reporting the counter values:
- *
- * <debugfs>/qlockstat/
- * pv_hash_hops - average # of hops per hashing operation
- * pv_kick_unlock - # of vCPU kicks issued at unlock time
- * pv_kick_wake - # of vCPU kicks used for computing pv_latency_wake
- * pv_latency_kick - average latency (ns) of vCPU kick operation
- * pv_latency_wake - average latency (ns) from vCPU kick to wakeup
- * pv_lock_stealing - # of lock stealing operations
- * pv_spurious_wakeup - # of spurious wakeups in non-head vCPUs
- * pv_wait_again - # of wait's after a queue head vCPU kick
- * pv_wait_early - # of early vCPU wait's
- * pv_wait_head - # of vCPU wait's at the queue head
- * pv_wait_node - # of vCPU wait's at a non-head queue node
- * lock_pending - # of locking operations via pending code
- * lock_slowpath - # of locking operations via MCS lock queue
- * lock_use_node2 - # of locking operations that use 2nd per-CPU node
- * lock_use_node3 - # of locking operations that use 3rd per-CPU node
- * lock_use_node4 - # of locking operations that use 4th per-CPU node
- * lock_no_node - # of locking operations without using per-CPU node
- *
- * Subtracting lock_use_node[234] from lock_slowpath will give you
- * lock_use_node1.
- *
- * Writing to the special ".reset_counts" file will reset all the above
- * counter values.
- *
- * These statistical counters are implemented as per-cpu variables which are
- * summed and computed whenever the corresponding debugfs files are read. This
- * minimizes added overhead making the counters usable even in a production
- * environment.
- *
- * There may be slight difference between pv_kick_wake and pv_kick_unlock.
- */
#include "lock_events.h"

-#ifdef CONFIG_QUEUED_LOCK_STAT
+#ifdef CONFIG_LOCK_EVENT_COUNTS
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
/*
- * Collect pvqspinlock statistics
+ * Collect pvqspinlock locking event counts
*/
-#include <linux/debugfs.h>
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/fs.h>

#define EVENT_COUNT(ev) lockevents[LOCKEVENT_ ## ev]

-#undef LOCK_EVENT
-#define LOCK_EVENT(name) [LOCKEVENT_ ## name] = #name,
-
-static const char * const lockevent_names[lockevent_num + 1] = {
-
-#include "lock_events_list.h"
-
- [LOCKEVENT_reset_cnts] = ".reset_counts",
-};
-
/*
- * Per-cpu counters
+ * PV specific per-cpu counter
*/
-DEFINE_PER_CPU(unsigned long, lockevents[lockevent_num]);
static DEFINE_PER_CPU(u64, pv_kick_time);

/*
- * Function to read and return the qlock statistical counter values
+ * Function to read and return the PV qspinlock counts.
*
* The following counters are handled specially:
* 1. pv_latency_kick
@@ -88,8 +41,8 @@
* 3. pv_hash_hops
* Average hops/hash = pv_hash_hops/pv_kick_unlock
*/
-static ssize_t lockevent_read(struct file *file, char __user *user_buf,
- size_t count, loff_t *ppos)
+ssize_t lockevent_read(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
{
char buf[64];
int cpu, id, len;
@@ -150,78 +103,6 @@ static ssize_t lockevent_read(struct file *file, char __user *user_buf,
}

/*
- * Function to handle write request
- *
- * When id = .reset_cnts, reset all the counter values.
- */
-static ssize_t lockevent_write(struct file *file, const char __user *user_buf,
- size_t count, loff_t *ppos)
-{
- int cpu;
-
- /*
- * Get the counter ID stored in file->f_inode->i_private
- */
- if ((long)file_inode(file)->i_private != LOCKEVENT_reset_cnts)
- return count;
-
- for_each_possible_cpu(cpu) {
- int i;
- unsigned long *ptr = per_cpu_ptr(lockevents, cpu);
-
- for (i = 0 ; i < lockevent_num; i++)
- WRITE_ONCE(ptr[i], 0);
- }
- return count;
-}
-
-/*
- * Debugfs data structures
- */
-static const struct file_operations fops_lockevent = {
- .read = lockevent_read,
- .write = lockevent_write,
- .llseek = default_llseek,
-};
-
-/*
- * Initialize debugfs for the qspinlock statistical counters
- */
-static int __init init_qspinlock_stat(void)
-{
- struct dentry *d_counts = debugfs_create_dir("qlockstat", NULL);
- int i;
-
- if (!d_counts)
- goto out;
-
- /*
- * Create the debugfs files
- *
- * As reading from and writing to the stat files can be slow, only
- * root is allowed to do the read/write to limit impact to system
- * performance.
- */
- for (i = 0; i < lockevent_num; i++)
- if (!debugfs_create_file(lockevent_names[i], 0400, d_counts,
- (void *)(long)i, &fops_lockevent))
- goto fail_undo;
-
- if (!debugfs_create_file(lockevent_names[LOCKEVENT_reset_cnts], 0200,
- d_counts, (void *)(long)LOCKEVENT_reset_cnts,
- &fops_lockevent))
- goto fail_undo;
-
- return 0;
-fail_undo:
- debugfs_remove_recursive(d_counts);
-out:
- pr_warn("Could not create 'qlockstat' debugfs entries\n");
- return -ENOMEM;
-}
-fs_initcall(init_qspinlock_stat);
-
-/*
* PV hash hop count
*/
static inline void lockevent_pv_hop(int hopcnt)
@@ -260,8 +141,10 @@ static inline void __pv_wait(u8 *ptr, u8 val)
#define pv_kick(c) __pv_kick(c)
#define pv_wait(p, v) __pv_wait(p, v)

-#else /* CONFIG_QUEUED_LOCK_STAT */
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_LOCK_EVENT_COUNTS */

static inline void lockevent_pv_hop(int hopcnt) { }

-#endif /* CONFIG_QUEUED_LOCK_STAT */
+#endif /* CONFIG_LOCK_EVENT_COUNTS */
--
1.8.3.1


2019-02-07 19:09:45

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 05/22] locking/rwsem: Move owner setting code from rwsem.c to rwsem.h

The setting of owner field is specific to rwsem-xadd code, it is not needed
for rwsem-spinlock. This patch moves all the owner setting code closer
to the rwsem-xadd fast paths directly within rwsem.h file.

For __down_read() and __down_read_killable(), it is assumed that
rwsem will be marked as reader-owned when the functions return. That
is currently true except in the transient case that the waiter queue
just becomes empty. So a rwsem_set_reader_owned() call is added for
this case. The __rwsem_set_reader_owned() call in __rwsem_mark_wake()
is now necessary.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 6 +++---
kernel/locking/rwsem.c | 19 ++-----------------
kernel/locking/rwsem.h | 17 +++++++++++++++--
3 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 0d518b5..c213869 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -176,9 +176,8 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
goto try_reader_grant;
}
/*
- * It is not really necessary to set it to reader-owned here,
- * but it gives the spinners an early indication that the
- * readers now have the lock.
+ * Set it to reader-owned to give spinners an early
+ * indication that readers now have the lock.
*/
__rwsem_set_reader_owned(sem, waiter->task);
}
@@ -441,6 +440,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
*/
if (atomic_long_read(&sem->count) >= 0) {
raw_spin_unlock_irq(&sem->wait_lock);
+ rwsem_set_reader_owned(sem);
return sem;
}
adjustment += RWSEM_WAITING_BIAS;
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index e586f0d..59e5848 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -24,7 +24,6 @@ void __sched down_read(struct rw_semaphore *sem)
rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
- rwsem_set_reader_owned(sem);
}

EXPORT_SYMBOL(down_read);
@@ -39,7 +38,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
return -EINTR;
}

- rwsem_set_reader_owned(sem);
return 0;
}

@@ -52,10 +50,8 @@ int down_read_trylock(struct rw_semaphore *sem)
{
int ret = __down_read_trylock(sem);

- if (ret == 1) {
+ if (ret == 1)
rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
- rwsem_set_reader_owned(sem);
- }
return ret;
}

@@ -70,7 +66,6 @@ void __sched down_write(struct rw_semaphore *sem)
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
- rwsem_set_owner(sem);
}

EXPORT_SYMBOL(down_write);
@@ -88,7 +83,6 @@ int __sched down_write_killable(struct rw_semaphore *sem)
return -EINTR;
}

- rwsem_set_owner(sem);
return 0;
}

@@ -101,10 +95,8 @@ int down_write_trylock(struct rw_semaphore *sem)
{
int ret = __down_write_trylock(sem);

- if (ret == 1) {
+ if (ret == 1)
rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_);
- rwsem_set_owner(sem);
- }

return ret;
}
@@ -119,7 +111,6 @@ void up_read(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED));

- rwsem_clear_reader_owned(sem);
__up_read(sem);
}

@@ -133,7 +124,6 @@ void up_write(struct rw_semaphore *sem)
rwsem_release(&sem->dep_map, 1, _RET_IP_);
DEBUG_RWSEMS_WARN_ON(sem->owner != current);

- rwsem_clear_owner(sem);
__up_write(sem);
}

@@ -147,7 +137,6 @@ void downgrade_write(struct rw_semaphore *sem)
lock_downgrade(&sem->dep_map, _RET_IP_);
DEBUG_RWSEMS_WARN_ON(sem->owner != current);

- rwsem_set_reader_owned(sem);
__downgrade_write(sem);
}

@@ -161,7 +150,6 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
- rwsem_set_reader_owned(sem);
}

EXPORT_SYMBOL(down_read_nested);
@@ -172,7 +160,6 @@ void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
- rwsem_set_owner(sem);
}

EXPORT_SYMBOL(_down_write_nest_lock);
@@ -193,7 +180,6 @@ void down_write_nested(struct rw_semaphore *sem, int subclass)
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);

LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
- rwsem_set_owner(sem);
}

EXPORT_SYMBOL(down_write_nested);
@@ -208,7 +194,6 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
return -EINTR;
}

- rwsem_set_owner(sem);
return 0;
}

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 067e265..f568391 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -161,6 +161,8 @@ static inline void __down_read(struct rw_semaphore *sem)
{
if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0))
rwsem_down_read_failed(sem);
+ else
+ rwsem_set_reader_owned(sem);
}

static inline int __down_read_killable(struct rw_semaphore *sem)
@@ -168,8 +170,9 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
+ } else {
+ rwsem_set_reader_owned(sem);
}
-
return 0;
}

@@ -180,6 +183,7 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
while ((tmp = atomic_long_read(&sem->count)) >= 0) {
if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
tmp + RWSEM_ACTIVE_READ_BIAS)) {
+ rwsem_set_reader_owned(sem);
return 1;
}
}
@@ -197,6 +201,7 @@ static inline void __down_write(struct rw_semaphore *sem)
&sem->count);
if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
rwsem_down_write_failed(sem);
+ rwsem_set_owner(sem);
}

static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -208,6 +213,7 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
if (IS_ERR(rwsem_down_write_failed_killable(sem)))
return -EINTR;
+ rwsem_set_owner(sem);
return 0;
}

@@ -217,7 +223,11 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)

tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
RWSEM_ACTIVE_WRITE_BIAS);
- return tmp == RWSEM_UNLOCKED_VALUE;
+ if (tmp == RWSEM_UNLOCKED_VALUE) {
+ rwsem_set_owner(sem);
+ return true;
+ }
+ return false;
}

/*
@@ -227,6 +237,7 @@ static inline void __up_read(struct rw_semaphore *sem)
{
long tmp;

+ rwsem_clear_reader_owned(sem);
tmp = atomic_long_dec_return_release(&sem->count);
if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
rwsem_wake(sem);
@@ -237,6 +248,7 @@ static inline void __up_read(struct rw_semaphore *sem)
*/
static inline void __up_write(struct rw_semaphore *sem)
{
+ rwsem_clear_owner(sem);
if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
&sem->count) < 0))
rwsem_wake(sem);
@@ -257,6 +269,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* write side. As such, rely on RELEASE semantics.
*/
tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+ rwsem_set_reader_owned(sem);
if (tmp < 0)
rwsem_downgrade_wake(sem);
}
--
1.8.3.1


2019-02-07 19:09:45

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 04/22] locking/rwsem: Remove arch specific rwsem files

As the generic rwsem-xadd code is using the appropriate acquire and
release versions of the atomic operations, the arch specific rwsem.h
files will not be that much faster than the generic code. So we can
remove those arch specific rwsem.h and stop building asm/rwsem.h to
reduce maintenance effort.

The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
as no other code other than those under kernel/locking needs to access
the internal rwsem macros and functions.

Signed-off-by: Waiman Long <[email protected]>
---
MAINTAINERS | 1 -
arch/alpha/include/asm/rwsem.h | 211 -----------------------------------
arch/arm/include/asm/Kbuild | 1 -
arch/arm64/include/asm/Kbuild | 1 -
arch/hexagon/include/asm/Kbuild | 1 -
arch/ia64/include/asm/rwsem.h | 172 -----------------------------
arch/powerpc/include/asm/Kbuild | 1 -
arch/s390/include/asm/Kbuild | 1 -
arch/sh/include/asm/Kbuild | 1 -
arch/sparc/include/asm/Kbuild | 1 -
arch/x86/include/asm/rwsem.h | 237 ----------------------------------------
arch/x86/lib/Makefile | 1 -
arch/x86/lib/rwsem.S | 156 --------------------------
arch/xtensa/include/asm/Kbuild | 1 -
include/asm-generic/rwsem.h | 140 ------------------------
include/linux/rwsem.h | 4 +-
kernel/locking/percpu-rwsem.c | 2 +
kernel/locking/rwsem.h | 130 ++++++++++++++++++++++
18 files changed, 133 insertions(+), 929 deletions(-)
delete mode 100644 arch/alpha/include/asm/rwsem.h
delete mode 100644 arch/ia64/include/asm/rwsem.h
delete mode 100644 arch/x86/include/asm/rwsem.h
delete mode 100644 arch/x86/lib/rwsem.S
delete mode 100644 include/asm-generic/rwsem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 08e989a..0b921f2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8922,7 +8922,6 @@ F: arch/*/include/asm/spinlock*.h
F: include/linux/rwlock*.h
F: include/linux/mutex*.h
F: include/linux/rwsem*.h
-F: arch/*/include/asm/rwsem.h
F: include/linux/seqlock.h
F: lib/locking*.[ch]
F: kernel/locking/
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
deleted file mode 100644
index cf8fc8f9..0000000
--- a/arch/alpha/include/asm/rwsem.h
+++ /dev/null
@@ -1,211 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky <[email protected]>, 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/compiler.h>
-
-#define RWSEM_UNLOCKED_VALUE 0x0000000000000000L
-#define RWSEM_ACTIVE_BIAS 0x0000000000000001L
-#define RWSEM_ACTIVE_MASK 0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS (-0x0000000100000000L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-static inline int ___down_read(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count.counter;
- sem->count.counter += RWSEM_ACTIVE_READ_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " mb\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
- return (oldcount < 0);
-}
-
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (unlikely(___down_read(sem)))
- rwsem_down_read_failed(sem);
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
- if (unlikely(___down_read(sem)))
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
- return -EINTR;
-
- return 0;
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- long old, new, res;
-
- res = atomic_long_read(&sem->count);
- do {
- new = res + RWSEM_ACTIVE_READ_BIAS;
- if (new <= 0)
- break;
- old = res;
- res = atomic_long_cmpxchg(&sem->count, old, new);
- } while (res != old);
- return res >= 0 ? 1 : 0;
-}
-
-static inline long ___down_write(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count.counter;
- sem->count.counter += RWSEM_ACTIVE_WRITE_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " mb\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
- return oldcount;
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
- if (unlikely(___down_write(sem)))
- rwsem_down_write_failed(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
- if (unlikely(___down_write(sem))) {
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
- return -EINTR;
- }
-
- return 0;
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- long ret = atomic_long_cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- if (ret == RWSEM_UNLOCKED_VALUE)
- return 1;
- return 0;
-}
-
-static inline void __up_read(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count.counter;
- sem->count.counter -= RWSEM_ACTIVE_READ_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- " mb\n"
- "1: ldq_l %0,%1\n"
- " subq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(oldcount < 0))
- if ((int)oldcount - RWSEM_ACTIVE_READ_BIAS == 0)
- rwsem_wake(sem);
-}
-
-static inline void __up_write(struct rw_semaphore *sem)
-{
- long count;
-#ifndef CONFIG_SMP
- sem->count.counter -= RWSEM_ACTIVE_WRITE_BIAS;
- count = sem->count.counter;
-#else
- long temp;
- __asm__ __volatile__(
- " mb\n"
- "1: ldq_l %0,%1\n"
- " subq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " subq %0,%3,%0\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (count), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(count))
- if ((int)count == 0)
- rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count.counter;
- sem->count.counter -= RWSEM_WAITING_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " mb\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (-RWSEM_WAITING_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(oldcount < 0))
- rwsem_downgrade_wake(sem);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ALPHA_RWSEM_H */
diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index 1d66db9..989e1a7 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -12,7 +12,6 @@ generic-y += mm-arch-hooks.h
generic-y += msi.h
generic-y += parport.h
generic-y += preempt.h
-generic-y += rwsem.h
generic-y += seccomp.h
generic-y += segment.h
generic-y += serial.h
diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 1e17ea5..60a933b 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -16,7 +16,6 @@ generic-y += mm-arch-hooks.h
generic-y += msi.h
generic-y += qrwlock.h
generic-y += qspinlock.h
-generic-y += rwsem.h
generic-y += segment.h
generic-y += serial.h
generic-y += set_memory.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index b25fd42..0a12feb 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -26,7 +26,6 @@ generic-y += mm-arch-hooks.h
generic-y += pci.h
generic-y += percpu.h
generic-y += preempt.h
-generic-y += rwsem.h
generic-y += sections.h
generic-y += segment.h
generic-y += serial.h
diff --git a/arch/ia64/include/asm/rwsem.h b/arch/ia64/include/asm/rwsem.h
deleted file mode 100644
index 9179106..0000000
--- a/arch/ia64/include/asm/rwsem.h
+++ /dev/null
@@ -1,172 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * R/W semaphores for ia64
- *
- * Copyright (C) 2003 Ken Chen <[email protected]>
- * Copyright (C) 2003 Asit Mallick <[email protected]>
- * Copyright (C) 2005 Christoph Lameter <[email protected]>
- *
- * Based on asm-i386/rwsem.h and other architecture implementation.
- *
- * The MSW of the count is the negated number of active writers and
- * waiting lockers, and the LSW is the total number of active locks.
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffffffff00000001 for
- * the case of an uncontended lock. Readers increment by 1 and see a positive
- * value when uncontended, negative if there are writers (and maybe) readers
- * waiting (in which case it goes to sleep).
- */
-
-#ifndef _ASM_IA64_RWSEM_H
-#define _ASM_IA64_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "Please don't include <asm/rwsem.h> directly, use <linux/rwsem.h> instead."
-#endif
-
-#include <asm/intrinsics.h>
-
-#define RWSEM_UNLOCKED_VALUE __IA64_UL_CONST(0x0000000000000000)
-#define RWSEM_ACTIVE_BIAS (1L)
-#define RWSEM_ACTIVE_MASK (0xffffffffL)
-#define RWSEM_WAITING_BIAS (-0x100000000L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-/*
- * lock for reading
- */
-static inline int
-___down_read (struct rw_semaphore *sem)
-{
- long result = ia64_fetchadd8_acq((unsigned long *)&sem->count.counter, 1);
-
- return (result < 0);
-}
-
-static inline void
-__down_read (struct rw_semaphore *sem)
-{
- if (___down_read(sem))
- rwsem_down_read_failed(sem);
-}
-
-static inline int
-__down_read_killable (struct rw_semaphore *sem)
-{
- if (___down_read(sem))
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
- return -EINTR;
-
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline long
-___down_write (struct rw_semaphore *sem)
-{
- long old, new;
-
- do {
- old = atomic_long_read(&sem->count);
- new = old + RWSEM_ACTIVE_WRITE_BIAS;
- } while (atomic_long_cmpxchg_acquire(&sem->count, old, new) != old);
-
- return old;
-}
-
-static inline void
-__down_write (struct rw_semaphore *sem)
-{
- if (___down_write(sem))
- rwsem_down_write_failed(sem);
-}
-
-static inline int
-__down_write_killable (struct rw_semaphore *sem)
-{
- if (___down_write(sem)) {
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
- return -EINTR;
- }
-
- return 0;
-}
-
-/*
- * unlock after reading
- */
-static inline void
-__up_read (struct rw_semaphore *sem)
-{
- long result = ia64_fetchadd8_rel((unsigned long *)&sem->count.counter, -1);
-
- if (result < 0 && (--result & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void
-__up_write (struct rw_semaphore *sem)
-{
- long old, new;
-
- do {
- old = atomic_long_read(&sem->count);
- new = old - RWSEM_ACTIVE_WRITE_BIAS;
- } while (atomic_long_cmpxchg_release(&sem->count, old, new) != old);
-
- if (new < 0 && (new & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-static inline int
-__down_read_trylock (struct rw_semaphore *sem)
-{
- long tmp;
- while ((tmp = atomic_long_read(&sem->count)) >= 0) {
- if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, tmp+1)) {
- return 1;
- }
- }
- return 0;
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-static inline int
-__down_write_trylock (struct rw_semaphore *sem)
-{
- long tmp = atomic_long_cmpxchg_acquire(&sem->count,
- RWSEM_UNLOCKED_VALUE, RWSEM_ACTIVE_WRITE_BIAS);
- return tmp == RWSEM_UNLOCKED_VALUE;
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void
-__downgrade_write (struct rw_semaphore *sem)
-{
- long old, new;
-
- do {
- old = atomic_long_read(&sem->count);
- new = old - RWSEM_WAITING_BIAS;
- } while (atomic_long_cmpxchg_release(&sem->count, old, new) != old);
-
- if (old < 0)
- rwsem_downgrade_wake(sem);
-}
-
-#endif /* _ASM_IA64_RWSEM_H */
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 77ff7fb..75bd77a 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -9,6 +9,5 @@ generic-y += irq_work.h
generic-y += local64.h
generic-y += mcs_spinlock.h
generic-y += preempt.h
-generic-y += rwsem.h
generic-y += vtime.h
generic-y += msi.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index e323977..4b77b42 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -21,7 +21,6 @@ generic-y += local64.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += preempt.h
-generic-y += rwsem.h
generic-y += trace_clock.h
generic-y += unaligned.h
generic-y += word-at-a-time.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index a6ef3fe..b89fce4 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -16,7 +16,6 @@ generic-y += mm-arch-hooks.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
-generic-y += rwsem.h
generic-y += serial.h
generic-y += sizes.h
generic-y += trace_clock.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index b82f64e..e843fc0 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -17,7 +17,6 @@ generic-y += mm-arch-hooks.h
generic-y += module.h
generic-y += msi.h
generic-y += preempt.h
-generic-y += rwsem.h
generic-y += serial.h
generic-y += trace_clock.h
generic-y += word-at-a-time.h
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
deleted file mode 100644
index 4c25cf6..0000000
--- a/arch/x86/include/asm/rwsem.h
+++ /dev/null
@@ -1,237 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* rwsem.h: R/W semaphores implemented using XADD/CMPXCHG for i486+
- *
- * Written by David Howells ([email protected]).
- *
- * Derived from asm-x86/semaphore.h
- *
- *
- * The MSW of the count is the negated number of active writers and waiting
- * lockers, and the LSW is the total number of active locks
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
- * uncontended lock. This can be determined because XADD returns the old value.
- * Readers increment by 1 and see a positive value when uncontended, negative
- * if there are writers (and maybe) readers waiting (in which case it goes to
- * sleep).
- *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
- *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
- *
- * This should be totally fair - if anything is waiting, a process that wants a
- * lock will go to the back of the queue. When the currently active lock is
- * released, if there's a writer at the front of the queue, then that and only
- * that will be woken up; if there's a bunch of consecutive readers at the
- * front, then they'll all be woken up, but no other readers will be.
- */
-
-#ifndef _ASM_X86_RWSEM_H
-#define _ASM_X86_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-#include <asm/asm.h>
-
-/*
- * The bias values and the counter type limits the number of
- * potential readers/writers to 32767 for 32 bits and 2147483647
- * for 64 bits.
- */
-
-#ifdef CONFIG_X86_64
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
-
-#define RWSEM_UNLOCKED_VALUE 0x00000000L
-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-/*
- * lock for reading
- */
-#define ____down_read(sem, slow_path) \
-({ \
- struct rw_semaphore* ret; \
- asm volatile("# beginning down_read\n\t" \
- LOCK_PREFIX _ASM_INC "(%[sem])\n\t" \
- /* adds 0x00000001 */ \
- " jns 1f\n" \
- " call " slow_path "\n" \
- "1:\n\t" \
- "# ending down_read\n\t" \
- : "+m" (sem->count), "=a" (ret), \
- ASM_CALL_CONSTRAINT \
- : [sem] "a" (sem) \
- : "memory", "cc"); \
- ret; \
-})
-
-static inline void __down_read(struct rw_semaphore *sem)
-{
- ____down_read(sem, "call_rwsem_down_read_failed");
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
- if (IS_ERR(____down_read(sem, "call_rwsem_down_read_failed_killable")))
- return -EINTR;
- return 0;
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-static inline bool __down_read_trylock(struct rw_semaphore *sem)
-{
- long result, tmp;
- asm volatile("# beginning __down_read_trylock\n\t"
- " mov %[count],%[result]\n\t"
- "1:\n\t"
- " mov %[result],%[tmp]\n\t"
- " add %[inc],%[tmp]\n\t"
- " jle 2f\n\t"
- LOCK_PREFIX " cmpxchg %[tmp],%[count]\n\t"
- " jnz 1b\n\t"
- "2:\n\t"
- "# ending __down_read_trylock\n\t"
- : [count] "+m" (sem->count), [result] "=&a" (result),
- [tmp] "=&r" (tmp)
- : [inc] "i" (RWSEM_ACTIVE_READ_BIAS)
- : "memory", "cc");
- return result >= 0;
-}
-
-/*
- * lock for writing
- */
-#define ____down_write(sem, slow_path) \
-({ \
- long tmp; \
- struct rw_semaphore* ret; \
- \
- asm volatile("# beginning down_write\n\t" \
- LOCK_PREFIX " xadd %[tmp],(%[sem])\n\t" \
- /* adds 0xffff0001, returns the old value */ \
- " test " __ASM_SEL(%w1,%k1) "," __ASM_SEL(%w1,%k1) "\n\t" \
- /* was the active mask 0 before? */\
- " jz 1f\n" \
- " call " slow_path "\n" \
- "1:\n" \
- "# ending down_write" \
- : "+m" (sem->count), [tmp] "=d" (tmp), \
- "=a" (ret), ASM_CALL_CONSTRAINT \
- : [sem] "a" (sem), "[tmp]" (RWSEM_ACTIVE_WRITE_BIAS) \
- : "memory", "cc"); \
- ret; \
-})
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
- ____down_write(sem, "call_rwsem_down_write_failed");
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
- if (IS_ERR(____down_write(sem, "call_rwsem_down_write_failed_killable")))
- return -EINTR;
-
- return 0;
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-static inline bool __down_write_trylock(struct rw_semaphore *sem)
-{
- bool result;
- long tmp0, tmp1;
- asm volatile("# beginning __down_write_trylock\n\t"
- " mov %[count],%[tmp0]\n\t"
- "1:\n\t"
- " test " __ASM_SEL(%w1,%k1) "," __ASM_SEL(%w1,%k1) "\n\t"
- /* was the active mask 0 before? */
- " jnz 2f\n\t"
- " mov %[tmp0],%[tmp1]\n\t"
- " add %[inc],%[tmp1]\n\t"
- LOCK_PREFIX " cmpxchg %[tmp1],%[count]\n\t"
- " jnz 1b\n\t"
- "2:\n\t"
- CC_SET(e)
- "# ending __down_write_trylock\n\t"
- : [count] "+m" (sem->count), [tmp0] "=&a" (tmp0),
- [tmp1] "=&r" (tmp1), CC_OUT(e) (result)
- : [inc] "er" (RWSEM_ACTIVE_WRITE_BIAS)
- : "memory");
- return result;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- long tmp;
- asm volatile("# beginning __up_read\n\t"
- LOCK_PREFIX " xadd %[tmp],(%[sem])\n\t"
- /* subtracts 1, returns the old value */
- " jns 1f\n\t"
- " call call_rwsem_wake\n" /* expects old value in %edx */
- "1:\n"
- "# ending __up_read\n"
- : "+m" (sem->count), [tmp] "=d" (tmp)
- : [sem] "a" (sem), "[tmp]" (-RWSEM_ACTIVE_READ_BIAS)
- : "memory", "cc");
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- long tmp;
- asm volatile("# beginning __up_write\n\t"
- LOCK_PREFIX " xadd %[tmp],(%[sem])\n\t"
- /* subtracts 0xffff0001, returns the old value */
- " jns 1f\n\t"
- " call call_rwsem_wake\n" /* expects old value in %edx */
- "1:\n\t"
- "# ending __up_write\n"
- : "+m" (sem->count), [tmp] "=d" (tmp)
- : [sem] "a" (sem), "[tmp]" (-RWSEM_ACTIVE_WRITE_BIAS)
- : "memory", "cc");
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- asm volatile("# beginning __downgrade_write\n\t"
- LOCK_PREFIX _ASM_ADD "%[inc],(%[sem])\n\t"
- /*
- * transitions 0xZZZZ0001 -> 0xYYYY0001 (i386)
- * 0xZZZZZZZZ00000001 -> 0xYYYYYYYY00000001 (x86_64)
- */
- " jns 1f\n\t"
- " call call_rwsem_downgrade_wake\n"
- "1:\n\t"
- "# ending __downgrade_write\n"
- : "+m" (sem->count)
- : [sem] "a" (sem), [inc] "er" (-RWSEM_WAITING_BIAS)
- : "memory", "cc");
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ASM_X86_RWSEM_H */
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 140e618..9866520 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -23,7 +23,6 @@ obj-$(CONFIG_SMP) += msr-smp.o cache-smp.o
lib-y := delay.o misc.o cmdline.o cpu.o
lib-y += usercopy_$(BITS).o usercopy.o getuser.o putuser.o
lib-y += memcpy_$(BITS).o
-lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
lib-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
diff --git a/arch/x86/lib/rwsem.S b/arch/x86/lib/rwsem.S
deleted file mode 100644
index dc2ab6e..0000000
--- a/arch/x86/lib/rwsem.S
+++ /dev/null
@@ -1,156 +0,0 @@
-/*
- * x86 semaphore implementation.
- *
- * (C) Copyright 1999 Linus Torvalds
- *
- * Portions Copyright 1999 Red Hat, Inc.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- *
- * rw semaphores implemented November 1999 by Benjamin LaHaise <[email protected]>
- */
-
-#include <linux/linkage.h>
-#include <asm/alternative-asm.h>
-#include <asm/frame.h>
-
-#define __ASM_HALF_REG(reg) __ASM_SEL(reg, e##reg)
-#define __ASM_HALF_SIZE(inst) __ASM_SEL(inst##w, inst##l)
-
-#ifdef CONFIG_X86_32
-
-/*
- * The semaphore operations have a special calling sequence that
- * allow us to do a simpler in-line version of them. These routines
- * need to convert that sequence back into the C sequence when
- * there is contention on the semaphore.
- *
- * %eax contains the semaphore pointer on entry. Save the C-clobbered
- * registers (%eax, %edx and %ecx) except %eax which is either a return
- * value or just gets clobbered. Same is true for %edx so make sure GCC
- * reloads it after the slow path, by making it hold a temporary, for
- * example see ____down_write().
- */
-
-#define save_common_regs \
- pushl %ecx
-
-#define restore_common_regs \
- popl %ecx
-
- /* Avoid uglifying the argument copying x86-64 needs to do. */
- .macro movq src, dst
- .endm
-
-#else
-
-/*
- * x86-64 rwsem wrappers
- *
- * This interfaces the inline asm code to the slow-path
- * C routines. We need to save the call-clobbered regs
- * that the asm does not mark as clobbered, and move the
- * argument from %rax to %rdi.
- *
- * NOTE! We don't need to save %rax, because the functions
- * will always return the semaphore pointer in %rax (which
- * is also the input argument to these helpers)
- *
- * The following can clobber %rdx because the asm clobbers it:
- * call_rwsem_down_write_failed
- * call_rwsem_wake
- * but %rdi, %rsi, %rcx, %r8-r11 always need saving.
- */
-
-#define save_common_regs \
- pushq %rdi; \
- pushq %rsi; \
- pushq %rcx; \
- pushq %r8; \
- pushq %r9; \
- pushq %r10; \
- pushq %r11
-
-#define restore_common_regs \
- popq %r11; \
- popq %r10; \
- popq %r9; \
- popq %r8; \
- popq %rcx; \
- popq %rsi; \
- popq %rdi
-
-#endif
-
-/* Fix up special calling conventions */
-ENTRY(call_rwsem_down_read_failed)
- FRAME_BEGIN
- save_common_regs
- __ASM_SIZE(push,) %__ASM_REG(dx)
- movq %rax,%rdi
- call rwsem_down_read_failed
- __ASM_SIZE(pop,) %__ASM_REG(dx)
- restore_common_regs
- FRAME_END
- ret
-ENDPROC(call_rwsem_down_read_failed)
-
-ENTRY(call_rwsem_down_read_failed_killable)
- FRAME_BEGIN
- save_common_regs
- __ASM_SIZE(push,) %__ASM_REG(dx)
- movq %rax,%rdi
- call rwsem_down_read_failed_killable
- __ASM_SIZE(pop,) %__ASM_REG(dx)
- restore_common_regs
- FRAME_END
- ret
-ENDPROC(call_rwsem_down_read_failed_killable)
-
-ENTRY(call_rwsem_down_write_failed)
- FRAME_BEGIN
- save_common_regs
- movq %rax,%rdi
- call rwsem_down_write_failed
- restore_common_regs
- FRAME_END
- ret
-ENDPROC(call_rwsem_down_write_failed)
-
-ENTRY(call_rwsem_down_write_failed_killable)
- FRAME_BEGIN
- save_common_regs
- movq %rax,%rdi
- call rwsem_down_write_failed_killable
- restore_common_regs
- FRAME_END
- ret
-ENDPROC(call_rwsem_down_write_failed_killable)
-
-ENTRY(call_rwsem_wake)
- FRAME_BEGIN
- /* do nothing if still outstanding active readers */
- __ASM_HALF_SIZE(dec) %__ASM_HALF_REG(dx)
- jnz 1f
- save_common_regs
- movq %rax,%rdi
- call rwsem_wake
- restore_common_regs
-1: FRAME_END
- ret
-ENDPROC(call_rwsem_wake)
-
-ENTRY(call_rwsem_downgrade_wake)
- FRAME_BEGIN
- save_common_regs
- __ASM_SIZE(push,) %__ASM_REG(dx)
- movq %rax,%rdi
- call rwsem_downgrade_wake
- __ASM_SIZE(pop,) %__ASM_REG(dx)
- restore_common_regs
- FRAME_END
- ret
-ENDPROC(call_rwsem_downgrade_wake)
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index e255683..7026b37 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -23,7 +23,6 @@ generic-y += mm-arch-hooks.h
generic-y += param.h
generic-y += percpu.h
generic-y += preempt.h
-generic-y += rwsem.h
generic-y += sections.h
generic-y += topology.h
generic-y += trace_clock.h
diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
deleted file mode 100644
index 93e67a0..0000000
--- a/include/asm-generic/rwsem.h
+++ /dev/null
@@ -1,140 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_GENERIC_RWSEM_H
-#define _ASM_GENERIC_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "Please don't include <asm/rwsem.h> directly, use <linux/rwsem.h> instead."
-#endif
-
-#ifdef __KERNEL__
-
-/*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <[email protected]>.
- */
-
-/*
- * the semaphore definition
- */
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
-
-#define RWSEM_UNLOCKED_VALUE 0x00000000L
-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0))
- rwsem_down_read_failed(sem);
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
- return -EINTR;
- }
-
- return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- long tmp;
-
- while ((tmp = atomic_long_read(&sem->count)) >= 0) {
- if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
- return 1;
- }
- }
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
- rwsem_down_write_failed(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
- return -EINTR;
- return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- long tmp;
-
- tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- return tmp == RWSEM_UNLOCKED_VALUE;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- long tmp;
-
- tmp = atomic_long_dec_return_release(&sem->count);
- if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count) < 0))
- rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- long tmp;
-
- /*
- * When downgrading from exclusive to shared ownership,
- * anything inside the write-locked region cannot leak
- * into the read side. In contrast, anything in the
- * read-locked region is ok to be re-ordered into the
- * write side. As such, rely on RELEASE semantics.
- */
- tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
- if (tmp < 0)
- rwsem_downgrade_wake(sem);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ASM_GENERIC_RWSEM_H */
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 67dbb57..6e56006 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -57,15 +57,13 @@ struct rw_semaphore {
extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);

-/* Include the arch specific part */
-#include <asm/rwsem.h>
-
/* In all implementations count != 0 means locked */
static inline int rwsem_is_locked(struct rw_semaphore *sem)
{
return atomic_long_read(&sem->count) != 0;
}

+#define RWSEM_UNLOCKED_VALUE 0L
#define __RWSEM_INIT_COUNT(name) .count = ATOMIC_LONG_INIT(RWSEM_UNLOCKED_VALUE)
#endif

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 883cf1b..f17dad9 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -7,6 +7,8 @@
#include <linux/sched.h>
#include <linux/errno.h>

+#include "rwsem.h"
+
int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
const char *name, struct lock_class_key *rwsem_key)
{
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index bad2bca..067e265 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -32,6 +32,26 @@
# define DEBUG_RWSEMS_WARN_ON(c)
#endif

+/*
+ * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
+ * Adapted largely from include/asm-i386/rwsem.h
+ * by Paul Mackerras <[email protected]>.
+ */
+
+/*
+ * the semaphore definition
+ */
+#ifdef CONFIG_64BIT
+# define RWSEM_ACTIVE_MASK 0xffffffffL
+#else
+# define RWSEM_ACTIVE_MASK 0x0000ffffL
+#endif
+
+#define RWSEM_ACTIVE_BIAS 0x00000001L
+#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -132,3 +152,113 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
{
}
#endif
+
+#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0))
+ rwsem_down_read_failed(sem);
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+ return -EINTR;
+ }
+
+ return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ while ((tmp = atomic_long_read(&sem->count)) >= 0) {
+ if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
+ tmp + RWSEM_ACTIVE_READ_BIAS)) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
+ &sem->count);
+ if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ rwsem_down_write_failed(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
+ &sem->count);
+ if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+ return -EINTR;
+ return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS);
+ return tmp == RWSEM_UNLOCKED_VALUE;
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_dec_return_release(&sem->count);
+ if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+ rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
+ &sem->count) < 0))
+ rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ /*
+ * When downgrading from exclusive to shared ownership,
+ * anything inside the write-locked region cannot leak
+ * into the read side. In contrast, anything in the
+ * read-locked region is ok to be re-ordered into the
+ * write side. As such, rely on RELEASE semantics.
+ */
+ tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+ if (tmp < 0)
+ rwsem_downgrade_wake(sem);
+}
+
+#endif /* CONFIG_RWSEM_XCHGADD_ALGORITHM */
--
1.8.3.1


2019-02-07 19:09:52

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 01/22] locking/qspinlock_stat: Introduce a generic lockevent counting APIs

The percpu event counts used by qspinlock code can be useful for
other locking code as well. So a new set of lockevent_* counting APIs
is introduced with the lock event names extracted out into the new
lock_events_list.h header file for easier addition in the future.

The existing qstat_inc() calls are replaced by either lockevent_inc() or
lockevent_cond_inc() calls.

The qstat_hop() call is renamed to lockevent_pv_hop(). The "reset_counters"
debugfs file is also renamed to ".reset_counts".

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events.h | 55 ++++++++++++
kernel/locking/lock_events_list.h | 50 +++++++++++
kernel/locking/qspinlock.c | 8 +-
kernel/locking/qspinlock_paravirt.h | 19 +++--
kernel/locking/qspinlock_stat.h | 163 ++++++++++++++----------------------
5 files changed, 181 insertions(+), 114 deletions(-)
create mode 100644 kernel/locking/lock_events.h
create mode 100644 kernel/locking/lock_events_list.h

diff --git a/kernel/locking/lock_events.h b/kernel/locking/lock_events.h
new file mode 100644
index 0000000..4009e07
--- /dev/null
+++ b/kernel/locking/lock_events.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+
+enum lock_events {
+
+#include "lock_events_list.h"
+
+ lockevent_num, /* Total number of lock event counts */
+ LOCKEVENT_reset_cnts = lockevent_num,
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Per-cpu counters
+ */
+DECLARE_PER_CPU(unsigned long, lockevents[lockevent_num]);
+
+/*
+ * Increment the PV qspinlock statistical counters
+ */
+static inline void __lockevent_inc(enum lock_events event, bool cond)
+{
+ if (cond)
+ __this_cpu_inc(lockevents[event]);
+}
+
+#define lockevent_inc(ev) __lockevent_inc(LOCKEVENT_ ##ev, true)
+#define lockevent_cond_inc(ev, c) __lockevent_inc(LOCKEVENT_ ##ev, c)
+
+static inline void __lockevent_add(enum lock_events event, int inc)
+{
+ __this_cpu_add(lockevents[event], inc);
+}
+
+#define lockevent_add(ev, c) __lockevent_add(LOCKEVENT_ ##ev, c)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+#define lockevent_inc(ev)
+#define lockevent_add(ev, c)
+#define lockevent_cond_inc(ev, c)
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
new file mode 100644
index 0000000..8b4d2e1
--- /dev/null
+++ b/kernel/locking/lock_events_list.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+
+#ifndef LOCK_EVENT
+#define LOCK_EVENT(name) LOCKEVENT_ ## name,
+#endif
+
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
+ * Locking events for PV qspinlock.
+ */
+LOCK_EVENT(pv_hash_hops) /* Average # of hops per hashing operation */
+LOCK_EVENT(pv_kick_unlock) /* # of vCPU kicks issued at unlock time */
+LOCK_EVENT(pv_kick_wake) /* # of vCPU kicks for pv_latency_wake */
+LOCK_EVENT(pv_latency_kick) /* Average latency (ns) of vCPU kick */
+LOCK_EVENT(pv_latency_wake) /* Average latency (ns) of kick-to-wakeup */
+LOCK_EVENT(pv_lock_stealing) /* # of lock stealing operations */
+LOCK_EVENT(pv_spurious_wakeup) /* # of spurious wakeups in non-head vCPUs */
+LOCK_EVENT(pv_wait_again) /* # of wait's after queue head vCPU kick */
+LOCK_EVENT(pv_wait_early) /* # of early vCPU wait's */
+LOCK_EVENT(pv_wait_head) /* # of vCPU wait's at the queue head */
+LOCK_EVENT(pv_wait_node) /* # of vCPU wait's at non-head queue node */
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+/*
+ * Locking events for qspinlock
+ *
+ * Subtracting lock_use_node[234] from lock_slowpath will give you
+ * lock_use_node1.
+ */
+LOCK_EVENT(lock_pending) /* # of locking ops via pending code */
+LOCK_EVENT(lock_slowpath) /* # of locking ops via MCS lock queue */
+LOCK_EVENT(lock_use_node2) /* # of locking ops that use 2nd percpu node */
+LOCK_EVENT(lock_use_node3) /* # of locking ops that use 3rd percpu node */
+LOCK_EVENT(lock_use_node4) /* # of locking ops that use 4th percpu node */
+LOCK_EVENT(lock_no_node) /* # of locking ops w/o using percpu node */
+#endif /* CONFIG_QUEUED_SPINLOCKS */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 21ee51b..fe0dc3c 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -398,7 +398,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* 0,1,0 -> 0,0,1
*/
clear_pending_set_locked(lock);
- qstat_inc(qstat_lock_pending, true);
+ lockevent_inc(lock_pending);
return;

/*
@@ -406,7 +406,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* queuing.
*/
queue:
- qstat_inc(qstat_lock_slowpath, true);
+ lockevent_inc(lock_slowpath);
pv_queue:
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
@@ -422,7 +422,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* simple enough.
*/
if (unlikely(idx >= MAX_NODES)) {
- qstat_inc(qstat_lock_no_node, true);
+ lockevent_inc(lock_no_node);
while (!queued_spin_trylock(lock))
cpu_relax();
goto release;
@@ -433,7 +433,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
/*
* Keep counts of non-zero index values:
*/
- qstat_inc(qstat_lock_use_node2 + idx - 1, idx);
+ lockevent_cond_inc(lock_use_node2 + idx - 1, idx);

/*
* Ensure that we increment the head node->count before initialising
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 8f36c27..89bab07 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -89,7 +89,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)

if (!(val & _Q_LOCKED_PENDING_MASK) &&
(cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
- qstat_inc(qstat_pv_lock_stealing, true);
+ lockevent_inc(pv_lock_stealing);
return true;
}
if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
@@ -219,7 +219,7 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
hopcnt++;
if (!cmpxchg(&he->lock, NULL, lock)) {
WRITE_ONCE(he->node, node);
- qstat_hop(hopcnt);
+ lockevent_pv_hop(hopcnt);
return &he->lock;
}
}
@@ -320,8 +320,8 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
smp_store_mb(pn->state, vcpu_halted);

if (!READ_ONCE(node->locked)) {
- qstat_inc(qstat_pv_wait_node, true);
- qstat_inc(qstat_pv_wait_early, wait_early);
+ lockevent_inc(pv_wait_node);
+ lockevent_cond_inc(pv_wait_early, wait_early);
pv_wait(&pn->state, vcpu_halted);
}

@@ -339,7 +339,8 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
- qstat_inc(qstat_pv_spurious_wakeup, !READ_ONCE(node->locked));
+ lockevent_cond_inc(pv_spurious_wakeup,
+ !READ_ONCE(node->locked));
}

/*
@@ -416,7 +417,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
/*
* Tracking # of slowpath locking operations
*/
- qstat_inc(qstat_lock_slowpath, true);
+ lockevent_inc(lock_slowpath);

for (;; waitcnt++) {
/*
@@ -464,8 +465,8 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
}
}
WRITE_ONCE(pn->state, vcpu_hashed);
- qstat_inc(qstat_pv_wait_head, true);
- qstat_inc(qstat_pv_wait_again, waitcnt);
+ lockevent_inc(pv_wait_head);
+ lockevent_cond_inc(pv_wait_again, waitcnt);
pv_wait(&lock->locked, _Q_SLOW_VAL);

/*
@@ -528,7 +529,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
* vCPU is harmless other than the additional latency in completing
* the unlock.
*/
- qstat_inc(qstat_pv_kick_unlock, true);
+ lockevent_inc(pv_kick_unlock);
pv_kick(node->cpu);
}

diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index d73f853..1db5b37 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -38,8 +38,8 @@
* Subtracting lock_use_node[234] from lock_slowpath will give you
* lock_use_node1.
*
- * Writing to the "reset_counters" file will reset all the above counter
- * values.
+ * Writing to the special ".reset_counts" file will reset all the above
+ * counter values.
*
* These statistical counters are implemented as per-cpu variables which are
* summed and computed whenever the corresponding debugfs files are read. This
@@ -48,27 +48,7 @@
*
* There may be slight difference between pv_kick_wake and pv_kick_unlock.
*/
-enum qlock_stats {
- qstat_pv_hash_hops,
- qstat_pv_kick_unlock,
- qstat_pv_kick_wake,
- qstat_pv_latency_kick,
- qstat_pv_latency_wake,
- qstat_pv_lock_stealing,
- qstat_pv_spurious_wakeup,
- qstat_pv_wait_again,
- qstat_pv_wait_early,
- qstat_pv_wait_head,
- qstat_pv_wait_node,
- qstat_lock_pending,
- qstat_lock_slowpath,
- qstat_lock_use_node2,
- qstat_lock_use_node3,
- qstat_lock_use_node4,
- qstat_lock_no_node,
- qstat_num, /* Total number of statistical counters */
- qstat_reset_cnts = qstat_num,
-};
+#include "lock_events.h"

#ifdef CONFIG_QUEUED_LOCK_STAT
/*
@@ -79,99 +59,91 @@ enum qlock_stats {
#include <linux/sched/clock.h>
#include <linux/fs.h>

-static const char * const qstat_names[qstat_num + 1] = {
- [qstat_pv_hash_hops] = "pv_hash_hops",
- [qstat_pv_kick_unlock] = "pv_kick_unlock",
- [qstat_pv_kick_wake] = "pv_kick_wake",
- [qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
- [qstat_pv_latency_kick] = "pv_latency_kick",
- [qstat_pv_latency_wake] = "pv_latency_wake",
- [qstat_pv_lock_stealing] = "pv_lock_stealing",
- [qstat_pv_wait_again] = "pv_wait_again",
- [qstat_pv_wait_early] = "pv_wait_early",
- [qstat_pv_wait_head] = "pv_wait_head",
- [qstat_pv_wait_node] = "pv_wait_node",
- [qstat_lock_pending] = "lock_pending",
- [qstat_lock_slowpath] = "lock_slowpath",
- [qstat_lock_use_node2] = "lock_use_node2",
- [qstat_lock_use_node3] = "lock_use_node3",
- [qstat_lock_use_node4] = "lock_use_node4",
- [qstat_lock_no_node] = "lock_no_node",
- [qstat_reset_cnts] = "reset_counters",
+#define EVENT_COUNT(ev) lockevents[LOCKEVENT_ ## ev]
+
+#undef LOCK_EVENT
+#define LOCK_EVENT(name) [LOCKEVENT_ ## name] = #name,
+
+static const char * const lockevent_names[lockevent_num + 1] = {
+
+#include "lock_events_list.h"
+
+ [LOCKEVENT_reset_cnts] = ".reset_counts",
};

/*
* Per-cpu counters
*/
-static DEFINE_PER_CPU(unsigned long, qstats[qstat_num]);
+DEFINE_PER_CPU(unsigned long, lockevents[lockevent_num]);
static DEFINE_PER_CPU(u64, pv_kick_time);

/*
* Function to read and return the qlock statistical counter values
*
* The following counters are handled specially:
- * 1. qstat_pv_latency_kick
+ * 1. pv_latency_kick
* Average kick latency (ns) = pv_latency_kick/pv_kick_unlock
- * 2. qstat_pv_latency_wake
+ * 2. pv_latency_wake
* Average wake latency (ns) = pv_latency_wake/pv_kick_wake
- * 3. qstat_pv_hash_hops
+ * 3. pv_hash_hops
* Average hops/hash = pv_hash_hops/pv_kick_unlock
*/
-static ssize_t qstat_read(struct file *file, char __user *user_buf,
- size_t count, loff_t *ppos)
+static ssize_t lockevent_read(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
{
char buf[64];
- int cpu, counter, len;
- u64 stat = 0, kicks = 0;
+ int cpu, id, len;
+ u64 sum = 0, kicks = 0;

/*
* Get the counter ID stored in file->f_inode->i_private
*/
- counter = (long)file_inode(file)->i_private;
+ id = (long)file_inode(file)->i_private;

- if (counter >= qstat_num)
+ if (id >= lockevent_num)
return -EBADF;

for_each_possible_cpu(cpu) {
- stat += per_cpu(qstats[counter], cpu);
+ sum += per_cpu(lockevents[id], cpu);
/*
- * Need to sum additional counter for some of them
+ * Need to sum additional counters for some of them
*/
- switch (counter) {
+ switch (id) {

- case qstat_pv_latency_kick:
- case qstat_pv_hash_hops:
- kicks += per_cpu(qstats[qstat_pv_kick_unlock], cpu);
+ case LOCKEVENT_pv_latency_kick:
+ case LOCKEVENT_pv_hash_hops:
+ kicks += per_cpu(EVENT_COUNT(pv_kick_unlock), cpu);
break;

- case qstat_pv_latency_wake:
- kicks += per_cpu(qstats[qstat_pv_kick_wake], cpu);
+ case LOCKEVENT_pv_latency_wake:
+ kicks += per_cpu(EVENT_COUNT(pv_kick_wake), cpu);
break;
}
}

- if (counter == qstat_pv_hash_hops) {
+ if (id == LOCKEVENT_pv_hash_hops) {
u64 frac = 0;

if (kicks) {
- frac = 100ULL * do_div(stat, kicks);
+ frac = 100ULL * do_div(sum, kicks);
frac = DIV_ROUND_CLOSEST_ULL(frac, kicks);
}

/*
* Return a X.XX decimal number
*/
- len = snprintf(buf, sizeof(buf) - 1, "%llu.%02llu\n", stat, frac);
+ len = snprintf(buf, sizeof(buf) - 1, "%llu.%02llu\n",
+ sum, frac);
} else {
/*
* Round to the nearest ns
*/
- if ((counter == qstat_pv_latency_kick) ||
- (counter == qstat_pv_latency_wake)) {
+ if ((id == LOCKEVENT_pv_latency_kick) ||
+ (id == LOCKEVENT_pv_latency_wake)) {
if (kicks)
- stat = DIV_ROUND_CLOSEST_ULL(stat, kicks);
+ sum = DIV_ROUND_CLOSEST_ULL(sum, kicks);
}
- len = snprintf(buf, sizeof(buf) - 1, "%llu\n", stat);
+ len = snprintf(buf, sizeof(buf) - 1, "%llu\n", sum);
}

return simple_read_from_buffer(user_buf, count, ppos, buf, len);
@@ -180,11 +152,9 @@ static ssize_t qstat_read(struct file *file, char __user *user_buf,
/*
* Function to handle write request
*
- * When counter = reset_cnts, reset all the counter values.
- * Since the counter updates aren't atomic, the resetting is done twice
- * to make sure that the counters are very likely to be all cleared.
+ * When id = .reset_cnts, reset all the counter values.
*/
-static ssize_t qstat_write(struct file *file, const char __user *user_buf,
+static ssize_t lockevent_write(struct file *file, const char __user *user_buf,
size_t count, loff_t *ppos)
{
int cpu;
@@ -192,14 +162,14 @@ static ssize_t qstat_write(struct file *file, const char __user *user_buf,
/*
* Get the counter ID stored in file->f_inode->i_private
*/
- if ((long)file_inode(file)->i_private != qstat_reset_cnts)
+ if ((long)file_inode(file)->i_private != LOCKEVENT_reset_cnts)
return count;

for_each_possible_cpu(cpu) {
int i;
- unsigned long *ptr = per_cpu_ptr(qstats, cpu);
+ unsigned long *ptr = per_cpu_ptr(lockevents, cpu);

- for (i = 0 ; i < qstat_num; i++)
+ for (i = 0 ; i < lockevent_num; i++)
WRITE_ONCE(ptr[i], 0);
}
return count;
@@ -208,9 +178,9 @@ static ssize_t qstat_write(struct file *file, const char __user *user_buf,
/*
* Debugfs data structures
*/
-static const struct file_operations fops_qstat = {
- .read = qstat_read,
- .write = qstat_write,
+static const struct file_operations fops_lockevent = {
+ .read = lockevent_read,
+ .write = lockevent_write,
.llseek = default_llseek,
};

@@ -219,10 +189,10 @@ static ssize_t qstat_write(struct file *file, const char __user *user_buf,
*/
static int __init init_qspinlock_stat(void)
{
- struct dentry *d_qstat = debugfs_create_dir("qlockstat", NULL);
+ struct dentry *d_counts = debugfs_create_dir("qlockstat", NULL);
int i;

- if (!d_qstat)
+ if (!d_counts)
goto out;

/*
@@ -232,18 +202,19 @@ static int __init init_qspinlock_stat(void)
* root is allowed to do the read/write to limit impact to system
* performance.
*/
- for (i = 0; i < qstat_num; i++)
- if (!debugfs_create_file(qstat_names[i], 0400, d_qstat,
- (void *)(long)i, &fops_qstat))
+ for (i = 0; i < lockevent_num; i++)
+ if (!debugfs_create_file(lockevent_names[i], 0400, d_counts,
+ (void *)(long)i, &fops_lockevent))
goto fail_undo;

- if (!debugfs_create_file(qstat_names[qstat_reset_cnts], 0200, d_qstat,
- (void *)(long)qstat_reset_cnts, &fops_qstat))
+ if (!debugfs_create_file(lockevent_names[LOCKEVENT_reset_cnts], 0200,
+ d_counts, (void *)(long)LOCKEVENT_reset_cnts,
+ &fops_lockevent))
goto fail_undo;

return 0;
fail_undo:
- debugfs_remove_recursive(d_qstat);
+ debugfs_remove_recursive(d_counts);
out:
pr_warn("Could not create 'qlockstat' debugfs entries\n");
return -ENOMEM;
@@ -251,20 +222,11 @@ static int __init init_qspinlock_stat(void)
fs_initcall(init_qspinlock_stat);

/*
- * Increment the PV qspinlock statistical counters
- */
-static inline void qstat_inc(enum qlock_stats stat, bool cond)
-{
- if (cond)
- this_cpu_inc(qstats[stat]);
-}
-
-/*
* PV hash hop count
*/
-static inline void qstat_hop(int hopcnt)
+static inline void lockevent_pv_hop(int hopcnt)
{
- this_cpu_add(qstats[qstat_pv_hash_hops], hopcnt);
+ this_cpu_add(EVENT_COUNT(pv_hash_hops), hopcnt);
}

/*
@@ -276,7 +238,7 @@ static inline void __pv_kick(int cpu)

per_cpu(pv_kick_time, cpu) = start;
pv_kick(cpu);
- this_cpu_add(qstats[qstat_pv_latency_kick], sched_clock() - start);
+ this_cpu_add(EVENT_COUNT(pv_latency_kick), sched_clock() - start);
}

/*
@@ -289,9 +251,9 @@ static inline void __pv_wait(u8 *ptr, u8 val)
*pkick_time = 0;
pv_wait(ptr, val);
if (*pkick_time) {
- this_cpu_add(qstats[qstat_pv_latency_wake],
+ this_cpu_add(EVENT_COUNT(pv_latency_wake),
sched_clock() - *pkick_time);
- qstat_inc(qstat_pv_kick_wake, true);
+ lockevent_inc(pv_kick_wake);
}
}

@@ -300,7 +262,6 @@ static inline void __pv_wait(u8 *ptr, u8 val)

#else /* CONFIG_QUEUED_LOCK_STAT */

-static inline void qstat_inc(enum qlock_stats stat, bool cond) { }
-static inline void qstat_hop(int hopcnt) { }
+static inline void lockevent_pv_hop(int hopcnt) { }

#endif /* CONFIG_QUEUED_LOCK_STAT */
--
1.8.3.1


2019-02-07 19:09:53

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 06/22] locking/rwsem: Rename kernel/locking/rwsem.h

The content of kernel/locking/rwsem.h is now specific to rwsem-xadd only.
Rename it to rwsem-xadd.h to indicate that it is specific to rwsem-xadd
and include it only when CONFIG_RWSEM_XCHGADD_ALGORITHM is set.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/percpu-rwsem.c | 4 +-
kernel/locking/rwsem-xadd.c | 2 +-
kernel/locking/rwsem-xadd.h | 274 +++++++++++++++++++++++++++++++++++++++++
kernel/locking/rwsem.c | 7 +-
kernel/locking/rwsem.h | 277 ------------------------------------------
5 files changed, 284 insertions(+), 280 deletions(-)
create mode 100644 kernel/locking/rwsem-xadd.h
delete mode 100644 kernel/locking/rwsem.h

diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f17dad9..d06f7c3 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -7,7 +7,9 @@
#include <linux/sched.h>
#include <linux/errno.h>

-#include "rwsem.h"
+#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
+#include "rwsem-xadd.h"
+#endif

int __percpu_init_rwsem(struct percpu_rw_semaphore *sem,
const char *name, struct lock_class_key *rwsem_key)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index c213869..62422a6 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -19,7 +19,7 @@
#include <linux/sched/debug.h>
#include <linux/osq_lock.h>

-#include "rwsem.h"
+#include "rwsem-xadd.h"

/*
* Guide to the rw_semaphore's count field for common values.
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
new file mode 100644
index 0000000..2eaa762
--- /dev/null
+++ b/kernel/locking/rwsem-xadd.h
@@ -0,0 +1,274 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * The least significant 2 bits of the owner value has the following
+ * meanings when set.
+ * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
+ * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
+ * i.e. the owner(s) cannot be readily determined. It can be reader
+ * owned or the owning writer is indeterminate.
+ *
+ * When a writer acquires a rwsem, it puts its task_struct pointer
+ * into the owner field. It is cleared after an unlock.
+ *
+ * When a reader acquires a rwsem, it will also puts its task_struct
+ * pointer into the owner field with both the RWSEM_READER_OWNED and
+ * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * largely be left untouched. So for a free or reader-owned rwsem,
+ * the owner value may contain information about the last reader that
+ * acquires the rwsem. The anonymous bit is set because that particular
+ * reader may or may not still own the lock.
+ *
+ * That information may be helpful in debugging cases where the system
+ * seems to hang on a reader owned rwsem especially if only one reader
+ * is involved. Ideally we would like to track all the readers that own
+ * a rwsem, but the overhead is simply too big.
+ */
+#define RWSEM_READER_OWNED (1UL << 0)
+#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
+
+#ifdef CONFIG_DEBUG_RWSEMS
+# define DEBUG_RWSEMS_WARN_ON(c) DEBUG_LOCKS_WARN_ON(c)
+#else
+# define DEBUG_RWSEMS_WARN_ON(c)
+#endif
+
+/*
+ * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
+ * Adapted largely from include/asm-i386/rwsem.h
+ * by Paul Mackerras <[email protected]>.
+ */
+
+/*
+ * the semaphore definition
+ */
+#ifdef CONFIG_64BIT
+# define RWSEM_ACTIVE_MASK 0xffffffffL
+#else
+# define RWSEM_ACTIVE_MASK 0x0000ffffL
+#endif
+
+#define RWSEM_ACTIVE_BIAS 0x00000001L
+#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * All writes to owner are protected by WRITE_ONCE() to make sure that
+ * store tearing can't happen as optimistic spinners may read and use
+ * the owner value concurrently without lock. Read from owner, however,
+ * may not need READ_ONCE() as long as the pointer value is only used
+ * for comparison and isn't being dereferenced.
+ */
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+ WRITE_ONCE(sem->owner, current);
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+ WRITE_ONCE(sem->owner, NULL);
+}
+
+/*
+ * The task_struct pointer of the last owning reader will be left in
+ * the owner field.
+ *
+ * Note that the owner value just indicates the task has owned the rwsem
+ * previously, it may not be the real owner or one of the real owners
+ * anymore when that field is examined, so take it with a grain of salt.
+ */
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+ struct task_struct *owner)
+{
+ unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
+ | RWSEM_ANONYMOUSLY_OWNED;
+
+ WRITE_ONCE(sem->owner, (struct task_struct *)val);
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+ __rwsem_set_reader_owned(sem, current);
+}
+
+/*
+ * Return true if the a rwsem waiter can spin on the rwsem's owner
+ * and steal the lock, i.e. the lock is not anonymously owned.
+ * N.B. !owner is considered spinnable.
+ */
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+{
+ return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
+}
+
+/*
+ * Return true if rwsem is owned by an anonymous writer or readers.
+ */
+static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
+{
+ return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+}
+
+#ifdef CONFIG_DEBUG_RWSEMS
+/*
+ * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
+ * is a task pointer in owner of a reader-owned rwsem, it will be the
+ * real owner or one of the real owners. The only exception is when the
+ * unlock is done by up_read_non_owner().
+ */
+#define rwsem_clear_reader_owned rwsem_clear_reader_owned
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+ unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
+ | RWSEM_ANONYMOUSLY_OWNED;
+ if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+ cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+ RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+}
+#endif
+
+#else
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+}
+
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+ struct task_struct *owner)
+{
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+#ifndef rwsem_clear_reader_owned
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0))
+ rwsem_down_read_failed(sem);
+ else
+ rwsem_set_reader_owned(sem);
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+ return -EINTR;
+ } else {
+ rwsem_set_reader_owned(sem);
+ }
+ return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ while ((tmp = atomic_long_read(&sem->count)) >= 0) {
+ if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
+ tmp + RWSEM_ACTIVE_READ_BIAS)) {
+ rwsem_set_reader_owned(sem);
+ return 1;
+ }
+ }
+ return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
+ &sem->count);
+ if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ rwsem_down_write_failed(sem);
+ rwsem_set_owner(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
+ &sem->count);
+ if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+ return -EINTR;
+ rwsem_set_owner(sem);
+ return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS);
+ if (tmp == RWSEM_UNLOCKED_VALUE) {
+ rwsem_set_owner(sem);
+ return true;
+ }
+ return false;
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ rwsem_clear_reader_owned(sem);
+ tmp = atomic_long_dec_return_release(&sem->count);
+ if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+ rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ rwsem_clear_owner(sem);
+ if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
+ &sem->count) < 0))
+ rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ /*
+ * When downgrading from exclusive to shared ownership,
+ * anything inside the write-locked region cannot leak
+ * into the read side. In contrast, anything in the
+ * read-locked region is ok to be re-ordered into the
+ * write side. As such, rely on RELEASE semantics.
+ */
+ tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+ rwsem_set_reader_owned(sem);
+ if (tmp < 0)
+ rwsem_downgrade_wake(sem);
+}
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 59e5848..b3b4582 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -13,7 +13,12 @@
#include <linux/rwsem.h>
#include <linux/atomic.h>

-#include "rwsem.h"
+#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
+#include "rwsem-xadd.h"
+#else
+#define __rwsem_set_reader_owned(a, b)
+#define DEBUG_RWSEMS_WARN_ON(a)
+#endif

/*
* lock for reading
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
deleted file mode 100644
index f568391..0000000
--- a/kernel/locking/rwsem.h
+++ /dev/null
@@ -1,277 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * The least significant 2 bits of the owner value has the following
- * meanings when set.
- * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- * i.e. the owner(s) cannot be readily determined. It can be reader
- * owned or the owning writer is indeterminate.
- *
- * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
- *
- * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
- *
- * That information may be helpful in debugging cases where the system
- * seems to hang on a reader owned rwsem especially if only one reader
- * is involved. Ideally we would like to track all the readers that own
- * a rwsem, but the overhead is simply too big.
- */
-#define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
-
-#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c) DEBUG_LOCKS_WARN_ON(c)
-#else
-# define DEBUG_RWSEMS_WARN_ON(c)
-#endif
-
-/*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <[email protected]>.
- */
-
-/*
- * the semaphore definition
- */
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
-
-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-/*
- * All writes to owner are protected by WRITE_ONCE() to make sure that
- * store tearing can't happen as optimistic spinners may read and use
- * the owner value concurrently without lock. Read from owner, however,
- * may not need READ_ONCE() as long as the pointer value is only used
- * for comparison and isn't being dereferenced.
- */
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
- WRITE_ONCE(sem->owner, current);
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
- WRITE_ONCE(sem->owner, NULL);
-}
-
-/*
- * The task_struct pointer of the last owning reader will be left in
- * the owner field.
- *
- * Note that the owner value just indicates the task has owned the rwsem
- * previously, it may not be the real owner or one of the real owners
- * anymore when that field is examined, so take it with a grain of salt.
- */
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
- struct task_struct *owner)
-{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
-
- WRITE_ONCE(sem->owner, (struct task_struct *)val);
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
- __rwsem_set_reader_owned(sem, current);
-}
-
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
-{
- return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
- return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
-}
-
-#ifdef CONFIG_DEBUG_RWSEMS
-/*
- * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
- * is a task pointer in owner of a reader-owned rwsem, it will be the
- * real owner or one of the real owners. The only exception is when the
- * unlock is done by up_read_non_owner().
- */
-#define rwsem_clear_reader_owned rwsem_clear_reader_owned
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
- unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
- if (READ_ONCE(sem->owner) == (struct task_struct *)val)
- cmpxchg_relaxed((unsigned long *)&sem->owner, val,
- RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
-}
-#endif
-
-#else
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
- struct task_struct *owner)
-{
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifndef rwsem_clear_reader_owned
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0))
- rwsem_down_read_failed(sem);
- else
- rwsem_set_reader_owned(sem);
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
- return -EINTR;
- } else {
- rwsem_set_reader_owned(sem);
- }
- return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- long tmp;
-
- while ((tmp = atomic_long_read(&sem->count)) >= 0) {
- if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
- rwsem_set_reader_owned(sem);
- return 1;
- }
- }
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
- rwsem_down_write_failed(sem);
- rwsem_set_owner(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
- return -EINTR;
- rwsem_set_owner(sem);
- return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- long tmp;
-
- tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- if (tmp == RWSEM_UNLOCKED_VALUE) {
- rwsem_set_owner(sem);
- return true;
- }
- return false;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- long tmp;
-
- rwsem_clear_reader_owned(sem);
- tmp = atomic_long_dec_return_release(&sem->count);
- if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- rwsem_clear_owner(sem);
- if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count) < 0))
- rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- long tmp;
-
- /*
- * When downgrading from exclusive to shared ownership,
- * anything inside the write-locked region cannot leak
- * into the read side. In contrast, anything in the
- * read-locked region is ok to be re-ordered into the
- * write side. As such, rely on RELEASE semantics.
- */
- tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
- rwsem_set_reader_owned(sem);
- if (tmp < 0)
- rwsem_downgrade_wake(sem);
-}
-
-#endif /* CONFIG_RWSEM_XCHGADD_ALGORITHM */
--
1.8.3.1


2019-02-07 19:10:19

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 03/22] locking/rwsem: Relocate rwsem_down_read_failed()

The rwsem_down_read_failed*() functions were relocted from above the
optimistic spinning section to below that section. This enables the
reader functions to use optimisitic spinning in future patches. There
is no code change.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 172 ++++++++++++++++++++++----------------------
1 file changed, 86 insertions(+), 86 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index fbe9634..0d518b5 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -225,92 +225,6 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}

/*
- * Wait for the read lock to be granted
- */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
-{
- long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
- struct rwsem_waiter waiter;
- DEFINE_WAKE_Q(wake_q);
-
- waiter.task = current;
- waiter.type = RWSEM_WAITING_FOR_READ;
-
- raw_spin_lock_irq(&sem->wait_lock);
- if (list_empty(&sem->wait_list)) {
- /*
- * In case the wait queue is empty and the lock isn't owned
- * by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_ACTIVE_READ_BIAS has already
- * been set in the count.
- */
- if (atomic_long_read(&sem->count) >= 0) {
- raw_spin_unlock_irq(&sem->wait_lock);
- return sem;
- }
- adjustment += RWSEM_WAITING_BIAS;
- }
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we're now waiting on the lock, but no longer actively locking */
- count = atomic_long_add_return(adjustment, &sem->count);
-
- /*
- * If there are no active locks, wake the front queued process(es).
- *
- * If there are no writers and we are first in the queue,
- * wake our own waiter to join the existing active readers !
- */
- if (count == RWSEM_WAITING_BIAS ||
- (count > RWSEM_WAITING_BIAS &&
- adjustment != -RWSEM_ACTIVE_READ_BIAS))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
- raw_spin_unlock_irq(&sem->wait_lock);
- wake_up_q(&wake_q);
-
- /* wait to be given the lock */
- while (true) {
- set_current_state(state);
- if (!waiter.task)
- break;
- if (signal_pending_state(state, current)) {
- raw_spin_lock_irq(&sem->wait_lock);
- if (waiter.task)
- goto out_nolock;
- raw_spin_unlock_irq(&sem->wait_lock);
- break;
- }
- schedule();
- }
-
- __set_current_state(TASK_RUNNING);
- return sem;
-out_nolock:
- list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
- raw_spin_unlock_irq(&sem->wait_lock);
- __set_current_state(TASK_RUNNING);
- return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
-/*
* This function must be called with the sem->wait_lock held to prevent
* race conditions between checking the rwsem wait list and setting the
* sem->count accordingly.
@@ -505,6 +419,92 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
#endif

/*
+ * Wait for the read lock to be granted
+ */
+static inline struct rw_semaphore __sched *
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+{
+ long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+ struct rwsem_waiter waiter;
+ DEFINE_WAKE_Q(wake_q);
+
+ waiter.task = current;
+ waiter.type = RWSEM_WAITING_FOR_READ;
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (list_empty(&sem->wait_list)) {
+ /*
+ * In case the wait queue is empty and the lock isn't owned
+ * by a writer, this reader can exit the slowpath and return
+ * immediately as its RWSEM_ACTIVE_READ_BIAS has already
+ * been set in the count.
+ */
+ if (atomic_long_read(&sem->count) >= 0) {
+ raw_spin_unlock_irq(&sem->wait_lock);
+ return sem;
+ }
+ adjustment += RWSEM_WAITING_BIAS;
+ }
+ list_add_tail(&waiter.list, &sem->wait_list);
+
+ /* we're now waiting on the lock, but no longer actively locking */
+ count = atomic_long_add_return(adjustment, &sem->count);
+
+ /*
+ * If there are no active locks, wake the front queued process(es).
+ *
+ * If there are no writers and we are first in the queue,
+ * wake our own waiter to join the existing active readers !
+ */
+ if (count == RWSEM_WAITING_BIAS ||
+ (count > RWSEM_WAITING_BIAS &&
+ adjustment != -RWSEM_ACTIVE_READ_BIAS))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+
+ /* wait to be given the lock */
+ while (true) {
+ set_current_state(state);
+ if (!waiter.task)
+ break;
+ if (signal_pending_state(state, current)) {
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (waiter.task)
+ goto out_nolock;
+ raw_spin_unlock_irq(&sem->wait_lock);
+ break;
+ }
+ schedule();
+ }
+
+ __set_current_state(TASK_RUNNING);
+ return sem;
+out_nolock:
+ list_del(&waiter.list);
+ if (list_empty(&sem->wait_list))
+ atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ __set_current_state(TASK_RUNNING);
+ return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+ return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+{
+ return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed_killable);
+
+/*
* Wait until we successfully acquire the write lock
*/
static inline struct rw_semaphore *
--
1.8.3.1


2019-02-07 19:10:24

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 07/22] locking/rwsem: Move rwsem internal function declarations to rwsem-xadd.h

We don't need to expose rwsem internal functions which are not supposed
to be called directly from other kernel code.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/rwsem.h | 7 -------
kernel/locking/rwsem-xadd.h | 7 +++++++
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 6e56006..1e0457b 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -50,13 +50,6 @@ struct rw_semaphore {
*/
#define RWSEM_OWNER_UNKNOWN ((struct task_struct *)-2L)

-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
/* In all implementations count != 0 means locked */
static inline int rwsem_is_locked(struct rw_semaphore *sem)
{
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 2eaa762..64e7d62 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -153,6 +153,13 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
}
#endif

+extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
+
/*
* lock for reading
*/
--
1.8.3.1


2019-02-07 19:10:26

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 08/22] locking/rwsem: Add debug check for __down_read*()

When rwsem_down_read_failed*() return, the read lock is acquired
indirectly by others. So debug checks are added in __down_read() and
__down_read_killable() to make sure the rwsem is really reader-owned.

The other debug check calls in kernel/locking/rwsem.c except the
one in up_read_non_owner() are also moved over to rwsem-xadd.h.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.h | 12 ++++++++++--
kernel/locking/rwsem.c | 3 ---
2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 64e7d62..77151c3 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -165,10 +165,13 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
*/
static inline void __down_read(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0))
+ if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
rwsem_down_read_failed(sem);
- else
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+ RWSEM_READER_OWNED));
+ } else {
rwsem_set_reader_owned(sem);
+ }
}

static inline int __down_read_killable(struct rw_semaphore *sem)
@@ -176,6 +179,8 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+ RWSEM_READER_OWNED));
} else {
rwsem_set_reader_owned(sem);
}
@@ -243,6 +248,7 @@ static inline void __up_read(struct rw_semaphore *sem)
{
long tmp;

+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED));
rwsem_clear_reader_owned(sem);
tmp = atomic_long_dec_return_release(&sem->count);
if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
@@ -254,6 +260,7 @@ static inline void __up_read(struct rw_semaphore *sem)
*/
static inline void __up_write(struct rw_semaphore *sem)
{
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current);
rwsem_clear_owner(sem);
if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
&sem->count) < 0))
@@ -274,6 +281,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* read-locked region is ok to be re-ordered into the
* write side. As such, rely on RELEASE semantics.
*/
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current);
tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
if (tmp < 0)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index b3b4582..598fc7c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -114,7 +114,6 @@ int down_write_trylock(struct rw_semaphore *sem)
void up_read(struct rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, 1, _RET_IP_);
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED));

__up_read(sem);
}
@@ -127,7 +126,6 @@ void up_read(struct rw_semaphore *sem)
void up_write(struct rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, 1, _RET_IP_);
- DEBUG_RWSEMS_WARN_ON(sem->owner != current);

__up_write(sem);
}
@@ -140,7 +138,6 @@ void up_write(struct rw_semaphore *sem)
void downgrade_write(struct rw_semaphore *sem)
{
lock_downgrade(&sem->dep_map, _RET_IP_);
- DEBUG_RWSEMS_WARN_ON(sem->owner != current);

__downgrade_write(sem);
}
--
1.8.3.1


2019-02-07 19:10:40

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 11/22] locking/rwsem: Implement a new locking scheme

The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used. The count is now a 32-bit atomic value
in all architectures. The current bit definitions are:

Bit 0 - writer locked bit
Bit 1 - waiters present bit
Bits 2-7 - reserved for future extension
Bits 8-X - reader count (24/56 bits)

Now the cmpxchg instruction is used to acquire the write lock. The
read lock is still acquired with xadd instruction, so there is no
change here. This scheme will allow up to 16M active readers which
should be more than enough. We can always use some more reserved bits
if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.0 based kernel, the total
locking rates (in kops/s) of the benchmark on a 4-socket 56-core
x86-64 system before and after the patch were as follows:

Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 28,773 30,164 29,063 29,967
2 7,435 15,167 6,746 13,198
4 7,181 14,438 6,515 12,763
8 6,918 13,796 6,747 13,419
16 6,554 16,030 6,653 15,534

The locking rates of the benchmark on a Power9 system were as follows:

Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 11,527 12,378 11,527 12,195
2 6,318 15,116 7,690 15,065
4 3,957 16,812 4,495 16,884
8 3,617 16,496 3,993 16,588

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 23,703 23,208 23,923 23,167
2 6,206 11,996 6,756 13,268
4 6,098 12,343 6,928 15,064
8 6,007 13,152 7,117 15,335

For x86, the locking rate dropped somewhat at low contention level with
the patch. At high contention level, however, the performance recovered.
For both Power9 and ARM64, the patch caused no performance drop. In fact,
it improved a little bit after the patch.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 145 +++++++++++++++-----------------------------
kernel/locking/rwsem-xadd.h | 85 +++++++++++++-------------
2 files changed, 89 insertions(+), 141 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index fff231a..18a414e 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
*
* Optimistic spinning by Tim Chen <[email protected]>
* and Davidlohr Bueso <[email protected]>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long <[email protected]>.
*/
#include <linux/rwsem.h>
#include <linux/init.h>
@@ -23,52 +25,20 @@
#include "lock_events.h"

/*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x0000000X (1) X readers active or attempting lock, no writer waiting
- * X = #active_readers + #readers attempting to lock
- * (X*ACTIVE_BIAS)
- *
- * 0x00000000 rwsem is unlocked, and no one is waiting for the lock or
- * attempting to read lock or write lock.
- *
- * 0xffff000X (1) X readers active or attempting lock, with waiters for lock
- * X = #active readers + # readers attempting lock
- * (X*ACTIVE_BIAS + WAITING_BIAS)
- * (2) 1 writer attempting lock, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- * (3) 1 writer active, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0xffff0001 (1) 1 reader active or attempting lock, waiters for lock
- * (WAITING_BIAS + ACTIVE_BIAS)
- * (2) 1 writer active or attempting lock, no waiters for lock
- * (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
*
- * 0xffff0000 (1) There are writers or readers queued but none active
- * or in the process of attempting lock.
- * (WAITING_BIAS)
- * Note: writer can attempt to steal lock for this count by adding
- * ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
*
- * 0xfffe0001 (1) 1 writer active, or attempting lock. Waiters on queue.
- * (ACTIVE_WRITE_BIAS + WAITING_BIAS)
- *
- * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
- * the count becomes more than 0 for successful lock acquisition,
- * i.e. the case where there are only readers or nobody has lock.
- * (1st and 2nd case above).
- *
- * Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
- * checking the count becomes ACTIVE_WRITE_BIAS for successful lock
- * acquisition (i.e. nobody else has lock or attempts lock). If
- * unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- * are only waiters but none active (5th case above), and attempt to
- * steal the lock.
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
*
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
*/

/*
@@ -114,9 +84,8 @@ enum rwsem_wake_type {

/*
* handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then:
- * - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
- * - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ * have been set.
* - there must be someone on the queue
* - the wait_lock must be held by the caller
* - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -160,22 +129,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
* so we can bail out early if a writer stole the lock.
*/
if (wake_type != RWSEM_WAKE_READ_OWNED) {
- adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
+ adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
- if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
- /*
- * If the count is still less than RWSEM_WAITING_BIAS
- * after removing the adjustment, it is assumed that
- * a writer has stolen the lock. We have to undo our
- * reader grant.
- */
- if (atomic_long_add_return(-adjustment, &sem->count) <
- RWSEM_WAITING_BIAS)
- return;
-
- /* Last active locker left. Retry waking readers. */
- goto try_reader_grant;
+ if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+ atomic_long_sub(adjustment, &sem->count);
+ return;
}
/*
* Set it to reader-owned to give spinners an early
@@ -215,11 +173,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
wake_q_add_safe(wake_q, tsk);
}

- adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+ adjustment = woken * RWSEM_READER_BIAS - adjustment;
lockevent_cond_inc(rwsem_wake_reader, woken);
if (list_empty(&sem->wait_list)) {
/* hit end of list above */
- adjustment -= RWSEM_WAITING_BIAS;
+ adjustment -= RWSEM_FLAG_WAITERS;
}

if (adjustment)
@@ -233,22 +191,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
*/
static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
{
- /*
- * Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
- */
- if (count != RWSEM_WAITING_BIAS)
+ long new;
+
+ if (RWSEM_COUNT_LOCKED(count))
return false;

- /*
- * Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
- * are other tasks on the wait list, we need to add on WAITING_BIAS.
- */
- count = list_is_singular(&sem->wait_list) ?
- RWSEM_ACTIVE_WRITE_BIAS :
- RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
+ new = count + RWSEM_WRITER_LOCKED -
+ (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);

- if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
- == RWSEM_WAITING_BIAS) {
+ if (atomic_long_cmpxchg_acquire(&sem->count, count, new) == count) {
rwsem_set_owner(sem);
return true;
}
@@ -265,11 +216,11 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
long old, count = atomic_long_read(&sem->count);

while (true) {
- if (!(count == 0 || count == RWSEM_WAITING_BIAS))
+ if (RWSEM_COUNT_LOCKED(count))
return false;

old = atomic_long_cmpxchg_acquire(&sem->count, count,
- count + RWSEM_ACTIVE_WRITE_BIAS);
+ count + RWSEM_WRITER_LOCKED);
if (old == count) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
@@ -428,7 +379,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
static inline struct rw_semaphore __sched *
__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
{
- long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+ long count, adjustment = -RWSEM_READER_BIAS;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);

@@ -440,16 +391,16 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
/*
* In case the wait queue is empty and the lock isn't owned
* by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_ACTIVE_READ_BIAS has already
- * been set in the count.
+ * immediately as its RWSEM_READER_BIAS has already been
+ * set in the count.
*/
- if (atomic_long_read(&sem->count) >= 0) {
+ if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
return sem;
}
- adjustment += RWSEM_WAITING_BIAS;
+ adjustment += RWSEM_FLAG_WAITERS;
}
list_add_tail(&waiter.list, &sem->wait_list);

@@ -462,9 +413,8 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
* If there are no writers and we are first in the queue,
* wake our own waiter to join the existing active readers !
*/
- if (count == RWSEM_WAITING_BIAS ||
- (count > RWSEM_WAITING_BIAS &&
- adjustment != -RWSEM_ACTIVE_READ_BIAS))
+ if (!RWSEM_COUNT_LOCKED(count) ||
+ (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);

raw_spin_unlock_irq(&sem->wait_lock);
@@ -492,7 +442,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
out_nolock:
list_del(&waiter.list);
if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+ atomic_long_add(-RWSEM_FLAG_WAITERS, &sem->count);
raw_spin_unlock_irq(&sem->wait_lock);
__set_current_state(TASK_RUNNING);
lockevent_inc(rwsem_rlock_fail);
@@ -525,9 +475,6 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);

- /* undo write bias from down_write operation, stop active locking */
- count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
-
/* do optimistic spinning and steal lock if possible */
if (rwsem_optimistic_spin(sem))
return sem;
@@ -553,10 +500,12 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)

/*
* If there were already threads queued before us and there are
- * no active writers, the lock must be read owned; so we try to
- * wake any read locks that were queued ahead of us.
+ * no active writers and some readers, the lock must be read
+ * owned; so we try to any read locks that were queued ahead
+ * of us.
*/
- if (count > RWSEM_WAITING_BIAS) {
+ if (!(count & RWSEM_WRITER_MASK) &&
+ (count & RWSEM_READER_MASK)) {
__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
/*
* The wakeup is normally called _after_ the wait_lock
@@ -573,8 +522,9 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
wake_q_init(&wake_q);
}

- } else
- count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
+ } else {
+ count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ }

/* wait until we successfully acquire the lock */
set_current_state(state);
@@ -591,7 +541,8 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
schedule();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
- } while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
+ count = atomic_long_read(&sem->count);
+ } while (RWSEM_COUNT_LOCKED(count));

raw_spin_lock_irq(&sem->wait_lock);
}
@@ -607,7 +558,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
raw_spin_lock_irq(&sem->wait_lock);
list_del(&waiter.list);
if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+ atomic_long_add(-RWSEM_FLAG_WAITERS, &sem->count);
else
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 6d2478c..1febd17 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -37,24 +37,26 @@
#endif

/*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <[email protected]>.
- */
-
-/*
- * the semaphore definition
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
*/
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
+#define RWSEM_WRITER_LOCKED (1UL << 0)
+#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_READER_SHIFT 8
+#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)

-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_COUNT_LOCKED(c) ((c) & RWSEM_LOCK_MASK)

#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
@@ -169,7 +171,8 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
*/
static inline void __down_read(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_failed(sem);
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
RWSEM_READER_OWNED), sem);
@@ -180,7 +183,8 @@ static inline void __down_read(struct rw_semaphore *sem)

static inline int __down_read_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
@@ -195,9 +199,10 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
{
long tmp;

- while ((tmp = atomic_long_read(&sem->count)) >= 0) {
+ while (!((tmp = atomic_long_read(&sem->count)) &
+ RWSEM_READ_FAILED_MASK)) {
if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
+ tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
return 1;
}
@@ -210,22 +215,16 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
static inline void __down_write(struct rw_semaphore *sem)
{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
rwsem_down_write_failed(sem);
rwsem_set_owner(sem);
}

static inline int __down_write_killable(struct rw_semaphore *sem)
{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
if (IS_ERR(rwsem_down_write_failed_killable(sem)))
return -EINTR;
rwsem_set_owner(sem);
@@ -234,15 +233,11 @@ static inline int __down_write_killable(struct rw_semaphore *sem)

static inline int __down_write_trylock(struct rw_semaphore *sem)
{
- long tmp;
-
- tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- if (tmp == RWSEM_UNLOCKED_VALUE) {
+ bool taken = !atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED);
+ if (taken)
rwsem_set_owner(sem);
- return true;
- }
- return false;
+ return taken;
}

/*
@@ -255,8 +250,9 @@ static inline void __up_read(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
sem);
rwsem_clear_reader_owned(sem);
- tmp = atomic_long_dec_return_release(&sem->count);
- if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+ tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+ == RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}

@@ -267,8 +263,8 @@ static inline void __up_write(struct rw_semaphore *sem)
{
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
rwsem_clear_owner(sem);
- if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count) < 0))
+ if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+ &sem->count) & RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}

@@ -287,8 +283,9 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* write side. As such, rely on RELEASE semantics.
*/
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
- tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+ tmp = atomic_long_fetch_add_release(
+ -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
- if (tmp < 0)
+ if (tmp & RWSEM_FLAG_WAITERS)
rwsem_downgrade_wake(sem);
}
--
1.8.3.1


2019-02-07 19:10:41

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 12/22] locking/rwsem: Implement lock handoff to prevent lock starvation

Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

The mutex code has a lock handoff mechanism to prevent lock starvation.
This patch implements a similar lock handoff mechanism to disable
lock stealing and force lock handoff to the first waiter in the queue
after at least a 5ms waiting period. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

A rwsem microbenchmark was run for 8 seconds on a 4-socket 56-core
128-thread x86-64 system with a v5.0 based kernel and 112 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,708/542,988. After the patch, the figures
became 6,031/7,267/9,476. It can be seen that the rwsem became much more
fair, though there was a slight drop of about 6% in the mean locking
operations done which was a tradeoff of having better fairness.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 2 +
kernel/locking/rwsem-xadd.c | 110 +++++++++++++++++++++++++++++++-------
kernel/locking/rwsem-xadd.h | 23 +++++---
3 files changed, 109 insertions(+), 26 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index c33c5df..4cde507 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -62,6 +62,8 @@
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
+LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */
LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */
LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */
+LOCK_EVENT(rwsem_wlock_handoff) /* # of write lock handoffs */
#endif /* CONFIG_RWSEM_XCHGADD_ALGORITHM */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 18a414e..12b1d61 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -74,6 +74,7 @@ struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+ unsigned long timeout;
};

enum rwsem_wake_type {
@@ -83,6 +84,12 @@ enum rwsem_wake_type {
};

/*
+ * The minimum waiting time (5ms) in the wait queue before initiating the
+ * handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT ((HZ - 1)/200 + 1)
+
+/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
* have been set.
@@ -132,6 +139,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+ /*
+ * Initiate handoff to reader, if applicable.
+ */
+ if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+ time_after(jiffies, waiter->timeout)) {
+ adjustment -= RWSEM_FLAG_HANDOFF;
+ lockevent_inc(rwsem_rlock_handoff);
+ }
+
atomic_long_sub(adjustment, &sem->count);
return;
}
@@ -180,6 +196,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
adjustment -= RWSEM_FLAG_WAITERS;
}

+ /*
+ * Clear the handoff flag
+ */
+ if (woken && RWSEM_COUNT_HANDOFF(atomic_long_read(&sem->count)))
+ adjustment -= RWSEM_FLAG_HANDOFF;
+
if (adjustment)
atomic_long_add(adjustment, &sem->count);
}
@@ -189,15 +211,19 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
* race conditions between checking the rwsem wait list and setting the
* sem->count accordingly.
*/
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool
+rwsem_try_write_lock(long count, struct rw_semaphore *sem, bool first)
{
long new;

if (RWSEM_COUNT_LOCKED(count))
return false;

- new = count + RWSEM_WRITER_LOCKED -
- (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+ if (!first && RWSEM_COUNT_HANDOFF(count))
+ return false;
+
+ new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
+ (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);

if (atomic_long_cmpxchg_acquire(&sem->count, count, new) == count) {
rwsem_set_owner(sem);
@@ -216,7 +242,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
long old, count = atomic_long_read(&sem->count);

while (true) {
- if (RWSEM_COUNT_LOCKED(count))
+ if (RWSEM_COUNT_LOCKED_OR_HANDOFF(count))
return false;

old = atomic_long_cmpxchg_acquire(&sem->count, count,
@@ -374,6 +400,16 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
#endif

/*
+ * This is safe to be called without holding the wait_lock.
+ */
+static inline bool
+rwsem_waiter_is_first(struct rw_semaphore *sem, struct rwsem_waiter *waiter)
+{
+ return list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
+ == waiter;
+}
+
+/*
* Wait for the read lock to be granted
*/
static inline struct rw_semaphore __sched *
@@ -385,6 +421,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)

waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
+ waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;

raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
@@ -441,8 +478,12 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
return sem;
out_nolock:
list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_FLAG_WAITERS, &sem->count);
+ if (list_empty(&sem->wait_list)) {
+ int adjustment = -RWSEM_FLAG_WAITERS -
+ (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF);
+
+ atomic_long_add(adjustment, &sem->count);
+ }
raw_spin_unlock_irq(&sem->wait_lock);
__set_current_state(TASK_RUNNING);
lockevent_inc(rwsem_rlock_fail);
@@ -469,8 +510,8 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
static inline struct rw_semaphore *
__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
{
- long count;
- bool waiting = true; /* any queued threads before us */
+ long count, adjustment;
+ bool first; /* First one in queue */
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
@@ -485,17 +526,17 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
*/
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_WRITE;
+ waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;

raw_spin_lock_irq(&sem->wait_lock);

/* account for this before adding a new element to the list */
- if (list_empty(&sem->wait_list))
- waiting = false;
+ first = list_empty(&sem->wait_list);

list_add_tail(&waiter.list, &sem->wait_list);

/* we're now waiting on the lock, but no longer actively locking */
- if (waiting) {
+ if (!first) {
count = atomic_long_read(&sem->count);

/*
@@ -529,12 +570,13 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(count, sem))
+ if (rwsem_try_write_lock(count, sem, first))
break;
+
raw_spin_unlock_irq(&sem->wait_lock);

/* Block until there are no active lockers. */
- do {
+ for (;;) {
if (signal_pending_state(state, current))
goto out_nolock;

@@ -542,7 +584,29 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
count = atomic_long_read(&sem->count);
- } while (RWSEM_COUNT_LOCKED(count));
+
+ if (!first)
+ first = rwsem_waiter_is_first(sem, &waiter);
+
+ if (!RWSEM_COUNT_LOCKED(count))
+ break;
+
+ if (first && !RWSEM_COUNT_HANDOFF(count) &&
+ time_after(jiffies, waiter.timeout)) {
+ atomic_long_or(RWSEM_FLAG_HANDOFF, &sem->count);
+ /*
+ * Make sure the handoff bit is seen by
+ * others before proceeding.
+ */
+ smp_mb__after_atomic();
+ lockevent_inc(rwsem_wlock_handoff);
+ /*
+ * Do a trylock after setting the handoff
+ * flag to avoid missed wakeup.
+ */
+ break;
+ }
+ }

raw_spin_lock_irq(&sem->wait_lock);
}
@@ -557,9 +621,15 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
__set_current_state(TASK_RUNNING);
raw_spin_lock_irq(&sem->wait_lock);
list_del(&waiter.list);
+ adjustment = 0;
if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_FLAG_WAITERS, &sem->count);
- else
+ adjustment -= RWSEM_FLAG_WAITERS;
+ if (first)
+ adjustment -= (atomic_long_read(&sem->count) &
+ RWSEM_FLAG_HANDOFF);
+ if (adjustment)
+ atomic_long_add(adjustment, &sem->count);
+ if (!list_empty(&sem->wait_list))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
@@ -587,7 +657,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
* - up_read/up_write has decremented the active part of count if we come here
*/
__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
{
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -620,7 +690,9 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
smp_rmb();

/*
- * If a spinner is present, it is not necessary to do the wakeup.
+ * If a spinner is present and the handoff flag isn't set, it is
+ * not necessary to do the wakeup.
+ *
* Try to do wakeup only if the trylock succeeds to minimize
* spinlock contention which may introduce too much delay in the
* unlock operation.
@@ -639,7 +711,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
* rwsem_has_spinner() is true, it will guarantee at least one
* trylock attempt on the rwsem later on.
*/
- if (rwsem_has_spinner(sem)) {
+ if (rwsem_has_spinner(sem) && !RWSEM_COUNT_HANDOFF(count)) {
/*
* The smp_rmb() here is to make sure that the spinner
* state is consulted before reading the wait_lock.
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 1febd17..6d4890d 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -41,7 +41,8 @@
*
* Bit 0 - writer locked bit
* Bit 1 - waiters present bit
- * Bits 2-7 - reserved
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
* Bits 8-X - 24-bit (32-bit) or 56-bit reader count
*
* atomic_long_fetch_add() is used to obtain reader lock, whereas
@@ -49,14 +50,20 @@
*/
#define RWSEM_WRITER_LOCKED (1UL << 0)
#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_FLAG_HANDOFF (1UL << 2)
+
#define RWSEM_READER_SHIFT 8
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+ RWSEM_FLAG_HANDOFF)

#define RWSEM_COUNT_LOCKED(c) ((c) & RWSEM_LOCK_MASK)
+#define RWSEM_COUNT_HANDOFF(c) ((c) & RWSEM_FLAG_HANDOFF)
+#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c) \
+ ((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))

#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
@@ -163,7 +170,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count);
extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);

/*
@@ -253,7 +260,7 @@ static inline void __up_read(struct rw_semaphore *sem)
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
== RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
+ rwsem_wake(sem, tmp);
}

/*
@@ -261,11 +268,13 @@ static inline void __up_read(struct rw_semaphore *sem)
*/
static inline void __up_write(struct rw_semaphore *sem)
{
+ long tmp;
+
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
rwsem_clear_owner(sem);
- if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
- &sem->count) & RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
+ tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+ if (unlikely(tmp & RWSEM_FLAG_WAITERS))
+ rwsem_wake(sem, tmp);
}

/*
--
1.8.3.1


2019-02-07 19:11:09

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 16/22] locking/rwsem: Remove redundant computation of writer lock word

On 64-bit architectures, each rwsem writer will have its unique lock
word for acquiring the lock. Right now, the writer code recomputes the
lock word every time it tries to acquire the lock. This is a waste of
time. The lock word is now cached and reused when it is needed.

On 32-bit architectures, the extra constant argument to
rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should be
optimized out by the compiler.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 0869fbf..16dc7a1 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -216,8 +216,8 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
* race conditions between checking the rwsem wait list and setting the
* sem->count accordingly.
*/
-static inline bool
-rwsem_try_write_lock(long count, struct rw_semaphore *sem, bool first)
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+ const long wlock, bool first)
{
long new;

@@ -227,7 +227,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
if (!first && RWSEM_COUNT_HANDOFF(count))
return false;

- new = (count & ~RWSEM_FLAG_HANDOFF) + RWSEM_WRITER_LOCKED -
+ new = (count & ~RWSEM_FLAG_HANDOFF) + wlock -
(list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);

if (atomic_long_cmpxchg_acquire(&sem->count, count, new) == count) {
@@ -242,7 +242,8 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
/*
* Try to acquire write lock before the writer has been put on wait queue.
*/
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
+ const long wlock)
{
long old, count = atomic_long_read(&sem->count);

@@ -251,7 +252,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
return false;

old = atomic_long_cmpxchg_acquire(&sem->count, count,
- count + RWSEM_WRITER_LOCKED);
+ count + wlock);
if (old == count) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
@@ -338,7 +339,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
return is_rwsem_owner_spinnable(rwsem_get_owner(sem));
}

-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
{
bool taken = false;

@@ -362,7 +363,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
/*
* Try to acquire the lock
*/
- if (rwsem_try_write_lock_unqueued(sem)) {
+ if (rwsem_try_write_lock_unqueued(sem, wlock)) {
taken = true;
break;
}
@@ -392,7 +393,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
return taken;
}
#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
{
return false;
}
@@ -514,9 +515,10 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
+ const long wlock = RWSEM_WRITER_LOCKED;

/* do optimistic spinning and steal lock if possible */
- if (rwsem_optimistic_spin(sem))
+ if (rwsem_optimistic_spin(sem, wlock))
return sem;

/*
@@ -569,7 +571,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(count, sem, first))
+ if (rwsem_try_write_lock(count, sem, wlock, first))
break;

raw_spin_unlock_irq(&sem->wait_lock);
--
1.8.3.1


2019-02-07 19:11:09

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 18/22] locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value

This patch modifies rwsem_spin_on_owner() to return a tri-state value
to better reflect the state of lock holder which enables us to make a
better decision of what to do next.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 25 +++++++++++++++++++------
kernel/locking/rwsem-xadd.h | 5 +++++
2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 21d462f..0a29aac 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -305,14 +305,24 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
}

/*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * Return the folowing three values depending on the lock owner state.
+ * 1 when owner changes and no reader is detected yet.
+ * 0 when owner changes and the new owner may be a reader.
+ * -1 when optimistic spinning has to stop because either the owner stops
+ * running, is unknown, or its timeslice has been used up.
*/
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+ OWNER_SPINNABLE = 1,
+ OWNER_READER = 0,
+ OWNER_NONSPINNABLE = -1,
+};
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
{
struct task_struct *owner = rwsem_get_owner(sem);

if (!is_rwsem_owner_spinnable(owner))
- return false;
+ return OWNER_NONSPINNABLE;

rcu_read_lock();
/*
@@ -337,7 +347,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
*/
if (need_resched() || !owner_on_cpu(owner, sem)) {
rcu_read_unlock();
- return false;
+ return OWNER_NONSPINNABLE;
}

cpu_relax();
@@ -348,7 +358,10 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
* If there is a new owner or the owner is not set, we continue
* spinning.
*/
- return is_rwsem_owner_spinnable(rwsem_get_owner(sem));
+ owner = rwsem_get_owner(sem);
+ if (!is_rwsem_owner_spinnable(owner))
+ return OWNER_NONSPINNABLE;
+ return is_rwsem_owner_reader(owner) ? OWNER_READER : OWNER_SPINNABLE;
}

static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
@@ -371,7 +384,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
* 2) readers own the lock as we can't determine if they are
* actively running or not.
*/
- while (rwsem_spin_on_owner(sem)) {
+ while (rwsem_spin_on_owner(sem) == OWNER_SPINNABLE) {
/*
* Try to acquire the lock
*/
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index d54b5db..1de6f1e 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -200,6 +200,11 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
}

+static inline bool is_rwsem_owner_reader(struct task_struct *owner)
+{
+ return (unsigned long)owner & RWSEM_READER_OWNED;
+}
+
/*
* Return true if the rwsem is owned by a reader.
*/
--
1.8.3.1


2019-02-07 19:11:19

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 22/22] locking/rwsem: Ensure an RT task will not spin on reader

An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if they happen to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3cf2e84..ac1f6c4 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -349,12 +349,14 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)

/*
* Return the folowing three values depending on the lock owner state.
+ * 2 when owner is currently NULL
* 1 when owner changes and no reader is detected yet.
* 0 when owner changes and the new owner may be a reader.
* -1 when optimistic spinning has to stop because either the owner stops
* running, is unknown, or its timeslice has been used up.
*/
enum owner_state {
+ OWNER_NULL = 2,
OWNER_SPINNABLE = 1,
OWNER_READER = 0,
OWNER_NONSPINNABLE = -1,
@@ -405,7 +407,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
owner = rwsem_get_owner(sem);
if (!is_rwsem_owner_spinnable(owner))
return OWNER_NONSPINNABLE;
- return is_rwsem_owner_reader(owner) ? OWNER_READER : OWNER_SPINNABLE;
+ return !owner ? OWNER_NULL :
+ is_rwsem_owner_reader(owner) ? OWNER_READER : OWNER_SPINNABLE;
}

static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
@@ -442,12 +445,13 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
break;

/*
- * When there's no owner, we might have preempted between the
- * owner acquiring the lock and setting the owner field. If
- * we're an RT task that will live-lock because we won't let
- * the owner complete.
+ * An RT task cannot do optimistic spinning if it cannot
+ * be sure the lock holder is running. When there's no owner
+ * or is reader-owned, an RT task has to stop spinning or
+ * live-lock may happen if the current task and the lock
+ * holder happen to run in the same CPU.
*/
- if (!rwsem_get_owner(sem) &&
+ if ((owner_state != OWNER_SPINNABLE) &&
(need_resched() || rt_task(current)))
break;

--
1.8.3.1


2019-02-07 19:11:29

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 19/22] locking/rwsem: Enable readers spinning on writer

This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly. The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
__rwsem_down_read_failed_common() and __rwsem_down_write_failed_common().

This patch may actually reduce performance under certain circumstances
for reader-mostly workload as the readers may not be grouped together
in the wait queue anymore. So we may have a number of small reader
groups among writers instead of a large reader group. However, this
change is needed for some of the subsequent patches.

With a locking microbenchmark running on 5.0 based kernel, the total
locking rates (in kops/s) of the benchmark on a 4-socket 56-core x86-64
system with equal numbers of readers and writers before and after the
patch were as follows:

# of Threads Pre-patch Post-patch
------------ --------- ----------
2 1,926 2,120
4 1,391 1,320
8 716 694
16 618 606
32 501 487
64 61 57

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 1 +
kernel/locking/rwsem-xadd.c | 80 ++++++++++++++++++++++++++++++++++-----
kernel/locking/rwsem-xadd.h | 3 ++
3 files changed, 74 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 4cde507..54b6650 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -57,6 +57,7 @@
LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */
LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */
LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
+LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 0a29aac..015edd6 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -240,6 +240,30 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,

#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+
+ if (RWSEM_COUNT_WLOCKED_OR_HANDOFF(count))
+ return false;
+
+ count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+ if (!RWSEM_COUNT_WLOCKED_OR_HANDOFF(count)) {
+ rwsem_set_reader_owned(sem);
+ lockevent_inc(rwsem_opt_rlock);
+ return true;
+ }
+
+ /* Back out the change */
+ atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+ return false;
+}
+
+/*
* Try to acquire write lock before the writer has been put on wait queue.
*/
static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
@@ -291,8 +315,10 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)

BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));

- if (need_resched())
+ if (need_resched()) {
+ lockevent_inc(rwsem_opt_fail);
return false;
+ }

rcu_read_lock();
owner = rwsem_get_owner(sem);
@@ -301,6 +327,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
owner_on_cpu(owner, sem);
}
rcu_read_unlock();
+ lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
}

@@ -371,9 +398,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
preempt_disable();

/* sem->wait_lock should not be held when doing optimistic spinning */
- if (!rwsem_can_spin_on_owner(sem))
- goto done;
-
if (!osq_lock(&sem->osq))
goto done;

@@ -388,10 +412,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
/*
* Try to acquire the lock
*/
- if (rwsem_try_write_lock_unqueued(sem, wlock)) {
- taken = true;
+ taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock)
+ : rwsem_try_read_lock_unqueued(sem);
+
+ if (taken)
break;
- }

/*
* When there's no owner, we might have preempted between the
@@ -418,7 +443,13 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
return taken;
}
#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+ return false;
+}
+
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+ const long wlock)
{
return false;
}
@@ -444,6 +475,33 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);

+ if (!rwsem_can_spin_on_owner(sem))
+ goto queue;
+
+ /*
+ * Undo read bias from down_read() and do optimistic spinning.
+ */
+ atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+ adjustment = 0;
+ if (rwsem_optimistic_spin(sem, 0)) {
+ unsigned long flags;
+
+ /*
+ * Opportunistically wake up other readers in the wait queue.
+ * It has another chance of wakeup at unlock time.
+ */
+ if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS) &&
+ raw_spin_trylock_irqsave(&sem->wait_lock, flags)) {
+ if (!list_empty(&sem->wait_list))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+ &wake_q);
+ raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+ wake_up_q(&wake_q);
+ }
+ return sem;
+ }
+
+queue:
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
@@ -456,7 +514,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
* immediately as its RWSEM_READER_BIAS has already been
* set in the count.
*/
- if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+ if (adjustment &&
+ !(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
@@ -543,7 +602,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
const long wlock = RWSEM_WRITER_LOCKED;

/* do optimistic spinning and steal lock if possible */
- if (rwsem_optimistic_spin(sem, wlock))
+ if (rwsem_can_spin_on_owner(sem) &&
+ rwsem_optimistic_spin(sem, wlock))
return sem;

/*
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 1de6f1e..eb4ef36 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -109,9 +109,12 @@
RWSEM_FLAG_HANDOFF)

#define RWSEM_COUNT_LOCKED(c) ((c) & RWSEM_LOCK_MASK)
+#define RWSEM_COUNT_WLOCKED(c) ((c) & RWSEM_WRITER_MASK)
#define RWSEM_COUNT_HANDOFF(c) ((c) & RWSEM_FLAG_HANDOFF)
#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c) \
((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))
+#define RWSEM_COUNT_WLOCKED_OR_HANDOFF(c) \
+ ((c) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))

/*
* Task structure pointer compression (64-bit only):
--
1.8.3.1


2019-02-07 19:11:51

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 10/22] locking/rwsem: Enable lock event counting

Add lock event counting calls so that we can track the number of lock
events happening in the rwsem code.

With CONFIG_LOCK_EVENT_COUNTS on and booting a 1-socket 22-core
44-thread x86-64 system, the non-zero rwsem counts after system bootup
were as follows:

rwsem_opt_fail=113
rwsem_opt_wlock=13647
rwsem_rlock=176
rwsem_rlock_fast=10
rwsem_wake_reader=153
rwsem_wake_writer=139
rwsem_wlock=113

It can be seen that most of the lock acquisitions in the slowpath were
writer-locks in the optimistic spinning code path with no sleeping at
all. Only about 4% of locks were acquired after sleeping.

Signed-off-by: Waiman Long <[email protected]>
---
arch/Kconfig | 2 +-
kernel/locking/lock_events_list.h | 17 +++++++++++++++++
kernel/locking/rwsem-xadd.c | 12 ++++++++++++
3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af147c2..7471791 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -891,7 +891,7 @@ config ARCH_USE_MEMREMAP_PROT
config LOCK_EVENT_COUNTS
bool "Locking event counts collection"
depends on DEBUG_FS
- depends on QUEUED_SPINLOCKS
+ depends on (QUEUED_SPINLOCKS || RWSEM_XCHGADD_ALGORITHM)
---help---
Enable light-weight counting of various locking related events
in the system with minimal performance impact. This reduces
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 8b4d2e1..c33c5df 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -48,3 +48,20 @@
LOCK_EVENT(lock_use_node4) /* # of locking ops that use 4th percpu node */
LOCK_EVENT(lock_no_node) /* # of locking ops w/o using percpu node */
#endif /* CONFIG_QUEUED_SPINLOCKS */
+
+#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
+/*
+ * Locking events for rwsem
+ */
+LOCK_EVENT(rwsem_sleep_reader) /* # of reader sleeps */
+LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */
+LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */
+LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
+LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
+LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
+LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
+LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
+LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
+LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */
+LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */
+#endif /* CONFIG_RWSEM_XCHGADD_ALGORITHM */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 62422a6..fff231a 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -20,6 +20,7 @@
#include <linux/osq_lock.h>

#include "rwsem-xadd.h"
+#include "lock_events.h"

/*
* Guide to the rw_semaphore's count field for common values.
@@ -147,6 +148,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
* will notice the queued writer.
*/
wake_q_add(wake_q, waiter->task);
+ lockevent_inc(rwsem_wake_writer);
}

return;
@@ -214,6 +216,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}

adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+ lockevent_cond_inc(rwsem_wake_reader, woken);
if (list_empty(&sem->wait_list)) {
/* hit end of list above */
adjustment -= RWSEM_WAITING_BIAS;
@@ -269,6 +272,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
count + RWSEM_ACTIVE_WRITE_BIAS);
if (old == count) {
rwsem_set_owner(sem);
+ lockevent_inc(rwsem_opt_wlock);
return true;
}

@@ -394,6 +398,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
osq_unlock(&sem->osq);
done:
preempt_enable();
+ lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
}

@@ -441,6 +446,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
if (atomic_long_read(&sem->count) >= 0) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
+ lockevent_inc(rwsem_rlock_fast);
return sem;
}
adjustment += RWSEM_WAITING_BIAS;
@@ -477,9 +483,11 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
break;
}
schedule();
+ lockevent_inc(rwsem_sleep_reader);
}

__set_current_state(TASK_RUNNING);
+ lockevent_inc(rwsem_rlock);
return sem;
out_nolock:
list_del(&waiter.list);
@@ -487,6 +495,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
raw_spin_unlock_irq(&sem->wait_lock);
__set_current_state(TASK_RUNNING);
+ lockevent_inc(rwsem_rlock_fail);
return ERR_PTR(-EINTR);
}

@@ -580,6 +589,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
goto out_nolock;

schedule();
+ lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
} while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);

@@ -588,6 +598,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
raw_spin_unlock_irq(&sem->wait_lock);
+ lockevent_inc(rwsem_wlock);

return ret;

@@ -601,6 +612,7 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
+ lockevent_inc(rwsem_wlock_fail);

return ERR_PTR(-EINTR);
}
--
1.8.3.1


2019-02-07 19:11:58

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 13/22] locking/rwsem: Remove rwsem_wake() wakeup optimization

With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention
in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing
a wakeup if the wait_lock cannot be directly acquired and an optimistic
spinning locker is present. This can help performance by avoiding
spinning on the wait_lock when it is contended.

With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless
waiter wakeup(s)"), the performance advantage of the above optimization
diminishes as the average wait_lock hold time become much shorter.

By supporting rwsem lock handoff, we can no longer relies on the fact
that the presence of an optimistic spinning locker will ensure that the
lock will be acquired by a task soon. This can lead to missed wakeup
and application hang. So the commit 59aabfc7e959 ("locking/rwsem:
Reduce spinlock contention in wakeup after up_read()/up_write()")
will have to be reverted.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 74 ---------------------------------------------
1 file changed, 74 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 12b1d61..5f74bae 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -378,25 +378,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
}
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
- return osq_is_locked(&sem->osq);
-}
-
#else
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
return false;
}
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
- return false;
-}
#endif

/*
@@ -662,67 +648,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
unsigned long flags;
DEFINE_WAKE_Q(wake_q);

- /*
- * __rwsem_down_write_failed_common(sem)
- * rwsem_optimistic_spin(sem)
- * osq_unlock(sem->osq)
- * ...
- * atomic_long_add_return(&sem->count)
- *
- * - VS -
- *
- * __up_write()
- * if (atomic_long_sub_return_release(&sem->count) < 0)
- * rwsem_wake(sem)
- * osq_is_locked(&sem->osq)
- *
- * And __up_write() must observe !osq_is_locked() when it observes the
- * atomic_long_add_return() in order to not miss a wakeup.
- *
- * This boils down to:
- *
- * [S.rel] X = 1 [RmW] r0 = (Y += 0)
- * MB RMB
- * [RmW] Y += 1 [L] r1 = X
- *
- * exists (r0=1 /\ r1=0)
- */
- smp_rmb();
-
- /*
- * If a spinner is present and the handoff flag isn't set, it is
- * not necessary to do the wakeup.
- *
- * Try to do wakeup only if the trylock succeeds to minimize
- * spinlock contention which may introduce too much delay in the
- * unlock operation.
- *
- * spinning writer up_write/up_read caller
- * --------------- -----------------------
- * [S] osq_unlock() [L] osq
- * MB RMB
- * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
- *
- * Here, it is important to make sure that there won't be a missed
- * wakeup while the rwsem is free and the only spinning writer goes
- * to sleep without taking the rwsem. Even when the spinning writer
- * is just going to break out of the waiting loop, it will still do
- * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
- * rwsem_has_spinner() is true, it will guarantee at least one
- * trylock attempt on the rwsem later on.
- */
- if (rwsem_has_spinner(sem) && !RWSEM_COUNT_HANDOFF(count)) {
- /*
- * The smp_rmb() here is to make sure that the spinner
- * state is consulted before reading the wait_lock.
- */
- smp_rmb();
- if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
- return sem;
- goto locked;
- }
raw_spin_lock_irqsave(&sem->wait_lock, flags);
-locked:

if (!list_empty(&sem->wait_list))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
--
1.8.3.1


2019-02-07 19:12:05

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 14/22] locking/rwsem: Add more rwsem owner access helpers

Before combining owner and count, we are adding two new helpers for
accessing the owner value in the rwsem.

1) struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
2) bool is_rwsem_reader_owned(struct rw_semaphore *sem)

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 11 ++++++-----
kernel/locking/rwsem-xadd.h | 32 ++++++++++++++++++++++++++------
kernel/locking/rwsem.c | 3 +--
3 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 5f74bae..719d390 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -277,7 +277,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
return false;

rcu_read_lock();
- owner = READ_ONCE(sem->owner);
+ owner = rwsem_get_owner(sem);
if (owner) {
ret = is_rwsem_owner_spinnable(owner) &&
owner_on_cpu(owner);
@@ -291,13 +291,13 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
*/
static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
{
- struct task_struct *owner = READ_ONCE(sem->owner);
+ struct task_struct *owner = rwsem_get_owner(sem);

if (!is_rwsem_owner_spinnable(owner))
return false;

rcu_read_lock();
- while (owner && (READ_ONCE(sem->owner) == owner)) {
+ while (owner && (rwsem_get_owner(sem) == owner)) {
/*
* Ensure we emit the owner->on_cpu, dereference _after_
* checking sem->owner still matches owner, if that fails,
@@ -323,7 +323,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
* If there is a new owner or the owner is not set, we continue
* spinning.
*/
- return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+ return is_rwsem_owner_spinnable(rwsem_get_owner(sem));
}

static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -361,7 +361,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
* we're an RT task that will live-lock because we won't let
* the owner complete.
*/
- if (!sem->owner && (need_resched() || rt_task(current)))
+ if (!rwsem_get_owner(sem) &&
+ (need_resched() || rt_task(current)))
break;

/*
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 6d4890d..277a134 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -83,6 +83,11 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
WRITE_ONCE(sem->owner, NULL);
}

+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+ return READ_ONCE(sem->owner);
+}
+
/*
* The task_struct pointer of the last owning reader will be left in
* the owner field.
@@ -116,6 +121,23 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
}

/*
+ * Return true if the rwsem is owned by a reader.
+ */
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_DEBUG_RWSEMS
+ /*
+ * Check the count to see if it is write-locked.
+ */
+ long count = atomic_long_read(&sem->count);
+
+ if (count & RWSEM_WRITER_MASK)
+ return false;
+#endif
+ return (unsigned long)sem->owner & RWSEM_READER_OWNED;
+}
+
+/*
* Return true if rwsem is owned by an anonymous writer or readers.
*/
static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
@@ -135,6 +157,7 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
{
unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
| RWSEM_ANONYMOUSLY_OWNED;
+
if (READ_ONCE(sem->owner) == (struct task_struct *)val)
cmpxchg_relaxed((unsigned long *)&sem->owner, val,
RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
@@ -181,8 +204,7 @@ static inline void __down_read(struct rw_semaphore *sem)
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_failed(sem);
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -194,8 +216,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
&sem->count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -254,8 +275,7 @@ static inline void __up_read(struct rw_semaphore *sem)
{
long tmp;

- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index bdfca7c..79fa6e4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -203,8 +203,7 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)

void up_read_non_owner(struct rw_semaphore *sem)
{
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
__up_read(sem);
}

--
1.8.3.1


2019-02-07 19:12:07

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 09/22] locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro

Currently, the DEBUG_RWSEMS_WARN_ON() macro just dumps a stack trace
when the rwsem isn't in the right state. It does not show the actual
states of the rwsem. This may not be that helpful in the debugging
process.

Enhance the DEBUG_RWSEMS_WARN_ON() macro to also show the current
content of the rwsem count and owner fields to give more information
about what is wrong with the rwsem.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.h | 19 ++++++++++++-------
kernel/locking/rwsem.c | 5 +++--
2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 77151c3..6d2478c 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -27,9 +27,13 @@
#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)

#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c) DEBUG_LOCKS_WARN_ON(c)
+# define DEBUG_RWSEMS_WARN_ON(c, sem) \
+ WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+ #c, atomic_long_read(&(sem)->count), \
+ (long)((sem)->owner), (long)current, \
+ list_empty(&(sem)->wait_list) ? "" : "not ")
#else
-# define DEBUG_RWSEMS_WARN_ON(c)
+# define DEBUG_RWSEMS_WARN_ON(c, sem)
#endif

/*
@@ -168,7 +172,7 @@ static inline void __down_read(struct rw_semaphore *sem)
if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
rwsem_down_read_failed(sem);
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED));
+ RWSEM_READER_OWNED), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -180,7 +184,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED));
+ RWSEM_READER_OWNED), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -248,7 +252,8 @@ static inline void __up_read(struct rw_semaphore *sem)
{
long tmp;

- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED));
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+ sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_dec_return_release(&sem->count);
if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
@@ -260,7 +265,7 @@ static inline void __up_read(struct rw_semaphore *sem)
*/
static inline void __up_write(struct rw_semaphore *sem)
{
- DEBUG_RWSEMS_WARN_ON(sem->owner != current);
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
rwsem_clear_owner(sem);
if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
&sem->count) < 0))
@@ -281,7 +286,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* read-locked region is ok to be re-ordered into the
* write side. As such, rely on RELEASE semantics.
*/
- DEBUG_RWSEMS_WARN_ON(sem->owner != current);
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
if (tmp < 0)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 598fc7c..bdfca7c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -17,7 +17,7 @@
#include "rwsem-xadd.h"
#else
#define __rwsem_set_reader_owned(a, b)
-#define DEBUG_RWSEMS_WARN_ON(a)
+#define DEBUG_RWSEMS_WARN_ON(a, b)
#endif

/*
@@ -203,7 +203,8 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)

void up_read_non_owner(struct rw_semaphore *sem)
{
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED));
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+ sem);
__up_read(sem);
}

--
1.8.3.1


2019-02-07 19:12:11

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 15/22] locking/rwsem: Merge owner into count on x86-64

With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.

On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is also aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count. That can supports
up to (64k-1) readers.

The owner value will still be duplicated in the owner field for the
purpose of signalling that the task is in the process of acquiring or
releasing a rwsem.

This change is currently for x86-64 only. Other 64-bit architectures may
be enabled in the future if the need arises.

With a locking microbenchmark running on 5.0 based kernel, the total
locking rates (in kops/s) of the benchmark on a 4-socket 56-core
x86-64 system before and after the patch were as follows:

Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 29,085 30,179 27,892 29,514
2 7,341 14,084 6,240 14,304
4 7,393 14,246 5,216 11,754
8 7,139 13,860 5,400 11,308
16 6,650 15,773 5,744 15,405

This change does have an impact on both read and write lock performance.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 20 +++++++--
kernel/locking/rwsem-xadd.h | 105 +++++++++++++++++++++++++++++++++++++++-----
2 files changed, 110 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 719d390..0869fbf 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -27,11 +27,11 @@
/*
* Guide to the rw_semaphore's count field.
*
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
+ * When any of the RWSEM_WRITER_MASK bits in count is set, the lock is
+ * owned by a writer.
*
* The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (1) none of the RWSEM_WRITER_MASK bits is set in count,
* (2) some of the reader bits are set in count, and
* (3) the owner field has RWSEM_READ_OWNED bit set.
*
@@ -47,6 +47,11 @@
void __init_rwsem(struct rw_semaphore *sem, const char *name,
struct lock_class_key *key)
{
+ /*
+ * We should support at least (4k-1) concurrent readers
+ */
+ BUILD_BUG_ON(sizeof(long) * 8 - RWSEM_READER_SHIFT < 12);
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
* Make sure we are not reinitializing a held semaphore:
@@ -297,7 +302,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
return false;

rcu_read_lock();
- while (owner && (rwsem_get_owner(sem) == owner)) {
+ /*
+ * In case the owner task pointer is also stored in the count,
+ * checking the sem->owner value alone will give an early indication
+ * if the owner is about to release the lock (sem->owner cleared).
+ * This enables the spinner to move forward and do a trylock
+ * earlier.
+ */
+ while (owner && (READ_ONCE(sem->owner) == owner)) {
/*
* Ensure we emit the owner->on_cpu, dereference _after_
* checking sem->owner still matches owner, if that fails,
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index 277a134..d54b5db 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -37,25 +37,73 @@
#endif

/*
- * The definition of the atomic counter in the semaphore:
+ * With separate count and owner, there are timing windows where the two
+ * values are inconsistent. That can cause problem when trying to figure
+ * out the exact state of the rwsem. That can be solved by combining
+ * the count and owner together in a single atomic value.
*
- * Bit 0 - writer locked bit
- * Bit 1 - waiters present bit
- * Bit 2 - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * On 64-bit architectures, the owner task structure pointer can be
+ * compressed and combined with reader count and other status flags.
+ * A simple compression method is to map the virtual address back to
+ * the physical address by subtracting PAGE_OFFSET. On 32-bit
+ * architectures, the long integer value just isn't big enough for
+ * combining owner and count. So they remain separate.
+ *
+ * For x86-64, the physical address can use up to 52 bits. That is 4PB
+ * of memory. That leaves 12 bits available for other use. The task
+ * structure pointer is also aligned to the L1 cache size. That means
+ * another 6 bits (64 bytes cacheline) will be available. Reserving
+ * 2 bits for status flags, we will have 16 bits for the reader count.
+ * That can supports up to (64k-1) readers.
+ *
+ * On x86-64, the bit definitions of the count are:
+ *
+ * Bit 0 - waiters present bit
+ * Bit 1 - lock handoff bit
+ * Bits 2-47 - compressed task structure pointer
+ * Bits 48-63 - 16-bit reader counts
+ *
+ * On other 64-bit architectures, the bit definitions are:
+ *
+ * Bit 0 - waiters present bit
+ * Bit 1 - lock handoff bit
+ * Bits 2-6 - reserved
+ * Bit 7 - writer lock bit
+ * Bits 8-63 - 56-bit reader counts
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit 0 - waiters present bit
+ * Bit 1 - lock handoff bit
+ * Bits 2-6 - reserved
+ * Bit 7 - writer lock bit
+ * Bits 8-31 - 24-bit reader counts
*
* atomic_long_fetch_add() is used to obtain reader lock, whereas
* atomic_long_cmpxchg() will be used to obtain writer lock.
*/
-#define RWSEM_WRITER_LOCKED (1UL << 0)
-#define RWSEM_FLAG_WAITERS (1UL << 1)
-#define RWSEM_FLAG_HANDOFF (1UL << 2)
+#define RWSEM_FLAG_WAITERS (1UL << 0)
+#define RWSEM_FLAG_HANDOFF (1UL << 1)

+#ifdef CONFIG_X86_64
+
+#ifdef __PHYSICAL_MASK_SHIFT
+#define RWSEM_PA_MASK_SHIFT __PHYSICAL_MASK_SHIFT
+#else
+#define RWSEM_PA_MASK_SHIFT 52
+#endif
+#define RWSEM_READER_SHIFT (RWSEM_PA_MASK_SHIFT - L1_CACHE_SHIFT + 2)
+#define RWSEM_WRITER_MASK ((1UL << RWSEM_READER_SHIFT) - 4)
+#define RWSEM_WRITER_LOCKED rwsem_owner_count(current)
+
+#else /* CONFIG_X86_64 */
+#define RWSEM_WRITER_MASK (1UL << 7)
#define RWSEM_READER_SHIFT 8
+#define RWSEM_WRITER_LOCKED RWSEM_WRITER_MASK
+#endif /* CONFIG_X86_64 */
+
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
RWSEM_FLAG_HANDOFF)
@@ -65,6 +113,21 @@
#define RWSEM_COUNT_LOCKED_OR_HANDOFF(c) \
((c) & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))

+/*
+ * Task structure pointer compression (64-bit only):
+ * (owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2)
+ */
+static inline unsigned long rwsem_owner_count(struct task_struct *owner)
+{
+ return ((unsigned long)owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2);
+}
+
+static inline unsigned long rwsem_count_owner(long count)
+{
+ return (((unsigned long)count & RWSEM_WRITER_MASK)
+ << (L1_CACHE_SHIFT - 2)) + PAGE_OFFSET;
+}
+
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -72,7 +135,12 @@
* the owner value concurrently without lock. Read from owner, however,
* may not need READ_ONCE() as long as the pointer value is only used
* for comparison and isn't being dereferenced.
+ *
+ * On 32-bit architectures, the owner and count are separate. On 64-bit
+ * architectures, however, the writer task structure pointer is written
+ * to the count as well in addition to the owner field.
*/
+
static inline void rwsem_set_owner(struct rw_semaphore *sem)
{
WRITE_ONCE(sem->owner, current);
@@ -83,10 +151,22 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
WRITE_ONCE(sem->owner, NULL);
}

+#ifdef CONFIG_X86_64
+/*
+ * Get the owner value from count to have early access to the task structure.
+ */
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+ return (struct task_struct *)
+ (rwsem_count_owner(atomic_long_read(&sem->count)) |
+ ((unsigned long)READ_ONCE(sem->owner) & 3));
+}
+#else /* !CONFIG_X86_64 */
static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
{
return READ_ONCE(sem->owner);
}
+#endif /* CONFIG_X86_64 */

/*
* The task_struct pointer of the last owning reader will be left in
@@ -291,8 +371,11 @@ static inline void __up_write(struct rw_semaphore *sem)
long tmp;

DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+#ifdef CONFIG_X86_64
+ DEBUG_RWSEMS_WARN_ON(sem->owner != rwsem_get_owner(sem), sem);
+#endif
rwsem_clear_owner(sem);
- tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+ tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
rwsem_wake(sem, tmp);
}
--
1.8.3.1


2019-02-07 19:12:13

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 17/22] locking/rwsem: Recheck owner if it is not on cpu

After merging the owner value directly into the count field, it was
found that the number of failed optimistic spinning operations increased
significantly during the boot up process.

The cause of this increased failures was tracked down to the condition
that a lock holder might have just released the lock and gone to sleep
right after its owner value was fetched by a spinner. So the task might
have slept, but it was no longer the lock holder. The merging of owner
into count increases the chance this condition can happen.

To close this failure mode, we are now rechecking the owner value again
to see if it has been changed in case it is not on cpu.

On a 1-socket x86-64 system, the lock event counts before the patch
were:

rwsem_opt_fail=5847
rwsem_opt_wlock=7880
rwsem_wlock=5847

After the patch, the counts were:

rwsem_opt_fail=225
rwsem_opt_wlock=8541
rwsem_wlock=225

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 16dc7a1..21d462f 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -263,13 +263,25 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
}
}

-static inline bool owner_on_cpu(struct task_struct *owner)
+static inline bool owner_on_cpu(struct task_struct *owner,
+ struct rw_semaphore *sem)
{
/*
* As lock holder preemption issue, we both skip spinning if
* task is not on cpu or its cpu is preempted
*/
- return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+ bool oncpu = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+
+ /*
+ * There is a slight chance that the lock holder might have
+ * just release the rwsem and gone to sleep right after we
+ * fetched the owner value. So we double-check the sem->owner
+ * field again to see if it has been changed. The sem->owner
+ * would have been cleared right before the lock was released.
+ */
+ if (!oncpu && (READ_ONCE(sem->owner) != owner))
+ return true; /* Assume the new owner is on cpu */
+ return oncpu;
}

static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
@@ -286,7 +298,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
owner = rwsem_get_owner(sem);
if (owner) {
ret = is_rwsem_owner_spinnable(owner) &&
- owner_on_cpu(owner);
+ owner_on_cpu(owner, sem);
}
rcu_read_unlock();
return ret;
@@ -323,7 +335,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
* abort spinning when need_resched or owner is not running or
* owner's cpu is preempted.
*/
- if (need_resched() || !owner_on_cpu(owner)) {
+ if (need_resched() || !owner_on_cpu(owner, sem)) {
rcu_read_unlock();
return false;
}
--
1.8.3.1


2019-02-07 19:12:37

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 20/22] locking/rwsem: Enable count-based spinning on reader

When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem. It is a loop count threshold based spinning where the count will
get reset whenenver the rwsem reader count value changes indicating
that the rwsem is still active. There is another maximum count value
that limits that maximum number of spinnings that can happen.

When the loop or max counts reach 0, a bit will be set in the owner
field to indicate that no more optimistic spinning should be done on
this rwsem until it becomes writer owned again. Not even readers
is allowed to acquire the reader-locked rwsem for better fairness.

The spinning threshold and maximum values can be overridden by
architecture specific header file, if necessary. The current default
threshold value is 512 iterations.

With a locking microbenchmark running on 5.0 based kernel, the total
locking rates (in kops/s) of the benchmark on a 4-socket 56-core x86-64
system with equal numbers of readers and writers before all the reader
spining patches, before this patch and after this patch were as follows:

# of Threads Pre-rspin Pre-Patch Post-patch
------------ --------- --------- ----------
2 1,926 2,120 8,057
4 1,391 1,320 7,680
8 716 694 7,284
16 618 606 6,542
32 501 487 1,449
64 61 57 480

This patch gives a big boost in performance for mixed reader/writer
workloads.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 1 +
kernel/locking/rwsem-xadd.c | 63 +++++++++++++++++++++++++++++++++++----
kernel/locking/rwsem-xadd.h | 45 +++++++++++++++++++++-------
3 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 54b6650..0052534 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -60,6 +60,7 @@
LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
+LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 015edd6..3beb942 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -95,6 +95,22 @@ enum rwsem_wake_type {
#define RWSEM_WAIT_TIMEOUT ((HZ - 1)/200 + 1)

/*
+ * Reader-owned rwsem spinning threshold and maximum value
+ *
+ * This threshold and maximum values can be overridden by architecture
+ * specific value. The loop count will be reset whenenver the rwsem count
+ * value changes. The max value constrains the total number of reader-owned
+ * lock spinnings that can happen.
+ */
+#ifdef ARCH_RWSEM_RSPIN_THRESHOLD
+# define RWSEM_RSPIN_THRESHOLD ARCH_RWSEM_RSPIN_THRESHOLD
+# define RWSEM_RSPIN_MAX ARCH_RWSEM_RSPIN_MAX
+#else
+# define RWSEM_RSPIN_THRESHOLD (1 << 9)
+# define RWSEM_RSPIN_MAX (1 << 12)
+#endif
+
+/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
* have been set.
@@ -324,7 +340,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
owner = rwsem_get_owner(sem);
if (owner) {
ret = is_rwsem_owner_spinnable(owner) &&
- owner_on_cpu(owner, sem);
+ (is_rwsem_owner_reader(owner) || owner_on_cpu(owner, sem));
}
rcu_read_unlock();
lockevent_cond_inc(rwsem_opt_fail, !ret);
@@ -359,7 +375,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
* This enables the spinner to move forward and do a trylock
* earlier.
*/
- while (owner && (READ_ONCE(sem->owner) == owner)) {
+ while (owner && !is_rwsem_owner_reader(owner)
+ && (READ_ONCE(sem->owner) == owner)) {
/*
* Ensure we emit the owner->on_cpu, dereference _after_
* checking sem->owner still matches owner, if that fails,
@@ -394,6 +411,10 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
{
bool taken = false;
+ enum owner_state owner_state;
+ int rspin_cnt = RWSEM_RSPIN_THRESHOLD;
+ int rspin_max = RWSEM_RSPIN_MAX;
+ int old_rcount = 0;

preempt_disable();

@@ -401,14 +422,16 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
if (!osq_lock(&sem->osq))
goto done;

+ if (!is_rwsem_spinnable(sem))
+ rspin_cnt = 0;
+
/*
* Optimistically spin on the owner field and attempt to acquire the
* lock whenever the owner changes. Spinning will be stopped when:
* 1) the owning writer isn't running; or
- * 2) readers own the lock as we can't determine if they are
- * actively running or not.
+ * 2) readers own the lock and spinning count has reached 0.
*/
- while (rwsem_spin_on_owner(sem) == OWNER_SPINNABLE) {
+ while ((owner_state = rwsem_spin_on_owner(sem)) != OWNER_NONSPINNABLE) {
/*
* Try to acquire the lock
*/
@@ -429,6 +452,36 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
break;

/*
+ * We only decremnt rspin_cnt when a writer is trying to
+ * acquire a lock owned by readers. In which case,
+ * rwsem_spin_on_owner() will essentially be a no-op
+ * and we will be spinning in this main loop. The spinning
+ * count will be reset whenever the rwsem count value
+ * changes.
+ */
+ if (wlock && (owner_state == OWNER_READER)) {
+ int rcount;
+
+ if (!rspin_cnt || !rspin_max) {
+ if (is_rwsem_spinnable(sem)) {
+ rwsem_set_nonspinnable(sem);
+ lockevent_inc(rwsem_opt_nospin);
+ }
+ break;
+ }
+
+ rcount = atomic_long_read(&sem->count)
+ >> RWSEM_READER_SHIFT;
+ if (rcount != old_rcount) {
+ old_rcount = rcount;
+ rspin_cnt = RWSEM_RSPIN_THRESHOLD;
+ } else {
+ rspin_cnt--;
+ }
+ rspin_max--;
+ }
+
+ /*
* The cpu_relax() call is a compiler barrier which forces
* everything in this loop to be re-loaded. We don't need
* memory barriers as we'll eventually observe the right
diff --git a/kernel/locking/rwsem-xadd.h b/kernel/locking/rwsem-xadd.h
index eb4ef36..be67dbd 100644
--- a/kernel/locking/rwsem-xadd.h
+++ b/kernel/locking/rwsem-xadd.h
@@ -5,18 +5,20 @@
* - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
* - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
* i.e. the owner(s) cannot be readily determined. It can be reader
- * owned or the owning writer is indeterminate.
+ * owned or the owning writer is indeterminate. Optimistic spinning
+ * should be disabled if this flag is set.
*
* When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
+ * into the owner field or the count itself (64-bit only. It should
+ * be cleared after an unlock.
*
* When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On unlock, the owner field will largely be left untouched. So
+ * for a free or reader-owned rwsem, the owner value may contain
+ * information about the last reader that acquires the rwsem. The
+ * anonymous bit may also be set to permanently disable optimistic
+ * spinning on a reader-own rwsem until a writer comes along.
*
* That information may be helpful in debugging cases where the system
* seems to hang on a reader owned rwsem especially if only one reader
@@ -182,8 +184,7 @@ static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
+ unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;

WRITE_ONCE(sem->owner, (struct task_struct *)val);
}
@@ -209,6 +210,14 @@ static inline bool is_rwsem_owner_reader(struct task_struct *owner)
}

/*
+ * Return true if the rwsem is spinnable.
+ */
+static inline bool is_rwsem_spinnable(struct rw_semaphore *sem)
+{
+ return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
+/*
* Return true if the rwsem is owned by a reader.
*/
static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
@@ -226,6 +235,22 @@ static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
}

/*
+ * Set the RWSEM_ANONYMOUSLY_OWNED flag if the RWSEM_READER_OWNED flag
+ * remains set. Otherwise, the operation will be aborted.
+ */
+static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
+{
+ long owner = (long)READ_ONCE(sem->owner);
+
+ while (is_rwsem_owner_reader((struct task_struct *)owner)) {
+ if (!is_rwsem_owner_spinnable((struct task_struct *)owner))
+ break;
+ owner = cmpxchg((long *)&sem->owner, owner,
+ owner | RWSEM_ANONYMOUSLY_OWNED);
+ }
+}
+
+/*
* Return true if rwsem is owned by an anonymous writer or readers.
*/
static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
--
1.8.3.1


2019-02-07 19:12:39

by Waiman Long

[permalink] [raw]
Subject: [PATCH-tip 21/22] locking/rwsem: Wake up all readers in wait queue

When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers in the queue are woken up when the first waiter is a
reader to improve reader throughput.

With a locking microbenchmark running on 5.0 based kernel, the total
locking rates (in kops/s) of the benchmark on a 4-socket 56-core x86-64
system with equal numbers of readers and writers before all the reader
spining patches, before this patch and after this patch were as follows:

# of Threads Pre-rspin Pre-Patch Post-patch
------------ --------- --------- ----------
2 1,926 8,057 7,397
4 1,391 7,680 6,161
8 716 7,284 6,405
16 618 6,542 6,768
32 501 1,449 6,550
64 61 480 5,548
112 75 769 5,216

At low contention level, there is a slight drop in performance. At high
contention level, however, this patch gives a big performance boost.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3beb942..3cf2e84 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -180,16 +180,16 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}

/*
- * Grant an infinite number of read locks to the readers at the front
- * of the queue. We know that woken will be at least 1 as we accounted
- * for above. Note we increment the 'active part' of the count by the
+ * Grant an infinite number of read locks to all the readers in the
+ * queue. We know that woken will be at least 1 as we accounted for
+ * above. Note we increment the 'active part' of the count by the
* number of readers before waking any processes up.
*/
list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
struct task_struct *tsk;

if (waiter->type == RWSEM_WAITING_FOR_WRITE)
- break;
+ continue;

woken++;
tsk = waiter->task;
--
1.8.3.1


2019-02-07 19:39:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH-tip 04/22] locking/rwsem: Remove arch specific rwsem files

On Thu, Feb 07, 2019 at 02:07:08PM -0500, Waiman Long wrote:

> +static inline int __down_read_trylock(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + while ((tmp = atomic_long_read(&sem->count)) >= 0) {
> + if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
> + tmp + RWSEM_ACTIVE_READ_BIAS)) {
> + return 1;
> + }
> + }

Nah, you're supposed to write that like:

for (;;) {
val = atomic_long_cond_read_relaxed(&sem->count, VAL < 0);
if (atomic_long_try_cmpxchg_acquire(&sem->count, &val,
val + RWSEM_ACTIVE_READ_BIAS))
break;
}

> + return 0;
> +}


Anyway, yuck, you're keeping all that BIAS nonsense :/ I was so hoping
for a rwsem implementation without that impenetrable crap.

2019-02-07 19:43:38

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 04/22] locking/rwsem: Remove arch specific rwsem files

On 02/07/2019 02:36 PM, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 02:07:08PM -0500, Waiman Long wrote:
>
>> +static inline int __down_read_trylock(struct rw_semaphore *sem)
>> +{
>> + long tmp;
>> +
>> + while ((tmp = atomic_long_read(&sem->count)) >= 0) {
>> + if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
>> + tmp + RWSEM_ACTIVE_READ_BIAS)) {
>> + return 1;
>> + }
>> + }
> Nah, you're supposed to write that like:
>
> for (;;) {
> val = atomic_long_cond_read_relaxed(&sem->count, VAL < 0);
> if (atomic_long_try_cmpxchg_acquire(&sem->count, &val,
> val + RWSEM_ACTIVE_READ_BIAS))
> break;
> }
>

Ah, you are right. I just took it directly from the current generic
asm.h file and didn't pay much attention to it. I will fix that in the
next iteration.

>> + return 0;
>> +}
>
> Anyway, yuck, you're keeping all that BIAS nonsense :/ I was so hoping
> for a rwsem implementation without that impenetrable crap.

Well, I am going to take out all the BIAS handling in a later patch.
This one is just for removing the architecture specific files.

I can't change the rwsem count encoding before I take out all the arch
specific files.

Cheers,
Longman


2019-02-07 19:48:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH-tip 15/22] locking/rwsem: Merge owner into count on x86-64

On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
> On 32-bit architectures, there aren't enough bits to hold both.
> 64-bit architectures, however, can have enough bits to do that. For
> x86-64, the physical address can use up to 52 bits. That is 4PB of
> memory. That leaves 12 bits available for other use. The task structure
> pointer is also aligned to the L1 cache size. That means another 6 bits
> (64 bytes cacheline) will be available. Reserving 2 bits for status
> flags, we will have 16 bits for the reader count. That can supports
> up to (64k-1) readers.

*groan*...

So take qrwlock's idea for a queue, then make the count value (similar
to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are
owner, when r they are reader-count. bit1 can be a pending bit, bit2 a
handoff bit etc..

That should fit and work on 32bit and 64bit without issue.

I have a half-arsed rwsem-atomic.c somewhere that does just that. I just
never got around to doing all the optimistic spin and steal crap that
makes our current rwsem fly.

And that nicely gets rid of that mind bending BIAS crud.

2019-02-07 19:49:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH-tip 04/22] locking/rwsem: Remove arch specific rwsem files

On Thu, Feb 07, 2019 at 08:36:56PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 02:07:08PM -0500, Waiman Long wrote:
>
> > +static inline int __down_read_trylock(struct rw_semaphore *sem)
> > +{
> > + long tmp;
> > +
> > + while ((tmp = atomic_long_read(&sem->count)) >= 0) {
> > + if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
> > + tmp + RWSEM_ACTIVE_READ_BIAS)) {
> > + return 1;
> > + }
> > + }
>
> Nah, you're supposed to write that like:
>
> for (;;) {
> val = atomic_long_cond_read_relaxed(&sem->count, VAL < 0);
> if (atomic_long_try_cmpxchg_acquire(&sem->count, &val,
> val + RWSEM_ACTIVE_READ_BIAS))
> break;
> }

N/m that's not in fact the same...

> > + return 0;
> > +}
>
>
> Anyway, yuck, you're keeping all that BIAS nonsense :/ I was so hoping
> for a rwsem implementation without that impenetrable crap.

2019-02-07 19:51:53

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On Thu, 07 Feb 2019, Waiman Long wrote:
> 30 files changed, 1197 insertions(+), 1594 deletions(-)

Performance numbers on numerous workloads, pretty please.

I'll go and throw this at my mmap_sem intensive workloads
I've collected.

Thanks,
Davidlohr

2019-02-07 19:56:17

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 15/22] locking/rwsem: Merge owner into count on x86-64

On 02/07/2019 02:45 PM, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
>> On 32-bit architectures, there aren't enough bits to hold both.
>> 64-bit architectures, however, can have enough bits to do that. For
>> x86-64, the physical address can use up to 52 bits. That is 4PB of
>> memory. That leaves 12 bits available for other use. The task structure
>> pointer is also aligned to the L1 cache size. That means another 6 bits
>> (64 bytes cacheline) will be available. Reserving 2 bits for status
>> flags, we will have 16 bits for the reader count. That can supports
>> up to (64k-1) readers.
> *groan*...
>
> So take qrwlock's idea for a queue, then make the count value (similar
> to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are
> owner, when r they are reader-count. bit1 can be a pending bit, bit2 a
> handoff bit etc..
>
> That should fit and work on 32bit and 64bit without issue.
>
> I have a half-arsed rwsem-atomic.c somewhere that does just that. I just
> never got around to doing all the optimistic spin and steal crap that
> makes our current rwsem fly.
>
> And that nicely gets rid of that mind bending BIAS crud.

Well, the reason for this compromise is to keep using xadd for readers.
Your scheme will certainly work, but we have to use cmpxchg for readers
too. That will have a performance impact especially with multiple
readers contending which I am trying to avoid.

Cheers,
Longman



2019-02-07 20:01:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On 02/07/2019 02:51 PM, Davidlohr Bueso wrote:
> On Thu, 07 Feb 2019, Waiman Long wrote:
>> 30 files changed, 1197 insertions(+), 1594 deletions(-)
>
> Performance numbers on numerous workloads, pretty please.
>
> I'll go and throw this at my mmap_sem intensive workloads
> I've collected.
>
> Thanks,
> Davidlohr

Thanks for getting some of the performance numbers. This is the initial
draft after more than 1 years of hibernation. I will also get other
performance numbers in subsequent revision of the patch.

Cheers,
Longman


2019-02-07 20:09:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH-tip 15/22] locking/rwsem: Merge owner into count on x86-64

On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
> On 32-bit architectures, there aren't enough bits to hold both.
> 64-bit architectures, however, can have enough bits to do that. For
> x86-64, the physical address can use up to 52 bits. That is 4PB of
> memory. That leaves 12 bits available for other use. The task structure
> pointer is also aligned to the L1 cache size. That means another 6 bits
> (64 bytes cacheline) will be available. Reserving 2 bits for status
> flags, we will have 16 bits for the reader count. That can supports
> up to (64k-1) readers.

64k readers sounds like a number that is fairly 'easy' to reach, esp. on
64bit. These are preemptible locks after all, all we need to do is get
64k tasks nested on enough CPUs.

I'm sure there's some willing Java proglet around that spawns more than
64k threads just because it can. Run it on a big enough machine (ISTR
there's a number of >1k CPU systems out there) and voila.



2019-02-07 20:55:41

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 15/22] locking/rwsem: Merge owner into count on x86-64

On 02/07/2019 03:08 PM, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
>> On 32-bit architectures, there aren't enough bits to hold both.
>> 64-bit architectures, however, can have enough bits to do that. For
>> x86-64, the physical address can use up to 52 bits. That is 4PB of
>> memory. That leaves 12 bits available for other use. The task structure
>> pointer is also aligned to the L1 cache size. That means another 6 bits
>> (64 bytes cacheline) will be available. Reserving 2 bits for status
>> flags, we will have 16 bits for the reader count. That can supports
>> up to (64k-1) readers.
> 64k readers sounds like a number that is fairly 'easy' to reach, esp. on
> 64bit. These are preemptible locks after all, all we need to do is get
> 64k tasks nested on enough CPUs.
>
> I'm sure there's some willing Java proglet around that spawns more than
> 64k threads just because it can. Run it on a big enough machine (ISTR
> there's a number of >1k CPU systems out there) and voila.

Yes, that can be a problem.

One possible solution is to check if the count goes negative. If so,
fail the read lock and make the readers wait in the wait queue until the
count is in positive territory. That effectively reduces the reader
count to 15 bits, but it will avoid the overflow situation. I will try
to add that support into the next version.

Cheers,
Longman





2019-02-08 14:21:02

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 15/22] locking/rwsem: Merge owner into count on x86-64

On 02/07/2019 03:54 PM, Waiman Long wrote:
> On 02/07/2019 03:08 PM, Peter Zijlstra wrote:
>> On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
>>> On 32-bit architectures, there aren't enough bits to hold both.
>>> 64-bit architectures, however, can have enough bits to do that. For
>>> x86-64, the physical address can use up to 52 bits. That is 4PB of
>>> memory. That leaves 12 bits available for other use. The task structure
>>> pointer is also aligned to the L1 cache size. That means another 6 bits
>>> (64 bytes cacheline) will be available. Reserving 2 bits for status
>>> flags, we will have 16 bits for the reader count. That can supports
>>> up to (64k-1) readers.
>> 64k readers sounds like a number that is fairly 'easy' to reach, esp. on
>> 64bit. These are preemptible locks after all, all we need to do is get
>> 64k tasks nested on enough CPUs.
>>
>> I'm sure there's some willing Java proglet around that spawns more than
>> 64k threads just because it can. Run it on a big enough machine (ISTR
>> there's a number of >1k CPU systems out there) and voila.
> Yes, that can be a problem.
>
> One possible solution is to check if the count goes negative. If so,
> fail the read lock and make the readers wait in the wait queue until the
> count is in positive territory. That effectively reduces the reader
> count to 15 bits, but it will avoid the overflow situation. I will try
> to add that support into the next version.
>
> Cheers,
> Longman

Something like the attached patch.

Cheers,
Longman


Attachments:
0023-locking-rwsem-Make-MSbit-of-count-as-guard-bit-to-fa.patch (8.79 kB)

2019-02-08 19:52:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On Thu, Feb 7, 2019 at 11:08 AM Waiman Long <[email protected]> wrote:
>
> This patchset revamps the current rwsem-xadd implementation to make
> it saner and easier to work with. This patchset removes all the
> architecture specific assembly code and uses generic C code for all
> architectures. This eases maintenance and enables us to enhance the
> code more easily.
>
> This patchset also implements the following 3 new features:
>
> 1) Waiter lock handoff
> 2) Reader optimistic spinning
> 3) Store write-lock owner in the atomic count (x86-64 only)

The patches are kind of hard to read, with most of them just doing
prep-work that doesn't necessarily matter to the big picture.

What I'd really like to see is

(a) an overview of the new locking logic

(b) what's the new fastpath case

(c) some performance numbers

to explain the changes from a "this is the point of the whole
exercise" standpoint.

And yes, I realize that the lock handoff and optimistic spinning is a
big deal, since I've seen the same regression numbers that presumably
caused this effort to be resurrected. So it's not that I don't find
this intriguing and worthwhile, it's literally that I'd like a summary
not so much of the individual patches, but of the new model.

Please?

Linus

2019-02-08 20:34:52

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On 02/08/2019 02:50 PM, Linus Torvalds wrote:
> On Thu, Feb 7, 2019 at 11:08 AM Waiman Long <[email protected]> wrote:
>> This patchset revamps the current rwsem-xadd implementation to make
>> it saner and easier to work with. This patchset removes all the
>> architecture specific assembly code and uses generic C code for all
>> architectures. This eases maintenance and enables us to enhance the
>> code more easily.
>>
>> This patchset also implements the following 3 new features:
>>
>> 1) Waiter lock handoff
>> 2) Reader optimistic spinning
>> 3) Store write-lock owner in the atomic count (x86-64 only)
> The patches are kind of hard to read, with most of them just doing
> prep-work that doesn't necessarily matter to the big picture.
>
> What I'd really like to see is
>
> (a) an overview of the new locking logic

The new locking logic is similar to qrwlock (see patch 11). Cmpxchg is
used to acquire the write lock, while xadd is still used for read lock.
Some of the bits in the count are also reserved for special purpose like
has waiter or lock handoff. Patch 15 tries to compress the write-lock
owner task pointer and put it into the count field for x86-64 at the
expense of less bits available for reader count. I have sent out an
additional patch this morning to make sure that the reader count won't
overflow.

In term of performance, there isn't much change with respect to
read-lock performance. For write-lock, I saw a slight drop in some
cases, but nothing significant. The merging of owner task pointer into
the count field does impose a slightly bigger drop than I would have
liked which I am going to look into a bit more.

>
> (b) what's the new fastpath case

The only change in the fastpath is the use of cmpxchg for writer lock.

>
> (c) some performance numbers

There are performance data at patches 11, 12, 15, 19, 20, 21. There was
performance data for patch 4 as well for eliminating the arch specific
file. Apparently, I might have deleted it accidentally. Anyway, no
noticeable performance difference was observed when switching to use
generic C code for x86, ppc and ARM64.

The major gain in performance is due to reader optimistic spinning
patches. The microbenchmark that I used shown an order of magnitude of
performance improvement for mixed reader-writer workloads. Of course, we
will see less performance gain with real world benchmarks.

I am planning to run more performance test and post the data sometimes
next week. Davidlohr is also going to run some of his rwsem performance
test on this patchset.

>
> to explain the changes from a "this is the point of the whole
> exercise" standpoint.
>
> And yes, I realize that the lock handoff and optimistic spinning is a
> big deal, since I've seen the same regression numbers that presumably
> caused this effort to be resurrected. So it's not that I don't find
> this intriguing and worthwhile, it's literally that I'd like a summary
> not so much of the individual patches, but of the new model.
>
> Please?

Maybe I should break this patchset into a few smaller ones to make it
easier to review. Any suggestion is welcome.

Cheers,
Longman


2019-02-09 00:09:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On Fri, Feb 8, 2019 at 12:31 PM Waiman Long <[email protected]> wrote:
>
> > (b) what's the new fastpath case
>
> The only change in the fastpath is the use of cmpxchg for writer lock.

.. since a big deal here was about using the generic atomic accessor
functions, I really was looking forward to seeing the *actual* fast
path code generation.

In other words, right now I have very little visibility in how it
actually affects the code. Looking at the patches themselves doesn't
make it obvious. I was hoping for the overview to really explain the
whole "before and after" situation, and it didn't. Not at the high
level, and not at a low level. And no performance numbers in the
overview either.

And yes, I see the numbers in the patches, but what I really hoped for
was some real load numbers. In particular, I would have loved to see
numbers from th ekernel test robot "will-it-scale.per_thread_ops"
case, which is the one that had a 65% regression due to the lack of
reader spinning.

So I was kind of hoping to hear whether that regression is basically
entirely gone with this patch series, or if we still have a regression
due to the extra downgrade, or what?

Linus

2019-02-11 07:40:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features


* Waiman Long <[email protected]> wrote:

> On 02/07/2019 02:51 PM, Davidlohr Bueso wrote:
> > On Thu, 07 Feb 2019, Waiman Long wrote:
> >> 30 files changed, 1197 insertions(+), 1594 deletions(-)
> >
> > Performance numbers on numerous workloads, pretty please.
> >
> > I'll go and throw this at my mmap_sem intensive workloads
> > I've collected.
> >
> > Thanks,
> > Davidlohr
>
> Thanks for getting some of the performance numbers. This is the initial
> draft after more than 1 years of hibernation. I will also get other
> performance numbers in subsequent revision of the patch.

If you could sort all the invariant preparatory patches to the head of
the series I can merge them to reduce overall complexity and simplify
performance testing and review of the rest.

Thanks,

Ingo

2019-02-13 13:30:11

by Chen, Rong A

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

Hi all,

Kernel test robot reported a will-it-scale.per_thread_ops -64.1% regression on IVB-desktop for v4.20-rc1.
The first bad commit is: 9bc8039e715da3b53dbac89525323a9f2f69b7b5, Yang Shi <[email protected]>: mm: brk: downgrade mmap_sem to read when shrinking
(https://lists.01.org/pipermail/lkp/2018-November/009335.html).

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-ivb-d01/brk1/will-it-scale/0x20

commit:
85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking")
9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking")

85a06835f6f1ba79 9bc8039e715da3b53dbac89525
---------------- --------------------------
%stddev %change %stddev
\ | \
196250 ± 8% -64.1% 70494 will-it-scale.per_thread_ops
127330 ± 19% -98.0% 2525 ± 24% will-it-scale.time.involuntary_context_switches
727.50 ± 2% -77.0% 167.25 will-it-scale.time.percent_of_cpu_this_job_got
2141 ± 2% -77.6% 479.12 will-it-scale.time.system_time
50.48 ± 7% -48.5% 25.98 will-it-scale.time.user_time
34925294 ± 18% +270.3% 1.293e+08 ± 4% will-it-scale.time.voluntary_context_switches
1570007 ± 8% -64.1% 563958 will-it-scale.workload
6435 ± 2% -6.4% 6024 proc-vmstat.nr_shmem
1298 ± 16% -44.5% 721.00 ± 18% proc-vmstat.pgactivate
2341 +16.4% 2724 slabinfo.kmalloc-96.active_objs
2341 +16.4% 2724 slabinfo.kmalloc-96.num_objs
6346 ±150% -87.8% 776.25 ± 9% softirqs.NET_RX
160107 ± 8% +151.9% 403273 softirqs.SCHED
1097999 -13.0% 955526 softirqs.TIMER
5.50 ± 9% -81.8% 1.00 vmstat.procs.r
230700 ± 19% +269.9% 853292 ± 4% vmstat.system.cs
26706 ± 3% +15.7% 30910 ± 5% vmstat.system.in
11.24 ± 23% +72.2 83.39 mpstat.cpu.idle%
0.00 ±131% +0.0 0.04 ± 99% mpstat.cpu.iowait%
86.32 ± 2% -70.8 15.54 mpstat.cpu.sys%
2.44 ± 7% -1.4 1.04 ± 8% mpstat.cpu.usr%
20610709 ± 15% +2376.0% 5.103e+08 ± 34% cpuidle.C1.time
3233399 ± 8% +241.5% 11042785 ± 25% cpuidle.C1.usage
36172040 ± 6% +931.3% 3.73e+08 ± 15% cpuidle.C1E.time
783605 ± 4% +548.7% 5083041 ± 18% cpuidle.C1E.usage
28753819 ± 39% +1054.5% 3.319e+08 ± 49% cpuidle.C3.time
283912 ± 25% +688.4% 2238225 ± 34% cpuidle.C3.usage
1.507e+08 ± 47% +292.3% 5.913e+08 ± 28% cpuidle.C6.time
339861 ± 37% +549.7% 2208222 ± 24% cpuidle.C6.usage
2709719 ± 5% +824.2% 25043444 cpuidle.POLL.time
28602864 ± 18% +173.7% 78276116 ± 10% cpuidle.POLL.usage


We found that the patchset could fix the regression.

tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01

commit:
85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking")
fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader")

85a06835f6f1ba79 fb835fe7f0adbd7c2c074b98ec
---------------- --------------------------
%stddev change %stddev
\ | \
120736 ± 22% 56% 188019 ± 6% will-it-scale.time.involuntary_context_switches
2126 ± 3% 4% 2215 will-it-scale.time.system_time
722 ± 3% 4% 752 will-it-scale.time.percent_of_cpu_this_job_got
36256485 ± 27% -35% 23682989 ± 3% will-it-scale.time.voluntary_context_switches
3151 ± 9% 11% 3504 turbostat.Avg_MHz
229285 ± 32% -30% 160660 ± 3% vmstat.system.cs
120736 ± 22% 56% 188019 ± 6% time.involuntary_context_switches
2126 ± 3% 4% 2215 time.system_time
722 ± 3% 4% 752 time.percent_of_cpu_this_job_got
36256485 ± 27% -35% 23682989 ± 3% time.voluntary_context_switches
23 643% 171 ± 3% proc-vmstat.nr_zone_inactive_file
23 643% 171 ± 3% proc-vmstat.nr_inactive_file
3664 12% 4121 proc-vmstat.nr_kernel_stack
6392 6% 6785 proc-vmstat.nr_slab_unreclaimable
9991 10176 proc-vmstat.nr_slab_reclaimable
63938 62394 proc-vmstat.nr_zone_active_anon
63938 62394 proc-vmstat.nr_active_anon
386388 ± 9% -6% 362272 proc-vmstat.pgfree
368296 ± 9% -10% 333074 proc-vmstat.numa_hit
368296 ± 9% -10% 333074 proc-vmstat.numa_local
5169 ± 13% -28% 3745 proc-vmstat.nr_shmem
1801 ± 21% -83% 309 proc-vmstat.pgactivate
0 1e+04 11441 latency_stats.avg.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
13165 ±222% -1e+04 0 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup
22499 ±151% -2e+04 657 ± 7% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
117414 ±181% -9e+04 24418 ± 44% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
666005 ±218% -7e+05 198 ±141% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup
2600097 ±132% -3e+06 572 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
34391390 ±150% -3e+07 21807 ±141% latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
34624774 ±149% -3e+07 37668 ± 58% latency_stats.avg.max
0 1e+04 11441 latency_stats.max.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
22499 ±151% -2e+04 657 ± 7% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
37845 ±222% -4e+04 0 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup
80096 ± 59% -8e+04 0 latency_stats.max.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
177149 ±195% -2e+05 24418 ± 44% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
689417 ±209% -7e+05 200 ±141% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup
18679699 ±129% -2e+07 656 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
83587334 ±129% -8e+07 43457 ±141% latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
84867236 ±126% -8e+07 59318 ± 86% latency_stats.max.max
0 1e+04 11441 latency_stats.sum.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
22499 ±151% -2e+04 657 ± 7% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
39431 ±222% -4e+04 0 latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup
216448 ±200% -2e+05 24418 ± 44% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
691960 ±208% -7e+05 397 ±141% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup
24239011 ±140% -2e+07 4768 ± 10% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
1.771e+08 ±122% -2e+08 43614 ±141% latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.939e+08 ± 36% -2e+08 0 latency_stats.sum.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.943e+08 ± 51% -2e+08 51929782 latency_stats.sum.max
407463 ± 10% -100% 0 perf-stat.total.page-faults
74225651 ± 26% -100% 0 perf-stat.total.context-switches
55293 ± 25% -100% 0 perf-stat.total.cpu-migrations
407463 ± 10% -100% 0 perf-stat.total.minor-faults


tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01

commit:
9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking")
fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader")

9bc8039e715da3b5 fb835fe7f0adbd7c2c074b98ec
---------------- --------------------------
%stddev change %stddev
\ | \
3500 ± 36% 5272% 188019 ± 6% will-it-scale.time.involuntary_context_switches
483 358% 2215 will-it-scale.time.system_time
168 346% 752 will-it-scale.time.percent_of_cpu_this_job_got
71190 180% 199232 ± 4% will-it-scale.per_thread_ops
569524 180% 1593862 ± 4% will-it-scale.workload
25.85 93% 49.95 ± 3% will-it-scale.time.user_time
1.314e+08 ± 3% -82% 23682989 ± 3% will-it-scale.time.voluntary_context_switches
30501 ± 9% -15% 25813 ± 4% vmstat.system.in
799593 ± 10% -80% 160660 ± 3% vmstat.system.cs
887 ± 11% 295% 3504 turbostat.Avg_MHz
23.60 ± 10% 68% 39.54 turbostat.CorWatt
28.38 ± 8% 57% 44.43 turbostat.PkgWatt
3500 ± 36% 5272% 188019 ± 6% time.involuntary_context_switches
483 358% 2215 time.system_time
168 346% 752 time.percent_of_cpu_this_job_got
25.85 93% 49.95 ± 3% time.user_time
1.314e+08 ± 3% -82% 23682989 ± 3% time.voluntary_context_switches
0 ± 44% 46220% 386 proc-vmstat.nr_zone_active_file
0 ± 44% 46220% 386 proc-vmstat.nr_active_file
23 643% 171 ± 3% proc-vmstat.nr_zone_inactive_file
23 643% 171 ± 3% proc-vmstat.nr_inactive_file
3690 12% 4121 proc-vmstat.nr_kernel_stack
6419 6% 6785 proc-vmstat.nr_slab_unreclaimable
9961 10176 proc-vmstat.nr_slab_reclaimable
229251 231278 proc-vmstat.nr_zone_unevictable
229251 231278 proc-vmstat.nr_unevictable
1008 1005 proc-vmstat.nr_page_table_pages
63178 62394 proc-vmstat.nr_zone_active_anon
63178 62394 proc-vmstat.nr_active_anon
432061 ± 12% -11% 385372 proc-vmstat.pgfault
408099 ± 10% -11% 362272 proc-vmstat.pgfree
422206 ± 9% -11% 373690 proc-vmstat.pgalloc_normal
382357 ± 11% -13% 333074 proc-vmstat.numa_hit
382357 ± 11% -13% 333074 proc-vmstat.numa_local
4428 ± 17% -15% 3745 proc-vmstat.nr_shmem
0 1e+04 11441 latency_stats.avg.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
11180 ±168% -1e+04 657 ± 7% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
19239 ±223% -2e+04 0 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open
63702 ±169% -4e+04 24418 ± 44% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
77617 ±205% -8e+04 510 ± 11% latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
3043762 ±124% -3e+06 572 latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
11630441 ±139% -1e+07 21807 ±141% latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
12242832 ±129% -1e+07 37668 ± 58% latency_stats.avg.max
0 1e+04 11441 latency_stats.max.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
11180 ±168% -1e+04 657 ± 7% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
19239 ±223% -2e+04 0 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open
29152 ± 11% -3e+04 0 latency_stats.max.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
65909 ±164% -4e+04 24418 ± 44% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
77617 ±205% -8e+04 510 ± 11% latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
17301268 ±125% -2e+07 656 latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
44248611 ±140% -4e+07 43457 ±141% latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
46380610 ±130% -5e+07 59318 ± 86% latency_stats.max.max
0 1e+04 11441 latency_stats.sum.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
11180 ±168% -1e+04 657 ± 7% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
19239 ±223% -2e+04 0 latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open
74047 ±148% -5e+04 24418 ± 44% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
77617 ±205% -8e+04 510 ± 11% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
26043088 ±130% -3e+07 4768 ± 10% latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
82480038 ±152% -8e+07 43614 ±141% latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.771e+09 -2e+09 51929782 latency_stats.sum.max
1.771e+09 -2e+09 0 latency_stats.sum.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
420016 ± 12% -100% 0 perf-stat.total.page-faults
2.648e+08 ± 3% -100% 0 perf-stat.total.context-switches
52212 ± 18% -100% 0 perf-stat.total.cpu-migrations
420016 ± 12% -100% 0 perf-stat.total.minor-faults

Best Regards,
Rong Chen

2019-02-14 08:09:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

Ok, those test robot reports are hard to read, but trying to distill it down:

On Wed, Feb 13, 2019 at 1:19 AM Chen Rong <[email protected]> wrote:
>
> %stddev %change %stddev
> \ | \
> 196250 ± 8% -64.1% 70494 will-it-scale.per_thread_ops

That's the original 64% regression..

And then with the patch set:

> %stddev change %stddev
> \ | \
> 71190 180% 199232 ± 4% will-it-scale.per_thread_ops

looks like it's back up where it used to be.

So I guess we have numbers for the regression now. Thanks.

And that closes my biggest question for the new model, and with the
new organization that gets ird of the arch-specific asm separately
first and makes it a bit more legible that way, I guess I'll just Ack
the whole series.

Linus

2019-02-14 23:27:39

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On Fri, 08 Feb 2019, Waiman Long wrote:
>I am planning to run more performance test and post the data sometimes
>next week. Davidlohr is also going to run some of his rwsem performance
>test on this patchset.

So I ran this series on a 40-core IB 2 socket with various worklods in
mmtests. Below are some of the interesting ones; full numbers and curves
at https://linux-scalability.org/rwsem-reader-spinner/

All workloads are with increasing number of threads.

-- pagefault timings: pft is an artificial pf benchmark (thus reader stress).
metric is faults/cpu and faults/sec
v5.0-rc6 v5.0-rc6
dirty
Hmean faults/cpu-1 624224.9815 ( 0.00%) 618847.5201 * -0.86%*
Hmean faults/cpu-4 539550.3509 ( 0.00%) 547407.5738 * 1.46%*
Hmean faults/cpu-7 401470.3461 ( 0.00%) 381157.9830 * -5.06%*
Hmean faults/cpu-12 267617.0353 ( 0.00%) 271098.5441 * 1.30%*
Hmean faults/cpu-21 176194.4641 ( 0.00%) 175151.3256 * -0.59%*
Hmean faults/cpu-30 119927.3862 ( 0.00%) 120610.1348 * 0.57%*
Hmean faults/cpu-40 91203.6820 ( 0.00%) 91832.7489 * 0.69%*
Hmean faults/sec-1 623292.3467 ( 0.00%) 617992.0795 * -0.85%*
Hmean faults/sec-4 2113364.6898 ( 0.00%) 2140254.8238 * 1.27%*
Hmean faults/sec-7 2557378.4385 ( 0.00%) 2450945.7060 * -4.16%*
Hmean faults/sec-12 2696509.8975 ( 0.00%) 2747968.9819 * 1.91%*
Hmean faults/sec-21 2902892.5639 ( 0.00%) 2905923.3881 * 0.10%*
Hmean faults/sec-30 2956696.5793 ( 0.00%) 2990583.5147 * 1.15%*
Hmean faults/sec-40 3422806.4806 ( 0.00%) 3352970.3082 * -2.04%*
Stddev faults/cpu-1 2949.5159 ( 0.00%) 2802.2712 ( 4.99%)
Stddev faults/cpu-4 24165.9454 ( 0.00%) 15841.1232 ( 34.45%)
Stddev faults/cpu-7 20914.8351 ( 0.00%) 22744.3294 ( -8.75%)
Stddev faults/cpu-12 11274.3490 ( 0.00%) 14733.3152 ( -30.68%)
Stddev faults/cpu-21 2500.1950 ( 0.00%) 2200.9518 ( 11.97%)
Stddev faults/cpu-30 1599.5346 ( 0.00%) 1414.0339 ( 11.60%)
Stddev faults/cpu-40 1473.0181 ( 0.00%) 3004.1209 (-103.94%)
Stddev faults/sec-1 2655.2581 ( 0.00%) 2405.1625 ( 9.42%)
Stddev faults/sec-4 84042.7234 ( 0.00%) 57996.7158 ( 30.99%)
Stddev faults/sec-7 123656.7901 ( 0.00%) 135591.1087 ( -9.65%)
Stddev faults/sec-12 97135.6091 ( 0.00%) 127054.4926 ( -30.80%)
Stddev faults/sec-21 69564.6264 ( 0.00%) 65922.6381 ( 5.24%)
Stddev faults/sec-30 51524.4027 ( 0.00%) 56109.4159 ( -8.90%)
Stddev faults/sec-40 101927.5280 ( 0.00%) 160117.0093 ( -57.09%)

With the exception of the hicup at 7 threads, things are pretty much in
the noise region for both metrics.

-- git checkout

First metric is total runtime for runs with incremental threads.

v5.0-rc6 v5.0-rc6
dirty
User 218.95 219.07
System 149.29 146.82
Elapsed 1574.10 1427.08

In this case there's a non trivial improvement (11%) in overall elapsed time.

-- reaim (which is always succeptible to rwsem changes for both mmap_sem and
i_mmap)
v5.0-rc6 v5.0-rc6
dirty
Hmean compute-1 6674.01 ( 0.00%) 6544.28 * -1.94%*
Hmean compute-21 85294.91 ( 0.00%) 85524.20 * 0.27%*
Hmean compute-41 149674.70 ( 0.00%) 149494.58 * -0.12%*
Hmean compute-61 177721.15 ( 0.00%) 170507.38 * -4.06%*
Hmean compute-81 181531.07 ( 0.00%) 180463.24 * -0.59%*
Hmean compute-101 189024.09 ( 0.00%) 187288.86 * -0.92%*
Hmean compute-121 200673.24 ( 0.00%) 195327.65 * -2.66%*
Hmean compute-141 213082.29 ( 0.00%) 211290.80 * -0.84%*
Hmean compute-161 207764.06 ( 0.00%) 204626.68 * -1.51%*

The 'compute' workload overall takes a small hit.

Hmean new_dbase-1 60.48 ( 0.00%) 60.63 * 0.25%*
Hmean new_dbase-21 6590.49 ( 0.00%) 6671.81 * 1.23%*
Hmean new_dbase-41 14202.91 ( 0.00%) 14470.59 * 1.88%*
Hmean new_dbase-61 21207.24 ( 0.00%) 21067.40 * -0.66%*
Hmean new_dbase-81 25542.40 ( 0.00%) 25542.40 * 0.00%*
Hmean new_dbase-101 30165.28 ( 0.00%) 30046.21 * -0.39%*
Hmean new_dbase-121 33638.33 ( 0.00%) 33219.90 * -1.24%*
Hmean new_dbase-141 36723.70 ( 0.00%) 37504.52 * 2.13%*
Hmean new_dbase-161 42242.51 ( 0.00%) 42117.34 * -0.30%*
Hmean shared-1 76.54 ( 0.00%) 76.09 * -0.59%*
Hmean shared-21 7535.51 ( 0.00%) 5518.75 * -26.76%*
Hmean shared-41 17207.81 ( 0.00%) 14651.94 * -14.85%*
Hmean shared-61 20716.98 ( 0.00%) 18667.52 * -9.89%*
Hmean shared-81 27603.83 ( 0.00%) 23466.45 * -14.99%*
Hmean shared-101 26008.59 ( 0.00%) 29536.96 * 13.57%*
Hmean shared-121 28354.76 ( 0.00%) 43139.39 * 52.14%*
Hmean shared-141 38509.25 ( 0.00%) 41619.35 * 8.08%*
Hmean shared-161 40496.07 ( 0.00%) 44303.46 * 9.40%*

Overall there is a small hit (in the noise level but consistent throughout
many workloads), except git-checkout which does quite well.

Thanks,
Davidlohr

2019-02-15 00:46:29

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On 02/14/2019 08:23 AM, Davidlohr Bueso wrote:
> On Fri, 08 Feb 2019, Waiman Long wrote:
>> I am planning to run more performance test and post the data sometimes
>> next week. Davidlohr is also going to run some of his rwsem performance
>> test on this patchset.
>
> So I ran this series on a 40-core IB 2 socket with various worklods in
> mmtests. Below are some of the interesting ones; full numbers and curves
> at https://linux-scalability.org/rwsem-reader-spinner/
>
> All workloads are with increasing number of threads.
>
> -- pagefault timings: pft is an artificial pf benchmark (thus reader
> stress).
> metric is faults/cpu and faults/sec
>                                       v5.0-rc6                 v5.0-rc6
>                                                                    dirty
> Hmean     faults/cpu-1    624224.9815 (   0.00%)   618847.5201 *  -0.86%*
> Hmean     faults/cpu-4    539550.3509 (   0.00%)   547407.5738 *   1.46%*
> Hmean     faults/cpu-7    401470.3461 (   0.00%)   381157.9830 *  -5.06%*
> Hmean     faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
> Hmean     faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
> Hmean     faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
> Hmean     faults/cpu-40    91203.6820 (   0.00%)    91832.7489 *   0.69%*
> Hmean     faults/sec-1    623292.3467 (   0.00%)   617992.0795 *  -0.85%*
> Hmean     faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
> Hmean     faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
> Hmean     faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
> Hmean     faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
> Hmean     faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
> Hmean     faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
> Stddev    faults/cpu-1      2949.5159 (   0.00%)     2802.2712 (   4.99%)
> Stddev    faults/cpu-4     24165.9454 (   0.00%)    15841.1232 (  34.45%)
> Stddev    faults/cpu-7     20914.8351 (   0.00%)    22744.3294 (  -8.75%)
> Stddev    faults/cpu-12    11274.3490 (   0.00%)    14733.3152 ( -30.68%)
> Stddev    faults/cpu-21     2500.1950 (   0.00%)     2200.9518 (  11.97%)
> Stddev    faults/cpu-30     1599.5346 (   0.00%)     1414.0339 (  11.60%)
> Stddev    faults/cpu-40     1473.0181 (   0.00%)     3004.1209 (-103.94%)
> Stddev    faults/sec-1      2655.2581 (   0.00%)     2405.1625 (   9.42%)
> Stddev    faults/sec-4     84042.7234 (   0.00%)    57996.7158 (  30.99%)
> Stddev    faults/sec-7    123656.7901 (   0.00%)   135591.1087 (  -9.65%)
> Stddev    faults/sec-12    97135.6091 (   0.00%)   127054.4926 ( -30.80%)
> Stddev    faults/sec-21    69564.6264 (   0.00%)    65922.6381 (   5.24%)
> Stddev    faults/sec-30    51524.4027 (   0.00%)    56109.4159 (  -8.90%)
> Stddev    faults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)
>
> With the exception of the hicup at 7 threads, things are pretty much in
> the noise region for both metrics.
>
> -- git checkout
>
> First metric is total runtime for runs with incremental threads.
>
>           v5.0-rc6    v5.0-rc6
>                          dirty
> User         218.95      219.07
> System       149.29      146.82
> Elapsed     1574.10     1427.08
>
> In this case there's a non trivial improvement (11%) in overall
> elapsed time.
>
> -- reaim (which is always succeptible to rwsem changes for both
> mmap_sem and
> i_mmap)
>                                     v5.0-rc6               v5.0-rc6
>                                                                dirty
> Hmean     compute-1         6674.01 (   0.00%)     6544.28 *  -1.94%*
> Hmean     compute-21       85294.91 (   0.00%)    85524.20 *   0.27%*
> Hmean     compute-41      149674.70 (   0.00%)   149494.58 *  -0.12%*
> Hmean     compute-61      177721.15 (   0.00%)   170507.38 *  -4.06%*
> Hmean     compute-81      181531.07 (   0.00%)   180463.24 *  -0.59%*
> Hmean     compute-101     189024.09 (   0.00%)   187288.86 *  -0.92%*
> Hmean     compute-121     200673.24 (   0.00%)   195327.65 *  -2.66%*
> Hmean     compute-141     213082.29 (   0.00%)   211290.80 *  -0.84%*
> Hmean     compute-161     207764.06 (   0.00%)   204626.68 *  -1.51%*
>
> The 'compute' workload overall takes a small hit.
>
> Hmean     new_dbase-1         60.48 (   0.00%)       60.63 *   0.25%*
> Hmean     new_dbase-21      6590.49 (   0.00%)     6671.81 *   1.23%*
> Hmean     new_dbase-41     14202.91 (   0.00%)    14470.59 *   1.88%*
> Hmean     new_dbase-61     21207.24 (   0.00%)    21067.40 *  -0.66%*
> Hmean     new_dbase-81     25542.40 (   0.00%)    25542.40 *   0.00%*
> Hmean     new_dbase-101    30165.28 (   0.00%)    30046.21 *  -0.39%*
> Hmean     new_dbase-121    33638.33 (   0.00%)    33219.90 *  -1.24%*
> Hmean     new_dbase-141    36723.70 (   0.00%)    37504.52 *   2.13%*
> Hmean     new_dbase-161    42242.51 (   0.00%)    42117.34 *  -0.30%*
> Hmean     shared-1            76.54 (   0.00%)       76.09 *  -0.59%*
> Hmean     shared-21         7535.51 (   0.00%)     5518.75 * -26.76%*
> Hmean     shared-41        17207.81 (   0.00%)    14651.94 * -14.85%*
> Hmean     shared-61        20716.98 (   0.00%)    18667.52 *  -9.89%*
> Hmean     shared-81        27603.83 (   0.00%)    23466.45 * -14.99%*
> Hmean     shared-101       26008.59 (   0.00%)    29536.96 *  13.57%*
> Hmean     shared-121       28354.76 (   0.00%)    43139.39 *  52.14%*
> Hmean     shared-141       38509.25 (   0.00%)    41619.35 *   8.08%*
> Hmean     shared-161       40496.07 (   0.00%)    44303.46 *   9.40%*
>
> Overall there is a small hit (in the noise level but consistent
> throughout
> many workloads), except git-checkout which does quite well.
>
> Thanks,
> Davidlohr

Thanks for running the patch through your performance tests.

Cheers,
Longman


2019-04-10 08:16:52

by huang ying

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

Hi, Waiman,

What's the status of this patchset? And its merging plan?

Best Regards,
Huang, Ying

2019-04-10 18:59:02

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On 04/10/2019 04:15 AM, huang ying wrote:
> Hi, Waiman,
>
> What's the status of this patchset? And its merging plan?
>
> Best Regards,
> Huang, Ying

I have broken the patch into 3 parts (0/1/2) and rewritten some of them.
Part 0 has been merged into tip. Parts 1 and 2 are still under testing.

Cheers,
Longman

2019-04-12 00:51:02

by huang ying

[permalink] [raw]
Subject: Re: [PATCH-tip 00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features

On Thu, Apr 11, 2019 at 12:08 AM Waiman Long <[email protected]> wrote:
>
> On 04/10/2019 04:15 AM, huang ying wrote:
> > Hi, Waiman,
> >
> > What's the status of this patchset? And its merging plan?
> >
> > Best Regards,
> > Huang, Ying
>
> I have broken the patch into 3 parts (0/1/2) and rewritten some of them.
> Part 0 has been merged into tip. Parts 1 and 2 are still under testing.

Thanks! Please keep me updated!

Best Regards,
Huang, Ying

> Cheers,
> Longman
>