2013-08-06 03:13:08

by Waiman Long

[permalink] [raw]
Subject: [PATCH v7 0/4] Lockless update of reference count protected by spinlock

v6->v7:
- Substantially reduce the number of patches from 14 to 4 because a
lot of the minor filesystem related changes had been merged to
v3.11-rc1.
- Remove architecture specific customization (LOCKREF_WAIT_SHIFT &
LOCKREF_RETRY_COUNT).
- Tune single-thread performance of lockref_put/get to within 10%
of old lock->update->unlock code.

v5->v6:
- Add a new GENERIC_SPINLOCK_REFCOUNT config parameter for using the
generic implementation.
- Add two parameters LOCKREF_WAIT_SHIFT and LOCKREF_RETRY_COUNT which
can be specified differently for each architecture.
- Update various spinlock_refcount.* files to incorporate review
comments.
- Replace reference of d_refcount() macro in Lustre filesystem code in
the staging tree to use the new d_count() helper function.

v4->v5:
- Add a d_count() helper for readonly access of reference count and
change all references to d_count outside of dcache.c, dcache.h
and namei.c to use d_count().

v3->v4:
- Replace helper function access to d_lock and d_count by using
macros to redefine the old d_lock name to the spinlock and new
d_refcount name to the reference count. This greatly reduces the
size of this patchset from 25 to 12 and make it easier to review.

v2->v3:
- Completely revamp the packaging by adding a new lockref data
structure that combines the spinlock with the reference
count. Helper functions are also added to manipulate the new data
structure. That results in modifying over 50 files, but the changes
were trivial in most of them.
- Change initial spinlock wait to use a timeout.
- Force 64-bit alignment of the spinlock & reference count structure.
- Add a new way to use the combo by using a new union and helper
functions.

v1->v2:
- Add one more layer of indirection to LOCK_WITH_REFCOUNT macro.
- Add __LINUX_SPINLOCK_REFCOUNT_H protection to spinlock_refcount.h.
- Add some generic get/put macros into spinlock_refcount.h.

This patchset supports a generic mechanism to atomically update
a reference count that is protected by a spinlock without actually
acquiring the lock itself. If the update doesn't succeeed, the caller
will have to acquire the lock and update the reference count in the
the old way. This will help in situation where there is a lot of
spinlock contention because of frequent reference count update.

The d_lock and d_count fields of the struct dentry in dcache.h was
modified to use the new lockref data structure and the d_lock name
is now a macro to the actual spinlock.

This patch set causes significant performance improvement in the
short workload of the AIM7 benchmark on a 8-socket x86-64 machine
with 80 cores.

Thank to Thomas Gleixner, Andi Kleen and Linus for their valuable
input in shaping this patchset.

Signed-off-by: Waiman Long <[email protected]>

Waiman Long (4):
spinlock: A new lockref structure for lockless update of refcount
spinlock: Enable x86 architecture to do lockless refcount update
dcache: replace d_lock/d_count by d_lockcnt
dcache: Enable lockless update of dentry's refcount

arch/x86/Kconfig | 3 +
fs/dcache.c | 78 +++++++------
fs/namei.c | 6 +-
include/asm-generic/spinlock_refcount.h | 46 +++++++
include/linux/dcache.h | 22 ++--
include/linux/spinlock_refcount.h | 126 ++++++++++++++++++++
kernel/Kconfig.locks | 15 +++
lib/Makefile | 2 +
lib/spinlock_refcount.c | 198 +++++++++++++++++++++++++++++++
9 files changed, 449 insertions(+), 47 deletions(-)
create mode 100644 include/asm-generic/spinlock_refcount.h
create mode 100644 include/linux/spinlock_refcount.h
create mode 100644 lib/spinlock_refcount.c


2013-08-06 03:13:10

by Waiman Long

[permalink] [raw]
Subject: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

This patch introduces a new set of spinlock_refcount.h header files to
be included by kernel codes that want to do a faster lockless update
of reference count protected by a spinlock.

The new lockref structure consists of just the spinlock and the
reference count data. Helper functions are defined in the new
<linux/spinlock_refcount.h> header file to access the content of
the new structure. There is a generic structure defined for all
architecture, but each architecture can also optionally define its
own structure and use its own helper functions.

Three new config parameters are introduced:
1. SPINLOCK_REFCOUNT
2. GENERIC_SPINLOCK_REFCOUNT
2. ARCH_SPINLOCK_REFCOUNT

The first one is defined in the kernel/Kconfig.locks which is used
to enable or disable the faster lockless reference count update
optimization. The second and third one have to be defined in each of
the architecture's Kconfig file to enable the optimization for that
architecture. Therefore, each architecture has to opt-in for this
optimization or it won't get it. This allows each architecture plenty
of time to test it out before deciding to use it or replace it with
a better architecture specific solution. The architecture should set
only GENERIC_SPINLOCK_REFCOUNT to use the generic implementation
without customization. By setting only ARCH_SPINLOCK_REFCOUNT,
the architecture will have to provide its own implementation.

This optimization won't work for non-SMP system or when spinlock
debugging is turned on. As a result, it is turned off each any of
them is true. It also won't work for full preempt-RT and so should
be turned off in this case.

To maximize the chance of doing lockless atomic update, the new code
will wait until the lock is free before trying to do the update.
The code will also attempt to do lockless atomic update a few times
before falling back to the old code path of acquiring a lock before
doing the update.

The table below shows the average JPM (jobs/minute) number (out of
3 runs) of the AIM7's short workload at 1500 users for different
configurations on an 8-socket 80-core DL980 with HT off with kernel
based on 3.11-rc3.

Configuration JPM
------------- ---
Wait till lock free, 1 update attempt 5899907
Wait till lock free, 2 update attempts 6534958
Wait till lock free, 3 update attempts 6868170
Wait till lock free, 4 update attempts 6905332
No wait, 2 update attempts 1091273
No wait, 4 update attempts 1281867
No wait, 8 update attempts 5095203
No wait, 16 update attempts 6392709
No wait, 32 update attempts 6438080

The "no wait, 8 update attempts" test showed high variability in the
results. One run can have 6M JPM whereas the other one is only 2M
JPM, for example. The "wait till lock free" tests, on the other hand,
are much more stable in their throughput numbers.

For this initial version, the code will wait until the lock is free
with 4 update attempts.

To evaluate the performance difference between doing a reference count
update using the old way (lock->update->unlock) and the new lockref
functions in the uncontended case, a 256K loop was run on a 2.4Ghz
Westmere x86-64 CPU. The following table shows the average time
(in ns) for a single update operation (including the looping and
timing overhead):

Update Type Time (ns)
----------- ---------
lock->update->unlock 14.7
lockref_get/lockref_put 16.0

The new lockref* functions are about 10% slower than when there is
no contention. Since reference count update is usually a very small
part of a typical workload, the actual performance impact of this
change is negligible when there is no contention.

Signed-off-by: Waiman Long <[email protected]>
---
include/asm-generic/spinlock_refcount.h | 46 +++++++
include/linux/spinlock_refcount.h | 126 ++++++++++++++++++++
kernel/Kconfig.locks | 15 +++
lib/Makefile | 2 +
lib/spinlock_refcount.c | 198 +++++++++++++++++++++++++++++++
5 files changed, 387 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/spinlock_refcount.h
create mode 100644 include/linux/spinlock_refcount.h
create mode 100644 lib/spinlock_refcount.c

diff --git a/include/asm-generic/spinlock_refcount.h b/include/asm-generic/spinlock_refcount.h
new file mode 100644
index 0000000..d3a4119
--- /dev/null
+++ b/include/asm-generic/spinlock_refcount.h
@@ -0,0 +1,46 @@
+/*
+ * Spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (c) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+#define __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+
+/*
+ * The lockref structure defines a combined spinlock with reference count
+ * data structure to be embedded in a larger structure. The combined data
+ * structure is always 8-byte aligned. So proper placement of this structure
+ * in the larger embedding data structure is needed to ensure that there is
+ * no hole in it.
+ */
+struct __aligned(sizeof(u64)) lockref {
+ union {
+ u64 lock_count;
+ struct {
+ unsigned int refcnt; /* Reference count */
+ spinlock_t lock;
+ };
+ };
+};
+
+/*
+ * Struct lockref helper functions
+ */
+extern void lockref_get(struct lockref *lockcnt);
+extern int lockref_put(struct lockref *lockcnt);
+extern int lockref_get_not_zero(struct lockref *lockcnt);
+extern int lockref_put_or_lock(struct lockref *lockcnt);
+
+#endif /* __ASM_GENERIC_SPINLOCK_REFCOUNT_H */
diff --git a/include/linux/spinlock_refcount.h b/include/linux/spinlock_refcount.h
new file mode 100644
index 0000000..abadd87
--- /dev/null
+++ b/include/linux/spinlock_refcount.h
@@ -0,0 +1,126 @@
+/*
+ * Spinlock with reference count combo data structure
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (c) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __LINUX_SPINLOCK_REFCOUNT_H
+#define __LINUX_SPINLOCK_REFCOUNT_H
+
+#include <linux/spinlock.h>
+
+/*
+ * To enable lockless update of reference count, an architecture has to define
+ * either one of the following two config parameters in its Kconfig file:
+ * 1. GENERIC_SPINLOCK_REFCOUNT
+ * 2. ARCH_SPINLOCK_REFCOUNT
+ *
+ * By defining just the GENERIC_SPINLOCK_REFCOUNT parameter, the architecture
+ * will use the generic implementation. There is nothing else an architecture
+ * need to do.
+ *
+ * On the other hand, defining the ARCH_SPINLOCK_REFCOUNT parameter indicates
+ * that the architecture is provding its own implementation. It has to provide
+ * an <asm/spinlock_refcount.h> header file.
+ */
+#ifdef CONFIG_SPINLOCK_REFCOUNT
+
+# ifdef CONFIG_ARCH_SPINLOCK_REFCOUNT
+# include <asm/spinlock_refcount.h>
+# else
+# include <asm-generic/spinlock_refcount.h>
+# endif
+
+#else
+/*
+ * If the spinlock & reference count optimization feature is disabled,
+ * they will be accessed separately on its own.
+ */
+struct lockref {
+ unsigned int refcnt; /* Reference count */
+ spinlock_t lock;
+};
+
+/*
+ * Struct lockref helper functions
+ */
+/**
+ * lockref_get - Increments reference count unconditionally
+ * @lockcnt: pointer to lockref structure
+ */
+static __always_inline void lockref_get(struct lockref *lockcnt)
+{
+ spin_lock(&lockcnt->lock);
+ lockcnt->refcnt++;
+ spin_unlock(&lockcnt->lock);
+}
+
+/**
+ * lockref_get_not_zero - Increments count unless the count is 0
+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count is 0
+ */
+static __always_inline int lockref_get_not_zero(struct lockref *lockcnt)
+{
+ int retval = 0;
+
+ spin_lock(&lockcnt->lock);
+ if (likely(lockcnt->refcnt)) {
+ lockcnt->refcnt++;
+ retval = 1;
+ }
+ spin_unlock(&lockcnt->lock);
+ return retval;
+}
+
+/**
+ * lockref_put - Decrements count unless count <= 1 before decrement
+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1
+ */
+static __always_inline int lockref_put(struct lockref *lockcnt)
+{
+ int retval = 0;
+
+ spin_lock(&lockcnt->lock);
+ if (likely(lockcnt->refcnt > 1)) {
+ lockcnt->refcnt--;
+ retval = 1;
+ }
+ spin_unlock(&lockcnt->lock);
+ return retval;
+}
+
+/**
+ * lockref_put_or_lock - decrements count unless count <= 1 before decrement
+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken
+ *
+ * The only difference between lockref_put_or_lock and lockref_put is that
+ * the former function will hold the lock on return while the latter one
+ * will free it on return.
+ */
+static __always_inline int lockref_put_or_lock(struct lockref *lockcnt)
+{
+ spin_lock(&lockcnt->lock);
+ if (likely(lockcnt->refcnt > 1)) {
+ lockcnt->refcnt--;
+ spin_unlock(&lockcnt->lock);
+ return 1;
+ }
+ return 0;
+}
+
+#endif /* !CONFIG_SPINLOCK_REFCOUNT */
+#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..67ff90b 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,18 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+#
+# Spinlock with reference count optimization
+#
+config GENERIC_SPINLOCK_REFCOUNT
+ bool
+
+config ARCH_SPINLOCK_REFCOUNT
+ bool
+
+config SPINLOCK_REFCOUNT
+ def_bool y
+ depends on ARCH_SPINLOCK_REFCOUNT || GENERIC_SPINLOCK_REFCOUNT
+ depends on SMP
+ depends on !GENERIC_LOCKBREAK && !DEBUG_SPINLOCK && !DEBUG_LOCK_ALLOC
diff --git a/lib/Makefile b/lib/Makefile
index 7baccfd..91de559 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -187,3 +187,5 @@ quiet_cmd_build_OID_registry = GEN $@
clean-files += oid_registry_data.c

obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+
+obj-$(CONFIG_GENERIC_SPINLOCK_REFCOUNT) += spinlock_refcount.o
diff --git a/lib/spinlock_refcount.c b/lib/spinlock_refcount.c
new file mode 100644
index 0000000..963ff07
--- /dev/null
+++ b/lib/spinlock_refcount.c
@@ -0,0 +1,198 @@
+/*
+ * Generic spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+
+#ifdef CONFIG_SPINLOCK_REFCOUNT
+#include <linux/spinlock.h>
+#include <linux/spinlock_refcount.h>
+
+/*
+ * The number of attempts to update the reference count locklessly before
+ * quitting (default = 4).
+ */
+#ifndef LOCKREF_RETRY_COUNT
+#define LOCKREF_RETRY_COUNT 4
+#endif
+
+/**
+ *
+ * add_unless - atomically add to count unless locked or reach threshold
+ *
+ * @lockcnt : pointer to the lockref structure
+ * @value : value to be added
+ * @threshold: threshold value for acquiring the lock
+ * Return : 1 if operation succeeds, -1 if threshold reached, 0 otherwise
+ *
+ * If the lock was not acquired, add_unless() atomically adds the given value
+ * to the reference count unless the given threshold is reached. If the lock
+ * was acquired or the threshold was reached, 0 is returned and the caller
+ * will have to acquire the lock and update the count accordingly (can be
+ * done in a non-atomic way).
+ */
+static __always_inline int
+add_unless(struct lockref *lockcnt, int value, int threshold)
+{
+ struct lockref old;
+ register struct lockref new;
+
+ old.lock_count = ACCESS_ONCE(lockcnt->lock_count);
+ if ((threshold >= 0) && (old.refcnt <= threshold))
+ return -1;
+ if (likely(!spin_is_locked(&old.lock))) {
+ new.lock_count = old.lock_count;
+ new.refcnt += value;
+ if (likely(cmpxchg64(&lockcnt->lock_count, old.lock_count,
+ new.lock_count) == old.lock_count))
+ return 1;
+ }
+ return 0;
+}
+
+/**
+ *
+ * add_unless_loop - call add_unless in a loop
+ *
+ * @lockcnt : pointer to the lockref structure
+ * @value : value to be added
+ * @threshold: threshold value for acquiring the lock
+ * @loopcnt : loop count
+ * Return : 1 if operation succeeds, 0 otherwise
+ */
+static noinline int
+add_unless_loop(struct lockref *lockcnt, int value, int threshold, int loopcnt)
+{
+ int ret;
+
+ if (threshold >= 0) {
+ for (; loopcnt > 0; loopcnt--) {
+ ret = add_unless(lockcnt, value, threshold);
+ if (ret > 0)
+ return 1;
+ else if (ret < 0)
+ return 0;
+ cpu_relax();
+ }
+ } else {
+ for (; loopcnt > 0; loopcnt--) {
+ if (add_unless(lockcnt, value, -1) > 0)
+ return 1;
+ cpu_relax();
+ }
+ }
+ return 0;
+}
+
+/**
+ *
+ * lockref_add_unless - atomically add to count unless locked or reach threshold
+ *
+ * @lockcnt : pointer to the lockref structure
+ * @value : value to be added
+ * @threshold: threshold value for acquiring the lock
+ * Return : 1 if operation succeeds, 0 otherwise
+ *
+ * The reason for separating out the first lockless update attempt from the
+ * rest is due to the fact that gcc compiler seems to be less able to optimize
+ * complex operations in a loop. So we try it once, if it doesn't work, we
+ * try out the remaining attempts in a separate slowpath function.
+ */
+static __always_inline int
+lockref_add_unless(struct lockref *lockcnt, int value, int threshold)
+{
+ int ret;
+
+ /*
+ * Code doesn't work if raw spinlock is larger than 4 bytes
+ * or is empty.
+ */
+ BUILD_BUG_ON((sizeof(arch_spinlock_t) == 0) ||
+ (sizeof(arch_spinlock_t) > 4));
+
+ /*
+ * Wait until the lock is free before attempting to do a lockless
+ * reference count update.
+ */
+ while (spin_is_locked(&lockcnt->lock))
+ cpu_relax();
+
+ ret = add_unless(lockcnt, value, threshold);
+ if (likely(ret > 0))
+ return 1;
+ if (unlikely((ret == 0) && (LOCKREF_RETRY_COUNT > 1))) {
+ cpu_relax();
+ if (add_unless_loop(lockcnt, value, threshold,
+ LOCKREF_RETRY_COUNT - 1))
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * Struct lockref helper functions
+ */
+/**
+ * lockref_get - Increments reference count unconditionally
+ * @lockcnt: pointer to struct lockref structure
+ */
+void lockref_get(struct lockref *lockcnt)
+{
+ if (likely(lockref_add_unless(lockcnt, 1, -1)))
+ return;
+ spin_lock(&lockcnt->lock);
+ lockcnt->refcnt++;
+ spin_unlock(&lockcnt->lock);
+}
+EXPORT_SYMBOL(lockref_get);
+
+/**
+ * lockref_get_not_zero - Increments count unless the count is 0
+ * @lockcnt: pointer to struct lockref structure
+ * Return: 1 if count updated successfully or 0 if count is 0 and lock taken
+ */
+int lockref_get_not_zero(struct lockref *lockcnt)
+{
+ return lockref_add_unless(lockcnt, 1, 0);
+}
+EXPORT_SYMBOL(lockref_get_not_zero);
+
+/**
+ * lockref_put - Decrements count unless the count <= 1
+ * @lockcnt: pointer to struct lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1
+ */
+int lockref_put(struct lockref *lockcnt)
+{
+ return lockref_add_unless(lockcnt, -1, 1);
+}
+EXPORT_SYMBOL(lockref_put);
+
+/**
+ * lockref_put_or_lock - Decrements count unless the count is <= 1
+ * otherwise, the lock will be taken
+ * @lockcnt: pointer to struct lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken
+ */
+int
+lockref_put_or_lock(struct lockref *lockcnt)
+{
+ if (likely(lockref_add_unless(lockcnt, -1, 1)))
+ return 1;
+ spin_lock(&lockcnt->lock);
+ return 0;
+}
+EXPORT_SYMBOL(lockref_put_or_lock);
+#endif /* CONFIG_SPINLOCK_REFCOUNT */
--
1.7.1

2013-08-06 03:13:23

by Waiman Long

[permalink] [raw]
Subject: [PATCH v7 4/4] dcache: Enable lockless update of dentry's refcount

The current code takes the dentry's d_lock lock whenever the refcnt
is being updated. In reality, nothing big really happens until refcnt
goes to 0 in dput(). So it is not necessary to take the lock if the
reference count won't go to 0. On the other hand, there are cases
where refcnt should not be updated or was not expected to be updated
while d_lock was acquired by another thread.

This patch changes the code in dput(), dget(), __dget() and
dget_parent() to use lockless reference count update function calls.

This patch has a particular big impact on the short workload of the
AIM7 benchmark with ramdisk filesystem. The table below show the
performance improvement to the JPM (jobs per minutes) throughput due
to this patch on an 8-socket 80-core x86-64 system with a 3.11-rc3
kernel in a 1/2/4/8 node configuration by using numactl to restrict
the execution of the workload on certain nodes.

+-----------------+----------------+-----------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | Rate w/o patch | Rate with patch | |
+-----------------+---------------------------------------------+
| | User Range 10 - 100 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1760523 | 4225737 | +140.0% |
| 4 nodes, HT off | 2020076 | 3206202 | +58.7% |
| 2 nodes, HT off | 2391359 | 2654701 | +11.0% |
| 1 node , HT off | 2302912 | 2302433 | 0.0% |
+-----------------+---------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1078421 | 7380760 | +584.4% |
| 4 nodes, HT off | 1371040 | 4212007 | +207.2% |
| 2 nodes, HT off | 2844720 | 2783442 | -2.2% |
| 1 node , HT off | 2433443 | 2415590 | -0.7% |
+-----------------+---------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1055626 | 7118985 | +574.4% |
| 4 nodes, HT off | 1352329 | 4512914 | +233.7% |
| 2 nodes, HT off | 2793037 | 2758652 | -1.2% |
| 1 node , HT off | 2458125 | 2445069 | -0.5% |
+-----------------+----------------+-----------------+----------+

With 4 nodes and above, there are significant performance improvement
with this patch. With only 1 or 2 nodes, the performance is very close.
Because of variability of the AIM7 benchmark, a few percent difference
may not indicate a real performance gain or loss.

A perf call-graph report of the short workload at 1500 users
without the patch on the same 8-node machine indicates that about
79% of the workload's total time were spent in the _raw_spin_lock()
function. Almost all of which can be attributed to the following 2
kernel functions:
1. dget_parent (49.92%)
2. dput (49.84%)

The relevant perf report lines are:
+ 78.76% reaim [kernel.kallsyms] [k] _raw_spin_lock
+ 0.05% reaim [kernel.kallsyms] [k] dput
+ 0.01% reaim [kernel.kallsyms] [k] dget_parent

With this patch installed, the new perf report lines are:
+ 19.66% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
+ 2.46% reaim [kernel.kallsyms] [k] _raw_spin_lock
+ 2.23% reaim [kernel.kallsyms] [k] lockref_get_not_zero
+ 0.50% reaim [kernel.kallsyms] [k] dput
+ 0.32% reaim [kernel.kallsyms] [k] lockref_put_or_lock
+ 0.30% reaim [kernel.kallsyms] [k] lockref_get
+ 0.01% reaim [kernel.kallsyms] [k] dget_parent

- 2.46% reaim [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
+ 23.89% sys_getcwd
+ 23.60% d_path
+ 8.01% prepend_path
+ 5.18% complete_walk
+ 4.21% __rcu_process_callbacks
+ 3.08% inet_twsk_schedule
+ 2.36% do_anonymous_page
+ 2.24% unlazy_walk
+ 2.02% sem_lock
+ 1.82% process_backlog
+ 1.62% selinux_inode_free_security
+ 1.54% task_rq_lock
+ 1.45% unix_dgram_sendmsg
+ 1.18% enqueue_to_backlog
+ 1.06% unix_stream_sendmsg
+ 0.94% tcp_v4_rcv
+ 0.87% unix_create1
+ 0.71% scheduler_tick
+ 0.60% unix_release_sock
+ 0.59% do_wp_page
+ 0.59% unix_stream_recvmsg
+ 0.58% handle_pte_fault
+ 0.57% __do_fault
+ 0.53% unix_peer_get

The dput() and dget_parent() functions didn't show up in the
_raw_spin_lock callers at all.

This impact of this patch on other AIM7 workloads were much more
modest. Besides short, the other AIM7 workload that showed consistent
improvement is the high_systime workload. For the other workloads,
the changes were so minor that they are no significant difference
with and without the patch.

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| high_systime | +0.1% | +1.1% | +3.4% |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 26 ++++++++++++++++++--------
include/linux/dcache.h | 7 ++-----
2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3adb6aa..9a4cf30 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -513,9 +513,15 @@ void dput(struct dentry *dentry)
return;

repeat:
- if (d_count(dentry) == 1)
- might_sleep();
- spin_lock(&dentry->d_lock);
+ if (d_count(dentry) > 1) {
+ if (lockref_put_or_lock(&dentry->d_lockcnt))
+ return;
+ /* dentry's lock taken */
+ } else {
+ if (d_count(dentry) == 1)
+ might_sleep();
+ spin_lock(&dentry->d_lock);
+ }
BUG_ON(!d_count(dentry));
if (d_count(dentry) > 1) {
dentry->d_lockcnt.refcnt--;
@@ -611,26 +617,30 @@ static inline void __dget_dlock(struct dentry *dentry)

static inline void __dget(struct dentry *dentry)
{
- spin_lock(&dentry->d_lock);
- __dget_dlock(dentry);
- spin_unlock(&dentry->d_lock);
+ lockref_get(&dentry->d_lockcnt);
}

struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;

+ rcu_read_lock();
+ ret = rcu_dereference(dentry->d_parent);
+ if (lockref_get_not_zero(&ret->d_lockcnt)) {
+ rcu_read_unlock();
+ return ret;
+ }
repeat:
/*
* Don't need rcu_dereference because we re-check it was correct under
* the lock.
*/
- rcu_read_lock();
- ret = dentry->d_parent;
+ ret = ACCESS_ONCE(dentry->d_parent);
spin_lock(&ret->d_lock);
if (unlikely(ret != dentry->d_parent)) {
spin_unlock(&ret->d_lock);
rcu_read_unlock();
+ rcu_read_lock();
goto repeat;
}
rcu_read_unlock();
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 20e6f2e..ec9206e 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -367,11 +367,8 @@ static inline struct dentry *dget_dlock(struct dentry *dentry)

static inline struct dentry *dget(struct dentry *dentry)
{
- if (dentry) {
- spin_lock(&dentry->d_lock);
- dget_dlock(dentry);
- spin_unlock(&dentry->d_lock);
- }
+ if (dentry)
+ lockref_get(&dentry->d_lockcnt);
return dentry;
}

--
1.7.1

2013-08-06 03:13:41

by Waiman Long

[permalink] [raw]
Subject: [PATCH v7 3/4] dcache: replace d_lock/d_count by d_lockcnt

This patch replaces the d_lock and d_count fields of the dentry
data structure by the combined d_lockcnt structure. A d_lock macro
is defined to remap the old d_lock name to the new d_lockcnt.lock
name. This is needed as a lot of files use the d_lock spinlock.

Read accesses to d_count are replaced by the d_count() helper
function. Write accesses to d_count are replaced by the new
d_lockcnt.refcnt name. Other than that, there is no other functional
change in this patch.

The offsets of the new d_lockcnt field are at byte 72 and 88 for
32-bit and 64-bit SMP systems respectively. In both cases, they are
8-byte aligned and their combination into a single 8-byte word will
not introduce a hole that increase the size of the dentry structure.

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 54 ++++++++++++++++++++++++------------------------
fs/namei.c | 6 ++--
include/linux/dcache.h | 15 ++++++++----
3 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 87bdb53..3adb6aa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -54,7 +54,7 @@
* - d_flags
* - d_name
* - d_lru
- * - d_count
+ * - d_lockcnt.refcnt
* - d_unhashed()
* - d_parent and d_subdirs
* - childrens' d_child and d_parent
@@ -229,7 +229,7 @@ static void __d_free(struct rcu_head *head)
*/
static void d_free(struct dentry *dentry)
{
- BUG_ON(dentry->d_count);
+ BUG_ON(d_count(dentry));
this_cpu_dec(nr_dentry);
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
@@ -467,7 +467,7 @@ relock:
}

if (ref)
- dentry->d_count--;
+ dentry->d_lockcnt.refcnt--;
/*
* inform the fs via d_prune that this dentry is about to be
* unhashed and destroyed.
@@ -513,12 +513,12 @@ void dput(struct dentry *dentry)
return;

repeat:
- if (dentry->d_count == 1)
+ if (d_count(dentry) == 1)
might_sleep();
spin_lock(&dentry->d_lock);
- BUG_ON(!dentry->d_count);
- if (dentry->d_count > 1) {
- dentry->d_count--;
+ BUG_ON(!d_count(dentry));
+ if (d_count(dentry) > 1) {
+ dentry->d_lockcnt.refcnt--;
spin_unlock(&dentry->d_lock);
return;
}
@@ -535,7 +535,7 @@ repeat:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);

- dentry->d_count--;
+ dentry->d_lockcnt.refcnt--;
spin_unlock(&dentry->d_lock);
return;

@@ -590,7 +590,7 @@ int d_invalidate(struct dentry * dentry)
* We also need to leave mountpoints alone,
* directory or not.
*/
- if (dentry->d_count > 1 && dentry->d_inode) {
+ if (d_count(dentry) > 1 && dentry->d_inode) {
if (S_ISDIR(dentry->d_inode->i_mode) || d_mountpoint(dentry)) {
spin_unlock(&dentry->d_lock);
return -EBUSY;
@@ -606,7 +606,7 @@ EXPORT_SYMBOL(d_invalidate);
/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
{
- dentry->d_count++;
+ dentry->d_lockcnt.refcnt++;
}

static inline void __dget(struct dentry *dentry)
@@ -634,8 +634,8 @@ repeat:
goto repeat;
}
rcu_read_unlock();
- BUG_ON(!ret->d_count);
- ret->d_count++;
+ BUG_ON(!d_count(ret));
+ ret->d_lockcnt.refcnt++;
spin_unlock(&ret->d_lock);
return ret;
}
@@ -718,7 +718,7 @@ restart:
spin_lock(&inode->i_lock);
hlist_for_each_entry(dentry, &inode->i_dentry, d_alias) {
spin_lock(&dentry->d_lock);
- if (!dentry->d_count) {
+ if (!d_count(dentry)) {
__dget_dlock(dentry);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
@@ -734,7 +734,7 @@ EXPORT_SYMBOL(d_prune_aliases);

/*
* Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
+ * Requires dentry->d_lock is held, and dentry->d_lockcnt.refcnt == 0.
* Releases dentry->d_lock.
*
* This may fail if locks cannot be acquired no problem, just try again.
@@ -764,8 +764,8 @@ static void try_prune_one_dentry(struct dentry *dentry)
dentry = parent;
while (dentry) {
spin_lock(&dentry->d_lock);
- if (dentry->d_count > 1) {
- dentry->d_count--;
+ if (d_count(dentry) > 1) {
+ dentry->d_lockcnt.refcnt--;
spin_unlock(&dentry->d_lock);
return;
}
@@ -793,7 +793,7 @@ static void shrink_dentry_list(struct list_head *list)
* the LRU because of laziness during lookup. Do not free
* it - just keep it off the LRU list.
*/
- if (dentry->d_count) {
+ if (d_count(dentry)) {
dentry_lru_del(dentry);
spin_unlock(&dentry->d_lock);
continue;
@@ -913,7 +913,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
dentry_lru_del(dentry);
__d_shrink(dentry);

- if (dentry->d_count != 0) {
+ if (d_count(dentry) != 0) {
printk(KERN_ERR
"BUG: Dentry %p{i=%lx,n=%s}"
" still in use (%d)"
@@ -922,7 +922,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
dentry->d_inode ?
dentry->d_inode->i_ino : 0UL,
dentry->d_name.name,
- dentry->d_count,
+ d_count(dentry),
dentry->d_sb->s_type->name,
dentry->d_sb->s_id);
BUG();
@@ -933,7 +933,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
list_del(&dentry->d_u.d_child);
} else {
parent = dentry->d_parent;
- parent->d_count--;
+ parent->d_lockcnt.refcnt--;
list_del(&dentry->d_u.d_child);
}

@@ -981,7 +981,7 @@ void shrink_dcache_for_umount(struct super_block *sb)

dentry = sb->s_root;
sb->s_root = NULL;
- dentry->d_count--;
+ dentry->d_lockcnt.refcnt--;
shrink_dcache_for_umount_subtree(dentry);

while (!hlist_bl_empty(&sb->s_anon)) {
@@ -1147,7 +1147,7 @@ resume:
* loop in shrink_dcache_parent() might not make any progress
* and loop forever.
*/
- if (dentry->d_count) {
+ if (d_count(dentry)) {
dentry_lru_del(dentry);
} else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
dentry_lru_move_list(dentry, dispose);
@@ -1269,7 +1269,7 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
smp_wmb();
dentry->d_name.name = dname;

- dentry->d_count = 1;
+ dentry->d_lockcnt.refcnt = 1;
dentry->d_flags = 0;
spin_lock_init(&dentry->d_lock);
seqcount_init(&dentry->d_seq);
@@ -1970,7 +1970,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
goto next;
}

- dentry->d_count++;
+ dentry->d_lockcnt.refcnt++;
found = dentry;
spin_unlock(&dentry->d_lock);
break;
@@ -2069,7 +2069,7 @@ again:
spin_lock(&dentry->d_lock);
inode = dentry->d_inode;
isdir = S_ISDIR(inode->i_mode);
- if (dentry->d_count == 1) {
+ if (d_count(dentry) == 1) {
if (!spin_trylock(&inode->i_lock)) {
spin_unlock(&dentry->d_lock);
cpu_relax();
@@ -2937,7 +2937,7 @@ resume:
}
if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
dentry->d_flags |= DCACHE_GENOCIDE;
- dentry->d_count--;
+ dentry->d_lockcnt.refcnt--;
}
spin_unlock(&dentry->d_lock);
}
@@ -2945,7 +2945,7 @@ resume:
struct dentry *child = this_parent;
if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
this_parent->d_flags |= DCACHE_GENOCIDE;
- this_parent->d_count--;
+ this_parent->d_lockcnt.refcnt--;
}
this_parent = try_to_ascend(this_parent, locked, seq);
if (!this_parent)
diff --git a/fs/namei.c b/fs/namei.c
index 8b61d10..28e5152 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -536,8 +536,8 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry)
* a reference at this point.
*/
BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
- BUG_ON(!parent->d_count);
- parent->d_count++;
+ BUG_ON(!d_count(parent));
+ parent->d_lockcnt.refcnt++;
spin_unlock(&dentry->d_lock);
}
spin_unlock(&parent->d_lock);
@@ -3327,7 +3327,7 @@ void dentry_unhash(struct dentry *dentry)
{
shrink_dcache_parent(dentry);
spin_lock(&dentry->d_lock);
- if (dentry->d_count == 1)
+ if (d_count(dentry) == 1)
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index b90337c..20e6f2e 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -9,6 +9,7 @@
#include <linux/seqlock.h>
#include <linux/cache.h>
#include <linux/rcupdate.h>
+#include <linux/spinlock_refcount.h>

struct nameidata;
struct path;
@@ -112,8 +113,7 @@ struct dentry {
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */

/* Ref lookup also touches following */
- unsigned int d_count; /* protected by d_lock */
- spinlock_t d_lock; /* per dentry lock */
+ struct lockref d_lockcnt; /* per dentry lock & count */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
@@ -132,6 +132,11 @@ struct dentry {
};

/*
+ * Define macros to access the name-changed spinlock
+ */
+#define d_lock d_lockcnt.lock
+
+/*
* dentry->d_lock spinlock nesting subclasses:
*
* 0: normal
@@ -318,7 +323,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
assert_spin_locked(&dentry->d_lock);
if (!read_seqcount_retry(&dentry->d_seq, seq)) {
ret = 1;
- dentry->d_count++;
+ dentry->d_lockcnt.refcnt++;
}

return ret;
@@ -326,7 +331,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)

static inline unsigned d_count(const struct dentry *dentry)
{
- return dentry->d_count;
+ return dentry->d_lockcnt.refcnt;
}

/* validate "insecure" dentry pointer */
@@ -356,7 +361,7 @@ extern char *dentry_path(struct dentry *, char *, int);
static inline struct dentry *dget_dlock(struct dentry *dentry)
{
if (dentry)
- dentry->d_count++;
+ dentry->d_lockcnt.refcnt++;
return dentry;
}

--
1.7.1

2013-08-06 03:14:11

by Waiman Long

[permalink] [raw]
Subject: [PATCH v7 2/4] spinlock: Enable x86 architecture to do lockless refcount update

This patch enables the x86 architecture to do lockless reference
count update using the generic lockref implementation with default
parameters. Only the x86/Kconfig file needs to be changed.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b32ebf9..79a9309 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -262,6 +262,9 @@ config ARCH_CPU_PROBE_RELEASE
config ARCH_SUPPORTS_UPROBES
def_bool y

+config GENERIC_SPINLOCK_REFCOUNT
+ def_bool y
+
source "init/Kconfig"
source "kernel/Kconfig.freezer"

--
1.7.1

2013-08-13 18:03:12

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 0/4] Lockless update of reference count protected by spinlock

On 08/05/2013 11:12 PM, Waiman Long wrote:
> v6->v7:
> - Substantially reduce the number of patches from 14 to 4 because a
> lot of the minor filesystem related changes had been merged to
> v3.11-rc1.
> - Remove architecture specific customization (LOCKREF_WAIT_SHIFT&
> LOCKREF_RETRY_COUNT).
> - Tune single-thread performance of lockref_put/get to within 10%
> of old lock->update->unlock code.
>
> v5->v6:
> - Add a new GENERIC_SPINLOCK_REFCOUNT config parameter for using the
> generic implementation.
> - Add two parameters LOCKREF_WAIT_SHIFT and LOCKREF_RETRY_COUNT which
> can be specified differently for each architecture.
> - Update various spinlock_refcount.* files to incorporate review
> comments.
> - Replace reference of d_refcount() macro in Lustre filesystem code in
> the staging tree to use the new d_count() helper function.
>
> v4->v5:
> - Add a d_count() helper for readonly access of reference count and
> change all references to d_count outside of dcache.c, dcache.h
> and namei.c to use d_count().
>
> v3->v4:
> - Replace helper function access to d_lock and d_count by using
> macros to redefine the old d_lock name to the spinlock and new
> d_refcount name to the reference count. This greatly reduces the
> size of this patchset from 25 to 12 and make it easier to review.
>
> v2->v3:
> - Completely revamp the packaging by adding a new lockref data
> structure that combines the spinlock with the reference
> count. Helper functions are also added to manipulate the new data
> structure. That results in modifying over 50 files, but the changes
> were trivial in most of them.
> - Change initial spinlock wait to use a timeout.
> - Force 64-bit alignment of the spinlock& reference count structure.
> - Add a new way to use the combo by using a new union and helper
> functions.
>
> v1->v2:
> - Add one more layer of indirection to LOCK_WITH_REFCOUNT macro.
> - Add __LINUX_SPINLOCK_REFCOUNT_H protection to spinlock_refcount.h.
> - Add some generic get/put macros into spinlock_refcount.h.
>
> This patchset supports a generic mechanism to atomically update
> a reference count that is protected by a spinlock without actually
> acquiring the lock itself. If the update doesn't succeeed, the caller
> will have to acquire the lock and update the reference count in the
> the old way. This will help in situation where there is a lot of
> spinlock contention because of frequent reference count update.
>
> The d_lock and d_count fields of the struct dentry in dcache.h was
> modified to use the new lockref data structure and the d_lock name
> is now a macro to the actual spinlock.
>
> This patch set causes significant performance improvement in the
> short workload of the AIM7 benchmark on a 8-socket x86-64 machine
> with 80 cores.
>
> Thank to Thomas Gleixner, Andi Kleen and Linus for their valuable
> input in shaping this patchset.
>
> Signed-off-by: Waiman Long<[email protected]>
>
> Waiman Long (4):
> spinlock: A new lockref structure for lockless update of refcount
> spinlock: Enable x86 architecture to do lockless refcount update
> dcache: replace d_lock/d_count by d_lockcnt
> dcache: Enable lockless update of dentry's refcount
>
> arch/x86/Kconfig | 3 +
> fs/dcache.c | 78 +++++++------
> fs/namei.c | 6 +-
> include/asm-generic/spinlock_refcount.h | 46 +++++++
> include/linux/dcache.h | 22 ++--
> include/linux/spinlock_refcount.h | 126 ++++++++++++++++++++
> kernel/Kconfig.locks | 15 +++
> lib/Makefile | 2 +
> lib/spinlock_refcount.c | 198 +++++++++++++++++++++++++++++++
> 9 files changed, 449 insertions(+), 47 deletions(-)
> create mode 100644 include/asm-generic/spinlock_refcount.h
> create mode 100644 include/linux/spinlock_refcount.h
> create mode 100644 lib/spinlock_refcount.c

So far, I haven't heard back anything about if further change and
improvement is needed for this patch set. Any comment or feedback will
be appreciated.

Thank in advance for your time.

Regards,
Longman

2013-08-29 01:40:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

Just FYI: I've merged two preparatory patches in my tree for the whole
lockref thing. Instead of applying your four patches as-is during the
merge window, I ended up writing two patches that introduce the
concept and use it in the dentry code *without* introducing any of the
new semantics yet.

Waiman, I attributed the patches to you, even if they don't actually
look much like any of the patches you sent out. And because I was
trying very hard to make sure that no actual semantics changed, my
version doesn't have the dget_parent() lockless update code, for
example. I literally just did a search-and-replace of "->d_count" with
"->d_lockref.count" and then I fixed up a few things by hand (undid
one replacement in a comment, and used the helper functions where they
were semantically identical).

You don't have to rewrite your patches if you don't want to, I'm
planning on cherry-picking the actual code changes during the merge
window.

Linus

2013-08-29 04:46:13

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, 2013-08-28 at 18:40 -0700, Linus Torvalds wrote:
> Just FYI: I've merged two preparatory patches in my tree for the whole
> lockref thing. Instead of applying your four patches as-is during the
> merge window, I ended up writing two patches that introduce the
> concept and use it in the dentry code *without* introducing any of the
> new semantics yet.
>
> Waiman, I attributed the patches to you, even if they don't actually
> look much like any of the patches you sent out. And because I was
> trying very hard to make sure that no actual semantics changed, my
> version doesn't have the dget_parent() lockless update code, for
> example. I literally just did a search-and-replace of "->d_count" with
> "->d_lockref.count" and then I fixed up a few things by hand (undid
> one replacement in a comment, and used the helper functions where they
> were semantically identical).
>
> You don't have to rewrite your patches if you don't want to, I'm
> planning on cherry-picking the actual code changes during the merge
> window.

I've somewhat lost track of this, will I need some arch support for
powerpc ?

Cheers,
Ben.

2013-08-29 07:00:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Benjamin Herrenschmidt <[email protected]> wrote:

> On Wed, 2013-08-28 at 18:40 -0700, Linus Torvalds wrote:
> > Just FYI: I've merged two preparatory patches in my tree for the whole
> > lockref thing. Instead of applying your four patches as-is during the
> > merge window, I ended up writing two patches that introduce the
> > concept and use it in the dentry code *without* introducing any of the
> > new semantics yet.
> >
> > Waiman, I attributed the patches to you, even if they don't actually
> > look much like any of the patches you sent out. And because I was
> > trying very hard to make sure that no actual semantics changed, my
> > version doesn't have the dget_parent() lockless update code, for
> > example. I literally just did a search-and-replace of "->d_count" with
> > "->d_lockref.count" and then I fixed up a few things by hand (undid
> > one replacement in a comment, and used the helper functions where they
> > were semantically identical).
> >
> > You don't have to rewrite your patches if you don't want to, I'm
> > planning on cherry-picking the actual code changes during the merge
> > window.
>
> I've somewhat lost track of this, will I need some arch support for
> powerpc ?

Lockrefs are combiend spinlock+count objects that fit into a
MESI-cacheline and can be accessed via the cmpxchg8b() primitives and
allow smart combined operations on the count field without necessarily
taking the lock.

So if an architecture meets the assumptions of the generic lockref code
(spinlock + an u32 fits in an aligned cacheline, has the cmpxchgb8b()
primitive, lockdep is off, etc.) then it needs no changes.

You won't see these arch requirements from Linus's current patches yet,
but the followup changes that actually add the optimization should make
this clear.

Thanks,

Ingo

2013-08-29 15:20:55

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/28/2013 09:40 PM, Linus Torvalds wrote:
> Just FYI: I've merged two preparatory patches in my tree for the whole
> lockref thing. Instead of applying your four patches as-is during the
> merge window, I ended up writing two patches that introduce the
> concept and use it in the dentry code *without* introducing any of the
> new semantics yet.
>
> Waiman, I attributed the patches to you, even if they don't actually
> look much like any of the patches you sent out. And because I was
> trying very hard to make sure that no actual semantics changed, my
> version doesn't have the dget_parent() lockless update code, for
> example. I literally just did a search-and-replace of "->d_count" with
> "->d_lockref.count" and then I fixed up a few things by hand (undid
> one replacement in a comment, and used the helper functions where they
> were semantically identical).
>
> You don't have to rewrite your patches if you don't want to, I'm
> planning on cherry-picking the actual code changes during the merge
> window.
>
> Linus

Thanks for merging the 2 preparatory patches for me. I will rebase my
patches with the latest linux git tree. A new v8 patch set will be sent
out sometime next week. I am looking forward to the v3.12 merge window
which I think is coming soon.

Cheer,
Longman

2013-08-29 16:43:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 12:00 AM, Ingo Molnar <[email protected]> wrote:
>
> Lockrefs are combiend spinlock+count objects that fit into a
> MESI-cacheline and can be accessed via the cmpxchg8b() primitives and
> allow smart combined operations on the count field without necessarily
> taking the lock.

Side note: I'm going to finally build a new desktop, and made sure
that I got a CPU that supports TSX. I'm not convinced transactional
memory ends up being usable in general without hardware support for
predicting transaction success, but using transactions to do small
specialized things like this may well be worth it. Especially since
the cmpxchg approach has some issues.

We'll see. The real problem is that I'm not sure if I can even see the
scalability issue on any machine I actually personally want to use
(read: silent). On my current system I can only get up to 15%
_raw_spin_lock by just stat'ing the same file over and over and over
again from lots of threads.

Linus

2013-08-29 19:25:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 9:43 AM, Linus Torvalds
<[email protected]> wrote:
>
> We'll see. The real problem is that I'm not sure if I can even see the
> scalability issue on any machine I actually personally want to use
> (read: silent). On my current system I can only get up to 15%
> _raw_spin_lock by just stat'ing the same file over and over and over
> again from lots of threads.

Hmm. I can see it, but it turns out that for normal pathname walking,
one of the main stumbling blocks is the RCU case of complete_walk(),
which cannot be done with the lockless lockref model.

Why? It needs to check the sequence count too and cannot touch the
refcount unless it matches under the spinlock. We could use
lockref_get_non_zero(), but for the final path component (which this
is) the zero refcount is actually a common case.

Waiman worked around this by having some rather complex code to retry
and wait for the dentry lock to be released in his lockref code. But
that has a lot of tuning implications, and I wanted to see what it is
*without* that kind of tuning. And that's when you hit the "lockless
case fails all the time because the lock is actually held" case.

I'm going to play around with changing the semantics of
"lockref_get_non_zero()" to match the "lockless_put_or_lock()":
instead of failing when the count it zero, it gets the lock. That
won't generally get any contention, because if the count is zero,
there generally isn't anybody else playing with that dentry.

Linus

2013-08-29 23:42:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 12:25 PM, Linus Torvalds
<[email protected]> wrote:
>
> Hmm. I can see it, but it turns out that for normal pathname walking,
> one of the main stumbling blocks is the RCU case of complete_walk(),
> which cannot be done with the lockless lockref model.
> [... snip ...]

Ok, here's a patch for people to try out if they want to. It's tested,
and works for me, but it is - like the two preparatory patches I
already applied - not really based on Waiman's patches except from a
conceptual standpoint.

For architecture people (ie Ben, if you want to try this on ppc64),
the thing that it needs from an architecture:

- the raw_spinlock_t and the "unsigned long" needs to fit in a u64.

This is normally true anyway, but if your architecture code has a
big spinlock, you can't use it.

- the architecture needs to support "cmpxchg()" on that "u64" type
(so x86-32 does *not* use this, although we could teach it to do so by
teaching x86-32 to use "cmpxchg8b" for the 8-byte case)

A 64-bit architecture needs to support cmpxchg() on an u64 anyway,
so this is always true on 64-bit. It _can_ be true on 32-bit
architectures too.

- the architecture needs to implement a simple
"arch_spin_value_unlocked()" macro, which takes a raw_spinlock_t value
and says whether it is unlocked or not.

This is a new macro/function. It's generally a one-liner. For the
x86 ticket locks, for example, the test is simply "lock.tickets.head
== lock.tickets.tail".

- the architecture code needs to let the generic code know that it
supports all this by doing a "select ARCH_USE_CMPXCHG_LOCKREF"

Add it to your Kconfig file in the appropriate place. You do *not*
need to worry about LOCKDEP etc, you only need to worry about your own
architecture details above.

That's it. If it does that, the lockref code will use the cmpxchg
optimization when appropriate (ie spinlock debugging is not turned on
etc etc).

For Waiman: your patch had that adaptive waiting thing, and "wait for
unlocked" code, and I threw all that away. I didn't like it, and the
only reason for it existing was that the spinlock could be taken in a
hot path, which I felt was against the whole point of this "lockref"
thing.

So I fixed the VFS layer instead. With dput() and friends using
lockrefs, the only thing remaining in the hot RCU dentry lookup path
was the nasty __d_rcu_to_refcount() thing in complete_walk(). I
rewrote that to locklessly increment the refcount when it was nonzero,
and get the lock if it was zero, and that all seems to work fine.

And once the only case that is relevant for the fast-path is "d_lock
is unlocked", all your games with waiting for the spinlock to be
released are unnecessary. Making everything much simpler. If the
spinlock isn't unlocked, we always kick out to the fallback case (with
real locking).

NOTE! My test-case was very small and simple, so it may not trigger
other cases that might trigger d_lock in a hotpath. Anything that
kicks us out of rcu mode (like a symlink, for example) will trigger
"unlazy_walk()", and I didn't do the same thing there. So there's
still details like that to sort out, but I very much think this whole
"only an unlocked spinlock is a fastpath" is the right approach.

My simple testing shows that this has about the same best-case
performance, and the 15% _raw_spin_lock load I was able to trigger is
totally gone. That doesn't make things *faster* for me (because the
cost of the cmpxchg is pretty much comparable to the cost of the
spinlocks), but the big difference is the contended behavior where we
don't actually have to wait for the spinlock, we can just locklessly
increment the counter.

I can't trigger the CPU-eating contention case on my single-socket
system, which is why I'm sending out this patch for testing (despite
it not having that unlazy_walk() thing etc).

Also note that this is one single patch, not split up. Again, that's
because what I'm really hoping to get is just "does this fix the
contention-case on the 80-core monster machine that I don't have
access to?"

Side note: the whole cmpxchg() loop is written to basically generate
the perfect cmpxchg sequence on x86. The assembly results actually
look pretty good. I can't take advantage of the eflag setting of the
instruction, because gcc inline asms don't support that (even with
"asm goto" - I'd need to have output values for that, and "asm goto"
does now allow that). So there's one extra "cmpq" instruction, and gcc
makes a mess of "add 1 to structure member in high bytes of a 64-bit
stricture", but it's actually fairly readable and short assembly code,
which was *not* true of the original patches.

Waiman? Mind looking at this and testing?

Linus


Attachments:
patch.diff (8.37 kB)

2013-08-30 00:28:23

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, 2013-08-29 at 16:42 -0700, Linus Torvalds wrote:

> For architecture people (ie Ben, if you want to try this on ppc64),
> the thing that it needs from an architecture:
>
> - the raw_spinlock_t and the "unsigned long" needs to fit in a u64.

I assume you mean unsigned int ? :-)

.../...

> - the architecture needs to implement a simple
> "arch_spin_value_unlocked()" macro, which takes a raw_spinlock_t value
> and says whether it is unlocked or not.

What's wrong with the existing arch_spin_is_locked() ?

BTW. Do you have your test case at hand ?

Cheers,
Ben.

2013-08-30 00:49:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 5:26 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
>
> I assume you mean unsigned int ? :-)

Oops, yes.

> What's wrong with the existing arch_spin_is_locked() ?

It takes a memory location. And we very much want to test the value we
loaded into a register.

And yes, gcc can do the right thing. But at least on x86,
arch_spin_is_locked() actually uses ACCESS_ONCE() to load the value
from the memory location, and I actually think that is the right thing
to do (or at least not incorrect). So the end result is that
arch_spin_value_unlocked() is actually fairly fundamentally different
from arch_spin_is_locked().

So I could have re-used arch_spin_is_locked() after having changed the
semantics of it, but I really didn't want to possibly change totally
unrelated users for this particular feature.

> BTW. Do you have your test case at hand ?

My test-case is a joke. It's explicitly *trying* to get as much
contention as possible on a dentry, by just starting up a lot of
threads that look up one single pathname (the same one for everybody).
It defaults to using /tmp for this, but you can specify the filename.

Note that directories, regular files and symlinks have fundamentally
different dentry lookup behavior:

- directories tend to have an elevated reference count (because they
have children). This was my primary test-case, because while I suspect
that there are crazy loads (and AIM7 may be one of them) that open the
same _regular_ file all concurrently, I don't think it's a "normal"
load). But opening the same directory concurrently as part of pathname
lookup is certainly normal.

- regular files tend to have a dentry count of zero unless they are
actively open, and the patch I sent out will take the dentry spinlock
for them when doing the final RCU finishing touches if that's the
case. So this one *will* still use the per-dentry spinlock rather than
the lockless refcount increments, but as outlined above I don't think
that should be a scalability issue unless you're crazy.

- symlink traveral causes us to drop out of RCU lookup mode, and thus
cause various slow-paths to happen. Some of that we can improve on,
but I suspect it will cause the lockless refcount paths to take a hit
too.

Anyway, I'm attaching my completely mindless test program. It has
hacky things like "unsigned long count[MAXTHREADS][32]" which are
purely to just spread out the counts so that they aren't in the same
cacheline etc.

Also note that the performance numbers it spits out depend a lot on
tings like how long the dcache hash chains etc are, so they are not
really reliable. Running the test-program right after reboot when the
dentries haven't been populated can result in much higher numbers -
without that having anything to do with contention or locking at all.

Linus


Attachments:
t.c (0.98 kB)

2013-08-30 02:06:24

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

> Anyway, I'm attaching my completely mindless test program. It has
> hacky things like "unsigned long count[MAXTHREADS][32]" which are
> purely to just spread out the counts so that they aren't in the same
> cacheline etc.
>
> Also note that the performance numbers it spits out depend a lot on
> tings like how long the dcache hash chains etc are, so they are not
> really reliable. Running the test-program right after reboot when the
> dentries haven't been populated can result in much higher numbers -
> without that having anything to do with contention or locking at all.

Running on a POWER7 here with 32 threads (8 cores x 4 threads) I'm
getting some good improvements:

Without patch:
# ./t
Total loops: 3730618

With patch:
# ./t
Total loops: 16826271

The numbers move around about 10% from run to run. I didn't change your
program at all, so it's still running with MAXTHREADS 16.

powerpc patch below. I'm using arch_spin_is_locked() to implement
arch_spin_value_unlocked().

Mikey

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9cf59816d..4a3f86b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -139,6 +139,7 @@ config PPC
select OLD_SIGSUSPEND
select OLD_SIGACTION if PPC32
select HAVE_DEBUG_STACKOVERFLOW
+ select ARCH_USE_CMPXCHG_LOCKREF

config EARLY_PRINTK
bool
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 5b23f91..65c25272 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -156,6 +156,11 @@ extern void arch_spin_unlock_wait(arch_spinlock_t *lock);
do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
#endif

+static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+ return !arch_spin_is_locked(&lock);
+}
+
/*
* Read-write spinlocks, allowing multiple readers
* but only one writer.

2013-08-30 02:31:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 7:06 PM, Michael Neuling <[email protected]> wrote:
>
> Running on a POWER7 here with 32 threads (8 cores x 4 threads) I'm
> getting some good improvements:

That's *much* better than I get. But I literally just have a single
socket with two cores (and HT, so four threads) in my test machine, so
I really have a hard time getting any real contention. And the main
advantage of the patch should be when you actually have CPU's spinning
on that dentry d_lock.

Also, on x86, there are no advantages to cmpxchg over a spinlock -
they are both exactly one equally serializing instruction. If
anything, cmpxchg is worse due to having a cache read before the
write, and a few cycles slower anyway. So I actually expect the x86
code to slow down a tiny bit for the single-threaded case, although
that should be hopefully unmeasurable.

On POWER, you may have much less serialization for the cmpxchg. That
may sadly be something we'll need to fix - the serialization between
getting a lockref and checking sequence counts etc may need some extra
work.

So it may be that you are seeing unrealistically good numbers, and
that we will need to add a memory barrier or two. On x86, due to the
locked instruction semantics, that just isn't an issue.

> The numbers move around about 10% from run to run.

Please note that the whole "dentry hash chains may be better" for one
run vs another, and that's something that will _persist_ between
subsequent runs, so you may see "only 10% variability", but there may
be a bigger picture variability that you're not noticing because you
had to reboot in between.

To be really comparable, you should really run the stupid benchmark
after fairly equal boot up sequences. If the machine had been up for
several days for one set of numbers, and freshly rebooted for the
other, it can be a very unfair comparison.

(I long ago had a nice "L1 dentry cache" patch that helped with the
fact that the dentry chains *can* get long especially if you have tons
of memory, and that helped with this kind of variability a lot - and
improved performance too. It was slightly racy, though, which is why
it never got merged).

> powerpc patch below. I'm using arch_spin_is_locked() to implement
> arch_spin_value_unlocked().

Your "slock" is of type "volatile unsigned int slock", so it may well
cause those temporaries to be written to memory.

It probably doesn't matter, but you may want to check that the result
of "make lib/lockref.s" looks ok.

Linus

2013-08-30 02:32:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, 2013-08-30 at 12:06 +1000, Michael Neuling wrote:

> powerpc patch below. I'm using arch_spin_is_locked() to implement
> arch_spin_value_unlocked().

>
> +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
> +{
> + return !arch_spin_is_locked(&lock);
> +}
> +

Arguably, it should be done the other way around :-)

arch_spin_value_unlocked semantics is to basically operate on an already
read copy of the value, while arch_spin_is_locked() has ACCESS_ONE
semantics on *top* of that.

Or we can keep both completely separate like Linus does on x86.

Cheers,
Ben.

2013-08-30 02:35:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 7:30 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
>
> Or we can keep both completely separate like Linus does on x86.

I did it that way mainly to minimize the patch.

I agree with you that it probably makes sense to layer them the other
way around from what Michael's patch did, iow implement
arch_spin_is_locked() in terms of arch_spin_value_unlocked().

That said, on power, you have that "ACCESS_ONCE()" implicit in the
*type*, not in the code, so an "arch_spinlock_t" is fundamentally
volatile in itself. It's one of the reasons I despise "volatile":
things like volatility are _not_ attributes of a variable or a type,
but of the code in question. Something can be volatile in one context,
but not in another (one context might be locked, for example).

Linus

2013-08-30 02:44:26

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, 2013-08-29 at 19:31 -0700, Linus Torvalds wrote:

> Also, on x86, there are no advantages to cmpxchg over a spinlock -
> they are both exactly one equally serializing instruction. If
> anything, cmpxchg is worse due to having a cache read before the
> write, and a few cycles slower anyway. So I actually expect the x86
> code to slow down a tiny bit for the single-threaded case, although
> that should be hopefully unmeasurable.
>
> On POWER, you may have much less serialization for the cmpxchg. That
> may sadly be something we'll need to fix - the serialization between
> getting a lockref and checking sequence counts etc may need some extra
> work.

> So it may be that you are seeing unrealistically good numbers, and
> that we will need to add a memory barrier or two. On x86, due to the
> locked instruction semantics, that just isn't an issue.

Dunno, our cmpxhg has both acquire and release barriers. It basically
does release, xchg, then acquire. So it is equivalent to an unlock
followed by a lock.

> > The numbers move around about 10% from run to run.
>
> Please note that the whole "dentry hash chains may be better" for one
> run vs another, and that's something that will _persist_ between
> subsequent runs, so you may see "only 10% variability", but there may
> be a bigger picture variability that you're not noticing because you
> had to reboot in between.
>
> To be really comparable, you should really run the stupid benchmark
> after fairly equal boot up sequences. If the machine had been up for
> several days for one set of numbers, and freshly rebooted for the
> other, it can be a very unfair comparison.
>
> (I long ago had a nice "L1 dentry cache" patch that helped with the
> fact that the dentry chains *can* get long especially if you have tons
> of memory, and that helped with this kind of variability a lot - and
> improved performance too. It was slightly racy, though, which is why
> it never got merged).
>
> > powerpc patch below. I'm using arch_spin_is_locked() to implement
> > arch_spin_value_unlocked().
>
> Your "slock" is of type "volatile unsigned int slock", so it may well
> cause those temporaries to be written to memory.
>
> It probably doesn't matter, but you may want to check that the result
> of "make lib/lockref.s" looks ok.
>
> Linus

2013-08-30 02:46:04

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, 2013-08-29 at 19:35 -0700, Linus Torvalds wrote:
> That said, on power, you have that "ACCESS_ONCE()" implicit in the
> *type*, not in the code, so an "arch_spinlock_t" is fundamentally
> volatile in itself. It's one of the reasons I despise "volatile":
> things like volatility are _not_ attributes of a variable or a type,
> but of the code in question. Something can be volatile in one context,
> but not in another (one context might be locked, for example).

Right, we can probably change that to use ACCESS_ONCE... volatile tend
to never quite do what you expect anyway.

Cheers,
Ben.

2013-08-30 03:13:05

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/29/2013 07:42 PM, Linus Torvalds wrote:
> Waiman? Mind looking at this and testing? Linus

Sure, I will try out the patch tomorrow morning and see how it works out
for my test case.

Regards,
Longman

2013-08-30 03:54:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>
>> Waiman? Mind looking at this and testing? Linus
>
> Sure, I will try out the patch tomorrow morning and see how it works out for
> my test case.

Ok, thanks, please use this slightly updated patch attached here.

It improves on the previous version in actually handling the
"unlazy_walk()" case with native lockref handling, which means that
one other not entirely odd case (symlink traversal) avoids the d_lock
contention.

It also refactored the __d_rcu_to_refcount() to be more readable, and
adds a big comment about what the heck is going on. The old code was
clever, but I suspect not very many people could possibly understand
what it actually did. Plus it used nested spinlocks because it wanted
to avoid checking the sequence count twice. Which is stupid, since
nesting locks is how you get really bad contention, and the sequence
count check is really cheap anyway. Plus the nesting *really* didn't
work with the whole lockref model.

With this, my stupid thread-lookup thing doesn't show any spinlock
contention even for the "look up symlink" case.

It also avoids the unnecessary aligned u64 for when we don't actually
use cmpxchg at all.

It's still one single patch, since I was working on lots of small
cleanups. I think it's pretty close to done now (assuming your testing
shows it performs fine - the powerpc numbers are promising, though),
so I'll split it up into proper chunks rather than random commit
points. But I'm done for today at least.

NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
may hit cases I didn't test. I think it should be *stable*, but maybe
there's some other d_lock case that your tuned waiting hid, and that
my "fastpath only for unlocked case" version ends up having problems
with.

Linus


Attachments:
patch.diff (13.27 kB)

2013-08-30 07:16:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Linus Torvalds <[email protected]> wrote:

> > BTW. Do you have your test case at hand ?
>
> My test-case is a joke. It's explicitly *trying* to get as much
> contention as possible on a dentry, by just starting up a lot of threads
> that look up one single pathname (the same one for everybody). It
> defaults to using /tmp for this, but you can specify the filename.

Waiman's tests seemed to use sufficiently generic and varied workloads
(AIM7) and they showed pretty nice unconditional improvements with his
variant of this scheme, so I think testing with your simple testcase that
intentionally magnifies the scalability issue is 100% legit and may in
fact help tune the changes more accurately, because it has less inherent
noise.

And that was on a 80 core system. The speedup should be exponentially more
dramatic on silly large systems. A nicely parallel VFS isn't a bad thing
to have, especially on ridiculously loud hardware you want to run a
continent away from you.

Thanks,

Ingo

2013-08-30 07:55:29

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>>
>>> Waiman? Mind looking at this and testing? Linus
>>
>> Sure, I will try out the patch tomorrow morning and see how it works out for
>> my test case.
>
> Ok, thanks, please use this slightly updated patch attached here.
>
> It improves on the previous version in actually handling the
> "unlazy_walk()" case with native lockref handling, which means that
> one other not entirely odd case (symlink traversal) avoids the d_lock
> contention.
>
> It also refactored the __d_rcu_to_refcount() to be more readable, and
> adds a big comment about what the heck is going on. The old code was
> clever, but I suspect not very many people could possibly understand
> what it actually did. Plus it used nested spinlocks because it wanted
> to avoid checking the sequence count twice. Which is stupid, since
> nesting locks is how you get really bad contention, and the sequence
> count check is really cheap anyway. Plus the nesting *really* didn't
> work with the whole lockref model.
>
> With this, my stupid thread-lookup thing doesn't show any spinlock
> contention even for the "look up symlink" case.
>
> It also avoids the unnecessary aligned u64 for when we don't actually
> use cmpxchg at all.
>
> It's still one single patch, since I was working on lots of small
> cleanups. I think it's pretty close to done now (assuming your testing
> shows it performs fine - the powerpc numbers are promising, though),
> so I'll split it up into proper chunks rather than random commit
> points. But I'm done for today at least.
>
> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
> may hit cases I didn't test. I think it should be *stable*, but maybe
> there's some other d_lock case that your tuned waiting hid, and that
> my "fastpath only for unlocked case" version ends up having problems
> with.
>

Following this thread with half an eye... Was that "unsigned" stuff
fixed (someone pointed to it).
How do you call that test-patch (subject)?
I would like to test it on my SNB ultrabook with your test-case script.

- Sedat -

2013-08-30 08:10:39

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 9:55 AM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
>>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>>>
>>>> Waiman? Mind looking at this and testing? Linus
>>>
>>> Sure, I will try out the patch tomorrow morning and see how it works out for
>>> my test case.
>>
>> Ok, thanks, please use this slightly updated patch attached here.
>>
>> It improves on the previous version in actually handling the
>> "unlazy_walk()" case with native lockref handling, which means that
>> one other not entirely odd case (symlink traversal) avoids the d_lock
>> contention.
>>
>> It also refactored the __d_rcu_to_refcount() to be more readable, and
>> adds a big comment about what the heck is going on. The old code was
>> clever, but I suspect not very many people could possibly understand
>> what it actually did. Plus it used nested spinlocks because it wanted
>> to avoid checking the sequence count twice. Which is stupid, since
>> nesting locks is how you get really bad contention, and the sequence
>> count check is really cheap anyway. Plus the nesting *really* didn't
>> work with the whole lockref model.
>>
>> With this, my stupid thread-lookup thing doesn't show any spinlock
>> contention even for the "look up symlink" case.
>>
>> It also avoids the unnecessary aligned u64 for when we don't actually
>> use cmpxchg at all.
>>
>> It's still one single patch, since I was working on lots of small
>> cleanups. I think it's pretty close to done now (assuming your testing
>> shows it performs fine - the powerpc numbers are promising, though),
>> so I'll split it up into proper chunks rather than random commit
>> points. But I'm done for today at least.
>>
>> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
>> may hit cases I didn't test. I think it should be *stable*, but maybe
>> there's some other d_lock case that your tuned waiting hid, and that
>> my "fastpath only for unlocked case" version ends up having problems
>> with.
>>
>
> Following this thread with half an eye... Was that "unsigned" stuff
> fixed (someone pointed to it).
> How do you call that test-patch (subject)?
> I would like to test it on my SNB ultrabook with your test-case script.
>

Can you explain why CONFIG_DEBUG_SPINLOCK=n (here: x86-64)?
( Will this be changed in further releases? )

# CONFIG_DEBUG_SPINLOCK is not set
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_CMPXCHG_LOCKREF=y

- Sedat -

2013-08-30 09:27:06

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 9:55 AM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
>>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>>>
>>>> Waiman? Mind looking at this and testing? Linus
>>>
>>> Sure, I will try out the patch tomorrow morning and see how it works out for
>>> my test case.
>>
>> Ok, thanks, please use this slightly updated patch attached here.
>>
>> It improves on the previous version in actually handling the
>> "unlazy_walk()" case with native lockref handling, which means that
>> one other not entirely odd case (symlink traversal) avoids the d_lock
>> contention.
>>
>> It also refactored the __d_rcu_to_refcount() to be more readable, and
>> adds a big comment about what the heck is going on. The old code was
>> clever, but I suspect not very many people could possibly understand
>> what it actually did. Plus it used nested spinlocks because it wanted
>> to avoid checking the sequence count twice. Which is stupid, since
>> nesting locks is how you get really bad contention, and the sequence
>> count check is really cheap anyway. Plus the nesting *really* didn't
>> work with the whole lockref model.
>>
>> With this, my stupid thread-lookup thing doesn't show any spinlock
>> contention even for the "look up symlink" case.
>>
>> It also avoids the unnecessary aligned u64 for when we don't actually
>> use cmpxchg at all.
>>
>> It's still one single patch, since I was working on lots of small
>> cleanups. I think it's pretty close to done now (assuming your testing
>> shows it performs fine - the powerpc numbers are promising, though),
>> so I'll split it up into proper chunks rather than random commit
>> points. But I'm done for today at least.
>>
>> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
>> may hit cases I didn't test. I think it should be *stable*, but maybe
>> there's some other d_lock case that your tuned waiting hid, and that
>> my "fastpath only for unlocked case" version ends up having problems
>> with.
>>
>
> Following this thread with half an eye... Was that "unsigned" stuff
> fixed (someone pointed to it).
> How do you call that test-patch (subject)?
> I would like to test it on my SNB ultrabook with your test-case script.
>

Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:

lockref: w/o patch | w/ patch
======================
Run #1: 2.688.094 | 2.643.004
Run #2: 2.678.884 | 2.652.787
Run #3: 2.686.450 | 2.650.142
Run #4: 2.688.435 | 2.648.409
Run #5: 2.693.770 | 2.651.514

Average: 2687126,6 VS. 2649171,2 ( −37955,4 )

- Sedat -

2013-08-30 09:49:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Sedat Dilek <[email protected]> wrote:

> On Fri, Aug 30, 2013 at 9:55 AM, Sedat Dilek <[email protected]> wrote:
> > On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
> > <[email protected]> wrote:
> >> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
> >>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
> >>>>
> >>>> Waiman? Mind looking at this and testing? Linus
> >>>
> >>> Sure, I will try out the patch tomorrow morning and see how it works out for
> >>> my test case.
> >>
> >> Ok, thanks, please use this slightly updated patch attached here.
> >>
> >> It improves on the previous version in actually handling the
> >> "unlazy_walk()" case with native lockref handling, which means that
> >> one other not entirely odd case (symlink traversal) avoids the d_lock
> >> contention.
> >>
> >> It also refactored the __d_rcu_to_refcount() to be more readable, and
> >> adds a big comment about what the heck is going on. The old code was
> >> clever, but I suspect not very many people could possibly understand
> >> what it actually did. Plus it used nested spinlocks because it wanted
> >> to avoid checking the sequence count twice. Which is stupid, since
> >> nesting locks is how you get really bad contention, and the sequence
> >> count check is really cheap anyway. Plus the nesting *really* didn't
> >> work with the whole lockref model.
> >>
> >> With this, my stupid thread-lookup thing doesn't show any spinlock
> >> contention even for the "look up symlink" case.
> >>
> >> It also avoids the unnecessary aligned u64 for when we don't actually
> >> use cmpxchg at all.
> >>
> >> It's still one single patch, since I was working on lots of small
> >> cleanups. I think it's pretty close to done now (assuming your testing
> >> shows it performs fine - the powerpc numbers are promising, though),
> >> so I'll split it up into proper chunks rather than random commit
> >> points. But I'm done for today at least.
> >>
> >> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
> >> may hit cases I didn't test. I think it should be *stable*, but maybe
> >> there's some other d_lock case that your tuned waiting hid, and that
> >> my "fastpath only for unlocked case" version ends up having problems
> >> with.
> >>
> >
> > Following this thread with half an eye... Was that "unsigned" stuff
> > fixed (someone pointed to it).
> > How do you call that test-patch (subject)?
> > I would like to test it on my SNB ultrabook with your test-case script.
> >
>
> Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:
>
> lockref: w/o patch | w/ patch
> ======================
> Run #1: 2.688.094 | 2.643.004
> Run #2: 2.678.884 | 2.652.787
> Run #3: 2.686.450 | 2.650.142
> Run #4: 2.688.435 | 2.648.409
> Run #5: 2.693.770 | 2.651.514
>
> Average: 2687126,6 VS. 2649171,2 ( ???37955,4 )

For precise stddev numbers you can run it like this:

perf stat --null --repeat 5 ./test

and it will measure time only and print the stddev in percentage:

Performance counter stats for './test' (5 runs):

1.001008928 seconds time elapsed ( +- 0.00% )

Thanks,

Ingo

2013-08-30 09:56:12

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 11:48 AM, Ingo Molnar <[email protected]> wrote:
>
> * Sedat Dilek <[email protected]> wrote:
>
>> On Fri, Aug 30, 2013 at 9:55 AM, Sedat Dilek <[email protected]> wrote:
>> > On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
>> > <[email protected]> wrote:
>> >> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
>> >>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>> >>>>
>> >>>> Waiman? Mind looking at this and testing? Linus
>> >>>
>> >>> Sure, I will try out the patch tomorrow morning and see how it works out for
>> >>> my test case.
>> >>
>> >> Ok, thanks, please use this slightly updated patch attached here.
>> >>
>> >> It improves on the previous version in actually handling the
>> >> "unlazy_walk()" case with native lockref handling, which means that
>> >> one other not entirely odd case (symlink traversal) avoids the d_lock
>> >> contention.
>> >>
>> >> It also refactored the __d_rcu_to_refcount() to be more readable, and
>> >> adds a big comment about what the heck is going on. The old code was
>> >> clever, but I suspect not very many people could possibly understand
>> >> what it actually did. Plus it used nested spinlocks because it wanted
>> >> to avoid checking the sequence count twice. Which is stupid, since
>> >> nesting locks is how you get really bad contention, and the sequence
>> >> count check is really cheap anyway. Plus the nesting *really* didn't
>> >> work with the whole lockref model.
>> >>
>> >> With this, my stupid thread-lookup thing doesn't show any spinlock
>> >> contention even for the "look up symlink" case.
>> >>
>> >> It also avoids the unnecessary aligned u64 for when we don't actually
>> >> use cmpxchg at all.
>> >>
>> >> It's still one single patch, since I was working on lots of small
>> >> cleanups. I think it's pretty close to done now (assuming your testing
>> >> shows it performs fine - the powerpc numbers are promising, though),
>> >> so I'll split it up into proper chunks rather than random commit
>> >> points. But I'm done for today at least.
>> >>
>> >> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
>> >> may hit cases I didn't test. I think it should be *stable*, but maybe
>> >> there's some other d_lock case that your tuned waiting hid, and that
>> >> my "fastpath only for unlocked case" version ends up having problems
>> >> with.
>> >>
>> >
>> > Following this thread with half an eye... Was that "unsigned" stuff
>> > fixed (someone pointed to it).
>> > How do you call that test-patch (subject)?
>> > I would like to test it on my SNB ultrabook with your test-case script.
>> >
>>
>> Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:
>>
>> lockref: w/o patch | w/ patch
>> ======================
>> Run #1: 2.688.094 | 2.643.004
>> Run #2: 2.678.884 | 2.652.787
>> Run #3: 2.686.450 | 2.650.142
>> Run #4: 2.688.435 | 2.648.409
>> Run #5: 2.693.770 | 2.651.514
>>
>> Average: 2687126,6 VS. 2649171,2 ( ???37955,4 )
>
> For precise stddev numbers you can run it like this:
>
> perf stat --null --repeat 5 ./test
>
> and it will measure time only and print the stddev in percentage:
>
> Performance counter stats for './test' (5 runs):
>
> 1.001008928 seconds time elapsed ( +- 0.00% )
>

Hi Ingo,

that sounds really good :-).

AFAICS 'make deb-pkg' does not have support to build linux-tools
Debian package where perf is included.
Can I run an older version of perf or should I / have to try with the
one shipped in Linux v3.11-rc7+ sources?
How can I build perf standalone, out of my sources?

- Sedat -

2013-08-30 09:58:26

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 11:56 AM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 11:48 AM, Ingo Molnar <[email protected]> wrote:
>>
>> * Sedat Dilek <[email protected]> wrote:
>>
>>> On Fri, Aug 30, 2013 at 9:55 AM, Sedat Dilek <[email protected]> wrote:
>>> > On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
>>> > <[email protected]> wrote:
>>> >> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
>>> >>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>> >>>>
>>> >>>> Waiman? Mind looking at this and testing? Linus
>>> >>>
>>> >>> Sure, I will try out the patch tomorrow morning and see how it works out for
>>> >>> my test case.
>>> >>
>>> >> Ok, thanks, please use this slightly updated patch attached here.
>>> >>
>>> >> It improves on the previous version in actually handling the
>>> >> "unlazy_walk()" case with native lockref handling, which means that
>>> >> one other not entirely odd case (symlink traversal) avoids the d_lock
>>> >> contention.
>>> >>
>>> >> It also refactored the __d_rcu_to_refcount() to be more readable, and
>>> >> adds a big comment about what the heck is going on. The old code was
>>> >> clever, but I suspect not very many people could possibly understand
>>> >> what it actually did. Plus it used nested spinlocks because it wanted
>>> >> to avoid checking the sequence count twice. Which is stupid, since
>>> >> nesting locks is how you get really bad contention, and the sequence
>>> >> count check is really cheap anyway. Plus the nesting *really* didn't
>>> >> work with the whole lockref model.
>>> >>
>>> >> With this, my stupid thread-lookup thing doesn't show any spinlock
>>> >> contention even for the "look up symlink" case.
>>> >>
>>> >> It also avoids the unnecessary aligned u64 for when we don't actually
>>> >> use cmpxchg at all.
>>> >>
>>> >> It's still one single patch, since I was working on lots of small
>>> >> cleanups. I think it's pretty close to done now (assuming your testing
>>> >> shows it performs fine - the powerpc numbers are promising, though),
>>> >> so I'll split it up into proper chunks rather than random commit
>>> >> points. But I'm done for today at least.
>>> >>
>>> >> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
>>> >> may hit cases I didn't test. I think it should be *stable*, but maybe
>>> >> there's some other d_lock case that your tuned waiting hid, and that
>>> >> my "fastpath only for unlocked case" version ends up having problems
>>> >> with.
>>> >>
>>> >
>>> > Following this thread with half an eye... Was that "unsigned" stuff
>>> > fixed (someone pointed to it).
>>> > How do you call that test-patch (subject)?
>>> > I would like to test it on my SNB ultrabook with your test-case script.
>>> >
>>>
>>> Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:
>>>
>>> lockref: w/o patch | w/ patch
>>> ======================
>>> Run #1: 2.688.094 | 2.643.004
>>> Run #2: 2.678.884 | 2.652.787
>>> Run #3: 2.686.450 | 2.650.142
>>> Run #4: 2.688.435 | 2.648.409
>>> Run #5: 2.693.770 | 2.651.514
>>>
>>> Average: 2687126,6 VS. 2649171,2 ( ???37955,4 )
>>
>> For precise stddev numbers you can run it like this:
>>
>> perf stat --null --repeat 5 ./test
>>
>> and it will measure time only and print the stddev in percentage:
>>
>> Performance counter stats for './test' (5 runs):
>>
>> 1.001008928 seconds time elapsed ( +- 0.00% )
>>
>
> Hi Ingo,
>
> that sounds really good :-).
>
> AFAICS 'make deb-pkg' does not have support to build linux-tools
> Debian package where perf is included.
> Can I run an older version of perf or should I / have to try with the
> one shipped in Linux v3.11-rc7+ sources?
> How can I build perf standalone, out of my sources?
>

Hmm, I installed linux-tools-common (3.2.0-53.81).

$ perf stat --null --repeat 5 ./t_lockref_from-linus
perf_3.11.0-rc7 not found
You may need to install linux-tools-3.11.0-rc7

- Sedat -

2013-08-30 10:29:38

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 11:58 AM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 11:56 AM, Sedat Dilek <[email protected]> wrote:
>> On Fri, Aug 30, 2013 at 11:48 AM, Ingo Molnar <[email protected]> wrote:
>>>
>>> * Sedat Dilek <[email protected]> wrote:
>>>
>>>> On Fri, Aug 30, 2013 at 9:55 AM, Sedat Dilek <[email protected]> wrote:
>>>> > On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
>>>> > <[email protected]> wrote:
>>>> >> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
>>>> >>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>>> >>>>
>>>> >>>> Waiman? Mind looking at this and testing? Linus
>>>> >>>
>>>> >>> Sure, I will try out the patch tomorrow morning and see how it works out for
>>>> >>> my test case.
>>>> >>
>>>> >> Ok, thanks, please use this slightly updated patch attached here.
>>>> >>
>>>> >> It improves on the previous version in actually handling the
>>>> >> "unlazy_walk()" case with native lockref handling, which means that
>>>> >> one other not entirely odd case (symlink traversal) avoids the d_lock
>>>> >> contention.
>>>> >>
>>>> >> It also refactored the __d_rcu_to_refcount() to be more readable, and
>>>> >> adds a big comment about what the heck is going on. The old code was
>>>> >> clever, but I suspect not very many people could possibly understand
>>>> >> what it actually did. Plus it used nested spinlocks because it wanted
>>>> >> to avoid checking the sequence count twice. Which is stupid, since
>>>> >> nesting locks is how you get really bad contention, and the sequence
>>>> >> count check is really cheap anyway. Plus the nesting *really* didn't
>>>> >> work with the whole lockref model.
>>>> >>
>>>> >> With this, my stupid thread-lookup thing doesn't show any spinlock
>>>> >> contention even for the "look up symlink" case.
>>>> >>
>>>> >> It also avoids the unnecessary aligned u64 for when we don't actually
>>>> >> use cmpxchg at all.
>>>> >>
>>>> >> It's still one single patch, since I was working on lots of small
>>>> >> cleanups. I think it's pretty close to done now (assuming your testing
>>>> >> shows it performs fine - the powerpc numbers are promising, though),
>>>> >> so I'll split it up into proper chunks rather than random commit
>>>> >> points. But I'm done for today at least.
>>>> >>
>>>> >> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
>>>> >> may hit cases I didn't test. I think it should be *stable*, but maybe
>>>> >> there's some other d_lock case that your tuned waiting hid, and that
>>>> >> my "fastpath only for unlocked case" version ends up having problems
>>>> >> with.
>>>> >>
>>>> >
>>>> > Following this thread with half an eye... Was that "unsigned" stuff
>>>> > fixed (someone pointed to it).
>>>> > How do you call that test-patch (subject)?
>>>> > I would like to test it on my SNB ultrabook with your test-case script.
>>>> >
>>>>
>>>> Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:
>>>>
>>>> lockref: w/o patch | w/ patch
>>>> ======================
>>>> Run #1: 2.688.094 | 2.643.004
>>>> Run #2: 2.678.884 | 2.652.787
>>>> Run #3: 2.686.450 | 2.650.142
>>>> Run #4: 2.688.435 | 2.648.409
>>>> Run #5: 2.693.770 | 2.651.514
>>>>
>>>> Average: 2687126,6 VS. 2649171,2 ( ???37955,4 )
>>>
>>> For precise stddev numbers you can run it like this:
>>>
>>> perf stat --null --repeat 5 ./test
>>>
>>> and it will measure time only and print the stddev in percentage:
>>>
>>> Performance counter stats for './test' (5 runs):
>>>
>>> 1.001008928 seconds time elapsed ( +- 0.00% )
>>>
>>
>> Hi Ingo,
>>
>> that sounds really good :-).
>>
>> AFAICS 'make deb-pkg' does not have support to build linux-tools
>> Debian package where perf is included.
>> Can I run an older version of perf or should I / have to try with the
>> one shipped in Linux v3.11-rc7+ sources?
>> How can I build perf standalone, out of my sources?
>>
>
> Hmm, I installed linux-tools-common (3.2.0-53.81).
>
> $ perf stat --null --repeat 5 ./t_lockref_from-linus
> perf_3.11.0-rc7 not found
> You may need to install linux-tools-3.11.0-rc7
>

[ Sorry for being off-topic ]

Hey Ingo,

can you help, please?

I installed so far all missing -dev packages...

$ sudo apt-get install libelf-dev libdw-dev libunwind7-dev libslang2-dev

...and then want a perf-only build...

[ See tools/Makefile ]

$ LANG=C LC_ALL=C make -C tools/ perf_install 2>&1 | tee ../perf_install-log.txt

This ends up like this:
...
make[2]: Entering directory
`/home/wearefam/src/linux-kernel/linux/tools/lib/traceevent'
make[2]: Leaving directory
`/home/wearefam/src/linux-kernel/linux/tools/lib/traceevent'
LINK perf
gcc: error: /home/wearefam/src/linux-kernel/linux/tools/lib/lk/liblk.a:
No such file or directory
make[1]: *** [perf] Error 1
make[1]: Leaving directory `/home/wearefam/src/linux-kernel/linux/tools/perf'
make: *** [perf_install] Error 2

$ LANG=C LC_ALL=C ll tools/lib/lk/
total 20
drwxr-xr-x 2 wearefam wearefam 4096 Aug 30 12:11 ./
drwxr-xr-x 4 wearefam wearefam 4096 Jul 11 19:42 ../
-rw-r--r-- 1 wearefam wearefam 1430 Aug 30 09:56 Makefile
-rw-r--r-- 1 wearefam wearefam 2144 Jul 11 19:42 debugfs.c
-rw-r--r-- 1 wearefam wearefam 619 Jul 11 19:42 debugfs.h

Why is liblk not built?

- Sedat -

P.S.: To clean perf build, run...

$ LANG=C LC_ALL=C make -C tools/ perf_clean

2013-08-30 10:36:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:29:34PM +0200, Sedat Dilek wrote:
> [ Sorry for being off-topic ]
>
> Hey Ingo,
>
> can you help, please?
>
> I installed so far all missing -dev packages...
>
> $ sudo apt-get install libelf-dev libdw-dev libunwind7-dev libslang2-dev

It seems to me the easier way is:

$ apt-get build-dep linux-tools

> ...and then want a perf-only build...
>
> [ See tools/Makefile ]
>
> $ LANG=C LC_ALL=C make -C tools/ perf_install 2>&1 | tee ../perf_install-log.txt

The way I always build that stuff is simply:

$ cd tools/perf
$ make -j
$ cp perf `which perf`
$ cd -

No idea about liblk though, never had that issue but maybe something
like:

$ cd tools/lib/lk/
$ make clean
$ make

will get you a more useful error. dunno, its a fairly trivial little
library, only a single .c file.

2013-08-30 10:38:23

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:29 PM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 11:58 AM, Sedat Dilek <[email protected]> wrote:
>> On Fri, Aug 30, 2013 at 11:56 AM, Sedat Dilek <[email protected]> wrote:
>>> On Fri, Aug 30, 2013 at 11:48 AM, Ingo Molnar <[email protected]> wrote:
>>>>
>>>> * Sedat Dilek <[email protected]> wrote:
>>>>
>>>>> On Fri, Aug 30, 2013 at 9:55 AM, Sedat Dilek <[email protected]> wrote:
>>>>> > On Fri, Aug 30, 2013 at 5:54 AM, Linus Torvalds
>>>>> > <[email protected]> wrote:
>>>>> >> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long <[email protected]> wrote:
>>>>> >>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>>>> >>>>
>>>>> >>>> Waiman? Mind looking at this and testing? Linus
>>>>> >>>
>>>>> >>> Sure, I will try out the patch tomorrow morning and see how it works out for
>>>>> >>> my test case.
>>>>> >>
>>>>> >> Ok, thanks, please use this slightly updated patch attached here.
>>>>> >>
>>>>> >> It improves on the previous version in actually handling the
>>>>> >> "unlazy_walk()" case with native lockref handling, which means that
>>>>> >> one other not entirely odd case (symlink traversal) avoids the d_lock
>>>>> >> contention.
>>>>> >>
>>>>> >> It also refactored the __d_rcu_to_refcount() to be more readable, and
>>>>> >> adds a big comment about what the heck is going on. The old code was
>>>>> >> clever, but I suspect not very many people could possibly understand
>>>>> >> what it actually did. Plus it used nested spinlocks because it wanted
>>>>> >> to avoid checking the sequence count twice. Which is stupid, since
>>>>> >> nesting locks is how you get really bad contention, and the sequence
>>>>> >> count check is really cheap anyway. Plus the nesting *really* didn't
>>>>> >> work with the whole lockref model.
>>>>> >>
>>>>> >> With this, my stupid thread-lookup thing doesn't show any spinlock
>>>>> >> contention even for the "look up symlink" case.
>>>>> >>
>>>>> >> It also avoids the unnecessary aligned u64 for when we don't actually
>>>>> >> use cmpxchg at all.
>>>>> >>
>>>>> >> It's still one single patch, since I was working on lots of small
>>>>> >> cleanups. I think it's pretty close to done now (assuming your testing
>>>>> >> shows it performs fine - the powerpc numbers are promising, though),
>>>>> >> so I'll split it up into proper chunks rather than random commit
>>>>> >> points. But I'm done for today at least.
>>>>> >>
>>>>> >> NOTE NOTE NOTE! My test coverage really has been pretty pitiful. You
>>>>> >> may hit cases I didn't test. I think it should be *stable*, but maybe
>>>>> >> there's some other d_lock case that your tuned waiting hid, and that
>>>>> >> my "fastpath only for unlocked case" version ends up having problems
>>>>> >> with.
>>>>> >>
>>>>> >
>>>>> > Following this thread with half an eye... Was that "unsigned" stuff
>>>>> > fixed (someone pointed to it).
>>>>> > How do you call that test-patch (subject)?
>>>>> > I would like to test it on my SNB ultrabook with your test-case script.
>>>>> >
>>>>>
>>>>> Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:
>>>>>
>>>>> lockref: w/o patch | w/ patch
>>>>> ======================
>>>>> Run #1: 2.688.094 | 2.643.004
>>>>> Run #2: 2.678.884 | 2.652.787
>>>>> Run #3: 2.686.450 | 2.650.142
>>>>> Run #4: 2.688.435 | 2.648.409
>>>>> Run #5: 2.693.770 | 2.651.514
>>>>>
>>>>> Average: 2687126,6 VS. 2649171,2 ( ???37955,4 )
>>>>
>>>> For precise stddev numbers you can run it like this:
>>>>
>>>> perf stat --null --repeat 5 ./test
>>>>
>>>> and it will measure time only and print the stddev in percentage:
>>>>
>>>> Performance counter stats for './test' (5 runs):
>>>>
>>>> 1.001008928 seconds time elapsed ( +- 0.00% )
>>>>
>>>
>>> Hi Ingo,
>>>
>>> that sounds really good :-).
>>>
>>> AFAICS 'make deb-pkg' does not have support to build linux-tools
>>> Debian package where perf is included.
>>> Can I run an older version of perf or should I / have to try with the
>>> one shipped in Linux v3.11-rc7+ sources?
>>> How can I build perf standalone, out of my sources?
>>>
>>
>> Hmm, I installed linux-tools-common (3.2.0-53.81).
>>
>> $ perf stat --null --repeat 5 ./t_lockref_from-linus
>> perf_3.11.0-rc7 not found
>> You may need to install linux-tools-3.11.0-rc7
>>
>
> [ Sorry for being off-topic ]
>
> Hey Ingo,
>
> can you help, please?
>
> I installed so far all missing -dev packages...
>
> $ sudo apt-get install libelf-dev libdw-dev libunwind7-dev libslang2-dev
>
> ...and then want a perf-only build...
>
> [ See tools/Makefile ]
>
> $ LANG=C LC_ALL=C make -C tools/ perf_install 2>&1 | tee ../perf_install-log.txt
>
> This ends up like this:
> ...
> make[2]: Entering directory
> `/home/wearefam/src/linux-kernel/linux/tools/lib/traceevent'
> make[2]: Leaving directory
> `/home/wearefam/src/linux-kernel/linux/tools/lib/traceevent'
> LINK perf
> gcc: error: /home/wearefam/src/linux-kernel/linux/tools/lib/lk/liblk.a:
> No such file or directory
> make[1]: *** [perf] Error 1
> make[1]: Leaving directory `/home/wearefam/src/linux-kernel/linux/tools/perf'
> make: *** [perf_install] Error 2
>
> $ LANG=C LC_ALL=C ll tools/lib/lk/
> total 20
> drwxr-xr-x 2 wearefam wearefam 4096 Aug 30 12:11 ./
> drwxr-xr-x 4 wearefam wearefam 4096 Jul 11 19:42 ../
> -rw-r--r-- 1 wearefam wearefam 1430 Aug 30 09:56 Makefile
> -rw-r--r-- 1 wearefam wearefam 2144 Jul 11 19:42 debugfs.c
> -rw-r--r-- 1 wearefam wearefam 619 Jul 11 19:42 debugfs.h
>
> Why is liblk not built?
>
> - Sedat -
>
> P.S.: To clean perf build, run...
>
> $ LANG=C LC_ALL=C make -C tools/ perf_clean

Sorry for flooding...

The tools/perf only build seems to be BROKEN in v3.11-rc7.

WORKAROUND:

$ sudo apt-get install libelf-dev libdw-dev libunwind7-dev
libslang2-dev libnuma-dev

$ LANG=C LC_ALL=C make -C tools/ liblk

$ LANG=C LC_ALL=C make -C tools/ perf_install

This works here.

- Sedat -

2013-08-30 10:44:32

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:36 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 12:29:34PM +0200, Sedat Dilek wrote:
>> [ Sorry for being off-topic ]
>>
>> Hey Ingo,
>>
>> can you help, please?
>>
>> I installed so far all missing -dev packages...
>>
>> $ sudo apt-get install libelf-dev libdw-dev libunwind7-dev libslang2-dev
>
> It seems to me the easier way is:
>
> $ apt-get build-dep linux-tools
>

NO, se B-Ds for Ubuntu-kernel v3.2:

Build-Depends: dpkg (>= 1.13.19), debhelper (>= 5), gawk

http://archive.ubuntu.com/ubuntu/pool/main/l/linux-meta/linux-meta_3.2.0.52.62.dsc

>> ...and then want a perf-only build...
>>
>> [ See tools/Makefile ]
>>
>> $ LANG=C LC_ALL=C make -C tools/ perf_install 2>&1 | tee ../perf_install-log.txt
>
> The way I always build that stuff is simply:
>
> $ cd tools/perf
> $ make -j
> $ cp perf `which perf`
> $ cd -
>

Please, see advices in 'tools/Makefile':

@echo 'You can do:'
@echo ' $$ make -C tools/ <tool>_install'
@echo ''
@echo ' from the kernel command line to build and install one of'
@echo ' the tools above'
@echo ''
@echo ' $$ make tools/install'
@echo ''
@echo ' installs all tools.'


> No idea about liblk though, never had that issue but maybe something
> like:
>
> $ cd tools/lib/lk/
> $ make clean
> $ make
>
> will get you a more useful error. dunno, its a fairly trivial little
> library, only a single .c file.

If looked quickly over diverse Makefile one was saying liblk FORCE, so
it should be built, but is NOT!

Thanks anyway, Peter for all the hints!

If you tell me where you discuss perf issues I can describe the problem.

- Sedat -

2013-08-30 10:46:55

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:44 PM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 12:36 PM, Peter Zijlstra <[email protected]> wrote:
>> On Fri, Aug 30, 2013 at 12:29:34PM +0200, Sedat Dilek wrote:
>>> [ Sorry for being off-topic ]
>>>
>>> Hey Ingo,
>>>
>>> can you help, please?
>>>
>>> I installed so far all missing -dev packages...
>>>
>>> $ sudo apt-get install libelf-dev libdw-dev libunwind7-dev libslang2-dev
>>
>> It seems to me the easier way is:
>>
>> $ apt-get build-dep linux-tools
>>
>
> NO, se B-Ds for Ubuntu-kernel v3.2:
>
> Build-Depends: dpkg (>= 1.13.19), debhelper (>= 5), gawk
>
> http://archive.ubuntu.com/ubuntu/pool/main/l/linux-meta/linux-meta_3.2.0.52.62.dsc
>
>>> ...and then want a perf-only build...
>>>
>>> [ See tools/Makefile ]
>>>
>>> $ LANG=C LC_ALL=C make -C tools/ perf_install 2>&1 | tee ../perf_install-log.txt
>>
>> The way I always build that stuff is simply:
>>
>> $ cd tools/perf
>> $ make -j
>> $ cp perf `which perf`
>> $ cd -
>>
>
> Please, see advices in 'tools/Makefile':
>
> @echo 'You can do:'
> @echo ' $$ make -C tools/ <tool>_install'
> @echo ''
> @echo ' from the kernel command line to build and install one of'
> @echo ' the tools above'
> @echo ''
> @echo ' $$ make tools/install'
> @echo ''
> @echo ' installs all tools.'
>
>
>> No idea about liblk though, never had that issue but maybe something
>> like:
>>
>> $ cd tools/lib/lk/
>> $ make clean
>> $ make
>>
>> will get you a more useful error. dunno, its a fairly trivial little
>> library, only a single .c file.
>
> If looked quickly over diverse Makefile one was saying liblk FORCE, so
> it should be built, but is NOT!
>
> Thanks anyway, Peter for all the hints!
>
> If you tell me where you discuss perf issues I can describe the problem.
>

Just as a sidenote for cleaning up (that seems to be the "official" way):

$ LANG=C LC_ALL=C make -C tools/ liblk_clean

$ LANG=C LC_ALL=C make -C tools/ perf_clean

- Sedat -

2013-08-30 10:52:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:44:29PM +0200, Sedat Dilek wrote:
> Please, see advices in 'tools/Makefile':
>
> @echo 'You can do:'
> @echo ' $$ make -C tools/ <tool>_install'
> @echo ''
> @echo ' from the kernel command line to build and install one of'
> @echo ' the tools above'
> @echo ''
> @echo ' $$ make tools/install'
> @echo ''
> @echo ' installs all tools.'

I never follow advice :-) Muwhaha, also I'm generally not interested in
'all' tools anywya.

> If you tell me where you discuss perf issues I can describe the problem.

On this list, start a new thread CC ingo, acme and me. There's also an
IRC channel but I keep forgetting what the official channel is and I'm
likely not on it. I do tend to abuse irc.oftc.net/#linux-rt for it since
most people are on there anyway.

2013-08-30 10:57:43

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:52 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 12:44:29PM +0200, Sedat Dilek wrote:
>> Please, see advices in 'tools/Makefile':
>>
>> @echo 'You can do:'
>> @echo ' $$ make -C tools/ <tool>_install'
>> @echo ''
>> @echo ' from the kernel command line to build and install one of'
>> @echo ' the tools above'
>> @echo ''
>> @echo ' $$ make tools/install'
>> @echo ''
>> @echo ' installs all tools.'
>
> I never follow advice :-) Muwhaha, also I'm generally not interested in
> 'all' tools anywya.
>
>> If you tell me where you discuss perf issues I can describe the problem.
>
> On this list, start a new thread CC ingo, acme and me. There's also an
> IRC channel but I keep forgetting what the official channel is and I'm
> likely not on it. I do tend to abuse irc.oftc.net/#linux-rt for it since
> most people are on there anyway.

Hi rebel :-)!

I was away from IRC for quite a year.
I will write some words...

- Sedat -

P.S.: It worked...

$ ~/src/linux-kernel/linux/tools/perf/perf --version
perf version 3.11.rc7.ga7370

$ ~/src/linux-kernel/linux/tools/perf/perf stat --null --repeat 5
./t_lockref_from-linus
Total loops: 2652351
Total loops: 2604876
Total loops: 2649696
Total loops: 2651417
Total loops: 2644068

Performance counter stats for './t_lockref_from-linus' (5 runs):

10,002926693 seconds time elapsed
( +- 0,00% )

$ cat /proc/version
Linux version 3.11.0-rc7-1-lockref-small
([email protected]@fambox) (gcc version 4.6.3 (Ubuntu/Linaro
4.6.3-1ubuntu5) ) #1 SMP Fri Aug 30 10:23:19 CEST 2013

2013-08-30 11:19:47

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:36 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 12:29:34PM +0200, Sedat Dilek wrote:
>> [ Sorry for being off-topic ]
>>
>> Hey Ingo,
>>
>> can you help, please?
>>
>> I installed so far all missing -dev packages...
>>
>> $ sudo apt-get install libelf-dev libdw-dev libunwind7-dev libslang2-dev
>
> It seems to me the easier way is:
>
> $ apt-get build-dep linux-tools
>

The B-Ds for latest linux-tools (3.11~rc4-1~exp1) in
Debian/experimental look like this:

Build-Depends: debhelper (>> 7), python, asciidoc, binutils-dev,
bison, flex, libdw-dev, libelf-dev, libnewt-dev, libperl-dev,
python-dev, xmlto, autoconf, automake, libtool, libglib2.0-dev,
libsysfs-dev, libwrap0-dev

- Sedat -

[1] http://ftp.de.debian.org/debian/pool/main/l/linux-tools/linux-tools_3.11~rc4-1~exp1.dsc

2013-08-30 14:05:08

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:57 PM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 12:52 PM, Peter Zijlstra <[email protected]> wrote:
>> On Fri, Aug 30, 2013 at 12:44:29PM +0200, Sedat Dilek wrote:

[...]

>>> If you tell me where you discuss perf issues I can describe the problem.
>>
>> On this list, start a new thread CC ingo, acme and me. There's also an
>> IRC channel but I keep forgetting what the official channel is and I'm
>> likely not on it. I do tend to abuse irc.oftc.net/#linux-rt for it since
>> most people are on there anyway.
>
> Hi rebel :-)!
>
> I was away from IRC for quite a year.
> I will write some words...
>

Here we go.

- Sedat -

[1] http://marc.info/?l=linux-kernel&m=137786590626809&w=2

2013-08-30 15:28:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:16 AM, Ingo Molnar <[email protected]> wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
>> > BTW. Do you have your test case at hand ?
>>
>> My test-case is a joke. It's explicitly *trying* to get as much
>> contention as possible on a dentry, by just starting up a lot of threads
>> that look up one single pathname (the same one for everybody). It
>> defaults to using /tmp for this, but you can specify the filename.
>
> Waiman's tests seemed to use sufficiently generic and varied workloads
> (AIM7) and they showed pretty nice unconditional improvements with his
> variant of this scheme, so I think testing with your simple testcase that
> intentionally magnifies the scalability issue is 100% legit and may in
> fact help tune the changes more accurately, because it has less inherent
> noise.

Yes. However, what I am (not very) worried about is that people will
hit some particular codepath that ends up having bad behavior.

I think I covered all the normal hotpaths in pathname lookup, which is
why I'm not *that* worried, but it's still the case that my silly
test-case is very limited. It's limited for a good *reason* (to try to
show the worst-case scalability problem), but it's limited.

Linus

2013-08-30 15:34:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 2:27 AM, Sedat Dilek <[email protected]> wrote:
>
> Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:
>
> lockref: w/o patch | w/ patch
> ======================
> Run #1: 2.688.094 | 2.643.004
> Run #2: 2.678.884 | 2.652.787
> Run #3: 2.686.450 | 2.650.142
> Run #4: 2.688.435 | 2.648.409
> Run #5: 2.693.770 | 2.651.514

Yes, so this is pretty much expected.

If you don't have a very high core count (you don't mention your
system, but that's pretty - I get ~65 million repetitions in 10
seconds on my i5-670), the cmpxchg will not help - because you don't
actually see the bad "wait on spinlock" behavior in the first place.

And a "cmpxchg" is slightly slower than the very optimized spinlocks,
and has that annoying "read original value" first issue too. So the
patch can make things a bit slower, although it will depends on the
microarchitecture (and as mentioned elsewhere, there are other things
that can make a bigger difference boot-to-boot - dentry allocation
details etc can have "sticky" performance impact).

So we may take a small hit in order to then *not* have horrible
scalability at the high end.

Linus

2013-08-30 15:38:52

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 5:34 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Aug 30, 2013 at 2:27 AM, Sedat Dilek <[email protected]> wrote:
>>
>> Here on Ubuntu/precise v12.04.3 AMD64 I get these numbers for total loops:
>>
>> lockref: w/o patch | w/ patch
>> ======================
>> Run #1: 2.688.094 | 2.643.004
>> Run #2: 2.678.884 | 2.652.787
>> Run #3: 2.686.450 | 2.650.142
>> Run #4: 2.688.435 | 2.648.409
>> Run #5: 2.693.770 | 2.651.514
>
> Yes, so this is pretty much expected.
>
> If you don't have a very high core count (you don't mention your
> system, but that's pretty - I get ~65 million repetitions in 10
> seconds on my i5-670), the cmpxchg will not help - because you don't
> actually see the bad "wait on spinlock" behavior in the first place.
>
> And a "cmpxchg" is slightly slower than the very optimized spinlocks,
> and has that annoying "read original value" first issue too. So the
> patch can make things a bit slower, although it will depends on the
> microarchitecture (and as mentioned elsewhere, there are other things
> that can make a bigger difference boot-to-boot - dentry allocation
> details etc can have "sticky" performance impact).
>
> So we may take a small hit in order to then *not* have horrible
> scalability at the high end.
>

A Samsung series-5 ultrabook.

$ grep "model name" /proc/cpuinfo | uniq
model name : Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz

- Sedat -

2013-08-30 16:12:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, 30 Aug 2013 17:38:47 +0200
Sedat Dilek <[email protected]> wrote:


> A Samsung series-5 ultrabook.
>
> $ grep "model name" /proc/cpuinfo | uniq
> model name : Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz

I believe the number of CPUs is more important. But as this is an
ultrabook, I doubt that is very high.

Now I know this isn't going to be popular, but I'll suggest it anyway.
What about only implementing the lockref locking when CPUs are greater
than 7, 7 or less will still use the normal optimized spinlocks.

-- Steve

2013-08-30 16:16:25

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 6:12 PM, Steven Rostedt <[email protected]> wrote:
> On Fri, 30 Aug 2013 17:38:47 +0200
> Sedat Dilek <[email protected]> wrote:
>
>
>> A Samsung series-5 ultrabook.
>>
>> $ grep "model name" /proc/cpuinfo | uniq
>> model name : Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz
>
> I believe the number of CPUs is more important. But as this is an
> ultrabook, I doubt that is very high.
>

Fantastic Four.

> Now I know this isn't going to be popular, but I'll suggest it anyway.
> What about only implementing the lockref locking when CPUs are greater
> than 7, 7 or less will still use the normal optimized spinlocks.
>

I have seen that spinlock-lockref stuff is more important on that
monster-machines.
It's good to see it does not break "smaller" systems.

- Sedat -

2013-08-30 16:32:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 8:38 AM, Sedat Dilek <[email protected]> wrote:
>
> A Samsung series-5 ultrabook.
>
> $ grep "model name" /proc/cpuinfo | uniq
> model name : Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz

Hmm. Do you have debugging options enabled? Because that CPU should
have the same core count as mine (two+HT), a slightly smaller cache
(3M vs 4M) and runs at a noticeably lower frequency (1.6GHz vs 3.5).
It probably also has slower memory etc, but that should still make it
maybe half speed of mine. Not 1/20th.

As mentioned, I get numbers in the 65M range. Yours are under 2.7M.
Even with some thermal throttling, I would expect better than that.

My pixel (1.8GHz i5-3427U) should be a *bit* faster that yours. And I
get 54M iterations on that.

I saw you mentioned CONFIG_CMPXCHG_LOCKREF=y in your .config, so you
don't have spinlock debugging enabled, but maybe you have some other
expensive debug option enabled. Like DEBUG_PAGEALLOC etc.

If you get "perf" compiled, mind doing a

perf record -f -e cycles:pp ./a.out
perf report

on it and look what that says?

Linus

2013-08-30 16:37:09

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 6:32 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Aug 30, 2013 at 8:38 AM, Sedat Dilek <[email protected]> wrote:
>>
>> A Samsung series-5 ultrabook.
>>
>> $ grep "model name" /proc/cpuinfo | uniq
>> model name : Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz
>
> Hmm. Do you have debugging options enabled? Because that CPU should
> have the same core count as mine (two+HT), a slightly smaller cache
> (3M vs 4M) and runs at a noticeably lower frequency (1.6GHz vs 3.5).
> It probably also has slower memory etc, but that should still make it
> maybe half speed of mine. Not 1/20th.
>
> As mentioned, I get numbers in the 65M range. Yours are under 2.7M.
> Even with some thermal throttling, I would expect better than that.
>
> My pixel (1.8GHz i5-3427U) should be a *bit* faster that yours. And I
> get 54M iterations on that.
>
> I saw you mentioned CONFIG_CMPXCHG_LOCKREF=y in your .config, so you
> don't have spinlock debugging enabled, but maybe you have some other
> expensive debug option enabled. Like DEBUG_PAGEALLOC etc.
>

Yeah, this was with some debug-options built (see my attached kernel-config).

> If you get "perf" compiled, mind doing a
>
> perf record -f -e cycles:pp ./a.out
> perf report
>
> on it and look what that says?
>

Where is this a.out file from or how to generate it?

- Sedat -


Attachments:
config-3.11.0-rc7-1-lockref-small (112.30 kB)

2013-08-30 16:52:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 9:37 AM, Sedat Dilek <[email protected]> wrote:
>
> Where is this a.out file from or how to generate it?

Oh, that's just the silly threaded test-binary. I don't know what you
called it.

As to your config options, yesh, you have some expensive stuff.
DEBUG_OBJECTS and DEBUG_MUTEXES in particular tend to cause lots of
horrible performance issues. I didn't check if there might be other
things..

Linusc

2013-08-30 17:11:30

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 6:52 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Aug 30, 2013 at 9:37 AM, Sedat Dilek <[email protected]> wrote:
>>
>> Where is this a.out file from or how to generate it?
>
> Oh, that's just the silly threaded test-binary. I don't know what you
> called it.
>
> As to your config options, yesh, you have some expensive stuff.
> DEBUG_OBJECTS and DEBUG_MUTEXES in particular tend to cause lots of
> horrible performance issues. I didn't check if there might be other
> things..
>

There is no -f option for record but for report, so I swapped them:

$ sudo ~/src/linux-kernel/linux/tools/perf/perf record -e cycles:pp
./scripts/t_lockref_from-linus
Total loops: 2240273
[ perf record: Woken up 25 times to write data ]
[ perf record: Captured and wrote 6.080 MB perf.data (~265641 samples) ]

$ sudo ~/src/linux-kernel/linux/tools/perf/perf report -f

Samples: 159K of event 'cycles:pp', Event count (approx.): 76535356682
84,10% t_lockref_from- [kernel.kallsyms] [k] check_poison_obj
5,22% t_lockref_from- [kernel.kallsyms] [k] memset
1,15% t_lockref_from- [kernel.kallsyms] [k] irq_return
0,45% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
0,44% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_free
0,36% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
0,35% t_lockref_from- [kernel.kallsyms] [k] cache_free_debugcheck
0,35% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
0,34% t_lockref_from- [kernel.kallsyms] [k] user_exit
0,33% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu
0,26% t_lockref_from- libc-2.15.so [.] __xstat64
0,25% t_lockref_from- [kernel.kallsyms] [k] poison_obj
0,24% t_lockref_from- [kernel.kallsyms] [k] local_clock
0,19% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
0,19% t_lockref_from- [kernel.kallsyms] [k] link_path_walk
0,19% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_enter_common.isra.43
0,19% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_exit_common.isra.41
0,17% t_lockref_from- [kernel.kallsyms] [k] native_read_tsc
0,17% t_lockref_from- [kernel.kallsyms] [k] user_enter
0,16% t_lockref_from- [kernel.kallsyms] [k] sched_clock_cpu
0,16% t_lockref_from- [kernel.kallsyms] [k] path_lookupat
0,14% t_lockref_from- [kernel.kallsyms] [k] vfs_getattr
0,14% t_lockref_from- [kernel.kallsyms] [k] lockref_put_or_lock
0,14% t_lockref_from- [kernel.kallsyms] [k] path_init
0,13% t_lockref_from- [kernel.kallsyms] [k] tracesys
0,13% t_lockref_from- [kernel.kallsyms] [k] native_sched_clock
0,13% t_lockref_from- [kernel.kallsyms] [k] strncpy_from_user
0,12% t_lockref_from- [kernel.kallsyms] [k] cp_new_stat
0,12% t_lockref_from- [kernel.kallsyms] [k]
cache_alloc_debugcheck_after.isra.61
0,12% t_lockref_from- [kernel.kallsyms] [k] account_system_time
0,12% t_lockref_from- [kernel.kallsyms] [k] copy_user_generic_unrolled
0,12% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_enter
0,12% t_lockref_from- [kernel.kallsyms] [k] jiffies_to_timeval
0,11% t_lockref_from- [kernel.kallsyms] [k] get_vtime_delta
0,11% t_lockref_from- t_lockref_from-linus [.] __stat
0,10% t_lockref_from- [kernel.kallsyms] [k] check_irq_off
0,10% t_lockref_from- [kernel.kallsyms] [k] common_perm
0,10% t_lockref_from- [kernel.kallsyms] [k] lookup_fast
0,09% t_lockref_from- [kernel.kallsyms] [k] getname_flags
0,09% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_leave
0,08% t_lockref_from- t_lockref_from-linus [.] start_routine
0,08% t_lockref_from- [kernel.kallsyms] [k] vfs_fstatat
0,08% t_lockref_from- [kernel.kallsyms] [k] system_call_after_swapgs
0,08% t_lockref_from- [kernel.kallsyms] [k] user_path_at_empty
0,08% t_lockref_from- [kernel.kallsyms] [k] account_user_time
0,07% t_lockref_from- [kernel.kallsyms] [k] generic_fillattr
0,07% t_lockref_from- [kernel.kallsyms] [k] complete_walk
0,06% t_lockref_from- [kernel.kallsyms] [k] security_inode_getattr
0,06% t_lockref_from- [kernel.kallsyms] [k] _raw_spin_lock
0,06% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_exit
0,06% t_lockref_from- [kernel.kallsyms] [k] vtime_account_user
0,06% t_lockref_from- [kernel.kallsyms] [k] dput
0,06% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_enter
0,06% t_lockref_from- [kernel.kallsyms] [k] __virt_addr_valid

- Sedat -

2013-08-30 17:17:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Thu, Aug 29, 2013 at 09:43:07AM -0700, Linus Torvalds wrote:

> We'll see. The real problem is that I'm not sure if I can even see the
> scalability issue on any machine I actually personally want to use
> (read: silent). On my current system I can only get up to 15%
> _raw_spin_lock by just stat'ing the same file over and over and over
> again from lots of threads.

Yeah, silent basically limits you to i7 single socket systems and sadly
Intel doesn't seem to want to make those with more than 4 cores on :/

I've got a i7-K part (SNB iirc) with a _huge_ scythe cooler and a high
efficiency fanless PSU for a system that's near noiseless -- as in my
Thinkpad actually makes more noise.

2013-08-30 17:26:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 10:11 AM, Sedat Dilek <[email protected]> wrote:
>
> There is no -f option for record but for report, so I swapped them:

That's odd. "perf record -f" very much works for me, and I use it all
the time to make sure that I don't mix perf data from older runs..

[ Time passes. I check current 'perf'. Dammit. This has changed. I
seldom rebuild "perf", so I never noticed. It was removed in commit
4a4d371a4dfb, apparently because the old "append" mode no longer even
exists ]

But never mind, that doesn't matter for your numbers:

> $ sudo ~/src/linux-kernel/linux/tools/perf/perf report -f
>
> Samples: 159K of event 'cycles:pp', Event count (approx.): 76535356682
> 84,10% t_lockref_from- [kernel.kallsyms] [k] check_poison_obj
> 5,22% t_lockref_from- [kernel.kallsyms] [k] memset
> 1,15% t_lockref_from- [kernel.kallsyms] [k] irq_return
> 0,45% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
> 0,44% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_free

You're wasting all the time in slab debugging, probably due to the
object debug option.

None of the rest is even remotely interesting, and the spinlock hits
you do have are more likely to come from the same object debug code
than from pathname lookup.

Linus

2013-08-30 17:29:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 10:17 AM, Peter Zijlstra <[email protected]> wrote:
>
> Yeah, silent basically limits you to i7 single socket systems and sadly
> Intel doesn't seem to want to make those with more than 4 cores on :/

Yup. And even if they had more cores in a single socket, the real
scalability issues won't happen until you start crossing sockets and
serialization slows down by a big amount due to cachelines moving
outside the die.

> I've got a i7-K part (SNB iirc) with a _huge_ scythe cooler and a high
> efficiency fanless PSU for a system that's near noiseless -- as in my
> Thinkpad actually makes more noise.

I've got a 4770S on order, it should arrive tomorrow. It's the 65W
part, and it has TSX. But no, I doubt I'll see any real scalability
issues with it, but at least I can test any TSX codepaths.

Linus

2013-08-30 17:33:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 10:28 AM, Linus Torvalds
<[email protected]> wrote:
>
> I've got a 4770S on order, it should arrive tomorrow. It's the 65W
> part, and it has TSX. But no, I doubt I'll see any real scalability
> issues with it, but at least I can test any TSX codepaths.

Side note: whatever marketing person inside Intel that decided that
the "K" parts shouldn't get TSX-NI support should be fired. Or at
least have a stern talking to (and by "stern" I obviously mean that
thumbscrews and hot pins under their nails should be involved).

If I wanted to build a peak performance machine (rather than a silent
one), I wouldn't have had TSX. What the hell is wrong with Intel
marketing? Their "lets fragment things" crap is making it harder to
actually support their new technologies.

Linus

2013-08-30 18:33:32

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/29/2013 11:54 PM, Linus Torvalds wrote:
> On Thu, Aug 29, 2013 at 8:12 PM, Waiman Long<[email protected]> wrote:
>> On 08/29/2013 07:42 PM, Linus Torvalds wrote:
>>> Waiman? Mind looking at this and testing? Linus
>> Sure, I will try out the patch tomorrow morning and see how it works out for
>> my test case.
> Ok, thanks, please use this slightly updated pCMPXCHG_LOOPatch attached here.
>
>

I tested your patch on a 2-socket (12 cores, 24 threads) DL380 with
2.9GHz Westmere-EX CPUs, the test results of your test program (max
threads increased to 24 to match the thread count) were:

with patch = 68M
w/o patch = 12M

So it was an almost 6X improvement. I think that is really good. A
dual-socket machine, these days, shouldn't be considered as a "BIG"
machine. They are pretty common in different organizations.

I have reviewed the patch, and it looks good to me with the exception
that I added a cpu_relax() call at the end of while loop in the
CMPXCHG_LOOP macro.

I also got the perf data of the test runs with and without the patch.

With patch:

29.24% a.out [kernel.kallsyms] [k] lockref_get_or_lock
19.65% a.out [kernel.kallsyms] [k] lockref_put_or_lock
14.11% a.out [kernel.kallsyms] [k] dput
5.37% a.out [kernel.kallsyms] [k] __d_lookup_rcu
5.29% a.out [kernel.kallsyms] [k] lg_local_lock
4.59% a.out [kernel.kallsyms] [k] d_rcu_to_refcount
:
0.13% a.out [kernel.kallsyms] [k] complete_walk
:
0.01% a.out [kernel.kallsyms] [k] _raw_spin_lock

Without patch:

93.50% a.out [kernel.kallsyms] [k] _raw_spin_lock
0.96% a.out [kernel.kallsyms] [k] dput
0.80% a.out [kernel.kallsyms] [k] kmem_cache_free
0.75% a.out [kernel.kallsyms] [k] lg_local_lock
0.48% a.out [kernel.kallsyms] [k] complete_walk
0.45% a.out [kernel.kallsyms] [k] __d_lookup_rcu

For the other test cases that I am interested in, like the AIM7
benchmark, your patch may not be as good as my original one. I got 1-3M
JPM (varied quite a lot in different runs) in the short workloads on a
80-core system. My original one got 6M JPM. However, the test was done
on 3.10 based kernel. So I need to do more test to see if that has an
effect on the JPM results.

Anyway, I think this patch is good performance-wise. I remembered that
awhile ago that an internal reported a lock contention problem in dentry
involving probably complete_walk(). This patch will certainly help for
that case.

I will do more investigation to see how to make this patch work better
for my test cases.

Thank for taking the effort in optimizing the complete_walk() and
unlazy_walk() function that are not in my original patch. That will make
the patch work even better under more circumstances. I really appreciate
that.

Best regards,
Longman


2013-08-30 18:42:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 9:12 AM, Steven Rostedt <[email protected]> wrote:
>
> Now I know this isn't going to be popular, but I'll suggest it anyway.
> What about only implementing the lockref locking when CPUs are greater
> than 7, 7 or less will still use the normal optimized spinlocks.

I considered it. It's not hugely difficult to do, in that we could
make it a static key thing, but I'd actually rather make it depend on
some actual user-settable thing than on some arbitrary number of
cpu's.

See the CMPXCHG_LOOP() macro in lib/lockref.c: it would be easy to
just enclose the whole thing in a

if (static_key_enabled(&cmpxchg_lockref)) { .. }

and then it could be enabled/disabled at will with very little
performance downside. And I don't think it's necessarily a bad idea.
The code has a very natural "fall back to spinlock" model.

THAT SAID.

Even though uncontended spinlocks are faster than a cmpxchg, under any
real normal load I don't think you can necessarily measure the
difference. Remember: this particular benchmark does absolutely
*nothing* but pathname lookups, and even then it's pretty close to
noise. And the biggest disadvantage of cmpxchg - the fact that you
have to read the cache line before you do the r-m-w cycle, and thus
might have an extra cache coherency cycle - shouldn't be an issue for
the dentry use when you don't try to hit the same dentry over and over
again, because the code has already read the dentry hash etc.

So I'm not sure it's really worth it. It might be interesting to try
that static_key approach simply for benchmarking, though. That way you
could benchmark the exact same boot with pretty much the exact same
dentry population, just switch the static key around and run a few
path-intensive benchmarks.

If anybody is willing to write the patch and do the benchmarking (I
would suggest *not* using my idiotic test-program for this), and then
send it to me with numbers, that would be interesting...

Linus

2013-08-30 18:53:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 11:33 AM, Waiman Long <[email protected]> wrote:
>
> I tested your patch on a 2-socket (12 cores, 24 threads) DL380 with 2.9GHz
> Westmere-EX CPUs, the test results of your test program (max threads
> increased to 24 to match the thread count) were:
>
> with patch = 68M
> w/o patch = 12M

Ok, that's certainly noticeable.

> I have reviewed the patch, and it looks good to me with the exception that I
> added a cpu_relax() call at the end of while loop in the CMPXCHG_LOOP macro.

Yeah, that's probably a good idea.

> I also got the perf data of the test runs with and without the patch.

So the perf data would be *much* more interesting for a more varied
load. I know pretty much exactly what happens with my silly
test-program, and as you can see it never really gets to the actual
spinlock, because that test program will only ever hit the fast-path
case.

It would be much more interesting to see another load that may trigger
the d_lock actually being taken. So:

> For the other test cases that I am interested in, like the AIM7 benchmark,
> your patch may not be as good as my original one. I got 1-3M JPM (varied
> quite a lot in different runs) in the short workloads on a 80-core system.
> My original one got 6M JPM. However, the test was done on 3.10 based kernel.
> So I need to do more test to see if that has an effect on the JPM results.

I'd really like to see a perf profile of that, particularly with some
call chain data for the relevant functions (ie "what it is that causes
us to get to spinlocks"). Because it may well be that you're hitting
some of the cases that I didn't see, and thus didn't notice.

In particular, I suspect AIM7 actually creates/deletes files and/or
renames them too. Or maybe I screwed up the dget_parent() special case
thing, which mattered because AIM7 did a lot of getcwd() calls or
someting odd like that.

Linus

2013-08-30 19:21:09

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 02:53 PM, Linus Torvalds wrote:
> So the perf data would be *much* more interesting for a more varied
> load. I know pretty much exactly what happens with my silly
> test-program, and as you can see it never really gets to the actual
> spinlock, because that test program will only ever hit the fast-path
> case. It would be much more interesting to see another load that may
> trigger the d_lock actually being taken. So:
>> For the other test cases that I am interested in, like the AIM7 benchmark,
>> your patch may not be as good as my original one. I got 1-3M JPM (varied
>> quite a lot in different runs) in the short workloads on a 80-core system.
>> My original one got 6M JPM. However, the test was done on 3.10 based kernel.
>> So I need to do more test to see if that has an effect on the JPM results.
> I'd really like to see a perf profile of that, particularly with some
> call chain data for the relevant functions (ie "what it is that causes
> us to get to spinlocks"). Because it may well be that you're hitting
> some of the cases that I didn't see, and thus didn't notice.
>
> In particular, I suspect AIM7 actually creates/deletes files and/or
> renames them too. Or maybe I screwed up the dget_parent() special case
> thing, which mattered because AIM7 did a lot of getcwd() calls or
> someting odd like that.
>
> Linus

Below is the perf data of my short workloads run in an 80-core DL980:

13.60% reaim [kernel.kallsyms] [k]
_raw_spin_lock_irqsave
|--48.79%-- tty_ldisc_try
|--48.58%-- tty_ldisc_deref
--2.63%-- [...]

11.31% swapper [kernel.kallsyms] [k] intel_idle
|--99.94%-- cpuidle_enter_state
--0.06%-- [...]

4.86% reaim [kernel.kallsyms] [k] lg_local_lock
|--59.41%-- mntput_no_expire
|--19.37%-- path_init
|--15.14%-- d_path
|--5.88%-- sys_getcwd
--0.21%-- [...]

3.00% reaim reaim [.] mul_short

2.41% reaim reaim [.] mul_long
|--87.21%-- 0xbc614e
--12.79%-- (nil)

2.29% reaim reaim [.] mul_int

2.20% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--12.81%-- prepend_path
|--9.90%-- lockref_put_or_lock
|--9.62%-- __rcu_process_callbacks
|--8.77%-- load_balance
|--6.40%-- lockref_get
|--5.55%-- __mutex_lock_slowpath
|--4.85%-- __mutex_unlock_slowpath
|--4.83%-- inet_twsk_schedule
|--4.27%-- lockref_get_or_lock
|--2.19%-- task_rq_lock
|--2.13%-- sem_lock
|--2.09%-- scheduler_tick
|--1.88%-- try_to_wake_up
|--1.53%-- kmem_cache_free
|--1.30%-- unix_create1
|--1.22%-- unix_release_sock
|--1.21%-- process_backlog
|--1.11%-- unix_stream_sendmsg
|--1.03%-- enqueue_to_backlog
|--0.85%-- rcu_accelerate_cbs
|--0.79%-- unix_dgram_sendmsg
|--0.76%-- do_anonymous_page
|--0.70%-- unix_stream_recvmsg
|--0.69%-- unix_stream_connect
|--0.64%-- net_rx_action
|--0.61%-- tcp_v4_rcv
|--0.59%-- __do_fault
|--0.54%-- new_inode_pseudo
|--0.52%-- __d_lookup
--10.62%-- [...]

1.19% reaim [kernel.kallsyms] [k] mspin_lock
|--99.82%-- __mutex_lock_slowpath
--0.18%-- [...]

1.01% reaim [kernel.kallsyms] [k] lg_global_lock
|--51.62%-- __shmdt
--48.38%-- __shmctl

There are more contention in the lglock than I remember for the run in
3.10. This is an area that I need to look at. In fact, lglock is
becoming a problem for really large machine with a lot of cores. We have
a prototype 16-socket machine with 240 cores under development. The cost
of doing a lg_global_lock will be very high in that type of machine
given that it is already high in this 80-core machine. I have been
thinking about instead of per-cpu spinlocks, we could change the locking
to per-node level. While there will be more contention for
lg_local_lock, the cost of doing a lg_global_lock will be much lower and
contention within the local die should not be too bad. That will require
either a per-node variable infrastructure or simulated with the existing
per-cpu subsystem.

I will also need to look at ways reduce the need of taking d_lock in
existing code. One area that I am looking at is whether we can take out
the lock/unlock pair in prepend_path(). This function can only be called
with the rename_lock taken. So no filename change or deletion will be
allowed. It will only be a problem if somehow the dentry itself got
killed or dropped while the name is being copied out. The first dentry
referenced by the path structure should have a non-zero reference count,
so that shouldn't happen. I am not so sure about the parents of that
dentry as I am not so familiar with that part of the filesystem code.

Regards,
Longman

2013-08-30 19:33:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 12:20 PM, Waiman Long <[email protected]> wrote:
>
> Below is the perf data of my short workloads run in an 80-core DL980:

Ok, that doesn't look much like d_lock any more. Sure, there's a small
amount of spinlocking going on with lockref being involved, but on the
whole even that looks more like getcwd and other random things.

I do agree that getcwd() can probably be hugely optimized. Nobody has
ever bothered, because it's never really performance-critical, and I
think AIM7 ends up just doing something really odd. I bet we could fix
it entirely if we cared enough.

I just wonder if it's even worth it (I assume AIM7 is something HP
uses internally, because I've never really heard of anybody else
caring)

But I'll look at getcwd anyway.

Linus

2013-08-30 19:41:19

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 03:20:48PM -0400, Waiman Long wrote:

> There are more contention in the lglock than I remember for the run
> in 3.10. This is an area that I need to look at. In fact, lglock is
> becoming a problem for really large machine with a lot of cores. We
> have a prototype 16-socket machine with 240 cores under development.
> The cost of doing a lg_global_lock will be very high in that type of
> machine given that it is already high in this 80-core machine. I
> have been thinking about instead of per-cpu spinlocks, we could
> change the locking to per-node level. While there will be more
> contention for lg_local_lock, the cost of doing a lg_global_lock
> will be much lower and contention within the local die should not be
> too bad. That will require either a per-node variable infrastructure
> or simulated with the existing per-cpu subsystem.

Speaking of lglock, there's a low-hanging fruit in that area: we have
no reason whatsoever to put anything but regular files with FMODE_WRITE
on the damn per-superblock list - the *only* thing it's used for is
mark_files_ro(), which will skip everything except those. And since
read opens normally outnumber the writes quite a bit... Could you
try the diff below and see if it changes the picture? files_lglock
situation ought to get better...

diff --git a/fs/file_table.c b/fs/file_table.c
index b44e4c5..322cd37 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -385,6 +385,10 @@ static inline void __file_sb_list_add(struct file *file, struct super_block *sb)
*/
void file_sb_list_add(struct file *file, struct super_block *sb)
{
+ if (likely(!(file->f_mode & FMODE_WRITE)))
+ return;
+ if (!S_ISREG(file_inode(file)->i_mode))
+ return;
lg_local_lock(&files_lglock);
__file_sb_list_add(file, sb);
lg_local_unlock(&files_lglock);
@@ -450,8 +454,6 @@ void mark_files_ro(struct super_block *sb)

lg_global_lock(&files_lglock);
do_file_list_for_each_entry(sb, f) {
- if (!S_ISREG(file_inode(f)->i_mode))
- continue;
if (!file_count(f))
continue;
if (!(f->f_mode & FMODE_WRITE))

2013-08-30 19:53:09

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 03:40 PM, Al Viro wrote:
> On Fri, Aug 30, 2013 at 03:20:48PM -0400, Waiman Long wrote:
>
>> There are more contention in the lglock than I remember for the run
>> in 3.10. This is an area that I need to look at. In fact, lglock is
>> becoming a problem for really large machine with a lot of cores. We
>> have a prototype 16-socket machine with 240 cores under development.
>> The cost of doing a lg_global_lock will be very high in that type of
>> machine given that it is already high in this 80-core machine. I
>> have been thinking about instead of per-cpu spinlocks, we could
>> change the locking to per-node level. While there will be more
>> contention for lg_local_lock, the cost of doing a lg_global_lock
>> will be much lower and contention within the local die should not be
>> too bad. That will require either a per-node variable infrastructure
>> or simulated with the existing per-cpu subsystem.
> Speaking of lglock, there's a low-hanging fruit in that area: we have
> no reason whatsoever to put anything but regular files with FMODE_WRITE
> on the damn per-superblock list - the *only* thing it's used for is
> mark_files_ro(), which will skip everything except those. And since
> read opens normally outnumber the writes quite a bit... Could you
> try the diff below and see if it changes the picture? files_lglock
> situation ought to get better...
>
>

Sure. I will try that out, but it probably won't help too much in this
test case. The perf profile that I sent out in my previous mail is only
partial. The actual one for lg_global_lock was:

1.01% reaim [kernel.kallsyms] [k] lg_global_lock
|
--- lg_global_lock
mntput_no_expire
mntput
__fput
____fput
task_work_run
do_notify_resume
int_signal
|
|--51.62%-- __shmdt
|
--48.38%-- __shmctl

So it is the mnput_no_expire() function that is doing all the
lg_global_lock() calls.

Regards,
Longman

2013-08-30 20:15:32

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 03:33 PM, Linus Torvalds wrote:
> On Fri, Aug 30, 2013 at 12:20 PM, Waiman Long<[email protected]> wrote:
>> Below is the perf data of my short workloads run in an 80-core DL980:
> Ok, that doesn't look much like d_lock any more. Sure, there's a small
> amount of spinlocking going on with lockref being involved, but on the
> whole even that looks more like getcwd and other random things.

Yes, d_lock contention isn't a major one in the perf profile. However,
sometimes a small improvement can lead to a noticeable improvement in
performance.
> I do agree that getcwd() can probably be hugely optimized. Nobody has
> ever bothered, because it's never really performance-critical, and I
> think AIM7 ends up just doing something really odd. I bet we could fix
> it entirely if we cared enough.
The prepend_path() isn't all due to getcwd. The correct profile should be


|--12.81%-- prepend_path
| |
| |--67.35%-- d_path
| | |
| | |--60.72%--
proc_pid_readlink
| | | sys_readlinkat
| | | sys_readlink
| | |
system_call_fastpath
| | | __GI___readlink
| | | 0x302f64662f666c
| | |
| | --39.28%--
perf_event_mmap_event
| |
| --32.65%-- sys_getcwd
| system_call_fastpath
| __getcwd

Yes, the perf subsystem itself can contribute a sizeable portion of the
spinlock contention. In fact, I have also applied my seqlock patch that
was sent a while ago to the test kernel in order to get a more accurate
perf profile. The seqlock patch will allow concurrent d_path() calls
without one blocking the others. In the 240-core prototype machine, it
was not possible to get an accurate perf profile for some workloads
because more than 50% of the time was spent in spinlock contention due
to the use of perf. An accurate perf profile can only be obtained in
those cases by applying my lockref and seqlock patches. I hope someone
will have the time to review my seqlock patch to see what additional
changes will be needed. I really like to see it merged in some form to
3.12.

> I just wonder if it's even worth it (I assume AIM7 is something HP
> uses internally, because I've never really heard of anybody else
> caring)

Our performance group is actually pretty new. It was formed 2 years ago
and we began actively participating in the Linux kernel development just
in the past year.

We use the AIM7 benchmark internally primarily because it is easy to run
and cover quite a lot of different areas in the kernel. We are also
using specJBB and SwingBench for performance benchmarking problem. We
are also trying to look for more benchmarks to use in the future.

Regards,
Longman

2013-08-30 20:26:21

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 03:52:49PM -0400, Waiman Long wrote:

> So it is the mnput_no_expire() function that is doing all the
> lg_global_lock() calls.

Interesting... So you are getting a lot of mntput() with ->mnt_ns being
NULL? I wonder which type it is... Note that anything mounted will
have non-NULL ->mnt_ns until umount and anything obtained via
kern_mount/kern_mount_data will also have a non-NULL ->mnt_ns - until
kern_unmount().

Could you try to gather stats of that sort? Normally that path should
be only hit by fput() when we have a lazy-unmounted fs and close an opened
file on it...

I see one potential stupidity in that area (simple_pin_fs() ought to set
->mnt_ns, with simple_release_fs() clearing it), but there's not a lot
of fs types that would use the damn thing...

2013-08-30 20:36:07

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 04:26 PM, Al Viro wrote:
> On Fri, Aug 30, 2013 at 03:52:49PM -0400, Waiman Long wrote:
>
>> So it is the mnput_no_expire() function that is doing all the
>> lg_global_lock() calls.
> Interesting... So you are getting a lot of mntput() with ->mnt_ns being
> NULL? I wonder which type it is... Note that anything mounted will
> have non-NULL ->mnt_ns until umount and anything obtained via
> kern_mount/kern_mount_data will also have a non-NULL ->mnt_ns - until
> kern_unmount().
>
> Could you try to gather stats of that sort? Normally that path should
> be only hit by fput() when we have a lazy-unmounted fs and close an opened
> file on it...
>
> I see one potential stupidity in that area (simple_pin_fs() ought to set
> ->mnt_ns, with simple_release_fs() clearing it), but there's not a lot
> of fs types that would use the damn thing...

The AIM7 test was run on a set of 16 ramdisk formated with ext3
filesystem with the following mount options:
barrier=0,async,noatime,nodiratime. Maybe that is a factor.

Regards,
Longman

2013-08-30 20:43:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 1:15 PM, Waiman Long <[email protected]> wrote:
>
> The prepend_path() isn't all due to getcwd. The correct profile should be

Ugh. I really think that prepend_path() should just be rewritten to
run entirely under RCU.

Then we can remove *all* the stupid locking, and replace it with doing
a read-lock on the rename sequence count, and repeating if requited.

That shouldn't even be hard to do, it just requires mindless massaging
and being careful.

Linus

2013-08-30 20:49:03

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 04:35:49PM -0400, Waiman Long wrote:

> The AIM7 test was run on a set of 16 ramdisk formated with ext3
> filesystem with the following mount options:
> barrier=0,async,noatime,nodiratime. Maybe that is a factor.

I would be really surprised if it was... Could you slap the following
into __fput():

struct mount *m = real_mount(mnt);
if (unlikely(!m->mnt_ns)) {
printk(KERN_INFO "type = %s",
mnt->mnt_sb->s_type->name);
WARN_ON(1);
}
and see what it catches? That'll need #include "fs/mount.h" in
fs/file_table.c to compile...

2013-08-30 20:54:12

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 01:43:11PM -0700, Linus Torvalds wrote:
> On Fri, Aug 30, 2013 at 1:15 PM, Waiman Long <[email protected]> wrote:
> >
> > The prepend_path() isn't all due to getcwd. The correct profile should be
>
> Ugh. I really think that prepend_path() should just be rewritten to
> run entirely under RCU.
>
> Then we can remove *all* the stupid locking, and replace it with doing
> a read-lock on the rename sequence count, and repeating if requited.
>
> That shouldn't even be hard to do, it just requires mindless massaging
> and being careful.

Not really. Sure, you'll retry it if you race with d_move(); that's not
the real problem - access past the end of the object containing ->d_name.name
would screw you and that's what ->d_lock is preventing there. Delayed freeing
of what ->d_name is pointing into is fine, but it's not the only way to get
hurt there...

2013-08-30 21:04:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 1:54 PM, Al Viro <[email protected]> wrote:
>
> Not really. Sure, you'll retry it if you race with d_move(); that's not
> the real problem - access past the end of the object containing ->d_name.name
> would screw you and that's what ->d_lock is preventing there. Delayed freeing
> of what ->d_name is pointing into is fine, but it's not the only way to get
> hurt there...

Umm? We follow d->d_name.name without d_lock under RCU all the time -
that's what the pathname lookup is all about, after all.

Yes, yes, you haev to be careful and cannot just blindly trust the
length: you also have to check for NUL character as you are copying it
and stop if you hit it. But that's trivial.

Why would d_prepend be any different?

Linus

2013-08-30 21:11:04

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 04:54 PM, Al Viro wrote:
> On Fri, Aug 30, 2013 at 01:43:11PM -0700, Linus Torvalds wrote:
>> On Fri, Aug 30, 2013 at 1:15 PM, Waiman Long<[email protected]> wrote:
>>> The prepend_path() isn't all due to getcwd. The correct profile should be
>> Ugh. I really think that prepend_path() should just be rewritten to
>> run entirely under RCU.
>>
>> Then we can remove *all* the stupid locking, and replace it with doing
>> a read-lock on the rename sequence count, and repeating if requited.
>>
>> That shouldn't even be hard to do, it just requires mindless massaging
>> and being careful.
> Not really. Sure, you'll retry it if you race with d_move(); that's not
> the real problem - access past the end of the object containing ->d_name.name
> would screw you and that's what ->d_lock is preventing there. Delayed freeing
> of what ->d_name is pointing into is fine, but it's not the only way to get
> hurt there...

Actually, prepend_path() was called with rename_lock taken. So d_move()
couldn't be run at the same time. Am I right?

Regards,
Longman

2013-08-30 21:23:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 2:10 PM, Waiman Long <[email protected]> wrote:
>
> Actually, prepend_path() was called with rename_lock taken. So d_move()
> couldn't be run at the same time. Am I right?

Al was discussing the case I mentioned: getting rid of that lock
entirely, running it all just under RCU, and then just checking the
rename sequence count around it all and retrying if required.

It would have the advantage of not only not having to get the lock,
but by doing it as an RCU walk, we would avoid all the nasty reference
counting costs too. We wouldn't even need to get refcounts on the
root/pwd entries (which currently cost us quite a bit), since we could
just check the sequence number in "struct fs_struct" too. That also
gets rid of the necessity for the fs->lock spinlock.

You do have to be a bit careful when following the dentry pointers
under RCU (and you cannot just do a "memcpy()" on the name, as Al
points out), but it really doesn't look nasty. It just looks "you have
to be careful".

Linus

2013-08-30 21:30:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 05:10:45PM -0400, Waiman Long wrote:
> On 08/30/2013 04:54 PM, Al Viro wrote:
> >On Fri, Aug 30, 2013 at 01:43:11PM -0700, Linus Torvalds wrote:
> >>On Fri, Aug 30, 2013 at 1:15 PM, Waiman Long<[email protected]> wrote:
> >>>The prepend_path() isn't all due to getcwd. The correct profile should be
> >>Ugh. I really think that prepend_path() should just be rewritten to
> >>run entirely under RCU.
> >>
> >>Then we can remove *all* the stupid locking, and replace it with doing
^^^^^^^^^^^^^^^^^^^^^
> >>a read-lock on the rename sequence count, and repeating if requited.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >>
> >>That shouldn't even be hard to do, it just requires mindless massaging
> >>and being careful.
> >Not really. Sure, you'll retry it if you race with d_move(); that's not
> >the real problem - access past the end of the object containing ->d_name.name
> >would screw you and that's what ->d_lock is preventing there. Delayed freeing
> >of what ->d_name is pointing into is fine, but it's not the only way to get
> >hurt there...
>
> Actually, prepend_path() was called with rename_lock taken. So
> d_move() couldn't be run at the same time. Am I right?

See above. You are right, but if Linus wants to turn that sucker into
reader (which is possible - see e.g. cifs build_path_from_dentry() and its
ilk), d_move() races will start to play.

2013-08-30 21:43:15

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 05:30 PM, Al Viro wrote:
> On Fri, Aug 30, 2013 at 05:10:45PM -0400, Waiman Long wrote:
> See above. You are right, but if Linus wants to turn that sucker into
> reader (which is possible - see e.g. cifs build_path_from_dentry() and
> its ilk), d_move() races will start to play.

Thank for the clarification.

-Longman

2013-08-30 21:45:03

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 02:03:59PM -0700, Linus Torvalds wrote:

> Yes, yes, you haev to be careful and cannot just blindly trust the
> length: you also have to check for NUL character as you are copying it
> and stop if you hit it. But that's trivial.

Point... Actually, I wonder if _that_ could be a solution for ->d_name.name
printk races as well. Remember that story? You objected against taking
spinlocks in printk, no matter how specialized and how narrow the area
over which those are taken, but rcu_read_lock/rcu_read_unlock should be
OK... Something like %pd expecting dentry pointer and producing dentry
name. Sure, we still get garbage if we race with d_move(), but at least
it's a contained garbage that way...

2013-08-30 22:30:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 2:44 PM, Al Viro <[email protected]> wrote:
>
> Point... Actually, I wonder if _that_ could be a solution for ->d_name.name
> printk races as well. Remember that story? You objected against taking
> spinlocks in printk, no matter how specialized and how narrow the area
> over which those are taken, but rcu_read_lock/rcu_read_unlock should be
> OK... Something like %pd expecting dentry pointer and producing dentry
> name. Sure, we still get garbage if we race with d_move(), but at least
> it's a contained garbage that way...

Yes, that sounds quite reasonable. For printk, we'd probably want to
limit the max size and depth to something fairly small (32 bytes, max
four deep or something), and we cannot take cwd/root into account
since it can happen from interrupts, but other than that it doesn't
sound horrible.

Linus

2013-08-31 02:02:56

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 04:48 PM, Al Viro wrote:
> On Fri, Aug 30, 2013 at 04:35:49PM -0400, Waiman Long wrote:
>
>> The AIM7 test was run on a set of 16 ramdisk formated with ext3
>> filesystem with the following mount options:
>> barrier=0,async,noatime,nodiratime. Maybe that is a factor.
> I would be really surprised if it was... Could you slap the following
> into __fput():
>
> struct mount *m = real_mount(mnt);
> if (unlikely(!m->mnt_ns)) {
> printk(KERN_INFO "type = %s",
> mnt->mnt_sb->s_type->name);
> WARN_ON(1);
> }
> and see what it catches? That'll need #include "fs/mount.h" in
> fs/file_table.c to compile...

I slapped in the code segment, and the following was logged:

[ 340.871590] type = tmpfs
[ 340.871596] ------------[ cut here ]------------
[ 340.871606] WARNING: CPU: 37 PID: 63276 at fs/file_table.c:239
__fput+0x23d/0x270()
[ 340.871607] Modules linked in: brd(F) ip6table_filter(F)
ip6_tables(F) iptable_filter(F) ip_tables(F) ebtable_nat(F) ebtables(F)
x_tables(F) edd(F) af_packet(F) bridge(F) stp(F) llc(F)
cpufreq_conservative(F) cpufreq_userspace(F) cpufreq_powersave(F)
pcc_cpufreq(F) microcode(F) fuse(F) loop(F) vhost_net(F) macvtap(F)
macvlan(F) vhost(F) tun(F) kvm_intel(F) ipv6(F) kvm(F) iTCO_wdt(F)
iTCO_vendor_support(F) joydev(F) igb(F) tpm_infineon(F) dca(F) ptp(F)
hid_generic(F) i7core_edac(F) sr_mod(F) tpm_tis(F) pps_core(F) qlcnic(F)
tpm(F) edac_core(F) be2net(F) netxen_nic(F) lpc_ich(F) ehci_pci(F)
mfd_core(F) hpwdt(F) hpilo(F) pcspkr(F) serio_raw(F) cdrom(F)
tpm_bios(F) sg(F) rtc_cmos(F) mperf(F) button(F) acpi_power_meter(F)
ext3(F) jbd(F) mbcache(F) dm_mirror(F) dm_region_hash(F) dm_log(F)
linear(F) radeon(F) ttm(F) drm_kms_helper(F) drm(F) i2c_algo_bit(F)
i2c_core(F) usbhid(F) hid(F) uhci_hcd(F) ehci_hcd(F) usbcore(F)
qla2xxx(F) sd_mod(F) usb_common(F) thermal(F) processor(F)
thermal_sys(F) hwmon(F) scsi_dh_emc(F) scsi_dh_rdac(F) scsi_dh_hp_sw(F)
scsi_dh_alua(F) scsi_dh(F) dm_snapshot(F) dm_mod(F) ata_generic(F)
ata_piix(F) libata(F) hpsa(F) lpfc(F) scsi_transport_fc(F) scsi_tgt(F)
crc_t10dif(F) cciss(F) scsi_mod(F)
[ 340.871663] CPU: 37 PID: 63276 Comm: reaim Tainted: GF W
3.11.0-rc7-lockref-0.11-default #4
[ 340.871665] Hardware name: HP ProLiant DL980 G7, BIOS P66 07/30/2012
[ 340.871667] 00000000000000ef ffff899f6b03dd88 ffffffff814992f5
ffff899f6b03ddc8
[ 340.871681] ffffffff8104a187 ffff899f6b03dda8 0000000000000000
ffff899f6b03aa40
[ 340.871686] ffff899f68fc7cc0 ffff899f69ee8ae0 ffff899f69ee8ae0
ffff899f6b03ddd8
[ 340.871692] Call Trace:
[ 340.871700] [<ffffffff814992f5>] dump_stack+0x6a/0x7d
[ 340.871705] [<ffffffff8104a187>] warn_slowpath_common+0x87/0xb0
[ 340.871709] [<ffffffff8104a1c5>] warn_slowpath_null+0x15/0x20
[ 340.871712] [<ffffffff8116c5bd>] __fput+0x23d/0x270
[ 340.871715] [<ffffffff8116c699>] ____fput+0x9/0x10
[ 340.871719] [<ffffffff81068ae1>] task_work_run+0xb1/0xe0
[ 340.871724] [<ffffffff81002990>] do_notify_resume+0x80/0x1b0
[ 340.871728] [<ffffffff811ed120>] ? ipc_lock+0x30/0x50
[ 340.871732] [<ffffffff8113a356>] ? remove_vma+0x56/0x60
[ 340.871736] [<ffffffff8113c1bf>] ? do_munmap+0x34f/0x380
[ 340.871741] [<ffffffff814a5fda>] int_signal+0x12/0x17
[ 340.871744] ---[ end trace aafa6c45f3388d65 ]---

Regards,
Longman

2013-08-31 02:35:44

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 10:02:36PM -0400, Waiman Long wrote:

> I slapped in the code segment, and the following was logged:
>
> [ 340.871590] type = tmpfs
> [ 340.871712] [<ffffffff8116c5bd>] __fput+0x23d/0x270
> [ 340.871715] [<ffffffff8116c699>] ____fput+0x9/0x10
> [ 340.871719] [<ffffffff81068ae1>] task_work_run+0xb1/0xe0
> [ 340.871724] [<ffffffff81002990>] do_notify_resume+0x80/0x1b0
> [ 340.871728] [<ffffffff811ed120>] ? ipc_lock+0x30/0x50
> [ 340.871732] [<ffffffff8113a356>] ? remove_vma+0x56/0x60
> [ 340.871736] [<ffffffff8113c1bf>] ? do_munmap+0x34f/0x380
> [ 340.871741] [<ffffffff814a5fda>] int_signal+0x12/0x17
> [ 340.871744] ---[ end trace aafa6c45f3388d65 ]---

Aha... OK, I see what's going on. We end up with shm_mnt *not* marked
as long-living vfsmount, even though it lives forever. See if the
following helps; if it does (and I very much expect it to), we want to
put it in -stable. As it is, you get slow path in mntput() each time
a file created by shmem_file_setup() gets closed. For no reason whatsoever...

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/mm/shmem.c b/mm/shmem.c
index e43dc55..445162c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2615,7 +2615,7 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
* tmpfs instance, limiting inodes to one per page of lowmem;
* but the internal instance is left unlimited.
*/
- if (!(sb->s_flags & MS_NOUSER)) {
+ if (!(sb->s_flags & MS_KERNMOUNT)) {
sbinfo->max_blocks = shmem_default_max_blocks();
sbinfo->max_inodes = shmem_default_max_inodes();
if (shmem_parse_options(data, sbinfo, false)) {
@@ -2831,8 +2831,7 @@ int __init shmem_init(void)
goto out2;
}

- shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
- shmem_fs_type.name, NULL);
+ shm_mnt = kern_mount(&shmem_fs_type);
if (IS_ERR(shm_mnt)) {
error = PTR_ERR(shm_mnt);
printk(KERN_ERR "Could not kern_mount tmpfs\n");

2013-08-31 02:42:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sat, Aug 31, 2013 at 03:35:16AM +0100, Al Viro wrote:

> Aha... OK, I see what's going on. We end up with shm_mnt *not* marked
> as long-living vfsmount, even though it lives forever. See if the
> following helps; if it does (and I very much expect it to), we want to
> put it in -stable. As it is, you get slow path in mntput() each time
> a file created by shmem_file_setup() gets closed. For no reason whatsoever...

We still want MS_NOUSER on shm_mnt, so we'd better make sure that
shmem_fill_super() sets it on the internal instance... Fixed variant
follows:

Signed-off-by: Al Viro <[email protected]>
diff --git a/mm/shmem.c b/mm/shmem.c
index e43dc55..5261498 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2615,13 +2615,15 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
* tmpfs instance, limiting inodes to one per page of lowmem;
* but the internal instance is left unlimited.
*/
- if (!(sb->s_flags & MS_NOUSER)) {
+ if (!(sb->s_flags & MS_KERNMOUNT)) {
sbinfo->max_blocks = shmem_default_max_blocks();
sbinfo->max_inodes = shmem_default_max_inodes();
if (shmem_parse_options(data, sbinfo, false)) {
err = -EINVAL;
goto failed;
}
+ } else {
+ sb->s_flags |= MS_NOUSER;
}
sb->s_export_op = &shmem_export_ops;
sb->s_flags |= MS_NOSEC;
@@ -2831,8 +2833,7 @@ int __init shmem_init(void)
goto out2;
}

- shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
- shmem_fs_type.name, NULL);
+ shm_mnt = kern_mount(&shmem_fs_type);
if (IS_ERR(shm_mnt)) {
error = PTR_ERR(shm_mnt);
printk(KERN_ERR "Could not kern_mount tmpfs\n");

2013-08-31 03:06:39

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

Just noticing that you are adding several functions that return a boolean
value as an int. And a "gotref" local variable.

Is that just not wanting to bother with thse newfangled C99 innovations,
or do you dislike the "bool" type for some reason?

Even if it doesn't change the code in the slightest, I like to declare
things with the bool type for documentation. I can see avoiding code
churn, but this is all new code, so I thought I'd ask.

(FWIW, stdbool.h was in gcc 3.2, which README says is the minimum
supported version, although that's probably outdated information.)

2013-08-31 17:16:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 8:06 PM, George Spelvin <[email protected]> wrote:
> Just noticing that you are adding several functions that return a boolean
> value as an int. And a "gotref" local variable.
>
> Is that just not wanting to bother with thse newfangled C99 innovations,
> or do you dislike the "bool" type for some reason?

I don't use "bool" in code I write. I don't think it adds any actual
value, and I think the data type is badly designed and of dubious
value in C. It has very few actual advantages.

That said, it's not like I *hate* the type, and I won't remove bool
from code other people write. I just think it's superfluous and
stupid, and another case of C++ people thinking too much "this is a
cool feature" without then actually doing it well. The C people then
picked it up because it was less onerous than some other C++ features,
and all the compilers had the logic anyway.

If "bool" had real advantages (like having a dense array
representation, for example), that would be one thing. It doesn't.
Sure, now you can take an address of a bool (which you couldn't
generally do efficiently if it really was a bit array), but it also
means that in practice, "bool" is normally nothing but "char" with
some really odd and special implicit type casting rules.

I doubt most people really even understand how "bool" casting works.

And bool is actually really *dangerous* to use if you don't understand
it. There are people who use "bool", but then because they want to be
portable, they have a compatibility #ifdef or other configuration
thing that does something like

typedef int bool;
#define true 1
#define false 0

and it will actually work. Most of the time. And then the semantic
differences from a _real_ C compiler that supports the C99 _Bool/bool
type are really really subtle.

IOW, bool has very few real upsides, and it has a real downside: it's
subtle, and people really never even seem to _realize_ just how subtle
it is. I suspect that if you ask ten random average C programmers if
the above is equivalent to stdbool.h, nine of them will say "sure".

And they'd be *COMPLETELY* wrong.

So no. I'm definitely not a fan of bool. I think there are better
types, and I think there are better ways to document things.

Linus

2013-08-31 21:24:12

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 03:30:14PM -0700, Linus Torvalds wrote:
> On Fri, Aug 30, 2013 at 2:44 PM, Al Viro <[email protected]> wrote:
> >
> > Point... Actually, I wonder if _that_ could be a solution for ->d_name.name
> > printk races as well. Remember that story? You objected against taking
> > spinlocks in printk, no matter how specialized and how narrow the area
> > over which those are taken, but rcu_read_lock/rcu_read_unlock should be
> > OK... Something like %pd expecting dentry pointer and producing dentry
> > name. Sure, we still get garbage if we race with d_move(), but at least
> > it's a contained garbage that way...
>
> Yes, that sounds quite reasonable. For printk, we'd probably want to
> limit the max size and depth to something fairly small (32 bytes, max
> four deep or something), and we cannot take cwd/root into account
> since it can happen from interrupts, but other than that it doesn't
> sound horrible.

Hmm... OK, most of these suckers are actually doing just one component;
we can look into 'print the ancestors as well' later, but the minimal
variant would be something like this and it already covers a lot of those
guys. Comments?

diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index 3e8cb73..259f8c3 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -168,6 +168,13 @@ UUID/GUID addresses:
Where no additional specifiers are used the default little endian
order with lower case hex characters will be printed.

+dentry names:
+ %pd
+
+ For printing dentry name; if we race with d_move(), the name might be
+ a mix of old and new ones, but it won't oops. %pd dentry is safer
+ equivalent of %s dentry->d_name.name we used to use.
+
struct va_format:

%pV
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 739a3636..941509e 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -26,6 +26,7 @@
#include <linux/math64.h>
#include <linux/uaccess.h>
#include <linux/ioport.h>
+#include <linux/dcache.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -532,6 +533,56 @@ char *string(char *buf, char *end, const char *s, struct printf_spec spec)
return buf;
}

+static void widen(char *buf, char *end, unsigned len, unsigned spaces)
+{
+ size_t size;
+ if (buf >= end) /* nowhere to put anything */
+ return;
+ size = end - buf;
+ if (size <= spaces) {
+ memset(buf, ' ', size);
+ return;
+ }
+ if (len) {
+ if (len > size - spaces)
+ len = size - spaces;
+ memmove(buf + spaces, buf, len);
+ }
+ memset(buf, ' ', spaces);
+}
+
+static noinline_for_stack
+char *dentry_name(char *buf, char *end, const struct dentry *d, struct printf_spec spec)
+{
+ int n;
+ const char *s;
+ char *p = buf;
+ char c;
+
+ rcu_read_lock();
+ s = ACCESS_ONCE(d->d_name.name);
+ for (n = 0; n != spec.precision && (c = *s++) != 0; n++) {
+ if (buf < end)
+ *buf = c;
+ buf++;
+ }
+ rcu_read_unlock();
+ if (n < spec.field_width) {
+ /* we want to pad the sucker */
+ unsigned spaces = spec.field_width - n;
+ if (!(spec.flags & LEFT)) {
+ widen(p, end, n, spaces);
+ return buf + spaces;
+ }
+ while (spaces--) {
+ if (buf < end)
+ *buf = ' ';
+ ++buf;
+ }
+ }
+ return buf;
+}
+
static noinline_for_stack
char *symbol_string(char *buf, char *end, void *ptr,
struct printf_spec spec, const char *fmt)
@@ -1253,6 +1304,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
spec.base = 16;
return number(buf, end,
(unsigned long long) *((phys_addr_t *)ptr), spec);
+ case 'd':
+ return dentry_name(buf, end, ptr, spec);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {

2013-08-31 22:49:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sat, Aug 31, 2013 at 2:23 PM, Al Viro <[email protected]> wrote:
>
> Hmm... OK, most of these suckers are actually doing just one component;
> we can look into 'print the ancestors as well' later, but the minimal
> variant would be something like this and it already covers a lot of those
> guys. Comments?

Doesn't look wrong, but remember the /proc debugging thing? We
definitely wanted more than just one pathname component, and I don't
think that's completely rare.

So I think it would be better to prepare for that, and simply print to
a local buffer, and then use the "string()" function on the end
result. Rather than do it directly from the dentry like you do, and
then having to do that widen() thing because you couldn't do the
strnlen() that that code wanted..

Hmm?

Linus

2013-08-31 23:28:15

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sat, Aug 31, 2013 at 03:49:31PM -0700, Linus Torvalds wrote:
> On Sat, Aug 31, 2013 at 2:23 PM, Al Viro <[email protected]> wrote:
> >
> > Hmm... OK, most of these suckers are actually doing just one component;
> > we can look into 'print the ancestors as well' later, but the minimal
> > variant would be something like this and it already covers a lot of those
> > guys. Comments?
>
> Doesn't look wrong, but remember the /proc debugging thing? We
> definitely wanted more than just one pathname component, and I don't
> think that's completely rare.
>
> So I think it would be better to prepare for that, and simply print to
> a local buffer, and then use the "string()" function on the end
> result. Rather than do it directly from the dentry like you do, and
> then having to do that widen() thing because you couldn't do the
> strnlen() that that code wanted..

Actually, right now I'm debugging a variant that avoids local buffers; use
is %pD3 for grandparent/parent/name, etc., up to %pD4. %pd is equivalent
to %pD1 (just the dentry name). Keep in mind that things like NFS use
a _lot_ of what would be %pD2 in debugging printks and the string can grow
fairly long, so I'd rather live with widen() than mess with local buffers
here. I'll send an updated variant when I'm more or less satisfied with
it...

2013-09-01 00:13:21

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 01, 2013 at 12:27:58AM +0100, Al Viro wrote:
> On Sat, Aug 31, 2013 at 03:49:31PM -0700, Linus Torvalds wrote:
> > On Sat, Aug 31, 2013 at 2:23 PM, Al Viro <[email protected]> wrote:
> > >
> > > Hmm... OK, most of these suckers are actually doing just one component;
> > > we can look into 'print the ancestors as well' later, but the minimal
> > > variant would be something like this and it already covers a lot of those
> > > guys. Comments?
> >
> > Doesn't look wrong, but remember the /proc debugging thing? We
> > definitely wanted more than just one pathname component, and I don't
> > think that's completely rare.
> >
> > So I think it would be better to prepare for that, and simply print to
> > a local buffer, and then use the "string()" function on the end
> > result. Rather than do it directly from the dentry like you do, and
> > then having to do that widen() thing because you couldn't do the
> > strnlen() that that code wanted..
>
> Actually, right now I'm debugging a variant that avoids local buffers; use
> is %pD3 for grandparent/parent/name, etc., up to %pD4. %pd is equivalent
> to %pD1 (just the dentry name). Keep in mind that things like NFS use
> a _lot_ of what would be %pD2 in debugging printks and the string can grow
> fairly long, so I'd rather live with widen() than mess with local buffers
> here. I'll send an updated variant when I'm more or less satisfied with
> it...

Seems to be working... This doesn't include the metric arseload of
conversions in fs/*/* - just the sprintf part.

diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index 3e8cb73..826147b 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -168,6 +168,17 @@ UUID/GUID addresses:
Where no additional specifiers are used the default little endian
order with lower case hex characters will be printed.

+dentry names:
+ %pd
+ %pD1
+ ...
+ %pD4
+
+ For printing dentry name; if we race with d_move(), the name might be
+ a mix of old and new ones, but it won't oops. %pd dentry is a safer
+ equivalent of %s dentry->d_name.name we used to use, %pD<n> prints
+ n last components (IOW, %pD1 is equivalent to %pd).
+
struct va_format:

%pV
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 739a3636..5db62bf 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -26,6 +26,7 @@
#include <linux/math64.h>
#include <linux/uaccess.h>
#include <linux/ioport.h>
+#include <linux/dcache.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -532,6 +533,88 @@ char *string(char *buf, char *end, const char *s, struct printf_spec spec)
return buf;
}

+static void widen(char *buf, char *end, unsigned len, unsigned spaces)
+{
+ size_t size;
+ if (buf >= end) /* nowhere to put anything */
+ return;
+ size = end - buf;
+ if (size <= spaces) {
+ memset(buf, ' ', size);
+ return;
+ }
+ if (len) {
+ if (len > size - spaces)
+ len = size - spaces;
+ memmove(buf + spaces, buf, len);
+ }
+ memset(buf, ' ', spaces);
+}
+
+static noinline_for_stack
+char *dentry_name(char *buf, char *end, const struct dentry *d, struct printf_spec spec,
+ int depth)
+{
+ int i, n = 0;
+ const char *s;
+ char *p = buf;
+ const struct dentry *array[4];
+ char c;
+
+ if (depth < 0) {
+ depth = 1;
+ WARN_ON(1);
+ }
+ if (depth > 4) {
+ depth = 4;
+ WARN_ON(1);
+ }
+
+ rcu_read_lock();
+ for (i = 0; i < depth; i++) {
+ struct dentry *p = ACCESS_ONCE(d->d_parent);
+ array[i] = d;
+ if (d == p)
+ break;
+ d = p;
+ }
+ if (!i) { /* root dentry has a bloody inconvenient name */
+ i++;
+ goto do_name;
+ }
+ if (i == depth)
+ goto do_name;
+ while (i && n != spec.precision) {
+ if (buf < end)
+ *buf = '/';
+ buf++;
+ n++;
+do_name:
+ s = ACCESS_ONCE(array[--i]->d_name.name);
+ while (n != spec.precision && (c = *s++) != '\0') {
+ if (buf < end)
+ *buf = c;
+ buf++;
+ n++;
+ }
+ }
+ rcu_read_unlock();
+ if (n < spec.field_width) {
+ /* we want to pad the sucker */
+ unsigned spaces = spec.field_width - n;
+ if (!(spec.flags & LEFT)) {
+ widen(p, end, n, spaces);
+ return buf + spaces;
+ }
+ while (spaces--) {
+ if (buf < end)
+ *buf = ' ';
+ ++buf;
+ }
+ }
+ return buf;
+}
+
static noinline_for_stack
char *symbol_string(char *buf, char *end, void *ptr,
struct printf_spec spec, const char *fmt)
@@ -1253,6 +1336,14 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
spec.base = 16;
return number(buf, end,
(unsigned long long) *((phys_addr_t *)ptr), spec);
+ case 'd':
+ return dentry_name(buf, end, ptr, spec, 1);
+ case 'D':
+ switch (fmt[1]) {
+ case '1': case '2': case '3': case '4':
+ return dentry_name(buf, end, ptr, spec, fmt[1] - '0');
+ }
+ break;
}
spec.flags |= SMALL;
if (spec.field_width == -1) {

2013-09-01 08:50:55

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

> If "bool" had real advantages (like having a dense array
> representation, for example), that would be one thing. It doesn't.
> Sure, now you can take an address of a bool (which you couldn't
> generally do efficiently if it really was a bit array), but it also
> means that in practice, "bool" is normally nothing but "char" with
> some really odd and special implicit type casting rules.

Huh. For me, the big advantage is, to paraphrase Michael Palin, that
the number of possible values is two, no more, and no less. Two is
the number of possibilities, and the number of possibilities is two.
Three is Right Out.

I agree that as a *storage* type (in a data structure), it's of limited
usefulness. But as a *parameter* type, to pass to and return from
functions, it's wonderful.

There are lots of naming conventions (like the word "flag", or a function
name starting with "is") to indicate that the return value is a simple
true/false, but declaring it that way provides compile-time checking.

> I doubt most people really even understand how "bool" casting works.

There's a reason for that; you rarely need to cast to bool, and you
could forbid it outright with very little impact on bool's usefulness.
What fraction of C programmers remember off the top of their heads that
casting from float to integer rounds toward zero?

Heck, if you asked me which conversions were defined and which were
undefined, I'd have to look it up. (I'd guess it's the same as the
different overflow rules for signed and unsigned types, but I'd have
to check.)

The place bool is most useful is control flow flags and other things
where all you're doing is assigning literal values and testing it. And,
if you want to get fancy, assigning to bool variables from boolean-valued
expressions like "flag = value >= limit;"


That said, I agree that the fact that 2 != (bool)2 is a bit subtle, but
doesn't that basic problem apply to *any* narrowing cast? (And thus,
is not really an *additional* subtlety that a progammer needs to learn
about.) And in this case, I can't really think of a *better* choice
that could have been made.

I can only see three plausible alternatives:

1) Use the lsbit, (bool)x == x & 1
2) Use zero/nonzero, and make (bool)x == !!x
3) Forbid casting to bool entirely.

Option 1 matches all other narrowing casts in C, and would be how I'd
expect a "bit" type to work.

Option 2 is how C conditional statements already work, so for any value x,
"if (x)" is the same as "if ((bool)x)". It's a rule that C programmers
already need to know; they only need to know to *apply* it in this one
extra case.

In fact, it arguably makes things *simpler* to explain, since it at least
gives a name to the magic that C condition expressions are subject to.

Option 3 is attractive, but ends up breaking the analogy to conditional
expressions. I'd recommend it as a coding style, however.

> And bool is actually really *dangerous* to use if you don't understand
> it. There are people who use "bool", but then because they want to be
> portable, they have a compatibility #ifdef or other configuration
> thing that does something like
>
> typedef int bool;
> #define true 1
> #define false 0
>
> and it will actually work. Most of the time. And then the semantic
> differences from a _real_ C compiler that supports the C99 _Bool/bool
> type are really really subtle.

But *all* of the subtleties arise when casting other types to bool.
If you just avoid that one thing (which it's hopefully obvious won't be
faithfully emulated by a compatibility kludge anyway), it all goes away.

And this rule is itself just a special case of "be very careful with
narrowing casts", which is already common wisdom.

For that kludge, I think a better equivalent would be

typedef enum { false, true } bool;

which at least communicates the idea (again, this is an annoying sutlety
that C programmers *already* have to deal with, so no additional cognitive
effort) that "the compiler might not catch you assigning out-of-range
values, but if you do, demons might fly out of your nose."


Anyway, thanks for sharing your opition. To me, it just looked like exactly
the sort of code where the bool type was a natural fit.

2013-09-01 10:01:18

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Fri, Aug 30, 2013 at 6:52 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Aug 30, 2013 at 9:37 AM, Sedat Dilek <[email protected]> wrote:
>>
>> Where is this a.out file from or how to generate it?
>
> Oh, that's just the silly threaded test-binary. I don't know what you
> called it.
>
> As to your config options, yesh, you have some expensive stuff.
> DEBUG_OBJECTS and DEBUG_MUTEXES in particular tend to cause lots of
> horrible performance issues. I didn't check if there might be other
> things..
>

I tried w/o DEBUG_OBJECTS and DEBUG_MUTEXES and disabled some
unnecessary debug-options, too (see attached diff).

This is what I get now...

[ TEST-CASE ]

$ ~/src/linux-kernel/linux/tools/perf/perf stat --null --repeat 5
./scripts/t_lockref_from-linus
Total loops: 26480075
Total loops: 27002388
Total loops: 25761463
Total loops: 26877615
Total loops: 27047644

Performance counter stats for './scripts/t_lockref_from-linus' (5 runs):

10,008617789 seconds time elapsed
( +- 0,07% )


Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
~26.60Mloops (no-debug).

[ PERF-RECORD ]

$ sudo ~/src/linux-kernel/linux/tools/perf/perf record -e cycles:pp
./scripts/t_lockref_from-linus
Total loops: 26601346
[ perf record: Woken up 25 times to write data ]
[ perf record: Captured and wrote 6.100 MB perf.data (~266501 samples) ]

[ PERF-REPORT ]

$ sudo ~/src/linux-kernel/linux/tools/perf/perf report -f

Samples: 159K of event 'cycles:pp', Event count (approx.): 76968896763
12,79% t_lockref_from- [kernel.kallsyms] [k] irq_return
4,36% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
4,36% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
4,07% t_lockref_from- [kernel.kallsyms] [k] user_exit
3,12% t_lockref_from- [kernel.kallsyms] [k] local_clock
2,83% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
2,73% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
2,62% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu
2,53% t_lockref_from- libc-2.15.so [.] __xstat64
2,53% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_free
2,28% t_lockref_from- [kernel.kallsyms] [k] path_init
2,27% t_lockref_from- [kernel.kallsyms] [k] link_path_walk
1,86% t_lockref_from- [kernel.kallsyms] [k] user_enter
1,85% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_exit_common.isra.43
1,81% t_lockref_from- [kernel.kallsyms] [k] sched_clock_cpu
1,79% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_enter_common.isra.45
1,78% t_lockref_from- [kernel.kallsyms] [k] path_lookupat
1,67% t_lockref_from- [kernel.kallsyms] [k] native_read_tsc
1,63% t_lockref_from- [kernel.kallsyms] [k] cp_new_stat
1,61% t_lockref_from- [kernel.kallsyms] [k] lockref_put_or_lock
1,53% t_lockref_from- [kernel.kallsyms] [k] account_system_time
1,48% t_lockref_from- [kernel.kallsyms] [k] tracesys
1,47% t_lockref_from- [kernel.kallsyms] [k] copy_user_generic_unrolled
1,46% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_enter
1,39% t_lockref_from- [kernel.kallsyms] [k] jiffies_to_timeval
1,33% t_lockref_from- [kernel.kallsyms] [k] native_sched_clock
1,27% t_lockref_from- [kernel.kallsyms] [k] getname_flags
1,27% t_lockref_from- [kernel.kallsyms] [k] lookup_fast
1,18% t_lockref_from- [kernel.kallsyms] [k] get_vtime_delta
1,05% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_leave
1,03% t_lockref_from- [kernel.kallsyms] [k] generic_fillattr
1,02% t_lockref_from- [kernel.kallsyms] [k] strncpy_from_user
1,00% t_lockref_from- [kernel.kallsyms] [k] user_path_at_empty
0,97% t_lockref_from- [kernel.kallsyms] [k] account_user_time
0,95% t_lockref_from- [kernel.kallsyms] [k] vfs_fstatat
0,95% t_lockref_from- [kernel.kallsyms] [k] system_call_after_swapgs
0,92% t_lockref_from- [kernel.kallsyms] [k] generic_permission
0,91% t_lockref_from- [kernel.kallsyms] [k] filename_lookup
0,80% t_lockref_from- [kernel.kallsyms] [k] vfs_getattr
0,78% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_unlock
0,74% t_lockref_from- [kernel.kallsyms] [k] complete_walk
0,70% t_lockref_from- [kernel.kallsyms] [k] vtime_account_user
0,68% t_lockref_from- [kernel.kallsyms] [k] d_rcu_to_refcount
0,65% t_lockref_from- [kernel.kallsyms] [k] common_perm
0,62% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_enter
0,58% t_lockref_from- [kernel.kallsyms] [k] vtime_user_enter
0,57% t_lockref_from- [kernel.kallsyms] [k] __inode_permission
0,55% t_lockref_from- [kernel.kallsyms] [k] dput
0,52% t_lockref_from- [kernel.kallsyms] [k] apparmor_inode_getattr
0,52% t_lockref_from- [kernel.kallsyms] [k] SYSC_newstat
0,52% t_lockref_from- [kernel.kallsyms] [k] mntget
0,49% t_lockref_from- [kernel.kallsyms] [k] cpuacct_account_field
0,48% t_lockref_from- [kernel.kallsyms] [k] __vtime_account_system
0,46% t_lockref_from- t_lockref_from-linus [.] start_routine

Thanks for all the explanations and hints.

Regards,
- Sedat -

P.S.: Some words to "perf -f"...

$ sudo ~/src/linux-kernel/linux/tools/perf/perf record -f -e cycles:pp
./scripts/t_lockref_from-linus
[sudo] password for wearefam:
Error: unknown switch `f'

usage: perf record [<options>] [<command>]
or: perf record [<options>] -- <command> [<options>]

-e, --event <event> event selector. use 'perf list' to list
available events
--filter <filter>
event filter
-p, --pid <pid> record events on existing process id
-t, --tid <tid> record events on existing thread id
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
-D, --no-delay collect data without buffering
-R, --raw-samples collect raw sample records from all opened counters
-a, --all-cpus system-wide collection from all CPUs
-C, --cpu <cpu> list of cpus to monitor
-c, --count <n> event period to sample
-o, --output <file> output file name
-i, --no-inherit child tasks do not inherit counters
-F, --freq <n> profile at this frequency
-m, --mmap-pages <n> number of mmap data pages
--group put the counters into a counter group
-g, --call-graph <mode[,dump_size]>
do call-graph (stack chain/backtrace)
recording: [fp] dwarf
-v, --verbose be more verbose (show counter open errors, etc)
-q, --quiet don't print any message
-s, --stat per thread counts
-d, --data Sample addresses
-T, --timestamp Sample timestamps
-P, --period Sample period
-n, --no-samples don't sample
-N, --no-buildid-cache
do not update the buildid cache
-B, --no-buildid do not collect buildids in perf.data
-G, --cgroup <name> monitor event in cgroup name only
-u, --uid <user> user to profile
-b, --branch-any sample any taken branches
-j, --branch-filter <branch filter mask>
branch stack filter modes
-W, --weight sample by weight (on special events only)

- EOT -


Attachments:
kernel-config.diff (2.77 kB)

2013-09-01 10:33:13

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 12:01 PM, Sedat Dilek <[email protected]> wrote:
> On Fri, Aug 30, 2013 at 6:52 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Fri, Aug 30, 2013 at 9:37 AM, Sedat Dilek <[email protected]> wrote:
>>>
>>> Where is this a.out file from or how to generate it?
>>
>> Oh, that's just the silly threaded test-binary. I don't know what you
>> called it.
>>
>> As to your config options, yesh, you have some expensive stuff.
>> DEBUG_OBJECTS and DEBUG_MUTEXES in particular tend to cause lots of
>> horrible performance issues. I didn't check if there might be other
>> things..
>>
>
> I tried w/o DEBUG_OBJECTS and DEBUG_MUTEXES and disabled some
> unnecessary debug-options, too (see attached diff).
>
> This is what I get now...
>
> [ TEST-CASE ]
>
> $ ~/src/linux-kernel/linux/tools/perf/perf stat --null --repeat 5
> ./scripts/t_lockref_from-linus
> Total loops: 26480075
> Total loops: 27002388
> Total loops: 25761463
> Total loops: 26877615
> Total loops: 27047644
>
> Performance counter stats for './scripts/t_lockref_from-linus' (5 runs):
>
> 10,008617789 seconds time elapsed
> ( +- 0,07% )
>
>
> Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
> ~26.60Mloops (no-debug).
>
> [ PERF-RECORD ]
>
> $ sudo ~/src/linux-kernel/linux/tools/perf/perf record -e cycles:pp
> ./scripts/t_lockref_from-linus
> Total loops: 26601346
> [ perf record: Woken up 25 times to write data ]
> [ perf record: Captured and wrote 6.100 MB perf.data (~266501 samples) ]
>
> [ PERF-REPORT ]
>
> $ sudo ~/src/linux-kernel/linux/tools/perf/perf report -f
>
> Samples: 159K of event 'cycles:pp', Event count (approx.): 76968896763
> 12,79% t_lockref_from- [kernel.kallsyms] [k] irq_return
> 4,36% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
> 4,36% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
> 4,07% t_lockref_from- [kernel.kallsyms] [k] user_exit
> 3,12% t_lockref_from- [kernel.kallsyms] [k] local_clock
> 2,83% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
> 2,73% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
> 2,62% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu
> 2,53% t_lockref_from- libc-2.15.so [.] __xstat64
> 2,53% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_free
> 2,28% t_lockref_from- [kernel.kallsyms] [k] path_init
> 2,27% t_lockref_from- [kernel.kallsyms] [k] link_path_walk
> 1,86% t_lockref_from- [kernel.kallsyms] [k] user_enter
> 1,85% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_exit_common.isra.43
> 1,81% t_lockref_from- [kernel.kallsyms] [k] sched_clock_cpu
> 1,79% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_enter_common.isra.45
> 1,78% t_lockref_from- [kernel.kallsyms] [k] path_lookupat
> 1,67% t_lockref_from- [kernel.kallsyms] [k] native_read_tsc
> 1,63% t_lockref_from- [kernel.kallsyms] [k] cp_new_stat
> 1,61% t_lockref_from- [kernel.kallsyms] [k] lockref_put_or_lock
> 1,53% t_lockref_from- [kernel.kallsyms] [k] account_system_time
> 1,48% t_lockref_from- [kernel.kallsyms] [k] tracesys
> 1,47% t_lockref_from- [kernel.kallsyms] [k] copy_user_generic_unrolled
> 1,46% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_enter
> 1,39% t_lockref_from- [kernel.kallsyms] [k] jiffies_to_timeval
> 1,33% t_lockref_from- [kernel.kallsyms] [k] native_sched_clock
> 1,27% t_lockref_from- [kernel.kallsyms] [k] getname_flags
> 1,27% t_lockref_from- [kernel.kallsyms] [k] lookup_fast
> 1,18% t_lockref_from- [kernel.kallsyms] [k] get_vtime_delta
> 1,05% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_leave
> 1,03% t_lockref_from- [kernel.kallsyms] [k] generic_fillattr
> 1,02% t_lockref_from- [kernel.kallsyms] [k] strncpy_from_user
> 1,00% t_lockref_from- [kernel.kallsyms] [k] user_path_at_empty
> 0,97% t_lockref_from- [kernel.kallsyms] [k] account_user_time
> 0,95% t_lockref_from- [kernel.kallsyms] [k] vfs_fstatat
> 0,95% t_lockref_from- [kernel.kallsyms] [k] system_call_after_swapgs
> 0,92% t_lockref_from- [kernel.kallsyms] [k] generic_permission
> 0,91% t_lockref_from- [kernel.kallsyms] [k] filename_lookup
> 0,80% t_lockref_from- [kernel.kallsyms] [k] vfs_getattr
> 0,78% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_unlock
> 0,74% t_lockref_from- [kernel.kallsyms] [k] complete_walk
> 0,70% t_lockref_from- [kernel.kallsyms] [k] vtime_account_user
> 0,68% t_lockref_from- [kernel.kallsyms] [k] d_rcu_to_refcount
> 0,65% t_lockref_from- [kernel.kallsyms] [k] common_perm
> 0,62% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_enter
> 0,58% t_lockref_from- [kernel.kallsyms] [k] vtime_user_enter
> 0,57% t_lockref_from- [kernel.kallsyms] [k] __inode_permission
> 0,55% t_lockref_from- [kernel.kallsyms] [k] dput
> 0,52% t_lockref_from- [kernel.kallsyms] [k] apparmor_inode_getattr
> 0,52% t_lockref_from- [kernel.kallsyms] [k] SYSC_newstat
> 0,52% t_lockref_from- [kernel.kallsyms] [k] mntget
> 0,49% t_lockref_from- [kernel.kallsyms] [k] cpuacct_account_field
> 0,48% t_lockref_from- [kernel.kallsyms] [k] __vtime_account_system
> 0,46% t_lockref_from- t_lockref_from-linus [.] start_routine
>
> Thanks for all the explanations and hints.
>
> Regards,
> - Sedat -
>
> P.S.: Some words to "perf -f"...
>
> $ sudo ~/src/linux-kernel/linux/tools/perf/perf record -f -e cycles:pp
> ./scripts/t_lockref_from-linus
> [sudo] password for wearefam:
> Error: unknown switch `f'
>
> usage: perf record [<options>] [<command>]
> or: perf record [<options>] -- <command> [<options>]
>
> -e, --event <event> event selector. use 'perf list' to list
> available events
> --filter <filter>
> event filter
> -p, --pid <pid> record events on existing process id
> -t, --tid <tid> record events on existing thread id
> -r, --realtime <n> collect data with this RT SCHED_FIFO priority
> -D, --no-delay collect data without buffering
> -R, --raw-samples collect raw sample records from all opened counters
> -a, --all-cpus system-wide collection from all CPUs
> -C, --cpu <cpu> list of cpus to monitor
> -c, --count <n> event period to sample
> -o, --output <file> output file name
> -i, --no-inherit child tasks do not inherit counters
> -F, --freq <n> profile at this frequency
> -m, --mmap-pages <n> number of mmap data pages
> --group put the counters into a counter group
> -g, --call-graph <mode[,dump_size]>
> do call-graph (stack chain/backtrace)
> recording: [fp] dwarf
> -v, --verbose be more verbose (show counter open errors, etc)
> -q, --quiet don't print any message
> -s, --stat per thread counts
> -d, --data Sample addresses
> -T, --timestamp Sample timestamps
> -P, --period Sample period
> -n, --no-samples don't sample
> -N, --no-buildid-cache
> do not update the buildid cache
> -B, --no-buildid do not collect buildids in perf.data
> -G, --cgroup <name> monitor event in cgroup name only
> -u, --uid <user> user to profile
> -b, --branch-any sample any taken branches
> -j, --branch-filter <branch filter mask>
> branch stack filter modes
> -W, --weight sample by weight (on special events only)
>
> - EOT -

Attached are the results without the patch from Linus.

- Sedat -


Attachments:
RESULTS_SPINLOCK-LOCKREF-TESTING_WITHOUT-PATCH_3.11.0-rc7-3-iniza-small.txt (5.18 kB)

2013-09-01 11:10:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

Speaking of bool (and I'm not a fan of it either), is this warning
just noise (which is bad enough since it masks real warnings), or is
this going to cause serious problems?

CHECK /usr/projects/linux/ext4/kernel/trace/trace.c
/usr/projects/linux/ext4/kernel/trace/trace.c:559:6: warning: symbol 'free_snapshot' was not declared. Should it be static?
/usr/projects/linux/ext4/kernel/trace/trace.c:1489:14: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1489:14: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1489:14: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1489:14: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1492:9: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1492:9: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1492:9: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1492:9: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1539:9: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1539:9: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1539:9: warning: expression using sizeof bool
/usr/projects/linux/ext4/kernel/trace/trace.c:1539:9: warning: expression using sizeof bool

(i.e., is a C compiler allowed to pack multiple bools stored in a data
structure into a single byte, with potentially hilarious results for C
code trying to do magic utilizing sizeof and offsetof (which some of
our kernel macros do use, IIRC)?

- Ted

P.S. BTW, this is not the only set of sparse warnings that are
emitted by files in kernel/trace/*; maybe it would be good if the
ftrace maintianers worked on nuking them all? There are also a bunch
of warnings like which I've only recently learned indicated potential
RCU bugs:

/usr/projects/linux/ext4/kernel/trace/trace.c:1885:17: warning: incorrect type in assignment (different address spaces)
/usr/projects/linux/ext4/kernel/trace/trace.c:1885:17: expected struct trace_buffer_struct *buffers
/usr/projects/linux/ext4/kernel/trace/trace.c:1885:17: got struct trace_buffer_struct [noderef] <asn:3>*<noident>

2013-09-01 15:32:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 3:01 AM, Sedat Dilek <[email protected]> wrote:
>
> Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
> ~26.60Mloops (no-debug).

Ok, that's getting to be in the right ballpark.

But your profile is still odd.

> Samples: 159K of event 'cycles:pp', Event count (approx.): 76968896763
> 12,79% t_lockref_from- [kernel.kallsyms] [k] irq_return
> 4,36% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock

If you do the profile with "-g", what are the top callers of this? You
shouldn't see any spinlock load from the path lookup, but you have all
these other things going on..

> 4,36% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
> 4,07% t_lockref_from- [kernel.kallsyms] [k] user_exit
> 3,12% t_lockref_from- [kernel.kallsyms] [k] local_clock
> 2,83% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
> 2,73% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
> 2,62% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu

You're spending more time on the task stats than on the actual lookup.
Maybe you should turn off CONFIG_TASKSTATS..But why that whole
irq_return thing? Odd.

Linus

2013-09-01 15:45:09

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 5:32 PM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Sep 1, 2013 at 3:01 AM, Sedat Dilek <[email protected]> wrote:
>>
>> Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
>> ~26.60Mloops (no-debug).
>
> Ok, that's getting to be in the right ballpark.
>
> But your profile is still odd.
>
>> Samples: 159K of event 'cycles:pp', Event count (approx.): 76968896763
>> 12,79% t_lockref_from- [kernel.kallsyms] [k] irq_return
>> 4,36% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
>
> If you do the profile with "-g", what are the top callers of this? You
> shouldn't see any spinlock load from the path lookup, but you have all
> these other things going on..
>

$ sudo ~/src/linux-kernel/linux/tools/perf/perf record -g -e cycles:pp
./scripts/t_lockref_from-linus
Total loops: 26205085
[ perf record: Woken up 77 times to write data ]
[ perf record: Captured and wrote 19.778 MB perf.data (~864092 samples) ]


$ sudo ~/src/linux-kernel/linux/tools/perf/perf report <--- I used
here with -f option, the last one, dropped here.

Samples: 160K of event 'cycles:pp', Event count (approx.): 77003901089
+ 12,46% t_lockref_from- [kernel.kallsyms] [k] irq_return
+ 4,86% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
+ 4,42% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
+ 4,28% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
+ 3,97% t_lockref_from- [kernel.kallsyms] [k] user_exit
+ 3,04% t_lockref_from- [kernel.kallsyms] [k] local_clock
+ 2,71% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
+ 2,50% t_lockref_from- [kernel.kallsyms] [k] link_path_walk
+ 2,46% t_lockref_from- libc-2.15.so [.] __xstat64
+ 2,38% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_free
+ 1,96% t_lockref_from- [kernel.kallsyms] [k] path_lookupat
+ 1,88% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu
+ 1,87% t_lockref_from- [kernel.kallsyms] [k] tracesys
+ 1,84% t_lockref_from- [kernel.kallsyms] [k]
rcu_eqs_exit_common.isra.43
+ 1,81% t_lockref_from- [kernel.kallsyms] [k]
rcu_eqs_enter_common.isra.45
+ 1,80% t_lockref_from- [kernel.kallsyms] [k] user_enter
+ 1,79% t_lockref_from- [kernel.kallsyms] [k] sched_clock_cpu
+ 1,61% t_lockref_from- [kernel.kallsyms] [k] native_read_tsc
+ 1,56% t_lockref_from- [kernel.kallsyms] [k] cp_new_stat
+ 1,52% t_lockref_from- [kernel.kallsyms] [k] lockref_put_or_lock
+ 1,51% t_lockref_from- [kernel.kallsyms] [k] account_system_time
+ 1,46% t_lockref_from- [kernel.kallsyms] [k] path_init
+ 1,46% t_lockref_from- [kernel.kallsyms] [k] copy_user_generic_unrolled
+ 1,42% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_enter
+ 1,38% t_lockref_from- [kernel.kallsyms] [k] jiffies_to_timeval
+ 1,32% t_lockref_from- [kernel.kallsyms] [k] lookup_fast
+ 1,31% t_lockref_from- [kernel.kallsyms] [k] native_sched_clock
+ 1,24% t_lockref_from- [kernel.kallsyms] [k] getname_flags
+ 1,17% t_lockref_from- [kernel.kallsyms] [k] vfs_getattr
+ 1,15% t_lockref_from- [kernel.kallsyms] [k] get_vtime_delta
+ 1,03% t_lockref_from- [kernel.kallsyms] [k] syscall_trace_leave
+ 0,95% t_lockref_from- [kernel.kallsyms] [k] generic_fillattr
+ 0,94% t_lockref_from- [kernel.kallsyms] [k] user_path_at_empty
+ 0,93% t_lockref_from- [kernel.kallsyms] [k] system_call_after_swapgs
+ 0,93% t_lockref_from- [kernel.kallsyms] [k] account_user_time
+ 0,89% t_lockref_from- [kernel.kallsyms] [k] strncpy_from_user
+ 0,86% t_lockref_from- [kernel.kallsyms] [k] complete_walk
+ 0,80% t_lockref_from- [kernel.kallsyms] [k] filename_lookup
+ 0,80% t_lockref_from- [kernel.kallsyms] [k] vfs_fstatat
+ 0,78% t_lockref_from- [kernel.kallsyms] [k] generic_permission
+ 0,77% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_unlock
+ 0,73% t_lockref_from- [kernel.kallsyms] [k] __inode_permission
+ 0,69% t_lockref_from- [kernel.kallsyms] [k] vtime_account_user
+ 0,66% t_lockref_from- [kernel.kallsyms] [k] d_rcu_to_refcount
+ 0,61% t_lockref_from- [kernel.kallsyms] [k] common_perm
+ 0,60% t_lockref_from- [kernel.kallsyms] [k] rcu_eqs_enter
+ 0,59% t_lockref_from- [kernel.kallsyms] [k] dput
+ 0,54% t_lockref_from- [kernel.kallsyms] [k] vtime_user_enter
+ 0,51% t_lockref_from- [kernel.kallsyms] [k] cpuacct_account_field
+ 0,50% t_lockref_from- [kernel.kallsyms] [k] mntput
+ 0,48% t_lockref_from- [kernel.kallsyms] [k] lg_local_lock
+ 0,48% t_lockref_from- [kernel.kallsyms] [k] apparmor_inode_getattr
+ 0,45% t_lockref_from- t_lockref_from-linus [.] start_routine
+ 0,45% t_lockref_from- [kernel.kallsyms] [k] __vtime_account_system
Press '?' for help on key bindings

- Sedat -

>> 4,36% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
>> 4,07% t_lockref_from- [kernel.kallsyms] [k] user_exit
>> 3,12% t_lockref_from- [kernel.kallsyms] [k] local_clock
>> 2,83% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
>> 2,73% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
>> 2,62% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu
>
> You're spending more time on the task stats than on the actual lookup.
> Maybe you should turn off CONFIG_TASKSTATS..But why that whole
> irq_return thing? Odd.
>

Yes, I have CONFIG_TASKSTATS=y.
I can try a -4 build w/o it.

- Sedat -

2013-09-01 15:49:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 4:10 AM, Theodore Ts'o <[email protected]> wrote:
> Speaking of bool (and I'm not a fan of it either), is this warning
> just noise (which is bad enough since it masks real warnings), or is
> this going to cause serious problems?
>
> CHECK /usr/projects/linux/ext4/kernel/trace/trace.c
> /usr/projects/linux/ext4/kernel/trace/trace.c:559:6: warning: symbol 'free_snapshot' was not declared. Should it be static?
> /usr/projects/linux/ext4/kernel/trace/trace.c:1489:14: warning: expression using sizeof bool

It's just because sparse is being a bit odd. Internally, sparse thinks
that "bool" has a size of one bit. So it then the sizeof code has a
special case to make it one byte, and when that special case was
added, people added the warning too.

I suspect we sparse should just make the size of "bool" be 8 bits
internally, and we should drop the warning. As it is, sparse will
actually do odd and wrong things due to the "bool is one bit" if you
put "bool" types in structures or unions, I think.

Nobody used to care, because we used to not use that broken type in the kernel.

Linus

2013-09-01 15:55:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 8:45 AM, Sedat Dilek <[email protected]> wrote:
>
> Samples: 160K of event 'cycles:pp', Event count (approx.): 77003901089
> + 12,46% t_lockref_from- [kernel.kallsyms] [k] irq_return
> + 4,86% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
> + 4,42% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
> + 4,28% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals

You need to go into __ticket_spin_lock to see who the callers are.

Just go down to it and press enter to expand it (and then you need to
go and expand that entry too to get the callers)

I still don't know how you get to irq_return. It should use sysret. Odd.

Linus

2013-09-01 17:49:13

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 01, 2013 at 01:13:06AM +0100, Al Viro wrote:
> > Actually, right now I'm debugging a variant that avoids local buffers; use
> > is %pD3 for grandparent/parent/name, etc., up to %pD4. %pd is equivalent
> > to %pD1 (just the dentry name). Keep in mind that things like NFS use
> > a _lot_ of what would be %pD2 in debugging printks and the string can grow
> > fairly long, so I'd rather live with widen() than mess with local buffers
> > here. I'll send an updated variant when I'm more or less satisfied with
> > it...
>
> Seems to be working... This doesn't include the metric arseload of
> conversions in fs/*/* - just the sprintf part.

FWIW, now that I've looked at more users (and we do have a shitload
of those), it seems that we need the following set:
dentry name
dentry path 2--4 levels deep (most of the users want 2 right now,
but that's mostly a matter of arguments being too painful to type for
deeper ones)
same for file - sure, we can just pass file->f_path.dentry,
but there's a lot of such guys and I'd like to reduce the amount of
places where we have ->f_path.dentry mentioned in source.

Suggestions regarding formats to use would be welcome. For now I'm using
pd/pd<n>/pD/pD<n>, the latter two being for struct file, but I'd gladly
take anything prettier than that.

2013-09-01 18:11:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, 1 Sep 2013 08:49:48 -0700
Linus Torvalds <[email protected]> wrote:


> Nobody used to care, because we used to not use that broken type in the kernel.

I've been told that gcc works better with 'bool' than with an int.
Should I replace those bools with bit fields in the structure?

-- Steve

2013-09-01 20:03:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 11:11 AM, Steven Rostedt <[email protected]> wrote:
>
> I've been told that gcc works better with 'bool' than with an int.
> Should I replace those bools with bit fields in the structure?

I think bitfields are a better idea in a struct, yes. They take less
space, and there's a possibility to generate better code with a
bitfield test than with a bool, especially if you have multiple values
and you test them together.

If it's a single bool in a structure, it really doesn't matter. It's
going to take up at least a char (and usually more due to padding of
other fields) regardless of bool-vs-bitfield issues. But I'd much
rather see bitfields in structs because they *can* work better, and
they are more flexible than "bool" anyway.

Bools generally should generate the same code as "char", and yes, that
can be better than "int". On x86, for example, the "setcc" instruction
always sets a char, so if you use an "int" for boolean values, the
compiler usually generates something like "xor %eax,%eax; test ..;
setne %al" or similar. A bool or a char will skip the "xor", because
the "setne" will set the low bits that are sufficient. That said, it's
not actually noticeable in practice, and most routines that return
true/false will just do plain "return 0" or whatever, so there's no
use for "setcc" to begin with.

Linus

2013-09-01 20:59:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 8:32 AM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Sep 1, 2013 at 3:01 AM, Sedat Dilek <[email protected]> wrote:
>>
>> Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
>> ~26.60Mloops (no-debug).
>
> Ok, that's getting to be in the right ballpark.

So I installed my new i7-4770S yesterday - somewhat lower frequency
than my previous CPU, but it has four cores plus HT, and boy does that
show the scalability problems better.

My test-program used to get maybe 15% time in spinlock. On the 4770S,
with current -git (so no lockref) I get this:

[torvalds@i5 test-lookup]$ for i in 1 2 3 4 5; do ./a.out ; done
Total loops: 26656873
Total loops: 26701572
Total loops: 26698526
Total loops: 26752993
Total loops: 26710556

with a profile that looks roughly like:

84.14% a.out _raw_spin_lock
3.04% a.out lg_local_lock
2.16% a.out vfs_getattr
1.16% a.out dput.part.15
0.67% a.out copy_user_enhanced_fast_string
0.55% a.out complete_walk

[ Side note: Al, that lg_local_lock really is annoying: it's
br_read_lock(mntput_no_expire), with two thirds of the calls coming
from mntput_no_expire, and the rest from path_init -> lock_rcu_walk.

I really really wonder if we could get rid of the
br_read_lock(&vfsmount_lock) for rcu_walk_init(), and use just the RCU
read accesses for the mount-namespaces too. What is that lock really
protecting against during lookup anyway? ]

With the last lockref patch I sent out, it looks like this:

[torvalds@i5 test-lookup]$ for i in 1 2 3 4 5; do ./a.out ; done
Total loops: 54740529
Total loops: 54568346
Total loops: 54715686
Total loops: 54715854
Total loops: 54790592

28.55% a.out lockref_get_or_lock
20.65% a.out lockref_put_or_lock
9.06% a.out dput
6.37% a.out lg_local_lock
5.45% a.out lookup_fast
3.77% a.out d_rcu_to_refcount
2.03% a.out vfs_getattr
1.75% a.out copy_user_enhanced_fast_string
1.16% a.out link_path_walk
1.15% a.out avc_has_perm_noaudit
1.14% a.out __lookup_mnt

so performance more than doubled (on that admittedly stupid
benchmark), and you can see that the cacheline bouncing for that
reference count is still a big deal, but at least it gets some real
work done now because we're not spinning waiting for it.

So you can see the bad case with even just a single socket when the
benchmark is just targeted enough. But two cores just wasn't enough to
show any performance advantage.

Linus

2013-09-01 21:24:16

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 01, 2013 at 01:59:22PM -0700, Linus Torvalds wrote:

> [ Side note: Al, that lg_local_lock really is annoying: it's
> br_read_lock(mntput_no_expire), with two thirds of the calls coming
> from mntput_no_expire, and the rest from path_init -> lock_rcu_walk.

How much of that is due to br_write_lock() taken in mntput_no_expire()
for no good reason? IOW, could you try shmem.c patch I've sent yesterday
and see how much effect does it have?[1] Basically, we get it grabbed
exclusive on each final fput() of a struct file created by shmem_file_setup(),
which is _not_ a rare event. And the only reason for that is not having
shm_mnt marked long-living, even though its refcount never hits 0...

> I really really wonder if we could get rid of the
> br_read_lock(&vfsmount_lock) for rcu_walk_init(), and use just the RCU
> read accesses for the mount-namespaces too. What is that lock really
> protecting against during lookup anyway? ]

A lot of things, I'm afraid. It's not as simple as just the access
to vfsmount hash... ;-/ I'll need to do some digging to put together
a full analysis, but there had been quite a few subtle issues where
it played...

[1] sits in the local queue, will push tonight:

commit e7db6c4c1d01032f53262f03b5f38899f9db8add
Author: Al Viro <[email protected]>
Date: Sat Aug 31 12:57:10 2013 -0400

shm_mnt is as longterm as it gets, TYVM...

Signed-off-by: Al Viro <[email protected]>

diff --git a/mm/shmem.c b/mm/shmem.c
index e43dc55..5261498 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2615,13 +2615,15 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
* tmpfs instance, limiting inodes to one per page of lowmem;
* but the internal instance is left unlimited.
*/
- if (!(sb->s_flags & MS_NOUSER)) {
+ if (!(sb->s_flags & MS_KERNMOUNT)) {
sbinfo->max_blocks = shmem_default_max_blocks();
sbinfo->max_inodes = shmem_default_max_inodes();
if (shmem_parse_options(data, sbinfo, false)) {
err = -EINVAL;
goto failed;
}
+ } else {
+ sb->s_flags |= MS_NOUSER;
}
sb->s_export_op = &shmem_export_ops;
sb->s_flags |= MS_NOSEC;
@@ -2831,8 +2833,7 @@ int __init shmem_init(void)
goto out2;
}

- shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
- shmem_fs_type.name, NULL);
+ shm_mnt = kern_mount(&shmem_fs_type);
if (IS_ERR(shm_mnt)) {
error = PTR_ERR(shm_mnt);
printk(KERN_ERR "Could not kern_mount tmpfs\n");

2013-09-01 22:16:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 2:23 PM, Al Viro <[email protected]> wrote:
>
> How much of that is due to br_write_lock() taken in mntput_no_expire()
> for no good reason? IOW, could you try shmem.c patch I've sent yesterday
> and see how much effect does it have?[1] Basically, we get it grabbed
> exclusive on each final fput() of a struct file created by shmem_file_setup(),
> which is _not_ a rare event. And the only reason for that is not having
> shm_mnt marked long-living, even though its refcount never hits 0...

Does not seem to matter. Still 66% mntput_no_expire, 31% path_init.
And that lg_local_lock() takes 5-6% of CPU, pretty much all of which
is that single xadd instruction that implements the spinlock.

This is on /tmp, which is tmpfs. But I don't see how any of that could
matter. "mntput()" does an unconditional call to mntput_no_expire(),
and mntput_no_expire() does that br_read_lock() unconditionally too.

Note that I'm talking about that "cheap" *read* lock being expensive.
It's the local one, not the global one. So it's not what Waiman saw
with the global lock. This is a local per-cpu thing.

That read-lock is supposed to be very cheap - it's just a per-cpu
spinlock. But it ends up being very expensive for some reason. I'm not
quite sure why - I don't see any lg_global_lock() calls at all, so...

I wonder if there is some false sharing going on. But I don't see that
either, this is the percpu offset map afaik:

000000000000f560 d files_lglock_lock
000000000000f564 d nr_dentry
000000000000f568 d last_ino
000000000000f56c d nr_unused
000000000000f570 d nr_inodes
000000000000f574 d vfsmount_lock_lock
000000000000f580 d bh_accounting

and I don't see anything there that would get cross-cpu accesses, so
there shouldn't be any cacheline bouncing. That's the whole point of
percpu variables, after all.

Odd. What am I missing?

Linus

2013-09-01 22:35:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 01, 2013 at 03:16:24PM -0700, Linus Torvalds wrote:

> Does not seem to matter. Still 66% mntput_no_expire, 31% path_init.
> And that lg_local_lock() takes 5-6% of CPU, pretty much all of which
> is that single xadd instruction that implements the spinlock.
>
> This is on /tmp, which is tmpfs. But I don't see how any of that could
> matter. "mntput()" does an unconditional call to mntput_no_expire(),
> and mntput_no_expire() does that br_read_lock() unconditionally too.
>
> Note that I'm talking about that "cheap" *read* lock being expensive.
> It's the local one, not the global one. So it's not what Waiman saw
> with the global lock. This is a local per-cpu thing.
>
> That read-lock is supposed to be very cheap - it's just a per-cpu
> spinlock. But it ends up being very expensive for some reason. I'm not
> quite sure why - I don't see any lg_global_lock() calls at all, so...
>
> I wonder if there is some false sharing going on. But I don't see that
> either, this is the percpu offset map afaik:
>
> 000000000000f560 d files_lglock_lock
> 000000000000f564 d nr_dentry
> 000000000000f568 d last_ino
> 000000000000f56c d nr_unused
> 000000000000f570 d nr_inodes
> 000000000000f574 d vfsmount_lock_lock
> 000000000000f580 d bh_accounting
>
> and I don't see anything there that would get cross-cpu accesses, so
> there shouldn't be any cacheline bouncing. That's the whole point of
> percpu variables, after all.

Hell knows... Are you sure you don't see br_write_lock() at all? I don't
see anything else that would cause cross-cpu traffic with that layout...

2013-09-01 22:45:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 01, 2013 at 11:35:21PM +0100, Al Viro wrote:
> > I wonder if there is some false sharing going on. But I don't see that
> > either, this is the percpu offset map afaik:
> >
> > 000000000000f560 d files_lglock_lock
> > 000000000000f564 d nr_dentry
> > 000000000000f568 d last_ino
> > 000000000000f56c d nr_unused
> > 000000000000f570 d nr_inodes
> > 000000000000f574 d vfsmount_lock_lock
> > 000000000000f580 d bh_accounting
> >
> > and I don't see anything there that would get cross-cpu accesses, so
> > there shouldn't be any cacheline bouncing. That's the whole point of
> > percpu variables, after all.
>
> Hell knows... Are you sure you don't see br_write_lock() at all? I don't
> see anything else that would cause cross-cpu traffic with that layout...

GRRR... I see something else:
void file_sb_list_del(struct file *file)
{
if (!list_empty(&file->f_u.fu_list)) {
lg_local_lock_cpu(&files_lglock, file_list_cpu(file));
list_del_init(&file->f_u.fu_list);
lg_local_unlock_cpu(&files_lglock, file_list_cpu(file));
}
}
will cheerfully cause cross-CPU traffic. If that's what is going on, the
earlier patch I've sent (not putting non-regulars and files opened r/o
on ->s_list) should reduce the cacheline bouncing on that cacheline.

2013-09-01 22:48:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 3:16 PM, Linus Torvalds
<[email protected]> wrote:
>
> I wonder if there is some false sharing going on. But I don't see that
> either, this is the percpu offset map afaik:
>
> 000000000000f560 d files_lglock_lock
> 000000000000f564 d nr_dentry
> 000000000000f568 d last_ino
> 000000000000f56c d nr_unused
> 000000000000f570 d nr_inodes
> 000000000000f574 d vfsmount_lock_lock
> 000000000000f580 d bh_accounting

I made DEFINE_LGLOCK use DEFINE_PER_CPU_SHARED_ALIGNED for the
spinlock, so that each local lock gets its own cacheline, and the
total loops jumped to 62M (from 52-54M before). So when I looked at
the numbers, I thought "oh, that helped".

But then I looked closer, and realized that I just see a fair amount
of boot-to-boot variation anyway (probably a lot to do with cache
placement and how dentries got allocated etc). And it didn't actually
help at all, the problem is stilte there, and lg_local_lock is still
really really high on the profile, at 8% cpu time:

- 8.00% lg_local_lock
- lg_local_lock
+ 64.83% mntput_no_expire
+ 33.81% path_init
+ 0.78% mntput
+ 0.58% path_lookupat

which just looks insane. And no, no lg_global_lock visible anywhere..

So it's not false sharing. But something is bouncing *that* particular
lock around.

Linus

---
34.60% lockref_get_or_lock
23.35% lockref_put_or_lock
10.57% dput
8.00% lg_local_lock
1.79% copy_user_enhanced_fast_string
1.15% link_path_walk
1.04% path_lookupat
1.03% sysret_check
1.01% kmem_cache_alloc
1.00% selinux_inode_permission
0.97% __d_lookup_rcu
0.95% kmem_cache_free
0.90% 0x00007f03e0800ee3
0.88% avc_has_perm_noaudit
0.79% cp_new_stat
0.76% avc_has_perm_flags
0.69% path_init
0.68% getname_flags
0.66% system_call
0.58% generic_permission
0.55% lookup_fast
0.54% user_path_at_empty
0.51% vfs_fstatat
0.49% vfs_getattr
0.49% filename_lookup
0.49% strncpy_from_user
0.44% generic_fillattr
0.40% inode_has_perm.isra.32.constprop.61
0.38% ext4_getattr
0.34% complete_walk
0.34% lg_local_unlock
0.27% d_rcu_to_refcount
0.25% __inode_permission
0.23% _copy_to_user
0.23% security_inode_getattr
0.22% mntget
0.22% selinux_inode_getattr
0.21% SYSC_newstat
0.21% mntput_no_expire
0.20% putname
0.17% path_put
0.16% security_inode_permission
0.16% start_routine
0.14% mntput
0.14% final_putname
0.14% _cond_resched
0.12% inode_permission
0.10% user_path_at
0.09% __xstat64
0.07% sys_newstat
0.03% __xstat@plt
0.03% update_cfs_rq_blocked_load
0.02% task_tick_fair
0.01% common_interrupt
0.01% ktime_get
0.01% lapic_next_deadline
0.01% run_timer_softirq
0.01% hsw_unclaimed_reg_check.isra.6
0.01% sched_clock_cpu
0.01% rcu_check_callbacks
0.01% update_cfs_shares
0.01% _raw_spin_lock
0.01% irqtime_account_irq
0.01% __do_softirq
0.01% ret_from_sys_call
0.01% i915_read32
0.01% hrtimer_interrupt
0.01% update_curr
0.01% profile_tick
0.00% intel_pmu_disable_all
0.00% intel_pmu_enable_all
0.00% tg_load_down
0.00% native_sched_clock
0.00% native_apic_msr_eoi_write
0.00% irqtime_account_process_tick.isra.2
0.00% perf_event_task_tick
0.00% clockevents_program_event
0.00% __acct_update_integrals
0.00% rcu_irq_exit

2013-09-01 22:58:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 3:44 PM, Al Viro <[email protected]> wrote:
>
> GRRR... I see something else:
> void file_sb_list_del(struct file *file)
> {
> if (!list_empty(&file->f_u.fu_list)) {
> lg_local_lock_cpu(&files_lglock, file_list_cpu(file));
> list_del_init(&file->f_u.fu_list);
> lg_local_unlock_cpu(&files_lglock, file_list_cpu(file));
> }
> }
> will cheerfully cause cross-CPU traffic. If that's what is going on, the
> earlier patch I've sent (not putting non-regulars and files opened r/o
> on ->s_list) should reduce the cacheline bouncing on that cacheline.

Hmm. That might indeed be a bad sources of cross-cpu bouncing on some
loads, but the load I test doesn't actually open any files. It just
does "stat()" on a filename.

So no "struct file *" anywhere for me..It really seems to be
vfsmount_lock itself somehow.

Linus

2013-09-01 23:30:20

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 01, 2013 at 03:48:01PM -0700, Linus Torvalds wrote:
> I made DEFINE_LGLOCK use DEFINE_PER_CPU_SHARED_ALIGNED for the
> spinlock, so that each local lock gets its own cacheline, and the
> total loops jumped to 62M (from 52-54M before). So when I looked at
> the numbers, I thought "oh, that helped".
>
> But then I looked closer, and realized that I just see a fair amount
> of boot-to-boot variation anyway (probably a lot to do with cache
> placement and how dentries got allocated etc). And it didn't actually
> help at all, the problem is stilte there, and lg_local_lock is still
> really really high on the profile, at 8% cpu time:
>
> - 8.00% lg_local_lock
> - lg_local_lock
> + 64.83% mntput_no_expire
> + 33.81% path_init
> + 0.78% mntput
> + 0.58% path_lookupat
>
> which just looks insane. And no, no lg_global_lock visible anywhere..
>
> So it's not false sharing. But something is bouncing *that* particular
> lock around.

Hrm... It excludes sharing between the locks, all right. AFAICS, that
won't exclude sharing with plain per-cpu vars, will it? Could you
tell what vfsmount_lock is sharing with on that build? The stuff between
it and files_lock doesn't have any cross-CPU writers, but with that
change it's the stuff after it that becomes interesting...

2013-09-02 00:13:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 4:30 PM, Al Viro <[email protected]> wrote:
>
> Hrm... It excludes sharing between the locks, all right. AFAICS, that
> won't exclude sharing with plain per-cpu vars, will it?

Yes it will. DEFINE_PER_CPU_SHARED_ALIGNED not only aligns the data,
it also puts it in a separate section with only other aligned data
entries. So now the percpu address map around it looks like this:

...
0000000000013a80 d call_single_queue
0000000000013ac0 d cfd_data
0000000000013b00 d files_lglock_lock
0000000000013b40 d vfsmount_lock_lock
0000000000013b80 d file_lock_lglock_lock
0000000000013bc0 D softnet_data
0000000000013d40 D __per_cpu_end
..

So there shouldn't be anything to share falsely with.

I'd like to say that the profile is bad, but this is *so* consistent,
and the profile data really looks perfectly fine in every other way.
I'm using "-e cycles:pp", so it's using hardware profiling and all the
other functions really look correct.

It *is* one of the few locked accesses remaining, and it's clearly
getting called a lot (three calls per system call: two mntput's - one
for the root path, one for the result path, and one from path_init ->
rcu_walk_init), but with up to 8% CPU time for basically that one
"lock xadd" instruction is damn odd. I can't see how that could happen
without seriously nasty cacheline bouncing, but I can't see how *that*
can happen when all the accesses seem to be from the current CPU.

This is a new Haswell-based machine that I put together yesterdat, and
I haven't used it for profiling before. So maybe it _is_ something odd
with the profiling after all, and atomic serializing instructions get
incorrect profile counts.

Linus

2013-09-02 00:50:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 5:12 PM, Linus Torvalds
<[email protected]> wrote:
>
> It *is* one of the few locked accesses remaining, and it's clearly
> getting called a lot (three calls per system call: two mntput's - one
> for the root path, one for the result path, and one from path_init ->
> rcu_walk_init), but with up to 8% CPU time for basically that one
> "lock xadd" instruction is damn odd. I can't see how that could happen
> without seriously nasty cacheline bouncing, but I can't see how *that*
> can happen when all the accesses seem to be from the current CPU.

So, I wanted to double-check that "it can only be that expensive if
there's cacheline bouncing" statement. Thinking "maybe it's just
really expensive. Even when running just a single thread".

So I set MAX_THREADS to 1 in my stupid benchmark, just to see what happens..

And almost everything changes as expected: now we don't have any
cacheline bouncing any more, so lockref_put_or_lock() and
lockref_get_or_lock() no longer dominate - instead of being 20%+ each,
they are now just 3%.

What _didn't_ change? Right. lg_local_lock() is still 6.40%. Even when
single-threaded. It's now the #1 function in my profile:

6.40% lg_local_lock
5.42% copy_user_enhanced_fast_string
5.14% sysret_check
4.79% link_path_walk
4.41% 0x00007ff861834ee3
4.33% avc_has_perm_flags
4.19% __lookup_mnt
3.83% lookup_fast

(that "copy_user_enhanced_fast_string" is when we copy the "struct
stat" from kernel space to user space)

The instruction-level profile just looking like

│ ffffffff81078e70 <lg_local_lock>:
2.06 │ push %rbp
1.06 │ mov %rsp,%rbp
0.11 │ mov (%rdi),%rdx
2.13 │ add %gs:0xcd48,%rdx
0.92 │ mov $0x100,%eax
85.87 │ lock xadd %ax,(%rdx)
0.04 │ movzbl %ah,%ecx
│ cmp %al,%cl
3.60 │ ↓ je 31
│ nop
│28: pause
│ movzbl (%rdx),%eax
│ cmp %cl,%al
│ ↑ jne 28
│31: pop %rbp
4.22 │ ← retq

so that instruction sequence is just expensive, and it is expensive
without any cacheline bouncing. The expense seems to be 100% simply
due to the fact that it's an atomic serializing instruction, and it
just gets called way too much.

So lockref_[get|put]_or_lock() are each called once per pathname
lookup (because the RCU accesses to the dentries get turned into a
refcount, and then that refcount gets dropped). But lg_local_lock()
gets called twice: once for path_init(), and once for mntput() - I
think I was wrong about mntput getting called twice.

So it doesn't seem to be cacheline bouncing at all. It's just
"serializing instructions are really expensive" together with calling
that function too much. And we've optimized pathname lookup so much
that even a single locked instruction shows up like a sort thumb.

I guess we should be proud.

Linus

2013-09-02 07:05:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Linus Torvalds <[email protected]> wrote:

> On Sun, Sep 1, 2013 at 5:12 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > It *is* one of the few locked accesses remaining, and it's clearly
> > getting called a lot (three calls per system call: two mntput's - one
> > for the root path, one for the result path, and one from path_init ->
> > rcu_walk_init), but with up to 8% CPU time for basically that one
> > "lock xadd" instruction is damn odd. I can't see how that could happen
> > without seriously nasty cacheline bouncing, but I can't see how *that*
> > can happen when all the accesses seem to be from the current CPU.
>
> So, I wanted to double-check that "it can only be that expensive if
> there's cacheline bouncing" statement. Thinking "maybe it's just
> really expensive. Even when running just a single thread".
>
> So I set MAX_THREADS to 1 in my stupid benchmark, just to see what happens..
>
> And almost everything changes as expected: now we don't have any
> cacheline bouncing any more, so lockref_put_or_lock() and
> lockref_get_or_lock() no longer dominate - instead of being 20%+ each,
> they are now just 3%.
>
> What _didn't_ change? Right. lg_local_lock() is still 6.40%. Even when
> single-threaded. It's now the #1 function in my profile:
>
> 6.40% lg_local_lock
> 5.42% copy_user_enhanced_fast_string
> 5.14% sysret_check
> 4.79% link_path_walk
> 4.41% 0x00007ff861834ee3
> 4.33% avc_has_perm_flags
> 4.19% __lookup_mnt
> 3.83% lookup_fast
>
> (that "copy_user_enhanced_fast_string" is when we copy the "struct
> stat" from kernel space to user space)
>
> The instruction-level profile just looking like
>
> ??? ffffffff81078e70 <lg_local_lock>:
> 2.06 ??? push %rbp
> 1.06 ??? mov %rsp,%rbp
> 0.11 ??? mov (%rdi),%rdx
> 2.13 ??? add %gs:0xcd48,%rdx
> 0.92 ??? mov $0x100,%eax
> 85.87 ??? lock xadd %ax,(%rdx)
> 0.04 ??? movzbl %ah,%ecx
> ??? cmp %al,%cl
> 3.60 ??? ??? je 31
> ??? nop
> ???28: pause
> ??? movzbl (%rdx),%eax
> ??? cmp %cl,%al
> ??? ??? jne 28
> ???31: pop %rbp
> 4.22 ??? ??? retq

The Haswell perf code isn't very widely tested yet as it took quite some
time to get it ready for upstream and thus got merged late, but on its
face this looks like a pretty good profile.

With one detail:

> so that instruction sequence is just expensive, and it is expensive
> without any cacheline bouncing. The expense seems to be 100% simply due
> to the fact that it's an atomic serializing instruction, and it just
> gets called way too much.
>
> So lockref_[get|put]_or_lock() are each called once per pathname lookup
> (because the RCU accesses to the dentries get turned into a refcount,
> and then that refcount gets dropped). But lg_local_lock() gets called
> twice: once for path_init(), and once for mntput() - I think I was wrong
> about mntput getting called twice.
>
> So it doesn't seem to be cacheline bouncing at all. It's just
> "serializing instructions are really expensive" together with calling
> that function too much. And we've optimized pathname lookup so much that
> even a single locked instruction shows up like a sort thumb.
>
> I guess we should be proud.

It still looks anomalous to me, on fresh Intel hardware. One suggestion:
could you, just for pure testing purposes, turn HT off and do a quick
profile that way?

The XADD, even if it's all in the fast path, could be a pretty natural
point to 'yield' an SMT context on a given core, giving it artificially
high overhead.

Note that to test HT off an intrusive reboot is probably not needed, if
the HT siblings are right after each other in the CPU enumeration sequence
then you can turn HT "off" effectively by running the workload only on 4
cores:

taskset 0x55 ./my-test

and reducing the # of your workload threads to 4 or so.

Thanks,

Ingo

2013-09-02 10:30:57

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 5:55 PM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Sep 1, 2013 at 8:45 AM, Sedat Dilek <[email protected]> wrote:
>>
>> Samples: 160K of event 'cycles:pp', Event count (approx.): 77003901089
>> + 12,46% t_lockref_from- [kernel.kallsyms] [k] irq_return
>> + 4,86% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
>> + 4,42% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
>> + 4,28% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
>
> You need to go into __ticket_spin_lock to see who the callers are.
>
> Just go down to it and press enter to expand it (and then you need to
> go and expand that entry too to get the callers)
>

I am new to perf usage.

4,60% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock

Which entry to select?

Annotate __ticket_spin_lock
Zoom into t_lockref_from-(3962) thread
Zoom into the Kernel DSO
Browse map details
Run scripts for samples of thread [t_lockref_from-]
Run scripts for samples of symbol [__ticket_spin_lock]
Run scripts for all samples
Switch to another data file in PWD
Exit

> I still don't know how you get to irq_return. It should use sysret. Odd.

- Sedat -

2013-09-02 16:09:45

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 9/2/13 4:30 AM, Sedat Dilek wrote:
> On Sun, Sep 1, 2013 at 5:55 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Sun, Sep 1, 2013 at 8:45 AM, Sedat Dilek <[email protected]> wrote:
>>>
>>> Samples: 160K of event 'cycles:pp', Event count (approx.): 77003901089
>>> + 12,46% t_lockref_from- [kernel.kallsyms] [k] irq_return
>>> + 4,86% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
>>> + 4,42% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
>>> + 4,28% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
>>
>> You need to go into __ticket_spin_lock to see who the callers are.
>>
>> Just go down to it and press enter to expand it (and then you need to
>> go and expand that entry too to get the callers)
>>
>
> I am new to perf usage.

One option for you is 'perf report --stdio ..'. That bypasses the tui.

David

2013-09-02 16:44:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Mon, Sep 2, 2013 at 12:05 AM, Ingo Molnar <[email protected]> wrote:
>
> The Haswell perf code isn't very widely tested yet as it took quite some
> time to get it ready for upstream and thus got merged late, but on its
> face this looks like a pretty good profile.

Yes. And everything else looks fine too. Profiles without locked
instructions all look very reasonable, and have the expected patterns.

\> It still looks anomalous to me, on fresh Intel hardware. One suggestion:
> could you, just for pure testing purposes, turn HT off and do a quick
> profile that way?
>
> The XADD, even if it's all in the fast path, could be a pretty natural
> point to 'yield' an SMT context on a given core, giving it artificially
> high overhead.
>
> Note that to test HT off an intrusive reboot is probably not needed, if
> the HT siblings are right after each other in the CPU enumeration sequence
> then you can turn HT "off" effectively by running the workload only on 4
> cores:
>
> taskset 0x55 ./my-test
>
> and reducing the # of your workload threads to 4 or so.

Remember: I see the exact same profile for single-thread behavior.
Other things change (iow, lockref_get_or_lock() is either ~3% or ~30%
- the latter case is for when there are bouncing cachelines), but
lg_local_lock() stays pretty constant.

So it's not a HT artifact or anything like that.

I've timed "lock xadd" separately, and it's not a slow instruction. I
also tried (in user space, using thread-local storage) to see if it's
the combination of creating the address through a segment load and
that somehow causing a micro-exception or something (the P4 used to
have things like that), and that doesn't seem to account for it
either.

It is entirely possible that it is just a "cycles:pp" oddity - because
the "lock xadd" is serializing, it can't retire until everything
around it has been sorted out, and maybe it just shows up in profiles
more than is really "fair" to the instruction itself, because it ends
up being that stable point for potentially hundreds of instructions
around it.

Linus

2013-09-02 19:26:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 08/30/2013 10:42 PM, Al Viro wrote:
> On Sat, Aug 31, 2013 at 03:35:16AM +0100, Al Viro wrote:
>
>> Aha... OK, I see what's going on. We end up with shm_mnt *not* marked
>> as long-living vfsmount, even though it lives forever. See if the
>> following helps; if it does (and I very much expect it to), we want to
>> put it in -stable. As it is, you get slow path in mntput() each time
>> a file created by shmem_file_setup() gets closed. For no reason whatsoever...
> We still want MS_NOUSER on shm_mnt, so we'd better make sure that
> shmem_fill_super() sets it on the internal instance... Fixed variant
> follows:
>
> Signed-off-by: Al Viro<[email protected]>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e43dc55..5261498 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2615,13 +2615,15 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
> * tmpfs instance, limiting inodes to one per page of lowmem;
> * but the internal instance is left unlimited.
> */
> - if (!(sb->s_flags& MS_NOUSER)) {
> + if (!(sb->s_flags& MS_KERNMOUNT)) {
> sbinfo->max_blocks = shmem_default_max_blocks();
> sbinfo->max_inodes = shmem_default_max_inodes();
> if (shmem_parse_options(data, sbinfo, false)) {
> err = -EINVAL;
> goto failed;
> }
> + } else {
> + sb->s_flags |= MS_NOUSER;
> }
> sb->s_export_op =&shmem_export_ops;
> sb->s_flags |= MS_NOSEC;
> @@ -2831,8 +2833,7 @@ int __init shmem_init(void)
> goto out2;
> }
>
> - shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
> - shmem_fs_type.name, NULL);
> + shm_mnt = kern_mount(&shmem_fs_type);
> if (IS_ERR(shm_mnt)) {
> error = PTR_ERR(shm_mnt);
> printk(KERN_ERR "Could not kern_mount tmpfs\n");

Yes, that patch worked. It eliminated the lglock as a bottleneck in the
AIM7 workload. The lg_global_lock did not show up in the perf profile,
whereas the lg_local_lock was only 0.07%.

Regards,
Longman

2013-09-03 06:01:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Waiman Long <[email protected]> wrote:

> On 08/30/2013 10:42 PM, Al Viro wrote:
> >On Sat, Aug 31, 2013 at 03:35:16AM +0100, Al Viro wrote:
> >
> >>Aha... OK, I see what's going on. We end up with shm_mnt *not* marked
> >>as long-living vfsmount, even though it lives forever. See if the
> >>following helps; if it does (and I very much expect it to), we want to
> >>put it in -stable. As it is, you get slow path in mntput() each time
> >>a file created by shmem_file_setup() gets closed. For no reason whatsoever...
> >We still want MS_NOUSER on shm_mnt, so we'd better make sure that
> >shmem_fill_super() sets it on the internal instance... Fixed variant
> >follows:
> >
> >Signed-off-by: Al Viro<[email protected]>
> >diff --git a/mm/shmem.c b/mm/shmem.c
> >index e43dc55..5261498 100644
> >--- a/mm/shmem.c
> >+++ b/mm/shmem.c
> >@@ -2615,13 +2615,15 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
> > * tmpfs instance, limiting inodes to one per page of lowmem;
> > * but the internal instance is left unlimited.
> > */
> >- if (!(sb->s_flags& MS_NOUSER)) {
> >+ if (!(sb->s_flags& MS_KERNMOUNT)) {
> > sbinfo->max_blocks = shmem_default_max_blocks();
> > sbinfo->max_inodes = shmem_default_max_inodes();
> > if (shmem_parse_options(data, sbinfo, false)) {
> > err = -EINVAL;
> > goto failed;
> > }
> >+ } else {
> >+ sb->s_flags |= MS_NOUSER;
> > }
> > sb->s_export_op =&shmem_export_ops;
> > sb->s_flags |= MS_NOSEC;
> >@@ -2831,8 +2833,7 @@ int __init shmem_init(void)
> > goto out2;
> > }
> >
> >- shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
> >- shmem_fs_type.name, NULL);
> >+ shm_mnt = kern_mount(&shmem_fs_type);
> > if (IS_ERR(shm_mnt)) {
> > error = PTR_ERR(shm_mnt);
> > printk(KERN_ERR "Could not kern_mount tmpfs\n");
>
> Yes, that patch worked. It eliminated the lglock as a bottleneck in
> the AIM7 workload. The lg_global_lock did not show up in the perf
> profile, whereas the lg_local_lock was only 0.07%.

Just curious: what's the worst bottleneck now in the optimized kernel? :-)

Thanks,

Ingo

2013-09-03 07:24:09

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 8:01 AM, Ingo Molnar <[email protected]> wrote:
>
> * Waiman Long <[email protected]> wrote:
>
>> On 08/30/2013 10:42 PM, Al Viro wrote:
>> >On Sat, Aug 31, 2013 at 03:35:16AM +0100, Al Viro wrote:
>> >
>> >>Aha... OK, I see what's going on. We end up with shm_mnt *not* marked
>> >>as long-living vfsmount, even though it lives forever. See if the
>> >>following helps; if it does (and I very much expect it to), we want to
>> >>put it in -stable. As it is, you get slow path in mntput() each time
>> >>a file created by shmem_file_setup() gets closed. For no reason whatsoever...
>> >We still want MS_NOUSER on shm_mnt, so we'd better make sure that
>> >shmem_fill_super() sets it on the internal instance... Fixed variant
>> >follows:
>> >
>> >Signed-off-by: Al Viro<[email protected]>
>> >diff --git a/mm/shmem.c b/mm/shmem.c
>> >index e43dc55..5261498 100644
>> >--- a/mm/shmem.c
>> >+++ b/mm/shmem.c
>> >@@ -2615,13 +2615,15 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
>> > * tmpfs instance, limiting inodes to one per page of lowmem;
>> > * but the internal instance is left unlimited.
>> > */
>> >- if (!(sb->s_flags& MS_NOUSER)) {
>> >+ if (!(sb->s_flags& MS_KERNMOUNT)) {
>> > sbinfo->max_blocks = shmem_default_max_blocks();
>> > sbinfo->max_inodes = shmem_default_max_inodes();
>> > if (shmem_parse_options(data, sbinfo, false)) {
>> > err = -EINVAL;
>> > goto failed;
>> > }
>> >+ } else {
>> >+ sb->s_flags |= MS_NOUSER;
>> > }
>> > sb->s_export_op =&shmem_export_ops;
>> > sb->s_flags |= MS_NOSEC;
>> >@@ -2831,8 +2833,7 @@ int __init shmem_init(void)
>> > goto out2;
>> > }
>> >
>> >- shm_mnt = vfs_kern_mount(&shmem_fs_type, MS_NOUSER,
>> >- shmem_fs_type.name, NULL);
>> >+ shm_mnt = kern_mount(&shmem_fs_type);
>> > if (IS_ERR(shm_mnt)) {
>> > error = PTR_ERR(shm_mnt);
>> > printk(KERN_ERR "Could not kern_mount tmpfs\n");
>>
>> Yes, that patch worked. It eliminated the lglock as a bottleneck in
>> the AIM7 workload. The lg_global_lock did not show up in the perf
>> profile, whereas the lg_local_lock was only 0.07%.
>
> Just curious: what's the worst bottleneck now in the optimized kernel? :-)
>
> Thanks,
>
> Ingo

Can someone summarize this thread (70+ postings)?
Which patches are needed? And fixing what?
( Can people provide separate patches with a proper changelog? )
Improvements?


Thanks.

- Sedat (still struggling with perf) -

2013-09-03 10:15:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Linus Torvalds <[email protected]> wrote:

> On Mon, Sep 2, 2013 at 12:05 AM, Ingo Molnar <[email protected]> wrote:
> >
> > The Haswell perf code isn't very widely tested yet as it took quite some
> > time to get it ready for upstream and thus got merged late, but on its
> > face this looks like a pretty good profile.
>
> Yes. And everything else looks fine too. Profiles without locked
> instructions all look very reasonable, and have the expected patterns.
>
> \> It still looks anomalous to me, on fresh Intel hardware. One suggestion:
> > could you, just for pure testing purposes, turn HT off and do a quick
> > profile that way?
> >
> > The XADD, even if it's all in the fast path, could be a pretty natural
> > point to 'yield' an SMT context on a given core, giving it artificially
> > high overhead.
> >
> > Note that to test HT off an intrusive reboot is probably not needed, if
> > the HT siblings are right after each other in the CPU enumeration sequence
> > then you can turn HT "off" effectively by running the workload only on 4
> > cores:
> >
> > taskset 0x55 ./my-test
> >
> > and reducing the # of your workload threads to 4 or so.
>
> Remember: I see the exact same profile for single-thread behavior.

Oh, indeed.

> Other things change (iow, lockref_get_or_lock() is either ~3% or ~30% -
> the latter case is for when there are bouncing cachelines), but
> lg_local_lock() stays pretty constant.
>
> So it's not a HT artifact or anything like that.
>
> I've timed "lock xadd" separately, and it's not a slow instruction. I
> also tried (in user space, using thread-local storage) to see if it's
> the combination of creating the address through a segment load and that
> somehow causing a micro-exception or something (the P4 used to have
> things like that), and that doesn't seem to account for it either.
>
> It is entirely possible that it is just a "cycles:pp" oddity - because
> the "lock xadd" is serializing, it can't retire until everything around
> it has been sorted out, and maybe it just shows up in profiles more than
> is really "fair" to the instruction itself, because it ends up being
> that stable point for potentially hundreds of instructions around it.

One more thing to try would be a regular '-e cycles' non-PEBS run and see
whether there's still largish overhead visible around that instruction.

That reintroduces skid, but it eliminates any PEBS and LBR funnies, as our
cycles:pp event is a really tricky/complex beast internally.

Thanks,

Ingo

2013-09-03 14:08:12

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

Hi!

> It is entirely possible that it is just a "cycles:pp" oddity - because
> the "lock xadd" is serializing, it can't retire until everything
> around it has been sorted out, and maybe it just shows up in profiles
> more than is really "fair" to the instruction itself, because it ends
> up being that stable point for potentially hundreds of instructions
> around it.

Hmm, turn "lock xadd" into plain add, and see if it improves loops per
second? Preferably on scratch filesystem, because you may corrupt it...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2013-09-03 15:15:13

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 09/03/2013 02:01 AM, Ingo Molnar wrote:
> * Waiman Long<[email protected]> wrote:
>
>> Yes, that patch worked. It eliminated the lglock as a bottleneck in
>> the AIM7 workload. The lg_global_lock did not show up in the perf
>> profile, whereas the lg_local_lock was only 0.07%.
> Just curious: what's the worst bottleneck now in the optimized kernel? :-)
>
> Thanks,
>
> Ingo
With the following patches on v3.11:
1. Linus's version of lockref patch
2. Al's lglock patch
3. My preliminary patch to convert prepend_path under RCU

The perf profile of the kernel portion of the short workload in a
80-core system became like this:

29.87% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--50.00%-- tty_ldisc_deref
|--49.01%-- tty_ldisc_try
--0.99%-- [...]

7.55% swapper [kernel.kallsyms] [k] intel_idle
1.03% reaim [kernel.kallsyms] [k] copy_user_generic_string
0.91% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--15.88%-- __rcu_process_callbacks
|--6.55%-- load_balance
|--6.02%-- sem_lock
|--4.77%-- enqueue_to_backlog
|--4.21%-- task_rq_lock
|--3.97%-- process_backlog
|--3.35%-- unix_dgram_sendmsg
|--3.28%-- kmem_cache_free
|--3.16%-- tcp_v4_rcv
|--2.77%-- unix_stream_sendmsg
|--2.36%-- rcu_accelerate_cbs
|--2.02%-- do_wp_page
|--2.02%-- unix_create1
|--1.83%-- unix_peer_get
|--1.67%-- udp_lib_get_port
|--1.66%-- unix_stream_recvmsg
|--1.63%-- handle_pte_fault
|--1.63%-- udp_queue_rcv_skb
|--1.54%-- unix_release_sock
|--1.48%-- try_to_wake_up
|--1.37%-- do_anonymous_page
|--1.37%-- new_inode_pseudo
|--1.33%-- __d_lookup
|--1.20%-- free_one_page
|--1.11%-- __do_fault
|--1.06%-- scheduler_tick
|--0.90%-- __drain_alien_cache
|--0.81%-- inet_csk_get_port
|--0.76%-- sock_alloc
|--0.76%-- shmem_lock
|--0.75%-- __d_instantiate
|--0.70%-- __inet_hash_connect
|--0.69%-- __inet_hash_nolisten
|--0.68%-- ip_local_deliver_finish
|--0.64%-- inet_hash
|--0.64%-- kfree
|--0.60%-- d_path
|--0.58%-- __close_fd
|--0.51%-- evict
--11.76%-- [...]

0.51% reaim [ip_tables] [k] ipt_do_table
0.46% reaim [kernel.kallsyms] [k] __alloc_skb
0.38% reaim [kernel.kallsyms] [k] kfree
0.36% reaim [kernel.kallsyms] [k] kmem_cache_free
0.34% reaim [kernel.kallsyms] [k] system_call_after_swapg
0.32% reaim [kernel.kallsyms] [k] fsnotify
0.32% reaim [kernel.kallsyms] [k] ip_finish_output
0.27% reaim [kernel.kallsyms] [k] system_call

Other than the global tty_ldisc_lock, there is no other major
bottleneck. I am not that worry about the tty_ldisc_lock bottleneck
as real world applications probably won't have that many calls to
set the tty driver.

Regards,
Longman

2013-09-03 15:34:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 8:14 AM, Waiman Long <[email protected]> wrote:
>
> Other than the global tty_ldisc_lock, there is no other major
> bottleneck. I am not that worry about the tty_ldisc_lock bottleneck
> as real world applications probably won't have that many calls to
> set the tty driver.

I suspect the tty_ldisc_lock() could be made to go away if we care.
Making the ldisc be rcu-free'd, coupled with just optimistically
updating the count using "atomic_inc_not_zero()" and re-checking the
ldisc pointer afterwards should probably do it. But I agree that it
probably isn't worth it for any actual real load.

Linus

2013-09-03 15:38:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 12:24 AM, Sedat Dilek <[email protected]> wrote:
>
> Can someone summarize this thread (70+ postings)?
> Which patches are needed? And fixing what?
> ( Can people provide separate patches with a proper changelog? )
> Improvements?

The core lockref part is now merged and available in my git tree (and
yes, as individual patches with proper logs).

Al's patch to avoid the lg_lock is still pending, I'm assuming I'll
get it through his VFS tree.

Linus

2013-09-03 15:41:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 3:15 AM, Ingo Molnar <[email protected]> wrote:
>
> One more thing to try would be a regular '-e cycles' non-PEBS run and see
> whether there's still largish overhead visible around that instruction.

I've done that, and it matches the PEBS runs, except obviously with
the instruction skew (so then depending on run it's 95% the
instruction after the xadd). So the PEBS profiles are entirely
consistent with other data.

Linus

2013-09-03 18:34:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 8:41 AM, Linus Torvalds
<[email protected]> wrote:
>
> I've done that, and it matches the PEBS runs, except obviously with
> the instruction skew (so then depending on run it's 95% the
> instruction after the xadd). So the PEBS profiles are entirely
> consistent with other data.

So one thing that strikes me about our lg-locks is that they are
designed to be cheap, but they force this insane 3-deep memory access
chain to lock them.

That may be a large part of why lg_local_lock shows up so clearly on
my profiles: the single "lock xadd" instruction ends up not just being
serializing, but it is what actually consumes the previous memory
reads.

The core of the lg_local_lock sequence ends up being this
four-instruction sequence:

mov (%rdi),%rdx
add %gs:0xcd48,%rdx
mov $0x100,%eax
lock xadd %ax,(%rdx)

and that's a nasty chain of dependent memory loads. First we load the
percpu address, then we add the percpu offset to that, and then we do
the xadd on the result.

It's kind of sad, because in *theory* we could get rid of that whole
thing entirely, and just do it as one single

mov $0x100,%eax
lock xadd %ax,%gs:vfsmount_lock

that only has one single memory access, not three dependent ones.

But the two extra memory accesses come from:

- the lglock data structure isn't a percpu data structure, it's this
stupid global data structure that has a percpu pointer in it. So that
first "mov (%rdi),%rdx" is purely to load what is effectively a
constant address (per lglock).

And that's not because it wants to be, but because we associate
global lockdep data with it. Ugh. If it wasn't for that, we could just
make them percpu.

- we don't have a percpu spinlock accessor, so we always need to turn
the percpu address into a global address by adding the percpu base
(and that's the "add %gsL...,%rdx" part).

Oh well. This whole "lg_local_lock" is really noticeable on my
test-case mainly because my test-case only stat's a pathname with a
single path component, so the whole lookup really is dominated by all
the "setup/teardown" code. Real loads tend to look up much longer
pathnames, so the setup/teardown isn't so dominant, and actually
looking up the dentries from the hash chain is where most of the time
goes. But it's annoying to have that one big spike in the profile and
not being able to do anything about it.

Linus

2013-09-03 19:09:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 8:34 AM, Linus Torvalds
<[email protected]> wrote:
>
> I suspect the tty_ldisc_lock() could be made to go away if we care.

Heh. I just pulled the tty patches from Greg, and the locking has
changed completely.

It may actually fix your AIM7 test-case, because while the global
spinlock remains (it got renamed to "tty_ldiscs_lock" - there's an
added "s"), the common operations now take the per-tty lock to get the
ldisc for that tty, rather than that global spinlock (which just
protects the actual ldisk array now).

That said, I don't know what AIM7 really ends up doing, but your
profile seems to have every access through tty_ldisc_[de]ref() that
now uses only the per-tty lock. Of course, how much that helps ends up
depending on whether AIM7 uses lots of tty's or just one shared one.

Anyway, it might be worth testing my current -git tree.

Linus

2013-09-03 19:19:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Linus Torvalds <[email protected]> wrote:

> On Tue, Sep 3, 2013 at 8:41 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > I've done that, and it matches the PEBS runs, except obviously with
> > the instruction skew (so then depending on run it's 95% the
> > instruction after the xadd). So the PEBS profiles are entirely
> > consistent with other data.
>
> So one thing that strikes me about our lg-locks is that they are
> designed to be cheap, but they force this insane 3-deep memory access
> chain to lock them.
>
> That may be a large part of why lg_local_lock shows up so clearly on my
> profiles: the single "lock xadd" instruction ends up not just being
> serializing, but it is what actually consumes the previous memory reads.
>
> The core of the lg_local_lock sequence ends up being this
> four-instruction sequence:
>
> mov (%rdi),%rdx
> add %gs:0xcd48,%rdx
> mov $0x100,%eax
> lock xadd %ax,(%rdx)
>
> and that's a nasty chain of dependent memory loads. First we load the
> percpu address, then we add the percpu offset to that, and then we do
> the xadd on the result.
>
> It's kind of sad, because in *theory* we could get rid of that whole
> thing entirely, and just do it as one single
>
> mov $0x100,%eax
> lock xadd %ax,%gs:vfsmount_lock
>
> that only has one single memory access, not three dependent ones.
>
> But the two extra memory accesses come from:
>
> - the lglock data structure isn't a percpu data structure, it's this
> stupid global data structure that has a percpu pointer in it. So that
> first "mov (%rdi),%rdx" is purely to load what is effectively a constant
> address (per lglock).
>
> And that's not because it wants to be, but because we associate
> global lockdep data with it. Ugh. If it wasn't for that, we could just
> make them percpu.

I don't think that's fundamental - the per CPU lock was percpu before:

#define DEFINE_LGLOCK(name) \
- \
- DEFINE_SPINLOCK(name##_cpu_lock); \
- DEFINE_PER_CPU(arch_spinlock_t, name##_lock); \
- DEFINE_LGLOCK_LOCKDEP(name); \


but AFAICS got converted to a pointer via this commit:

commit eea62f831b8030b0eeea8314eed73b6132d1de26
Author: Andi Kleen <[email protected]>
Date: Tue May 8 13:32:24 2012 +0930

brlocks/lglocks: turn into functions

lglocks and brlocks are currently generated with some complicated
macros in lglock.h. But there's no reason to not just use common
utility functions and put all the data into a common data structure.

Since there are at least two users it makes sense to share this code
in a library. This is also easier maintainable than a macro forest.

This will also make it later possible to dynamically allocate lglocks
and also use them in modules (this would both still need some
additional, but now straightforward, code)

Which was a rather misguided premise IMHO.

Thanks,

Ingo

2013-09-03 21:01:23

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 09/03/2013 03:09 PM, Linus Torvalds wrote:
> On Tue, Sep 3, 2013 at 8:34 AM, Linus Torvalds
> <[email protected]> wrote:
>> I suspect the tty_ldisc_lock() could be made to go away if we care.
> Heh. I just pulled the tty patches from Greg, and the locking has
> changed completely.
>
> It may actually fix your AIM7 test-case, because while the global
> spinlock remains (it got renamed to "tty_ldiscs_lock" - there's an
> added "s"), the common operations now take the per-tty lock to get the
> ldisc for that tty, rather than that global spinlock (which just
> protects the actual ldisk array now).
>
> That said, I don't know what AIM7 really ends up doing, but your
> profile seems to have every access through tty_ldisc_[de]ref() that
> now uses only the per-tty lock. Of course, how much that helps ends up
> depending on whether AIM7 uses lots of tty's or just one shared one.
>
> Anyway, it might be worth testing my current -git tree.
>
> Linus

Thank for the news. I will fetch your latest git tree and try it out.

-Longman

2013-09-03 21:05:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 12:19 PM, Ingo Molnar <[email protected]> wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
>>
>> - the lglock data structure isn't a percpu data structure, it's this
>> stupid global data structure that has a percpu pointer in it. So that
>> first "mov (%rdi),%rdx" is purely to load what is effectively a constant
>> address (per lglock).
>>
>> And that's not because it wants to be, but because we associate
>> global lockdep data with it. Ugh. If it wasn't for that, we could just
>> make them percpu.
>
> I don't think that's fundamental - the per CPU lock was percpu before:
[...]
> but AFAICS got converted to a pointer via this commit:
>
> commit eea62f831b8030b0eeea8314eed73b6132d1de26
> Author: Andi Kleen <[email protected]>
> Date: Tue May 8 13:32:24 2012 +0930
>
> brlocks/lglocks: turn into functions

So instead of reverting that entirely, how about making "struct
lglock" always per entirely per-cpu, and replacing the percpu pointer
with the lock itself.

Then, we say "the lockdep map is always on CPU#0".

TOTALLY UNTESTED PATCH ATTACHED. It compiles in at least a couple of
configurations, and I checked that this removes _one_ of the
indirections (the other one is because we don't have a native per-cpu
spinlock helper function, so we need to do that percpu base addition),
but I haven't dared try to actually try to boot it.

Comments?

I'll try booting it and seeing if it actually works (and if it makes
any difference), but it seems to be a reasonable approach. I think it
actually cleans things up a bit, but maybe that's just because I
touched the code now.

Linus


Attachments:
patch.diff (6.27 kB)

2013-09-03 21:13:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 2:05 PM, Linus Torvalds
<[email protected]> wrote:
>
> TOTALLY UNTESTED PATCH ATTACHED.

Actually, that was the previous (broken) version of that patch - I
hadn't regenerated it after fixing some stupid compile errors, and it
had the DECLARE parts wrong.

This is the one that actually compiles. Whether it *works* is still a
total mystery.

Linus


Attachments:
patch.diff (6.27 kB)

2013-09-03 21:34:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 2:13 PM, Linus Torvalds
<[email protected]> wrote:
>
> This is the one that actually compiles. Whether it *works* is still a
> total mystery.

It generates ok code, and it booted, so it seems to work at least for my config.

However, it seems to make no performance-difference what-so-ever, and
lg_local_lock is still using about 7% cpu per the profiles.

The code generation is slightly better, but the profile looks the same:

│ ffffffff81078e70 <lg_local_lock>:
0.62 │ push %rbp
0.28 │ mov %rsp,%rbp
0.22 │ add %gs:0xcd48,%rdi
0.27 │ mov $0x100,%eax
97.22 │ lock xadd %ax,(%rdi)
0.01 │ movzbl %ah,%edx
│ cmp %al,%dl
0.56 │ ↓ je 29
│ xchg %ax,%ax
0.00 │20: pause
0.00 │ movzbl (%rdi),%eax
│ cmp %dl,%al
│ ↑ jne 20
│29: pop %rbp
0.81 │ ← retq

but it still obviously doesn't do the "lock xadd %ax,%gs:(%rdi)"
(without the preceding 'add') that would be the optimal code.

I'll try to hack that up too, but it's looking like it really is just
the "lock xadd", not the memory dependency chain..

Linus

2013-09-03 21:39:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 2:34 PM, Linus Torvalds
<[email protected]> wrote:
>
> I'll try to hack that up too, but it's looking like it really is just
> the "lock xadd", not the memory dependency chain..

Yeah, no difference: Better code generation with my quick hack for a
percpu spinlock:

│ ffffffff81078e70 <lg_local_lock>:
0.59 │ push %rbp
0.25 │ mov %rsp,%rbp
0.07 │ mov $0x100,%eax
97.55 │ lock xadd %ax,%gs:(%rdi)
0.01 │ movzbl %ah,%edx
│ cmp %al,%dl
0.68 │ ↓ je 29
│ nop
│20: pause
│ mov %gs:(%rdi),%al
│ cmp %dl,%al
│ ↑ jne 20
│29: pop %rbp
0.84 │ ← retq

but the actual cost is pretty much the same:

6.81% lg_local_lock

so it doesn't seem to be some odd weakness of the microarchitecture.

Linus

2013-09-03 22:37:29

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 1, 2013 at 5:32 PM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Sep 1, 2013 at 3:01 AM, Sedat Dilek <[email protected]> wrote:
>>
>> Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
>> ~26.60Mloops (no-debug).
>
> Ok, that's getting to be in the right ballpark.
>
> But your profile is still odd.
>
>> Samples: 159K of event 'cycles:pp', Event count (approx.): 76968896763
>> 12,79% t_lockref_from- [kernel.kallsyms] [k] irq_return
>> 4,36% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
>
> If you do the profile with "-g", what are the top callers of this? You
> shouldn't see any spinlock load from the path lookup, but you have all
> these other things going on..
>
>> 4,36% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
>> 4,07% t_lockref_from- [kernel.kallsyms] [k] user_exit
>> 3,12% t_lockref_from- [kernel.kallsyms] [k] local_clock
>> 2,83% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
>> 2,73% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
>> 2,62% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu
>
> You're spending more time on the task stats than on the actual lookup.
> Maybe you should turn off CONFIG_TASKSTATS..But why that whole
> irq_return thing? Odd.
>

[ init/Kconfig ]
...
config TASKSTATS
bool "Export task/process statistics through netlink"
depends on NET <--- Difficult to disable it?!
default n
help
Export selected statistics for tasks/processes through the
generic netlink interface. Unlike BSD process accounting, the
statistics are available during the lifetime of tasks/processes as
responses to commands. Like BSD accounting, they are sent to user
space on task exit.

Say N if unsure.
...

- Sedat -

2013-09-03 22:41:35

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Tue, Sep 3, 2013 at 5:14 PM, Waiman Long <[email protected]> wrote:
> On 09/03/2013 02:01 AM, Ingo Molnar wrote:
>>
>> * Waiman Long<[email protected]> wrote:
>>
>>> Yes, that patch worked. It eliminated the lglock as a bottleneck in the
>>> AIM7 workload. The lg_global_lock did not show up in the perf profile,
>>> whereas the lg_local_lock was only 0.07%.
>>
>> Just curious: what's the worst bottleneck now in the optimized kernel? :-)
>>
>> Thanks,
>>
>> Ingo
>
> With the following patches on v3.11:
> 1. Linus's version of lockref patch
> 2. Al's lglock patch
> 3. My preliminary patch to convert prepend_path under RCU
>

With no reference where to get those patches, it's a bit hard to follow.

I will try some perf benchmarking with the attached patch against
Linux "WfW" edition.

- Sedat -


Attachments:
3.11.0-1-lockref-small.patch (15.47 kB)

2013-09-03 22:56:36

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 04, 2013 at 12:37:25AM +0200, Sedat Dilek wrote:

> > You're spending more time on the task stats than on the actual lookup.
> > Maybe you should turn off CONFIG_TASKSTATS..But why that whole
> > irq_return thing? Odd.
> >
>
> [ init/Kconfig ]
> ...
> config TASKSTATS
> bool "Export task/process statistics through netlink"
> depends on NET <--- Difficult to disable it?!

More likely you're getting bitten by the fact that CONFIG_KVM has
a 'select TASKSTATS'

Dave

2013-09-03 23:05:41

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 4, 2013 at 12:55 AM, Dave Jones <[email protected]> wrote:
> On Wed, Sep 04, 2013 at 12:37:25AM +0200, Sedat Dilek wrote:
>
> > > You're spending more time on the task stats than on the actual lookup.
> > > Maybe you should turn off CONFIG_TASKSTATS..But why that whole
> > > irq_return thing? Odd.
> > >
> >
> > [ init/Kconfig ]
> > ...
> > config TASKSTATS
> > bool "Export task/process statistics through netlink"
> > depends on NET <--- Difficult to disable it?!
>
> More likely you're getting bitten by the fact that CONFIG_KVM has
> a 'select TASKSTATS'
>

With CONFIG_KVM=n ...

$ egrep 'CONFIG_NET=|CONFIG_KVM=|CONFIG_TASKSTATS' .config
CONFIG_TASKSTATS=y
CONFIG_NET=y

- Sedat -

2013-09-03 23:11:45

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 4, 2013 at 12:41 AM, Sedat Dilek <[email protected]> wrote:
> On Tue, Sep 3, 2013 at 5:14 PM, Waiman Long <[email protected]> wrote:
>> On 09/03/2013 02:01 AM, Ingo Molnar wrote:
>>>
>>> * Waiman Long<[email protected]> wrote:
>>>
>>>> Yes, that patch worked. It eliminated the lglock as a bottleneck in the
>>>> AIM7 workload. The lg_global_lock did not show up in the perf profile,
>>>> whereas the lg_local_lock was only 0.07%.
>>>
>>> Just curious: what's the worst bottleneck now in the optimized kernel? :-)
>>>
>>> Thanks,
>>>
>>> Ingo
>>
>> With the following patches on v3.11:
>> 1. Linus's version of lockref patch
>> 2. Al's lglock patch
>> 3. My preliminary patch to convert prepend_path under RCU
>>
>
> With no reference where to get those patches, it's a bit hard to follow.
>
> I will try some perf benchmarking with the attached patch against
> Linux "WfW" edition.
>

Eat thiz...

$ cat /proc/version
Linux version 3.11.0-1-lockref-small ([email protected]@fambox)
(gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #1 SMP Wed Sep 4
00:53:25 CEST 2013

$ ~/src/linux-kernel/linux/tools/perf/perf stat --null --repeat 5
../scripts/t_lockref_from-linus
Total loops: 26786226
Total loops: 26970142
Total loops: 26593312
Total loops: 26885806
Total loops: 26944076

Performance counter stats for '../scripts/t_lockref_from-linus' (5 runs):

10,011755076 seconds time elapsed
( +- 0,10% )

$ sudo ~/src/linux-kernel/linux/tools/perf/perf record -e cycles:pp
../scripts/t_lockref_from-linus
Total loops: 26267751
[ perf record: Woken up 25 times to write data ]
[ perf record: Captured and wrote 6.112 MB perf.data (~267015 samples) ]

$ sudo ~/src/linux-kernel/linux/tools/perf/perf report -tui

Samples: 159K of event 'cycles:pp', Event count (approx.): 77088218721
12,52%uit_lockref_from-ui[kernel.kallsyms] ui[k] irq_ret.rn
4,37%uit_lockref_from-ui[kernel.kallsyms] ui[k] __ticket_spin_lock
4,18%uit_lockref_from-ui[kernel.kallsyms] ui[k] __acct_.pdate_integrals
3,90%uit_lockref_from-ui[kernel.kallsyms] ui[k] .ser_exit
3,17%uit_lockref_from-ui[kernel.kallsyms] ui[k] __d_look.p_rc.
3,14%uit_lockref_from-ui[kernel.kallsyms] ui[k] lockref_get_or_lock
3,01%uit_lockref_from-ui[kernel.kallsyms] ui[k] local_clock
2,72%uit_lockref_from-ui[kernel.kallsyms] ui[k] kmem_cache_alloc
2,54%uit_lockref_from-uilibc-2.15.so ui[.] __xstat64
2,45%uit_lockref_from-ui[kernel.kallsyms] ui[k] link_path_walk
2,23%uit_lockref_from-ui[kernel.kallsyms] ui[k] kmem_cache_free
1,90%uit_lockref_from-ui[kernel.kallsyms] ui[k] rc._eqs_exit_common.isra.43
1,88%uit_lockref_from-ui[kernel.kallsyms] ui[k] tracesys
1,82%uit_lockref_from-ui[kernel.kallsyms] ui[k] rc._eqs_enter_common.isra.45
1,77%uit_lockref_from-ui[kernel.kallsyms] ui[k] sched_clock_cp.
1,76%uit_lockref_from-ui[kernel.kallsyms] ui[k] .ser_enter
1,73%uit_lockref_from-ui[kernel.kallsyms] ui[k] lockref_p.t_or_lock
1,70%uit_lockref_from-ui[kernel.kallsyms] ui[k] path_look.pat
1,53%uit_lockref_from-ui[kernel.kallsyms] ui[k] native_read_tsc
1,52%uit_lockref_from-ui[kernel.kallsyms] ui[k] native_sched_clock
1,51%uit_lockref_from-ui[kernel.kallsyms] ui[k] cp_new_stat
1,51%uit_lockref_from-ui[kernel.kallsyms] ui[k] syscall_trace_enter
1,46%uit_lockref_from-ui[kernel.kallsyms] ui[k] acco.nt_system_time
1,42%uit_lockref_from-ui[kernel.kallsyms] ui[k] path_init
1,42%uit_lockref_from-ui[kernel.kallsyms] ui[k] copy_.ser_generic_.nrolled
1,39%uit_lockref_from-ui[kernel.kallsyms] ui[k] jiffies_to_timeval
1,39%uit_lockref_from-ui[kernel.kallsyms] ui[k] getname_flags
1,37%uit_lockref_from-ui[kernel.kallsyms] ui[k] vfs_getattr
1,25%uit_lockref_from-ui[kernel.kallsyms] ui[k] common_perm
1,14%uit_lockref_from-ui[kernel.kallsyms] ui[k] get_vtime_delta
1,13%uit_lockref_from-ui[kernel.kallsyms] ui[k] look.p_fast
1,12%uit_lockref_from-ui[kernel.kallsyms] ui[k] syscall_trace_leave
1,05%uit_lockref_from-ui[kernel.kallsyms] ui[k] system_call
0,99%uit_lockref_from-ui[kernel.kallsyms] ui[k] generic_fillattr
0,94%uit_lockref_from-ui[kernel.kallsyms] ui[k] .ser_path_at_empty
0,91%uit_lockref_from-ui[kernel.kallsyms] ui[k] acco.nt_.ser_time
0,90%uit_lockref_from-ui[kernel.kallsyms] ui[k] __ticket_spin_.nlock
0,87%uit_lockref_from-ui[kernel.kallsyms] ui[k] strncpy_from_.ser
0,83%uit_lockref_from-ui[kernel.kallsyms] ui[k] filename_look.p
0,82%uit_lockref_from-ui[kernel.kallsyms] ui[k] generic_permission
0,78%uit_lockref_from-ui[kernel.kallsyms] ui[k] complete_walk
0,75%uit_lockref_from-ui[kernel.kallsyms] ui[k] vfs_fstatat
0,74%uit_lockref_from-ui[kernel.kallsyms] ui[k] lg_local_lock
0,72%uit_lockref_from-ui[kernel.kallsyms] ui[k] vtime_acco.nt_.ser
0,67%uit_lockref_from-ui[kernel.kallsyms] ui[k] dp.t
0,66%uit_lockref_from-ui[kernel.kallsyms] ui[k] __inode_permission
0,62%uit_lockref_from-ui[kernel.kallsyms] ui[k] rc._eqs_enter
0,58%uit_lockref_from-ui[kernel.kallsyms] ui[k] lg_local_.nlock
0,56%uit_lockref_from-ui[kernel.kallsyms] ui[k] vtime_.ser_enter
0,50%uit_lockref_from-ui[kernel.kallsyms] ui[k] cp.acct_acco.nt_field
0,48%uit_lockref_from-ui[kernel.kallsyms] ui[k] sec.rity_inode_permission
0,48%uit_lockref_from-uit_lockref_from-lin.sui[.] start_ro.tine
0,47%uit_lockref_from-ui[kernel.kallsyms] ui[k] sec.rity_inode_getattr
0,47%uit_lockref_from-ui[kernel.kallsyms] ui[k] acct_acco.nt_cp.time
Press '?' for help on key bindings

Here the annotated entries for the first two entries:

irq_return



│ Disassembly of section .text:

│ ffffffff816d4f2c <irq_return>:
100,00 │ ↓ jmpq 120
│ data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)


__ticket_spin_lock



│ Disassembly of section .text:

│ ffffffff8104ff10 <__ticket_spin_lock>:
2,55 │ push %rbp
1,19 │ mov $0x10000,%eax
2,16 │ mov %rsp,%rbp
84,70 │ lock xadd %eax,(%rdi)
0,14 │ mov %eax,%edx
│ shr $0x10,%edx
4,33 │ cmp %ax,%dx
0,03 │ ↓ je 2a
│ nop
│20: pause
0,03 │ movzwl (%rdi),%eax
│ cmp %dx,%ax
│ ↑ jne 20
0,03 │2a: pop %rbp
4,84 │ ← retq

- Sedat -

2013-09-03 23:17:03

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 04, 2013 at 01:05:38AM +0200, Sedat Dilek wrote:
> On Wed, Sep 4, 2013 at 12:55 AM, Dave Jones <[email protected]> wrote:
> > On Wed, Sep 04, 2013 at 12:37:25AM +0200, Sedat Dilek wrote:
> >
> > > > You're spending more time on the task stats than on the actual lookup.
> > > > Maybe you should turn off CONFIG_TASKSTATS..But why that whole
> > > > irq_return thing? Odd.
> > > >
> > >
> > > [ init/Kconfig ]
> > > ...
> > > config TASKSTATS
> > > bool "Export task/process statistics through netlink"
> > > depends on NET <--- Difficult to disable it?!
> >
> > More likely you're getting bitten by the fact that CONFIG_KVM has
> > a 'select TASKSTATS'
> >
>
> With CONFIG_KVM=n ...
>
> $ egrep 'CONFIG_NET=|CONFIG_KVM=|CONFIG_TASKSTATS' .config
> CONFIG_TASKSTATS=y
> CONFIG_NET=y

Weird.

sed -i '/TASKSTATS/d' .config
sed -i '/KVM/d' .config
make oldconfig
...
egrep 'CONFIG_NET=|CONFIG_KVM=|CONFIG_TASKSTATS' .config
# CONFIG_TASKSTATS is not set
CONFIG_NET=y


The NET dependancy shouldn't matter at all.
I don't see any other 'select TASKSTATS' in the tree.

Dave

2013-09-03 23:20:18

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 4, 2013 at 1:15 AM, Dave Jones <[email protected]> wrote:
> On Wed, Sep 04, 2013 at 01:05:38AM +0200, Sedat Dilek wrote:
> > On Wed, Sep 4, 2013 at 12:55 AM, Dave Jones <[email protected]> wrote:
> > > On Wed, Sep 04, 2013 at 12:37:25AM +0200, Sedat Dilek wrote:
> > >
> > > > > You're spending more time on the task stats than on the actual lookup.
> > > > > Maybe you should turn off CONFIG_TASKSTATS..But why that whole
> > > > > irq_return thing? Odd.
> > > > >
> > > >
> > > > [ init/Kconfig ]
> > > > ...
> > > > config TASKSTATS
> > > > bool "Export task/process statistics through netlink"
> > > > depends on NET <--- Difficult to disable it?!
> > >
> > > More likely you're getting bitten by the fact that CONFIG_KVM has
> > > a 'select TASKSTATS'
> > >
> >
> > With CONFIG_KVM=n ...
> >
> > $ egrep 'CONFIG_NET=|CONFIG_KVM=|CONFIG_TASKSTATS' .config
> > CONFIG_TASKSTATS=y
> > CONFIG_NET=y
>
> Weird.
>
> sed -i '/TASKSTATS/d' .config
> sed -i '/KVM/d' .config
> make oldconfig
> ...
> egrep 'CONFIG_NET=|CONFIG_KVM=|CONFIG_TASKSTATS' .config
> # CONFIG_TASKSTATS is not set
> CONFIG_NET=y
>
>
> The NET dependancy shouldn't matter at all.
> I don't see any other 'select TASKSTATS' in the tree.
>

Ah, cool.

With CONFIG_KVM=n and CONFIG_TASKSTATS=n plus...

$ yes "" | make oldconfig && make silentoldconfig </dev/null

...I get now:

$ egrep 'CONFIG_NET=|CONFIG_KVM=|CONFIG_TASKSTATS' .config
# CONFIG_TASKSTATS is not set
CONFIG_NET=y

Thanks, Dave!

- Sedat -

2013-09-03 23:45:10

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 4, 2013 at 12:37 AM, Sedat Dilek <[email protected]> wrote:
> On Sun, Sep 1, 2013 at 5:32 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Sun, Sep 1, 2013 at 3:01 AM, Sedat Dilek <[email protected]> wrote:
>>>
>>> Looks like this is now 10x faster: ~2.66Mloops (debug) VS.
>>> ~26.60Mloops (no-debug).
>>
>> Ok, that's getting to be in the right ballpark.
>>
>> But your profile is still odd.
>>
>>> Samples: 159K of event 'cycles:pp', Event count (approx.): 76968896763
>>> 12,79% t_lockref_from- [kernel.kallsyms] [k] irq_return
>>> 4,36% t_lockref_from- [kernel.kallsyms] [k] __ticket_spin_lock
>>
>> If you do the profile with "-g", what are the top callers of this? You
>> shouldn't see any spinlock load from the path lookup, but you have all
>> these other things going on..
>>
>>> 4,36% t_lockref_from- [kernel.kallsyms] [k] __acct_update_integrals
>>> 4,07% t_lockref_from- [kernel.kallsyms] [k] user_exit
>>> 3,12% t_lockref_from- [kernel.kallsyms] [k] local_clock
>>> 2,83% t_lockref_from- [kernel.kallsyms] [k] lockref_get_or_lock
>>> 2,73% t_lockref_from- [kernel.kallsyms] [k] kmem_cache_alloc
>>> 2,62% t_lockref_from- [kernel.kallsyms] [k] __d_lookup_rcu
>>
>> You're spending more time on the task stats than on the actual lookup.
>> Maybe you should turn off CONFIG_TASKSTATS..But why that whole
>> irq_return thing? Odd.
>>
>
> [ init/Kconfig ]
> ...
> config TASKSTATS
> bool "Export task/process statistics through netlink"
> depends on NET <--- Difficult to disable it?!
> default n
> help
> Export selected statistics for tasks/processes through the
> generic netlink interface. Unlike BSD process accounting, the
> statistics are available during the lifetime of tasks/processes as
> responses to commands. Like BSD accounting, they are sent to user
> space on task exit.
>
> Say N if unsure.
> ...
>

So with Dave J.'s help I disabled CONFIG_TASKSTATS.

But I still see that odd irq_ret* thing.

My kernel-config and patch (on top of Linux v3.11) are attached.

- Sedat -

$ sudo ~/src/linux-kernel/linux/tools/perf/perf report -tui

Samples: 161K of event 'cycles:pp', Event count (approx.): 76595555357
13,30%uit_lockref_from-ui[kernel.kallsyms] ui[k] irq_ret.rn
5,25%uit_lockref_from-ui[kernel.kallsyms] ui[k] lockref_get_or_lock
4,82%uit_lockref_from-ui[kernel.kallsyms] ui[k] __ticket_spin_lock
4,23%uit_lockref_from-ui[kernel.kallsyms] ui[k] .ser_exit
3,17%uit_lockref_from-ui[kernel.kallsyms] ui[k] local_clock
2,98%uit_lockref_from-ui[kernel.kallsyms] ui[k] kmem_cache_alloc
2,61%uit_lockref_from-uilibc-2.15.so ui[.] __xstat64
2,55%uit_lockref_from-ui[kernel.kallsyms] ui[k] link_path_walk
2,49%uit_lockref_from-ui[kernel.kallsyms] ui[k] kmem_cache_free
2,03%uit_lockref_from-ui[kernel.kallsyms] ui[k] tracesys
2,01%uit_lockref_from-ui[kernel.kallsyms] ui[k] path_look.pat
1,99%uit_lockref_from-ui[kernel.kallsyms] ui[k] rc._eqs_exit_common.isra.43
1,94%uit_lockref_from-ui[kernel.kallsyms] ui[k] .ser_enter
1,92%uit_lockref_from-ui[kernel.kallsyms] ui[k] rc._eqs_enter_common.isra.45
1,86%uit_lockref_from-ui[kernel.kallsyms] ui[k] sched_clock_cp.
1,72%uit_lockref_from-ui[kernel.kallsyms] ui[k] __d_look.p_rc.
1,71%uit_lockref_from-ui[kernel.kallsyms] ui[k] native_read_tsc
1,68%uit_lockref_from-ui[kernel.kallsyms] ui[k] cp_new_stat
1,65%uit_lockref_from-ui[kernel.kallsyms] ui[k] lockref_p.t_or_lock
1,64%uit_lockref_from-ui[kernel.kallsyms] ui[k] look.p_fast
1,59%uit_lockref_from-ui[kernel.kallsyms] ui[k] path_init
1,58%uit_lockref_from-ui[kernel.kallsyms] ui[k] native_sched_clock
1,57%uit_lockref_from-ui[kernel.kallsyms] ui[k] copy_.ser_generic_.nrolled
1,56%uit_lockref_from-ui[kernel.kallsyms] ui[k] syscall_trace_enter
1,53%uit_lockref_from-ui[kernel.kallsyms] ui[k] acco.nt_system_time
1,34%uit_lockref_from-ui[kernel.kallsyms] ui[k] getname_flags
1,28%uit_lockref_from-ui[kernel.kallsyms] ui[k] get_vtime_delta
1,26%uit_lockref_from-ui[kernel.kallsyms] ui[k] vfs_getattr
1,14%uit_lockref_from-ui[kernel.kallsyms] ui[k] syscall_trace_leave
1,11%uit_lockref_from-ui[kernel.kallsyms] ui[k] system_call
1,06%uit_lockref_from-ui[kernel.kallsyms] ui[k] strncpy_from_.ser
1,01%uit_lockref_from-ui[kernel.kallsyms] ui[k] generic_fillattr
0,97%uit_lockref_from-ui[kernel.kallsyms] ui[k] acco.nt_.ser_time
0,96%uit_lockref_from-ui[kernel.kallsyms] ui[k] .ser_path_at_empty
0,92%uit_lockref_from-ui[kernel.kallsyms] ui[k] filename_look.p
0,88%uit_lockref_from-ui[kernel.kallsyms] ui[k] __ticket_spin_.nlock
0,87%uit_lockref_from-ui[kernel.kallsyms] ui[k] complete_walk
0,86%uit_lockref_from-ui[kernel.kallsyms] ui[k] generic_permission
0,86%uit_lockref_from-ui[kernel.kallsyms] ui[k] common_perm
0,82%uit_lockref_from-ui[kernel.kallsyms] ui[k] vfs_fstatat
0,77%uit_lockref_from-ui[kernel.kallsyms] ui[k] rc._eqs_enter
0,75%uit_lockref_from-ui[kernel.kallsyms] ui[k] vtime_acco.nt_.ser
0,72%uit_lockref_from-ui[kernel.kallsyms] ui[k] __inode_permission
0,65%uit_lockref_from-ui[kernel.kallsyms] ui[k] dp.t
0,62%uit_lockref_from-ui[kernel.kallsyms] ui[k] vtime_.ser_enter
0,59%uit_lockref_from-ui[kernel.kallsyms] ui[k] apparmor_inode_getattr
0,55%uit_lockref_from-ui[kernel.kallsyms] ui[k] lg_local_lock
0,53%uit_lockref_from-ui[kernel.kallsyms] ui[k] __vtime_acco.nt_system
0,51%uit_lockref_from-ui[kernel.kallsyms] ui[k] sec.rity_inode_permission
0,51%uit_lockref_from-ui[kernel.kallsyms] ui[k] mntp.t
0,47%uit_lockref_from-uit_lockref_from-lin.sui[.] start_ro.tine
0,45%uit_lockref_from-ui[kernel.kallsyms] ui[k] cp.acct_acco.nt_field
0,45%uit_lockref_from-ui[kernel.kallsyms] ui[k] int_with_check
0,44%uit_lockref_from-ui[kernel.kallsyms] ui[k] rc._eqs_exit
Press '?' for help on key bindings


Attachments:
config-3.11.0-2-lockref-small (111.48 kB)
3.11.0-2-lockref-small.patch (16.76 kB)
Download all attachments

2013-09-04 14:53:22

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 09/03/2013 03:09 PM, Linus Torvalds wrote:
> On Tue, Sep 3, 2013 at 8:34 AM, Linus Torvalds
> <[email protected]> wrote:
>> I suspect the tty_ldisc_lock() could be made to go away if we care.
> Heh. I just pulled the tty patches from Greg, and the locking has
> changed completely.
>
> It may actually fix your AIM7 test-case, because while the global
> spinlock remains (it got renamed to "tty_ldiscs_lock" - there's an
> added "s"), the common operations now take the per-tty lock to get the
> ldisc for that tty, rather than that global spinlock (which just
> protects the actual ldisk array now).
>
> That said, I don't know what AIM7 really ends up doing, but your
> profile seems to have every access through tty_ldisc_[de]ref() that
> now uses only the per-tty lock. Of course, how much that helps ends up
> depending on whether AIM7 uses lots of tty's or just one shared one.
>
> Anyway, it might be worth testing my current -git tree.
>
> Linus

The latest tty patches did work. The tty related spinlock contention is
now completely gone. The short workload can now reach over 8M JPM which
is the highest I have ever seen.

The perf profile was:

5.85% reaim reaim [.] mul_short
4.87% reaim [kernel.kallsyms] [k] ebitmap_get_bit
4.72% reaim reaim [.] mul_int
4.71% reaim reaim [.] mul_long
2.67% reaim libc-2.12.so [.] __random_r
2.64% reaim [kernel.kallsyms] [k] lockref_get_not_zero
1.58% reaim [kernel.kallsyms] [k] copy_user_generic_string
1.48% reaim [kernel.kallsyms] [k] mls_level_isvalid
1.35% reaim [kernel.kallsyms] [k] find_next_bit
1.23% reaim [kernel.kallsyms] [k] system_call
1.21% reaim libc-2.12.so [.] memcpy
1.19% reaim [kernel.kallsyms] [k] _raw_spin_lock
1.06% reaim [kernel.kallsyms] [k] avc_has_perm_flags
1.04% reaim libc-2.12.so [.] __srandom_r
1.02% reaim reaim [.] newton_raphson
1.01% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
0.98% reaim [kernel.kallsyms] [k] fsnotify
0.94% reaim [kernel.kallsyms] [k] avtab_search_node
0.91% reaim libm-2.12.so [.] __sincos

I have a patch in linux-next that should eliminate ebitmap_get_bit,
mls_leve_isvalid and find_next_bit from the list top once it is merged.

Regards,
Longman

2013-09-04 15:14:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 4, 2013 at 7:52 AM, Waiman Long <[email protected]> wrote:
>
> The latest tty patches did work. The tty related spinlock contention is now
> completely gone. The short workload can now reach over 8M JPM which is the
> highest I have ever seen.

Good. And this was with the 80-core machine, so there aren't any
scalability issues hiding?

Linus

2013-09-04 19:25:55

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 09/04/2013 11:14 AM, Linus Torvalds wrote:
> On Wed, Sep 4, 2013 at 7:52 AM, Waiman Long<[email protected]> wrote:
>> The latest tty patches did work. The tty related spinlock contention is now
>> completely gone. The short workload can now reach over 8M JPM which is the
>> highest I have ever seen.
> Good. And this was with the 80-core machine, so there aren't any
> scalability issues hiding?
>
> Linus

Yes, the perf profile was taking from an 80-core machine. There isn't
any scalability issue hiding for the short workload on an 80-core machine.

However, I am certain that more may pop up when running in an even
larger machine like the prototype 240-core machine that our team has
been testing on.

-Longman

2013-09-04 21:34:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Wed, Sep 4, 2013 at 12:25 PM, Waiman Long <[email protected]> wrote:
>
> Yes, the perf profile was taking from an 80-core machine. There isn't any
> scalability issue hiding for the short workload on an 80-core machine.
>
> However, I am certain that more may pop up when running in an even larger
> machine like the prototype 240-core machine that our team has been testing
> on.

Sure. Please let us know, I think it's going to be interesting to see
what that shows.

SGI certainly did much larger machines, but their primary target
tended to be all user space, so they had things like "tons of
concurrent page faults in the same process" rather than filename
lookup or the tty layer.

Linus

2013-09-05 02:35:59

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 09/04/2013 05:34 PM, Linus Torvalds wrote:
> On Wed, Sep 4, 2013 at 12:25 PM, Waiman Long<[email protected]> wrote:
>> Yes, the perf profile was taking from an 80-core machine. There isn't any
>> scalability issue hiding for the short workload on an 80-core machine.
>>
>> However, I am certain that more may pop up when running in an even larger
>> machine like the prototype 240-core machine that our team has been testing
>> on.
> Sure. Please let us know, I think it's going to be interesting to see
> what that shows.
>
> SGI certainly did much larger machines, but their primary target
> tended to be all user space, so they had things like "tons of
> concurrent page faults in the same process" rather than filename
> lookup or the tty layer.
>
> Linus

I think SGI is more focus on compute-intensive workload. HP is more
focus on high-end commercial workload like SAP HANA. Below was a sample
perf profile of the high-systime workload on a 240-core prototype
machine (HT off) with 3.10-rc1 kernel with my lockref and seqlock patches:

9.61% 3382925 swapper [kernel.kallsyms] [k]
_raw_spin_lock
|--59.90%-- rcu_process_callbacks
|--19.41%-- load_balance
|--9.58%-- rcu_accelerate_cbs
|--6.70%-- tick_do_update_jiffies64
|--1.46%-- scheduler_tick
|--1.17%-- sched_rt_period_timer
|--0.56%-- perf_adjust_freq_unthr_context
--1.21%-- [...]

6.34% 99 reaim [kernel.kallsyms] [k]
_raw_spin_lock
|--73.96%-- load_balance
|--11.98%-- rcu_process_callbacks
|--2.21%-- __mutex_lock_slowpath
|--2.02%-- rcu_accelerate_cbs
|--1.95%-- wake_up_new_task
|--1.70%-- scheduler_tick
|--1.67%-- xfs_alloc_log_agf
|--1.24%-- task_rq_lock
|--1.15%-- try_to_wake_up
--2.12%-- [...]

5.39% 2 reaim [kernel.kallsyms] [k]
_raw_spin_lock_irqsave
|--95.08%-- rwsem_wake
|--1.80%-- rcu_process_callbacks
|--1.03%-- prepare_to_wait
|--0.59%-- __wake_up
--1.50%-- [...]

2.28% 1 reaim [kernel.kallsyms] [k]
_raw_spin_lock_irq
|--90.56%-- rwsem_down_write_failed
|--9.25%-- __schedule
--0.19%-- [...]

Longman

2013-09-05 13:31:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Waiman Long <[email protected]> wrote:

> On 09/03/2013 03:09 PM, Linus Torvalds wrote:
> >On Tue, Sep 3, 2013 at 8:34 AM, Linus Torvalds
> ><[email protected]> wrote:
> >>I suspect the tty_ldisc_lock() could be made to go away if we care.
> >Heh. I just pulled the tty patches from Greg, and the locking has
> >changed completely.
> >
> >It may actually fix your AIM7 test-case, because while the global
> >spinlock remains (it got renamed to "tty_ldiscs_lock" - there's an
> >added "s"), the common operations now take the per-tty lock to get the
> >ldisc for that tty, rather than that global spinlock (which just
> >protects the actual ldisk array now).
> >
> >That said, I don't know what AIM7 really ends up doing, but your
> >profile seems to have every access through tty_ldisc_[de]ref() that
> >now uses only the per-tty lock. Of course, how much that helps ends up
> >depending on whether AIM7 uses lots of tty's or just one shared one.
> >
> >Anyway, it might be worth testing my current -git tree.
> >
> > Linus
>
> The latest tty patches did work. The tty related spinlock contention
> is now completely gone. The short workload can now reach over 8M JPM
> which is the highest I have ever seen.
>
> The perf profile was:
>
> 5.85% reaim reaim [.] mul_short
> 4.87% reaim [kernel.kallsyms] [k] ebitmap_get_bit
> 4.72% reaim reaim [.] mul_int
> 4.71% reaim reaim [.] mul_long
> 2.67% reaim libc-2.12.so [.] __random_r
> 2.64% reaim [kernel.kallsyms] [k] lockref_get_not_zero
> 1.58% reaim [kernel.kallsyms] [k] copy_user_generic_string
> 1.48% reaim [kernel.kallsyms] [k] mls_level_isvalid
> 1.35% reaim [kernel.kallsyms] [k] find_next_bit

6%+ spent in ebitmap_get_bit() and mls_level_isvalid() looks like
something worth optimizing.

Is that called very often, or is it perhaps cache-bouncing for some
reason?

Btw., you ought to be able to see instructions where the CPU is in some
sort of stall (either it ran out of work, or it is cache-missing, or it is
executing something complex), via:

perf top -e stalled-cycles-frontend -e stalled-cycles-backend

run it for a while and pick the one which has more entries and have a
look. Both profiles will keep updating in the background.

(Note: on Haswell CPUs stalled-cycles events are not yet available.)

Another performance analysis trick is to run this while your workload is
executing:

perf stat -a -ddd sleep 60

and have a look at the output - it will color-code suspicious looking
counts.

For example, this is on a 32-way box running a kernel build:

vega:~> perf stat -addd sleep 10

Performance counter stats for 'sleep 10':

320753.639566 task-clock # 32.068 CPUs utilized [100.00%]
187,962 context-switches # 0.586 K/sec [100.00%]
22,989 cpu-migrations # 0.072 K/sec [100.00%]
6,622,424 page-faults # 0.021 M/sec
817,576,186,285 cycles # 2.549 GHz [27.82%]
214,366,744,930 stalled-cycles-frontend # 26.22% frontend cycles idle [16.75%]
45,454,323,703 stalled-cycles-backend # 5.56% backend cycles idle [16.72%]
474,770,833,376 instructions # 0.58 insns per cycle
# 0.45 stalled cycles per insn [22.27%]
105,860,676,229 branches # 330.037 M/sec [33.37%]
5,964,088,457 branch-misses # 5.63% of all branches [33.36%]
244,982,563,232 L1-dcache-loads # 763.772 M/sec [33.35%]
7,503,377,286 L1-dcache-load-misses # 3.06% of all L1-dcache hits [33.36%]
19,606,194,180 LLC-loads # 61.125 M/sec [22.26%]
1,232,340,603 LLC-load-misses # 6.29% of all LL-cache hits [16.69%]
261,749,251,526 L1-icache-loads # 816.045 M/sec [16.69%]
11,821,747,974 L1-icache-load-misses # 4.52% of all L1-icache hits [16.67%]
244,253,746,431 dTLB-loads # 761.500 M/sec [27.78%]
126,546,407 dTLB-load-misses # 0.05% of all dTLB cache hits [33.30%]
260,909,042,891 iTLB-loads # 813.425 M/sec [33.31%]
73,911,000 iTLB-load-misses # 0.03% of all iTLB cache hits [33.31%]
7,989,072,388 L1-dcache-prefetches # 24.907 M/sec [27.76%]
0 L1-dcache-prefetch-misses # 0.000 K/sec [33.32%]

10.002245831 seconds time elapsed

the system is nicely saturated, caches are more or less normally utilized,
but about a quarter of all frontend cycles are idle.

So then I ran:

perf top -e stalled-cycles-frontend

which gave me this profile:

2.21% cc1 cc1 [.] ht_lookup_with_hash
1.86% cc1 libc-2.15.so [.] _int_malloc
1.66% cc1 [kernel.kallsyms] [k] page_fault
1.48% cc1 cc1 [.] _cpp_lex_direct
1.33% cc1 cc1 [.] grokdeclarator
1.26% cc1 cc1 [.] ggc_internal_alloc_stat
1.19% cc1 cc1 [.] ggc_internal_cleared_alloc_stat
1.12% cc1 libc-2.15.so [.] _int_free
1.10% cc1 libc-2.15.so [.] malloc
0.95% cc1 cc1 [.] c_lex_with_flags
0.95% cc1 cc1 [.] cpp_get_token_1
0.92% cc1 cc1 [.] c_parser_declspecs

where gcc's ht_lookup_with_hash() is having trouble:

which function is visibly getting stalls from a hash walk:

0.79 : a0f303: addl $0x1,0x80(%rdi)
0.01 : a0f30a: mov %rdi,%rbx
0.18 : a0f30d: mov %rsi,%r15
0.00 : a0f310: lea -0x1(%r14),%r13d
0.13 : a0f314: and %r13d,%r10d
0.05 : a0f317: mov %r10d,%eax
0.34 : a0f31a: mov (%rcx,%rax,8),%r9
24.87 : a0f31e: lea 0x0(,%rax,8),%rdx
0.02 : a0f326: test %r9,%r9
0.00 : a0f329: je a0f5bd <ht_lookup_with_hash+0x2ed>
0.31 : a0f32f: cmp $0xffffffffffffffff,%r9
0.18 : a0f333: je a0f6e1 <ht_lookup_with_hash+0x411>
0.37 : a0f339: cmp %r12d,0xc(%r9)
24.41 : a0f33d: jne a0f3a0 <ht_lookup_with_hash+0xd0>

So giving that function some attention would probably give the most bang
for bucks on this particular workload.

Thanks,

Ingo

2013-09-05 17:33:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On 09/05/2013 09:31 AM, Ingo Molnar wrote:
> * Waiman Long<[email protected]> wrote:
>
>
>> The latest tty patches did work. The tty related spinlock contention
>> is now completely gone. The short workload can now reach over 8M JPM
>> which is the highest I have ever seen.
>>
>> The perf profile was:
>>
>> 5.85% reaim reaim [.] mul_short
>> 4.87% reaim [kernel.kallsyms] [k] ebitmap_get_bit
>> 4.72% reaim reaim [.] mul_int
>> 4.71% reaim reaim [.] mul_long
>> 2.67% reaim libc-2.12.so [.] __random_r
>> 2.64% reaim [kernel.kallsyms] [k] lockref_get_not_zero
>> 1.58% reaim [kernel.kallsyms] [k] copy_user_generic_string
>> 1.48% reaim [kernel.kallsyms] [k] mls_level_isvalid
>> 1.35% reaim [kernel.kallsyms] [k] find_next_bit
> 6%+ spent in ebitmap_get_bit() and mls_level_isvalid() looks like
> something worth optimizing.
>
> Is that called very often, or is it perhaps cache-bouncing for some
> reason?

The high cycle count is due more to inefficient algorithm in the
mls_level_isvalid() function than cacheline contention in the code. The
attached patch should address this problem. It is in linux-next and
hopefully will be merged in 3.12.

-Longman


Attachments:
0001-SELinux-Reduce-overhead-of-mls_level_isvalid-functio.patch (7.67 kB)

2013-09-05 17:40:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount


* Waiman Long <[email protected]> wrote:

> On 09/05/2013 09:31 AM, Ingo Molnar wrote:
> >* Waiman Long<[email protected]> wrote:
> >
> >
> >>The latest tty patches did work. The tty related spinlock contention
> >>is now completely gone. The short workload can now reach over 8M JPM
> >>which is the highest I have ever seen.
> >>
> >>The perf profile was:
> >>
> >>5.85% reaim reaim [.] mul_short
> >>4.87% reaim [kernel.kallsyms] [k] ebitmap_get_bit
> >>4.72% reaim reaim [.] mul_int
> >>4.71% reaim reaim [.] mul_long
> >>2.67% reaim libc-2.12.so [.] __random_r
> >>2.64% reaim [kernel.kallsyms] [k] lockref_get_not_zero
> >>1.58% reaim [kernel.kallsyms] [k] copy_user_generic_string
> >>1.48% reaim [kernel.kallsyms] [k] mls_level_isvalid
> >>1.35% reaim [kernel.kallsyms] [k] find_next_bit
> >6%+ spent in ebitmap_get_bit() and mls_level_isvalid() looks like
> >something worth optimizing.
> >
> >Is that called very often, or is it perhaps cache-bouncing for some
> >reason?
>
> The high cycle count is due more to inefficient algorithm in the
> mls_level_isvalid() function than cacheline contention in the code. The
> attached patch should address this problem. It is in linux-next and
> hopefully will be merged in 3.12.

Great!

If/when you happen to boot the latest & greatest kernel that has all these
scalability patches applied it would be nice if you could send an updated
profile into this thread.

Thanks,

Ingo

2013-09-08 21:45:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

Al, this is mainly a rambling question for you, but others left on the
Cc just FYI..

On Thu, Aug 29, 2013 at 4:42 PM, Linus Torvalds
<[email protected]> wrote:
>
> So I fixed the VFS layer instead. With dput() and friends using
> lockrefs, the only thing remaining in the hot RCU dentry lookup path
> was the nasty __d_rcu_to_refcount() thing in complete_walk(). I
> rewrote that to locklessly increment the refcount when it was nonzero,
> and get the lock if it was zero, and that all seems to work fine.

Today I realized that my new simplified d_rcu_to_refcount() was
fundamentally buggy in a really annoying way.

I had worried mainly about the races with the dentry being
invalidated, and how we must not touch the dentry reference count if
it might be dead, because once the dentry has been killed, it cannot
be killed again.

My lockref_get_or_lock() avoided all that by making sure that it
always got the reference on a dentry that was still live, and this
everything was fine. If it turns out that the sequence count says that
the lookup wasn't valid, we could just dput() that dentry, and
everything was fine.

And you know what? That's even true. It's true from a dentry standpoint.

But today I was - for totally unrelated reasons, just as a result of
looking at performance profiles - looking at dput(), and realized that
dput() really *can* sleep. Not for any actual dentry reasons, but
because the inode associated with the dentry is being released. And
the reason I noticed this from the performance profile was that the
dput code did

if (dentry->d_lockref.count == 1)
might_sleep();

where reading that lockref count showed up in the profile.

But that test never triggered any of my thinking, and it's racy
anyway, and the condition we care about is another race, which means
that it not only didn't trigger my thinking, it doesn't trigger in any
normal testing either. The fact is, dput() can always sleep,
regardless of count, because of the race (ok, the exception being that
you actually have multiple references to the same dentry _yourself_,
but who does that? Nobody).

Anyway, the reason the new d_rcu_to_refcount() is buggy isn't because
it does bad things to dentries - the dentry reference counts and
lifetimes are perfectly fine. No, the reason it does bad things is
that it does an otherwise perfectly fine dput() in a context where it
definitely is _not_ fine to sleep. We are in RCU lookup, so we are in
a RCU-locked region, we hold the percpu vfsmount_lock spinlock, and in
fact in unlazy_walk() we also potentially hold the "fs->lock"
spinlock.

Ugh, ugh, ugh.

I really like the much simplified d_rcu_to_refcount() conceptually,
though. So my current plan is to keep it, but to just delay the
"dput()" it does until after we've done the "unlock_rcu_walk()".

"complete_walk()" is actually quite simple to fix, because it does the
unlock_rcu_walk() itself - so it's fairly trivial to just move the
dput() to be after it. In preparation for this, I already ended up
committing my "dead lockref" dentry code, because that simplifies
d_rcu_to_refcount() to the degree that splitting the code up into the
callers and then moving the dput() looks like the right thing to do.

The problem I see right now is unlazy_walk(). It does *not* do the
unlock_rcu_walk() itself for the error case (and it's the error case
that needs this), but instead just returns -ECHILD and expects the
caller to do it. Even then it happens fairly obscurely in
"terminate_walk()".

So Al, any ideas? I currently have two:

- always do that unlock_rcu_walk() in unlazy_walk(), and exit RCU
mode even for the error case (similar to how complete_walk() does it).

That makes solving this for unlazy_walk() as straightforward as it
is for complete_walk(), and this is, I think, the right thing to do.

- add a couple of "to be freed" dentry pointers to the 'nd' and let
terminate_walk() do the dput(). This leaves the unlazy_walk()
semantics alone.

The second one is hacky, but it doesn't change the semantics of
unlzay_walk(). I think the current semantics (drop out of RCU mode on
success, but not on failure) are nasty, but I wonder if there is some
really subtle reason for it. So the hacky second solution has got that
whole "no change to odd/subtle semantics" thing going for it..

Comments?

Linus

2013-09-09 00:03:21

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 08, 2013 at 02:45:32PM -0700, Linus Torvalds wrote:

> But that test never triggered any of my thinking, and it's racy
> anyway, and the condition we care about is another race, which means
> that it not only didn't trigger my thinking, it doesn't trigger in any
> normal testing either. The fact is, dput() can always sleep,
> regardless of count, because of the race (ok, the exception being that
> you actually have multiple references to the same dentry _yourself_,
> but who does that? Nobody).

There's one exception - basically, we decide to put duplicates of
reference(s) we hold into (a bunch of) structures being created. If
we decide that we'd failed and need to roll back, the structures
need to be taken out of whatever lists, etc. they'd been already
put on and references held in them - dropped. That removal gets done
under a spinlock. Sure, we can string those structures on some kind
of temp list, drop the spinlock and do dput() on everything in there,
but it's much more convenient to just free them as we are evicting
them, doing dput() as we go. Which is safe, since we are still have
the references used to create these buggers pinned down.

> Anyway, the reason the new d_rcu_to_refcount() is buggy isn't because
> it does bad things to dentries - the dentry reference counts and
> lifetimes are perfectly fine. No, the reason it does bad things is
> that it does an otherwise perfectly fine dput() in a context where it
> definitely is _not_ fine to sleep. We are in RCU lookup, so we are in
> a RCU-locked region, we hold the percpu vfsmount_lock spinlock, and in
> fact in unlazy_walk() we also potentially hold the "fs->lock"
> spinlock.
>
> Ugh, ugh, ugh.
>
> I really like the much simplified d_rcu_to_refcount() conceptually,
> though. So my current plan is to keep it, but to just delay the
> "dput()" it does until after we've done the "unlock_rcu_walk()".

> "complete_walk()" is actually quite simple to fix, because it does the
> unlock_rcu_walk() itself - so it's fairly trivial to just move the
> dput() to be after it. In preparation for this, I already ended up
> committing my "dead lockref" dentry code, because that simplifies
> d_rcu_to_refcount() to the degree that splitting the code up into the
> callers and then moving the dput() looks like the right thing to do.
>
> The problem I see right now is unlazy_walk(). It does *not* do the
> unlock_rcu_walk() itself for the error case (and it's the error case
> that needs this), but instead just returns -ECHILD and expects the
> caller to do it. Even then it happens fairly obscurely in
> "terminate_walk()".
>
> So Al, any ideas? I currently have two:
>
> - always do that unlock_rcu_walk() in unlazy_walk(), and exit RCU
> mode even for the error case (similar to how complete_walk() does it).
>
> That makes solving this for unlazy_walk() as straightforward as it
> is for complete_walk(), and this is, I think, the right thing to do.
>
> - add a couple of "to be freed" dentry pointers to the 'nd' and let
> terminate_walk() do the dput(). This leaves the unlazy_walk()
> semantics alone.
>
> The second one is hacky, but it doesn't change the semantics of
> unlzay_walk(). I think the current semantics (drop out of RCU mode on
> success, but not on failure) are nasty, but I wonder if there is some
> really subtle reason for it. So the hacky second solution has got that
> whole "no change to odd/subtle semantics" thing going for it..

Well... unlazy_walk() is always followed by terminate_walk() very shortly,
but there's a minor problem - terminate_walk() uses "are we in RCU
mode?" for two things:
a) do we need to do path_put() here?
b) do we need to unlock?
If you introduce the third case ("no need to do unlock and no need to
do path_put()"), we'd better decide how to check for that case...

I suspect that minimal variant would be along the lines of
* have unlazy_walk() slap NULL into ->path.mnt on error, clear
LOOKUP_RCU and unlock
* have terminate_walk() check ->path.mnt before doing path_put()
in !RCU case
* in do_last() replace bool got_write with struct vfsmount *got_write,
storing the reference to vfsmount we'd fed to mnt_want_write().
And use its value when we call mnt_put_write() in there...

I'll put together a commit like that on top of what I was going to push
into public queues tonight; give me about half an hour, OK?

2013-09-09 00:25:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 8, 2013 at 5:03 PM, Al Viro <[email protected]> wrote:
>
> Well... unlazy_walk() is always followed by terminate_walk() very shortly,
> but there's a minor problem - terminate_walk() uses "are we in RCU
> mode?" for two things:
> a) do we need to do path_put() here?
> b) do we need to unlock?
> If you introduce the third case ("no need to do unlock and no need to
> do path_put()"), we'd better decide how to check for that case...

Actually, I decided to take advantage of those two cases instead, and
I have a patch that I think does the right thing. Basically, I start
off unlazy_walk() with just doing that lockref_get_not_dead() on the
parent dentry, and if that fails I just return an error in RCU mode
(so terminate_walk() does what it always used to do, and we haven't
done anything else to any refcounts).

Now, if the lockref_get_not_dead() succeeded, that means that we have
a reference on the nd->path.dentry, and we can now just do
"mntget(nd->path.mnt);". Ta-Daa! We now have everything done for the
non-RCU case for terminate_walk().

So after that point, we clear LOOKUP_RCU, and make the rule be that
any return (error or success) has to do unlock_rcu_walk(). And then
all the other refcounts are easy, we can just "dput(dentry);" after
that.

I haven't tested it yet, I was going to reboot into it just now. But
I'm attaching the patch here. Maybe I missed some detail, but it all
seems simpler.

Note that this patch requires the "lockref_get_not_dead()" cleanup at
the top of my current -git.

Linus


Attachments:
patch.diff (4.51 kB)

2013-09-09 00:30:36

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Mon, Sep 09, 2013 at 01:03:00AM +0100, Al Viro wrote:

> Well... unlazy_walk() is always followed by terminate_walk() very shortly,
> but there's a minor problem - terminate_walk() uses "are we in RCU
> mode?" for two things:
> a) do we need to do path_put() here?
> b) do we need to unlock?
> If you introduce the third case ("no need to do unlock and no need to
> do path_put()"), we'd better decide how to check for that case...
>
> I suspect that minimal variant would be along the lines of
> * have unlazy_walk() slap NULL into ->path.mnt on error, clear
> LOOKUP_RCU and unlock
> * have terminate_walk() check ->path.mnt before doing path_put()
> in !RCU case
> * in do_last() replace bool got_write with struct vfsmount *got_write,
> storing the reference to vfsmount we'd fed to mnt_want_write().
> And use its value when we call mnt_put_write() in there...
>
> I'll put together a commit like that on top of what I was going to push
> into public queues tonight; give me about half an hour, OK?

See the last commit in vfs.git#for-next (38373e1).

2013-09-09 00:35:26

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 08, 2013 at 05:25:40PM -0700, Linus Torvalds wrote:
> On Sun, Sep 8, 2013 at 5:03 PM, Al Viro <[email protected]> wrote:
> >
> > Well... unlazy_walk() is always followed by terminate_walk() very shortly,
> > but there's a minor problem - terminate_walk() uses "are we in RCU
> > mode?" for two things:
> > a) do we need to do path_put() here?
> > b) do we need to unlock?
> > If you introduce the third case ("no need to do unlock and no need to
> > do path_put()"), we'd better decide how to check for that case...
>
> Actually, I decided to take advantage of those two cases instead, and
> I have a patch that I think does the right thing. Basically, I start
> off unlazy_walk() with just doing that lockref_get_not_dead() on the
> parent dentry, and if that fails I just return an error in RCU mode
> (so terminate_walk() does what it always used to do, and we haven't
> done anything else to any refcounts).
>
> Now, if the lockref_get_not_dead() succeeded, that means that we have
> a reference on the nd->path.dentry, and we can now just do
> "mntget(nd->path.mnt);". Ta-Daa! We now have everything done for the
> non-RCU case for terminate_walk().
>
> So after that point, we clear LOOKUP_RCU, and make the rule be that
> any return (error or success) has to do unlock_rcu_walk(). And then
> all the other refcounts are easy, we can just "dput(dentry);" after
> that.
>
> I haven't tested it yet, I was going to reboot into it just now. But
> I'm attaching the patch here. Maybe I missed some detail, but it all
> seems simpler.
>
> Note that this patch requires the "lockref_get_not_dead()" cleanup at
> the top of my current -git.

That should also work, replacing the current tip of #for-next. Do you
prefer to merge those two diffs of yours into a single commit?

2013-09-09 00:38:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 8, 2013 at 5:35 PM, Al Viro <[email protected]> wrote:
>
> That should also work, replacing the current tip of #for-next. Do you
> prefer to merge those two diffs of yours into a single commit?

If you're ok with my patch (it's now also tested, I'm running with it
and it looks fine), I'll commit that one as-is.

When you say "those two diffs of yours", which two are you talking
about? I already committed the "dead lockref" part separately - it may
be "preparatory", but it was preparatory cleanup that didn't change
semantics, so it's better to be separate anyway. The last patch I sent
out a few moments ago is the one that actually fixes things so that
"dput()" isn't done under the RCU lock etc.

Linus

2013-09-09 00:57:45

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 08, 2013 at 05:38:46PM -0700, Linus Torvalds wrote:
> On Sun, Sep 8, 2013 at 5:35 PM, Al Viro <[email protected]> wrote:
> >
> > That should also work, replacing the current tip of #for-next. Do you
> > prefer to merge those two diffs of yours into a single commit?
>
> If you're ok with my patch (it's now also tested, I'm running with it
> and it looks fine), I'll commit that one as-is.

OK... I'm dropping 38373e1ee72bf57b6f9b1ab0979a60887352c7e6, then...

2013-09-09 02:10:14

by Ramkumar Ramachandra

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

Linus Torvalds wrote:
> On Sun, Sep 8, 2013 at 5:03 PM, Al Viro <[email protected]> wrote:
>>
>> Well... unlazy_walk() is always followed by terminate_walk() very shortly,
>> but there's a minor problem - terminate_walk() uses "are we in RCU
>> mode?" for two things:
>> a) do we need to do path_put() here?
>> b) do we need to unlock?
>> If you introduce the third case ("no need to do unlock and no need to
>> do path_put()"), we'd better decide how to check for that case...

Yeah, I think b is a crucial step: it depends on how you unlock. Does
the unlocking result in data corruption and chaos, or does it result
in ordered data being unlocked? If it's the former case, people are
sure to take advantage of you; otherwise, you can use people.

> Actually, I decided to take advantage of those two cases instead, and
> I have a patch that I think does the right thing. Basically, I start
> off unlazy_walk() with just doing that lockref_get_not_dead() on the
> parent dentry, and if that fails I just return an error in RCU mode
> (so terminate_walk() does what it always used to do, and we haven't
> done anything else to any refcounts).

LRU cache requires that a single processor is accessing memory at a
time: if there are multiple processors, L1/ L2 cache concurrent access
is a problem.

> Now, if the lockref_get_not_dead() succeeded, that means that we have
> a reference on the nd->path.dentry, and we can now just do
> "mntget(nd->path.mnt);". Ta-Daa! We now have everything done for the
> non-RCU case for terminate_walk().

How efficient is the mounting? Has data been corrupted while taking
the portable filesystem around?

> So after that point, we clear LOOKUP_RCU, and make the rule be that
> any return (error or success) has to do unlock_rcu_walk(). And then
> all the other refcounts are easy, we can just "dput(dentry);" after
> that.

Drag ropes.

> I haven't tested it yet, I was going to reboot into it just now. But
> I'm attaching the patch here. Maybe I missed some detail, but it all
> seems simpler.

Some amount of garbage collection is healthy every once in a while,
but the project requires more yak shavers than anything else at this
point.

> Note that this patch requires the "lockref_get_not_dead()" cleanup at
> the top of my current -git.

Why is the current gitster not up to date? We should discuss automated
version control.

2013-09-09 03:32:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 8, 2013 at 5:03 PM, Al Viro <[email protected]> wrote:
>
> There's one exception - basically, we decide to put duplicates of
> reference(s) we hold into (a bunch of) structures being created. If
> we decide that we'd failed and need to roll back, the structures
> need to be taken out of whatever lists, etc. they'd been already
> put on and references held in them - dropped. That removal gets done
> under a spinlock. Sure, we can string those structures on some kind
> of temp list, drop the spinlock and do dput() on everything in there,
> but it's much more convenient to just free them as we are evicting
> them, doing dput() as we go. Which is safe, since we are still have
> the references used to create these buggers pinned down.

Hmm. Which codepath does this? Because I got curious and added back
the __might_sleep() unconditionally to dput() just to see (now that I
think that the dput() bugs are gone), and at least under normal load
it doesn't trigger. I even wrote a thing that just constantly creates
and renames the target file concurrently with looking it up, so that
I've stress-tested the RCU sequence number failure path (and verified
with a profile that yes, it does trigger the "oops, need to retry"
case). I didn't test anything odd at all (just my dentry stress tests
and a regular graphical desktop), though.

And I have too much memory to sanely stress any out-of-memory situations.

#firstworldkerneldeveloperproblems

Linus

2013-09-09 04:07:21

by Ramkumar Ramachandra

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

Linus Torvalds wrote:
> On Sun, Sep 8, 2013 at 5:03 PM, Al Viro <[email protected]> wrote:
>>
>> There's one exception - basically, we decide to put duplicates of
>> reference(s) we hold into (a bunch of) structures being created. If
>> we decide that we'd failed and need to roll back, the structures
>> need to be taken out of whatever lists, etc. they'd been already
>> put on and references held in them - dropped. That removal gets done
>> under a spinlock. Sure, we can string those structures on some kind
>> of temp list, drop the spinlock and do dput() on everything in there,
>> but it's much more convenient to just free them as we are evicting
>> them, doing dput() as we go. Which is safe, since we are still have
>> the references used to create these buggers pinned down.

Dropping the spinlocks means more cores; unfortunately, a quad-core
seems to be the limit. Users must divide their time between reading
history and contributing to the present: some amount of persistent
data is a must on every user's machine. Pixel seems to be heading in
the wrong direction: that's what is stressing us out.

Ram

2013-09-09 05:44:24

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 08, 2013 at 08:32:03PM -0700, Linus Torvalds wrote:
> On Sun, Sep 8, 2013 at 5:03 PM, Al Viro <[email protected]> wrote:
> >
> > There's one exception - basically, we decide to put duplicates of
> > reference(s) we hold into (a bunch of) structures being created. If
> > we decide that we'd failed and need to roll back, the structures
> > need to be taken out of whatever lists, etc. they'd been already
> > put on and references held in them - dropped. That removal gets done
> > under a spinlock. Sure, we can string those structures on some kind
> > of temp list, drop the spinlock and do dput() on everything in there,
> > but it's much more convenient to just free them as we are evicting
> > them, doing dput() as we go. Which is safe, since we are still have
> > the references used to create these buggers pinned down.
>
> Hmm. Which codepath does this? Because I got curious and added back
> the __might_sleep() unconditionally to dput() just to see (now that I
> think that the dput() bugs are gone), and at least under normal load
> it doesn't trigger. I even wrote a thing that just constantly creates
> and renames the target file concurrently with looking it up, so that
> I've stress-tested the RCU sequence number failure path (and verified
> with a profile that yes, it does trigger the "oops, need to retry"
> case). I didn't test anything odd at all (just my dentry stress tests
> and a regular graphical desktop), though.

Not sure if we have anything of that sort in the current tree; I remember
that kind of stuff in shared subtree code (basically, if we decided that
operation should fail halfway through the process, we could just evict
all created vfsmounts from the lists and mntput them, spinlocks or no
spinlocks - they all had been copied from existing ones protected by
the system-wide serialization on namespace modifications, so resulting
dput() calls wouldn't have d_count on anything reach zero). But I'm not
sure if it had been about vfsmount_lock or namespace_sem (we really don't
want any IO under the latter, since one stuck fs can make _any_ umount
impossible afterwards) and all remnants of that got killed off by
reorganizations of locking in there - all pending dput()/mntput() calls are
delayed until namespace_unlock() now.

Anyway, that wouldn't break even now - I'm not saying that it's a good
pattern to use, but it's legitimate. If you are holding a reference
already, something like
p = alloc_foo();
spin_lock(&lock);
...
p->dentry = dget(dentry);
...
if (error) {
...
free_foo(p);
...
spin_unlock(&lock);
}
with free_foo(p) including dput(p->dentry) is safe. The whole thing was just
a comment on your "who does that? Nobody" - that kind of stuff has uses and
it did happen at least once. And yes, it is safe *and* anybody writing
anything of that sort needs to look hard if it can be reorganized into
something less subtle...

> And I have too much memory to sanely stress any out-of-memory situations.

KVM image with -m <size> or mem=<size> in kernel command line ;-)

2013-09-09 08:31:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount

On Sun, Sep 01, 2013 at 01:13:06AM +0100, Al Viro wrote:

> +static noinline_for_stack
> +char *dentry_name(char *buf, char *end, const struct dentry *d, struct printf_spec spec,
> + int depth)
> +{
> + int i, n = 0;
> + const char *s;
> + char *p = buf;
> + const struct dentry *array[4];
> + char c;
> +
> + if (depth < 0) {
> + depth = 1;
> + WARN_ON(1);
> + }
> + if (depth > 4) {
> + depth = 4;
> + WARN_ON(1);
> + }
> +
> + rcu_read_lock();
> + for (i = 0; i < depth; i++) {
> + struct dentry *p = ACCESS_ONCE(d->d_parent);
> + array[i] = d;
> + if (d == p)
> + break;
> + d = p;
> + }
> + if (!i) { /* root dentry has a bloody inconvenient name */
> + i++;
> + goto do_name;
> + }
> + if (i == depth)
> + goto do_name;
> + while (i && n != spec.precision) {
> + if (buf < end)
> + *buf = '/';
> + buf++;
> + n++;
> +do_name:
> + s = ACCESS_ONCE(array[--i]->d_name.name);
> + while (n != spec.precision && (c = *s++) != '\0') {
> + if (buf < end)
> + *buf = c;
> + buf++;
> + n++;
> + }
> + }
> + rcu_read_unlock();

Should one or both of those ACCESS_ONCE()s be an rcu_dereference()?