LinuxLists.cc - [PATCH 0/3 v3] dcache: make it more scalable on large system

2013-05-23 01:47:28

Subject: [PATCH 0/3 v3] dcache: make it more scalable on large system

Change log:

v2->v3
- Fix the RCU lock problem found by Al Viro.
- Rebase the code to the latest v3.10-rc1 linux mainline.
- Remove patch 4 which may be problematic if the dentry is deleted.
- Rerun performance measurement using 3.10-rc1 kernel.

v1->v2
- Include performance improvement in the AIM7 benchmark results because
of this patch.
- Modify dget_parent() to avoid taking the lock, if possible, to further
improve AIM7 benchmark results.

During some perf-record sessions of the kernel running the high_systime
workload of the AIM7 benchmark, it was found that quite a large portion
of the spinlock contention was due to the perf_event_mmap_event()
function itself. This perf kernel function calls d_path() which,
in turn, call path_get() and dput() indirectly. These 3 functions
were the hottest functions shown in the perf-report output of
the _raw_spin_lock() function in an 8-socket system with 80 cores
(hyperthreading off) with a 3.10-rc1 kernel:

- 13.91% reaim [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
+ 35.54% path_get
+ 34.85% dput
+ 19.49% d_path

In fact, the output of the "perf record -s -a" (without call-graph)
showed:

13.37% reaim [kernel.kallsyms] [k] _raw_spin_lock
7.61% ls [kernel.kallsyms] [k] _raw_spin_lock
3.54% true [kernel.kallsyms] [k] _raw_spin_lock

Without using the perf monitoring tool, the actual execution profile
will be quite different. In fact, with this patch set applied, the
output of the same "perf record -s -a" command became:

2.82% reaim [kernel.kallsyms] [k] _raw_spin_lock
1.11% ls [kernel.kallsyms] [k] _raw_spin_lock
0.26% true [kernel.kallsyms] [k] _raw_spin_lock

So the time spent on _raw_spin_lock() function went down from 24.52%
to 4.19%. It can be seen that the performance data collected by the
perf-record command can be heavily skewed in some cases on a system
with a large number of CPUs. This set of patches enables the perf
command to give a more accurate and reliable picture of what is really
happening in the system. At the same time, they can also improve the
general performance of systems especially those with a large number
of CPUs.

The d_path() function takes the following two locks:

1. dentry->d_lock [spinlock] from dget()/dget_parent()/dput()
2. rename_lock [seqlock] from d_path()

This set of patches were designed to minimize the locking overhead
of these code paths.

The current kernel takes the dentry->d_lock lock whenever it wants to
increment or decrement the d_count reference count. However, nothing
big will really happen until the reference count goes all the way to 1
or 0. Actually, we don't need to take the lock when reference count
is bigger than 1. Instead, atomic cmpxchg() function can be used to
increment or decrement the count in these situations. For safety,
other reference count update operations have to be changed to use
atomic instruction as well.

The rename_lock is a sequence lock. The d_path() function takes the
writer lock because it needs to traverse different dentries through
pointers to get the full path name. Hence it can't tolerate changes
in those pointers. But taking the writer lock also prevent multiple
d_path() calls to proceed concurrently.

A solution is to introduce a new lock type where there will be a
second type of reader which can block the writers - the sequence
read/write lock (seqrwlock). The d_path() and related functions will
then be changed to take the reader lock instead of the writer lock.
This will allow multiple d_path() operations to proceed concurrently.

Additional performance testing was conducted using the AIM7
benchmark. It is mainly the first patch that has impact on the AIM7
benchmark. Please see the patch description of the first patch on
more information about the benchmark results.

Incidentally, these patches also have a favorable impact on Oracle
database performance when measured by the Oracle SLOB benchmark.

The following tests with multiple threads were also run on kernels
with and without the patch on an 8-socket 80-core system and a PC
with 4-core i5 processor:

1. find $HOME -size 0b
2. cat /proc/*/maps /proc/*/numa_maps
3. git diff

For both the find-size and cat-maps tests, the performance difference
with hot cache was within a few percentage points and hence within
the margin of error. Single-thread performance was slightly worse,
but multithread performance was generally a bit better. Apparently,
reference count update isn't a significant factor in those tests. Their
perf traces indicates that there was less spinlock content in
functions like dput(), but the function itself ran a little bit longer
on average.

The git-diff test showed no difference in performance. There is a
slight increase in system time compensated by a slight decrease in
user time.

Of the 3 patches, patch 3 is dependent on patch 2. The first patch
is independent can be applied individually.

Signed-off-by: Waiman Long <[email protected]>

Waiman Long (3):
dcache: Don't take unnecessary lock in d_count update
dcache: introduce a new sequence read/write lock type
dcache: change rename_lock to a sequence read/write lock

fs/autofs4/waitq.c | 6 +-
fs/ceph/mds_client.c | 4 +-
fs/cifs/dir.c | 4 +-
fs/dcache.c | 120 ++++++++++++++++++++--------------------
fs/namei.c | 2 +-
fs/nfs/namespace.c | 6 +-
include/linux/dcache.h | 105 +++++++++++++++++++++++++++++++++--
include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++
kernel/auditsc.c | 4 +-
9 files changed, 310 insertions(+), 78 deletions(-)
create mode 100644 include/linux/seqrwlock.h

2013-05-29 21:01:03

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, May 29, 2013 at 04:43:23PM -0400, J. Bruce Fields wrote:
> On Wed, May 29, 2013 at 10:37:00PM +0200, Andi Kleen wrote:
> > > As Dave said before, is the last path component sufficient? Or how
> > > about an inode number?
> >
> > Neither works, the profiler needs to find the file and read it.
> >
> > inode searching would be incredible expensive,
>
> How fast does it need to be? Actually analyzing the profile is
> something you only do once after everything's over, right?

"perf top" does it live.

-Andi

--
[email protected] -- Speaking for myself only.

2013-05-29 22:48:39

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote:
>
> > As Dave said before, is the last path component sufficient? Or how
> > about an inode number?
>
> Neither works, the profiler needs to find the file and read it.

Ignoring all the complexity this would cause downstream, you could do
the path lookup just once, attach some cookie to it and return the
cookie ever-after. Maybe some combination of i_sb and i_ino would be
good enough as a cookie.

Jörn

--
Data dominates. If you've chosen the right data structures and organized
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike

2013-05-29 18:47:03

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, May 29, 2013 at 11:55:14AM -0400, Waiman Long wrote:
> On 05/26/2013 10:09 PM, Dave Chinner wrote:
> >On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote:
> >>On 05/23/2013 05:42 AM, Dave Chinner wrote:
> >>>
> >>>What was it I said about this patchset when you posted it to speed
> >>>up an Oracle benchmark back in february? I'll repeat:
> >>>
> >>>"Nobody should be doing reverse dentry-to-name lookups in a quantity
> >>>sufficient for it to become a performance limiting factor."
> >>Thank for the comment, but my point is that it is the d_lock
> >>contention is skewing the data about how much spin lock contention
> >>had actually happened in the workload and it makes it harder to
> >>pinpoint problem areas to look at. This is not about performance, it
> >>is about accurate representation of performance data. Ideally, we
> >>want the overhead of turning on perf instrumentation to be as low as
> >>possible.
> >Right. But d_path will never be "low overhead", and as such it
> >shouldn't be used by perf.
>
> The d_path() is called by perf_event_mmap_event() which translates
> VMA to its file path for memory segments backed by files. As perf is
> not just for sampling data within the kernel, it can also be used
> for checking access pattern in the user space. As a result, it needs
> to map VMAs back to the backing files to access their symbols
> information. If d_path() is not the right function to call for this
> purpose, what other alternatives do we have?

As Dave said before, is the last path component sufficient? Or how
about an inode number?

--b.

2013-05-23 01:38:37

by Waiman Long

[permalink] [raw]

Subject: [PATCH 1/3 v3] dcache: Don't take unnecessary lock in d_count update

The current code takes the dentry's d_lock lock whenever the d_count
reference count is being updated. In reality, nothing big really
happens until d_count goes to 0 in dput(). So it is not necessary to
take the lock if the reference count won't go to 0.

Without using a lock, multiple threads may update d_count
simultaneously. Therefore, atomic instructions must be used to
ensure consistency except in shrink_dcache_for_umount*() where the
whole superblock is being dismounted and locking is not needed.

The worst case scenarios are:

1. d_lock taken in dput with d_count = 2 in one thread and another
thread comes in to atomically decrement d_count without taking
the lock. This may result in a d_count of 0 with no deleting
action taken.

2. d_lock taken in dput with d_count = 1 in one thread and another
thread comes in to atomically increment d_count without taking
the lock. This may result in the dentry in the deleted state while
having a d_count of 1.

Without taking a lock, we need to make sure the decrementing or
incrementing action should not be taken while other threads are
updating d_count simultaneously. This can be done by using the
atomic cmpxchg instruction which will fail if the underlying value
is changed. If the lock is taken, it should be safe to use a simpler
atomic increment or decrement instruction.

To make sure that the above worst case scenarios will not happen,
the dget() function must take the lock if d_count <= 1. Similarly,
the dput() function must take the lock if d_count <= 2. The cmpxchg()
call to update d_count will be tried twice before falling back to
using the lock as there is a fairly good chance that the cmpxchg()
may fail in a busy situation.

Finally, the CPU must have an instructional level cmpxchg instruction
or the emulated cmpxchg() function may be too expensive to
use. Therefore, the above mentioned changes will only be applied if
the __HAVE_ARCH_CMPXCHG flag is set. Most of the major architectures
supported by Linux have this flag set with the notation exception
of ARM.

As for the performance of the updated reference counting code, it
all depends on whether the cmpxchg instruction is used or not. The
original code has 2 atomic instructions to lock and unlock the
spinlock. The new code path has either 1 atomic cmpxchg instruction
or 3 atomic instructions if the lock has to be taken. Depending on
how frequent the cmpxchg instruction is used (d_count > 1 or 2),
the new code can be faster or slower than the original one.

This patch has a particular big impact on the short workload of the
AIM7 benchmark with ramdisk filesystem. The table below show the
performance improvement to the JPM (jobs per minutes) throughput due
to this patch on an 8-socket 80-core x86-64 system with a 3.10-rc1
kernel in a 1/2/4/8 node configuration by using numactl to restrict
the execution of the workload on certain nodes.

+-----------------+----------------+-----------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | Rate w/o patch | Rate with patch | |
+-----------------+---------------------------------------------+
| | User Range 10 - 100 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1546813 | 5080126 | +228.4% |
| 4 nodes, HT off | 1663439 | 4703083 | +182.7% |
| 2 nodes, HT off | 2545389 | 3790995 | +48.9% |
| 1 node , HT off | 2374488 | 2394846 | +0.9% |
+-----------------+---------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1064184 | 7391382 | +594.6% |
| 4 nodes, HT off | 1362447 | 7325475 | +437.7% |
| 2 nodes, HT off | 1851198 | 4539511 | +145.2% |
| 1 node , HT off | 2477537 | 2457513 | -0.8% |
+-----------------+---------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1056243 | 7067282 | +569.1% |
| 4 nodes, HT off | 1354505 | 6930437 | +411.7% |
| 2 nodes, HT off | 1735881 | 4581847 | +164.0% |
| 1 node , HT off | 2511547 | 2496594 | -0.6% |
+-----------------+----------------+-----------------+----------+

It can be seen that with 20 CPUs (2 nodes) or more, this patch can
significantly improve the short workload performance. With only 1
node, the performance is similar with or without the patch. The short
workload also scales pretty well up to 4 nodes with this patch.

A perf call-graph report of the short workload at 1500 users without
the patch on 8 nodes indicates that about 79% of the workload's total
time were spent in the _raw_spin_lock() function. Almost all of which
can be attributed to the following 2 kernel functions:
1. dget_parent (49.91%)
2. dput (49.82%)

With this patch on, the time spent on _raw_spin_lock() is only 1.38%
which is a huge improvement. Of the 1.38%, only a small portion of it
is related to the dcache lock. However, the _raw_spin_lock_irqsave()
of the tty_ldisc_lock now dominates with 24.5% of the total time.

This impact of this patch on other AIM7 workloads were much more
modest. The table below show the mean %change due to this patch on
the same 8-socket system with a 3.10-rc1 kernel.

+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | -0.2% | -0.5% | -2.7% |
| five_sec | -1.0% | +2.6% | +2.4% |
| fserver | +1.3% | +1.9% | +1.0% |
| high_systime | +0.1% | +1.3% | +4.8% |
| new_fserver | -1.5% | +0.8% | +1.1% |
| shared | +0.5% | +0.4% | -0.1% |
+--------------+---------------+----------------+-----------------+

Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 37 ++++++++---------
fs/namei.c | 2 +-
include/linux/dcache.h | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 117 insertions(+), 23 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..470b06f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -467,7 +467,7 @@ relock:
}

if (ref)
- dentry->d_count--;
+ dcount_dec(dentry);
/*
* inform the fs via d_prune that this dentry is about to be
* unhashed and destroyed.
@@ -515,10 +515,13 @@ void dput(struct dentry *dentry)
repeat:
if (dentry->d_count == 1)
might_sleep();
+ if (dcount_dec_cmpxchg(dentry))
+ return;
+
spin_lock(&dentry->d_lock);
BUG_ON(!dentry->d_count);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -535,7 +538,7 @@ repeat:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);

- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;

@@ -606,11 +609,13 @@ EXPORT_SYMBOL(d_invalidate);
/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
{
- dentry->d_count++;
+ dcount_inc(dentry);
}

static inline void __dget(struct dentry *dentry)
{
+ if (dcount_inc_cmpxchg(dentry))
+ return;
spin_lock(&dentry->d_lock);
__dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
@@ -620,22 +625,16 @@ struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;

-repeat:
- /*
- * Don't need rcu_dereference because we re-check it was correct under
- * the lock.
- */
rcu_read_lock();
- ret = dentry->d_parent;
- spin_lock(&ret->d_lock);
- if (unlikely(ret != dentry->d_parent)) {
- spin_unlock(&ret->d_lock);
+ ret = rcu_dereference(dentry->d_parent);
+ if (dcount_inc_cmpxchg(ret)) {
rcu_read_unlock();
- goto repeat;
+ return ret;
}
+ spin_lock(&ret->d_lock);
rcu_read_unlock();
BUG_ON(!ret->d_count);
- ret->d_count++;
+ dcount_inc(ret);
spin_unlock(&ret->d_lock);
return ret;
}
@@ -765,7 +764,7 @@ static void try_prune_one_dentry(struct dentry *dentry)
while (dentry) {
spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -1970,7 +1969,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
goto next;
}

- dentry->d_count++;
+ dcount_inc(dentry);
found = dentry;
spin_unlock(&dentry->d_lock);
break;
@@ -2937,7 +2936,7 @@ resume:
}
if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
dentry->d_flags |= DCACHE_GENOCIDE;
- dentry->d_count--;
+ dcount_dec(dentry);
}
spin_unlock(&dentry->d_lock);
}
@@ -2945,7 +2944,7 @@ resume:
struct dentry *child = this_parent;
if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
this_parent->d_flags |= DCACHE_GENOCIDE;
- this_parent->d_count--;
+ dcount_dec(this_parent);
}
this_parent = try_to_ascend(this_parent, locked, seq);
if (!this_parent)
diff --git a/fs/namei.c b/fs/namei.c
index 85e40d1..03dcaed 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -537,7 +537,7 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry)
*/
BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
BUG_ON(!parent->d_count);
- parent->d_count++;
+ dcount_inc(parent);
spin_unlock(&dentry->d_lock);
}
spin_unlock(&parent->d_lock);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a6bb81..99da5e2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -258,6 +258,99 @@ extern int have_submounts(struct dentry *);
extern void d_rehash(struct dentry *);

/**
+ * dcount_inc, dcount_dec - increment or decrement the dentry reference
+ * count
+ * @dentry: dentry to get a reference to
+ */
+static inline void dcount_inc(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_inc((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count++;
+#endif
+}
+
+static inline void dcount_dec(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_dec((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count--;
+#endif
+}
+
+/**
+ * dcount_inc_cmpxchg - increment dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ *
+ * N.B. For architectures that do not have a cmpxchg instruction, the
+ * substitute code may not perform better than taking the lock.
+ * So this cmpxchg code path is disabled for those cases.
+ */
+static inline int dcount_inc_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ if ((oldc = dentry->d_count) > 1) {
+ /*
+ * If reference count > 1, we can safely increment its
+ * value using atomic cmpxchg without locking. If the
+ * operation fails, fall back to using the lock.
+ */
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 1)) {
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
+ * dcount_dec_cmpxchg - decrement dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ */
+static inline int dcount_dec_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ /*
+ * If d_count is bigger than 2, we can safely decrement the
+ * reference count using atomic cmpxchg instruction without locking.
+ * Even if d_lock is taken by a thread running dput() with a parallel
+ * atomic_dec(), the reference count won't go to 0 without anyone
+ * noticing.
+ */
+ if ((oldc = dentry->d_count) > 2) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 2)) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
* d_add - add dentry to hash queues
* @entry: dentry to add
* @inode: The inode to attach to this dentry
@@ -319,7 +412,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
assert_spin_locked(&dentry->d_lock);
if (!read_seqcount_retry(&dentry->d_seq, seq)) {
ret = 1;
- dentry->d_count++;
+ dcount_inc(dentry);
}

return ret;
@@ -340,7 +433,6 @@ extern char *dentry_path_raw(struct dentry *, char *, int);
extern char *dentry_path(struct dentry *, char *, int);

/* Allocation counts.. */
-
/**
* dget, dget_dlock - get a reference to a dentry
* @dentry: dentry to get a reference to
@@ -352,13 +444,16 @@ extern char *dentry_path(struct dentry *, char *, int);
static inline struct dentry *dget_dlock(struct dentry *dentry)
{
if (dentry)
- dentry->d_count++;
+ dcount_inc(dentry);
return dentry;
}

static inline struct dentry *dget(struct dentry *dentry)
{
if (dentry) {
+ if (dcount_inc_cmpxchg(dentry))
+ return dentry;
+
spin_lock(&dentry->d_lock);
dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
--
1.7.1

2013-05-29 20:33:02

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On 05/29/2013 12:18 PM, Simo Sorce wrote:
> On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote:
>
>> My patch set consists of 2 different changes. The first one is to avoid
>> taking the d_lock lock when updating the reference count in the
>> dentries. This particular change also benefit some other workloads that
>> are filesystem intensive. One particular example is the short workload
>> in the AIM7 benchmark. One of the job type in the short workload is
>> "misc_rtns_1" which calls security functions like getpwnam(),
>> getpwuid(), getgrgid() a couple of times. These functions open the
>> /etc/passwd or /etc/group files, read their content and close the files.
>> It is the intensive open/read/close sequence from multiple threads that
>> is causing 80%+ contention in the d_lock on a system with large number
>> of cores.
> To be honest a workload base on /etc/passwd or /etc/group is completely
> artificial, in actual usage, if you really have such access you use
> nscd or sssd with their shared memory caches to completely remove most
> of the file access.
> I have no beef on the rest but repeated access to Nsswitch information
> is not something you need to optimize at the file system layer and
> should not be brought up as a point in favor.

The misc_rtns_1 workload that I described here is just part of a larger
workload involving other activities. It represents just 1/17 of the
total jobs that were spawned. This particular job type, however,
dominates the time because of the lock contention that it created. I
agree that it is an artificial workload as most benchmarks are. It is
certainly an exaggeration of what a real workload may be, but it doesn't
mean that similar contention will not happen in the real world
especially when the trend is to have more and more CPU cores packed in
the same machine.

Regards,
Longman

2013-05-30 16:40:47

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Thu, 30 May 2013 11:48:50 -0400, Waiman Long wrote:
> On 05/29/2013 05:19 PM, Jörn Engel wrote:
> >On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote:
> >>>As Dave said before, is the last path component sufficient? Or how
> >>>about an inode number?
> >>Neither works, the profiler needs to find the file and read it.
> >Ignoring all the complexity this would cause downstream, you could do
> >the path lookup just once, attach some cookie to it and return the
> >cookie ever-after. Maybe some combination of i_sb and i_ino would be
> >good enough as a cookie.
>
> Still, it is just shifting the complexity from the d_path code to
> the perf kernel subsystem as it needs to keep track of what paths
> have been sent up before.

That sounds like a good thing to have. Every single linux user
depends on the dcache. Only a relatively small subset cares about
perf. Having dcache pay the cost for perf's special needs is a
classical externality.

> It also have complications in case the
> tracked files are being deleted or moved around in the filesystem.
> Some kind of notification mechanism has to be implemented in the
> dentry layer to notify the perf subsystem.

Agreed. The whole approach is based on getting the 99% case right and
occasionally being wrong. For perf this may be acceptable, not sure.

Jörn

--
Without a major sea change, nothing that is under copyright today will
ever come out from under it and fall into the public domain.
-- Jake Edge

2013-05-30 15:49:07

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On 05/29/2013 05:19 PM, Jörn Engel wrote:
> On Wed, 29 May 2013 22:37:00 +0200, Andi Kleen wrote:
>>> As Dave said before, is the last path component sufficient? Or how
>>> about an inode number?
>> Neither works, the profiler needs to find the file and read it.
> Ignoring all the complexity this would cause downstream, you could do
> the path lookup just once, attach some cookie to it and return the
> cookie ever-after. Maybe some combination of i_sb and i_ino would be
> good enough as a cookie.

Still, it is just shifting the complexity from the d_path code to the
perf kernel subsystem as it needs to keep track of what paths have been
sent up before. It also have complications in case the tracked files are
being deleted or moved around in the filesystem. Some kind of
notification mechanism has to be implemented in the dentry layer to
notify the perf subsystem.

Regards,
Longman

2013-05-29 16:56:05

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, May 29, 2013 at 12:18:09PM -0400, Simo Sorce wrote:
> On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote:
>
> > My patch set consists of 2 different changes. The first one is to avoid
> > taking the d_lock lock when updating the reference count in the
> > dentries. This particular change also benefit some other workloads that
> > are filesystem intensive. One particular example is the short workload
> > in the AIM7 benchmark. One of the job type in the short workload is
> > "misc_rtns_1" which calls security functions like getpwnam(),
> > getpwuid(), getgrgid() a couple of times. These functions open the
> > /etc/passwd or /etc/group files, read their content and close the files.
> > It is the intensive open/read/close sequence from multiple threads that
> > is causing 80%+ contention in the d_lock on a system with large number
> > of cores.
>
> To be honest a workload base on /etc/passwd or /etc/group is completely
> artificial, in actual usage, if you really have such access you use
> nscd or sssd with their shared memory caches to completely remove most
> of the file access.

I don't fully agree at this point. A lot of things can be tuned away,
but in practice we want things to perform well out of the box without
needing all kinds of magic tuning that only

Also this is just normal file access, nothing special about it.
It simply has to scale. For all kinds of workloads.

And it does, just d_path messes it up.

> I have no beef on the rest but repeated access to Nsswitch information
> is not something you need to optimize at the file system layer and
> should not be brought up as a point in favor.

This is about repeated access to arbitrary files.

-Andi

2013-05-27 02:09:54

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote:
> On 05/23/2013 05:42 AM, Dave Chinner wrote:
> >On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote:
> >>Change log:
> >>
> >>v2->v3
> >> - Fix the RCU lock problem found by Al Viro.
> >> - Rebase the code to the latest v3.10-rc1 linux mainline.
> >> - Remove patch 4 which may be problematic if the dentry is deleted.
> >> - Rerun performance measurement using 3.10-rc1 kernel.
> >>
> >>v1->v2
> >> - Include performance improvement in the AIM7 benchmark results because
> >> of this patch.
> >> - Modify dget_parent() to avoid taking the lock, if possible, to further
> >> improve AIM7 benchmark results.
> >>
> >>During some perf-record sessions of the kernel running the high_systime
> >>workload of the AIM7 benchmark, it was found that quite a large portion
> >>of the spinlock contention was due to the perf_event_mmap_event()
> >>function itself. This perf kernel function calls d_path() which,
> >>in turn, call path_get() and dput() indirectly. These 3 functions
> >>were the hottest functions shown in the perf-report output of
> >>the _raw_spin_lock() function in an 8-socket system with 80 cores
> >>(hyperthreading off) with a 3.10-rc1 kernel:
> >What was it I said about this patchset when you posted it to speed
> >up an Oracle benchmark back in february? I'll repeat:
> >
> >"Nobody should be doing reverse dentry-to-name lookups in a quantity
> >sufficient for it to become a performance limiting factor."
>
> Thank for the comment, but my point is that it is the d_lock
> contention is skewing the data about how much spin lock contention
> had actually happened in the workload and it makes it harder to
> pinpoint problem areas to look at. This is not about performance, it
> is about accurate representation of performance data. Ideally, we
> want the overhead of turning on perf instrumentation to be as low as
> possible.

Right. But d_path will never be "low overhead", and as such it
shouldn't be used by perf.

> >And that makes whatever that tracepoint is doing utterly stupid.
> >Dumping out full paths in a tracepoint is hardly "low overhead", and
> >that tracepoint should be stomped on from a great height. Sure,
> >print the filename straight out of the dentry into a tracepoint,
> >but repeated calculating the full path (probably for the same set of
> >dentries) is just a dumb thing to do.
> >
> >Anyway, your numbers show that a single AIM7 microbenchmark goes
> >better under tracing the specific mmap event that uses d_path(), but
> >the others are on average a little bit slower. That doesn't convince
> >me that this is worth doing. Indeed, what happens to performance
> >when you aren't tracing?
> >
> >Indeed, have you analysed what makes that
> >microbenchmark contend so much on the dentry lock while reverse
> >lookups are occuring? Dentry lock contention does not necessarily
> >indicate a problem with the dentry locks, and without knowing why
> >there is contention occuring (esp. compared to the other benchmarks)
> >we can't really determine if this is a good solution or not...
>
> What made it contend so much was the large number of CPUs available
> in my test system which is a 8-socket Westmere EX machines with 80
> cores. As perf was collecting data from every core, the threads will
> unavoidably bump into each other to translate dentries back to the
> full paths. The current code only allows one CPU in the d_path()
> critical path. My patch will allow all of them to be in the critical
> path concurrently.

Yes, I know *how* contention occurs and what your patch does. I'm
asking you to explain *why* we need to fix this specific workload,
and why the tracepoint that calls d_path() actually needs to do
that.

Your numbers indicate that your patchset decreases peformance in the
common, non-d_path intensive workloads, so why should we add all
this complexity to optimise a slow path at the expense of
significant complexity and lower performance in the fast
path?

> The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So
> even a 4-socket machine will have up to 60 cores or 120 virtual CPUs
> if hyperthreading is turned on.

I don't see why that matters. I've been dealing with systems much
larger than that since early 2002, adn the process for delaing with
lock contention hasn't changed much in the last 10 years. First we
need to determine what you are doing is something that is sane,
determining whether there's a simpler fix than changing locking, and
whether it's actually any faster in the common case we care
about....

> >IOWs, you need more than one microbenchmark that interacts with
> >some naive monitoring code to justify the complexity these locking
> >changes introduce....
> The first patch can also independently improve the performance of a
> number of AIM7 workloads including almost 7X improvement in the
> short workload. More detailed information of these types of
> performance benefit was discussed in the patch description of the
> first patch. I will try to collect more performance improvement data
> on other workloads too.
>
> Thank for the review.

I haven't reviewed anything. All I've made comments about you
needing to justifying the complexity of the change.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-05-29 20:43:47

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, May 29, 2013 at 10:37:00PM +0200, Andi Kleen wrote:
> > As Dave said before, is the last path component sufficient? Or how
> > about an inode number?
>
> Neither works, the profiler needs to find the file and read it.
>
> inode searching would be incredible expensive,

How fast does it need to be? Actually analyzing the profile is
something you only do once after everything's over, right?

> unless the file system provided a "open-by-inode" primitive

Well, I suppose there is open_by_handle_at.

--b.

> In normal operation you only have that event when starting up the
> program and when each shared library gets mapped in.
>
> -Andi

2013-05-23 01:47:28

by Waiman Long

[permalink] [raw]

Subject: [PATCH 3/3 v3] dcache: change rename_lock to a sequence read/write lock

The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.

It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.

All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.

This patch will have merge conflict When applying to kernel version
earlier than 3.10.

Signed-off-by: Waiman Long <[email protected]>
---
fs/autofs4/waitq.c | 6 ++--
fs/ceph/mds_client.c | 4 +-
fs/cifs/dir.c | 4 +-
fs/dcache.c | 83 ++++++++++++++++++++++++-----------------------
fs/nfs/namespace.c | 6 ++--
include/linux/dcache.h | 4 +-
kernel/auditsc.c | 4 +-
7 files changed, 56 insertions(+), 55 deletions(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 3db70da..3afc4db 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -197,7 +197,7 @@ rename_retry:
buf = *name;
len = 0;

- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
spin_lock(&sbi->fs_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -206,7 +206,7 @@ rename_retry:
if (!len || --len > NAME_MAX) {
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return 0;
}
@@ -222,7 +222,7 @@ rename_retry:
}
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;

return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..b0c266f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1489,7 +1489,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,

retry:
len = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = dentry; !IS_ROOT(temp);) {
struct inode *inode = temp->d_inode;
@@ -1539,7 +1539,7 @@ retry:
temp = temp->d_parent;
}
rcu_read_unlock();
- if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+ if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
pr_err("build_path did not end path lookup where "
"expected, namelen is %d, pos is %d\n", len, pos);
/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 5699b50..b672c02 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
dfsplen = 0;
cifs_bp_rename_retry:
namelen = dfsplen;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = direntry; !IS_ROOT(temp);) {
namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
}
}
rcu_read_unlock();
- if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+ if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
cifs_dbg(FYI, "did not end path lookup where expected. namelen=%ddfsplen=%d\n",
namelen, dfsplen);
/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 470b06f..c96bdb1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
#include <asm/uaccess.h>
#include <linux/security.h>
#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/swap.h>
#include <linux/bootmem.h>
#include <linux/fs_struct.h>
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);

EXPORT_SYMBOL(rename_lock);

@@ -1009,7 +1010,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
*/
if (new != old->d_parent ||
(old->d_flags & DCACHE_DENTRY_KILLED) ||
- (!locked && read_seqretry(&rename_lock, seq))) {
+ (!locked && read_seqrwretry(&rename_lock, seq))) {
spin_unlock(&new->d_lock);
new = NULL;
}
@@ -1038,7 +1039,7 @@ int have_submounts(struct dentry *parent)
unsigned seq;
int locked = 0;

- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;

@@ -1081,23 +1082,23 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 0; /* No mount points found in tree */
positive:
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 1;

rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
EXPORT_SYMBOL(have_submounts);
@@ -1124,7 +1125,7 @@ static int select_parent(struct dentry *parent, struct list_head *dispose)
int found = 0;
int locked = 0;

- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;
spin_lock(&this_parent->d_lock);
@@ -1189,10 +1190,10 @@ resume:
}
out:
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return found;

rename_retry:
@@ -1201,7 +1202,7 @@ rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}

@@ -1816,7 +1817,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -1884,11 +1885,11 @@ struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
unsigned seq;

do {
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
dentry = __d_lookup(parent, name);
if (dentry)
break;
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));
return dentry;
}
EXPORT_SYMBOL(d_lookup);
@@ -1902,7 +1903,7 @@ EXPORT_SYMBOL(d_lookup);
* __d_lookup is like d_lookup, however it may (rarely) return a
* false-negative result due to unrelated rename activity.
*
- * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * __d_lookup is slightly faster by avoiding rename_lock read seqrwlock,
* however it must be used carefully, eg. with a following d_lookup in
* the case of failure.
*
@@ -1934,7 +1935,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -2309,9 +2310,9 @@ static void __d_move(struct dentry * dentry, struct dentry * target)
*/
void d_move(struct dentry *dentry, struct dentry *target)
{
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
__d_move(dentry, target);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
}
EXPORT_SYMBOL(d_move);

@@ -2439,7 +2440,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
alias = __d_find_alias(inode, 0);
if (alias) {
actual = alias;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);

if (d_ancestor(alias, dentry)) {
/* Check for loops */
@@ -2449,7 +2450,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
/* Is this an anonymous mountpoint that we
* could splice into our tree? */
__d_materialise_dentry(dentry, alias);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
__d_drop(alias);
goto found;
} else {
@@ -2457,7 +2458,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
* aliasing. This drops inode->i_lock */
actual = __d_unalias(inode, dentry, alias);
}
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
if (IS_ERR(actual)) {
if (PTR_ERR(actual) == -ELOOP)
pr_warn_ratelimited(
@@ -2602,9 +2603,9 @@ char *__d_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2623,9 +2624,9 @@ char *d_absolute_path(const struct path *path,

prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error > 1)
@@ -2691,9 +2692,9 @@ char *d_path(const struct path *path, char *buf, int buflen)

get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2761,9 +2762,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
char *retval;

- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);

return retval;
}
@@ -2774,7 +2775,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;

- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2782,7 +2783,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2821,7 +2822,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2829,7 +2830,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)

prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);

if (error < 0)
@@ -2850,7 +2851,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}

@@ -2887,7 +2888,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)

do {
/* for restarting inner loop in case of seq retry */
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
/*
* Need rcu_readlock to protect against the d_parent trashing
* due to d_move
@@ -2898,7 +2899,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
else
result = 0;
rcu_read_unlock();
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));

return result;
}
@@ -2910,7 +2911,7 @@ void d_genocide(struct dentry *root)
unsigned seq;
int locked = 0;

- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = root;
spin_lock(&this_parent->d_lock);
@@ -2953,17 +2954,17 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return;

rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}

diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index fc8dc20..0eca871 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,7 @@ rename_retry:
*--end = '\0';
buflen--;

- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
while (1) {
spin_lock(&dentry->d_lock);
@@ -76,7 +76,7 @@ rename_retry:
spin_unlock(&dentry->d_lock);
dentry = dentry->d_parent;
}
- if (read_seqretry(&rename_lock, seq)) {
+ if (read_seqrwretry(&rename_lock, seq)) {
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
goto rename_retry;
@@ -117,7 +117,7 @@ rename_retry:
Elong_unlock:
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
Elong:
return ERR_PTR(-ENAMETOOLONG);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 99da5e2..5f05815 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -6,7 +6,7 @@
#include <linux/rculist.h>
#include <linux/rculist_bl.h>
#include <linux/spinlock.h>
-#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/cache.h>
#include <linux/rcupdate.h>

@@ -210,7 +210,7 @@ struct dentry_operations {

#define DCACHE_DENTRY_KILLED 0x100000

-extern seqlock_t rename_lock;
+extern seqrwlock_t rename_lock;

static inline int dname_external(struct dentry *dentry)
{
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 3c8a601..d464b67 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1591,7 +1591,7 @@ retry:
drop = NULL;
d = dentry;
rcu_read_lock();
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
for(;;) {
struct inode *inode = d->d_inode;
if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1609,7 +1609,7 @@ retry:
break;
d = parent;
}
- if (unlikely(read_seqretry(&rename_lock, seq) || drop)) { /* in this order */
+ if (unlikely(read_seqrwretry(&rename_lock, seq) || drop)) { /* in this order */
rcu_read_unlock();
if (!drop) {
/* just a race with rename */
--
1.7.1

2013-05-29 16:18:35

by Simo Sorce

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote:

> My patch set consists of 2 different changes. The first one is to avoid
> taking the d_lock lock when updating the reference count in the
> dentries. This particular change also benefit some other workloads that
> are filesystem intensive. One particular example is the short workload
> in the AIM7 benchmark. One of the job type in the short workload is
> "misc_rtns_1" which calls security functions like getpwnam(),
> getpwuid(), getgrgid() a couple of times. These functions open the
> /etc/passwd or /etc/group files, read their content and close the files.
> It is the intensive open/read/close sequence from multiple threads that
> is causing 80%+ contention in the d_lock on a system with large number
> of cores.

To be honest a workload base on /etc/passwd or /etc/group is completely
artificial, in actual usage, if you really have such access you use
nscd or sssd with their shared memory caches to completely remove most
of the file access.
I have no beef on the rest but repeated access to Nsswitch information
is not something you need to optimize at the file system layer and
should not be brought up as a point in favor.

Simo.

--
Simo Sorce * Red Hat, Inc * New York

2013-05-23 01:38:40

by Waiman Long

[permalink] [raw]

Subject: [PATCH 2/3 v3] dcache: introduce a new sequence read/write lock type

The current sequence lock supports 2 types of lock users:

1. A reader who wants a consistent set of information and is willing
to retry if the information changes. The information that the
reader needs cannot contain pointers, because any writer could
invalidate a pointer that a reader was following. This reader
will never block but they may have to retry if a writer is in
progress.
2. A writer who may need to modify content of a data structure. Writer
blocks only if another writer is in progress.

This type of lock is suitable for cases where there are a large number
of readers and much less writers. However, it has a limitation that
reader who may want to follow pointer or cannot tolerate unexpected
changes in the protected data structure must take the writer lock
even if it doesn't need to make any changes.

To more efficiently support this type of readers, a new lock type is
introduced by this patch: sequence read/write lock. Two types of readers
are supported by this new lock:

1. Reader who has the same behavior as a sequence lock reader.
2. Reader who may need to follow pointers. This reader will block if
a writer is in progress. In turn, it blocks a writer if it is in
progress. Multiple readers of this type can proceed concurrently.
Taking this reader lock won't update the sequence number.

This new lock type is a combination of the sequence lock and read/write
lock. Hence it will have the same limitation of a read/write lock that
writers may be starved if there is a lot of contention.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 137 insertions(+), 0 deletions(-)
create mode 100644 include/linux/seqrwlock.h

diff --git a/include/linux/seqrwlock.h b/include/linux/seqrwlock.h
new file mode 100644
index 0000000..c6145ff
--- /dev/null
+++ b/include/linux/seqrwlock.h
@@ -0,0 +1,137 @@
+#ifndef __LINUX_SEQRWLOCK_H
+#define __LINUX_SEQRWLOCK_H
+/*
+ * Sequence Read/Write Lock
+ * ------------------------
+ * This new lock type is a combination of the sequence lock and read/write
+ * lock. Three types of lock users are supported:
+ * 1. Reader who wants a consistent set of information and is willing to
+ * retry if the information changes. The information that the reader
+ * need cannot contain pointers, because any writer could invalidate
+ * a pointer that a reader was following. This reader never block but
+ * they may have to retry if a writer is in progress.
+ * 2. Reader who may need to follow pointers. This reader will block if
+ * a writer is in progress.
+ * 3. Writer who may need to modify content of a data structure. Writer
+ * blocks if another writer or the 2nd type of reader is in progress.
+ *
+ * The current implementation layered on top of the regular read/write
+ * lock. There is a chance that the writers may be starved by the readers.
+ * So be careful when you decided to use this lock.
+ *
+ * Expected 1st type reader usage:
+ * do {
+ * seq = read_seqrwbegin(&foo);
+ * ...
+ * } while (read_seqrwretry(&foo, seq));
+ *
+ * Expected 2nd type reader usage:
+ * read_seqrwlock(&foo)
+ * ...
+ * read_seqrwunlock(&foo)
+ *
+ * Expected writer usage:
+ * write_seqrwlock(&foo)
+ * ...
+ * write_seqrwunlock(&foo)
+ *
+ * Based on the seqlock.h file
+ * by Waiman Long
+ */
+
+#include <linux/rwlock.h>
+#include <linux/preempt.h>
+#include <asm/processor.h>
+
+typedef struct {
+ unsigned sequence;
+ rwlock_t lock;
+} seqrwlock_t;
+
+#define __SEQRWLOCK_UNLOCKED(lockname) \
+ { 0, __RW_LOCK_UNLOCKED(lockname) }
+
+#define seqrwlock_init(x) \
+ do { \
+ (x)->sequence = 0; \
+ rwlock_init(&(x)->lock); \
+ } while (0)
+
+#define DEFINE_SEQRWLOCK(x) \
+ seqrwlock_t x = __SEQRWLOCK_UNLOCKED(x)
+
+/* For writer:
+ * Lock out other writers and 2nd type of readers and update the sequence
+ * number. Don't need preempt_disable() because that is in the read_lock and
+ * write_lock already.
+ */
+static inline void write_seqrwlock(seqrwlock_t *sl)
+{
+ write_lock(&sl->lock);
+ ++sl->sequence;
+ smp_wmb();
+}
+
+static inline void write_seqrwunlock(seqrwlock_t *sl)
+{
+ smp_wmb();
+ sl->sequence++;
+ write_unlock(&sl->lock);
+}
+
+static inline int write_tryseqrwlock(seqrwlock_t *sl)
+{
+ int ret = write_trylock(&sl->lock);
+
+ if (ret) {
+ ++sl->sequence;
+ smp_wmb();
+ }
+ return ret;
+}
+
+/* For 2nd type of reader:
+ * Lock out other writers, but don't update the sequence number
+ */
+static inline void read_seqrwlock(seqrwlock_t *sl)
+{
+ read_lock(&sl->lock);
+}
+
+static inline void read_seqrwunlock(seqrwlock_t *sl)
+{
+ read_unlock(&sl->lock);
+}
+
+static inline int read_tryseqrwlock(seqrwlock_t *sl)
+{
+ return read_trylock(&sl->lock);
+}
+
+/* Start of read calculation -- fetch last complete writer token */
+static __always_inline unsigned read_seqrwbegin(const seqrwlock_t *sl)
+{
+ unsigned ret;
+
+repeat:
+ ret = ACCESS_ONCE(sl->sequence);
+ if (unlikely(ret & 1)) {
+ cpu_relax();
+ goto repeat;
+ }
+ smp_rmb();
+ return ret;
+}
+
+/*
+ * Test if reader processed invalid data.
+ *
+ * If sequence value changed then writer changed data while in section.
+ */
+static __always_inline int read_seqrwretry(const seqrwlock_t *sl, unsigned start)
+{
+ smp_rmb();
+ return unlikely(sl->sequence != start);
+}
+
+#endif /* __LINUX_SEQLOCK_H */
--
1.7.1

2013-05-29 15:55:30

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On 05/26/2013 10:09 PM, Dave Chinner wrote:
> On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote:
>> On 05/23/2013 05:42 AM, Dave Chinner wrote:
>>>
>>> What was it I said about this patchset when you posted it to speed
>>> up an Oracle benchmark back in february? I'll repeat:
>>>
>>> "Nobody should be doing reverse dentry-to-name lookups in a quantity
>>> sufficient for it to become a performance limiting factor."
>> Thank for the comment, but my point is that it is the d_lock
>> contention is skewing the data about how much spin lock contention
>> had actually happened in the workload and it makes it harder to
>> pinpoint problem areas to look at. This is not about performance, it
>> is about accurate representation of performance data. Ideally, we
>> want the overhead of turning on perf instrumentation to be as low as
>> possible.
> Right. But d_path will never be "low overhead", and as such it
> shouldn't be used by perf.

The d_path() is called by perf_event_mmap_event() which translates VMA
to its file path for memory segments backed by files. As perf is not
just for sampling data within the kernel, it can also be used for
checking access pattern in the user space. As a result, it needs to map
VMAs back to the backing files to access their symbols information. If
d_path() is not the right function to call for this purpose, what other
alternatives do we have?

There may be ways to reduce calls to d_path() by doing some kind of
caching, but that will certainly increase the complexity of the perf code.

>>> And that makes whatever that tracepoint is doing utterly stupid.
>>> Dumping out full paths in a tracepoint is hardly "low overhead", and
>>> that tracepoint should be stomped on from a great height. Sure,
>>> print the filename straight out of the dentry into a tracepoint,
>>> but repeated calculating the full path (probably for the same set of
>>> dentries) is just a dumb thing to do.
>>>
>>> Anyway, your numbers show that a single AIM7 microbenchmark goes
>>> better under tracing the specific mmap event that uses d_path(), but
>>> the others are on average a little bit slower. That doesn't convince
>>> me that this is worth doing. Indeed, what happens to performance
>>> when you aren't tracing?
>>>
>>> Indeed, have you analysed what makes that
>>> microbenchmark contend so much on the dentry lock while reverse
>>> lookups are occuring? Dentry lock contention does not necessarily
>>> indicate a problem with the dentry locks, and without knowing why
>>> there is contention occuring (esp. compared to the other benchmarks)
>>> we can't really determine if this is a good solution or not...
>> What made it contend so much was the large number of CPUs available
>> in my test system which is a 8-socket Westmere EX machines with 80
>> cores. As perf was collecting data from every core, the threads will
>> unavoidably bump into each other to translate dentries back to the
>> full paths. The current code only allows one CPU in the d_path()
>> critical path. My patch will allow all of them to be in the critical
>> path concurrently.
> Yes, I know *how* contention occurs and what your patch does. I'm
> asking you to explain *why* we need to fix this specific workload,
> and why the tracepoint that calls d_path() actually needs to do
> that.

My patch set consists of 2 different changes. The first one is to avoid
taking the d_lock lock when updating the reference count in the
dentries. This particular change also benefit some other workloads that
are filesystem intensive. One particular example is the short workload
in the AIM7 benchmark. One of the job type in the short workload is
"misc_rtns_1" which calls security functions like getpwnam(),
getpwuid(), getgrgid() a couple of times. These functions open the
/etc/passwd or /etc/group files, read their content and close the files.
It is the intensive open/read/close sequence from multiple threads that
is causing 80%+ contention in the d_lock on a system with large number
of cores. The MIT's MOSBench paper also outlined dentry reference
counting as a scalability roadblock for its Apache and Exim tests. So
that change will certainly help workloads that are dcache reference
counting intensive.

The second rename_lock change is mainly for reducing the d_path lock
contention for perf and some other applications that may need to access
/proc system files like maps or numa_maps frequently. If you think the
rename_lock change is not worth the effort to just benefit perf, I would
still like to see the first one go in as it can certainly benefit other
workload.

> Your numbers indicate that your patchset decreases peformance in the
> common, non-d_path intensive workloads, so why should we add all
> this complexity to optimise a slow path at the expense of
> significant complexity and lower performance in the fast
> path?

My numbers didn't actually indicate a performance regression in other
common workloads. They just indicate that there wasn't a significant
changes in performance as + or - a few percentages here and there are
within the margin of errors for the tests that I used.

>> The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So
>> even a 4-socket machine will have up to 60 cores or 120 virtual CPUs
>> if hyperthreading is turned on.
> I don't see why that matters. I've been dealing with systems much
> larger than that since early 2002, adn the process for delaing with
> lock contention hasn't changed much in the last 10 years. First we
> need to determine what you are doing is something that is sane,
> determining whether there's a simpler fix than changing locking, and
> whether it's actually any faster in the common case we care
> about....

I certainly agree with that. My testing indicated no change in
performance for the common case, but significant speed-up in some
scenarios with large number of CPUs.

Regards,
Longman

2013-05-29 20:37:26

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On 05/29/2013 12:56 PM, Andi Kleen wrote:
> On Wed, May 29, 2013 at 12:18:09PM -0400, Simo Sorce wrote:
>> To be honest a workload base on /etc/passwd or /etc/group is completely
>> artificial, in actual usage, if you really have such access you use
>> nscd or sssd with their shared memory caches to completely remove most
>> of the file access.
> I don't fully agree at this point. A lot of things can be tuned away,
> but in practice we want things to perform well out of the box without
> needing all kinds of magic tuning that only
>
> Also this is just normal file access, nothing special about it.
> It simply has to scale. For all kinds of workloads.
>
> And it does, just d_path messes it up.

Just for clarification, the AIM7 workload is not affected by the current
d_path() code, they are speed-limited by the lock contention in the
dcache reference counting code. However, both the d_path() change and
the dentry reference counting change are needed to to eliminate the
overhead introduced by the use of the perf-record command.

Regards,
Longman

2013-05-23 21:34:30

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On 05/23/2013 05:42 AM, Dave Chinner wrote:
> On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote:
>> Change log:
>>
>> v2->v3
>> - Fix the RCU lock problem found by Al Viro.
>> - Rebase the code to the latest v3.10-rc1 linux mainline.
>> - Remove patch 4 which may be problematic if the dentry is deleted.
>> - Rerun performance measurement using 3.10-rc1 kernel.
>>
>> v1->v2
>> - Include performance improvement in the AIM7 benchmark results because
>> of this patch.
>> - Modify dget_parent() to avoid taking the lock, if possible, to further
>> improve AIM7 benchmark results.
>>
>> During some perf-record sessions of the kernel running the high_systime
>> workload of the AIM7 benchmark, it was found that quite a large portion
>> of the spinlock contention was due to the perf_event_mmap_event()
>> function itself. This perf kernel function calls d_path() which,
>> in turn, call path_get() and dput() indirectly. These 3 functions
>> were the hottest functions shown in the perf-report output of
>> the _raw_spin_lock() function in an 8-socket system with 80 cores
>> (hyperthreading off) with a 3.10-rc1 kernel:
> What was it I said about this patchset when you posted it to speed
> up an Oracle benchmark back in february? I'll repeat:
>
> "Nobody should be doing reverse dentry-to-name lookups in a quantity
> sufficient for it to become a performance limiting factor."

Thank for the comment, but my point is that it is the d_lock contention
is skewing the data about how much spin lock contention had actually
happened in the workload and it makes it harder to pinpoint problem
areas to look at. This is not about performance, it is about accurate
representation of performance data. Ideally, we want the overhead of
turning on perf instrumentation to be as low as possible.

>
> And that makes whatever that tracepoint is doing utterly stupid.
> Dumping out full paths in a tracepoint is hardly "low overhead", and
> that tracepoint should be stomped on from a great height. Sure,
> print the filename straight out of the dentry into a tracepoint,
> but repeated calculating the full path (probably for the same set of
> dentries) is just a dumb thing to do.
>
> Anyway, your numbers show that a single AIM7 microbenchmark goes
> better under tracing the specific mmap event that uses d_path(), but
> the others are on average a little bit slower. That doesn't convince
> me that this is worth doing. Indeed, what happens to performance
> when you aren't tracing?
>
> Indeed, have you analysed what makes that
> microbenchmark contend so much on the dentry lock while reverse
> lookups are occuring? Dentry lock contention does not necessarily
> indicate a problem with the dentry locks, and without knowing why
> there is contention occuring (esp. compared to the other benchmarks)
> we can't really determine if this is a good solution or not...

What made it contend so much was the large number of CPUs available in
my test system which is a 8-socket Westmere EX machines with 80 cores.
As perf was collecting data from every core, the threads will
unavoidably bump into each other to translate dentries back to the full
paths. The current code only allows one CPU in the d_path() critical
path. My patch will allow all of them to be in the critical path
concurrently.

The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So even a
4-socket machine will have up to 60 cores or 120 virtual CPUs if
hyperthreading is turned on.
> IOWs, you need more than one microbenchmark that interacts with
> some naive monitoring code to justify the complexity these locking
> changes introduce....
The first patch can also independently improve the performance of a
number of AIM7 workloads including almost 7X improvement in the short
workload. More detailed information of these types of performance
benefit was discussed in the patch description of the first patch. I
will try to collect more performance improvement data on other workloads
too.

Thank for the review.
Longman

2013-05-29 17:03:37

by Simo Sorce

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, 2013-05-29 at 18:56 +0200, Andi Kleen wrote:
> On Wed, May 29, 2013 at 12:18:09PM -0400, Simo Sorce wrote:
> > On Wed, 2013-05-29 at 11:55 -0400, Waiman Long wrote:
> >
> > > My patch set consists of 2 different changes. The first one is to avoid
> > > taking the d_lock lock when updating the reference count in the
> > > dentries. This particular change also benefit some other workloads that
> > > are filesystem intensive. One particular example is the short workload
> > > in the AIM7 benchmark. One of the job type in the short workload is
> > > "misc_rtns_1" which calls security functions like getpwnam(),
> > > getpwuid(), getgrgid() a couple of times. These functions open the
> > > /etc/passwd or /etc/group files, read their content and close the files.
> > > It is the intensive open/read/close sequence from multiple threads that
> > > is causing 80%+ contention in the d_lock on a system with large number
> > > of cores.
> >
> > To be honest a workload base on /etc/passwd or /etc/group is completely
> > artificial, in actual usage, if you really have such access you use
> > nscd or sssd with their shared memory caches to completely remove most
> > of the file access.
>
> I don't fully agree at this point. A lot of things can be tuned away,
> but in practice we want things to perform well out of the box without
> needing all kinds of magic tuning that only

Phrase seem cut mid-sentence ?

> Also this is just normal file access, nothing special about it.
> It simply has to scale. For all kinds of workloads.
>
> And it does, just d_path messes it up.

Well there are reasonable workloads and artificial ones, I am just
warning not to use 'that' specific test as a good indicator, if you have
other reasonable workloads that show a similar flow feel free to bring
it up.

> > I have no beef on the rest but repeated access to Nsswitch information
> > is not something you need to optimize at the file system layer and
> > should not be brought up as a point in favor.
>
> This is about repeated access to arbitrary files.

Ok. I do not want to start a discussion on this, I just pointed out the
specific point was not really a good example hopefully there are others
that justify the patchset.

Simo.

--
Simo Sorce * Red Hat, Inc * New York

2013-05-29 16:14:06

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

> The d_path() is called by perf_event_mmap_event() which translates
> VMA to its file path for memory segments backed by files. As perf is
> not just for sampling data within the kernel, it can also be used
> for checking access pattern in the user space. As a result, it needs
> to map VMAs back to the backing files to access their symbols
> information. If d_path() is not the right function to call for this
> purpose, what other alternatives do we have?

In principle it should be only called for new file mappings
getting maped. Do you really have that many new file mappings all
the time? Or is this related to program startup?

>
> My patch set consists of 2 different changes. The first one is to
> avoid taking the d_lock lock when updating the reference count in
> the dentries. This particular change also benefit some other
> workloads that are filesystem intensive. One particular example is
> the short workload in the AIM7 benchmark. One of the job type in the
> short workload is "misc_rtns_1" which calls security functions like
> getpwnam(), getpwuid(), getgrgid() a couple of times. These
> functions open the /etc/passwd or /etc/group files, read their
> content and close the files. It is the intensive open/read/close
> sequence from multiple threads that is causing 80%+ contention in
> the d_lock on a system with large number of cores. The MIT's
> MOSBench paper also outlined dentry reference counting as a

The paper was before Nick Piggin's RCU (and our) work on this.
Modern kernels do not have dcache problems with mosbench, unless
you run weird security modules like SMACK that effectively
disable dcache RCU.

BTW lock elision may fix these problems anyways, in a much
simpler way.

-Andi

[email protected] -- Speaking for myself only.

2013-05-29 20:40:57

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On 05/29/2013 02:46 PM, J. Bruce Fields wrote:
> On Wed, May 29, 2013 at 11:55:14AM -0400, Waiman Long wrote:
>> On 05/26/2013 10:09 PM, Dave Chinner wrote:
>>> On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote:
>>>> On 05/23/2013 05:42 AM, Dave Chinner wrote:
>>>>> What was it I said about this patchset when you posted it to speed
>>>>> up an Oracle benchmark back in february? I'll repeat:
>>>>>
>>>>> "Nobody should be doing reverse dentry-to-name lookups in a quantity
>>>>> sufficient for it to become a performance limiting factor."
>>>> Thank for the comment, but my point is that it is the d_lock
>>>> contention is skewing the data about how much spin lock contention
>>>> had actually happened in the workload and it makes it harder to
>>>> pinpoint problem areas to look at. This is not about performance, it
>>>> is about accurate representation of performance data. Ideally, we
>>>> want the overhead of turning on perf instrumentation to be as low as
>>>> possible.
>>> Right. But d_path will never be "low overhead", and as such it
>>> shouldn't be used by perf.
>> The d_path() is called by perf_event_mmap_event() which translates
>> VMA to its file path for memory segments backed by files. As perf is
>> not just for sampling data within the kernel, it can also be used
>> for checking access pattern in the user space. As a result, it needs
>> to map VMAs back to the backing files to access their symbols
>> information. If d_path() is not the right function to call for this
>> purpose, what other alternatives do we have?
> As Dave said before, is the last path component sufficient? Or how
> about an inode number?

I don't think so. The user-space perf command will need the full
pathname to locate the binaries and read their debug information. Just
returning the last path component or the inode number will not cut it.

Regards,
Longman

2013-05-29 20:23:10

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On 05/29/2013 12:13 PM, Andi Kleen wrote:
>> The d_path() is called by perf_event_mmap_event() which translates
>> VMA to its file path for memory segments backed by files. As perf is
>> not just for sampling data within the kernel, it can also be used
>> for checking access pattern in the user space. As a result, it needs
>> to map VMAs back to the backing files to access their symbols
>> information. If d_path() is not the right function to call for this
>> purpose, what other alternatives do we have?
> In principle it should be only called for new file mappings
> getting maped. Do you really have that many new file mappings all
> the time? Or is this related to program startup?

The AIM7 benchmark that I used runs a large number of relatively short
jobs. I think each time a new job is spawned, the file mappngs have to
be redone again. It is probably not a big problem for long running
processes.

>> My patch set consists of 2 different changes. The first one is to
>> avoid taking the d_lock lock when updating the reference count in
>> the dentries. This particular change also benefit some other
>> workloads that are filesystem intensive. One particular example is
>> the short workload in the AIM7 benchmark. One of the job type in the
>> short workload is "misc_rtns_1" which calls security functions like
>> getpwnam(), getpwuid(), getgrgid() a couple of times. These
>> functions open the /etc/passwd or /etc/group files, read their
>> content and close the files. It is the intensive open/read/close
>> sequence from multiple threads that is causing 80%+ contention in
>> the d_lock on a system with large number of cores. The MIT's
>> MOSBench paper also outlined dentry reference counting as a
> The paper was before Nick Piggin's RCU (and our) work on this.
> Modern kernels do not have dcache problems with mosbench, unless
> you run weird security modules like SMACK that effectively
> disable dcache RCU.

I had tried, but not yet able to run the MOSBench myself. Thank for
letting me know that the dcache problem wrt MOSBench was fixed.

> BTW lock elision may fix these problems anyways, in a much
> simpler way.

I will certainly hope so. However, there will still be a lot of
computers out there running pre-Haswell Intel chips. For them, locking
is still a problem that need to be solved.

Regards,
Longman

2013-05-23 09:47:12

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote:
> Change log:
>
> v2->v3
> - Fix the RCU lock problem found by Al Viro.
> - Rebase the code to the latest v3.10-rc1 linux mainline.
> - Remove patch 4 which may be problematic if the dentry is deleted.
> - Rerun performance measurement using 3.10-rc1 kernel.
>
> v1->v2
> - Include performance improvement in the AIM7 benchmark results because
> of this patch.
> - Modify dget_parent() to avoid taking the lock, if possible, to further
> improve AIM7 benchmark results.
>
> During some perf-record sessions of the kernel running the high_systime
> workload of the AIM7 benchmark, it was found that quite a large portion
> of the spinlock contention was due to the perf_event_mmap_event()
> function itself. This perf kernel function calls d_path() which,
> in turn, call path_get() and dput() indirectly. These 3 functions
> were the hottest functions shown in the perf-report output of
> the _raw_spin_lock() function in an 8-socket system with 80 cores
> (hyperthreading off) with a 3.10-rc1 kernel:

<sigh>

What was it I said about this patchset when you posted it to speed
up an Oracle benchmark back in february? I'll repeat:

"Nobody should be doing reverse dentry-to-name lookups in a quantity
sufficient for it to become a performance limiting factor."

And that makes whatever that tracepoint is doing utterly stupid.
Dumping out full paths in a tracepoint is hardly "low overhead", and
that tracepoint should be stomped on from a great height. Sure,
print the filename straight out of the dentry into a tracepoint,
but repeated calculating the full path (probably for the same set of
dentries) is just a dumb thing to do.

Anyway, your numbers show that a single AIM7 microbenchmark goes
better under tracing the specific mmap event that uses d_path(), but
the others are on average a little bit slower. That doesn't convince
me that this is worth doing. Indeed, what happens to performance
when you aren't tracing?

Indeed, have you analysed what makes that
microbenchmark contend so much on the dentry lock while reverse
lookups are occuring? Dentry lock contention does not necessarily
indicate a problem with the dentry locks, and without knowing why
there is contention occuring (esp. compared to the other benchmarks)
we can't really determine if this is a good solution or not...

IOWs, you need more than one microbenchmark that interacts with
some naive monitoring code to justify the complexity these locking
changes introduce....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-05-29 20:37:03

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 0/3 v3] dcache: make it more scalable on large system

> As Dave said before, is the last path component sufficient? Or how
> about an inode number?

Neither works, the profiler needs to find the file and read it.

inode searching would be incredible expensive, unless the file system
provided a "open-by-inode" primitive

In normal operation you only have that event when starting up the
program and when each shared library gets mapped in.

-Andi

2013-05-23 01:37:27

by Waiman Long

[permalink] [raw]

Subject: [PATCH 2/3 v3] dcache: introduce a new sequence read/write lock type

2013-05-23 01:37:26

by Waiman Long

[permalink] [raw]

Subject: [PATCH 1/3 v3] dcache: Don't take unnecessary lock in d_count update

2013-05-23 01:37:28

by Waiman Long

[permalink] [raw]

Subject: [PATCH 3/3 v3] dcache: change rename_lock to a sequence read/write lock

2013-06-06 03:48:56

by Dave Chinner

[permalink] [raw]