Change History
v3->v4:
- Extract the begin and end blocks of the rename_lock sequence number
check into helper functions to make the code easier to read.
v2->v3:
- Simplify prepend_name() to blindly copy the dname over until it
reaches a null byte or the specified length leaving the sequence
check to handle error case.
v1->v2:
- Check for internal vs external dname, taking d_lock only for
external dname for safety.
- Replace memchr() by a byte-by-byte checking for loop.
- Try lockless dentry to pathname conversion 3 times before falling
back to taking the rename_lock to prevent live-lock.
- Make code re-factoring suggested by George Spelvin.
Waiman Long (1):
dcache: Translating dentry into pathname without taking rename_lock
fs/dcache.c | 196 ++++++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 133 insertions(+), 63 deletions(-)
When running the AIM7's short workload, Linus' lockref patch eliminated
most of the spinlock contention. However, there were still some left:
8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--42.21%-- d_path
| proc_pid_readlink
| SyS_readlinkat
| SyS_readlink
| system_call
| __GI___readlink
|
|--40.97%-- sys_getcwd
| system_call
| __getcwd
The big one here is the rename_lock (seqlock) contention in d_path()
and the getcwd system call. This patch will eliminate the need to take
the rename_lock while translating dentries into the full pathnames.
The need to take the rename_lock is to make sure that no rename
operation can be ongoing while the translation is in progress. However,
only one thread can take the rename_lock thus blocking all the other
threads that need it even though the translation process won't make
any change to the dentries.
This patch will replace the writer's write_seqlock/write_sequnlock
sequence of the rename_lock of the callers of the prepend_path() and
__dentry_path() functions with the reader's read_seqbegin/read_seqretry
sequence within these 2 functions. As a result, the code will have to
retry if one or more rename operations had been performed. In addition,
RCU read lock will be taken during the translation process to make sure
that no dentries will go away. To prevent live-lock from happening,
the code will switch back to take the rename_lock if read_seqretry()
fails for three times.
To further reduce spinlock contention, this patch does not take the
dentry's d_lock when copying the filename from the dentries. Instead,
it treats the name pointer and length as unreliable and just copy
the string byte-by-byte over until it hits a null byte or the end of
string as specified by the length. This should avoid stepping into
invalid memory address. The error cases are left to be handled by
the sequence number check.
The following code re-factoring are also made:
1. Move prepend('/') into prepend_name() to remove one conditional
check.
2. Move the global root check in prepend_path() back to the top of
the while loop.
With this patch, the _raw_spin_lock will now account for only 1.2%
of the total CPU cycles for the short workload. This patch also has
the effect of reducing the effect of running perf on its profile
since the perf command itself can be a heavy user of the d_path()
function depending on the complexity of the workload.
When taking the perf profile of the high-systime workload, the amount
of spinlock contention contributed by running perf without this patch
was about 16%. With this patch, the spinlock contention caused by
the running of perf will go away and we will have a more accurate
perf profile.
Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 196 ++++++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 133 insertions(+), 63 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index ca8e9cd..8186ff9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -88,6 +88,44 @@ EXPORT_SYMBOL(rename_lock);
static struct kmem_cache *dentry_cache __read_mostly;
+/**
+ * read_seqbegin_or_lock - begin a sequence number check or locking block
+ * lock: sequence lock
+ * seq : sequence number to be checked
+ *
+ * First try it once optimistically without taking the lock. If that fails,
+ * take the lock. The sequence number is also used as a marker for deciding
+ * whether to be a reader (even) or writer (odd).
+ * N.B. seq must be initialized to an even number to begin with.
+ */
+static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
+{
+ if (!(*seq & 1)) { /* Even */
+ *seq = read_seqbegin(lock);
+ rcu_read_lock();
+ } else /* Odd */
+ write_seqlock(lock);
+}
+
+/**
+ * read_seqretry_or_unlock - end a seqretry or lock block & return retry status
+ * lock : sequence lock
+ * seq : sequence number
+ * Return: 1 to retry operation again, 0 to continue
+ */
+static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
+{
+ if (!(*seq & 1)) { /* Even */
+ rcu_read_unlock();
+ if (read_seqretry(lock, *seq)) {
+ (*seq)++; /* Take writer lock */
+ return 1;
+ }
+ } else /* Odd */
+ write_sequnlock(lock);
+ return 0;
+}
+
/*
* This is the single most critical data structure when it comes
* to the dcache: the hashtable for lookups. Somebody should try
@@ -2647,9 +2685,39 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
return 0;
}
+/**
+ * prepend_name - prepend a pathname in front of current buffer pointer
+ * buffer: buffer pointer
+ * buflen: allocated length of the buffer
+ * name: name string and length qstr structure
+ *
+ * With RCU path tracing, it may race with d_move(). Use ACCESS_ONCE() to
+ * make sure that either the old or the new name pointer and length are
+ * fetched. However, there may be mismatch between length and pointer.
+ * The length cannot be trusted, we need to copy it byte-by-byte until
+ * the length is reached or a null byte is found. It also prepends "/" at
+ * the beginning of the name. The sequence number check at the caller will
+ * retry it again when a d_move() does happen. So any garbage in the buffer
+ * due to mismatched pointer and length will be discarded.
+ */
static int prepend_name(char **buffer, int *buflen, struct qstr *name)
{
- return prepend(buffer, buflen, name->name, name->len);
+ const char *dname = ACCESS_ONCE(name->name);
+ u32 dlen = ACCESS_ONCE(name->len);
+ char *p;
+
+ if (*buflen < dlen + 1)
+ return -ENAMETOOLONG;
+ *buflen -= dlen + 1;
+ p = *buffer -= dlen + 1;
+ *p++ = '/';
+ while (dlen--) {
+ char c = *dname++;
+ if (!c)
+ break;
+ *p++ = c;
+ }
+ return 0;
}
/**
@@ -2659,7 +2727,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again. It performs a total of 3 trials of
+ * lockless back-tracing sequences before falling back to take the
+ * rename_lock.
*/
static int prepend_path(const struct path *path,
const struct path *root,
@@ -2668,54 +2743,60 @@ static int prepend_path(const struct path *path,
struct dentry *dentry = path->dentry;
struct vfsmount *vfsmnt = path->mnt;
struct mount *mnt = real_mount(vfsmnt);
- bool slash = false;
int error = 0;
+ unsigned seq = 0;
+ char *bptr;
+ int blen;
+restart:
+ bptr = *buffer;
+ blen = *buflen;
+ read_seqbegin_or_lock(&rename_lock, &seq);
while (dentry != root->dentry || vfsmnt != root->mnt) {
struct dentry * parent;
if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
/* Global root? */
- if (!mnt_has_parent(mnt))
- goto global_root;
- dentry = mnt->mnt_mountpoint;
- mnt = mnt->mnt_parent;
- vfsmnt = &mnt->mnt;
- continue;
+ if (mnt_has_parent(mnt)) {
+ dentry = mnt->mnt_mountpoint;
+ mnt = mnt->mnt_parent;
+ vfsmnt = &mnt->mnt;
+ continue;
+ }
+ /*
+ * Filesystems needing to implement special "root names"
+ * should do so with ->d_dname()
+ */
+ if (IS_ROOT(dentry) &&
+ (dentry->d_name.len != 1 ||
+ dentry->d_name.name[0] != '/')) {
+ WARN(1, "Root dentry has weird name <%.*s>\n",
+ (int) dentry->d_name.len,
+ dentry->d_name.name);
+ }
+ if (!error)
+ error = is_mounted(vfsmnt) ? 1 : 2;
+ break;
}
parent = dentry->d_parent;
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(buffer, buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (!error)
- error = prepend(buffer, buflen, "/", 1);
+ error = prepend_name(&bptr, &blen, &dentry->d_name);
if (error)
break;
- slash = true;
dentry = parent;
}
+ if (read_seqretry_or_unlock(&rename_lock, &seq))
+ goto restart;
- if (!error && !slash)
- error = prepend(buffer, buflen, "/", 1);
-
- return error;
-
-global_root:
- /*
- * Filesystems needing to implement special "root names"
- * should do so with ->d_dname()
- */
- if (IS_ROOT(dentry) &&
- (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) {
- WARN(1, "Root dentry has weird name <%.*s>\n",
- (int) dentry->d_name.len, dentry->d_name.name);
- }
- if (!slash)
- error = prepend(buffer, buflen, "/", 1);
- if (!error)
- error = is_mounted(vfsmnt) ? 1 : 2;
+ if (error >= 0 && bptr == *buffer) {
+ if (--blen < 0)
+ error = -ENAMETOOLONG;
+ else
+ *--bptr = '/';
+ }
+ *buffer = bptr;
+ *buflen = blen;
return error;
}
@@ -2744,9 +2825,7 @@ char *__d_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2765,9 +2844,7 @@ char *d_absolute_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error > 1)
@@ -2833,9 +2910,7 @@ char *d_path(const struct path *path, char *buf, int buflen)
get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2870,10 +2945,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
char *end = buffer + buflen;
/* these dentries are never renamed, so d_lock is not needed */
if (prepend(&end, &buflen, " (deleted)", 11) ||
- prepend_name(&end, &buflen, &dentry->d_name) ||
+ prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) ||
prepend(&end, &buflen, "/", 1))
end = ERR_PTR(-ENAMETOOLONG);
- return end;
+ return end;
}
/*
@@ -2881,30 +2956,36 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
*/
static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
- char *retval;
+ char *end, *retval;
+ int len, seq = 0;
+ int error = 0;
- prepend(&end, &buflen, "\0", 1);
+restart:
+ end = buf + buflen;
+ len = buflen;
+ prepend(&end, &len, "\0", 1);
if (buflen < 1)
goto Elong;
/* Get '/' right */
retval = end-1;
*retval = '/';
-
+ read_seqbegin_or_lock(&rename_lock, &seq);
while (!IS_ROOT(dentry)) {
struct dentry *parent = dentry->d_parent;
int error;
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(&end, &buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
- goto Elong;
+ error = prepend_name(&end, &len, &dentry->d_name);
+ if (error)
+ break;
retval = end;
dentry = parent;
}
+ if (read_seqretry_or_unlock(&rename_lock, &seq))
+ goto restart;
+ if (error)
+ goto Elong;
return retval;
Elong:
return ERR_PTR(-ENAMETOOLONG);
@@ -2912,13 +2993,7 @@ Elong:
char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
- char *retval;
-
- write_seqlock(&rename_lock);
- retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
-
- return retval;
+ return __dentry_path(dentry, buf, buflen);
}
EXPORT_SYMBOL(dentry_path_raw);
@@ -2927,7 +3002,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;
- write_seqlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2935,7 +3009,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2974,7 +3047,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2982,7 +3054,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -3003,7 +3074,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}
--
1.7.1
On Mon, Sep 09, 2013 at 12:18:13PM -0400, Waiman Long wrote:
> +/**
> + * read_seqbegin_or_lock - begin a sequence number check or locking block
> + * lock: sequence lock
> + * seq : sequence number to be checked
> + *
> + * First try it once optimistically without taking the lock. If that fails,
> + * take the lock. The sequence number is also used as a marker for deciding
> + * whether to be a reader (even) or writer (odd).
> + * N.B. seq must be initialized to an even number to begin with.
> + */
> +static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
> +{
> + if (!(*seq & 1)) { /* Even */
> + *seq = read_seqbegin(lock);
> + rcu_read_lock();
> + } else /* Odd */
> + write_seqlock(lock);
> +}
> +static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
> +{
> + if (!(*seq & 1)) { /* Even */
> + rcu_read_unlock();
> + if (read_seqretry(lock, *seq)) {
> + (*seq)++; /* Take writer lock */
> + return 1;
> + }
> + } else /* Odd */
> + write_sequnlock(lock);
> + return 0;
> +}
I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
can do that themselves just fine (it needs to be taken when seq is even),
and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
this variant.
On Mon, Sep 9, 2013 at 10:29 AM, Al Viro <[email protected]> wrote:
>
> I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> can do that themselves just fine (it needs to be taken when seq is even),
> and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> this variant.
Hmm.. I think you need the RCU read lock even when you get the write_seqlock().
Yes, getting the seqlock for write implies that you get a spinlock and
in many normal circumstances that basically is equvalent to being
rcu-locked, but afaik in some configurations that is *not* sufficient
protection against an RCU grace period on another CPU. You need to do
a real rcu_read_lock that increments that whole rcu_read_lock_nesting
level, which a spinlock won't do.
And while the rename sequence lock protects against _renames_, it does
not protect against just plain dentries getting free'd under memory
pressure.
So I think the RCU-readlockness really needs to be independent of the
sequence lock.
Linus
On 09/09/2013 01:29 PM, Al Viro wrote:
> On Mon, Sep 09, 2013 at 12:18:13PM -0400, Waiman Long wrote:
>> +/**
>> + * read_seqbegin_or_lock - begin a sequence number check or locking block
>> + * lock: sequence lock
>> + * seq : sequence number to be checked
>> + *
>> + * First try it once optimistically without taking the lock. If that fails,
>> + * take the lock. The sequence number is also used as a marker for deciding
>> + * whether to be a reader (even) or writer (odd).
>> + * N.B. seq must be initialized to an even number to begin with.
>> + */
>> +static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
>> +{
>> + if (!(*seq& 1)) { /* Even */
>> + *seq = read_seqbegin(lock);
>> + rcu_read_lock();
>> + } else /* Odd */
>> + write_seqlock(lock);
>> +}
>> +static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
>> +{
>> + if (!(*seq& 1)) { /* Even */
>> + rcu_read_unlock();
>> + if (read_seqretry(lock, *seq)) {
>> + (*seq)++; /* Take writer lock */
>> + return 1;
>> + }
>> + } else /* Odd */
>> + write_sequnlock(lock);
>> + return 0;
>> +}
> I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> can do that themselves just fine (it needs to be taken when seq is even),
> and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> this variant.
I think rcu_read_lock() is needed to make sure that the dentry won't be
freed as we don't take d_lock now.
-Longman
On 09/09/2013 01:45 PM, Linus Torvalds wrote:
> On Mon, Sep 9, 2013 at 10:29 AM, Al Viro<[email protected]> wrote:
>> I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
>> can do that themselves just fine (it needs to be taken when seq is even),
>> and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
>> this variant.
> Hmm.. I think you need the RCU read lock even when you get the write_seqlock().
>
> Yes, getting the seqlock for write implies that you get a spinlock and
> in many normal circumstances that basically is equvalent to being
> rcu-locked, but afaik in some configurations that is *not* sufficient
> protection against an RCU grace period on another CPU. You need to do
> a real rcu_read_lock that increments that whole rcu_read_lock_nesting
> level, which a spinlock won't do.
>
> And while the rename sequence lock protects against _renames_, it does
> not protect against just plain dentries getting free'd under memory
> pressure.
>
> So I think the RCU-readlockness really needs to be independent of the
> sequence lock.
>
> Linus
Yes, you are right. It will be safer to take rcu_read_lock() even if we
are taking the rename_lock. It wasn't needed before as d_lock was taken.
Will update the patch to take rcu_read_lock() out to reflect that.
Regards,
Longman
On Mon, Sep 09, 2013 at 10:45:38AM -0700, Linus Torvalds wrote:
> On Mon, Sep 9, 2013 at 10:29 AM, Al Viro <[email protected]> wrote:
> >
> > I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> > can do that themselves just fine (it needs to be taken when seq is even),
> > and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> > this variant.
>
> Hmm.. I think you need the RCU read lock even when you get the write_seqlock().
>
> Yes, getting the seqlock for write implies that you get a spinlock and
> in many normal circumstances that basically is equvalent to being
> rcu-locked, but afaik in some configurations that is *not* sufficient
> protection against an RCU grace period on another CPU. You need to do
> a real rcu_read_lock that increments that whole rcu_read_lock_nesting
> level, which a spinlock won't do.
>
> And while the rename sequence lock protects against _renames_, it does
> not protect against just plain dentries getting free'd under memory
> pressure.
It protects the chain of ->d_parent, so they'd better not get freeds at
all...
> So I think the RCU-readlockness really needs to be independent of the
> sequence lock.
Actually, now that I've tried to convert d_walk() to those guys, I think
I like my proposal for the set of primitives better:
static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq):
{
if ((*seq & 1) || !read_seqretry(lock, *seq))
return true;
*seq |= 1;
write_seqlock(lock);
return false;
}
static inline void seqretry_done(seqlock_t *lock, unsigned seq)
{
if (seq & 1)
write_sequnlock(lock);
}
with the prepend_path() and friends becoming
rcu_read_lock();
seq = read_seqbegin(&rename_lock);
again:
....
if (!seqretry_and_lock(&rename_lock, seq))
goto again; /* now as writer */
seqretry_done(&rename_lock, seq);
rcu_read_unlock();
The thing is, d_walk() does essentially
seq = read_seqbegin(&rename_lock);
again:
....
spin_lock(&d->d_lock);
if (!seqretry_and_lock(&rename_lock, seq)) {
spin_unlock(&d->d_lock);
goto again; /* now as writer */
}
/* now we are holding ->d_lock on it and we know
* that d has not gone stale until that point.
*/
do stuff with d
spin_unlock(&d->d_lock);
seqretry_done(&rename_lock, seq);
OTOH, it's not impossible to handle with Waiman's primitives, just more
massage to do that...
On Mon, Sep 09, 2013 at 01:55:06PM -0400, Waiman Long wrote:
> >I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends
> >can do that themselves just fine (it needs to be taken when seq is even),
> >and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with
> >this variant.
>
> I think rcu_read_lock() is needed to make sure that the dentry won't
> be freed as we don't take d_lock now.
Sure, you do need that; the question is whether you need to take it in
the primitives you are introducing.
On Mon, Sep 9, 2013 at 11:06 AM, Al Viro <[email protected]> wrote:
>>
>> And while the rename sequence lock protects against _renames_, it does
>> not protect against just plain dentries getting free'd under memory
>> pressure.
>
> It protects the chain of ->d_parent, so they'd better not get freeds at
> all...
Yeah, I think you're right because of the other constraints.. Holding
just the rename-lock for writing is actually sufficient, because
(a) you are guaranteed to already hold on to the end-point of the
walk (it's where you begin your path lookup), and so the memory
pressure issue is actually irrelevant.
(b) the only dentry lookup thing that remains is the parenthood chain
which is recursively reference-count-protected from the end, and yes,
in the absense of normal memory pressure, renames are the only thing
that will mess with that.
So even without holding d_lock, I guess we're actually safe even
without a real rcu read lock if we hold the rename lock for writing.
That actually argues for doing the rcu_read_lock() inside the helper
function. HOWEVER, I'd like a comment to that effect, to show why we
can access all those dentries without any other synchronization.
Linus
On Mon, Sep 09, 2013 at 07:06:04PM +0100, Al Viro wrote:
> I like my proposal for the set of primitives better:
>
> static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq):
> {
> if ((*seq & 1) || !read_seqretry(lock, *seq))
> return true;
> *seq |= 1;
> write_seqlock(lock);
> return false;
> }
>
> static inline void seqretry_done(seqlock_t *lock, unsigned seq)
> {
> if (seq & 1)
> write_sequnlock(lock);
> }
>
> with the prepend_path() and friends becoming
>
> rcu_read_lock();
> seq = read_seqbegin(&rename_lock);
> again:
> ....
> if (!seqretry_and_lock(&rename_lock, seq))
> goto again; /* now as writer */
> seqretry_done(&rename_lock, seq);
> rcu_read_unlock();
Actually, it's better for prepend_path() as well, because it's actually
rcu_read_lock();
seq = read_seqbegin(&rename_lock);
again:
....
if (error)
goto done;
....
if (!seqretry_and_lock(&rename_lock, seq))
goto again; /* now as writer */
done:
seqretry_done(&rename_lock, seq);
rcu_read_unlock();
Posted variant will sometimes hit the following path:
* seq_readlock()
* start generating the output
* hit an error
[another process has taken and released rename_lock for some reason]
* hit read_seqretry_and_unlock(), which returns 1.
* retry everything with seq_writelock(), despite the error.
It's not too horrible (we won't be looping indefinitely, ignoring error
all along), but it's certainly subtle enough...
On Mon, Sep 09, 2013 at 07:21:11PM +0100, Al Viro wrote:
> Actually, it's better for prepend_path() as well, because it's actually
>
> rcu_read_lock();
> seq = read_seqbegin(&rename_lock);
> again:
> ....
> if (error)
> goto done;
> ....
> if (!seqretry_and_lock(&rename_lock, seq))
> goto again; /* now as writer */
> done:
> seqretry_done(&rename_lock, seq);
> rcu_read_unlock();
>
> Posted variant will sometimes hit the following path:
> * seq_readlock()
> * start generating the output
> * hit an error
> [another process has taken and released rename_lock for some reason]
> * hit read_seqretry_and_unlock(), which returns 1.
> * retry everything with seq_writelock(), despite the error.
>
> It's not too horrible (we won't be looping indefinitely, ignoring error
> all along), but it's certainly subtle enough...
FWIW, what I propose is this (just the d_path-related parts):
diff --git a/fs/dcache.c b/fs/dcache.c
index 761e31b..b963605 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -88,6 +88,21 @@ EXPORT_SYMBOL(rename_lock);
static struct kmem_cache *dentry_cache __read_mostly;
+static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq)
+{
+ if ((*seq & 1) || !read_seqretry(lock, *seq))
+ return true;
+ *seq |= 1;
+ write_seqlock(lock);
+ return false;
+}
+
+static inline void seqretry_done(seqlock_t *lock, unsigned seq)
+{
+ if (seq & 1)
+ write_sequnlock(lock);
+}
+
/*
* This is the single most critical data structure when it comes
* to the dcache: the hashtable for lookups. Somebody should try
@@ -2644,9 +2659,39 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
return 0;
}
+/**
+ * prepend_name - prepend a pathname in front of current buffer pointer
+ * buffer: buffer pointer
+ * buflen: allocated length of the buffer
+ * name: name string and length qstr structure
+ *
+ * With RCU path tracing, it may race with d_move(). Use ACCESS_ONCE() to
+ * make sure that either the old or the new name pointer and length are
+ * fetched. However, there may be mismatch between length and pointer.
+ * The length cannot be trusted, we need to copy it byte-by-byte until
+ * the length is reached or a null byte is found. It also prepends "/" at
+ * the beginning of the name. The sequence number check at the caller will
+ * retry it again when a d_move() does happen. So any garbage in the buffer
+ * due to mismatched pointer and length will be discarded.
+ */
static int prepend_name(char **buffer, int *buflen, struct qstr *name)
{
- return prepend(buffer, buflen, name->name, name->len);
+ const char *dname = ACCESS_ONCE(name->name);
+ u32 dlen = ACCESS_ONCE(name->len);
+ char *p;
+
+ if (*buflen < dlen + 1)
+ return -ENAMETOOLONG;
+ *buflen -= dlen + 1;
+ p = *buffer -= dlen + 1;
+ *p++ = '/';
+ while (dlen--) {
+ char c = *dname++;
+ if (!c)
+ break;
+ *p++ = c;
+ }
+ return 0;
}
/**
@@ -2656,7 +2701,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again. It performs a total of 3 trials of
+ * lockless back-tracing sequences before falling back to take the
+ * rename_lock.
*/
static int prepend_path(const struct path *path,
const struct path *root,
@@ -2665,54 +2717,64 @@ static int prepend_path(const struct path *path,
struct dentry *dentry = path->dentry;
struct vfsmount *vfsmnt = path->mnt;
struct mount *mnt = real_mount(vfsmnt);
- bool slash = false;
int error = 0;
+ unsigned seq;
+ char *bptr;
+ int blen;
+ rcu_read_lock();
+ seq = read_seqbegin(&rename_lock);
+restart:
+ bptr = *buffer;
+ blen = *buflen;
while (dentry != root->dentry || vfsmnt != root->mnt) {
struct dentry * parent;
if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
/* Global root? */
- if (!mnt_has_parent(mnt))
- goto global_root;
- dentry = mnt->mnt_mountpoint;
- mnt = mnt->mnt_parent;
- vfsmnt = &mnt->mnt;
- continue;
+ if (mnt_has_parent(mnt)) {
+ dentry = mnt->mnt_mountpoint;
+ mnt = mnt->mnt_parent;
+ vfsmnt = &mnt->mnt;
+ continue;
+ }
+ /*
+ * Filesystems needing to implement special "root names"
+ * should do so with ->d_dname()
+ */
+ if (IS_ROOT(dentry) &&
+ (dentry->d_name.len != 1 ||
+ dentry->d_name.name[0] != '/')) {
+ WARN(1, "Root dentry has weird name <%.*s>\n",
+ (int) dentry->d_name.len,
+ dentry->d_name.name);
+ }
+ if (!error)
+ error = is_mounted(vfsmnt) ? 1 : 2;
+ break;
}
parent = dentry->d_parent;
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(buffer, buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (!error)
- error = prepend(buffer, buflen, "/", 1);
+ error = prepend_name(&bptr, &blen, &dentry->d_name);
if (error)
break;
- slash = true;
dentry = parent;
}
+ if (error >= 0 && !seqretry_and_lock(&rename_lock, &seq))
+ goto restart;
- if (!error && !slash)
- error = prepend(buffer, buflen, "/", 1);
-
- return error;
+ seqretry_done(&rename_lock, seq);
+ rcu_read_unlock();
-global_root:
- /*
- * Filesystems needing to implement special "root names"
- * should do so with ->d_dname()
- */
- if (IS_ROOT(dentry) &&
- (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) {
- WARN(1, "Root dentry has weird name <%.*s>\n",
- (int) dentry->d_name.len, dentry->d_name.name);
- }
- if (!slash)
- error = prepend(buffer, buflen, "/", 1);
- if (!error)
- error = is_mounted(vfsmnt) ? 1 : 2;
+ if (error >= 0 && bptr == *buffer) {
+ if (--blen < 0)
+ error = -ENAMETOOLONG;
+ else
+ *--bptr = '/';
+ }
+ *buffer = bptr;
+ *buflen = blen;
return error;
}
@@ -2741,9 +2803,7 @@ char *__d_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2762,9 +2822,7 @@ char *d_absolute_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error > 1)
@@ -2830,9 +2888,7 @@ char *d_path(const struct path *path, char *buf, int buflen)
get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2867,10 +2923,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
char *end = buffer + buflen;
/* these dentries are never renamed, so d_lock is not needed */
if (prepend(&end, &buflen, " (deleted)", 11) ||
- prepend_name(&end, &buflen, &dentry->d_name) ||
+ prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) ||
prepend(&end, &buflen, "/", 1))
end = ERR_PTR(-ENAMETOOLONG);
- return end;
+ return end;
}
/*
@@ -2878,30 +2934,40 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
*/
static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
- char *end = buf + buflen;
- char *retval;
+ char *end, *retval;
+ int len, seq;
+ int error = 0;
- prepend(&end, &buflen, "\0", 1);
+ rcu_read_lock();
+ seq = read_seqbegin(&rename_lock);
+restart:
+ end = buf + buflen;
+ len = buflen;
+ prepend(&end, &len, "\0", 1);
if (buflen < 1)
goto Elong;
/* Get '/' right */
retval = end-1;
*retval = '/';
-
while (!IS_ROOT(dentry)) {
struct dentry *parent = dentry->d_parent;
int error;
prefetch(parent);
- spin_lock(&dentry->d_lock);
- error = prepend_name(&end, &buflen, &dentry->d_name);
- spin_unlock(&dentry->d_lock);
- if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
- goto Elong;
+ error = prepend_name(&end, &len, &dentry->d_name);
+ if (error)
+ goto done;
retval = end;
dentry = parent;
}
+ if (!seqretry_and_lock(&rename_lock, &seq))
+ goto restart;
+done:
+ seqretry_done(&rename_lock, seq);
+ rcu_read_unlock();
+ if (error)
+ goto Elong;
return retval;
Elong:
return ERR_PTR(-ENAMETOOLONG);
@@ -2909,13 +2975,7 @@ Elong:
char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
- char *retval;
-
- write_seqlock(&rename_lock);
- retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
-
- return retval;
+ return __dentry_path(dentry, buf, buflen);
}
EXPORT_SYMBOL(dentry_path_raw);
@@ -2924,7 +2984,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;
- write_seqlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2932,7 +2991,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2971,7 +3029,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2979,7 +3036,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -3000,7 +3056,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}
On Mon, Sep 09, 2013 at 07:36:08PM +0100, Al Viro wrote:
> FWIW, what I propose is this (just the d_path-related parts):
Grrr... It's still the wrong set of primitives ;-/ Nevermind that one...
On 09/09/2013 02:36 PM, Al Viro wrote:
> On Mon, Sep 09, 2013 at 07:21:11PM +0100, Al Viro wrote:
>> Actually, it's better for prepend_path() as well, because it's actually
>>
>> rcu_read_lock();
>> seq = read_seqbegin(&rename_lock);
>> again:
>> ....
>> if (error)
>> goto done;
>> ....
>> if (!seqretry_and_lock(&rename_lock, seq))
>> goto again; /* now as writer */
>> done:
>> seqretry_done(&rename_lock, seq);
>> rcu_read_unlock();
>>
>> Posted variant will sometimes hit the following path:
>> * seq_readlock()
>> * start generating the output
>> * hit an error
>> [another process has taken and released rename_lock for some reason]
>> * hit read_seqretry_and_unlock(), which returns 1.
>> * retry everything with seq_writelock(), despite the error.
>>
>> It's not too horrible (we won't be looping indefinitely, ignoring error
>> all along), but it's certainly subtle enough...
> FWIW, what I propose is this (just the d_path-related parts):
>
>
I am fine with your proposed change as long as it gets the job done. It
doesn't really matter if you do it or I do it.
Thank for taking the effort to make it better for us all.
-Longman
On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote:
> I am fine with your proposed change as long as it gets the job done.
I suspect that the real problem is the unlock part of read_seqretry_or_unlock();
for d_walk() we want to be able to check if we need retry and continue walking
if we do not. Let's do it that way: I've applied your patch as is, with the
next step being
* split read_seqretry_or_unlock():
need_seqretry() (return (!(seq & 1) && read_seqretry(lock, seq))
done_seqretry() (if (seq & 1) write_sequnlock(lock, seq)),
your if (read_seqretry_or_unlock(&rename_lock, &seq))
goto restart;
becoming
if (need_seqretry(&rename_lock, seq)) {
seq = 1;
goto restart;
}
done_seqretry(&rename_lock, seq);
Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(),
need_seqretry() and done_seqretry(). Give me a few, I'll post it...
On Mon, Sep 09, 2013 at 08:10:29PM +0100, Al Viro wrote:
> On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote:
>
> > I am fine with your proposed change as long as it gets the job done.
>
> I suspect that the real problem is the unlock part of read_seqretry_or_unlock();
> for d_walk() we want to be able to check if we need retry and continue walking
> if we do not. Let's do it that way: I've applied your patch as is, with the
> next step being
> * split read_seqretry_or_unlock():
> need_seqretry() (return (!(seq & 1) && read_seqretry(lock, seq))
> done_seqretry() (if (seq & 1) write_sequnlock(lock, seq)),
> your if (read_seqretry_or_unlock(&rename_lock, &seq))
> goto restart;
> becoming
> if (need_seqretry(&rename_lock, seq)) {
> seq = 1;
> goto restart;
> }
> done_seqretry(&rename_lock, seq);
>
> Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(),
> need_seqretry() and done_seqretry(). Give me a few, I'll post it...
OK, how about this? It splits read_seqretry_or_unlock(), takes
rcu_read_{lock,unlock} in the callers and converts d_walk() to those
primitives. I've pushed that and your commit into vfs.git#experimental
(head at 48f5ec2, should propagate in a few); guys, please give it a look
and comment.
diff --git a/fs/dcache.c b/fs/dcache.c
index 38b1b09..b9caf47 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -100,30 +100,21 @@ static struct kmem_cache *dentry_cache __read_mostly;
*/
static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
{
- if (!(*seq & 1)) { /* Even */
+ if (!(*seq & 1)) /* Even */
*seq = read_seqbegin(lock);
- rcu_read_lock();
- } else /* Odd */
+ else /* Odd */
write_seqlock(lock);
}
-/**
- * read_seqretry_or_unlock - end a seqretry or lock block & return retry status
- * lock : sequence lock
- * seq : sequence number
- * Return: 1 to retry operation again, 0 to continue
- */
-static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq)
+static inline int need_seqretry(seqlock_t *lock, int seq)
{
- if (!(*seq & 1)) { /* Even */
- rcu_read_unlock();
- if (read_seqretry(lock, *seq)) {
- (*seq)++; /* Take writer lock */
- return 1;
- }
- } else /* Odd */
+ return !(seq & 1) && read_seqretry(lock, seq);
+}
+
+static inline void done_seqretry(seqlock_t *lock, int seq)
+{
+ if (seq & 1)
write_sequnlock(lock);
- return 0;
}
/*
@@ -1047,7 +1038,7 @@ void shrink_dcache_for_umount(struct super_block *sb)
* the parenthood after dropping the lock and check
* that the sequence number still matches.
*/
-static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq)
+static struct dentry *try_to_ascend(struct dentry *old, unsigned seq)
{
struct dentry *new = old->d_parent;
@@ -1061,7 +1052,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
*/
if (new != old->d_parent ||
(old->d_flags & DCACHE_DENTRY_KILLED) ||
- (!locked && read_seqretry(&rename_lock, seq))) {
+ need_seqretry(&rename_lock, seq)) {
spin_unlock(&new->d_lock);
new = NULL;
}
@@ -1098,13 +1089,12 @@ static void d_walk(struct dentry *parent, void *data,
{
struct dentry *this_parent;
struct list_head *next;
- unsigned seq;
- int locked = 0;
+ unsigned seq = 0;
enum d_walk_ret ret;
bool retry = true;
- seq = read_seqbegin(&rename_lock);
again:
+ read_seqbegin_or_lock(&rename_lock, &seq);
this_parent = parent;
spin_lock(&this_parent->d_lock);
@@ -1158,13 +1148,13 @@ resume:
*/
if (this_parent != parent) {
struct dentry *child = this_parent;
- this_parent = try_to_ascend(this_parent, locked, seq);
+ this_parent = try_to_ascend(this_parent, seq);
if (!this_parent)
goto rename_retry;
next = child->d_u.d_child.next;
goto resume;
}
- if (!locked && read_seqretry(&rename_lock, seq)) {
+ if (need_seqretry(&rename_lock, seq)) {
spin_unlock(&this_parent->d_lock);
goto rename_retry;
}
@@ -1173,17 +1163,13 @@ resume:
out_unlock:
spin_unlock(&this_parent->d_lock);
- if (locked)
- write_sequnlock(&rename_lock);
+ done_seqretry(&rename_lock, seq);
return;
rename_retry:
if (!retry)
return;
- if (locked)
- goto again;
- locked = 1;
- write_seqlock(&rename_lock);
+ seq = 1;
goto again;
}
@@ -2745,6 +2731,7 @@ static int prepend_path(const struct path *path,
char *bptr;
int blen;
+ rcu_read_lock();
restart:
bptr = *buffer;
blen = *buflen;
@@ -2783,8 +2770,13 @@ restart:
dentry = parent;
}
- if (read_seqretry_or_unlock(&rename_lock, &seq))
+ if (!(seq & 1))
+ rcu_read_unlock();
+ if (need_seqretry(&rename_lock, seq)) {
+ seq = 1;
goto restart;
+ }
+ done_seqretry(&rename_lock, seq);
if (error >= 0 && bptr == *buffer) {
if (--blen < 0)
@@ -2957,6 +2949,7 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
int len, seq = 0;
int error = 0;
+ rcu_read_lock();
restart:
end = buf + buflen;
len = buflen;
@@ -2979,8 +2972,13 @@ restart:
retval = end;
dentry = parent;
}
- if (read_seqretry_or_unlock(&rename_lock, &seq))
+ if (!(seq & 1))
+ rcu_read_unlock();
+ if (need_seqretry(&rename_lock, seq)) {
+ seq = 1;
goto restart;
+ }
+ done_seqretry(&rename_lock, seq);
if (error)
goto Elong;
return retval;
On 09/09/2013 03:28 PM, Al Viro wrote:
> On Mon, Sep 09, 2013 at 08:10:29PM +0100, Al Viro wrote:
>> On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote:
>>
>>> I am fine with your proposed change as long as it gets the job done.
>> I suspect that the real problem is the unlock part of read_seqretry_or_unlock();
>> for d_walk() we want to be able to check if we need retry and continue walking
>> if we do not. Let's do it that way: I've applied your patch as is, with the
>> next step being
>> * split read_seqretry_or_unlock():
>> need_seqretry() (return (!(seq& 1)&& read_seqretry(lock, seq))
>> done_seqretry() (if (seq& 1) write_sequnlock(lock, seq)),
>> your if (read_seqretry_or_unlock(&rename_lock,&seq))
>> goto restart;
>> becoming
>> if (need_seqretry(&rename_lock, seq)) {
>> seq = 1;
>> goto restart;
>> }
>> done_seqretry(&rename_lock, seq);
>>
>> Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(),
>> need_seqretry() and done_seqretry(). Give me a few, I'll post it...
> OK, how about this? It splits read_seqretry_or_unlock(), takes
> rcu_read_{lock,unlock} in the callers and converts d_walk() to those
> primitives. I've pushed that and your commit into vfs.git#experimental
> (head at 48f5ec2, should propagate in a few); guys, please give it a look
> and comment.
The changes look good to me. I was planning to take rcu_read_lock() out
and doing something similar, but your change is good. BTW, I think Linus
want to add some comments on why RCU lock is needed without the
rename_lock, but I can put that in with a follow-up patch once the
current change is merged.
Thank for your help and inspiration on this patch.
-Longman
I'm really wondering about only trying once before taking the write lock.
Yes, using the lsbit is a cute hack, but are we using it for its cuteness
rather than its effectiveness?
Renames happen occasionally. If that causes all the current pathname
translations to fall back to the write lock, that is fairly heavy.
Worse, all of those translations will (unnecessarily) bump the write
seqcount, triggering *other* translations to fail back to the write-lock
path.
One patch to fix this would be to have the fallback read algorithm take
sl->lock but *not* touch sl->seqcount, so it wouldn't break concurrent
readers.
But another is to simply retry at least once (two attempts) on the
non-exclusive path before falling back to the exclusive one, This means
that the count lsbit is no longer enough space for a retry counter, but
oh, well.
(If you really want to use one word, perhaps a better heuristic
as to how to retry would be to examine the *number* of writes
to the seqlock during the read. If there was only one, there's a fair
chance that another read will succeed. If there was more than one
(i.e. the seqlock has incremented by 3 or more), then forcing the
writers to stop is probably necessary.)
On Mon, Sep 09, 2013 at 08:40:20PM -0400, George Spelvin wrote:
> I'm really wondering about only trying once before taking the write lock.
> Yes, using the lsbit is a cute hack, but are we using it for its cuteness
> rather than its effectiveness?
>
> Renames happen occasionally. If that causes all the current pathname
> translations to fall back to the write lock, that is fairly heavy.
> Worse, all of those translations will (unnecessarily) bump the write
> seqcount, triggering *other* translations to fail back to the write-lock
> path.
_What_ "pathname translations"? Pathname resolution doesn't fall back to
seq_writelock() at all.
Al Viro wrote:
> _What_ "pathname translations"? Pathname resolution doesn't fall back to
> seq_writelock() at all.
Maybe it should then?
On Mon, Sep 9, 2013 at 6:15 PM, Ramkumar Ramachandra <[email protected]> wrote:
>
> Maybe it should then?
It doesn't need to. The RCU lookup looks at individual dentry sequence
numbers and doesn't care about the bigger rename sequence number at
all.
The fallback (if you hit one of the very very rare races, or if you
hit a symlink) ends up doing per-path-component lookups under the
rename sequence lock, but for it, read-locking it until it succeeds is
the right thing to do.
Linus
On Mon, Sep 09, 2013 at 06:34:16PM -0700, Linus Torvalds wrote:
> On Mon, Sep 9, 2013 at 6:15 PM, Ramkumar Ramachandra <[email protected]> wrote:
> >
> > Maybe it should then?
>
> It doesn't need to. The RCU lookup looks at individual dentry sequence
> numbers and doesn't care about the bigger rename sequence number at
> all.
One name: Mark V. Shaney...
On Mon, Sep 9, 2013 at 7:25 PM, Al Viro <[email protected]> wrote:
>
> One name: Mark V. Shaney...
Heh, yes. I had ignored the earlier emails, and that last one looked
more reasonable than the earlier ones ;)
Linus
Linus Torvalds wrote:
> It doesn't need to. The RCU lookup looks at individual dentry sequence
> numbers and doesn't care about the bigger rename sequence number at
> all.
Right; it's sequential.
> The fallback (if you hit one of the very very rare races, or if you
> hit a symlink) ends up doing per-path-component lookups under the
> rename sequence lock, but for it, read-locking it until it succeeds is
> the right thing to do.
No, it's write-locking.
On 09/09/2013 08:40 PM, George Spelvin wrote:
> I'm really wondering about only trying once before taking the write lock.
> Yes, using the lsbit is a cute hack, but are we using it for its cuteness
> rather than its effectiveness?
>
> Renames happen occasionally. If that causes all the current pathname
> translations to fall back to the write lock, that is fairly heavy.
> Worse, all of those translations will (unnecessarily) bump the write
> seqcount, triggering *other* translations to fail back to the write-lock
> path.
>
> One patch to fix this would be to have the fallback read algorithm take
> sl->lock but *not* touch sl->seqcount, so it wouldn't break concurrent
> readers.
Actually, a follow-up patch that I am planning to do is to introduce a
read_seqlock() primitive in seqlock.h that does exactly that. Then the
write_seqlock() in this patch will be modified to read_seqlock().
-Longman
> _What_ "pathname translations"? Pathname resolution doesn't fall back to
> seq_writelock() at all.
I meant the translation of dentry to pathname that this thread is about.
Apologies for my confusing abbreviation. Since the reverse translation
(pathname to dentry) is done a level at a time and people understand
that it's not atomic, I agree with Linus that a livelock there is too
remote a possibility to worry about.