2010-02-01 22:25:15

by Al Viro

[permalink] [raw]
Subject: [PATCH][RFC] %pd - for printing dentry name

There's a lot of places doing printks with ->d_name of various
dentries. Unfortunately, as often as not they are b0rken due to races
with rename().

I propose to add a new format - %pd. It would print dentry name.
However, unlike everything else in vsnprintf, it would NOT be locking-agnostic.
It would grab and release dentry->d_lock. And yes, I hate that as much as
anyone. I don't see any sane alternative.

Patch below implements it and fixes some obvious races in ocfs2
and ubifs by switch to that sucker. There are many more places of
that kind and a _lot_ of those are racy. Adding ->d_lock in every caller
is not a good solution, IMO...

Rules are:
* don't use %pd under dentry->d_lock, use dentry->d_name.name instead; in
that case it *is* safe. Incidentally, ->d_lock isn't held a lot.
* if we don't hold dentry->d_lock, feel free to use %pd....,dentry

Comments?

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 06ccf6a..64769c6 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -104,8 +104,7 @@ static int ocfs2_file_open(struct inode *inode, struct file *file)
int mode = file->f_flags;
struct ocfs2_inode_info *oi = OCFS2_I(inode);

- mlog_entry("(0x%p, 0x%p, '%.*s')\n", inode, file,
- file->f_path.dentry->d_name.len, file->f_path.dentry->d_name.name);
+ mlog_entry("(0x%p, 0x%p, '%pd')\n", inode, file, file->f_path.dentry);

spin_lock(&oi->ip_lock);

@@ -145,9 +144,7 @@ static int ocfs2_file_release(struct inode *inode, struct file *file)
{
struct ocfs2_inode_info *oi = OCFS2_I(inode);

- mlog_entry("(0x%p, 0x%p, '%.*s')\n", inode, file,
- file->f_path.dentry->d_name.len,
- file->f_path.dentry->d_name.name);
+ mlog_entry("(0x%p, 0x%p, '%pd')\n", inode, file, file->f_path.dentry);

spin_lock(&oi->ip_lock);
if (!--oi->ip_open_count)
@@ -181,8 +178,7 @@ static int ocfs2_sync_file(struct file *file,
struct inode *inode = dentry->d_inode;
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);

- mlog_entry("(0x%p, 0x%p, %d, '%.*s')\n", file, dentry, datasync,
- dentry->d_name.len, dentry->d_name.name);
+ mlog_entry("(0x%p, 0x%p, '%pd')\n", inode, file, file->f_path.dentry);

err = ocfs2_sync_inode(dentry->d_inode);
if (err)
@@ -947,8 +943,7 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
struct dquot *transfer_from[MAXQUOTAS] = { };
struct dquot *transfer_to[MAXQUOTAS] = { };

- mlog_entry("(0x%p, '%.*s')\n", dentry,
- dentry->d_name.len, dentry->d_name.name);
+ mlog_entry("(0x%p, '%pd')\n", dentry, dentry);

/* ensuring we don't even attempt to truncate a symlink */
if (S_ISLNK(inode->i_mode))
@@ -1916,10 +1911,9 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
struct inode *inode = file->f_path.dentry->d_inode;
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);

- mlog_entry("(0x%p, %u, '%.*s')\n", file,
+ mlog_entry("(0x%p, %u, '%pd')\n", file,
(unsigned int)nr_segs,
- file->f_path.dentry->d_name.len,
- file->f_path.dentry->d_name.name);
+ file->f_path.dentry);

if (iocb->ki_left == 0)
return 0;
@@ -2096,10 +2090,9 @@ static ssize_t ocfs2_file_splice_write(struct pipe_inode_info *pipe,
.u.file = out,
};

- mlog_entry("(0x%p, 0x%p, %u, '%.*s')\n", out, pipe,
+ mlog_entry("(0x%p, 0x%p, %u, '%pd')\n", out, pipe,
(unsigned int)len,
- out->f_path.dentry->d_name.len,
- out->f_path.dentry->d_name.name);
+ out->f_path.dentry);

if (pipe->inode)
mutex_lock_nested(&pipe->inode->i_mutex, I_MUTEX_PARENT);
@@ -2156,10 +2149,9 @@ static ssize_t ocfs2_file_splice_read(struct file *in,
int ret = 0, lock_level = 0;
struct inode *inode = in->f_path.dentry->d_inode;

- mlog_entry("(0x%p, 0x%p, %u, '%.*s')\n", in, pipe,
+ mlog_entry("(0x%p, 0x%p, %u, '%pd')\n", in, pipe,
(unsigned int)len,
- in->f_path.dentry->d_name.len,
- in->f_path.dentry->d_name.name);
+ in->f_path.dentry);

/*
* See the comment in ocfs2_file_aio_read()
@@ -2187,10 +2179,9 @@ static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
struct file *filp = iocb->ki_filp;
struct inode *inode = filp->f_path.dentry->d_inode;

- mlog_entry("(0x%p, %u, '%.*s')\n", filp,
+ mlog_entry("(0x%p, %u, '%pd')\n", filp,
(unsigned int)nr_segs,
- filp->f_path.dentry->d_name.len,
- filp->f_path.dentry->d_name.name);
+ filp->f_path.dentry);

if (!inode) {
ret = -EINVAL;
diff --git a/fs/ubifs/xattr.c b/fs/ubifs/xattr.c
index 195830f..fac5dbc 100644
--- a/fs/ubifs/xattr.c
+++ b/fs/ubifs/xattr.c
@@ -304,8 +304,8 @@ int ubifs_setxattr(struct dentry *dentry, const char *name,
union ubifs_key key;
int err, type;

- dbg_gen("xattr '%s', host ino %lu ('%.*s'), size %zd", name,
- host->i_ino, dentry->d_name.len, dentry->d_name.name, size);
+ dbg_gen("xattr '%s', host ino %lu ('%pd'), size %zd", name,
+ host->i_ino, dentry, size);
ubifs_assert(mutex_is_locked(&host->i_mutex));

if (size > UBIFS_MAX_INO_DATA)
@@ -368,8 +368,8 @@ ssize_t ubifs_getxattr(struct dentry *dentry, const char *name, void *buf,
union ubifs_key key;
int err;

- dbg_gen("xattr '%s', ino %lu ('%.*s'), buf size %zd", name,
- host->i_ino, dentry->d_name.len, dentry->d_name.name, size);
+ dbg_gen("xattr '%s', ino %lu ('%pd'), buf size %zd", name,
+ host->i_ino, dentry, dentry, size);

err = check_namespace(&nm);
if (err < 0)
@@ -427,8 +427,8 @@ ssize_t ubifs_listxattr(struct dentry *dentry, char *buffer, size_t size)
int err, len, written = 0;
struct qstr nm = { .name = NULL };

- dbg_gen("ino %lu ('%.*s'), buffer size %zd", host->i_ino,
- dentry->d_name.len, dentry->d_name.name, size);
+ dbg_gen("ino %lu ('%pd'), buffer size %zd", host->i_ino,
+ dentry, size);

len = host_ui->xattr_names + host_ui->xattr_cnt;
if (!buffer)
@@ -530,8 +530,7 @@ int ubifs_removexattr(struct dentry *dentry, const char *name)
union ubifs_key key;
int err;

- dbg_gen("xattr '%s', ino %lu ('%.*s')", name,
- host->i_ino, dentry->d_name.len, dentry->d_name.name);
+ dbg_gen("xattr '%s', ino %lu ('%pd')", name, host->i_ino, dentry);
ubifs_assert(mutex_is_locked(&host->i_mutex));

err = check_namespace(&nm);
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3b8aeec..52a09dc 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -880,6 +880,18 @@ static char *uuid_string(char *buf, char *end, const u8 *addr,
return string(buf, end, uuid, spec);
}

+static char *dname_string(char *buf, char *end, struct dentry *d,
+ struct printf_spec spec)
+{
+ char *res;
+ spin_lock(&d->d_lock);
+ if (spec.precision > d->d_name.len)
+ spec.precision = d->d_name.len;
+ res = string(buf, end, d->d_name.name, spec);
+ spin_unlock(&d->d_lock);
+ return res;
+}
+
/*
* Show a '%p' thing. A kernel extension is that the '%p' is followed
* by an extra set of alphanumeric characters that are extended format
@@ -915,6 +927,8 @@ static char *uuid_string(char *buf, char *end, const u8 *addr,
* [0][1][2][3]-[4][5]-[6][7]-[8][9]-[10][11][12][13][14][15]
* little endian output byte order is:
* [3][2][1][0]-[5][4]-[7][6]-[8][9]-[10][11][12][13][14][15]
+ * - 'd' For dentry name. NOTE: don't use under dentry->d_lock, there
+ * you can safely use ->d_name.name instead.
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -958,6 +972,8 @@ static char *pointer(const char *fmt, char *buf, char *end, void *ptr,
break;
case 'U':
return uuid_string(buf, end, ptr, spec, fmt);
+ case 'd':
+ return dname_string(buf, end, ptr, spec);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1191,6 +1207,7 @@ qualifier:
* http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-00
* %pU[bBlL] print a UUID/GUID in big or little endian using lower or upper
* case.
+ * %pd print dentry name
* %n is ignored
*
* The return value is the number of characters which would


2010-02-01 22:34:13

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 10:25:11PM +0000, Al Viro wrote:
> There's a lot of places doing printks with ->d_name of various
> dentries. Unfortunately, as often as not they are b0rken due to races
> with rename().
>
> I propose to add a new format - %pd. It would print dentry name.
> However, unlike everything else in vsnprintf, it would NOT be locking-agnostic.
> It would grab and release dentry->d_lock. And yes, I hate that as much as
> anyone. I don't see any sane alternative.
>
> Patch below implements it and fixes some obvious races in ocfs2
> and ubifs by switch to that sucker. There are many more places of
> that kind and a _lot_ of those are racy. Adding ->d_lock in every caller
> is not a good solution, IMO...
>
> Rules are:
> * don't use %pd under dentry->d_lock, use dentry->d_name.name instead; in
> that case it *is* safe. Incidentally, ->d_lock isn't held a lot.
> * if we don't hold dentry->d_lock, feel free to use %pd....,dentry
>
> Comments?

PS: if it ends up going in, the usual S-o-b applies

2010-02-01 22:37:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name



On Mon, 1 Feb 2010, Al Viro wrote:
>
> I propose to add a new format - %pd. It would print dentry name.
> However, unlike everything else in vsnprintf, it would NOT be locking-agnostic.
> It would grab and release dentry->d_lock. And yes, I hate that as much as
> anyone. I don't see any sane alternative.

So I do hate it a lot too, and one of the main reasons I hate it is that
that format is then subtly but horribly broken inside sections that
already hold the lock.

And I can hear you say "What *subtle*? It's an instant deadlock!", but the
most likely reason for such a printk would be some debug statement etc
that would seldom/ever trigger - like a BUG_ON or whatever. And that "we
take the lock while printing" would thus become a rather nasty "machine
died silently" issue.

So I _really_ hate taking locks in the printk paths. We've had serious
problems with that in the past. It has made for some very annoying issues.

> * don't use %pd under dentry->d_lock, use dentry->d_name.name instead; in
> that case it *is* safe. Incidentally, ->d_lock isn't held a lot.

I realize we can just call it a rule, and yes, d_lock is held much less
than something like console_lock etc that we've had ABBA issues with, but
still..

> Comments?

Quite frankly, I'd _much_ rather see something like just always freeing
the dentry names (when they aren't inlined) using RCU. The VFS layer quite
possibly would want to do that anyway at some point (eg Nick's VFS
scalability patches), and then we could make it just a RCU read-lock or
whatever (interrupt disable, what-not) instead.

And I'm much happier with printk doing that kind of thing, and wouldn't
have issues with that kind of much weaker locking.

Linus

2010-02-01 22:45:29

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, 2010-02-01 at 22:25 +0000, Al Viro wrote:
> There's a lot of places doing printks with ->d_name of various
> dentries. Unfortunately, as often as not they are b0rken due to races
> with rename().
>
> I propose to add a new format - %pd. It would print dentry name.
> However, unlike everything else in vsnprintf, it would NOT be locking-agnostic.
> It would grab and release dentry->d_lock. And yes, I hate that as much as
> anyone. I don't see any sane alternative.
[]
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 3b8aeec..52a09dc 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -880,6 +880,18 @@ static char *uuid_string(char *buf, char *end, const u8 *addr,
> return string(buf, end, uuid, spec);
> }
>
> +static char *dname_string(char *buf, char *end, struct dentry *d,
> + struct printf_spec spec)

Maybe there's value in passing fmt here to allow
printing of the various dentry fields?

"%pd<foo>"

for things like:

pr_debug("ds_read(socket %d)\n", iminor(file->f_path.dentry->d_inode));
dprintk("%s: dentry = %s/%s, cookie = %Lu\n", __func__,
dentry->d_parent->d_name.name,
dentry->d_name.name,


2010-02-01 23:18:52

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 02:37:32PM -0800, Linus Torvalds wrote:

> > * don't use %pd under dentry->d_lock, use dentry->d_name.name instead; in
> > that case it *is* safe. Incidentally, ->d_lock isn't held a lot.
>
> I realize we can just call it a rule, and yes, d_lock is held much less
> than something like console_lock etc that we've had ABBA issues with, but
> still..

> Quite frankly, I'd _much_ rather see something like just always freeing
> the dentry names (when they aren't inlined) using RCU. The VFS layer quite
> possibly would want to do that anyway at some point (eg Nick's VFS
> scalability patches), and then we could make it just a RCU read-lock or
> whatever (interrupt disable, what-not) instead.
>
> And I'm much happier with printk doing that kind of thing, and wouldn't
> have issues with that kind of much weaker locking.

Ehh... RCU will save you from stepping on freed memory, but it still will
leave the joy of half-updated string with length out of sync with it, etc.
We probably can get away with that, but we'll have to be a lot more careful
with the order of updating these suckers in d_move_locked et.al.

I don't know... Note that if we end up adding something extra to struct
dentry, we might as well just add *another* spinlock, taken only under
->d_lock and only in two places in dcache.c that change d_name. That kind
of thing is trivial to enforce (just grep over the tree once in a while)
and if it shares the cacheline with d_lock, we shouldn't get any real overhead
in d_move()/d_materialise_unique(). I'm not particulary fond of that variant,
but it's at least guaranteed to be devoid of subtleties.

If RCU folks can come up with a sane suggestions that would be robust and
wouldn't bloat dentry - sure, I'm all for it. If not...

2010-02-02 01:06:55

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 11:18:47PM +0000, Al Viro wrote:

> Ehh... RCU will save you from stepping on freed memory, but it still will
> leave the joy of half-updated string with length out of sync with it, etc.
> We probably can get away with that, but we'll have to be a lot more careful
> with the order of updating these suckers in d_move_locked et.al.
>
> I don't know... Note that if we end up adding something extra to struct
> dentry, we might as well just add *another* spinlock, taken only under
> ->d_lock and only in two places in dcache.c that change d_name. That kind
> of thing is trivial to enforce (just grep over the tree once in a while)
> and if it shares the cacheline with d_lock, we shouldn't get any real overhead
> in d_move()/d_materialise_unique(). I'm not particulary fond of that variant,
> but it's at least guaranteed to be devoid of subtleties.
>
> If RCU folks can come up with a sane suggestions that would be robust and
> wouldn't bloat dentry - sure, I'm all for it. If not...

As the matter of fact, there's just *one* place that has any business [*]
changing ->d_name contents of dentry that might be visible to somebody
else. fs/dcache.c::switch_names().

So a very brute-force approach would be to add a new spinlock to dentry and
have switch_names() grab it on dentry and target and drop when we are done,
dname_string() grab it around the call of string() and pull the guts out
through the nose to anyone who as much as mentions that lock outside of
fs/dcache.c:switch_names() and lib/vsprintf.c:dname_string().

Again, I'd love to see something more elegant; this variant won't add any
contention and if we place the lock next to d_lock we won't get any cacheline
bouncing either (we'd just taken ->d_lock on both dentries), but it's
rather ugly way to deal with the problem. I mean, a spinlock just for the
needs of debugging printks? Yuck.

BTW, speaking of ->d_lock, dget_parent() is abused in a bunch of places.
I'm going through review of ->d_parent and ->d_name uses; will post
the results when it's done...

[*] there's also !@$#!@#!@# {ncp,smb}_fill_cache() that does change of
letters' case in ->d_name; no locking whatsoever in there, luckily for
that crap the callers hold i_mutex on parent, so they get exclusion with
potential callers of d_move(). Bad Idea All Around(tm), but irrelevant
for our purposes.

2010-02-02 04:22:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name



On Mon, 1 Feb 2010, Al Viro wrote:
>
> Ehh... RCU will save you from stepping on freed memory, but it still will
> leave the joy of half-updated string with length out of sync with it, etc.

Sure. But do we care?

Every printk _should_ be on a dentry that we have a reference to, anyway.
Otherwise, how did we find it?

So yeah, then there is the race with rename(), but (a) it's not going to
happen, (b) if we race with rename there is no "one" correct solution
anyway, so (c) what we really want to protect against is the printk
causing problems (like an oops from DEBUG_PAGEALLOC and freeing the old
name).

> We probably can get away with that, but we'll have to be a lot more careful
> with the order of updating these suckers in d_move_locked et.al.

I wouldn't worry about it too much. So what if we get a screwed up name?
If we use "%.*s" to print the name, we know that we won't overstep the old
name even if the NUL termination somehow went away (because we're busy
copying a new, longer, name over it or whatever).

So I would not worry too much about sometimes showing a nonsensical name
that may be the result of a concurrent rename. I'd only worry about it not
oopsing or stepping on memory that hasn't had _either_ the old or the new
name in it.

That's why I think an RCU solution should be sufficient..

Linus

2010-02-02 05:00:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 08:22:24PM -0800, Linus Torvalds wrote:

> > We probably can get away with that, but we'll have to be a lot more careful
> > with the order of updating these suckers in d_move_locked et.al.
>
> I wouldn't worry about it too much. So what if we get a screwed up name?
> If we use "%.*s" to print the name, we know that we won't overstep the old
> name even if the NUL termination somehow went away (because we're busy
> copying a new, longer, name over it or whatever).

Actually, I'm not sure. We can get hit by a switch to inline name with
length still being that from a long earlier name. And inline name is
in the end of struct dentry, so we could end up stepping off the end
of page. Note that existing d_alloc() does put NUL in d_iname for a short
name, but it won't clean the end of array, so overwrite during memcpy()
can open up a whole lot of PITA.

And yes, it's theoretical and ought to be hard to hit - the sky isn't falling.
OTOH, something like rename() vs. close() race as in ocfs2 might make it
not all that theoretical.

We probably can get away with being careful with barriers and order of
->len vs. ->name updates (and being a bit more careful about cleaning the
crap in d_alloc()), but it'll take an accurate analysis. I'd really like
to hear something from RCU folks, BTW; I still hope that it's one of the
more or less standard problems and "memory barriers" and "reinventing
the wheel" in the same sentence is something I'd rather avoid.

FWIW, speaking of fun printf extensions, there might be a completely
different way to deal with all that crap. %s modification doing kfree().
I.e. "get char * from argument list, print it as %s would, then kfree() it".
With something along the lines of
printk("... %<something>...", build_some_string(...));
as intended use, build_some_string() allocating a string and filling it.
Might or might not be a good idea, but it's interesting to consider. And
yes, of course it's a deadlock if you do that under any kind of a spinlock,
but that's the damn caller's responsibility - after all, they explicitly
call a function that does allocation. The real danger with that is that
somebody will use it with %s and get a leak from hell...

2010-02-02 05:55:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

Al Viro <[email protected]> writes:

> On Mon, Feb 01, 2010 at 11:18:47PM +0000, Al Viro wrote:
>
>> Ehh... RCU will save you from stepping on freed memory, but it still will
>> leave the joy of half-updated string with length out of sync with it, etc.
>> We probably can get away with that, but we'll have to be a lot more careful
>> with the order of updating these suckers in d_move_locked et.al.
>>
>> I don't know... Note that if we end up adding something extra to struct
>> dentry, we might as well just add *another* spinlock, taken only under
>> ->d_lock and only in two places in dcache.c that change d_name. That kind
>> of thing is trivial to enforce (just grep over the tree once in a while)
>> and if it shares the cacheline with d_lock, we shouldn't get any real overhead
>> in d_move()/d_materialise_unique(). I'm not particulary fond of that variant,
>> but it's at least guaranteed to be devoid of subtleties.
>>
>> If RCU folks can come up with a sane suggestions that would be robust and
>> wouldn't bloat dentry - sure, I'm all for it. If not...
>
> As the matter of fact, there's just *one* place that has any business [*]
> changing ->d_name contents of dentry that might be visible to somebody
> else. fs/dcache.c::switch_names().
>
> So a very brute-force approach would be to add a new spinlock to dentry and
> have switch_names() grab it on dentry and target and drop when we are done,
> dname_string() grab it around the call of string() and pull the guts out
> through the nose to anyone who as much as mentions that lock outside of
> fs/dcache.c:switch_names() and lib/vsprintf.c:dname_string().

We already have rename_lock, which is only take for write in d_move_locked.

I wonder if there is an instruction sequence that could guarantee that the
string copy is done atomically from the perspective of another cpu,
d_iname fits nicely on a single cache line so it should be possible.

That is a stronger guarantee than we need. All we really need is the
guarantee that a reader will see the string null terminator. dentries already
have rcu safe lifetimes.

Hmm.

We should be able to do:
struct qstr *name;
int len;
char buf[MAX_LEN + 1];
long seq;

do {
seq = read_seqbegin(&rename_lock);
rcu_read_lock();
name = rcu_dereference(dentry->d_name.name);
len = dentry->d_name.len;
if (read_seqretry(&rename_lock, seq)
continue;
if (len > MAX_LEN)
len = MAX_LEN;
memcpy(buf, name, len);
buf[len] = '\0';
rcu_read_unlock();
} while (read_seqretry(&rename_lock, seq));

I don't know if rcu bits are actually necessary as we are using the seqlock for
all of the heavy lifting. In particular name and len are guaranteed to be consistent
with each other because of the seqlock, and the seqlock also guarantees that we will
not have an inconsistent state.

Eric

2010-02-02 06:42:37

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 05:00:37AM +0000, Al Viro wrote:
> On Mon, Feb 01, 2010 at 08:22:24PM -0800, Linus Torvalds wrote:
>
> > > We probably can get away with that, but we'll have to be a lot more careful
> > > with the order of updating these suckers in d_move_locked et.al.
> >
> > I wouldn't worry about it too much. So what if we get a screwed up name?
> > If we use "%.*s" to print the name, we know that we won't overstep the old
> > name even if the NUL termination somehow went away (because we're busy
> > copying a new, longer, name over it or whatever).
>
> Actually, I'm not sure. We can get hit by a switch to inline name with
> length still being that from a long earlier name. And inline name is
> in the end of struct dentry, so we could end up stepping off the end
> of page. Note that existing d_alloc() does put NUL in d_iname for a short
> name, but it won't clean the end of array, so overwrite during memcpy()
> can open up a whole lot of PITA.
>
> And yes, it's theoretical and ought to be hard to hit - the sky isn't falling.
> OTOH, something like rename() vs. close() race as in ocfs2 might make it
> not all that theoretical.
>
> We probably can get away with being careful with barriers and order of
> ->len vs. ->name updates (and being a bit more careful about cleaning the
> crap in d_alloc()), but it'll take an accurate analysis. I'd really like
> to hear something from RCU folks, BTW; I still hope that it's one of the
> more or less standard problems and "memory barriers" and "reinventing
> the wheel" in the same sentence is something I'd rather avoid.

I ended up having to use a seqlock to do name comparison without locks
(and without holding references for that matter, just RCU). However
name comparison is obviously a lot more critical because you can't
ignore races, so you might be able to do something simpler.

But I can't see a good way to do it completely just with RCU and memory
ordering. You could get multiple renames in there, so no matter the
ordering, I think you can get a mismatched len,name ptr tuple.

Could do something really awful by checking ksize or DNAME_INLINE_LEN
of your name pointer. This ensures you don't hit random memory. Would
require more work to avoid leaking kernel memory. Seems really nasty
though.

Seqlock should work nicely, but it requires some additions to core code
so I don't think it is justified unless it comes with other core
improvements (like scalability work).


> FWIW, speaking of fun printf extensions, there might be a completely
> different way to deal with all that crap. %s modification doing kfree().
> I.e. "get char * from argument list, print it as %s would, then kfree() it".
> With something along the lines of
> printk("... %<something>...", build_some_string(...));
> as intended use, build_some_string() allocating a string and filling it.
> Might or might not be a good idea, but it's interesting to consider. And
> yes, of course it's a deadlock if you do that under any kind of a spinlock,
> but that's the damn caller's responsibility - after all, they explicitly
> call a function that does allocation. The real danger with that is that
> somebody will use it with %s and get a leak from hell...

Might not be a bad idea to have a kstrdup type helper that does the
right locking. But does it warrant a fancy new printk conversion?

2010-02-02 06:53:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 11:18:47PM +0000, Al Viro wrote:
> On Mon, Feb 01, 2010 at 02:37:32PM -0800, Linus Torvalds wrote:
>
> > > * don't use %pd under dentry->d_lock, use dentry->d_name.name instead; in
> > > that case it *is* safe. Incidentally, ->d_lock isn't held a lot.
> >
> > I realize we can just call it a rule, and yes, d_lock is held much less
> > than something like console_lock etc that we've had ABBA issues with, but
> > still..
>
> > Quite frankly, I'd _much_ rather see something like just always freeing
> > the dentry names (when they aren't inlined) using RCU. The VFS layer quite
> > possibly would want to do that anyway at some point (eg Nick's VFS
> > scalability patches), and then we could make it just a RCU read-lock or
> > whatever (interrupt disable, what-not) instead.
> >
> > And I'm much happier with printk doing that kind of thing, and wouldn't
> > have issues with that kind of much weaker locking.
>
> Ehh... RCU will save you from stepping on freed memory, but it still will
> leave the joy of half-updated string with length out of sync with it, etc.
> We probably can get away with that, but we'll have to be a lot more careful
> with the order of updating these suckers in d_move_locked et.al.
>
> I don't know... Note that if we end up adding something extra to struct
> dentry, we might as well just add *another* spinlock, taken only under
> ->d_lock and only in two places in dcache.c that change d_name. That kind
> of thing is trivial to enforce (just grep over the tree once in a while)
> and if it shares the cacheline with d_lock, we shouldn't get any real overhead
> in d_move()/d_materialise_unique(). I'm not particulary fond of that variant,
> but it's at least guaranteed to be devoid of subtleties.
>
> If RCU folks can come up with a sane suggestions that would be robust and
> wouldn't bloat dentry - sure, I'm all for it. If not...

Here is an approximation that might inspire someone to come up with a
real solution.

One approach would be to store the name length with the name, so that
struct qstr loses the "len" field, and so that its "name" field points
to a struct that has a "len" field followed by an array of const
unsigned char. That way, the name and length are closely associated.
When you pick up a struct qstr's "name" pointer, you are guaranteed to
get a length that matches the name.

Unfortunately:

o In theory, this leaves the length of the dentry unchanged, but
alignment is a problem on 64-bit systems. Also, the long names
gain an extra four bytes.

o If you get a pointer to the d_iname small-name field, rename
might still change the name out from under you. This could in
theory be fixed by refusing to re-use the d_iname field until
an RCU grace period had elapsed (using an external structure
instead). In practice, not sure if this is really a reasonable
approach.

Thoughts?

Thanx, Paul

2010-02-02 07:09:17

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 10:53:41PM -0800, Paul E. McKenney wrote:

> Here is an approximation that might inspire someone to come up with a
> real solution.
>
> One approach would be to store the name length with the name, so that
> struct qstr loses the "len" field, and so that its "name" field points
> to a struct that has a "len" field followed by an array of const
> unsigned char. That way, the name and length are closely associated.
> When you pick up a struct qstr's "name" pointer, you are guaranteed to
> get a length that matches the name.
>
> Unfortunately:
>
> o In theory, this leaves the length of the dentry unchanged, but
> alignment is a problem on 64-bit systems. Also, the long names
> gain an extra four bytes.

That one is not a big deal.

> o If you get a pointer to the d_iname small-name field, rename
> might still change the name out from under you. This could in
> theory be fixed by refusing to re-use the d_iname field until
> an RCU grace period had elapsed (using an external structure
> instead). In practice, not sure if this is really a reasonable
> approach.

That, OTOH, is - most of dentries use inline name and external one is
really a rarely used fallback. Making it a common case isn't nice.

There's another practical problem - a lot of code uses qstr fields and
patch will be painful; I couldn't care less about the out-of-tree code,
but it's a flagday change and in-tree patch size is not something to
sneeze at - I've been crawling through all that code for the last couple
of days and there's a lot of it.

Trying to play with seqlock-based solutions sounds more promising; I've
missed it completely and I'm half-asleep right now, so I'll try to take
a look at that after I get some sleep.

2010-02-02 13:32:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 07:09:08AM +0000, Al Viro wrote:
> On Mon, Feb 01, 2010 at 10:53:41PM -0800, Paul E. McKenney wrote:
>
> > Here is an approximation that might inspire someone to come up with a
> > real solution.
> >
> > One approach would be to store the name length with the name, so that
> > struct qstr loses the "len" field, and so that its "name" field points
> > to a struct that has a "len" field followed by an array of const
> > unsigned char. That way, the name and length are closely associated.
> > When you pick up a struct qstr's "name" pointer, you are guaranteed to
> > get a length that matches the name.
> >
> > Unfortunately:
> >
> > o In theory, this leaves the length of the dentry unchanged, but
> > alignment is a problem on 64-bit systems. Also, the long names
> > gain an extra four bytes.
>
> That one is not a big deal.
>
> > o If you get a pointer to the d_iname small-name field, rename
> > might still change the name out from under you. This could in
> > theory be fixed by refusing to re-use the d_iname field until
> > an RCU grace period had elapsed (using an external structure
> > instead). In practice, not sure if this is really a reasonable
> > approach.
>
> That, OTOH, is - most of dentries use inline name and external one is
> really a rarely used fallback. Making it a common case isn't nice.
>
> There's another practical problem - a lot of code uses qstr fields and
> patch will be painful; I couldn't care less about the out-of-tree code,
> but it's a flagday change and in-tree patch size is not something to
> sneeze at - I've been crawling through all that code for the last couple
> of days and there's a lot of it.

How about doing this:

struct qstr {
- const unsigned char *name;
+ const unsigned char name[0];
}

struct dentry {
- struct qstr d_name;
+ struct qstr *d_name;
- unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
+ union {
+ struct qstr d_iname;
+ char pad[DNAME_INLINE_LEN_MIN];
+ };
}

Doesn't increase the size of struct dentry, and puts the hash and len
with the name. Increases long name allocations by 8 bytes each.

I think reusing the d_iname is OK. As long as we always limit the
number of characters printed to the 'len' element, we should never get
an overrun. At worst, we get a mixture of the previous name and the
next name ... and that's a significant canary in itself.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2010-02-02 15:57:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name



On Tue, 2 Feb 2010, Matthew Wilcox wrote:
>
> How about doing this:
>
> struct qstr {
> - const unsigned char *name;
> + const unsigned char name[0];
> }
>
> struct dentry {
> - struct qstr d_name;
> + struct qstr *d_name;
> - unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
> + union {
> + struct qstr d_iname;
> + char pad[DNAME_INLINE_LEN_MIN];
> + };
> }
>
> Doesn't increase the size of struct dentry, and puts the hash and len
> with the name. Increases long name allocations by 8 bytes each.

Conceptually nice, but in practice that's absolutely horrible.

Why? Because now the dentry lookup logic has to follow an additional
pointer just to verify the hash and the length of the name. That's some of
the hottest code we have, and the _last_ thing we want is another pointer
dereference and cache access in the path that looks up the dentry hash
chains.

Or am I missing something? I didn't look at the code.

Linus

2010-02-02 16:13:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 07:56:36AM -0800, Linus Torvalds wrote:
> On Tue, 2 Feb 2010, Matthew Wilcox wrote:
> >
> > How about doing this:
> >
> > struct qstr {
> > - const unsigned char *name;
> > + const unsigned char name[0];
> > }
> >
> > struct dentry {
> > - struct qstr d_name;
> > + struct qstr *d_name;
> > - unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
> > + union {
> > + struct qstr d_iname;
> > + char pad[DNAME_INLINE_LEN_MIN];
> > + };
> > }
> >
> > Doesn't increase the size of struct dentry, and puts the hash and len
> > with the name. Increases long name allocations by 8 bytes each.
>
> Conceptually nice, but in practice that's absolutely horrible.
>
> Why? Because now the dentry lookup logic has to follow an additional
> pointer just to verify the hash and the length of the name. That's some of
> the hottest code we have, and the _last_ thing we want is another pointer
> dereference and cache access in the path that looks up the dentry hash
> chains.

I'd thought it was in the same cacheline ... but that's not generally
true since dentries are so large. And the d_name is at the other end
of the dentry, so they're guaranteed to be in different cachelines. Bum.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2010-02-02 16:43:55

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 06:32:31AM -0700, Matthew Wilcox wrote:
> How about doing this:
>
> struct qstr {
> - const unsigned char *name;
> + const unsigned char name[0];
> }
>
> struct dentry {
> - struct qstr d_name;
> + struct qstr *d_name;
> - unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
> + union {
> + struct qstr d_iname;
> + char pad[DNAME_INLINE_LEN_MIN];
> + };
> }
>
> Doesn't increase the size of struct dentry, and puts the hash and len
> with the name. Increases long name allocations by 8 bytes each.
>
> I think reusing the d_iname is OK. As long as we always limit the
> number of characters printed to the 'len' element, we should never get
> an overrun. At worst, we get a mixture of the previous name and the
> next name ... and that's a significant canary in itself.

You are creating an extra deref in normal case. Inline names are common.
Putting len and hash with the name probably not a win - most of the time
you don't look at actual characters and rely on mismatches in other
components to skip the candidate during search.

2010-02-02 17:02:10

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 09:55:42PM -0800, Eric W. Biederman wrote:

> We already have rename_lock, which is only take for write in d_move_locked.
>
> I wonder if there is an instruction sequence that could guarantee that the
> string copy is done atomically from the perspective of another cpu,
> d_iname fits nicely on a single cache line so it should be possible.
>
> That is a stronger guarantee than we need. All we really need is the
> guarantee that a reader will see the string null terminator. dentries already
> have rcu safe lifetimes.
>
> Hmm.
>
> We should be able to do:
> struct qstr *name;
> int len;
> char buf[MAX_LEN + 1];
> long seq;
>
> do {
> seq = read_seqbegin(&rename_lock);
> rcu_read_lock();
> name = rcu_dereference(dentry->d_name.name);
> len = dentry->d_name.len;
> if (read_seqretry(&rename_lock, seq)
> continue;
> if (len > MAX_LEN)
> len = MAX_LEN;
> memcpy(buf, name, len);
> buf[len] = '\0';
> rcu_read_unlock();
> } while (read_seqretry(&rename_lock, seq));

Actually, WTF do we bother? seqlock writer grabs embedded spinlock.
So can we, since *that* is far more narrow than ->d_lock and it's
static, so it's not like somebody outside of dcache.c can decide
to grab.

We could just inline string() into dname_string() and take the entire
thing to fs/dcache.c. And protect it either with direct locking
of rename_lock.lock or another seq_writelock; contention is not
an issue, since if printk guts are your hotpath you are already
FUBAR.

Something like that (completely untested):

diff --git a/fs/dcache.c b/fs/dcache.c
index 953173a..a4d30bc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1703,6 +1703,48 @@ void d_move(struct dentry * dentry, struct dentry * target)
spin_unlock(&dcache_lock);
}

+/*
+ * for vsprintf use only.
+ */
+char *dname_string(char *buf, char *end, struct dentry *d,
+ int precision, int width, int left_align)
+{
+ int len, i;
+ const unsigned char *s;
+
+ write_seqlock(&rename_lock);
+ if (precision > d->d_name.len)
+ precision = d->d_name.len;
+
+ s = d->d_name.name;
+ if ((unsigned long)s < PAGE_SIZE)
+ s = "(null)";
+
+ len = strnlen(s, precision);
+
+ if (!left_align) {
+ while (len < width--) {
+ if (buf < end)
+ *buf = ' ';
+ ++buf;
+ }
+ }
+ for (i = 0; i < len; ++i) {
+ if (buf < end)
+ *buf = *s;
+ ++buf; ++s;
+ }
+
+ while (len < width--) {
+ if (buf < end)
+ *buf = ' ';
+ ++buf;
+ }
+
+ write_sequnlock(&rename_lock);
+ return buf;
+}
+
/**
* d_ancestor - search for an ancestor
* @p1: ancestor dentry
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 30b93b2..2dc286a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -379,5 +379,6 @@ extern struct vfsmount *lookup_mnt(struct path *);
extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);

extern int sysctl_vfs_cache_pressure;
+extern char *dname_string(char *, char *, struct dentry *, int, int, int);

#endif /* __LINUX_DCACHE_H */
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3b8aeec..ab0e40a 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -915,6 +915,8 @@ static char *uuid_string(char *buf, char *end, const u8 *addr,
* [0][1][2][3]-[4][5]-[6][7]-[8][9]-[10][11][12][13][14][15]
* little endian output byte order is:
* [3][2][1][0]-[5][4]-[7][6]-[8][9]-[10][11][12][13][14][15]
+ * - 'd' For dentry name. NOTE: don't use under dentry->d_lock, there
+ * you can safely use ->d_name.name instead.
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -958,6 +960,9 @@ static char *pointer(const char *fmt, char *buf, char *end, void *ptr,
break;
case 'U':
return uuid_string(buf, end, ptr, spec, fmt);
+ case 'd':
+ return dname_string(buf, end, ptr, spec.precision,
+ spec.field_width, spec.flags & LEFT);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1191,6 +1196,7 @@ qualifier:
* http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-00
* %pU[bBlL] print a UUID/GUID in big or little endian using lower or upper
* case.
+ * %pd print dentry name
* %n is ignored
*
* The return value is the number of characters which would

2010-02-02 18:10:13

by Olivier Galibert

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 05:01:57PM +0000, Al Viro wrote:
> + * - 'd' For dentry name. NOTE: don't use under dentry->d_lock, there
> + * you can safely use ->d_name.name instead.

You probably can get rid of the caveat, then, right? "Don't use
d_name.name in printk" is easier to grep for :-)

OG.

2010-02-02 19:20:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

Al Viro <[email protected]> writes:

> On Mon, Feb 01, 2010 at 09:55:42PM -0800, Eric W. Biederman wrote:
>
>> We already have rename_lock, which is only take for write in d_move_locked.
>>
>> I wonder if there is an instruction sequence that could guarantee that the
>> string copy is done atomically from the perspective of another cpu,
>> d_iname fits nicely on a single cache line so it should be possible.
>>
>> That is a stronger guarantee than we need. All we really need is the
>> guarantee that a reader will see the string null terminator. dentries already
>> have rcu safe lifetimes.
>>
>> Hmm.
>>
>> We should be able to do:
>> struct qstr *name;
>> int len;
>> char buf[MAX_LEN + 1];
>> long seq;
>>
>> do {
>> seq = read_seqbegin(&rename_lock);
>> rcu_read_lock();
>> name = rcu_dereference(dentry->d_name.name);
>> len = dentry->d_name.len;
>> if (read_seqretry(&rename_lock, seq)
>> continue;
>> if (len > MAX_LEN)
>> len = MAX_LEN;
>> memcpy(buf, name, len);
>> buf[len] = '\0';
>> rcu_read_unlock();
>> } while (read_seqretry(&rename_lock, seq));
>
> Actually, WTF do we bother? seqlock writer grabs embedded spinlock.
> So can we, since *that* is far more narrow than ->d_lock and it's
> static, so it's not like somebody outside of dcache.c can decide
> to grab.
>
> We could just inline string() into dname_string() and take the entire
> thing to fs/dcache.c. And protect it either with direct locking
> of rename_lock.lock or another seq_writelock; contention is not
> an issue, since if printk guts are your hotpath you are already
> FUBAR.
>
> Something like that (completely untested):

If we are going to take a lock this seems as sane as any.

Do we want to honor oops_in_progress aka bust_spinlocks here?

Perhaps just:
if (oops_in_progress)
return buf;

To guarantee we get the rest of a panic message out of the kernel.

Eric


>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 953173a..a4d30bc 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1703,6 +1703,48 @@ void d_move(struct dentry * dentry, struct dentry * target)
> spin_unlock(&dcache_lock);
> }
>
> +/*
> + * for vsprintf use only.
> + */
> +char *dname_string(char *buf, char *end, struct dentry *d,
> + int precision, int width, int left_align)
> +{
> + int len, i;
> + const unsigned char *s;
> +
> + write_seqlock(&rename_lock);
> + if (precision > d->d_name.len)
> + precision = d->d_name.len;
> +
> + s = d->d_name.name;
> + if ((unsigned long)s < PAGE_SIZE)
> + s = "(null)";
> +
> + len = strnlen(s, precision);
> +
> + if (!left_align) {
> + while (len < width--) {
> + if (buf < end)
> + *buf = ' ';
> + ++buf;
> + }
> + }
> + for (i = 0; i < len; ++i) {
> + if (buf < end)
> + *buf = *s;
> + ++buf; ++s;
> + }
> +
> + while (len < width--) {
> + if (buf < end)
> + *buf = ' ';
> + ++buf;
> + }
> +
> + write_sequnlock(&rename_lock);
> + return buf;
> +}
> +
> /**
> * d_ancestor - search for an ancestor
> * @p1: ancestor dentry
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 30b93b2..2dc286a 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -379,5 +379,6 @@ extern struct vfsmount *lookup_mnt(struct path *);
> extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);
>
> extern int sysctl_vfs_cache_pressure;
> +extern char *dname_string(char *, char *, struct dentry *, int, int, int);
>
> #endif /* __LINUX_DCACHE_H */
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 3b8aeec..ab0e40a 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -915,6 +915,8 @@ static char *uuid_string(char *buf, char *end, const u8 *addr,
> * [0][1][2][3]-[4][5]-[6][7]-[8][9]-[10][11][12][13][14][15]
> * little endian output byte order is:
> * [3][2][1][0]-[5][4]-[7][6]-[8][9]-[10][11][12][13][14][15]
> + * - 'd' For dentry name. NOTE: don't use under dentry->d_lock, there
> + * you can safely use ->d_name.name instead.
> *
> * Note: The difference between 'S' and 'F' is that on ia64 and ppc64
> * function pointers are really function descriptors, which contain a
> @@ -958,6 +960,9 @@ static char *pointer(const char *fmt, char *buf, char *end, void *ptr,
> break;
> case 'U':
> return uuid_string(buf, end, ptr, spec, fmt);
> + case 'd':
> + return dname_string(buf, end, ptr, spec.precision,
> + spec.field_width, spec.flags & LEFT);
> }
> spec.flags |= SMALL;
> if (spec.field_width == -1) {
> @@ -1191,6 +1196,7 @@ qualifier:
> * http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-00
> * %pU[bBlL] print a UUID/GUID in big or little endian using lower or upper
> * case.
> + * %pd print dentry name
> * %n is ignored
> *
> * The return value is the number of characters which would

2010-02-03 03:04:22

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 11:19:52AM -0800, Eric W. Biederman wrote:

> If we are going to take a lock this seems as sane as any.
>
> Do we want to honor oops_in_progress aka bust_spinlocks here?
>
> Perhaps just:
> if (oops_in_progress)
> return buf;
>
> To guarantee we get the rest of a panic message out of the kernel.

Hmm... There's another fun issue - we would want local_irq_disable() /
local_irq_enable() in d_move_locked and local_irq_save/local_irq_restore()
in dname_string(), AFAICT.

OK, here's what I've got from moving in that direction. Folks, how does
that one look to you? I'm not too happy about explicit manipulations
with irq flags in there, so any suggestions would be welcome.

AFAICS, that should be enough to make the damn thing safe to use in any
context outside of d_move_locked() itself...

diff --git a/fs/dcache.c b/fs/dcache.c
index 953173a..cd9fc0b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1636,6 +1636,7 @@ static void d_move_locked(struct dentry * dentry, struct dentry * target)
if (!dentry->d_inode)
printk(KERN_WARNING "VFS: moving negative dcache entry\n");

+ local_irq_disable();
write_seqlock(&rename_lock);
/*
* XXXX: do we really need to take target->d_lock?
@@ -1685,6 +1686,7 @@ already_unhashed:
fsnotify_d_move(dentry);
spin_unlock(&dentry->d_lock);
write_sequnlock(&rename_lock);
+ local_irq_enable();
}

/**
@@ -1703,6 +1705,57 @@ void d_move(struct dentry * dentry, struct dentry * target)
spin_unlock(&dcache_lock);
}

+/*
+ * for vsprintf use only.
+ */
+char *dname_string(char *buf, char *end, struct dentry *d,
+ int precision, int width, int left_align)
+{
+ int len, i;
+ const unsigned char *s;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ if (!write_tryseqlock(&rename_lock)) {
+ if (oops_in_progress) {
+ local_irq_restore(flags);
+ return buf;
+ }
+ write_seqlock(&rename_lock);
+ }
+ if (precision > d->d_name.len)
+ precision = d->d_name.len;
+
+ s = d->d_name.name;
+ if ((unsigned long)s < PAGE_SIZE)
+ s = "(null)";
+
+ len = strnlen(s, precision);
+
+ if (!left_align) {
+ while (len < width--) {
+ if (buf < end)
+ *buf = ' ';
+ ++buf;
+ }
+ }
+ for (i = 0; i < len; ++i) {
+ if (buf < end)
+ *buf = *s;
+ ++buf; ++s;
+ }
+
+ while (len < width--) {
+ if (buf < end)
+ *buf = ' ';
+ ++buf;
+ }
+
+ write_sequnlock(&rename_lock);
+ local_irq_restore(flags);
+ return buf;
+}
+
/**
* d_ancestor - search for an ancestor
* @p1: ancestor dentry
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 30b93b2..2dc286a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -379,5 +379,6 @@ extern struct vfsmount *lookup_mnt(struct path *);
extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);

extern int sysctl_vfs_cache_pressure;
+extern char *dname_string(char *, char *, struct dentry *, int, int, int);

#endif /* __LINUX_DCACHE_H */
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3b8aeec..918edea 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -915,6 +915,7 @@ static char *uuid_string(char *buf, char *end, const u8 *addr,
* [0][1][2][3]-[4][5]-[6][7]-[8][9]-[10][11][12][13][14][15]
* little endian output byte order is:
* [3][2][1][0]-[5][4]-[7][6]-[8][9]-[10][11][12][13][14][15]
+ * - 'd' For dentry name.
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -958,6 +959,9 @@ static char *pointer(const char *fmt, char *buf, char *end, void *ptr,
break;
case 'U':
return uuid_string(buf, end, ptr, spec, fmt);
+ case 'd':
+ return dname_string(buf, end, ptr, spec.precision,
+ spec.field_width, spec.flags & LEFT);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1191,6 +1195,7 @@ qualifier:
* http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-00
* %pU[bBlL] print a UUID/GUID in big or little endian using lower or upper
* case.
+ * %pd print dentry name
* %n is ignored
*
* The return value is the number of characters which would

2010-02-04 04:54:01

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Wed, Feb 03, 2010 at 03:04:19AM +0000, Al Viro wrote:
> On Tue, Feb 02, 2010 at 11:19:52AM -0800, Eric W. Biederman wrote:
>
> > If we are going to take a lock this seems as sane as any.
> >
> > Do we want to honor oops_in_progress aka bust_spinlocks here?
> >
> > Perhaps just:
> > if (oops_in_progress)
> > return buf;
> >
> > To guarantee we get the rest of a panic message out of the kernel.
>
> Hmm... There's another fun issue - we would want local_irq_disable() /
> local_irq_enable() in d_move_locked and local_irq_save/local_irq_restore()
> in dname_string(), AFAICT.
>
> OK, here's what I've got from moving in that direction. Folks, how does
> that one look to you? I'm not too happy about explicit manipulations
> with irq flags in there, so any suggestions would be welcome.

Argh. No, it's not at all better. Moreover, even read_seqbegin variant
is b0rken if we ever do that under ->d_lock.

CPU1:A: grabs dentry->d_lock
CPU2:B: calls d_move_locked()
CPU2:B: grabs rename_lock
CPU2:B: spins on dentry->d_lock
CPU1:A: calls printk with %pd dentry
CPU1:A: spins waiting for rename_lock writer to release it

So much for that approach ;-/

2010-02-04 06:02:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 05:36:31PM +1100, Nick Piggin wrote:

> I ended up having to use a seqlock to do name comparison without locks
> (and without holding references for that matter, just RCU). However
> name comparison is obviously a lot more critical because you can't
> ignore races, so you might be able to do something simpler.

Umm... Your ->d_seq, you mean? IIRC, that stuff landed too close to
do_filp_open() queue back then *and* had potential headache from hell
with vfsmount side of that business.

I've reread it now. Comments:
* mntput() is blocking. When refcount goes to 0, we get
dput() on root, possibly followed by actual fs shutdown. Very
much not an RCU fodder, even though most of the calls will be OK.
We can do a variant that would do atomic_dec_and_bail() instead ;-)
I.e. decrement atomically if greater than 1, bail out otherwise.
* why do you bail out on LOOKUP_PARENT? For that matter, why
do you do that so late?
* how does that interact with d_materialise_unique()? Sure,
you bail out on NFS anyway, but it's still not nice to leave relying just
on that.
* why can we access dentry->d_inode->i_op contents? Or dentry->d_op
one, for that matter...

BTW, I'd love to take the entire "last component" part of link_path_walk()
out into a separate function and away from link_path_walk(), leaving
basically just the LOOKUP_PARENT case in there. Price: trailing symlinks
need to be handled by an iterative loop in do_follow_link(). And
that actually ends up an improvement both in stack depth and in overall
code cleanup. Nothing like the horrors in do_filp_open(), TYVM (said
horrors had mostly gone away in #untested, BTW, and I'm going to move
that series to for-next shortly). However, we are still several prereqs
away from link_path_walk() split, so that'll have to wait a bit. In
any case, LOOKUP_PARENT is very much worth the first-class treatment...

2010-02-04 07:41:08

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Thu, Feb 04, 2010 at 06:02:37AM +0000, Al Viro wrote:
> On Tue, Feb 02, 2010 at 05:36:31PM +1100, Nick Piggin wrote:
>
> > I ended up having to use a seqlock to do name comparison without locks
> > (and without holding references for that matter, just RCU). However
> > name comparison is obviously a lot more critical because you can't
> > ignore races, so you might be able to do something simpler.
>
> Umm... Your ->d_seq, you mean? IIRC, that stuff landed too close to
> do_filp_open() queue back then *and* had potential headache from hell
> with vfsmount side of that business.
>
> I've reread it now. Comments:

OK. It's at the end of a big patch series and the path walk stuff
was proof of concept quality at best. In particular I was hoping
for a better way to share code.

That said, we can move pieces around if you are interested to
cherry pick things. I should rediff and resend the series.

> * mntput() is blocking. When refcount goes to 0, we get
> dput() on root, possibly followed by actual fs shutdown. Very
> much not an RCU fodder, even though most of the calls will be OK.
> We can do a variant that would do atomic_dec_and_bail() instead ;-)
> I.e. decrement atomically if greater than 1, bail out otherwise.

Good point. atomic_dec_and_bail seems like it would work nicely.


> * why do you bail out on LOOKUP_PARENT? For that matter, why
> do you do that so late?

Ah no good reason. Probably I was just trying to get some basic
testable cases working. I indeed hope to handle more cases and
bail out more gracefully (ie. attempt to resume walking from
last successful lookup rather than start the entire path again).


> * how does that interact with d_materialise_unique()? Sure,
> you bail out on NFS anyway, but it's still not nice to leave relying just
> on that.

Yeah it does need some more thinking about these cases. In fact,
it's even missing a lot of basic rcu_assign_pointer / rcu_deref
cases that are needed if we are to make lookups totally lock free.

For harder cases, I'm hoping we can get away with using ->d_seq
to avoid thinking too hard about ordering.

Like I said, it's very much "known buggy" / proof of concept stage.
But if you're happy with the basic idea, then I'm keen to get to
work on it.

Do you have opinions on the rest of the vfs scalability series?


> * why can we access dentry->d_inode->i_op contents? Or dentry->d_op
> one, for that matter...

Good point. Require an rcu_barrier in unregister_filesystem I guess.


> BTW, I'd love to take the entire "last component" part of link_path_walk()
> out into a separate function and away from link_path_walk(), leaving
> basically just the LOOKUP_PARENT case in there. Price: trailing symlinks
> need to be handled by an iterative loop in do_follow_link(). And
> that actually ends up an improvement both in stack depth and in overall
> code cleanup. Nothing like the horrors in do_filp_open(), TYVM (said
> horrors had mostly gone away in #untested, BTW, and I'm going to move
> that series to for-next shortly). However, we are still several prereqs
> away from link_path_walk() split, so that'll have to wait a bit. In
> any case, LOOKUP_PARENT is very much worth the first-class treatment...

I'll have to take a look at what you're doing.

Thanks,
Nick

2010-02-04 10:03:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Tue, Feb 02, 2010 at 07:09:08AM +0000, Al Viro wrote:
> On Mon, Feb 01, 2010 at 10:53:41PM -0800, Paul E. McKenney wrote:
>
> > Here is an approximation that might inspire someone to come up with a
> > real solution.
> >
> > One approach would be to store the name length with the name, so that
> > struct qstr loses the "len" field, and so that its "name" field points
> > to a struct that has a "len" field followed by an array of const
> > unsigned char. That way, the name and length are closely associated.
> > When you pick up a struct qstr's "name" pointer, you are guaranteed to
> > get a length that matches the name.
> >
> > Unfortunately:
> >
> > o In theory, this leaves the length of the dentry unchanged, but
> > alignment is a problem on 64-bit systems. Also, the long names
> > gain an extra four bytes.
>
> That one is not a big deal.

K.

> > o If you get a pointer to the d_iname small-name field, rename
> > might still change the name out from under you. This could in
> > theory be fixed by refusing to re-use the d_iname field until
> > an RCU grace period had elapsed (using an external structure
> > instead). In practice, not sure if this is really a reasonable
> > approach.
>
> That, OTOH, is - most of dentries use inline name and external one is
> really a rarely used fallback. Making it a common case isn't nice.

It is possible to move it back inline after a grace period, so that the
external name would be in use for only a few milliseconds after the
rename, but that of course adds more complexity.

> There's another practical problem - a lot of code uses qstr fields and
> patch will be painful; I couldn't care less about the out-of-tree code,
> but it's a flagday change and in-tree patch size is not something to
> sneeze at - I've been crawling through all that code for the last couple
> of days and there's a lot of it.

Indeed!!!

> Trying to play with seqlock-based solutions sounds more promising; I've
> missed it completely and I'm half-asleep right now, so I'll try to take
> a look at that after I get some sleep.

Certainly sounds worth a try.

Thanx, Paul

2010-02-04 10:03:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Mon, Feb 01, 2010 at 10:53:41PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 01, 2010 at 11:18:47PM +0000, Al Viro wrote:
> > On Mon, Feb 01, 2010 at 02:37:32PM -0800, Linus Torvalds wrote:
> >
> > > > * don't use %pd under dentry->d_lock, use dentry->d_name.name instead; in
> > > > that case it *is* safe. Incidentally, ->d_lock isn't held a lot.
> > >
> > > I realize we can just call it a rule, and yes, d_lock is held much less
> > > than something like console_lock etc that we've had ABBA issues with, but
> > > still..
> >
> > > Quite frankly, I'd _much_ rather see something like just always freeing
> > > the dentry names (when they aren't inlined) using RCU. The VFS layer quite
> > > possibly would want to do that anyway at some point (eg Nick's VFS
> > > scalability patches), and then we could make it just a RCU read-lock or
> > > whatever (interrupt disable, what-not) instead.
> > >
> > > And I'm much happier with printk doing that kind of thing, and wouldn't
> > > have issues with that kind of much weaker locking.
> >
> > Ehh... RCU will save you from stepping on freed memory, but it still will
> > leave the joy of half-updated string with length out of sync with it, etc.
> > We probably can get away with that, but we'll have to be a lot more careful
> > with the order of updating these suckers in d_move_locked et.al.
> >
> > I don't know... Note that if we end up adding something extra to struct
> > dentry, we might as well just add *another* spinlock, taken only under
> > ->d_lock and only in two places in dcache.c that change d_name. That kind
> > of thing is trivial to enforce (just grep over the tree once in a while)
> > and if it shares the cacheline with d_lock, we shouldn't get any real overhead
> > in d_move()/d_materialise_unique(). I'm not particulary fond of that variant,
> > but it's at least guaranteed to be devoid of subtleties.
> >
> > If RCU folks can come up with a sane suggestions that would be robust and
> > wouldn't bloat dentry - sure, I'm all for it. If not...
>
> Here is an approximation that might inspire someone to come up with a
> real solution.
>
> One approach would be to store the name length with the name, so that
> struct qstr loses the "len" field, and so that its "name" field points
> to a struct that has a "len" field followed by an array of const
> unsigned char. That way, the name and length are closely associated.
> When you pick up a struct qstr's "name" pointer, you are guaranteed to
> get a length that matches the name.
>
> Unfortunately:
>
> o In theory, this leaves the length of the dentry unchanged, but
> alignment is a problem on 64-bit systems. Also, the long names
> gain an extra four bytes.

And one fix for the alignment issue is to move the "hash" as well as the
"len" field to the struct containing the name. The size of the dentry
remains unchanged, as these two fields smiply move from the d_name qstr
to the small-name d_iname array (which becomes a structure containing
a hash, length, and char array). This approach guarantees that the
length, hash, and name are consistent.

One stupid question: why are the hash and length ints rather than shorts?
Doesn't the maximum filename length fit into a 16-bit short? In fact,
doesn't the maximum length of a full pathname fit into a 16-bit short?

Thanx, Paul

> o If you get a pointer to the d_iname small-name field, rename
> might still change the name out from under you. This could in
> theory be fixed by refusing to re-use the d_iname field until
> an RCU grace period had elapsed (using an external structure
> instead). In practice, not sure if this is really a reasonable
> approach.
>
> Thoughts?
>
> Thanx, Paul

2010-02-04 15:29:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name



On Tue, 2 Feb 2010, Paul E. McKenney wrote:
>
> One stupid question: why are the hash and length ints rather than shorts?
> Doesn't the maximum filename length fit into a 16-bit short? In fact,
> doesn't the maximum length of a full pathname fit into a 16-bit short?

Yes, the name length could easily be just 16 bits.

The hash, though, is a different matter. We actually want lots of bits to
spread out the dentries and 16 bits for hashing would be too small (on my
machine, the dentry cache hash table has half a million entries and takes
4MB of space - space I'll happily give it to keep the hash chains short).

So we need at least 20 bits (and probably more on big machines).

Now, we could decide that having just 16 bits for the name hash is enough,
because we do mix in the address of the 'parent' dentry, and we might
decide that that is worth a few bits (taking the number of total bits up
to enough to look up half a million entries)

We could also use bitfields, and give the name length say 10 bits, and 22
bits to the hash, which togethr with the extra bits from the parent
pointer might well work out fine.

It might be worth trying. But is playing that kind of game worth four
extra characters in the inline name? If it were to make the difference
between "core dentry fields fit in a cacheline" vs "needs two cachelines",
then maybe it would be worth it. But I don't think that's the case.

Linus

2010-02-04 16:02:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Thu, Feb 04, 2010 at 07:29:25AM -0800, Linus Torvalds wrote:
>
>
> On Tue, 2 Feb 2010, Paul E. McKenney wrote:
> >
> > One stupid question: why are the hash and length ints rather than shorts?
> > Doesn't the maximum filename length fit into a 16-bit short? In fact,
> > doesn't the maximum length of a full pathname fit into a 16-bit short?
>
> Yes, the name length could easily be just 16 bits.
>
> The hash, though, is a different matter. We actually want lots of bits to
> spread out the dentries and 16 bits for hashing would be too small (on my
> machine, the dentry cache hash table has half a million entries and takes
> 4MB of space - space I'll happily give it to keep the hash chains short).
>
> So we need at least 20 bits (and probably more on big machines).
>
> Now, we could decide that having just 16 bits for the name hash is enough,
> because we do mix in the address of the 'parent' dentry, and we might
> decide that that is worth a few bits (taking the number of total bits up
> to enough to look up half a million entries)
>
> We could also use bitfields, and give the name length say 10 bits, and 22
> bits to the hash, which togethr with the extra bits from the parent
> pointer might well work out fine.
>
> It might be worth trying. But is playing that kind of game worth four
> extra characters in the inline name? If it were to make the difference
> between "core dentry fields fit in a cacheline" vs "needs two cachelines",
> then maybe it would be worth it. But I don't think that's the case.

Ah, good point on the hash size. And given that DNAME_INLINE_LEN_MIN
is 40 characters on 32-bit systems and 32 characters on 64-bit systems,
I agree that while a four-character increase might be nice, it cannot be
said to be an emergency.

Thanx, Paul

2010-02-04 17:13:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name



On Thu, 4 Feb 2010, Paul E. McKenney wrote:
>
> Ah, good point on the hash size. And given that DNAME_INLINE_LEN_MIN
> is 40 characters on 32-bit systems and 32 characters on 64-bit systems,
> I agree that while a four-character increase might be nice, it cannot be
> said to be an emergency.

Well, what we _could_ do is to make the 'length' field be part of the name
itself, and just keep the hash separate. The hash (and parenthood) is what
we check most in the hot inner loop, and don't want to have any extra
indirection (or cache misses) for. The name length we check only later,
after we've done all other checks (and after we've gotten the spinlock,
which is the big thing).

So qstr->len is _not_ performance critical in the same way that qstr->hash
is.

So we might not save 4 bytes, but we _could_ try to move the length field
into the name with something like

struct qstr_name {
unsigned short len;
char name[];
};

struct qstr {
unsigned int hash;
const struct qstr_name *name;
};

but the problem then is one of alignment (ie that pointer generally wants
8-byte alignment on 64-bit architectures, so none of this _helps_ unless
we also move some other 4-byte field into the qstr (we could look at
making 'd_time' be a 32-bit entry, for example, and move it in there).

Or we could do really crazy things, and put the four first characters of
d_iname inside the qstr etc etc. It's just that it all gets totally
unnatural very quickly.

So 'struct dentry' is one of the most important data structures we have
(just because you easily end up with millions of them), but since we
already control its size by varying the inline name length, I doubt that
playing really fancy games is worth it.

Linus

2010-02-04 17:36:25

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Thu, Feb 04, 2010 at 09:13:18AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 4 Feb 2010, Paul E. McKenney wrote:
> >
> > Ah, good point on the hash size. And given that DNAME_INLINE_LEN_MIN
> > is 40 characters on 32-bit systems and 32 characters on 64-bit systems,
> > I agree that while a four-character increase might be nice, it cannot be
> > said to be an emergency.
>
> Well, what we _could_ do is to make the 'length' field be part of the name
> itself, and just keep the hash separate. The hash (and parenthood) is what
> we check most in the hot inner loop, and don't want to have any extra
> indirection (or cache misses) for. The name length we check only later,
> after we've done all other checks (and after we've gotten the spinlock,
> which is the big thing).
>
> So qstr->len is _not_ performance critical in the same way that qstr->hash
> is.

We could also try to put the hash chain in that sucker, copy d_parent in
there *and* put a pointer back to struct dentry in it. Then the walk
itself would go through those and we'd actually looked at the dentry
only once - in the end of it. Normally that thing would be just embedded
into dentry, with ability to allocate separately.

That might deal with lockless lookups if we did it right, but delayed
copying back into dentry and freeing of out-of-line copy (after d_move())
would still cause all sorts of fun.

The thing is, we have places where ->d_name.name uses rely on "I hold
i_mutex on parent, so this thing won't change or go away under me" and
that's actually the majority of code using ->d_name. All directory
operations.

How about doing that delayed work just before dropping i_mutex on parent?
There we definitely can sleep, etc., so if we have d_move mark dentry as
"got out-of-line hash chain+name+hash+len+d_parent_copy, want to collapse
it back into dentry" and do d_collapse_that_stuff(dentry) before the
matching drop of i_mutex...

It would be one hell of a patch size, probably, but it seems that the rest
of problems wouldn't be there... All such out-of-line structs would be
freed via RCU and never modified. And inline ones would be modified only
when
a) everyone who looks at hash chains already sees out-of-line one
b) i_mutex on parent is still held
They'd get out-of-line one copied into them, replace it in hash chains
and schedule freeing of out-of-line sucker.

The reason why I'm talking about copy of d_parent and not just taking the
field over there: we avoid messing with dentry refcounting, etc. that way,
assuming that this copy is never dereferenced (used only for comparisons
during dcache lookups) and doesn't contribute to d_count. Freeing dentries
themselves would be also RCU-delayed, of course.

Comments?

2010-02-07 16:34:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH][RFC] %pd - for printing dentry name

On Thu, Feb 04, 2010 at 05:36:09PM +0000, Al Viro wrote:
> On Thu, Feb 04, 2010 at 09:13:18AM -0800, Linus Torvalds wrote:
> >
> >
> > On Thu, 4 Feb 2010, Paul E. McKenney wrote:
> > >
> > > Ah, good point on the hash size. And given that DNAME_INLINE_LEN_MIN
> > > is 40 characters on 32-bit systems and 32 characters on 64-bit systems,
> > > I agree that while a four-character increase might be nice, it cannot be
> > > said to be an emergency.
> >
> > Well, what we _could_ do is to make the 'length' field be part of the name
> > itself, and just keep the hash separate. The hash (and parenthood) is what
> > we check most in the hot inner loop, and don't want to have any extra
> > indirection (or cache misses) for. The name length we check only later,
> > after we've done all other checks (and after we've gotten the spinlock,
> > which is the big thing).
> >
> > So qstr->len is _not_ performance critical in the same way that qstr->hash
> > is.
>
> We could also try to put the hash chain in that sucker, copy d_parent in
> there *and* put a pointer back to struct dentry in it. Then the walk
> itself would go through those and we'd actually looked at the dentry
> only once - in the end of it. Normally that thing would be just embedded
> into dentry, with ability to allocate separately.

Good point!!!

But wouldn't this also require that the permission bits
be in qstr as well, along with a flag indicating ACLs?

> That might deal with lockless lookups if we did it right, but delayed
> copying back into dentry and freeing of out-of-line copy (after d_move())
> would still cause all sorts of fun.
>
> The thing is, we have places where ->d_name.name uses rely on "I hold
> i_mutex on parent, so this thing won't change or go away under me" and
> that's actually the majority of code using ->d_name. All directory
> operations.
>
> How about doing that delayed work just before dropping i_mutex on parent?
> There we definitely can sleep, etc., so if we have d_move mark dentry as
> "got out-of-line hash chain+name+hash+len+d_parent_copy, want to collapse
> it back into dentry" and do d_collapse_that_stuff(dentry) before the
> matching drop of i_mutex...

This sounds like a good way to solve the problem of successive renames
of the same file -- the second rename would be unable to acquire i_mutex
until after the d_collapse_that_stuff() completed, right?

> It would be one hell of a patch size, probably, but it seems that the rest
> of problems wouldn't be there... All such out-of-line structs would be
> freed via RCU and never modified. And inline ones would be modified only
> when
> a) everyone who looks at hash chains already sees out-of-line one
> b) i_mutex on parent is still held
> They'd get out-of-line one copied into them, replace it in hash chains
> and schedule freeing of out-of-line sucker.

And during the time that the dentry is switching from out-of-line to
inline, it can safely be referenced by both, so no need for fancy
hash-chain traversal tactics.

> The reason why I'm talking about copy of d_parent and not just taking the
> field over there: we avoid messing with dentry refcounting, etc. that way,
> assuming that this copy is never dereferenced (used only for comparisons
> during dcache lookups) and doesn't contribute to d_count. Freeing dentries
> themselves would be also RCU-delayed, of course.
>
> Comments?

Looks pretty good at first glance!

Thanx, Paul