2015-07-31 17:47:50

by Hugh Dickins

[permalink] [raw]
Subject: v4.2-rc dcache regression, probably 75a6f82a0d10

I think there's something not quite right with the fs/dcache.c
commit 75a6f82a0d10 ("freeing unlinked file indefinitely delayed").

When running my old tmpfs swapping load (two repetitive make -j20
kernel builds, one on tmpfs, one on ext4 over loop over tmpfs file,
in limited memory with plenty of swapping; rm -rf of both trees
in between the builds): one of the builds, always the ext4 so far,
fails after several hours, one or another header file "No such file
or directory", but the file's there when I check the tree afterwards.

Sounds like a dcache problem, and 75a6f82a0d10 seemed the only
likely candidate, so I experimented with reverting it yesterday,
and ran successfully for 24 hours. That's a little too soon to
be sure (I've set another run going this morning), but I'd say
90% certain that is to blame, and thought I'd better alert you
sooner than later - you'll probably guess what's the matter
long before I get back to check today's run.

(I saw exactly the same symptom two months ago; but that was just
before you put in 2159184ea01e "d_walk() might skip too much",
which fixed it back then.)

Hugh


2015-07-31 17:59:51

by Dominique Martinet

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

Hi,

Hugh Dickins wrote on Fri, Jul 31, 2015:
> I think there's something not quite right with the fs/dcache.c
> commit 75a6f82a0d10 ("freeing unlinked file indefinitely delayed").
>
> When running my old tmpfs swapping load (two repetitive make -j20
> kernel builds, one on tmpfs, one on ext4 over loop over tmpfs file,
> in limited memory with plenty of swapping; rm -rf of both trees
> in between the builds): one of the builds, always the ext4 so far,
> fails after several hours, one or another header file "No such file
> or directory", but the file's there when I check the tree afterwards.
>
> Sounds like a dcache problem, and 75a6f82a0d10 seemed the only
> likely candidate, so I experimented with reverting it yesterday,
> and ran successfully for 24 hours. That's a little too soon to
> be sure (I've set another run going this morning), but I'd say
> 90% certain that is to blame, and thought I'd better alert you
> sooner than later - you'll probably guess what's the matter
> long before I get back to check today's run.

Actually sounds like my thread "Race condition introduced in 4bf46a27
VFS: Impose ordering on accesses of d_inode and d_flags" (in fsdevel
only)

It's WAY easier to reproduce over a 9P/virtio mount and with multiple
level of caches (memory barrier), I was able to track it down to
4bf46a272647d ("VFS: Impose ordering on accesses of d_inode and
d_flags") and debug a bit, but this isn't code I'm familiar with so
would appreciate if you could tell me what you think.

--
Dominique Martinet

2015-07-31 18:00:42

by J. Bruce Fields

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, Jul 31, 2015 at 10:46:51AM -0700, Hugh Dickins wrote:
> I think there's something not quite right with the fs/dcache.c
> commit 75a6f82a0d10 ("freeing unlinked file indefinitely delayed").
>
> When running my old tmpfs swapping load (two repetitive make -j20
> kernel builds, one on tmpfs, one on ext4 over loop over tmpfs file,
> in limited memory with plenty of swapping; rm -rf of both trees
> in between the builds): one of the builds, always the ext4 so far,
> fails after several hours, one or another header file "No such file
> or directory", but the file's there when I check the tree afterwards.

That's weird.

In theory, if you're not running knfsd, and if nobody's doing
open_by_handle_at() calls, then you shouldn't have any
DCACHE_DISCONNECTED dentries on your system, so this patch shouldn't
affect behavior.

--b.

>
> Sounds like a dcache problem, and 75a6f82a0d10 seemed the only
> likely candidate, so I experimented with reverting it yesterday,
> and ran successfully for 24 hours. That's a little too soon to
> be sure (I've set another run going this morning), but I'd say
> 90% certain that is to blame, and thought I'd better alert you
> sooner than later - you'll probably guess what's the matter
> long before I get back to check today's run.
>
> (I saw exactly the same symptom two months ago; but that was just
> before you put in 2159184ea01e "d_walk() might skip too much",
> which fixed it back then.)
>
> Hugh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-31 18:42:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, Jul 31, 2015 at 10:46 AM, Hugh Dickins <[email protected]> wrote:
>
> Sounds like a dcache problem, and 75a6f82a0d10 seemed the only
> likely candidate, so I experimented with reverting it yesterday,
> and ran successfully for 24 hours.

Hmm. Sounds odd. Are you running nfsd? That would explain why it
happens on ext4 but not tmpfs: ext4 has a get_parent method that can
get a disconnected entry, while tmpfs does not.

That said, your load doesn't sound like it would actually ever trigger
this, unless you just didn't mention that you also end up using that
filesystem over nfs on another machine.

So leave it running a while longer, but maybe it's 4bf46a272647 like
Dominique suspects. Although I don't see how that could trigger
anything either..

Linus

2015-07-31 19:43:28

by Hugh Dickins

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, 31 Jul 2015, Linus Torvalds wrote:
> On Fri, Jul 31, 2015 at 10:46 AM, Hugh Dickins <[email protected]> wrote:
> >
> > Sounds like a dcache problem, and 75a6f82a0d10 seemed the only
> > likely candidate, so I experimented with reverting it yesterday,
> > and ran successfully for 24 hours.
>
> Hmm. Sounds odd. Are you running nfsd? That would explain why it
> happens on ext4 but not tmpfs: ext4 has a get_parent method that can
> get a disconnected entry, while tmpfs does not.
>
> That said, your load doesn't sound like it would actually ever trigger
> this, unless you just didn't mention that you also end up using that
> filesystem over nfs on another machine.

No, no nfsd nor any kind of networking filesystem stuff going on.
Right, I never looked to see what DCACHE_DISCONNECTED is actually
about, just rushed ahead and tried running with the revert.

>
> So leave it running a while longer, but maybe it's 4bf46a272647 like
> Dominique suspects. Although I don't see how that could trigger
> anything either..

I restarted with a slightly different version of the load this
morning, which has sometimes shown the issue more easily - I thought
it better to restart with a variant than persist with a run that
might have settled into a protected pattern. We'll see what that
shows later on.

It will indeed be weird and odd if it confirms that DCACHE_DISCONNECTED
revert is good. I agree that Dominique's 4bf46a272647 seems now more
likely, if still unlikely; but that was included in v4.1, and I saw
no problem with v4.1 once the rmap_walk() skip was fixed.

There may be some completely unrelated commit which alters the
timing enough to expose or mask whatever is the guilty commit.
Or something corrupting dentry->d_flags occasionally.

Hugh

2015-07-31 20:50:58

by Dominique Martinet

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

Hugh Dickins wrote on Fri, Jul 31, 2015:
> It will indeed be weird and odd if it confirms that DCACHE_DISCONNECTED
> revert is good. I agree that Dominique's 4bf46a272647 seems now more
> likely, if still unlikely; but that was included in v4.1, and I saw
> no problem with v4.1 once the rmap_walk() skip was fixed.

I think it could, actually, and that neither commits are actually bad --
just that they affect timing enough to raise an issue between d_delete
(I guess?) and link_path_walk (see last mail in other thread[1])

It's probably an old race that was very hard to hit because of cache
coherency.
Basically, before the wmb/rmb, the dentry was always updated closely to
its flags, so the other CPU would "usually" get both updates at the same
time; the barriers make it so the updates are split and it's possible to
get it, and would explain why I could pick 4bf46a2726 as "the one"


I'm not sure why the problem wouldn't arise on tmpfs though.

Hugh, could you try the reproducer I gave in the other thread[2] on both
filesystems maybe?
I need to let the thing run for a while, might need to tune params as
well. I was trying to fine tune cpu affinity with less threads but it's
not getting anywhere.

I'll also check if it's getting even easier to reproduce with
75a6f82a0d10 (or a recent kernel), who knows... How fast do you hit the
bug with the commit?


Thanks,
--
Dominique

[1] https://marc.info/?l=linux-fsdevel&m=143835651005259&w=2
[2] https://marc.info/?l=linux-fsdevel&m=143825706609188&w=2

2015-07-31 22:52:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, Jul 31, 2015 at 1:50 PM, Dominique Martinet
<[email protected]> wrote:
>
> It's probably an old race that was very hard to hit because of cache
> coherency.
> Basically, before the wmb/rmb, the dentry was always updated closely to
> its flags, so the other CPU would "usually" get both updates at the same
> time; the barriers make it so the updates are split and it's possible to
> get it, and would explain why I could pick 4bf46a2726 as "the one"

I doubt that's it. wmb/rmb is actually a no-op on x86, and only
affects instruction scheduling. So yes, timing could be affected, but
it's such a _tiny_ effect that I don't really believe it's an issue.

I'd be more suspicious about other effects. For example, iot's not at
all obvious that the commit in question just changes the order of the
flags/inode field accesses, there are potentialy bigger changes there.
For example, this part (in __d_obtain_alias):

- tmp->d_inode = inode;
- tmp->d_flags |= add_flags;
+ __d_set_inode_and_type(tmp, inode, add_flags);

looks a bit off, because it *used* to just add those flags, but now,
through __d_set_inode_and_type, it does

+ dentry->d_inode = inode;
+ smp_wmb();
+ flags = READ_ONCE(dentry->d_flags);
+ flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
+ flags |= type_flags;
+ WRITE_ONCE(dentry->d_flags, flags);

so it clears DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU.

Is that correct? Maybe, I haven't checked. And maybe it's a big bad
bug. Regardless, it sure as hell isn't just changing the order of the
access to those fields. That "DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU"
clearing came from __d_instantiate(), but now it hits __d_obtain_alias
too.

There may be other changes like that for all I know. I didn't look
more closely. But I'd blame changes like that rather than any timing
by rmb/wmb.

(Side note: the instruction scheduling changes can be big - especially
together with the changes to use READ_ONCE/WRITE_ONCE, it basically
means that gcc won't be using r-m-w instructions etc to set flags, so
I'm not saying rmb/wmb is necessarily a no-op in general, but it's
definitely generally not a hugely noticeable thing).

Added DavidH to the cc list, since it's his commit. But Dominique, it
would probably be a good idea for you to try to double-check that that
commit really is what matters for you and causes problems.

Linus

2015-08-01 00:10:20

by Hugh Dickins

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, 31 Jul 2015, Dominique Martinet wrote:
> Hugh Dickins wrote on Fri, Jul 31, 2015:
> > It will indeed be weird and odd if it confirms that DCACHE_DISCONNECTED
> > revert is good. I agree that Dominique's 4bf46a272647 seems now more
> > likely, if still unlikely; but that was included in v4.1, and I saw
> > no problem with v4.1 once the rmap_walk() skip was fixed.
>
> I think it could, actually, and that neither commits are actually bad --
> just that they affect timing enough to raise an issue between d_delete
> (I guess?) and link_path_walk (see last mail in other thread[1])
>
> It's probably an old race that was very hard to hit because of cache
> coherency.
> Basically, before the wmb/rmb, the dentry was always updated closely to
> its flags, so the other CPU would "usually" get both updates at the same
> time; the barriers make it so the updates are split and it's possible to
> get it, and would explain why I could pick 4bf46a2726 as "the one"
>
>
> I'm not sure why the problem wouldn't arise on tmpfs though.
>
> Hugh, could you try the reproducer I gave in the other thread[2] on both
> filesystems maybe?

Sorry, I probably won't get around to that, to be honest:
it shouldn't need me to run it anyway.

> I need to let the thing run for a while, might need to tune params as
> well. I was trying to fine tune cpu affinity with less threads but it's
> not getting anywhere.
>
> I'll also check if it's getting even easier to reproduce with
> 75a6f82a0d10 (or a recent kernel), who knows... How fast do you hit the
> bug with the commit?

"A number of hours". I don't have my records in front of me at the
moment, but I think when I was lucky it happened within two hours,
but more commonly around ten or twelve hours.

I just leave it going and get on with other things: yours may be a
_much_ better reproducer. Though once there's a potential fix, we shall
both need to try it, to report back if our separate cases are fixed.

Hugh

>
>
> Thanks,
> --
> Dominique
>
> [1] https://marc.info/?l=linux-fsdevel&m=143835651005259&w=2
> [2] https://marc.info/?l=linux-fsdevel&m=143825706609188&w=2

2015-08-01 00:21:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, 31 Jul 2015, Linus Torvalds wrote:
>
> I'd be more suspicious about other effects. For example, iot's not at
> all obvious that the commit in question just changes the order of the
> flags/inode field accesses, there are potentialy bigger changes there.
> For example, this part (in __d_obtain_alias):
>
> - tmp->d_inode = inode;
> - tmp->d_flags |= add_flags;
> + __d_set_inode_and_type(tmp, inode, add_flags);
>
> looks a bit off, because it *used* to just add those flags, but now,
> through __d_set_inode_and_type, it does
>
> + dentry->d_inode = inode;
> + smp_wmb();
> + flags = READ_ONCE(dentry->d_flags);
> + flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> + flags |= type_flags;
> + WRITE_ONCE(dentry->d_flags, flags);
>
> so it clears DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU.
>
> Is that correct? Maybe, I haven't checked. And maybe it's a big bad
> bug. Regardless, it sure as hell isn't just changing the order of the
> access to those fields. That "DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU"
> clearing came from __d_instantiate(), but now it hits __d_obtain_alias
> too.
>
> There may be other changes like that for all I know. I didn't look

Yes, the one which grabbed my attention is:

@@ -311,7 +346,7 @@ static void dentry_iput(struct dentry * dentry)
{
struct inode *inode = dentry->d_inode;
if (inode) {
- dentry->d_inode = NULL;
+ __d_clear_type_and_inode(dentry);
hlist_del_init(&dentry->d_u.d_alias);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);

which I think clears the DCACHE_ENTRY_TYPE i.e. makes it DCACHE_MISS_TYPE,
when it was left as is before. While there might be an RCU lookup in
progress, suddenly finding this to be a negative dentry. Perhaps -
this is not an area I've visited for years, and I've not followed up
the sequence count protection.

Hugh

2015-08-01 04:21:25

by Hugh Dickins

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, 31 Jul 2015, Hugh Dickins wrote:
> On Fri, 31 Jul 2015, Linus Torvalds wrote:
> >
> > So leave it running a while longer, but maybe it's 4bf46a272647 like
> > Dominique suspects. Although I don't see how that could trigger
> > anything either..
>
> I restarted with a slightly different version of the load this
> morning, which has sometimes shown the issue more easily - I thought
> it better to restart with a variant than persist with a run that
> might have settled into a protected pattern. We'll see what that
> shows later on.

It showed nothing useful to this discussion: after an hour and a half
it had hung on some almost-certainly-unrelated issue that I've never
seen before - looked as if a jbd2 transaction never completed, some
tasks waiting for that, some waiting for f_pos_lock held by those
(which I hit when I tried to tail the output log). Worry about
that another time, if it ever shows up again.

I think I'll try reinstating Al's commit, and hacking out that change
of David's in dentry_iput() that worried me (though the "especially
problematic" remark in his change description suggests that it is
an intentional and necessary change to suit unionmount). See how
that goes.

Hugh

2015-08-01 05:59:05

by Dominique Martinet

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

Hugh Dickins wrote on Fri, Jul 31, 2015:
> On Fri, 31 Jul 2015, Linus Torvalds wrote:
>> I'd be more suspicious about other effects. For example, iot's not at
>> all obvious that the commit in question just changes the order of the
>> flags/inode field accesses, there are potentialy bigger changes there.
>> For example, this part (in __d_obtain_alias):
>>
>> - tmp->d_inode = inode;
>> - tmp->d_flags |= add_flags;
>> + __d_set_inode_and_type(tmp, inode, add_flags);
>>
>> looks a bit off, because it *used* to just add those flags, but now,
>> through __d_set_inode_and_type, it does
>>
>> + dentry->d_inode = inode;
>> + smp_wmb();
>> + flags = READ_ONCE(dentry->d_flags);
>> + flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>> + flags |= type_flags;
>> + WRITE_ONCE(dentry->d_flags, flags);
>>
>> so it clears DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU.
>>
>> Is that correct? Maybe, I haven't checked. And maybe it's a big bad
>> bug. Regardless, it sure as hell isn't just changing the order of the
>> access to those fields. That "DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU"
>> clearing came from __d_instantiate(), but now it hits __d_obtain_alias
>> too.

Yeah, had spotted this and tried to fix just this bit, but it didn't
seem to change much for my problem.
Not saying it isn't a bug, but I have no idea what __d_obtain_alias
does and nobody seemed to care about this bit in my previous thread.

> Yes, the one which grabbed my attention is:
>
> @@ -311,7 +346,7 @@ static void dentry_iput(struct dentry * dentry)
> {
> struct inode *inode = dentry->d_inode;
> if (inode) {
> - dentry->d_inode = NULL;
> + __d_clear_type_and_inode(dentry);
> hlist_del_init(&dentry->d_u.d_alias);
> spin_unlock(&dentry->d_lock);
> spin_unlock(&inode->i_lock);

Oh, missed that one... Running tests with just that over the weekend,
it's a good candidate and the first 10 minutes of tests sound quite
positive!

I think it's wrong to call it a "fix" even if it stops the bug from
reproducing though, the way I understand the intent of the patch, they
want that checking d_flags be enough to take decisions without having to
check d_inode as well - so now things rely on that, it's still going to
be wrong on HEAD... I think?
(I'm running tests at this commit, so I don't have the patches that rely
on that yet)

As I understand things, the fact that lookup managed to get a
being-removed entry from rcu/wherever isn't changed, it's just that it
won't fail as fast and maybe something later will notice the lack of
inode and fallback graciously instead of that ENOTDIR I've been
tracking -- so that commit makes it possible to hit the bug, but there's
another issue about taking the dentry in the first place?


I really need to spend some time to understand vfs/dcache better as a
whole at some point, I've been looking at a small part of it without
context for too long...


Cheers,
--
Dominique

2015-08-01 07:26:10

by Al Viro

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Fri, Jul 31, 2015 at 03:52:38PM -0700, Linus Torvalds wrote:

> Is that correct? Maybe, I haven't checked. And maybe it's a big bad
> bug. Regardless, it sure as hell isn't just changing the order of the
> access to those fields. That "DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU"
> clearing came from __d_instantiate(), but now it hits __d_obtain_alias
> too.

Actually, the shit had hit the fan earlier. Look: in
commit b18825a7c8e37a7cf6abb97a12a6ad71af160de7
Author: David Howells <[email protected]>
Date: Thu Sep 12 19:22:53 2013 +0100

VFS: Put a small type field into struct dentry::d_flags

we have this:
@@ -1823,7 +1794,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
if (err)
return err;
}
- if (!can_lookup(nd->inode)) {
+ if (!d_is_directory(nd->path.dentry)) {
err = -ENOTDIR;
break;
}

And that has turned the check done to an inode that *was* ours at some
point (i.e. fetching it had been followed by checking that ->d_seq had
been still valid) into something completely unprotected. Suppose we
are in lazy mode and somebody had evicted nd->path.dentry after we'd looked
it up and before that check. Sure, its ->d_seq had been bumped by that,
and we would've failed anyway. With ECHILD. Which, unlike ENOTDIR, is
"repeat in non-lazy mode".

AFAICS, that's where the problem is. It affects only RCU mode and only
the places where dentry isn't pinned. That place in link_path_walk() is
trivial - we just need to do
if (unlikely(!d_can_lookup(nd->path.dentry))) {
if (nd->flags & LOOKUP_RCU) {
if (unlazy_walk(nd, NULL, 0))
return -ECHILD;
}
return -ENOTDIR;
}
there. AFAICS, other places of that sort are not a problem anymore.

Folks, could you check if this fixes the problems you are seeing?

diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c1..b16c3a7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1954,7 +1954,11 @@ OK:
continue;
}
}
- if (unlikely(!d_can_lookup(nd->path.dentry)))
+ if (unlikely(!d_can_lookup(nd->path.dentry))) {
+ if (nd->flags & LOOKUP_RCU) {
+ if (unlazy_walk(nd, NULL, 0))
+ return -ECHILD;
+ }
return -ENOTDIR;
}
}

2015-08-01 10:19:32

by Dominique Martinet

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

Al Viro wrote on Sat, Aug 01, 2015:
> And that has turned the check done to an inode that *was* ours at some
> point (i.e. fetching it had been followed by checking that ->d_seq had
> been still valid) into something completely unprotected. Suppose we
> are in lazy mode and somebody had evicted nd->path.dentry after we'd looked
> it up and before that check. Sure, its ->d_seq had been bumped by that,
> and we would've failed anyway. With ECHILD. Which, unlike ENOTDIR, is
> "repeat in non-lazy mode".

That sounds like a good find, I was looking at how to claim/protect the
entry somehow as well but I just have no idea...

> Folks, could you check if this fixes the problems you are seeing?
>
> diff --git a/fs/namei.c b/fs/namei.c
> index ae4e4c1..b16c3a7 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1954,7 +1954,11 @@ OK:
> continue;
> }
> }
> - if (unlikely(!d_can_lookup(nd->path.dentry)))
> + if (unlikely(!d_can_lookup(nd->path.dentry))) {
> + if (nd->flags & LOOKUP_RCU) {
> + if (unlazy_walk(nd, NULL, 0))
> + return -ECHILD;
> + }
> return -ENOTDIR;
> }
> }

Unfortunately, still happens for me.

I had to adapt a bit because using an old kernel (4bf46a272), will try
again with a recent master to doublecheck, but I had a break on
the "if (nd->flags & LOOKUP_RCU)" check:
- sometimes fails without ever hitting the check. I think this fixes
the "ENOTDIR" I had described, but there's at least another way to
fail?
- When we do hit it, we're into LOOKUP_RCU at this point alright,
unlazy_walk fails and we try again without RCU -- can confirm the
recovery process goes OK (well, that it went OK at least once)


Thanks,
--
Dominique

2015-08-01 10:50:38

by Dominique Martinet

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

Dominique Martinet wrote on Sat, Aug 01, 2015:
> I had to adapt a bit because using an old kernel (4bf46a272), will try
> again with a recent master to doublecheck

There have been more changes than what I thought, can't seem to
reproduce in a while on linus' HEAD with that fix (it fell in that
check quite a few times already).

I'll let test run all weekend to be safe, but it looks good to me!

--
Dominique

2015-08-01 16:09:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Sat, Aug 1, 2015 at 12:26 AM, Al Viro <[email protected]> wrote:
>
> Actually, the shit had hit the fan earlier. Look: in
> commit b18825a7c8e37a7cf6abb97a12a6ad71af160de7
> Author: David Howells <[email protected]>
> Date: Thu Sep 12 19:22:53 2013 +0100
>
> VFS: Put a small type field into struct dentry::d_flags
>
> we have this:
>
> - if (!can_lookup(nd->inode)) {
> + if (!d_is_directory(nd->path.dentry)) {

Ahh. That's subtle, yes. Inodes are stable in ways that dentries
aren't. And the reason why Dominique bisected it to 4bf46a272647 would
seem to be that while dentries aren't really stable, the dentry flags
generally don't change. But dentry_iput() changed to actually clear
the type when clearing the inode, so that probably added a few cases
where it went from "stable in practice" to be more easily triggered.

Your patch looks obviously correct, with the slight worry that there
might be other cases of this.

> there. AFAICS, other places of that sort are not a problem anymore.

Al, do you plan a pull request? It would be good to get this into rc5
(tomorrow) regardless of whether there might be other issues lurking
too.

Linus

2015-08-01 17:09:48

by Al Viro

[permalink] [raw]
Subject: Re: v4.2-rc dcache regression, probably 75a6f82a0d10

On Sat, Aug 01, 2015 at 09:09:24AM -0700, Linus Torvalds wrote:
> On Sat, Aug 1, 2015 at 12:26 AM, Al Viro <[email protected]> wrote:
> >
> > Actually, the shit had hit the fan earlier. Look: in
> > commit b18825a7c8e37a7cf6abb97a12a6ad71af160de7
> > Author: David Howells <[email protected]>
> > Date: Thu Sep 12 19:22:53 2013 +0100
> >
> > VFS: Put a small type field into struct dentry::d_flags
> >
> > we have this:
> >
> > - if (!can_lookup(nd->inode)) {
> > + if (!d_is_directory(nd->path.dentry)) {
>
> Ahh. That's subtle, yes. Inodes are stable in ways that dentries
> aren't. And the reason why Dominique bisected it to 4bf46a272647 would
> seem to be that while dentries aren't really stable, the dentry flags
> generally don't change. But dentry_iput() changed to actually clear
> the type when clearing the inode, so that probably added a few cases
> where it went from "stable in practice" to be more easily triggered.
>
> Your patch looks obviously correct, with the slight worry that there
> might be other cases of this.
>
> > there. AFAICS, other places of that sort are not a problem anymore.
>
> Al, do you plan a pull request? It would be good to get this into rc5
> (tomorrow) regardless of whether there might be other issues lurking
> too.

Will do later today. -stable branches will be interesting, though ;-/

Al, still digging through the loads of fun stuff in l-k mailbox...

2015-08-02 00:14:12

by Al Viro

[permalink] [raw]
Subject: [git pull] vfs.git spurious ENOTDIR fix

On Sat, Aug 01, 2015 at 09:09:24AM -0700, Linus Torvalds wrote:

> Al, do you plan a pull request? It would be good to get this into rc5
> (tomorrow) regardless of whether there might be other issues lurking
> too.

Please, pull from the usual place:
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (1):
link_path_walk(): be careful when failing with ENOTDIR

Diffstat:
fs/namei.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

2015-08-02 00:23:24

by Al Viro

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sun, Aug 02, 2015 at 01:14:02AM +0100, Al Viro wrote:
> On Sat, Aug 01, 2015 at 09:09:24AM -0700, Linus Torvalds wrote:
>
> > Al, do you plan a pull request? It would be good to get this into rc5
> > (tomorrow) regardless of whether there might be other issues lurking
> > too.
>
> Please, pull from the usual place:
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus
>
> Shortlog:
> Al Viro (1):
> link_path_walk(): be careful when failing with ENOTDIR
>
> Diffstat:
> fs/namei.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)

Diffstat:
fs/namei.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

actually. Note to self: git diff HEAD *before* pushing, to verify that
no uncommitted typo fixes are sitting around...

My apologies. Branch head should be at 97242f9, just to make sure you get
the right one...

BTW, is there any convenient way to have git itself check things like
that?

2015-08-02 00:42:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, Aug 1, 2015 at 5:23 PM, Al Viro <[email protected]> wrote:
>
> BTW, is there any convenient way to have git itself check things like
> that?

I guess you could just do a "pre-push" hook, and have it exit with an
error if the tree isn't clean (to fail the push) or at least warn.

See "man githooks" for details.

Linus

2015-08-02 00:57:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, Aug 1, 2015 at 5:23 PM, Al Viro <[email protected]> wrote:
>
> Branch head should be at 97242f9, just to make sure you get
> the right one...

Ok, merged in my tree.

However, looking at this, I'm struck by how all the callers of
"link_path_walk()" tend to have very similar patterns wrt error
handling.

And I'm wondering - wouldn't it be nicer to extend that pattern a bit
more, and make the *callers* of link_path_walk() all do

if (error) {
if (nd->flags & LOOKUP_RCU) {
if (unlazy_walk(nd, NULL, 0))
error = -ECHILD;
}
}

and maybe even make that part of terminate_walk() that everybody calls
after getting here.

Because it's not just that "!d_can_lookup()" case that triggers it,
you also have that pattern in the RCU error case for may_lookup(), and
get_link().

So why don't we make the rule that *every* single error we get during
an RCU walk should do that unlazy_walk() and turn the error into
ECHILD on failure. Hmm? We're almost there as-is.

Linus

2015-08-02 01:41:46

by Al Viro

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, Aug 01, 2015 at 05:57:44PM -0700, Linus Torvalds wrote:

> Because it's not just that "!d_can_lookup()" case that triggers it,
> you also have that pattern in the RCU error case for may_lookup(), and
> get_link().

It feels like it might make sense to handle that in caller, but...
that goes only for cases when we are *NOT* going to continue after
successful transition to non-lazy mode. And these two are not of
that sort - we do want to continue rather than restart everything
from scratch.

BTW, unlazy_walk() has too many arguments, all for the sake of one caller
(everything except lookup_fast() calls it with (nd, NULL, 0) as arguments)
and it might make sense to split the damn thing in two. I have that in
a pending pile since the last cycle, but back then you have asked to stop
piling them up and let it settle, so I'd postponed that one along with other
cleanups...

2015-08-02 02:39:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, Aug 1, 2015 at 6:41 PM, Al Viro <[email protected]> wrote:
>
> It feels like it might make sense to handle that in caller, but...
> that goes only for cases when we are *NOT* going to continue after
> successful transition to non-lazy mode. And these two are not of
> that sort - we do want to continue rather than restart everything
> from scratch.

Ahh. Yes. I didn't notice that they actually don't return an error at
all if the unlazy_walk succeeds, but just continue.

Ok, so they really are very different.

> BTW, unlazy_walk() has too many arguments, all for the sake of one caller
> (everything except lookup_fast() calls it with (nd, NULL, 0) as arguments)
> and it might make sense to split the damn thing in two. I have that in
> a pending pile since the last cycle, but back then you have asked to stop
> piling them up and let it settle, so I'd postponed that one along with other
> cleanups...

Well, I'd not be against continuing cleanups for 4.3... Well, as long
as we can make sure 4.2 is solid first, of course. I'd still like to
have Hugh verify that the current -git tree works for his load, but
obviously that wasn't easily reproducible, so that will presumably
take a few days. Dominique seems to at least not see it any more with
that patch.

Linus

2015-08-02 04:07:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, 1 Aug 2015, Linus Torvalds wrote:
>
> Well, I'd not be against continuing cleanups for 4.3... Well, as long
> as we can make sure 4.2 is solid first, of course. I'd still like to
> have Hugh verify that the current -git tree works for his load, but
> obviously that wasn't easily reproducible, so that will presumably
> take a few days. Dominique seems to at least not see it any more with
> that patch.

I've had Al's patch under load since this morning, and it's going fine.
But I've not actually managed to reproduce the issue for several days
now, however bogus a "fix" I've been testing. So certainly don't wait
to hear from me: but of course I'll keep on with the loads, and shout
if it does go wrong again.

(I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
dentry_iput() is not of continuing concern; but don't worry, there's
plenty I don't understand - so long as you're both satisfied that
it's not a concern, no need to persuade me.)

Do we have any idea why a bug introduced in v3.13 should only now
stand out, both for Dominique and for me? Has the RCU lookup somehow
become much more effective recently? (I don't think symlinks were a
big deal for either of us, though the kernel build does use some).

Hugh

2015-08-02 04:39:55

by Al Viro

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, Aug 01, 2015 at 09:06:55PM -0700, Hugh Dickins wrote:

> (I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
> dentry_iput() is not of continuing concern; but don't worry, there's
> plenty I don't understand - so long as you're both satisfied that
> it's not a concern, no need to persuade me.)

Because before we even get to dentry_iput(), we evict the fucker from hash.
And that will do dentry_rcuwalk_invalidate(dentry), which will bump ->d_seq
*AFTER* having it unhashed. Now look at __d_lookup_rcu() - there we fetch
->d_seq, then verify that it's still hashed.

So having hit dentry_iput() means that everyone who'd found it via RCU
lookup will be guaranteed a ->d_seq mismatch. The same goes for things
like d_drop() and d_instantiate(). Look for dentry_rcuwalk_invalidate()
callers in there...

Clearing DCACHE_ENTRY_TYPE there is fine - dentry *is* made negative there,
after all. What we want is to have ->d_inode stable at least as long as
->d_seq remains so.

2015-08-02 04:42:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, Aug 1, 2015 at 9:06 PM, Hugh Dickins <[email protected]> wrote:
>
> (I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
> dentry_iput() is not of continuing concern; but don't worry, there's
> plenty I don't understand - so long as you're both satisfied that
> it's not a concern, no need to persuade me.)

So dentry_iput() is only called as the dentry is being thrown away,
and is stale.

Yes, such a stale dentry can be seen by an RCU lookup, but the RCU
lookups should always revalidate things after the lookup, so it
shouldn't matter. The problem here was that there was a missing
revalidate of the RCU lookup for an error case, so the error that
_should_ have been a harmless race that got handled later by the
proper validation instead turned into a real user-visible error.

But we didn't use to clear the flags in dentry_iput, so before things
generally "happened to work" anyway, because this rare error case
didn't actually ever trigger in the first place.

(And I still don't think we necessarily *should* clear the flags in
dentry_iput(), but it really shouldn't be a correctness issue)

> Do we have any idea why a bug introduced in v3.13 should only now
> stand out, both for Dominique and for me? Has the RCU lookup somehow
> become much more effective recently?

So I do think that the clearing of the dentry flags exposed a
situation that was harder to hit before.

The fact that we now do RCU lookups even over symlinks probably does
end up widening the possibilities for this happening too, although as
you say, that shouldn't be very common during a kernel build.

Linus

2015-08-02 18:54:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: [git pull] vfs.git spurious ENOTDIR fix

On Sat, 1 Aug 2015, Linus Torvalds wrote:
> On Sat, Aug 1, 2015 at 9:06 PM, Hugh Dickins <[email protected]> wrote:
> >
> > (I don't actually understand why the clearing of DCACHE_ENTRY_TYPE in
> > dentry_iput() is not of continuing concern; but don't worry, there's
> > plenty I don't understand - so long as you're both satisfied that
> > it's not a concern, no need to persuade me.)
>
> So dentry_iput() is only called as the dentry is being thrown away,
> and is stale.
>
> Yes, such a stale dentry can be seen by an RCU lookup, but the RCU
> lookups should always revalidate things after the lookup, so it
> shouldn't matter. The problem here was that there was a missing
> revalidate of the RCU lookup for an error case, so the error that
> _should_ have been a harmless race that got handled later by the
> proper validation instead turned into a real user-visible error.

Thank you both for leading me through that: I really should have
rechecked the sequence count invalidation in the source for myself
(I had a wrong picture of it in my head), before inserting that
parenthesis and taking your time over it; but had been in a hurry
to get a response back.

>
> But we didn't use to clear the flags in dentry_iput, so before things
> generally "happened to work" anyway, because this rare error case
> didn't actually ever trigger in the first place.
>
> (And I still don't think we necessarily *should* clear the flags in
> dentry_iput(), but it really shouldn't be a correctness issue)
>
> > Do we have any idea why a bug introduced in v3.13 should only now
> > stand out, both for Dominique and for me? Has the RCU lookup somehow
> > become much more effective recently?
>
> So I do think that the clearing of the dentry flags exposed a
> situation that was harder to hit before.

Right, that does indeed make sense of why it appeared now.

I cannot actually report success from yesterday's testing, since
it hung after 20 hours for, I believe, the same unrelated reason
that I ran into before.

I mentioned jbd2 last time, but I doubt that's at fault: it's almost
certainly an issue with recent vmscan changes and/or recent loop
changes - the business of page reclaim waiting on page writeback
has always been tricky and fragile and deadlock-prone, the more
so when loop is involved: probably the balance has got shifted
slightly by recent changes, I'll look into it (but definitely
not rc5 material).

Thanks,
Hugh