LinuxLists.cc - Re: Linux 2.6.26-rc4

2008-06-03 09:49:59

by Jesper Krogh

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

Hi.

I'm getting this one. The mount.nfs call is being called by the
autofs-daemon.

It is not directly reproducible, but happens around once a day on a single
node on a 48 node cluster.

I reported it to the NFS-guys on a .22 kernel, and was directed to the
fsdevel list(CC'ed again) back then. Now the problem is reproduces on
.26-rc4

Jesper

Jun 3 09:46:58 node40 kernel: [17191699.952564] PGD 9f6f5067 PUD baf4f067
PMD 0
Jun 3 09:46:58 node40 kernel: [17191699.952564] CPU 1
Jun 3 09:46:58 node40 kernel: [17191699.952564] Modules linked in: nfs
lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd
parport_pc lp parport amd_rng i2c_amd756 container psmouse serio_raw
button pcspkr evdev i2c_core k8temp shpchp pci_hotplug ext3 jbd mbcache sg
sd_mod ide_cd_mod cdrom amd74xx ide_core floppy mptspi mptscsih mptbase
scsi_transport_spi tg3 ata_generic libata scsi_mod dock ohci_hcd usbcore
thermal processor fan thermal_sys fuse
Jun 3 09:46:58 node40 kernel: [17191699.952564] Pid: 15602, comm:
mount.nfs Not tainted 2.6.26-rc4 #1
Jun 3 09:46:58 node40 kernel: [17191700.727822] RIP:
0010:[graft_tree+77/288] [graft_tree+77/288] graft_tree+0x4d/0x120
Jun 3 09:46:58 node40 kernel: [17191700.727822] RSP:
0000:ffff810068683e08 EFLAGS: 00010246
Jun 3 09:46:58 node40 kernel: [17191700.727822] RAX: ffff8100bf836a90
RBX: 00000000ffffffec RCX: 0000000000000000
Jun 3 09:46:58 node40 kernel: [17191700.983576] RDX: ffff8100fa378500
RSI: ffff810068683e68 RDI: ffff810029252b00
Jun 3 09:46:58 node40 kernel: [17191700.983576] RBP: ffff810029252b00
R08: 0000000000000000 R09: 0000000000000001
Jun 3 09:46:58 node40 kernel: [17191700.983576] R10: 0000000000000001
R11: ffffffff803011c0 R12: ffff810068683e68
Jun 3 09:46:58 node40 kernel: [17191700.983576] R13: 0000000000000000
R14: 000000000000000b R15: 000000000000000b
Jun 3 09:46:58 node40 kernel: [17191700.983576] FS:
00007f525595b6e0(0000) GS:ffff8100fbb12280(0000) knlGS:00000000cb307b90
Jun 3 09:46:58 node40 kernel: [17191701.435324] CS: 0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Jun 3 09:46:58 node40 kernel: [17191701.435324] CR2: 00000000000000b2
CR3: 000000005ccbc000 CR4: 00000000000006e0
Jun 3 09:46:58 node40 kernel: [17191701.435324] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Jun 3 09:46:58 node40 kernel: [17191701.435324] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 3 09:46:58 node40 kernel: [17191701.435324] Process mount.nfs (pid:
15602, threadinfo ffff810068682000, task ffff8100bad61700)
Jun 3 09:46:58 node40 kernel: [17191701.879311] Stack: ffff810068683e68
ffff810068683e70 ffff810029252b00 ffffffff802b2622
Jun 3 09:46:58 node40 kernel: [17191701.879311] 0000000000000006
0000000000000000 ffff8100c1b72000 ffff8100f68aa000
Jun 3 09:46:58 node40 kernel: [17191701.879311] ffff8100f6540000
ffffffff802b49e9 000000004844f763 0000000000000000
Jun 3 09:46:58 node40 kernel: [17191701.879311] Call Trace:
Jun 3 09:46:58 node40 kernel: [17191701.879311] [do_add_mount+162/320] ?
do_add_mount+0xa2/0x140
Jun 3 09:46:58 node40 kernel: [17191701.879311] [do_mount+505/592] ?
do_mount+0x1f9/0x250
Jun 3 09:46:58 node40 kernel: [17191701.879311]
[copy_mount_options+269/384] ? copy_mount_options+0x10d/0x180
Jun 3 09:46:58 node40 kernel: [17191701.879311] [sys_mount+155/256] ?
sys_mount+0x9b/0x100
Jun 3 09:46:58 node40 kernel: [17191701.879311]
[system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80
Jun 3 09:46:58 node40 kernel: [17191701.879311]
Jun 3 09:46:58 node40 kernel: [17191701.879311]
Jun 3 09:46:58 node40 kernel: [17191701.879311] RSP <ffff810068683e08>
Jun 3 09:46:58 node40 kernel: [17191704.861207] ---[ end trace
bc4c286fe026e348 ]---

--
Jesper Krogh

2008-06-03 09:57:27

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:

> I reported it to the NFS-guys on a .22 kernel, and was directed to the
> fsdevel list(CC'ed again) back then. Now the problem is reproduces on
> .26-rc4

Lovely... Do you have the full oops trace, with the actual code
dump?

2008-06-03 10:05:18

by Jesper Krogh

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

> On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
>
>
>> I reported it to the NFS-guys on a .22 kernel, and was directed to the
>> fsdevel list(CC'ed again) back then. Now the problem is reproduced on
>> .26-rc4
>>
>
> Lovely... Do you have the full oops trace, with the actual code
> dump? --

This is all I got from the logs. I'll try go get more from the serial
console.
But since it is not directly reproducible, it is a bit hard. Suggestions are
welcome?

Jesper
--
Jesper Krogh

2008-06-03 10:14:23

by Miklos Szeredi

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

> > On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
> >
> >
> >> I reported it to the NFS-guys on a .22 kernel, and was directed to the
> >> fsdevel list(CC'ed again) back then. Now the problem is reproduced on
> >> .26-rc4
> >>
> >
> > Lovely... Do you have the full oops trace, with the actual code
> > dump? --
>
> This is all I got from the logs. I'll try go get more from the serial

Probably the same as this one:

http://www.kerneloops.org/raw.php?rawid=12419&msgid=

Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
which would be due to NFS not properly filling in its root dentry?

Miklos

2008-06-03 10:35:58

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 10:57:13AM +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
>
> > I reported it to the NFS-guys on a .22 kernel, and was directed to the
> > fsdevel list(CC'ed again) back then. Now the problem is reproduces on
> > .26-rc4
>
> Lovely... Do you have the full oops trace, with the actual code
> dump?

FWIW, searching for graft_tree on arjan's site:
#12419:
negative dentry path->dentry
#17367:
ditto
#17463:
ditto
#13042:
probably the same (is that earlier oops you've mentioned?)
#18932:
WTF is that one doing there? (graft_tree is never mentioned in it)

Nuts... I really don't see how that could happen, unless it's NFS
revalidation playing silly buggers with dentries and ->d_revalidate()
there ends up turning dentry passed to it into negative one...

Where does the mountpoint in question live and what gets passed to
sys_mount()?

2008-06-03 10:38:19

by Miklos Szeredi

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

> > > On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
> > >
> > >
> > >> I reported it to the NFS-guys on a .22 kernel, and was directed to the
> > >> fsdevel list(CC'ed again) back then. Now the problem is reproduced on
> > >> .26-rc4
> > >>
> > >
> > > Lovely... Do you have the full oops trace, with the actual code
> > > dump? --
> >
> > This is all I got from the logs. I'll try go get more from the serial
>
> Probably the same as this one:
>
> http://www.kerneloops.org/raw.php?rawid=12419&msgid=
>
> Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> which would be due to NFS not properly filling in its root dentry?

On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
means it's an autofs thing.

CC-ing Ian.

Miklos

2008-06-03 10:40:50

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 12:13:54PM +0200, Miklos Szeredi wrote:
>
> http://www.kerneloops.org/raw.php?rawid=12419&msgid=
>
> Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> which would be due to NFS not properly filling in its root dentry?

Look more carefully. It's path->dentry; aside of the fact that dentry
pointer is fetched at offset 8 from one of the arguments (fits path->dentry,
too low for mnt->mnt_root), do_add_mount() itself has just done S_ISLNK
on the very same thing, so it'd die before getting to graft_tree().

No, it's either path_lookup() somehow returning a negative dentry in
do_mount() (which shouldn't be possible, unless it's some crap around
return_reval in __link_path_walk()) or it's follow_down() giving us
a negative dentry. Which almost certainly would've exploded prior to
that...

2008-06-03 10:45:47

by Miklos Szeredi

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

> >
> > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> >
> > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > which would be due to NFS not properly filling in its root dentry?
>
> Look more carefully. It's path->dentry; aside of the fact that dentry
> pointer is fetched at offset 8 from one of the arguments (fits path->dentry,
> too low for mnt->mnt_root),

Yup, realized this after posting.

> do_add_mount() itself has just done S_ISLNK
> on the very same thing, so it'd die before getting to graft_tree().
>
> No, it's either path_lookup() somehow returning a negative dentry in
> do_mount() (which shouldn't be possible, unless it's some crap around
> return_reval in __link_path_walk()) or it's follow_down() giving us
> a negative dentry. Which almost certainly would've exploded prior to
> that...

I think it must be autofs4 doing something weird. Like this in
autofs4_lookup_unhashed():

/*
* Make the rehashed dentry negative so the VFS
* behaves as it should.
*/
if (inode) {
dentry->d_inode = NULL;

Miklos

2008-06-03 10:48:27

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 12:37:59PM +0200, Miklos Szeredi wrote:
> > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> >
> > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > which would be due to NFS not properly filling in its root dentry?
>
> On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
> means it's an autofs thing.

It is path->dentry, all right, but the question is how'd it get that way.
Look: we got that nd.path.dentry out of path_lookup() with LOOKUP_FOLLOW
as flags. Then we'd passed it through do_new_mount() to do_add_mount()
without changes. And went through
/* Something was mounted here while we slept */
while (d_mountpoint(nd->path.dentry) &&
follow_down(&nd->path.mnt, &nd->path.dentry))
;

2008-06-03 10:53:17

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 12:45:33PM +0200, Miklos Szeredi wrote:

> I think it must be autofs4 doing something weird. Like this in
> autofs4_lookup_unhashed():
>
> /*
> * Make the rehashed dentry negative so the VFS
> * behaves as it should.
> */
> if (inode) {
> dentry->d_inode = NULL;

Lovely. If we ever step into that with somebody else (no matter who)
holding a reference to that dentry, we are certainly well and truly
buggered. It's not just mount(2) - everything in the tree assumes that
holding a reference to positive dentry guarantees that it remains
positive.

Ian?

2008-06-03 13:35:42

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 21:31 +0800, Ian Kent wrote:
> On Tue, 2008-06-03 at 11:48 +0100, Al Viro wrote:
> > On Tue, Jun 03, 2008 at 12:37:59PM +0200, Miklos Szeredi wrote:
> > > > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> > > >
> > > > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > > > which would be due to NFS not properly filling in its root dentry?
> > >
> > > On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
> > > means it's an autofs thing.
> >
> > It is path->dentry, all right, but the question is how'd it get that way.
> > Look: we got that nd.path.dentry out of path_lookup() with LOOKUP_FOLLOW
> > as flags. Then we'd passed it through do_new_mount() to do_add_mount()
> > without changes. And went through
> > /* Something was mounted here while we slept */
> > while (d_mountpoint(nd->path.dentry) &&
> > follow_down(&nd->path.mnt, &nd->path.dentry))
> > ;
>
> And this relates to previous in that a mount isn't done by autofs until
> until after the directory is created, at which time the (->mkdir())
> dentry is hashed.

Oh .. and made positive at the same time.

>
> Ian
>

2008-06-03 13:46:31

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 11:52 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 12:45:33PM +0200, Miklos Szeredi wrote:
>
> > I think it must be autofs4 doing something weird. Like this in
> > autofs4_lookup_unhashed():
> >
> > /*
> > * Make the rehashed dentry negative so the VFS
> > * behaves as it should.
> > */
> > if (inode) {
> > dentry->d_inode = NULL;
>
> Lovely. If we ever step into that with somebody else (no matter who)
> holding a reference to that dentry, we are certainly well and truly
> buggered. It's not just mount(2) - everything in the tree assumes that
> holding a reference to positive dentry guarantees that it remains
> positive.

The intent here is that, the dentry above is unhashed at this point, and
if hasn't been reclaimed by the VFS, it is made negative and replaces
the unhashed negative dentry passed to ->lookup(). The reference count
is incremented to account for the reference held by the path walk.

What am I doing wrong here?

Ian

2008-06-03 13:56:09

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 11:48 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 12:37:59PM +0200, Miklos Szeredi wrote:
> > > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> > >
> > > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > > which would be due to NFS not properly filling in its root dentry?
> >
> > On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
> > means it's an autofs thing.
>
> It is path->dentry, all right, but the question is how'd it get that way.
> Look: we got that nd.path.dentry out of path_lookup() with LOOKUP_FOLLOW
> as flags. Then we'd passed it through do_new_mount() to do_add_mount()
> without changes. And went through
> /* Something was mounted here while we slept */
> while (d_mountpoint(nd->path.dentry) &&
> follow_down(&nd->path.mnt, &nd->path.dentry))
> ;

And this relates to previous in that a mount isn't done by autofs until
until after the directory is created, at which time the (->mkdir())
dentry is hashed.

Ian

2008-06-03 15:02:51

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 3 Jun 2008, Ian Kent wrote:
> >
> > > I think it must be autofs4 doing something weird. Like this in
> > > autofs4_lookup_unhashed():
> > >
> > > /*
> > > * Make the rehashed dentry negative so the VFS
> > > * behaves as it should.
> > > */
> > > if (inode) {
> > > dentry->d_inode = NULL;

Uhhuh. Yeah, that's not allowed.

A dentry inode can start _out_ as NULL, but it can never later become NULL
again until it is totally unused.

> > Lovely. If we ever step into that with somebody else (no matter who)
> > holding a reference to that dentry, we are certainly well and truly
> > buggered. It's not just mount(2) - everything in the tree assumes that
> > holding a reference to positive dentry guarantees that it remains
> > positive.

Indeed. Things like regular file ops won't even test the inode, since they
know that "open()" will only open a dentry with a positive entry, so they
know that the dentry->inode is non-NULL.

[ Although some code-paths do test - but that is just because people are
so used to testign that pointers are non-NULL. ]

> The intent here is that, the dentry above is unhashed at this point, and
> if hasn't been reclaimed by the VFS, it is made negative and replaces
> the unhashed negative dentry passed to ->lookup(). The reference count
> is incremented to account for the reference held by the path walk.
>
> What am I doing wrong here?

What's wrong is that you can't do that "dentry->d_inode = NULL". EVER.

Why would you want to? If the dentry is already unhashed, then no _new_
lookups will ever find it anyway, so it's effectively unfindable anyway.
Except by people who *have* to find it, ie the people who already hold it
open (because, for example, they opened it earlier, or because they
chdir()'ed into a subdirectory).

So why don't you just return a NULL dentry instead, for a unhashed dentry?
Or do the "goto next" thing?

Linus

2008-06-03 16:11:16

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
>
> On Tue, 3 Jun 2008, Ian Kent wrote:
> > >
> > > > I think it must be autofs4 doing something weird. Like this in
> > > > autofs4_lookup_unhashed():
> > > >
> > > > /*
> > > > * Make the rehashed dentry negative so the VFS
> > > > * behaves as it should.
> > > > */
> > > > if (inode) {
> > > > dentry->d_inode = NULL;
>
> Uhhuh. Yeah, that's not allowed.
>
> A dentry inode can start _out_ as NULL, but it can never later become NULL
> again until it is totally unused.
>
> > > Lovely. If we ever step into that with somebody else (no matter who)
> > > holding a reference to that dentry, we are certainly well and truly
> > > buggered. It's not just mount(2) - everything in the tree assumes that
> > > holding a reference to positive dentry guarantees that it remains
> > > positive.
>
> Indeed. Things like regular file ops won't even test the inode, since they
> know that "open()" will only open a dentry with a positive entry, so they
> know that the dentry->inode is non-NULL.
>
> [ Although some code-paths do test - but that is just because people are
> so used to testign that pointers are non-NULL. ]
>
> > The intent here is that, the dentry above is unhashed at this point, and
> > if hasn't been reclaimed by the VFS, it is made negative and replaces
> > the unhashed negative dentry passed to ->lookup(). The reference count
> > is incremented to account for the reference held by the path walk.
> >
> > What am I doing wrong here?
>
> What's wrong is that you can't do that "dentry->d_inode = NULL". EVER.

OK.

>
> Why would you want to? If the dentry is already unhashed, then no _new_
> lookups will ever find it anyway, so it's effectively unfindable anyway.
> Except by people who *have* to find it, ie the people who already hold it
> open (because, for example, they opened it earlier, or because they
> chdir()'ed into a subdirectory).

The code we're talking about deals with a race between expiring and
mounting an autofs mount point at the same time.

I'll have a closer look and see if I can make it work without turning
the dentry negative.

>
> So why don't you just return a NULL dentry instead, for a unhashed dentry?
> Or do the "goto next" thing?

That just won't work for the case this is meant to deal with.

Ian

2008-06-03 16:36:22

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Wed, 4 Jun 2008, Ian Kent wrote:
>
> The code we're talking about deals with a race between expiring and
> mounting an autofs mount point at the same time.
>
> I'll have a closer look and see if I can make it work without turning
> the dentry negative.

Hmm.

Can you walk me through this?

If the dentry is unhashed, it means that it _either_

- has already been deleted (rmdir'ed) or d_invalidate()'d. Right?

I don't see why you should ever return the dentry in this case..

- or it has not yet been hashed at all

But then d_inode should be NULL too, no?

Anyway, as far as I can tell, you should handle the race between expiring
and re-mounting not by unhashing at expire time (which causes these kinds
of problems), but by setting a bit in the dentry and using the dentry
"revalidate()" callback to wait for the revalidate.

But I don't know autofs4, so you probably have some reason. Could you
explain it?

Linus

2008-06-03 16:41:20

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 09:35:47AM -0700, Linus Torvalds wrote:
>
>
> On Wed, 4 Jun 2008, Ian Kent wrote:
> >
> > The code we're talking about deals with a race between expiring and
> > mounting an autofs mount point at the same time.
> >
> > I'll have a closer look and see if I can make it work without turning
> > the dentry negative.
>
> Hmm.
>
> Can you walk me through this?
>
> If the dentry is unhashed, it means that it _either_
>
> - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?
>
> I don't see why you should ever return the dentry in this case..

>From my reading of that code looks like it's been rmdir'ed. And no, I
don't understand what the hell is that code trying to do.

Ian, could you describe the race you are talking about?

2008-06-03 16:51:11

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:

> >From my reading of that code looks like it's been rmdir'ed. And no, I
> don't understand what the hell is that code trying to do.
>
> Ian, could you describe the race you are talking about?

BTW, this stuff is definitely broken regardless of mount - if something
had the directory in question opened before that rmdir and we'd hit
your lookup_unhashed while another CPU had been in the middle of
getdents(2) on that opened descriptor, we'll get

vfs_readdir() grabs i_mutex
vfs_readdir() checks that it's dead
autofs4_lookup_unhashed() calls iput()
inode is freed
vfs_readdir() releases i_mutex - in already freed struct inode.

Hell, just getdents() right *after* dentry->d_inode = NULL will oops,
plain and simple.

2008-06-03 17:00:41

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 3 Jun 2008, Al Viro wrote:
> >
> > If the dentry is unhashed, it means that it _either_
> >
> > - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?
> >
> > I don't see why you should ever return the dentry in this case..
>
> From my reading of that code looks like it's been rmdir'ed. And no, I
> don't understand what the hell is that code trying to do.

Hmm. Looking closer, I think that code is meant to handle the
d_invalidate() that it did in autofs4_tree_busy().

However, that should never trigger for a directory entry that can be
reached some other way, because that code has done a "dget()" on the
dentry, and d_invalidate() does

if (atomic_read(&dentry->d_count) > 1) {
if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
..unlock..
return -EBUSY;
}
}

so I dunno. I still think the expire code shouldn't even use
d_invalidate() at all, and just revalidate() at lookup.

Linus

2008-06-03 17:16:42

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 09:35 -0700, Linus Torvalds wrote:
>
> On Wed, 4 Jun 2008, Ian Kent wrote:
> >
> > The code we're talking about deals with a race between expiring and
> > mounting an autofs mount point at the same time.
> >
> > I'll have a closer look and see if I can make it work without turning
> > the dentry negative.
>
> Hmm.
>
> Can you walk me through this?
>
> If the dentry is unhashed, it means that it _either_
>
> - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?

In the current code the only way a dentry gets onto this list is by one
of the two operations ->unlink() or ->rmdir(), it is d_drop'ed and left
positive by both of these operations (a carry over from 2.4 when
d_lookup() returned unhashed dentrys, I missed that detail for quite a
while).

>
> I don't see why you should ever return the dentry in this case..

It's been a while now but the original patch description should help.

"What happens is that during an expire the situation can arise
that a directory is removed and another lookup is done before
the expire issues a completion status to the kernel module.
In this case, since the the lookup gets a new dentry, it doesn't
know that there is an expire in progress and when it posts its
mount request, matches the existing expire request and waits
for its completion. ENOENT is then returned to user space
from lookup (as the dentry passed in is now unhashed) without
having performed the mount request.

The solution used here is to keep track of dentrys in this
unhashed state and reuse them, if possible, in order to
preserve the flags. Additionally, this infrastructure will
provide the framework for the reintroduction of caching
of mount fails removed earlier in development."

I wasn't able to do an acceptable re-implementation of the negative
caching we had in 2.4 with this framework, so just ignore the last
sentence in the above description.

>
> - or it has not yet been hashed at all

It has been previously hashed, yes.

>
> But then d_inode should be NULL too, no?

Unfortunately no, but I thought that once the dentry became unhashed
(aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
course there may be descriptors open on the dentry, which I think is the
problem that's being pointed out.

>
> Anyway, as far as I can tell, you should handle the race between expiring
> and re-mounting not by unhashing at expire time (which causes these kinds
> of problems), but by setting a bit in the dentry and using the dentry
> "revalidate()" callback to wait for the revalidate.

Yes, that would be ideal but the reason we arrived here is that, because
we must release the directory mutex before calling back to the daemon
(the heart of the problem, actually having to drop the mutex) to perform
the mount, we can get a deadlock. The cause of the problem was that for
"create" like operations the mutex is held for ->lookup() and
->revalidate() but for a "path walks" the mutex is only held for
->lookup(), so if the mutex is held when we're in ->revalidate(), we
could never be sure that we where the code path that acquired it.

Sorry, this last bit is unclear.
I'll need to work a bit harder on the explanation if you're interested
in checking further.

Anyway, I'm sure I made the dentry negative upon getting a rehash hit
for a reason so I'll need to revisit it. Perhaps I was misguided in the
first place or perhaps just plain lazy.

Ian

2008-06-03 17:30:47

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Wed, Jun 04, 2008 at 01:13:08AM +0800, Ian Kent wrote:

> "What happens is that during an expire the situation can arise
> that a directory is removed and another lookup is done before
> the expire issues a completion status to the kernel module.
> In this case, since the the lookup gets a new dentry, it doesn't
> know that there is an expire in progress and when it posts its
> mount request, matches the existing expire request and waits
> for its completion. ENOENT is then returned to user space
> from lookup (as the dentry passed in is now unhashed) without
> having performed the mount request.
>
> The solution used here is to keep track of dentrys in this
> unhashed state and reuse them, if possible, in order to
> preserve the flags. Additionally, this infrastructure will
> provide the framework for the reintroduction of caching
> of mount fails removed earlier in development."
>
> I wasn't able to do an acceptable re-implementation of the negative
> caching we had in 2.4 with this framework, so just ignore the last
> sentence in the above description.

> Unfortunately no, but I thought that once the dentry became unhashed
> (aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
> course there may be descriptors open on the dentry, which I think is the
> problem that's being pointed out.

... or we could have had a pending mount(2) sitting there with a reference
to mountpoint-to-be...

> Yes, that would be ideal but the reason we arrived here is that, because
> we must release the directory mutex before calling back to the daemon
> (the heart of the problem, actually having to drop the mutex) to perform
> the mount, we can get a deadlock. The cause of the problem was that for
> "create" like operations the mutex is held for ->lookup() and
> ->revalidate() but for a "path walks" the mutex is only held for
> ->lookup(), so if the mutex is held when we're in ->revalidate(), we
> could never be sure that we where the code path that acquired it.
>
> Sorry, this last bit is unclear.
> I'll need to work a bit harder on the explanation if you're interested
> in checking further.

I am.

Oh, well... Looks like RTFS time for me for now... Additional parts of
braindump would be appreciated - the last time I've seriously looked at
autofs4 internal had been ~2005 or so ;-/

2008-06-03 17:31:52

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 17:50 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:
>
> > >From my reading of that code looks like it's been rmdir'ed. And no, I
> > don't understand what the hell is that code trying to do.
> >
> > Ian, could you describe the race you are talking about?
>
> BTW, this stuff is definitely broken regardless of mount - if something
> had the directory in question opened before that rmdir and we'd hit
> your lookup_unhashed while another CPU had been in the middle of
> getdents(2) on that opened descriptor, we'll get
>
> vfs_readdir() grabs i_mutex
> vfs_readdir() checks that it's dead
> autofs4_lookup_unhashed() calls iput()

Can this really happen, since autofs4_lookup_unhashed() is only called
with the i_mutex held.

> inode is freed
> vfs_readdir() releases i_mutex - in already freed struct inode.

But it could happen later. So it's academic I guess.

>
> Hell, just getdents() right *after* dentry->d_inode = NULL will oops,
> plain and simple.

Yeah, I'll look into why I believed I needed to turn the dentry
negative. I'll need to keep the dentry positive through out this
process.

Ian

2008-06-03 17:34:19

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 09:59 -0700, Linus Torvalds wrote:
>
> On Tue, 3 Jun 2008, Al Viro wrote:
> > >
> > > If the dentry is unhashed, it means that it _either_
> > >
> > > - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?
> > >
> > > I don't see why you should ever return the dentry in this case..
> >
> > From my reading of that code looks like it's been rmdir'ed. And no, I
> > don't understand what the hell is that code trying to do.
>
> Hmm. Looking closer, I think that code is meant to handle the
> d_invalidate() that it did in autofs4_tree_busy().
>
> However, that should never trigger for a directory entry that can be
> reached some other way, because that code has done a "dget()" on the
> dentry, and d_invalidate() does
>
> if (atomic_read(&dentry->d_count) > 1) {
> if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
> ..unlock..
> return -EBUSY;
> }
> }
>
> so I dunno. I still think the expire code shouldn't even use
> d_invalidate() at all, and just revalidate() at lookup.

Yes, perhaps not.
A job for another day.

Ian

2008-06-03 17:41:42

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Wed, Jun 04, 2008 at 01:28:23AM +0800, Ian Kent wrote:
>
> On Tue, 2008-06-03 at 17:50 +0100, Al Viro wrote:
> > On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:
> >
> > > >From my reading of that code looks like it's been rmdir'ed. And no, I
> > > don't understand what the hell is that code trying to do.
> > >
> > > Ian, could you describe the race you are talking about?
> >
> > BTW, this stuff is definitely broken regardless of mount - if something
> > had the directory in question opened before that rmdir and we'd hit
> > your lookup_unhashed while another CPU had been in the middle of
> > getdents(2) on that opened descriptor, we'll get
> >
> > vfs_readdir() grabs i_mutex
> > vfs_readdir() checks that it's dead
> > autofs4_lookup_unhashed() calls iput()
>
> Can this really happen, since autofs4_lookup_unhashed() is only called
> with the i_mutex held.

i_mutex on a different inode (obviously - it frees the inode in question,
so if caller held i_mutex on it, you would be in trouble every time you
hit that codepath).

2008-06-03 17:41:56

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 18:30 +0100, Al Viro wrote:
> On Wed, Jun 04, 2008 at 01:13:08AM +0800, Ian Kent wrote:
>
> > "What happens is that during an expire the situation can arise
> > that a directory is removed and another lookup is done before
> > the expire issues a completion status to the kernel module.
> > In this case, since the the lookup gets a new dentry, it doesn't
> > know that there is an expire in progress and when it posts its
> > mount request, matches the existing expire request and waits
> > for its completion. ENOENT is then returned to user space
> > from lookup (as the dentry passed in is now unhashed) without
> > having performed the mount request.
> >
> > The solution used here is to keep track of dentrys in this
> > unhashed state and reuse them, if possible, in order to
> > preserve the flags. Additionally, this infrastructure will
> > provide the framework for the reintroduction of caching
> > of mount fails removed earlier in development."
> >
> > I wasn't able to do an acceptable re-implementation of the negative
> > caching we had in 2.4 with this framework, so just ignore the last
> > sentence in the above description.
>
> > Unfortunately no, but I thought that once the dentry became unhashed
> > (aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
> > course there may be descriptors open on the dentry, which I think is the
> > problem that's being pointed out.
>
> ... or we could have had a pending mount(2) sitting there with a reference
> to mountpoint-to-be...
>
> > Yes, that would be ideal but the reason we arrived here is that, because
> > we must release the directory mutex before calling back to the daemon
> > (the heart of the problem, actually having to drop the mutex) to perform
> > the mount, we can get a deadlock. The cause of the problem was that for
> > "create" like operations the mutex is held for ->lookup() and
> > ->revalidate() but for a "path walks" the mutex is only held for
> > ->lookup(), so if the mutex is held when we're in ->revalidate(), we
> > could never be sure that we where the code path that acquired it.
> >
> > Sorry, this last bit is unclear.
> > I'll need to work a bit harder on the explanation if you're interested
> > in checking further.
>
> I am.
>
> Oh, well... Looks like RTFS time for me for now... Additional parts of
> braindump would be appreciated - the last time I've seriously looked at
> autofs4 internal had been ~2005 or so ;-/

You will find other problems.

The other bit to this is the patch to resolve the deadlock issue I spoke
about just above. This is likely where most of the current problems
started and the fact that we have always had to drop the mutex to call
back the daemon.

I can post the patches as well if that helps.

The description accompanying that patch was (the inconsistent locking
referred to here is what was described above):

"Due to inconsistent locking in the VFS between calls to lookup and
revalidate deadlock can occur in the automounter.

The inconsistency is that the directory inode mutex is held for both
lookup and revalidate calls when called via lookup_hash whereas it is
held only for lookup during a path walk. Consequently, if the mutex
is held during a call to revalidate autofs4 can't release the mutex
to callback the daemon as it can't know whether it owns the mutex.

This situation happens when a process tries to create a directory
within an automount and a second process also tries to create the
same directory between the lookup and the mkdir. Since the first
process has dropped the mutex for the daemon callback, the second
process takes it during revalidate leading to deadlock between the
autofs daemon and the second process when the daemon tries to create
the mount point directory.

After spending quite a bit of time trying to resolve this on more than
one occassion, using rather complex and ulgy approaches, it turns out
that just delaying the hashing of the dentry until the create operation
work fine."

Ian

2008-06-03 17:44:58

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 18:41 +0100, Al Viro wrote:
> On Wed, Jun 04, 2008 at 01:28:23AM +0800, Ian Kent wrote:
> >
> > On Tue, 2008-06-03 at 17:50 +0100, Al Viro wrote:
> > > On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:
> > >
> > > > >From my reading of that code looks like it's been rmdir'ed. And no, I
> > > > don't understand what the hell is that code trying to do.
> > > >
> > > > Ian, could you describe the race you are talking about?
> > >
> > > BTW, this stuff is definitely broken regardless of mount - if something
> > > had the directory in question opened before that rmdir and we'd hit
> > > your lookup_unhashed while another CPU had been in the middle of
> > > getdents(2) on that opened descriptor, we'll get
> > >
> > > vfs_readdir() grabs i_mutex
> > > vfs_readdir() checks that it's dead
> > > autofs4_lookup_unhashed() calls iput()
> >
> > Can this really happen, since autofs4_lookup_unhashed() is only called
> > with the i_mutex held.
>
> i_mutex on a different inode (obviously - it frees the inode in question,
> so if caller held i_mutex on it, you would be in trouble every time you
> hit that codepath).

OK, I'll need to look at vfs_readdir().
I thought vfs_readdir() would take the containing directory mutex as
does ->lookup().

2008-06-03 17:47:39

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

Al Viro <[email protected]> writes:

> On Wed, Jun 04, 2008 at 01:13:08AM +0800, Ian Kent wrote:
>
>> "What happens is that during an expire the situation can arise
>> that a directory is removed and another lookup is done before
>> the expire issues a completion status to the kernel module.
>> In this case, since the the lookup gets a new dentry, it doesn't
>> know that there is an expire in progress and when it posts its
>> mount request, matches the existing expire request and waits
>> for its completion. ENOENT is then returned to user space
>> from lookup (as the dentry passed in is now unhashed) without
>> having performed the mount request.
>>
>> The solution used here is to keep track of dentrys in this
>> unhashed state and reuse them, if possible, in order to
>> preserve the flags. Additionally, this infrastructure will
>> provide the framework for the reintroduction of caching
>> of mount fails removed earlier in development."
>>
>> I wasn't able to do an acceptable re-implementation of the negative
>> caching we had in 2.4 with this framework, so just ignore the last
>> sentence in the above description.
>
>> Unfortunately no, but I thought that once the dentry became unhashed
>> (aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
>> course there may be descriptors open on the dentry, which I think is the
>> problem that's being pointed out.
>
> ... or we could have had a pending mount(2) sitting there with a reference
> to mountpoint-to-be...
>
>> Yes, that would be ideal but the reason we arrived here is that, because
>> we must release the directory mutex before calling back to the daemon
>> (the heart of the problem, actually having to drop the mutex) to perform
>> the mount, we can get a deadlock. The cause of the problem was that for
>> "create" like operations the mutex is held for ->lookup() and
>> ->revalidate() but for a "path walks" the mutex is only held for
>> ->lookup(), so if the mutex is held when we're in ->revalidate(), we
>> could never be sure that we where the code path that acquired it.
>>
>> Sorry, this last bit is unclear.
>> I'll need to work a bit harder on the explanation if you're interested
>> in checking further.
>
> I am.

commit 1864f7bd58351732593def024e73eca1f75bc352
Author: Ian Kent <[email protected]>
Date: Wed Aug 22 14:01:54 2007 -0700

autofs4: deadlock during create

Due to inconsistent locking in the VFS between calls to lookup and
revalidate deadlock can occur in the automounter.

The inconsistency is that the directory inode mutex is held for both lookup
and revalidate calls when called via lookup_hash whereas it is held only
for lookup during a path walk. Consequently, if the mutex is held during a
call to revalidate autofs4 can't release the mutex to callback the daemon
as it can't know whether it owns the mutex.

This situation happens when a process tries to create a directory within an
automount and a second process also tries to create the same directory
between the lookup and the mkdir. Since the first process has dropped the
mutex for the daemon callback, the second process takes it during
revalidate leading to deadlock between the autofs daemon and the second
process when the daemon tries to create the mount point directory.

After spending quite a bit of time trying to resolve this on more than one
occassion, using rather complex and ulgy approaches, it turns out that just
delaying the hashing of the dentry until the create operation works fine.

> Oh, well... Looks like RTFS time for me for now... Additional parts of
> braindump would be appreciated - the last time I've seriously looked at
> autofs4 internal had been ~2005 or so ;-/

Well, let me know what level of dump you'd like. I can give the 50,000
foot view, or I can give you the history of things that happened to get
us to where we are today, or anything inbetween. The more specific
your request, the quicker I can respond. A full brain-dump would take
some time!

Cheers,

Jeff

2008-06-03 17:51:16

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Wed, Jun 04, 2008 at 01:41:32AM +0800, Ian Kent wrote:

> OK, I'll need to look at vfs_readdir().
> I thought vfs_readdir() would take the containing directory mutex as
> does ->lookup().

vfs_readdir() takes i_mutex on directory it reads. I.e. on the victim
in this case. lookup has i_mutex on directory it does lookup in, i.e.
root in this case...

2008-06-03 17:52:48

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 18:50 +0100, Al Viro wrote:
> On Wed, Jun 04, 2008 at 01:41:32AM +0800, Ian Kent wrote:
>
> > OK, I'll need to look at vfs_readdir().
> > I thought vfs_readdir() would take the containing directory mutex as
> > does ->lookup().
>
> vfs_readdir() takes i_mutex on directory it reads. I.e. on the victim
> in this case. lookup has i_mutex on directory it does lookup in, i.e.
> root in this case...

Hahaha, yes, I'll need to go back and re-read that bit of code, sorry.

2008-06-03 19:19:26

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 01:46:41PM -0400, Jeff Moyer wrote:

> Well, let me know what level of dump you'd like. I can give the 50,000
> foot view, or I can give you the history of things that happened to get
> us to where we are today, or anything inbetween. The more specific
> your request, the quicker I can respond. A full brain-dump would take
> some time!

a) what the hell is going on in autofs4_free_ino()? It checks for
ino->dentry, when the only caller has just set it to NULL.

b) while we are at it, what's ino->inode doing there? AFAICS, it's
a write-only field...

c) what are possible states of autofs4 dentry and what's the supposed
life cycle of these beasts?

d)
/* For dentries of directories in the root dir */
static struct dentry_operations autofs4_root_dentry_operations = {
.d_revalidate = autofs4_revalidate,
.d_release = autofs4_dentry_release,
};

/* For other dentries */
static struct dentry_operations autofs4_dentry_operations = {
.d_revalidate = autofs4_revalidate,
.d_release = autofs4_dentry_release,
};

Just what is the difference?

e) in autofs4_tree_busy() we do atomic_read() on ino->count and dentry->d_count
What's going to keep these suckers consistent with each other in any useful
way?

2008-06-03 19:54:14

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

Al Viro <[email protected]> writes:

> On Tue, Jun 03, 2008 at 01:46:41PM -0400, Jeff Moyer wrote:
>
>> Well, let me know what level of dump you'd like. I can give the 50,000
>> foot view, or I can give you the history of things that happened to get
>> us to where we are today, or anything inbetween. The more specific
>> your request, the quicker I can respond. A full brain-dump would take
>> some time!
>
> a) what the hell is going on in autofs4_free_ino()? It checks for
> ino->dentry, when the only caller has just set it to NULL.

Probably historic. Ian?

> b) while we are at it, what's ino->inode doing there? AFAICS, it's
> a write-only field...

Another good point.

> c) what are possible states of autofs4 dentry and what's the supposed
> life cycle of these beasts?

The life cycle of a dentry for an indirect, non-browsable mount goes
something like this:

autofs4_lookup is called on behalf a process trying to walk into an
automounted directory. That dentry's d_flags is set to
DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created,
indexed off of the name of the dentry. A callout is made to the
automount daemon (via autofs4_wait).

The daemon looks up the directory name in its configuration. If it
finds a valid map entry, it will then create the directory using
sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode ==
1) will return NULL, and then the mkdir call will be made. The
autofs4_mkdir function then instantiates the dentry which, by the way,
is different from the original dentry passed to autofs4_lookup. (This
dentry also does not get the PENDING flag set, which is a bug addressed
by a patch set that Ian and I have been working on; specifically, the
idea is to reuse the dentry from the original lookup, but I digress).

The daemon then mounts the share on the given directory and issues an
ioctl to wakeup the waiter. When awakened, the waiter clears the
DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
dcache and returns that dentry if found.

Later, the dentry gets expired via another ioctl. That path sets
the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
It then calls out to the daemon to perform the unmount and rmdir. The
rmdir unhashes the dentry (and places it on the rehash list).

The dentry is removed from the rehash list if there was a racing expire
and mount or if the dentry is released.

This description is valid for the tree as it stands today. Ian and I
have been working on fixing some other race conditions which will change
the dentry life cycle (for the better, I hope).

> d)
> /* For dentries of directories in the root dir */
> static struct dentry_operations autofs4_root_dentry_operations = {
> .d_revalidate = autofs4_revalidate,
> .d_release = autofs4_dentry_release,
> };
>
> /* For other dentries */
> static struct dentry_operations autofs4_dentry_operations = {
> .d_revalidate = autofs4_revalidate,
> .d_release = autofs4_dentry_release,
> };
>
> Just what is the difference?

Nothing. There used to be, and I'm guessing Ian kept this around for,
umm, clarity?

> e) in autofs4_tree_busy() we do atomic_read() on ino->count and dentry->d_count
> What's going to keep these suckers consistent with each other in any useful
> way?

I'm afraid I'm not familiar enough with that part of the code to give
you a good answer. Ian?

Cheers,

Jeff

2008-06-03 23:00:59

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:

> autofs4_lookup is called on behalf a process trying to walk into an
> automounted directory. That dentry's d_flags is set to
> DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created,
> indexed off of the name of the dentry. A callout is made to the
> automount daemon (via autofs4_wait).
>
> The daemon looks up the directory name in its configuration. If it
> finds a valid map entry, it will then create the directory using
> sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode ==
> 1) will return NULL, and then the mkdir call will be made. The
> autofs4_mkdir function then instantiates the dentry which, by the way,
> is different from the original dentry passed to autofs4_lookup. (This
> dentry also does not get the PENDING flag set, which is a bug addressed
> by a patch set that Ian and I have been working on; specifically, the
> idea is to reuse the dentry from the original lookup, but I digress).
>
> The daemon then mounts the share on the given directory and issues an
> ioctl to wakeup the waiter. When awakened, the waiter clears the
> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> dcache and returns that dentry if found.
> Later, the dentry gets expired via another ioctl. That path sets
> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> It then calls out to the daemon to perform the unmount and rmdir. The
> rmdir unhashes the dentry (and places it on the rehash list).
>
> The dentry is removed from the rehash list if there was a racing expire
> and mount or if the dentry is released.
>
> This description is valid for the tree as it stands today. Ian and I
> have been working on fixing some other race conditions which will change
> the dentry life cycle (for the better, I hope).

So what happens if new lookup hits between umount and rmdir?

Another thing: would be nice to write down the expected state of dentry
(positive/negative, flags, has/hasn't ->d_fsdata, flags on ->d_fsdata)
for all stages. I'll go through the code and do that once I get some sleep,
but if you'll have time to do it before that...

FWIW, I wonder if it would be better to leave the directory alone and just
have the daemon mount the sucker elsewhere and let the kernel side move
the damn thing in place itself, along with making dentry positive and
waking the sleepers up. Then we might get away with not unlocking anything
at all... That obviously doesn't help the current systems with existing
daemon, but it might be interesting for the next autofs version...
Note that we don't even have to mount it anywhere - mount2() is close to
the top of the pile for the next couple of cycles and it'd separate
"activate fs" from "attach fully set up fs to given place", with the
former resulting in a descriptor and the latter being
mount2(Attach, dir_fd, fs_fd);
Kernel side of autofs might receive the file descriptor in question and
do the rest itself...

2008-06-04 01:40:27

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 20:18 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 01:46:41PM -0400, Jeff Moyer wrote:
>
> > Well, let me know what level of dump you'd like. I can give the 50,000
> > foot view, or I can give you the history of things that happened to get
> > us to where we are today, or anything inbetween. The more specific
> > your request, the quicker I can respond. A full brain-dump would take
> > some time!
>
> a) what the hell is going on in autofs4_free_ino()? It checks for
> ino->dentry, when the only caller has just set it to NULL.

I know.
I need to clean that up.

>
> b) while we are at it, what's ino->inode doing there? AFAICS, it's
> a write-only field...

I know.
And I think it has never been used anywhere either but I haven't removed
it from the info structure.

>
> c) what are possible states of autofs4 dentry and what's the supposed
> life cycle of these beasts?
>
> d)
> /* For dentries of directories in the root dir */
> static struct dentry_operations autofs4_root_dentry_operations = {
> .d_revalidate = autofs4_revalidate,
> .d_release = autofs4_dentry_release,
> };
>
> /* For other dentries */
> static struct dentry_operations autofs4_dentry_operations = {
> .d_revalidate = autofs4_revalidate,
> .d_release = autofs4_dentry_release,
> };
>
> Just what is the difference?

There isn't any difference.
There's no real reason to keep them different except that there are two
distinct sets of operations. I don't see any harm in retaining this.

>
> e) in autofs4_tree_busy() we do atomic_read() on ino->count and dentry->d_count
> What's going to keep these suckers consistent with each other in any useful
> way?

The only time ino->count is changed is in ->mkdir(), ->rmdir and
->symlink() and ->unlink(). So it is supposed to represent the minimal
reference count. The code in autofs4_free_ino() should go but that may
be a bug, I need to check.

2008-06-04 02:45:32

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
>
> > autofs4_lookup is called on behalf a process trying to walk into an
> > automounted directory. That dentry's d_flags is set to
> > DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created,
> > indexed off of the name of the dentry. A callout is made to the
> > automount daemon (via autofs4_wait).
> >
> > The daemon looks up the directory name in its configuration. If it
> > finds a valid map entry, it will then create the directory using
> > sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode ==
> > 1) will return NULL, and then the mkdir call will be made. The
> > autofs4_mkdir function then instantiates the dentry which, by the way,
> > is different from the original dentry passed to autofs4_lookup. (This
> > dentry also does not get the PENDING flag set, which is a bug addressed
> > by a patch set that Ian and I have been working on; specifically, the
> > idea is to reuse the dentry from the original lookup, but I digress).
> >
> > The daemon then mounts the share on the given directory and issues an
> > ioctl to wakeup the waiter. When awakened, the waiter clears the
> > DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> > dcache and returns that dentry if found.
> > Later, the dentry gets expired via another ioctl. That path sets
> > the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> > It then calls out to the daemon to perform the unmount and rmdir. The
> > rmdir unhashes the dentry (and places it on the rehash list).
> >
> > The dentry is removed from the rehash list if there was a racing expire
> > and mount or if the dentry is released.
> >
> > This description is valid for the tree as it stands today. Ian and I
> > have been working on fixing some other race conditions which will change
> > the dentry life cycle (for the better, I hope).
>
> So what happens if new lookup hits between umount and rmdir?

It will wait for the expire to complete and then wait for a mount
request to the daemon.

This is an example of how I've broken the lookup by delaying the hashing
of the dentry without providing a way for ->lookup() to pickup the same
unhashed dentry prior the directory dentry being hashed. Currently only
the first lookup after the d_drop will get this dentry.

Keeping track of the dentry between the first lookup and it's subsequent
hashing (or release) is what I want to do. But, as you point out, I also
need to keep the dentry positive.

>
> Another thing: would be nice to write down the expected state of dentry
> (positive/negative, flags, has/hasn't ->d_fsdata, flags on ->d_fsdata)
> for all stages. I'll go through the code and do that once I get some sleep,
> but if you'll have time to do it before that...

A dentry gets an info struct when it gets an inode and it should retain
it until the dentry is released.

When a dentry is selected for umount the AUTOFS_INF_EXPIRING
(ino->flags) is set and cleared upon return (synchronous expire).

The DCACHE_AUTOFS_PENDING (dentry->d_flags) flag should be set when a
mount request is to be issued to the daemon and cleared when the request
completes. I've introduced some inconsistency in setting and clearing
this flag which has compounded the delayed hashing issue.

>
> FWIW, I wonder if it would be better to leave the directory alone and just
> have the daemon mount the sucker elsewhere and let the kernel side move
> the damn thing in place itself, along with making dentry positive and
> waking the sleepers up. Then we might get away with not unlocking anything
> at all... That obviously doesn't help the current systems with existing
> daemon, but it might be interesting for the next autofs version...
> Note that we don't even have to mount it anywhere - mount2() is close to
> the top of the pile for the next couple of cycles and it'd separate
> "activate fs" from "attach fully set up fs to given place", with the
> former resulting in a descriptor and the latter being
> mount2(Attach, dir_fd, fs_fd);
> Kernel side of autofs might receive the file descriptor in question and
> do the rest itself...

Perhaps, if we didn't use /etc/mtab anywhere.
It would make a difference if we could "mount" /proc/mounts onto a file
such as /etc/mtab and everyone always did that.

2008-06-04 05:35:39

by Miklos Szeredi

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

> Perhaps, if we didn't use /etc/mtab anywhere.
> It would make a difference if we could "mount" /proc/mounts onto a file
> such as /etc/mtab and everyone always did that.

That's actually remarkably close to being possible. Loopback mounts
have already been fixed not to rely on /etc/mtab. The only major
piece missing is "user" mounts, and there's already a patchset for
that waiting for review by the VFS maintainers. After that, it's just
a "simple" issue of fixing up all the userspace pieces. Could be
finished in the next decade, possibly ;)

Miklos

2008-06-04 05:45:31

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Wed, 2008-06-04 at 07:34 +0200, Miklos Szeredi wrote:
> > Perhaps, if we didn't use /etc/mtab anywhere.
> > It would make a difference if we could "mount" /proc/mounts onto a file
> > such as /etc/mtab and everyone always did that.
>
> That's actually remarkably close to being possible. Loopback mounts
> have already been fixed not to rely on /etc/mtab. The only major
> piece missing is "user" mounts, and there's already a patchset for
> that waiting for review by the VFS maintainers. After that, it's just
> a "simple" issue of fixing up all the userspace pieces. Could be
> finished in the next decade, possibly ;)

;)

2008-06-05 07:35:22

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
>
> On Tue, 3 Jun 2008, Ian Kent wrote:
> > >
> > > > I think it must be autofs4 doing something weird. Like this in
> > > > autofs4_lookup_unhashed():
> > > >
> > > > /*
> > > > * Make the rehashed dentry negative so the VFS
> > > > * behaves as it should.
> > > > */
> > > > if (inode) {
> > > > dentry->d_inode = NULL;
>
> Uhhuh. Yeah, that's not allowed.
>
> A dentry inode can start _out_ as NULL, but it can never later become NULL
> again until it is totally unused.

Here is a patch for autofs4 to, hopefully, resolve this.

Keep in mind this doesn't address any other autofs4 issues but I it
should allow us to identify if this was in fact the root cause of the
problem Jesper reported.

autofs4 - leave rehashed dentry positive

From: Ian Kent <[email protected]>

Correct the error of making a positive dentry negative after it has been
instantiated.

This involves removing the code in autofs4_lookup_unhashed() that
makes the dentry negative and updating autofs4_dir_symlink() and
autofs4_dir_mkdir() to recognise they have been given a postive
dentry (previously the dentry was always negative) and deal with
it. In addition the dentry info struct initialization, autofs4_init_ino(),
and the symlink free function, ino_lnkfree(), have been made aware
of this possible re-use. This is needed because the current case
re-uses a dentry in order to preserve it's flags as described in
commit f50b6f8691cae2e0064c499dd3ef3f31142987f0.

Signed-off-by: Ian Kent <[email protected]>
---

fs/autofs4/inode.c | 23 +++++++------
fs/autofs4/root.c | 95 ++++++++++++++++++++++++++++++----------------------
2 files changed, 67 insertions(+), 51 deletions(-)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..ec9a641 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -24,8 +24,10 @@

static void ino_lnkfree(struct autofs_info *ino)
{
- kfree(ino->u.symlink);
- ino->u.symlink = NULL;
+ if (ino->u.symlink) {
+ kfree(ino->u.symlink);
+ ino->u.symlink = NULL;
+ }
}

struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
@@ -41,16 +43,17 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
if (ino == NULL)
return NULL;

- ino->flags = 0;
- ino->mode = mode;
- ino->inode = NULL;
- ino->dentry = NULL;
- ino->size = 0;
-
- INIT_LIST_HEAD(&ino->rehash);
+ if (!reinit) {
+ ino->flags = 0;
+ ino->mode = mode;
+ ino->inode = NULL;
+ ino->dentry = NULL;
+ ino->size = 0;
+ INIT_LIST_HEAD(&ino->rehash);
+ atomic_set(&ino->count, 0);
+ }

ino->last_used = jiffies;
- atomic_set(&ino->count, 0);

ino->sbi = sbi;

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index edf5b6b..6ce603b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -555,24 +555,8 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
goto next;

if (d_unhashed(dentry)) {
- struct inode *inode = dentry->d_inode;
-
- ino = autofs4_dentry_ino(dentry);
list_del_init(&ino->rehash);
dget(dentry);
- /*
- * Make the rehashed dentry negative so the VFS
- * behaves as it should.
- */
- if (inode) {
- dentry->d_inode = NULL;
- list_del_init(&dentry->d_alias);
- spin_unlock(&dentry->d_lock);
- spin_unlock(&sbi->rehash_lock);
- spin_unlock(&dcache_lock);
- iput(inode);
- return dentry;
- }
spin_unlock(&dentry->d_lock);
spin_unlock(&sbi->rehash_lock);
spin_unlock(&dcache_lock);
@@ -728,35 +712,50 @@ static int autofs4_dir_symlink(struct inode *dir,
return -EACCES;

ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
- if (ino == NULL)
+ if (!ino)
return -ENOSPC;

- ino->size = strlen(symname);
- ino->u.symlink = cp = kmalloc(ino->size + 1, GFP_KERNEL);
-
- if (cp == NULL) {
- kfree(ino);
+ cp = kmalloc(ino->size + 1, GFP_KERNEL);
+ if (!cp) {
+ if (!dentry->d_fsdata)
+ kfree(ino);
return -ENOSPC;
}

strcpy(cp, symname);

- inode = autofs4_get_inode(dir->i_sb, ino);
- d_add(dentry, inode);
+ inode = dentry->d_inode;
+ if (inode)
+ d_rehash(dentry);
+ else {
+ inode = autofs4_get_inode(dir->i_sb, ino);
+ if (!inode) {
+ kfree(cp);
+ if (!dentry->d_fsdata)
+ kfree(ino);
+ return -ENOSPC;
+ }
+
+ d_add(dentry, inode);

- if (dir == dir->i_sb->s_root->d_inode)
- dentry->d_op = &autofs4_root_dentry_operations;
- else
- dentry->d_op = &autofs4_dentry_operations;
+ if (dir == dir->i_sb->s_root->d_inode)
+ dentry->d_op = &autofs4_root_dentry_operations;
+ else
+ dentry->d_op = &autofs4_dentry_operations;
+
+ dentry->d_fsdata = ino;
+ ino->dentry = dentry;
+ ino->inode = inode;
+ }
+ dget(dentry);

- dentry->d_fsdata = ino;
- ino->dentry = dget(dentry);
atomic_inc(&ino->count);
p_ino = autofs4_dentry_ino(dentry->d_parent);
if (p_ino && dentry->d_parent != dentry)
atomic_inc(&p_ino->count);
- ino->inode = inode;

+ ino->u.symlink = cp;
+ ino->size = strlen(symname);
dir->i_mtime = CURRENT_TIME;

return 0;
@@ -866,24 +865,38 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
dentry, dentry->d_name.len, dentry->d_name.name);

ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
- if (ino == NULL)
+ if (!ino)
return -ENOSPC;

- inode = autofs4_get_inode(dir->i_sb, ino);
- d_add(dentry, inode);
+ inode = dentry->d_inode;
+ if (inode)
+ d_rehash(dentry);
+ else {
+ inode = autofs4_get_inode(dir->i_sb, ino);
+ if (!inode) {
+ if (!dentry->d_fsdata)
+ kfree(ino);
+ return -ENOSPC;
+ }

- if (dir == dir->i_sb->s_root->d_inode)
- dentry->d_op = &autofs4_root_dentry_operations;
- else
- dentry->d_op = &autofs4_dentry_operations;
+ d_add(dentry, inode);
+
+ if (dir == dir->i_sb->s_root->d_inode)
+ dentry->d_op = &autofs4_root_dentry_operations;
+ else
+ dentry->d_op = &autofs4_dentry_operations;
+
+ dentry->d_fsdata = ino;
+ ino->dentry = dentry;
+ ino->inode = inode;
+ }
+ dget(dentry);

- dentry->d_fsdata = ino;
- ino->dentry = dget(dentry);
atomic_inc(&ino->count);
p_ino = autofs4_dentry_ino(dentry->d_parent);
if (p_ino && dentry->d_parent != dentry)
atomic_inc(&p_ino->count);
- ino->inode = inode;
+
inc_nlink(dir);
dir->i_mtime = CURRENT_TIME;

2008-06-05 21:30:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Thu, 5 Jun 2008, Ian Kent wrote:
>
> Keep in mind this doesn't address any other autofs4 issues but I it
> should allow us to identify if this was in fact the root cause of the
> problem Jesper reported.

Jesper, can you test this? I think you said you could trigger this in ~24
hours or so? I'd love to have some testing of some heavy autofs user. Even
if it's not a guarantee of a fix (due to reproducing the bug not being
entirely trivial), at least I'd like to know that it doesn't introduce any
obvious new problems either..

Linus

2008-06-05 21:34:57

by Jesper Krogh

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

Linus Torvalds wrote:
>
> On Thu, 5 Jun 2008, Ian Kent wrote:
>> Keep in mind this doesn't address any other autofs4 issues but I it
>> should allow us to identify if this was in fact the root cause of the
>> problem Jesper reported.
>
> Jesper, can you test this? I think you said you could trigger this in ~24
> hours or so? I'd love to have some testing of some heavy autofs user. Even
> if it's not a guarantee of a fix (due to reproducing the bug not being
> entirely trivial), at least I'd like to know that it doesn't introduce any
> obvious new problems either..

Yes, I'll apply it an report back. Dont expect anything before early
next week.

Jesper
--
Jesper

2008-06-05 22:31:24

by Andrew Morton

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Thu, 05 Jun 2008 15:31:37 +0800
Ian Kent <[email protected]> wrote:

>
> On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
> >
> > On Tue, 3 Jun 2008, Ian Kent wrote:
> > > >
> > > > > I think it must be autofs4 doing something weird. Like this in
> > > > > autofs4_lookup_unhashed():
> > > > >
> > > > > /*
> > > > > * Make the rehashed dentry negative so the VFS
> > > > > * behaves as it should.
> > > > > */
> > > > > if (inode) {
> > > > > dentry->d_inode = NULL;
> >
> > Uhhuh. Yeah, that's not allowed.
> >
> > A dentry inode can start _out_ as NULL, but it can never later become NULL
> > again until it is totally unused.
>
> Here is a patch for autofs4 to, hopefully, resolve this.
>
> Keep in mind this doesn't address any other autofs4 issues but I it
> should allow us to identify if this was in fact the root cause of the
> problem Jesper reported.
>
> autofs4 - leave rehashed dentry positive
>
> From: Ian Kent <[email protected]>
>
> Correct the error of making a positive dentry negative after it has been
> instantiated.
>
> This involves removing the code in autofs4_lookup_unhashed() that
> makes the dentry negative and updating autofs4_dir_symlink() and
> autofs4_dir_mkdir() to recognise they have been given a postive
> dentry (previously the dentry was always negative) and deal with
> it. In addition the dentry info struct initialization, autofs4_init_ino(),
> and the symlink free function, ino_lnkfree(), have been made aware
> of this possible re-use. This is needed because the current case
> re-uses a dentry in order to preserve it's flags as described in
> commit f50b6f8691cae2e0064c499dd3ef3f31142987f0.
>
> ...
>
> ...
>
> diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
> index edf5b6b..6ce603b 100644
> --- a/fs/autofs4/root.c
> +++ b/fs/autofs4/root.c
> @@ -555,24 +555,8 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
> goto next;
>
> if (d_unhashed(dentry)) {
> - struct inode *inode = dentry->d_inode;
> -
> - ino = autofs4_dentry_ino(dentry);
> list_del_init(&ino->rehash);
> dget(dentry);
> - /*
> - * Make the rehashed dentry negative so the VFS
> - * behaves as it should.
> - */
> - if (inode) {
> - dentry->d_inode = NULL;
> - list_del_init(&dentry->d_alias);
> - spin_unlock(&dentry->d_lock);
> - spin_unlock(&sbi->rehash_lock);
> - spin_unlock(&dcache_lock);
> - iput(inode);
> - return dentry;
> - }
> spin_unlock(&dentry->d_lock);
> spin_unlock(&sbi->rehash_lock);
> spin_unlock(&dcache_lock);
> @@ -728,35 +712,50 @@ static int autofs4_dir_symlink(struct inode *dir,
> return -EACCES;
>
> ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
> - if (ino == NULL)
> + if (!ino)
> return -ENOSPC;

Should have been ENOMEM, I guess.

> - ino->size = strlen(symname);
> - ino->u.symlink = cp = kmalloc(ino->size + 1, GFP_KERNEL);
> -
> - if (cp == NULL) {
> - kfree(ino);
> + cp = kmalloc(ino->size + 1, GFP_KERNEL);
> + if (!cp) {
> + if (!dentry->d_fsdata)
> + kfree(ino);

OK, so here we work out that autofs4_init_ino() had to allocate a new
autofs_info and if so, free it here. It took me a moment..

> return -ENOSPC;

ENOMEM?

> }
>
> strcpy(cp, symname);
>
> - inode = autofs4_get_inode(dir->i_sb, ino);
> - d_add(dentry, inode);
> + inode = dentry->d_inode;
> + if (inode)
> + d_rehash(dentry);
> + else {
> + inode = autofs4_get_inode(dir->i_sb, ino);
> + if (!inode) {
> + kfree(cp);
> + if (!dentry->d_fsdata)
> + kfree(ino);
> + return -ENOSPC;
> + }
> +
> + d_add(dentry, inode);
>
> - if (dir == dir->i_sb->s_root->d_inode)
> - dentry->d_op = &autofs4_root_dentry_operations;
> - else
> - dentry->d_op = &autofs4_dentry_operations;
> + if (dir == dir->i_sb->s_root->d_inode)
> + dentry->d_op = &autofs4_root_dentry_operations;
> + else
> + dentry->d_op = &autofs4_dentry_operations;
> +
> + dentry->d_fsdata = ino;
> + ino->dentry = dentry;
> + ino->inode = inode;
> + }
> + dget(dentry);
>
> - dentry->d_fsdata = ino;
> - ino->dentry = dget(dentry);
> atomic_inc(&ino->count);
> p_ino = autofs4_dentry_ino(dentry->d_parent);
> if (p_ino && dentry->d_parent != dentry)
> atomic_inc(&p_ino->count);
> - ino->inode = inode;
>
> + ino->u.symlink = cp;
> + ino->size = strlen(symname);
> dir->i_mtime = CURRENT_TIME;

This all seems a bit ungainly. I assume that on entry to
autofs4_dir_symlink(), ino->size is equal to strlen(symname)? If it's
not, that strcpy() will overrun.

But if ino->size _is_ equal to strlen(symname) then why did we just
recalculate the same thing?

I'm suspecting we can zap a lump of code and just do

cp = kstrdup(symname, GFP_KERNEL);

Anyway, please check that.

> return 0;
> @@ -866,24 +865,38 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> dentry, dentry->d_name.len, dentry->d_name.name);
>
> ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
> - if (ino == NULL)
> + if (!ino)
> return -ENOSPC;

ENOMEM?

> - inode = autofs4_get_inode(dir->i_sb, ino);
> - d_add(dentry, inode);
> + inode = dentry->d_inode;
> + if (inode)
> + d_rehash(dentry);
> + else {
> + inode = autofs4_get_inode(dir->i_sb, ino);
> + if (!inode) {
> + if (!dentry->d_fsdata)
> + kfree(ino);
> + return -ENOSPC;
> + }
>
> - if (dir == dir->i_sb->s_root->d_inode)
> - dentry->d_op = &autofs4_root_dentry_operations;
> - else
> - dentry->d_op = &autofs4_dentry_operations;
> + d_add(dentry, inode);
> +
> + if (dir == dir->i_sb->s_root->d_inode)
> + dentry->d_op = &autofs4_root_dentry_operations;
> + else
> + dentry->d_op = &autofs4_dentry_operations;
> +
> + dentry->d_fsdata = ino;
> + ino->dentry = dentry;
> + ino->inode = inode;
> + }
> + dget(dentry);

This all looks very similar to the code in autofs4_dir_symlink(). Some
refactoring might be needed at some stage?

> - dentry->d_fsdata = ino;
> - ino->dentry = dget(dentry);
> atomic_inc(&ino->count);
> p_ino = autofs4_dentry_ino(dentry->d_parent);
> if (p_ino && dentry->d_parent != dentry)
> atomic_inc(&p_ino->count);
> - ino->inode = inode;
> +
> inc_nlink(dir);
> dir->i_mtime = CURRENT_TIME;
>
>

2008-06-06 02:43:03

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Thu, 2008-06-05 at 14:29 -0700, Linus Torvalds wrote:
>
> On Thu, 5 Jun 2008, Ian Kent wrote:
> >
> > Keep in mind this doesn't address any other autofs4 issues but I it
> > should allow us to identify if this was in fact the root cause of the
> > problem Jesper reported.
>
> Jesper, can you test this? I think you said you could trigger this in ~24
> hours or so? I'd love to have some testing of some heavy autofs user. Even
> if it's not a guarantee of a fix (due to reproducing the bug not being
> entirely trivial), at least I'd like to know that it doesn't introduce any
> obvious new problems either..

I'm continuing to test this also.
My initial testing was fine but late last night, with some of my other
patches applied as well, I started seeing some strange problems.

Ian

2008-06-06 02:51:21

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Thu, 2008-06-05 at 15:30 -0700, Andrew Morton wrote:
> On Thu, 05 Jun 2008 15:31:37 +0800
> Ian Kent <[email protected]> wrote:
>
> >
> > On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
> > >
> > > On Tue, 3 Jun 2008, Ian Kent wrote:
> > > > >
> > > > > > I think it must be autofs4 doing something weird. Like this in
> > > > > > autofs4_lookup_unhashed():
> > > > > >
> > > > > > /*
> > > > > > * Make the rehashed dentry negative so the VFS
> > > > > > * behaves as it should.
> > > > > > */
> > > > > > if (inode) {
> > > > > > dentry->d_inode = NULL;
> > >
> > > Uhhuh. Yeah, that's not allowed.
> > >
> > > A dentry inode can start _out_ as NULL, but it can never later become NULL
> > > again until it is totally unused.
> >
> > Here is a patch for autofs4 to, hopefully, resolve this.
> >
> > Keep in mind this doesn't address any other autofs4 issues but I it
> > should allow us to identify if this was in fact the root cause of the
> > problem Jesper reported.
> >
> > autofs4 - leave rehashed dentry positive
> >
> > From: Ian Kent <[email protected]>
> >
> > Correct the error of making a positive dentry negative after it has been
> > instantiated.
> >
> > This involves removing the code in autofs4_lookup_unhashed() that
> > makes the dentry negative and updating autofs4_dir_symlink() and
> > autofs4_dir_mkdir() to recognise they have been given a postive
> > dentry (previously the dentry was always negative) and deal with
> > it. In addition the dentry info struct initialization, autofs4_init_ino(),
> > and the symlink free function, ino_lnkfree(), have been made aware
> > of this possible re-use. This is needed because the current case
> > re-uses a dentry in order to preserve it's flags as described in
> > commit f50b6f8691cae2e0064c499dd3ef3f31142987f0.
> >
> > ...
> >
> > ...
> >
> > diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
> > index edf5b6b..6ce603b 100644
> > --- a/fs/autofs4/root.c
> > +++ b/fs/autofs4/root.c
> > @@ -555,24 +555,8 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
> > goto next;
> >
> > if (d_unhashed(dentry)) {
> > - struct inode *inode = dentry->d_inode;
> > -
> > - ino = autofs4_dentry_ino(dentry);
> > list_del_init(&ino->rehash);
> > dget(dentry);
> > - /*
> > - * Make the rehashed dentry negative so the VFS
> > - * behaves as it should.
> > - */
> > - if (inode) {
> > - dentry->d_inode = NULL;
> > - list_del_init(&dentry->d_alias);
> > - spin_unlock(&dentry->d_lock);
> > - spin_unlock(&sbi->rehash_lock);
> > - spin_unlock(&dcache_lock);
> > - iput(inode);
> > - return dentry;
> > - }
> > spin_unlock(&dentry->d_lock);
> > spin_unlock(&sbi->rehash_lock);
> > spin_unlock(&dcache_lock);
> > @@ -728,35 +712,50 @@ static int autofs4_dir_symlink(struct inode *dir,
> > return -EACCES;
> >
> > ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
> > - if (ino == NULL)
> > + if (!ino)
> > return -ENOSPC;
>
> Should have been ENOMEM, I guess.

Ha, yeah, I think it should be.
I just used the errno that has been there all along without thinking.

>
> > - ino->size = strlen(symname);
> > - ino->u.symlink = cp = kmalloc(ino->size + 1, GFP_KERNEL);
> > -
> > - if (cp == NULL) {
> > - kfree(ino);
> > + cp = kmalloc(ino->size + 1, GFP_KERNEL);
> > + if (!cp) {
> > + if (!dentry->d_fsdata)
> > + kfree(ino);
>
> OK, so here we work out that autofs4_init_ino() had to allocate a new
> autofs_info and if so, free it here. It took me a moment..

That is right, but as it is now, this will always be a new allocation.
If all goes well (yeah right) I will be allocating the info struct in
->lookup() in a subsequent patch.

>
> > return -ENOSPC;
>
> ENOMEM?
>
> > }
> >
> > strcpy(cp, symname);
> >
> > - inode = autofs4_get_inode(dir->i_sb, ino);
> > - d_add(dentry, inode);
> > + inode = dentry->d_inode;
> > + if (inode)
> > + d_rehash(dentry);
> > + else {
> > + inode = autofs4_get_inode(dir->i_sb, ino);
> > + if (!inode) {
> > + kfree(cp);
> > + if (!dentry->d_fsdata)
> > + kfree(ino);
> > + return -ENOSPC;
> > + }
> > +
> > + d_add(dentry, inode);
> >
> > - if (dir == dir->i_sb->s_root->d_inode)
> > - dentry->d_op = &autofs4_root_dentry_operations;
> > - else
> > - dentry->d_op = &autofs4_dentry_operations;
> > + if (dir == dir->i_sb->s_root->d_inode)
> > + dentry->d_op = &autofs4_root_dentry_operations;
> > + else
> > + dentry->d_op = &autofs4_dentry_operations;
> > +
> > + dentry->d_fsdata = ino;
> > + ino->dentry = dentry;
> > + ino->inode = inode;
> > + }
> > + dget(dentry);
> >
> > - dentry->d_fsdata = ino;
> > - ino->dentry = dget(dentry);
> > atomic_inc(&ino->count);
> > p_ino = autofs4_dentry_ino(dentry->d_parent);
> > if (p_ino && dentry->d_parent != dentry)
> > atomic_inc(&p_ino->count);
> > - ino->inode = inode;
> >
> > + ino->u.symlink = cp;
> > + ino->size = strlen(symname);
> > dir->i_mtime = CURRENT_TIME;
>
> This all seems a bit ungainly. I assume that on entry to
> autofs4_dir_symlink(), ino->size is equal to strlen(symname)? If it's
> not, that strcpy() will overrun.
>
> But if ino->size _is_ equal to strlen(symname) then why did we just
> recalculate the same thing?
>
> I'm suspecting we can zap a lump of code and just do
>
> cp = kstrdup(symname, GFP_KERNEL);
>
> Anyway, please check that.
>
> > return 0;
> > @@ -866,24 +865,38 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> > dentry, dentry->d_name.len, dentry->d_name.name);
> >
> > ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
> > - if (ino == NULL)
> > + if (!ino)
> > return -ENOSPC;
>
> ENOMEM?
>
> > - inode = autofs4_get_inode(dir->i_sb, ino);
> > - d_add(dentry, inode);
> > + inode = dentry->d_inode;
> > + if (inode)
> > + d_rehash(dentry);
> > + else {
> > + inode = autofs4_get_inode(dir->i_sb, ino);
> > + if (!inode) {
> > + if (!dentry->d_fsdata)
> > + kfree(ino);
> > + return -ENOSPC;
> > + }
> >
> > - if (dir == dir->i_sb->s_root->d_inode)
> > - dentry->d_op = &autofs4_root_dentry_operations;
> > - else
> > - dentry->d_op = &autofs4_dentry_operations;
> > + d_add(dentry, inode);
> > +
> > + if (dir == dir->i_sb->s_root->d_inode)
> > + dentry->d_op = &autofs4_root_dentry_operations;
> > + else
> > + dentry->d_op = &autofs4_dentry_operations;
> > +
> > + dentry->d_fsdata = ino;
> > + ino->dentry = dentry;
> > + ino->inode = inode;
> > + }
> > + dget(dentry);
>
> This all looks very similar to the code in autofs4_dir_symlink(). Some
> refactoring might be needed at some stage?
>
> > - dentry->d_fsdata = ino;
> > - ino->dentry = dget(dentry);
> > atomic_inc(&ino->count);
> > p_ino = autofs4_dentry_ino(dentry->d_parent);
> > if (p_ino && dentry->d_parent != dentry)
> > atomic_inc(&p_ino->count);
> > - ino->inode = inode;
> > +
> > inc_nlink(dir);
> > dir->i_mtime = CURRENT_TIME;
> >
> >

2008-06-06 06:24:00

by Jesper Krogh

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

Hi.

This isn't a test of the proposed patch. I just got another variatioin
of the problem in the log. (I've tried running the automount daemon both
with and without the --ghost option) that is the only change I can see.
Still 2.6.26-rc4..

Jun 5 16:13:15 node37 kernel: [17388710.169561] BUG: unable to handle
kernel NULL pointer dereference at 00000000000000b2
Jun 5 16:13:15 node37 automount[28691]: mount(nfs): nfs: mount failure
hest.nzcorp.net:/z/fx1200 on /nfs/fx1200
Jun 5 16:13:15 node37 automount[28691]: failed to mount /nfs/fx1200
Jun 5 16:13:15 node37 kernel: [17388710.217273] IP: [graft_tree+77/288]
graft_tree+0x4d/0x120
Jun 5 16:13:15 node37 kernel: [17388710.217273] PGD f9e75067 PUD
f681e067 PMD 0
Jun 5 16:13:15 node37 kernel: [17388710.217273] Oops: 0000 [1] SMP
Jun 5 16:13:15 node37 kernel: [17388710.217273] CPU 1
Jun 5 16:13:15 node37 kernel: [17388710.217273] Modules linked in: nfs
lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd
fuse parport_pc lp parport i2c_amd756 serio_raw psmouse pcspkr container
i2c_core shpchp k8temp button amd_rng evdev pci_hotplug ext3 jbd mbcache
sg sd_mod ide_cd_mod cdrom floppy mptspi mptscsih mptbase
scsi_transport_spi ohci_hcd tg3 usbcore amd74xx ide_core ata_generic
libata scsi_mod dock thermal processor fan thermal_sys
Jun 5 16:13:15 node37 kernel: [17388710.217273] Pid: 28693, comm:
mount.nfs Not tainted 2.6.26-rc4 #1
Jun 5 16:13:15 node37 kernel: [17388710.993688] RIP:
0010:[graft_tree+77/288] [graft_tree+77/288] graft_tree+0x4d/0x120
Jun 5 16:13:15 node37 kernel: [17388710.993688] RSP:
0000:ffff8100f9c85e08 EFLAGS: 00010246
Jun 5 16:13:15 node37 kernel: [17388710.993688] RAX: ffff8100bfbc0270
RBX: 00000000ffffffec RCX: 0000000000000000
Jun 5 16:13:15 node37 kernel: [17388711.245666] RDX: ffff8100f9ec5900
RSI: ffff8100f9c85e68 RDI: ffff8100bae1f800
Jun 5 16:13:15 node37 kernel: [17388711.245666] RBP: ffff8100bae1f800
R08: 0000000000000000 R09: 0000000000000001
Jun 5 16:13:15 node37 kernel: [17388711.245666] R10: 0000000000000001
R11: ffffffff803011c0 R12: ffff8100f9c85e68
Jun 5 16:13:15 node37 kernel: [17388711.513641] R13: 0000000000000000
R14: 000000000000000b R15: 000000000000000b
Jun 5 16:13:15 node37 kernel: [17388711.513641] FS:
00007fd02f2cf6e0(0000) GS:ffff8100fbb0e280(0000) knlGS:00000000557fc6b0
Jun 5 16:13:15 node37 kernel: [17388711.701623] CS: 0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Jun 5 16:13:15 node37 kernel: [17388711.701623] CR2: 00000000000000b2
CR3: 00000000f9f49000 CR4: 00000000000006e0
Jun 5 16:13:15 node37 kernel: [17388711.701623] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Jun 5 16:13:15 node37 kernel: [17388711.701623] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 5 16:13:15 node37 kernel: [17388711.701623] Process mount.nfs (pid:
28693, threadinfo ffff8100f9c84000, task ffff8100f6815640)
Jun 5 16:13:15 node37 kernel: [17388712.145575] Stack:
ffff8100f9c85e68 ffff8100f9c85e70 ffff8100bae1f800 ffffffff802b2622
Jun 5 16:13:15 node37 kernel: [17388712.145575] 0000000000000006
0000000000000000 ffff8100f695f000 ffff8100f695e000
Jun 5 16:13:15 node37 kernel: [17388712.145575] ffff8100f695d000
ffffffff802b49e9 000000004847f4ef 0000000000000000
Jun 5 16:13:15 node37 kernel: [17388712.145575] Call Trace:
Jun 5 16:13:15 node37 kernel: [17388712.145575] [do_add_mount+162/320]
? do_add_mount+0xa2/0x140
Jun 5 16:13:15 node37 kernel: [17388712.145575] [do_mount+505/592] ?
do_mount+0x1f9/0x250
Jun 5 16:13:15 node37 kernel: [17388712.145575]
[copy_mount_options+269/384] ? copy_mount_options+0x10d/0x180
Jun 5 16:13:15 node37 kernel: [17388712.145575] [sys_mount+155/256] ?
sys_mount+0x9b/0x100
Jun 5 16:13:15 node37 kernel: [17388712.145575]
[system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80
Jun 5 16:13:15 node37 kernel: [17388712.145575]
Jun 5 16:13:15 node37 kernel: [17388712.145575]
Jun 5 16:13:15 node37 kernel: [17388712.145575] Code: f7 40 58 00 00 00
80 74 15 89 d8 48 8b 6c 24 08 48 8b 1c 24 4c 8b 64 24 10 48 83 c4 18 c3
48 8b 46 08 bb ec ff ff ff 48 8b 48 10 <0f> b7 81 b2 00 00 00 25 00 f0
00 00 3d 00 40 00 00 48 8b 47 20
Jun 5 16:13:15 node37 kernel: [17388712.145575] RIP
[graft_tree+77/288] graft_tree+0x4d/0x120
Jun 5 16:13:15 node37 kernel: [17388712.145575] RSP <ffff8100f9c85e08>
Jun 5 16:13:15 node37 kernel: [17388712.145575] CR2: 00000000000000b2
Jun 5 16:13:15 node37 kernel: [17388715.129847] ---[ end trace
f3c4579f529c23bf ]---

I'll apply the patch today and get some nodes booted up on it.

--
Jesper

2008-06-06 08:25:28

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Fri, 2008-06-06 at 08:23 +0200, Jesper Krogh wrote:
> Hi.
>
> This isn't a test of the proposed patch. I just got another variatioin
> of the problem in the log. (I've tried running the automount daemon both
> with and without the --ghost option) that is the only change I can see.
> Still 2.6.26-rc4..

Right.

Whether that would make a difference depends largely on your map
configuration. If you have simple indirect or direct maps then then
using the --ghost option (or just adding the "browse" option if you're
using version 5) should prevent the code that turns the dentry negative
from being executed at all. If you're using submounts in your map, or
the "hosts" map or you have multi-mount entries in the maps then that
code could still be executed.

>
> Jun 5 16:13:15 node37 kernel: [17388710.169561] BUG: unable to handle
> kernel NULL pointer dereference at 00000000000000b2
> Jun 5 16:13:15 node37 automount[28691]: mount(nfs): nfs: mount failure
> hest.nzcorp.net:/z/fx1200 on /nfs/fx1200
> Jun 5 16:13:15 node37 automount[28691]: failed to mount /nfs/fx1200
> Jun 5 16:13:15 node37 kernel: [17388710.217273] IP: [graft_tree+77/288]
> graft_tree+0x4d/0x120
> Jun 5 16:13:15 node37 kernel: [17388710.217273] PGD f9e75067 PUD
> f681e067 PMD 0
> Jun 5 16:13:15 node37 kernel: [17388710.217273] Oops: 0000 [1] SMP
> Jun 5 16:13:15 node37 kernel: [17388710.217273] CPU 1
> Jun 5 16:13:15 node37 kernel: [17388710.217273] Modules linked in: nfs
> lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd
> fuse parport_pc lp parport i2c_amd756 serio_raw psmouse pcspkr container
> i2c_core shpchp k8temp button amd_rng evdev pci_hotplug ext3 jbd mbcache
> sg sd_mod ide_cd_mod cdrom floppy mptspi mptscsih mptbase
> scsi_transport_spi ohci_hcd tg3 usbcore amd74xx ide_core ata_generic
> libata scsi_mod dock thermal processor fan thermal_sys
> Jun 5 16:13:15 node37 kernel: [17388710.217273] Pid: 28693, comm:
> mount.nfs Not tainted 2.6.26-rc4 #1
> Jun 5 16:13:15 node37 kernel: [17388710.993688] RIP:
> 0010:[graft_tree+77/288] [graft_tree+77/288] graft_tree+0x4d/0x120
> Jun 5 16:13:15 node37 kernel: [17388710.993688] RSP:
> 0000:ffff8100f9c85e08 EFLAGS: 00010246
> Jun 5 16:13:15 node37 kernel: [17388710.993688] RAX: ffff8100bfbc0270
> RBX: 00000000ffffffec RCX: 0000000000000000
> Jun 5 16:13:15 node37 kernel: [17388711.245666] RDX: ffff8100f9ec5900
> RSI: ffff8100f9c85e68 RDI: ffff8100bae1f800
> Jun 5 16:13:15 node37 kernel: [17388711.245666] RBP: ffff8100bae1f800
> R08: 0000000000000000 R09: 0000000000000001
> Jun 5 16:13:15 node37 kernel: [17388711.245666] R10: 0000000000000001
> R11: ffffffff803011c0 R12: ffff8100f9c85e68
> Jun 5 16:13:15 node37 kernel: [17388711.513641] R13: 0000000000000000
> R14: 000000000000000b R15: 000000000000000b
> Jun 5 16:13:15 node37 kernel: [17388711.513641] FS:
> 00007fd02f2cf6e0(0000) GS:ffff8100fbb0e280(0000) knlGS:00000000557fc6b0
> Jun 5 16:13:15 node37 kernel: [17388711.701623] CS: 0010 DS: 0000 ES:
> 0000 CR0: 000000008005003b
> Jun 5 16:13:15 node37 kernel: [17388711.701623] CR2: 00000000000000b2
> CR3: 00000000f9f49000 CR4: 00000000000006e0
> Jun 5 16:13:15 node37 kernel: [17388711.701623] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
> Jun 5 16:13:15 node37 kernel: [17388711.701623] DR3: 0000000000000000
> DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jun 5 16:13:15 node37 kernel: [17388711.701623] Process mount.nfs (pid:
> 28693, threadinfo ffff8100f9c84000, task ffff8100f6815640)
> Jun 5 16:13:15 node37 kernel: [17388712.145575] Stack:
> ffff8100f9c85e68 ffff8100f9c85e70 ffff8100bae1f800 ffffffff802b2622
> Jun 5 16:13:15 node37 kernel: [17388712.145575] 0000000000000006
> 0000000000000000 ffff8100f695f000 ffff8100f695e000
> Jun 5 16:13:15 node37 kernel: [17388712.145575] ffff8100f695d000
> ffffffff802b49e9 000000004847f4ef 0000000000000000
> Jun 5 16:13:15 node37 kernel: [17388712.145575] Call Trace:
> Jun 5 16:13:15 node37 kernel: [17388712.145575] [do_add_mount+162/320]
> ? do_add_mount+0xa2/0x140
> Jun 5 16:13:15 node37 kernel: [17388712.145575] [do_mount+505/592] ?
> do_mount+0x1f9/0x250
> Jun 5 16:13:15 node37 kernel: [17388712.145575]
> [copy_mount_options+269/384] ? copy_mount_options+0x10d/0x180
> Jun 5 16:13:15 node37 kernel: [17388712.145575] [sys_mount+155/256] ?
> sys_mount+0x9b/0x100
> Jun 5 16:13:15 node37 kernel: [17388712.145575]
> [system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80
> Jun 5 16:13:15 node37 kernel: [17388712.145575]
> Jun 5 16:13:15 node37 kernel: [17388712.145575]
> Jun 5 16:13:15 node37 kernel: [17388712.145575] Code: f7 40 58 00 00 00
> 80 74 15 89 d8 48 8b 6c 24 08 48 8b 1c 24 4c 8b 64 24 10 48 83 c4 18 c3
> 48 8b 46 08 bb ec ff ff ff 48 8b 48 10 <0f> b7 81 b2 00 00 00 25 00 f0
> 00 00 3d 00 40 00 00 48 8b 47 20
> Jun 5 16:13:15 node37 kernel: [17388712.145575] RIP
> [graft_tree+77/288] graft_tree+0x4d/0x120
> Jun 5 16:13:15 node37 kernel: [17388712.145575] RSP <ffff8100f9c85e08>
> Jun 5 16:13:15 node37 kernel: [17388712.145575] CR2: 00000000000000b2
> Jun 5 16:13:15 node37 kernel: [17388715.129847] ---[ end trace
> f3c4579f529c23bf ]---
>
> I'll apply the patch today and get some nodes booted up on it.
>

2008-06-06 08:29:17

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Fri, 2008-06-06 at 16:21 +0800, Ian Kent wrote:
> On Fri, 2008-06-06 at 08:23 +0200, Jesper Krogh wrote:
> > Hi.
> >
> > This isn't a test of the proposed patch. I just got another variatioin
> > of the problem in the log. (I've tried running the automount daemon both
> > with and without the --ghost option) that is the only change I can see.
> > Still 2.6.26-rc4..
>
> Right.
>
> Whether that would make a difference depends largely on your map
> configuration. If you have simple indirect or direct maps then then
> using the --ghost option (or just adding the "browse" option if you're
> using version 5) should prevent the code that turns the dentry negative
> from being executed at all. If you're using submounts in your map, or
> the "hosts" map or you have multi-mount entries in the maps then that
> code could still be executed.
>

btw, I'm currently testing with these additional changes.
I don't think they will result in functional differences but it's best
we keep in sync.

autofs4 - leave rehash dentry positive fix

From: Ian Kent <[email protected]>

Change ENOSPC to ENOMEM.
Make autofs4_init_ino() always set mode field.

Signed-off-by: Ian Kent <[email protected]>
---

fs/autofs4/inode.c | 2 +-
fs/autofs4/root.c | 10 +++++-----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index ec9a641..3221506 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -45,7 +45,6 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,

if (!reinit) {
ino->flags = 0;
- ino->mode = mode;
ino->inode = NULL;
ino->dentry = NULL;
ino->size = 0;
@@ -53,6 +52,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
atomic_set(&ino->count, 0);
}

+ ino->mode = mode;
ino->last_used = jiffies;

ino->sbi = sbi;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index 6ce603b..f438e6b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -713,13 +713,13 @@ static int autofs4_dir_symlink(struct inode *dir,

ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
if (!ino)
- return -ENOSPC;
+ return -ENOMEM;

cp = kmalloc(ino->size + 1, GFP_KERNEL);
if (!cp) {
if (!dentry->d_fsdata)
kfree(ino);
- return -ENOSPC;
+ return -ENOMEM;
}

strcpy(cp, symname);
@@ -733,7 +733,7 @@ static int autofs4_dir_symlink(struct inode *dir,
kfree(cp);
if (!dentry->d_fsdata)
kfree(ino);
- return -ENOSPC;
+ return -ENOMEM;
}

d_add(dentry, inode);
@@ -866,7 +866,7 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)

ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
if (!ino)
- return -ENOSPC;
+ return -ENOMEM;

inode = dentry->d_inode;
if (inode)
@@ -876,7 +876,7 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
if (!inode) {
if (!dentry->d_fsdata)
kfree(ino);
- return -ENOSPC;
+ return -ENOMEM;
}

d_add(dentry, inode);

2008-06-10 05:01:21

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> > On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
> >
> > > autofs4_lookup is called on behalf a process trying to walk into an
> > > automounted directory. That dentry's d_flags is set to
> > > DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created,
> > > indexed off of the name of the dentry. A callout is made to the
> > > automount daemon (via autofs4_wait).
> > >
> > > The daemon looks up the directory name in its configuration. If it
> > > finds a valid map entry, it will then create the directory using
> > > sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode ==
> > > 1) will return NULL, and then the mkdir call will be made. The
> > > autofs4_mkdir function then instantiates the dentry which, by the way,
> > > is different from the original dentry passed to autofs4_lookup. (This
> > > dentry also does not get the PENDING flag set, which is a bug addressed
> > > by a patch set that Ian and I have been working on; specifically, the
> > > idea is to reuse the dentry from the original lookup, but I digress).
> > >
> > > The daemon then mounts the share on the given directory and issues an
> > > ioctl to wakeup the waiter. When awakened, the waiter clears the
> > > DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> > > dcache and returns that dentry if found.
> > > Later, the dentry gets expired via another ioctl. That path sets
> > > the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> > > It then calls out to the daemon to perform the unmount and rmdir. The
> > > rmdir unhashes the dentry (and places it on the rehash list).
> > >
> > > The dentry is removed from the rehash list if there was a racing expire
> > > and mount or if the dentry is released.
> > >
> > > This description is valid for the tree as it stands today. Ian and I
> > > have been working on fixing some other race conditions which will change
> > > the dentry life cycle (for the better, I hope).
> >
> > So what happens if new lookup hits between umount and rmdir?
>
> It will wait for the expire to complete and then wait for a mount
> request to the daemon.

Actually, that explanation is a bit simple minded.

It should wait for the expire in ->revalidate().
Following the expire completion d_invalidate() should return 0, since
the dentry is now unhashed, which causes ->revalidate() to return 0.
do_lookup() should see this and call a ->lookup().

But maybe I've missed something as I'm seeing a problem now.

Ian

2008-06-10 06:29:16

by Jesper Krogh

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

Ian Kent wrote:
> On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
>> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
>>> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
>>>
>>>> autofs4_lookup is called on behalf a process trying to walk into an
>>>> automounted directory. That dentry's d_flags is set to
>>>> DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created,
>>>> indexed off of the name of the dentry. A callout is made to the
>>>> automount daemon (via autofs4_wait).
>>>>
>>>> The daemon looks up the directory name in its configuration. If it
>>>> finds a valid map entry, it will then create the directory using
>>>> sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode ==
>>>> 1) will return NULL, and then the mkdir call will be made. The
>>>> autofs4_mkdir function then instantiates the dentry which, by the way,
>>>> is different from the original dentry passed to autofs4_lookup. (This
>>>> dentry also does not get the PENDING flag set, which is a bug addressed
>>>> by a patch set that Ian and I have been working on; specifically, the
>>>> idea is to reuse the dentry from the original lookup, but I digress).
>>>>
>>>> The daemon then mounts the share on the given directory and issues an
>>>> ioctl to wakeup the waiter. When awakened, the waiter clears the
>>>> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
>>>> dcache and returns that dentry if found.
>>>> Later, the dentry gets expired via another ioctl. That path sets
>>>> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
>>>> It then calls out to the daemon to perform the unmount and rmdir. The
>>>> rmdir unhashes the dentry (and places it on the rehash list).
>>>>
>>>> The dentry is removed from the rehash list if there was a racing expire
>>>> and mount or if the dentry is released.
>>>>
>>>> This description is valid for the tree as it stands today. Ian and I
>>>> have been working on fixing some other race conditions which will change
>>>> the dentry life cycle (for the better, I hope).
>>> So what happens if new lookup hits between umount and rmdir?
>> It will wait for the expire to complete and then wait for a mount
>> request to the daemon.
>
> Actually, that explanation is a bit simple minded.
>
> It should wait for the expire in ->revalidate().
> Following the expire completion d_invalidate() should return 0, since
> the dentry is now unhashed, which causes ->revalidate() to return 0.
> do_lookup() should see this and call a ->lookup().
>
> But maybe I've missed something as I'm seeing a problem now.

Ok. Ive been running on the patch for a few days now .. and didn't see
any problems. But that being said, I also turned off the --ghost option
to autofs so if it actually is the patch or the different codepaths
being used, I dont know. Since this is a production system, I'm a bit
reluctant to just change a working setup to test it out.

Jesper
--
Jesper

2008-06-10 06:43:52

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-10 at 08:28 +0200, Jesper Krogh wrote:
> Ian Kent wrote:
> > On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
> >> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> >>> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
> >>>
> >>>> autofs4_lookup is called on behalf a process trying to walk into an
> >>>> automounted directory. That dentry's d_flags is set to
> >>>> DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created,
> >>>> indexed off of the name of the dentry. A callout is made to the
> >>>> automount daemon (via autofs4_wait).
> >>>>
> >>>> The daemon looks up the directory name in its configuration. If it
> >>>> finds a valid map entry, it will then create the directory using
> >>>> sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode ==
> >>>> 1) will return NULL, and then the mkdir call will be made. The
> >>>> autofs4_mkdir function then instantiates the dentry which, by the way,
> >>>> is different from the original dentry passed to autofs4_lookup. (This
> >>>> dentry also does not get the PENDING flag set, which is a bug addressed
> >>>> by a patch set that Ian and I have been working on; specifically, the
> >>>> idea is to reuse the dentry from the original lookup, but I digress).
> >>>>
> >>>> The daemon then mounts the share on the given directory and issues an
> >>>> ioctl to wakeup the waiter. When awakened, the waiter clears the
> >>>> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> >>>> dcache and returns that dentry if found.
> >>>> Later, the dentry gets expired via another ioctl. That path sets
> >>>> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> >>>> It then calls out to the daemon to perform the unmount and rmdir. The
> >>>> rmdir unhashes the dentry (and places it on the rehash list).
> >>>>
> >>>> The dentry is removed from the rehash list if there was a racing expire
> >>>> and mount or if the dentry is released.
> >>>>
> >>>> This description is valid for the tree as it stands today. Ian and I
> >>>> have been working on fixing some other race conditions which will change
> >>>> the dentry life cycle (for the better, I hope).
> >>> So what happens if new lookup hits between umount and rmdir?
> >> It will wait for the expire to complete and then wait for a mount
> >> request to the daemon.
> >
> > Actually, that explanation is a bit simple minded.
> >
> > It should wait for the expire in ->revalidate().
> > Following the expire completion d_invalidate() should return 0, since
> > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > do_lookup() should see this and call a ->lookup().
> >
> > But maybe I've missed something as I'm seeing a problem now.
>
> Ok. Ive been running on the patch for a few days now .. and didn't see
> any problems. But that being said, I also turned off the --ghost option
> to autofs so if it actually is the patch or the different codepaths
> being used, I dont know. Since this is a production system, I'm a bit
> reluctant to just change a working setup to test it out.

No need to change anything.

My comment above relates to difficulties I'm having with patches that
I'm working on that follow this one and the specific question that Al
Viro asked "what happens if new lookup hits between umount and rmdir".

But, clearly we need to know if I (autofs4) caused the specific problem
you reported and if the patch resolves it. And that sounds promising
from what you've seen so far.

Ian

2008-06-10 09:13:23

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> On Tue, 2008-06-10 at 08:28 +0200, Jesper Krogh wrote:
> > Ian Kent wrote:
> > > On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
> > >> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> > >>> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
> > >>>
> > >>>> autofs4_lookup is called on behalf a process trying to walk into an
> > >>>> automounted directory. That dentry's d_flags is set to
> > >>>> DCACHE_AUTOFS_PENDING but not hashed. A waitqueue entry is created,
> > >>>> indexed off of the name of the dentry. A callout is made to the
> > >>>> automount daemon (via autofs4_wait).
> > >>>>
> > >>>> The daemon looks up the directory name in its configuration. If it
> > >>>> finds a valid map entry, it will then create the directory using
> > >>>> sys_mkdir. The autofs4_lookup call on behalf of the daemon (oz_mode ==
> > >>>> 1) will return NULL, and then the mkdir call will be made. The
> > >>>> autofs4_mkdir function then instantiates the dentry which, by the way,
> > >>>> is different from the original dentry passed to autofs4_lookup. (This
> > >>>> dentry also does not get the PENDING flag set, which is a bug addressed
> > >>>> by a patch set that Ian and I have been working on; specifically, the
> > >>>> idea is to reuse the dentry from the original lookup, but I digress).
> > >>>>
> > >>>> The daemon then mounts the share on the given directory and issues an
> > >>>> ioctl to wakeup the waiter. When awakened, the waiter clears the
> > >>>> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> > >>>> dcache and returns that dentry if found.
> > >>>> Later, the dentry gets expired via another ioctl. That path sets
> > >>>> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> > >>>> It then calls out to the daemon to perform the unmount and rmdir. The
> > >>>> rmdir unhashes the dentry (and places it on the rehash list).
> > >>>>
> > >>>> The dentry is removed from the rehash list if there was a racing expire
> > >>>> and mount or if the dentry is released.
> > >>>>
> > >>>> This description is valid for the tree as it stands today. Ian and I
> > >>>> have been working on fixing some other race conditions which will change
> > >>>> the dentry life cycle (for the better, I hope).
> > >>> So what happens if new lookup hits between umount and rmdir?
> > >> It will wait for the expire to complete and then wait for a mount
> > >> request to the daemon.
> > >
> > > Actually, that explanation is a bit simple minded.
> > >
> > > It should wait for the expire in ->revalidate().
> > > Following the expire completion d_invalidate() should return 0, since
> > > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > > do_lookup() should see this and call a ->lookup().
> > >
> > > But maybe I've missed something as I'm seeing a problem now.
> >
> > Ok. Ive been running on the patch for a few days now .. and didn't see
> > any problems. But that being said, I also turned off the --ghost option
> > to autofs so if it actually is the patch or the different codepaths
> > being used, I dont know. Since this is a production system, I'm a bit
> > reluctant to just change a working setup to test it out.
>
> No need to change anything.

Mmmm .. that comment might not be accurate either.

It's beginning to look like my original approach, a post from back in
Feb 2007, to fix a deadlock bug, wasn't right at all. But we don't
really have time to determine that for sure now as it can take several
days for the bug to trigger.

Ian

2008-06-12 03:08:15

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> > >>> So what happens if new lookup hits between umount and rmdir?
> > >> It will wait for the expire to complete and then wait for a mount
> > >> request to the daemon.
> > >
> > > Actually, that explanation is a bit simple minded.
> > >
> > > It should wait for the expire in ->revalidate().
> > > Following the expire completion d_invalidate() should return 0, since
> > > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > > do_lookup() should see this and call a ->lookup().
> > >
> > > But maybe I've missed something as I'm seeing a problem now.
> >
> > Ok. Ive been running on the patch for a few days now .. and didn't see
> > any problems. But that being said, I also turned off the --ghost option
> > to autofs so if it actually is the patch or the different codepaths
> > being used, I dont know. Since this is a production system, I'm a bit
> > reluctant to just change a working setup to test it out.
>
> No need to change anything.

There is a problem with the patch I posted.
It will allow an incorrect ENOENT return in some cases.

The patch below is sufficiently different to the original patch I posted
to warrant a replacement rather than a correction.

If you can find a way to test this out that would be great.

autofs4 - don't make expiring dentry negative

From: Ian Kent <[email protected]>

Correct the error of making a positive dentry negative after it has been
instantiated.

This involves removing the code in autofs4_lookup_unhashed() that makes
the dentry negative and updating autofs4_lookup() to check for an
unfinished expire and wait if needed. The dentry used for the lookup
must be negative for mounts to trigger in the required cases so the
dentry can't be re-used (which is probably for the better anyway).

Signed-off-by: Ian Kent <[email protected]>
---

fs/autofs4/autofs_i.h | 6 +--
fs/autofs4/inode.c | 6 +--
fs/autofs4/root.c | 115 ++++++++++++++++++-------------------------------
3 files changed, 49 insertions(+), 78 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c3d352d..69b1497 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -52,7 +52,7 @@ struct autofs_info {

int flags;

- struct list_head rehash;
+ struct list_head expiring;

struct autofs_sb_info *sbi;
unsigned long last_used;
@@ -112,8 +112,8 @@ struct autofs_sb_info {
struct mutex wq_mutex;
spinlock_t fs_lock;
struct autofs_wait_queue *queues; /* Wait queue pointer */
- spinlock_t rehash_lock;
- struct list_head rehash_list;
+ spinlock_t lookup_lock;
+ struct list_head expiring_list;
};

static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..94bfc15 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -47,7 +47,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
ino->dentry = NULL;
ino->size = 0;

- INIT_LIST_HEAD(&ino->rehash);
+ INIT_LIST_HEAD(&ino->expiring);

ino->last_used = jiffies;
atomic_set(&ino->count, 0);
@@ -338,8 +338,8 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent)
mutex_init(&sbi->wq_mutex);
spin_lock_init(&sbi->fs_lock);
sbi->queues = NULL;
- spin_lock_init(&sbi->rehash_lock);
- INIT_LIST_HEAD(&sbi->rehash_list);
+ spin_lock_init(&sbi->lookup_lock);
+ INIT_LIST_HEAD(&sbi->expiring_list);
s->s_blocksize = 1024;
s->s_blocksize_bits = 10;
s->s_magic = AUTOFS_SUPER_MAGIC;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index edf5b6b..2e8959c 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -493,10 +493,10 @@ void autofs4_dentry_release(struct dentry *de)
struct autofs_sb_info *sbi = autofs4_sbi(de->d_sb);

if (sbi) {
- spin_lock(&sbi->rehash_lock);
- if (!list_empty(&inf->rehash))
- list_del(&inf->rehash);
- spin_unlock(&sbi->rehash_lock);
+ spin_lock(&sbi->lookup_lock);
+ if (!list_empty(&inf->expiring))
+ list_del(&inf->expiring);
+ spin_unlock(&sbi->lookup_lock);
}

inf->dentry = NULL;
@@ -518,7 +518,7 @@ static struct dentry_operations autofs4_dentry_operations = {
.d_release = autofs4_dentry_release,
};

-static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
+static struct dentry *autofs4_lookup_expiring(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
{
unsigned int len = name->len;
unsigned int hash = name->hash;
@@ -526,14 +526,14 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
struct list_head *p, *head;

spin_lock(&dcache_lock);
- spin_lock(&sbi->rehash_lock);
- head = &sbi->rehash_list;
+ spin_lock(&sbi->lookup_lock);
+ head = &sbi->expiring_list;
list_for_each(p, head) {
struct autofs_info *ino;
struct dentry *dentry;
struct qstr *qstr;

- ino = list_entry(p, struct autofs_info, rehash);
+ ino = list_entry(p, struct autofs_info, expiring);
dentry = ino->dentry;

spin_lock(&dentry->d_lock);
@@ -555,33 +555,17 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
goto next;

if (d_unhashed(dentry)) {
- struct inode *inode = dentry->d_inode;
-
- ino = autofs4_dentry_ino(dentry);
- list_del_init(&ino->rehash);
+ list_del_init(&ino->expiring);
dget(dentry);
- /*
- * Make the rehashed dentry negative so the VFS
- * behaves as it should.
- */
- if (inode) {
- dentry->d_inode = NULL;
- list_del_init(&dentry->d_alias);
- spin_unlock(&dentry->d_lock);
- spin_unlock(&sbi->rehash_lock);
- spin_unlock(&dcache_lock);
- iput(inode);
- return dentry;
- }
spin_unlock(&dentry->d_lock);
- spin_unlock(&sbi->rehash_lock);
+ spin_unlock(&sbi->lookup_lock);
spin_unlock(&dcache_lock);
return dentry;
}
next:
spin_unlock(&dentry->d_lock);
}
- spin_unlock(&sbi->rehash_lock);
+ spin_unlock(&sbi->lookup_lock);
spin_unlock(&dcache_lock);

return NULL;
@@ -591,7 +575,7 @@ next:
static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
{
struct autofs_sb_info *sbi;
- struct dentry *unhashed;
+ struct dentry *expiring;
int oz_mode;

DPRINTK("name = %.*s",
@@ -607,44 +591,40 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
DPRINTK("pid = %u, pgrp = %u, catatonic = %d, oz_mode = %d",
current->pid, task_pgrp_nr(current), sbi->catatonic, oz_mode);

- unhashed = autofs4_lookup_unhashed(sbi, dentry->d_parent, &dentry->d_name);
- if (!unhashed) {
- /*
- * Mark the dentry incomplete but don't hash it. We do this
- * to serialize our inode creation operations (symlink and
- * mkdir) which prevents deadlock during the callback to
- * the daemon. Subsequent user space lookups for the same
- * dentry are placed on the wait queue while the daemon
- * itself is allowed passage unresticted so the create
- * operation itself can then hash the dentry. Finally,
- * we check for the hashed dentry and return the newly
- * hashed dentry.
- */
- dentry->d_op = &autofs4_root_dentry_operations;
-
- dentry->d_fsdata = NULL;
- d_instantiate(dentry, NULL);
- } else {
- struct autofs_info *ino = autofs4_dentry_ino(unhashed);
- DPRINTK("rehash %p with %p", dentry, unhashed);
+ expiring = autofs4_lookup_expiring(sbi, dentry->d_parent, &dentry->d_name);
+ if (expiring) {
+ struct autofs_info *ino = autofs4_dentry_ino(expiring);
/*
* If we are racing with expire the request might not
* be quite complete but the directory has been removed
* so it must have been successful, so just wait for it.
- * We need to ensure the AUTOFS_INF_EXPIRING flag is clear
- * before continuing as revalidate may fail when calling
- * try_to_fill_dentry (returning EAGAIN) if we don't.
*/
while (ino && (ino->flags & AUTOFS_INF_EXPIRING)) {
DPRINTK("wait for incomplete expire %p name=%.*s",
- unhashed, unhashed->d_name.len,
- unhashed->d_name.name);
- autofs4_wait(sbi, unhashed, NFY_NONE);
+ expiring, expiring->d_name.len,
+ expiring->d_name.name);
+ autofs4_wait(sbi, expiring, NFY_NONE);
DPRINTK("request completed");
}
- dentry = unhashed;
+ dput(expiring);
}

+ /*
+ * Mark the dentry incomplete but don't hash it. We do this
+ * to serialize our inode creation operations (symlink and
+ * mkdir) which prevents deadlock during the callback to
+ * the daemon. Subsequent user space lookups for the same
+ * dentry are placed on the wait queue while the daemon
+ * itself is allowed passage unresticted so the create
+ * operation itself can then hash the dentry. Finally,
+ * we check for the hashed dentry and return the newly
+ * hashed dentry.
+ */
+ dentry->d_op = &autofs4_root_dentry_operations;
+
+ dentry->d_fsdata = NULL;
+ d_instantiate(dentry, NULL);
+
if (!oz_mode) {
spin_lock(&dentry->d_lock);
dentry->d_flags |= DCACHE_AUTOFS_PENDING;
@@ -668,8 +648,6 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
if (sigismember (sigset, SIGKILL) ||
sigismember (sigset, SIGQUIT) ||
sigismember (sigset, SIGINT)) {
- if (unhashed)
- dput(unhashed);
return ERR_PTR(-ERESTARTNOINTR);
}
}
@@ -699,15 +677,9 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
else
dentry = ERR_PTR(-ENOENT);

- if (unhashed)
- dput(unhashed);
-
return dentry;
}

- if (unhashed)
- return dentry;
-
return NULL;
}

@@ -769,9 +741,8 @@ static int autofs4_dir_symlink(struct inode *dir,
* that the file no longer exists. However, doing that means that the
* VFS layer can turn the dentry into a negative dentry. We don't want
* this, because the unlink is probably the result of an expire.
- * We simply d_drop it and add it to a rehash candidates list in the
- * super block, which allows the dentry lookup to reuse it retaining
- * the flags, such as expire in progress, in case we're racing with expire.
+ * We simply d_drop it and add it to a expiring list in the super block,
+ * which allows the dentry lookup to check for an incomplete expire.
*
* If a process is blocked on the dentry waiting for the expire to finish,
* it will invalidate the dentry and try to mount with a new one.
@@ -801,9 +772,9 @@ static int autofs4_dir_unlink(struct inode *dir, struct dentry *dentry)
dir->i_mtime = CURRENT_TIME;

spin_lock(&dcache_lock);
- spin_lock(&sbi->rehash_lock);
- list_add(&ino->rehash, &sbi->rehash_list);
- spin_unlock(&sbi->rehash_lock);
+ spin_lock(&sbi->lookup_lock);
+ list_add(&ino->expiring, &sbi->expiring_list);
+ spin_unlock(&sbi->lookup_lock);
spin_lock(&dentry->d_lock);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
@@ -829,9 +800,9 @@ static int autofs4_dir_rmdir(struct inode *dir, struct dentry *dentry)
spin_unlock(&dcache_lock);
return -ENOTEMPTY;
}
- spin_lock(&sbi->rehash_lock);
- list_add(&ino->rehash, &sbi->rehash_list);
- spin_unlock(&sbi->rehash_lock);
+ spin_lock(&sbi->lookup_lock);
+ list_add(&ino->expiring, &sbi->expiring_list);
+ spin_unlock(&sbi->lookup_lock);
spin_lock(&dentry->d_lock);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);

2008-06-12 07:02:52

by Jesper Krogh

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

> On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> The patch below is sufficiently different to the original patch I posted
> to warrant a replacement rather than a correction.

I'll patch that in now and see what it gives. (Is it preferred
to use this --ghost option or not?) What would you suggest?

Jesper
--
Jesper Krogh

2008-06-12 11:23:22

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Thu, 2008-06-12 at 11:03 +0800, Ian Kent wrote:
> On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> > > >>> So what happens if new lookup hits between umount and rmdir?
> > > >> It will wait for the expire to complete and then wait for a mount
> > > >> request to the daemon.
> > > >
> > > > Actually, that explanation is a bit simple minded.
> > > >
> > > > It should wait for the expire in ->revalidate().
> > > > Following the expire completion d_invalidate() should return 0, since
> > > > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > > > do_lookup() should see this and call a ->lookup().
> > > >
> > > > But maybe I've missed something as I'm seeing a problem now.
> > >
> > > Ok. Ive been running on the patch for a few days now .. and didn't see
> > > any problems. But that being said, I also turned off the --ghost option
> > > to autofs so if it actually is the patch or the different codepaths
> > > being used, I dont know. Since this is a production system, I'm a bit
> > > reluctant to just change a working setup to test it out.
> >
> > No need to change anything.
>
> There is a problem with the patch I posted.
> It will allow an incorrect ENOENT return in some cases.
>
> The patch below is sufficiently different to the original patch I posted
> to warrant a replacement rather than a correction.
>
> If you can find a way to test this out that would be great.

Oops, I must have not set the Preformat option in Evolution, let me try
again.

autofs4 - don't make expiring dentry negative

From: Ian Kent <[email protected]>

Correct the error of making a positive dentry negative after it has been
instantiated.

This involves removing the code in autofs4_lookup_unhashed() that makes
the dentry negative and updating autofs4_lookup() to check for an
unfinished expire and wait if needed. The dentry used for the lookup
must be negative for mounts to trigger in the required cases so the
dentry can't be re-used (which is probably for the better anyway).

Signed-off-by: Ian Kent <[email protected]>
---

fs/autofs4/autofs_i.h | 6 +--
fs/autofs4/inode.c | 6 +--
fs/autofs4/root.c | 115 ++++++++++++++++++-------------------------------
3 files changed, 49 insertions(+), 78 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c3d352d..69b1497 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -52,7 +52,7 @@ struct autofs_info {

int flags;

- struct list_head rehash;
+ struct list_head expiring;

struct autofs_sb_info *sbi;
unsigned long last_used;
@@ -112,8 +112,8 @@ struct autofs_sb_info {
struct mutex wq_mutex;
spinlock_t fs_lock;
struct autofs_wait_queue *queues; /* Wait queue pointer */
- spinlock_t rehash_lock;
- struct list_head rehash_list;
+ spinlock_t lookup_lock;
+ struct list_head expiring_list;
};

static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..94bfc15 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -47,7 +47,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
ino->dentry = NULL;
ino->size = 0;

- INIT_LIST_HEAD(&ino->rehash);
+ INIT_LIST_HEAD(&ino->expiring);

ino->last_used = jiffies;
atomic_set(&ino->count, 0);
@@ -338,8 +338,8 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent)
mutex_init(&sbi->wq_mutex);
spin_lock_init(&sbi->fs_lock);
sbi->queues = NULL;
- spin_lock_init(&sbi->rehash_lock);
- INIT_LIST_HEAD(&sbi->rehash_list);
+ spin_lock_init(&sbi->lookup_lock);
+ INIT_LIST_HEAD(&sbi->expiring_list);
s->s_blocksize = 1024;
s->s_blocksize_bits = 10;
s->s_magic = AUTOFS_SUPER_MAGIC;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index edf5b6b..2e8959c 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -493,10 +493,10 @@ void autofs4_dentry_release(struct dentry *de)
struct autofs_sb_info *sbi = autofs4_sbi(de->d_sb);

if (sbi) {
- spin_lock(&sbi->rehash_lock);
- if (!list_empty(&inf->rehash))
- list_del(&inf->rehash);
- spin_unlock(&sbi->rehash_lock);
+ spin_lock(&sbi->lookup_lock);
+ if (!list_empty(&inf->expiring))
+ list_del(&inf->expiring);
+ spin_unlock(&sbi->lookup_lock);
}

inf->dentry = NULL;
@@ -518,7 +518,7 @@ static struct dentry_operations autofs4_dentry_operations = {
.d_release = autofs4_dentry_release,
};

-static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
+static struct dentry *autofs4_lookup_expiring(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
{
unsigned int len = name->len;
unsigned int hash = name->hash;
@@ -526,14 +526,14 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
struct list_head *p, *head;

spin_lock(&dcache_lock);
- spin_lock(&sbi->rehash_lock);
- head = &sbi->rehash_list;
+ spin_lock(&sbi->lookup_lock);
+ head = &sbi->expiring_list;
list_for_each(p, head) {
struct autofs_info *ino;
struct dentry *dentry;
struct qstr *qstr;

- ino = list_entry(p, struct autofs_info, rehash);
+ ino = list_entry(p, struct autofs_info, expiring);
dentry = ino->dentry;

spin_lock(&dentry->d_lock);
@@ -555,33 +555,17 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
goto next;

if (d_unhashed(dentry)) {
- struct inode *inode = dentry->d_inode;
-
- ino = autofs4_dentry_ino(dentry);
- list_del_init(&ino->rehash);
+ list_del_init(&ino->expiring);
dget(dentry);
- /*
- * Make the rehashed dentry negative so the VFS
- * behaves as it should.
- */
- if (inode) {
- dentry->d_inode = NULL;
- list_del_init(&dentry->d_alias);
- spin_unlock(&dentry->d_lock);
- spin_unlock(&sbi->rehash_lock);
- spin_unlock(&dcache_lock);
- iput(inode);
- return dentry;
- }
spin_unlock(&dentry->d_lock);
- spin_unlock(&sbi->rehash_lock);
+ spin_unlock(&sbi->lookup_lock);
spin_unlock(&dcache_lock);
return dentry;
}
next:
spin_unlock(&dentry->d_lock);
}
- spin_unlock(&sbi->rehash_lock);
+ spin_unlock(&sbi->lookup_lock);
spin_unlock(&dcache_lock);

return NULL;
@@ -591,7 +575,7 @@ next:
static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
{
struct autofs_sb_info *sbi;
- struct dentry *unhashed;
+ struct dentry *expiring;
int oz_mode;

DPRINTK("name = %.*s",
@@ -607,44 +591,40 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
DPRINTK("pid = %u, pgrp = %u, catatonic = %d, oz_mode = %d",
current->pid, task_pgrp_nr(current), sbi->catatonic, oz_mode);

- unhashed = autofs4_lookup_unhashed(sbi, dentry->d_parent, &dentry->d_name);
- if (!unhashed) {
- /*
- * Mark the dentry incomplete but don't hash it. We do this
- * to serialize our inode creation operations (symlink and
- * mkdir) which prevents deadlock during the callback to
- * the daemon. Subsequent user space lookups for the same
- * dentry are placed on the wait queue while the daemon
- * itself is allowed passage unresticted so the create
- * operation itself can then hash the dentry. Finally,
- * we check for the hashed dentry and return the newly
- * hashed dentry.
- */
- dentry->d_op = &autofs4_root_dentry_operations;
-
- dentry->d_fsdata = NULL;
- d_instantiate(dentry, NULL);
- } else {
- struct autofs_info *ino = autofs4_dentry_ino(unhashed);
- DPRINTK("rehash %p with %p", dentry, unhashed);
+ expiring = autofs4_lookup_expiring(sbi, dentry->d_parent, &dentry->d_name);
+ if (expiring) {
+ struct autofs_info *ino = autofs4_dentry_ino(expiring);
/*
* If we are racing with expire the request might not
* be quite complete but the directory has been removed
* so it must have been successful, so just wait for it.
- * We need to ensure the AUTOFS_INF_EXPIRING flag is clear
- * before continuing as revalidate may fail when calling
- * try_to_fill_dentry (returning EAGAIN) if we don't.
*/
while (ino && (ino->flags & AUTOFS_INF_EXPIRING)) {
DPRINTK("wait for incomplete expire %p name=%.*s",
- unhashed, unhashed->d_name.len,
- unhashed->d_name.name);
- autofs4_wait(sbi, unhashed, NFY_NONE);
+ expiring, expiring->d_name.len,
+ expiring->d_name.name);
+ autofs4_wait(sbi, expiring, NFY_NONE);
DPRINTK("request completed");
}
- dentry = unhashed;
+ dput(expiring);
}

+ /*
+ * Mark the dentry incomplete but don't hash it. We do this
+ * to serialize our inode creation operations (symlink and
+ * mkdir) which prevents deadlock during the callback to
+ * the daemon. Subsequent user space lookups for the same
+ * dentry are placed on the wait queue while the daemon
+ * itself is allowed passage unresticted so the create
+ * operation itself can then hash the dentry. Finally,
+ * we check for the hashed dentry and return the newly
+ * hashed dentry.
+ */
+ dentry->d_op = &autofs4_root_dentry_operations;
+
+ dentry->d_fsdata = NULL;
+ d_instantiate(dentry, NULL);
+
if (!oz_mode) {
spin_lock(&dentry->d_lock);
dentry->d_flags |= DCACHE_AUTOFS_PENDING;
@@ -668,8 +648,6 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
if (sigismember (sigset, SIGKILL) ||
sigismember (sigset, SIGQUIT) ||
sigismember (sigset, SIGINT)) {
- if (unhashed)
- dput(unhashed);
return ERR_PTR(-ERESTARTNOINTR);
}
}
@@ -699,15 +677,9 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
else
dentry = ERR_PTR(-ENOENT);

- if (unhashed)
- dput(unhashed);
-
return dentry;
}

- if (unhashed)
- return dentry;
-
return NULL;
}

@@ -769,9 +741,8 @@ static int autofs4_dir_symlink(struct inode *dir,
* that the file no longer exists. However, doing that means that the
* VFS layer can turn the dentry into a negative dentry. We don't want
* this, because the unlink is probably the result of an expire.
- * We simply d_drop it and add it to a rehash candidates list in the
- * super block, which allows the dentry lookup to reuse it retaining
- * the flags, such as expire in progress, in case we're racing with expire.
+ * We simply d_drop it and add it to a expiring list in the super block,
+ * which allows the dentry lookup to check for an incomplete expire.
*
* If a process is blocked on the dentry waiting for the expire to finish,
* it will invalidate the dentry and try to mount with a new one.
@@ -801,9 +772,9 @@ static int autofs4_dir_unlink(struct inode *dir, struct dentry *dentry)
dir->i_mtime = CURRENT_TIME;

spin_lock(&dcache_lock);
- spin_lock(&sbi->rehash_lock);
- list_add(&ino->rehash, &sbi->rehash_list);
- spin_unlock(&sbi->rehash_lock);
+ spin_lock(&sbi->lookup_lock);
+ list_add(&ino->expiring, &sbi->expiring_list);
+ spin_unlock(&sbi->lookup_lock);
spin_lock(&dentry->d_lock);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
@@ -829,9 +800,9 @@ static int autofs4_dir_rmdir(struct inode *dir, struct dentry *dentry)
spin_unlock(&dcache_lock);
return -ENOTEMPTY;
}
- spin_lock(&sbi->rehash_lock);
- list_add(&ino->rehash, &sbi->rehash_list);
- spin_unlock(&sbi->rehash_lock);
+ spin_lock(&sbi->lookup_lock);
+ list_add(&ino->expiring, &sbi->expiring_list);
+ spin_unlock(&sbi->lookup_lock);
spin_lock(&dentry->d_lock);
__d_drop(dentry);
spin_unlock(&dentry->d_lock);

2008-06-12 11:25:43

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Thu, 2008-06-12 at 09:02 +0200, Jesper Krogh wrote:
> > On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> > The patch below is sufficiently different to the original patch I posted
> > to warrant a replacement rather than a correction.
>
> I'll patch that in now and see what it gives. (Is it preferred
> to use this --ghost option or not?) What would you suggest?

It is a matter of choice mostly.
If you have a large number of entries in your map then don't use it. It
will result in a performance hit especially during expires.

Ian

2008-06-27 04:21:22

[permalink] [raw]

Subject: Re: Linux 2.6.26-rc4

On Thu, 2008-06-05 at 15:30 -0700, Andrew Morton wrote:
> >
> > + ino->u.symlink = cp;
> > + ino->size = strlen(symname);
> > dir->i_mtime = CURRENT_TIME;
>
> This all seems a bit ungainly. I assume that on entry to
> autofs4_dir_symlink(), ino->size is equal to strlen(symname)? If it's
> not, that strcpy() will overrun.
>
> But if ino->size _is_ equal to strlen(symname) then why did we just
> recalculate the same thing?

Oops.
I've fixed that in my git tree just now.

>
> I'm suspecting we can zap a lump of code and just do
>
> cp = kstrdup(symname, GFP_KERNEL);
>
> Anyway, please check that.

Yep, but fix now re-factor later.

Ian