2009-11-23 17:41:40

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

There are a few situations where a lookup can end up returning a dentry
without revalidating it, and without checking whether the calling
process has permissions to access it. Two situations identified so far
are:

1) LAST_BIND symlinks (such as those under /proc/<pid>)

2) file bind mounts

This patchset is intended to fix this by forcing revalidation of the
returned dentries at appropriate locations.

In the case of LAST_BIND symlinks it also adds a check to verify that
the target of the symlink is accessible by the current process by
walking mounts and dentries back up to the root and checking permission
on each inode.

This set fixes the reproducers I have (including the reproducer that
Pavel provided for the permissions bypass). It's still pretty rough
though and I expect that it'll need revision. At this point, I'm mainly
looking to get these questions answered:

1) what should we do if these dentries are found to be invalid? Is it ok
to d_invalidate them? Or is that likely to break something (particularly
in the case of file bind mounts)?

2) I'm using FS_REVAL_DOT as an indicator of whether to force a
d_revalidate. I think that it's appropriate to key off of that flag, but
we may want to rename it (maybe FS_FORCE_D_REVAL ?).

3) is check_path_accessible racy? It seems to work, but something
doesn't seem quite right with this approach. Is this defeatable somehow?
Could a rename of one of the intermediate path components cause
problems?

4) do we need to hold the dcache_lock across the check_path_accessible
call?

This isn't my usual area of expertise, so I'm definitely open to
suggestions on this.

Jeff Layton (3):
vfs: force reval of target when following LAST_BIND symlinks
vfs: force reval on dentry of bind mounted files on FS_REVAL_DOT
filesystems
vfs: check path permissions on target of LAST_BIND symlinks

fs/namei.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 102 insertions(+), 2 deletions(-)


2009-11-23 17:41:36

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 1/3] vfs: force reval of target when following LAST_BIND symlinks

procfs-style symlinks return a last_type of LAST_BIND without an actual
path string. This causes __follow_link to skip calling __vfs_follow_link
and so the dentry isn't revalidated.

This is a problem when the link target sits on NFSv4 as it depends on
the VFS to revalidate the dentry before using it on an open call. Ensure
that this occurs by forcing a revalidation of the target dentry of
LAST_BIND symlinks.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/namei.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 3374917..2c10519 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -414,6 +414,44 @@ do_revalidate(struct dentry *dentry, struct nameidata *nd)
}

/*
+ * force_reval_path - force revalidation of a dentry
+ *
+ * In some situations the path walking code will trust dentries without
+ * revalidating them. This causes problems for filesystems that depend on
+ * d_revalidate to handle file opens (e.g. NFSv4). When FS_REVAL_DOT is set
+ * (which indicates that it's possible for the dentry to go stale), force
+ * a d_revalidate call before proceeding.
+ *
+ * Returns 0 if the revalidation was successful. If the revalidation fails,
+ * either return the error returned by d_revalidate or -ESTALE if the
+ * revalidation it just returned 0. On error, references to the path are put.
+ * Note that on error, the dentry is not d_invalidated. If the VFS is assuming
+ * that the dentry is valid, then invalidating it could have adverse effects
+ * elsewhere.
+ */
+static int
+force_reval_path(struct path *path, struct nameidata *nd)
+{
+ int status;
+ struct dentry *dentry = path->dentry;
+
+ /*
+ * only check on filesystems where it's possible for the dentry to
+ * become stale. It's assumed that if this flag is set then the
+ * d_revalidate op will also be defined.
+ */
+ if (!(dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT))
+ return 0;
+
+ status = dentry->d_op->d_revalidate(dentry, nd);
+ if (status > 0)
+ return 0;
+
+ path_put(path);
+ return status ? status : -ESTALE;
+}
+
+/*
* Internal lookup() using the new generic dcache.
* SMP-safe
*/
@@ -641,6 +679,8 @@ static __always_inline int __do_follow_link(struct path *path, struct nameidata
error = 0;
if (s)
error = __vfs_follow_link(nd, s);
+ else if (nd->last_type == LAST_BIND)
+ error = force_reval_path(&nd->path, nd);
if (dentry->d_inode->i_op->put_link)
dentry->d_inode->i_op->put_link(dentry, nd, cookie);
}
--
1.5.5.6

2009-11-23 17:41:32

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 2/3] vfs: force reval on dentry of bind mounted files on FS_REVAL_DOT filesystems

In the case of a bind mounted file, the path walking code will assume
that the cached dentry that was bind mounted is valid. This is a problem
problem for NFSv4 in a way that's similar to LAST_BIND symlinks.

Fix this by revalidating the dentry if LOOKUP_OPEN is set and
__follow_mount returns true.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/namei.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2c10519..738b257 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -838,6 +838,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
{
struct vfsmount *mnt = nd->path.mnt;
struct dentry *dentry = __d_lookup(nd->path.dentry, name);
+ int error = 0;

if (!dentry)
goto need_lookup;
@@ -846,8 +847,9 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
done:
path->mnt = mnt;
path->dentry = dentry;
- __follow_mount(path);
- return 0;
+ if (__follow_mount(path))
+ error = force_reval_path(path, nd);
+ return error;

need_lookup:
dentry = real_lookup(nd->path.dentry, name, nd);
@@ -1835,6 +1837,9 @@ do_last:
error = -ELOOP;
if (flag & O_NOFOLLOW)
goto exit_dput;
+ error = force_reval_path(&path, &nd);
+ if (error)
+ goto exit;
}

error = -ENOENT;
--
1.5.5.6

2009-11-23 17:41:50

by Jeff Layton

[permalink] [raw]
Subject: [PATCH 3/3] vfs: check path permissions on target of LAST_BIND symlinks

Because LAST_BIND symlinks aren't subject to the normal path walking
routines, they sidestep all of the permission checking that occurs while
resolving a path. Fix this by adding a routine to walk back up the
directory tree and check MAY_EXEC permission on the entire path back up to
the root.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/namei.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 56 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 738b257..d4c1279 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -504,6 +504,57 @@ ok:
}

/*
+ * We have a cached struct path from a LAST_BIND symlink. The path is valid
+ * but it's possible that one of the components leading to this point
+ * might not be accessible. This check walks back up the tree to the root
+ * of the namespace and checks whether each component is accessible. The
+ * supplied path is put on error.
+ */
+static int
+check_path_accessible(struct path *path)
+{
+ int err;
+ struct dentry *parent, *tdentry = dget(path->dentry);
+ struct vfsmount *vfsmnt = mntget(path->mnt);
+ struct path root = current->fs->root;
+
+ path_get(&root);
+ for(;;) {
+ parent = dget_parent(tdentry);
+ err = inode_permission(parent->d_inode, MAY_EXEC);
+ if (err < 0) {
+ dput(parent);
+ break;
+ }
+ dput(tdentry);
+ tdentry = parent;
+
+ /* keep going if not to root of mnt */
+ if (!IS_ROOT(tdentry))
+ continue;
+
+ /* are we at global root or root of namespace? */
+ if ((tdentry == root.dentry && vfsmnt == root.mnt) ||
+ vfsmnt->mnt_parent == vfsmnt)
+ break;
+
+ /* cross to parent mount and keep walking */
+ mntput(vfsmnt);
+ vfsmnt = mntget(vfsmnt->mnt_parent);
+ tdentry = dget(vfsmnt->mnt_mountpoint);
+ dput(parent);
+ }
+ dput(tdentry);
+ mntput(vfsmnt);
+ path_put(&root);
+
+ if (err)
+ path_put(path);
+
+ return err;
+}
+
+/*
* This is called when everything else fails, and we actually have
* to go to the low-level filesystem to find out what we should do..
*
@@ -679,8 +730,12 @@ static __always_inline int __do_follow_link(struct path *path, struct nameidata
error = 0;
if (s)
error = __vfs_follow_link(nd, s);
- else if (nd->last_type == LAST_BIND)
+ else if (nd->last_type == LAST_BIND) {
error = force_reval_path(&nd->path, nd);
+ if (!error)
+ error = check_path_accessible(&nd->path);
+ }
+
if (dentry->d_inode->i_op->put_link)
dentry->d_inode->i_op->put_link(dentry, nd, cookie);
}
--
1.5.5.6

2009-11-23 22:05:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

Jeff Layton <[email protected]> writes:

> There are a few situations where a lookup can end up returning a dentry
> without revalidating it, and without checking whether the calling
> process has permissions to access it. Two situations identified so far
> are:
>
> 1) LAST_BIND symlinks (such as those under /proc/<pid>)
>
> 2) file bind mounts
>
> This patchset is intended to fix this by forcing revalidation of the
> returned dentries at appropriate locations.
>
> In the case of LAST_BIND symlinks it also adds a check to verify that
> the target of the symlink is accessible by the current process by
> walking mounts and dentries back up to the root and checking permission
> on each inode.
>
> This set fixes the reproducers I have (including the reproducer that
> Pavel provided for the permissions bypass). It's still pretty rough
> though and I expect that it'll need revision. At this point, I'm mainly
> looking to get these questions answered:
>
> 1) what should we do if these dentries are found to be invalid? Is it ok
> to d_invalidate them? Or is that likely to break something (particularly
> in the case of file bind mounts)?

The normal sequence in do_revalidate should be safe. In practice what we
should see is d_drop(). If we access the dentries via another path today
we already go through d_revalidate. It is only the reference count on
the dentry that keeps them alive and working. The cases I have looked
at for distributed filesystems have to call d_drop themselves so I don't
know if it would add anything if the vfs called d_revalidate. Especially
since FS_REVAL_DOT doesn't have that logic.

> 2) I'm using FS_REVAL_DOT as an indicator of whether to force a
> d_revalidate. I think that it's appropriate to key off of that flag, but
> we may want to rename it (maybe FS_FORCE_D_REVAL ?).

Perhaps FS_ALWAYS_REVAL. I don't think it makes much of
a difference either way. I expect a rename should come after we fix
nfsv4 so there is a chance at pushing the fixes back to stable.

> 3) is check_path_accessible racy? It seems to work, but something
> doesn't seem quite right with this approach. Is this defeatable somehow?
> Could a rename of one of the intermediate path components cause
> problems?

check_path_accessible seems pretty horrible. If a process is running
inside of a subdirectory it doesn't have permissions to access, say
a chroot, /proc/self/fd/XXX becomes completely unusable.

Eric

2009-11-23 22:36:29

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Mon, 23 Nov 2009 14:05:24 -0800
[email protected] (Eric W. Biederman) wrote:

> Jeff Layton <[email protected]> writes:
>
> > There are a few situations where a lookup can end up returning a dentry
> > without revalidating it, and without checking whether the calling
> > process has permissions to access it. Two situations identified so far
> > are:
> >
> > 1) LAST_BIND symlinks (such as those under /proc/<pid>)
> >
> > 2) file bind mounts
> >
> > This patchset is intended to fix this by forcing revalidation of the
> > returned dentries at appropriate locations.
> >
> > In the case of LAST_BIND symlinks it also adds a check to verify that
> > the target of the symlink is accessible by the current process by
> > walking mounts and dentries back up to the root and checking permission
> > on each inode.
> >
> > This set fixes the reproducers I have (including the reproducer that
> > Pavel provided for the permissions bypass). It's still pretty rough
> > though and I expect that it'll need revision. At this point, I'm mainly
> > looking to get these questions answered:
> >
> > 1) what should we do if these dentries are found to be invalid? Is it ok
> > to d_invalidate them? Or is that likely to break something (particularly
> > in the case of file bind mounts)?
>
> The normal sequence in do_revalidate should be safe. In practice what we
> should see is d_drop(). If we access the dentries via another path today
> we already go through d_revalidate. It is only the reference count on
> the dentry that keeps them alive and working. The cases I have looked
> at for distributed filesystems have to call d_drop themselves so I don't
> know if it would add anything if the vfs called d_revalidate. Especially
> since FS_REVAL_DOT doesn't have that logic.
>
> > 2) I'm using FS_REVAL_DOT as an indicator of whether to force a
> > d_revalidate. I think that it's appropriate to key off of that flag, but
> > we may want to rename it (maybe FS_FORCE_D_REVAL ?).
>
> Perhaps FS_ALWAYS_REVAL. I don't think it makes much of
> a difference either way. I expect a rename should come after we fix
> nfsv4 so there is a chance at pushing the fixes back to stable.
>
> > 3) is check_path_accessible racy? It seems to work, but something
> > doesn't seem quite right with this approach. Is this defeatable somehow?
> > Could a rename of one of the intermediate path components cause
> > problems?
>
> check_path_accessible seems pretty horrible. If a process is running
> inside of a subdirectory it doesn't have permissions to access, say
> a chroot, /proc/self/fd/XXX becomes completely unusable.
>

Hmm...I have this in there:

+ /* are we at global root or root of namespace? */
+ if ((tdentry == root.dentry && vfsmnt == root.mnt) ||
+ vfsmnt->mnt_parent == vfsmnt)
+ break;

...In the case of a chroot, wouldn't "current->fs->root" point to the
root of the process' namespace? Or am I misunderstanding what
current->fs actually represents?

--
Jeff Layton <[email protected]>

2009-11-23 22:49:51

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

Jeff Layton wrote:
> > check_path_accessible seems pretty horrible. If a process is running
> > inside of a subdirectory it doesn't have permissions to access, say
> > a chroot, /proc/self/fd/XXX becomes completely unusable.
> >
>
> Hmm...I have this in there:
>
> + /* are we at global root or root of namespace? */
> + if ((tdentry == root.dentry && vfsmnt == root.mnt) ||
> + vfsmnt->mnt_parent == vfsmnt)
> + break;
>
> ...In the case of a chroot, wouldn't "current->fs->root" point to the
> root of the process' namespace? Or am I misunderstanding what
> current->fs actually represents?

A process can run inside a subdirectory it doesn't have permissions to
access without that being a chroot.

It can also run inside a subdirectory that isn't accessible from it's
root, if that's how it was started - as well as having other
descriptors pointing to things outside its root.

It can also be passed file descriptors from outside it's root while
it's running.

Really, I think the /proc/PID/fd/N check should restrict the open to
the O_* limitations that were used to open fd N before, and not have
any connection to actual paths at the time of this check.

-- Jamie

2009-11-23 23:16:12

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Mon, 23 Nov 2009 22:49:48 +0000
Jamie Lokier <[email protected]> wrote:

> Jeff Layton wrote:
> > > check_path_accessible seems pretty horrible. If a process is running
> > > inside of a subdirectory it doesn't have permissions to access, say
> > > a chroot, /proc/self/fd/XXX becomes completely unusable.
> > >
> >
> > Hmm...I have this in there:
> >
> > + /* are we at global root or root of namespace? */
> > + if ((tdentry == root.dentry && vfsmnt == root.mnt) ||
> > + vfsmnt->mnt_parent == vfsmnt)
> > + break;
> >
> > ...In the case of a chroot, wouldn't "current->fs->root" point to the
> > root of the process' namespace? Or am I misunderstanding what
> > current->fs actually represents?
>
> A process can run inside a subdirectory it doesn't have permissions to
> access without that being a chroot.
>

Certainly.

> It can also run inside a subdirectory that isn't accessible from it's
> root, if that's how it was started - as well as having other
> descriptors pointing to things outside its root.
>

Yes.

> It can also be passed file descriptors from outside it's root while
> it's running.
>

Yep.

> Really, I think the /proc/PID/fd/N check should restrict the open to
> the O_* limitations that were used to open fd N before, and not have
> any connection to actual paths at the time of this check.
>

The big question with all of this is: Should a task have the ability
to follow a /proc/pid symlink to a path that it wouldn't ordinarily be
able to resolve with a path lookup. The concensus that I got from the
bugtraq discussion was that it should not, and this patch is an attempt
to prevent that.

I take it from you and Eric's comments that you disagree? If so, what's
your rationale for allowing a task to resolve this symlink when it
wouldn't ordinarily be able to do so if it were a "normal" symlink?

--
Jeff Layton <[email protected]>

2009-11-23 23:35:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

Jeff Layton <[email protected]> writes:

> On Mon, 23 Nov 2009 22:49:48 +0000
> Jamie Lokier <[email protected]> wrote:
>
>> Jeff Layton wrote:
>> > > check_path_accessible seems pretty horrible. If a process is running
>> > > inside of a subdirectory it doesn't have permissions to access, say
>> > > a chroot, /proc/self/fd/XXX becomes completely unusable.
>> > >
>> >
>> > Hmm...I have this in there:
>> >
>> > + /* are we at global root or root of namespace? */
>> > + if ((tdentry == root.dentry && vfsmnt == root.mnt) ||
>> > + vfsmnt->mnt_parent == vfsmnt)
>> > + break;
>> >
>> > ...In the case of a chroot, wouldn't "current->fs->root" point to the
>> > root of the process' namespace? Or am I misunderstanding what
>> > current->fs actually represents?
>>
>> A process can run inside a subdirectory it doesn't have permissions to
>> access without that being a chroot.
>>
>
> Certainly.
>
>> It can also run inside a subdirectory that isn't accessible from it's
>> root, if that's how it was started - as well as having other
>> descriptors pointing to things outside its root.
>>
>
> Yes.
>
>> It can also be passed file descriptors from outside it's root while
>> it's running.
>>
>
> Yep.
>
>> Really, I think the /proc/PID/fd/N check should restrict the open to
>> the O_* limitations that were used to open fd N before, and not have
>> any connection to actual paths at the time of this check.
>>
>
> The big question with all of this is: Should a task have the ability
> to follow a /proc/pid symlink to a path that it wouldn't ordinarily be
> able to resolve with a path lookup. The concensus that I got from the
> bugtraq discussion was that it should not, and this patch is an attempt
> to prevent that.
>
> I take it from you and Eric's comments that you disagree? If so, what's
> your rationale for allowing a task to resolve this symlink when it
> wouldn't ordinarily be able to do so if it were a "normal" symlink?

For myself I start with the simple fact that the code has worked the
way it currently does since around linux 0.99. We have to be very
careful if we change this to avoid breaking existing applications.

So if we change the existing behaviour it must be done in such a way
that legitimate applications do not break.

Eric

2009-11-24 00:34:52

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Mon, 23 Nov 2009 15:35:44 -0800
[email protected] (Eric W. Biederman) wrote:

> Jeff Layton <[email protected]> writes:
>
> > On Mon, 23 Nov 2009 22:49:48 +0000
> > Jamie Lokier <[email protected]> wrote:
> >
> >> Jeff Layton wrote:
> >> > > check_path_accessible seems pretty horrible. If a process is running
> >> > > inside of a subdirectory it doesn't have permissions to access, say
> >> > > a chroot, /proc/self/fd/XXX becomes completely unusable.
> >> > >
> >> >
> >> > Hmm...I have this in there:
> >> >
> >> > + /* are we at global root or root of namespace? */
> >> > + if ((tdentry == root.dentry && vfsmnt == root.mnt) ||
> >> > + vfsmnt->mnt_parent == vfsmnt)
> >> > + break;
> >> >
> >> > ...In the case of a chroot, wouldn't "current->fs->root" point to the
> >> > root of the process' namespace? Or am I misunderstanding what
> >> > current->fs actually represents?
> >>
> >> A process can run inside a subdirectory it doesn't have permissions to
> >> access without that being a chroot.
> >>
> >
> > Certainly.
> >
> >> It can also run inside a subdirectory that isn't accessible from it's
> >> root, if that's how it was started - as well as having other
> >> descriptors pointing to things outside its root.
> >>
> >
> > Yes.
> >
> >> It can also be passed file descriptors from outside it's root while
> >> it's running.
> >>
> >
> > Yep.
> >
> >> Really, I think the /proc/PID/fd/N check should restrict the open to
> >> the O_* limitations that were used to open fd N before, and not have
> >> any connection to actual paths at the time of this check.
> >>
> >
> > The big question with all of this is: Should a task have the ability
> > to follow a /proc/pid symlink to a path that it wouldn't ordinarily be
> > able to resolve with a path lookup. The concensus that I got from the
> > bugtraq discussion was that it should not, and this patch is an attempt
> > to prevent that.
> >
> > I take it from you and Eric's comments that you disagree? If so, what's
> > your rationale for allowing a task to resolve this symlink when it
> > wouldn't ordinarily be able to do so if it were a "normal" symlink?
>
> For myself I start with the simple fact that the code has worked the
> way it currently does since around linux 0.99. We have to be very
> careful if we change this to avoid breaking existing applications.
>
> So if we change the existing behaviour it must be done in such a way
> that legitimate applications do not break.
>

I certainly don't want to break existing apps. That said, applications
that are depending on /proc/pid symlinks to allow them to bypass
directory permissions or access files that aren't in their namespace
would seem to be unsafe, no?

I think all we can reasonably do is try to clearly lay out how these
symlinks are intended to work. I think it's logical that the result of
following these links should be more or less the same as if you were to
resolve the results of the readlink.

Is there some reason that we should expect them to provide anything
more? Do you have apps in mind that you think will break with this
change? If you think this is unreasonable, perhaps you could suggest an
alternative?

If this approach is reasonable, there is one thing I think that I'm
pretty sure will need to be fixed. It'll need to detect when the file
lies outside of its namespace altogether. I'm not quite sure how to
do that yet. I've not done much work with multiple namespaces, so I
could certainly use some guidance here.

--
Jeff Layton <[email protected]>

2009-11-24 01:20:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

Jeff Layton wrote:
> I certainly don't want to break existing apps. That said, applications
> that are depending on /proc/pid symlinks to allow them to bypass
> directory permissions or access files that aren't in their namespace
> would seem to be unsafe, no?

I think we can mostly agree on that :-)

> I think all we can reasonably do is try to clearly lay out how these
> symlinks are intended to work. I think it's logical that the result of
> following these links should be more or less the same as if you were to
> resolve the results of the readlink.
>
> Is there some reason that we should expect them to provide anything
> more? Do you have apps in mind that you think will break with this
> change?

Anything which compiled with and uses the openat(), mkdirat()
etc. emulation in gnulib (formerly known as libiberty), and anything
using the same technique.

You know, GNU coreutils and other obscure things :-)

Of course there are real system calls for that, now, but there are
still compiled programs that don't know about the real system calls.

The same technique (traversing /proc/self/fd/N) is used on Solaris, by
the way. It's probably worth keeping a modicum of compatibility with
whatever Solaris does.

> If you think this is unreasonable, perhaps you could suggest an
> alternative?

I have, two mails up - did you read it? - and in the previous threads
which resulted in the bugtraq.

Please tell me why that approach does not work, thanks.

> If this approach is reasonable, there is one thing I think that I'm
> pretty sure will need to be fixed.

It's not reasonable for /proc/self/fd/N because that has historically
been a way to follow a directory (like openat) or dup() an open file
without sharing the seek offset, which is useful for multithreaded
code.

Same goes for /proc/self/exe: That has historically been a way to read
your own executable, e.g. for self-extracting executables, executables
with additional data glued on. That breaks if the executable at the
link target is not yourself.

But just to prove we've been over this before and never came to a
consensus or conclusion:

http://lkml.org/lkml/2008/3/23/3

(the whole thread is worth a read, but Denys Vlasenko's remarks are
especially relevant).

And for those who remember 2.0 :-)

http://lkml.indiana.edu/hypermail/linux/kernel/9609.2/0371.html

-- Jamie

2009-11-24 11:26:47

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Tue, 24 Nov 2009 01:20:27 +0000
Jamie Lokier <[email protected]> wrote:

> Jeff Layton wrote:
> > I certainly don't want to break existing apps. That said, applications
> > that are depending on /proc/pid symlinks to allow them to bypass
> > directory permissions or access files that aren't in their namespace
> > would seem to be unsafe, no?
>
> I think we can mostly agree on that :-)
>
> > I think all we can reasonably do is try to clearly lay out how these
> > symlinks are intended to work. I think it's logical that the result of
> > following these links should be more or less the same as if you were to
> > resolve the results of the readlink.
> >
> > Is there some reason that we should expect them to provide anything
> > more? Do you have apps in mind that you think will break with this
> > change?
>
> Anything which compiled with and uses the openat(), mkdirat()
> etc. emulation in gnulib (formerly known as libiberty), and anything
> using the same technique.
>
> You know, GNU coreutils and other obscure things :-)
>
> Of course there are real system calls for that, now, but there are
> still compiled programs that don't know about the real system calls.
>
> The same technique (traversing /proc/self/fd/N) is used on Solaris, by
> the way. It's probably worth keeping a modicum of compatibility with
> whatever Solaris does.
>

Oof...ok. I didn't realize that these symlinks were relied on in such a
fashion. I guess that means that there is a need to have them continue
to have special semantics beyond what a symlink would ordinarily have.

> > If you think this is unreasonable, perhaps you could suggest an
> > alternative?
>
> I have, two mails up - did you read it? - and in the previous threads
> which resulted in the bugtraq.
>
> Please tell me why that approach does not work, thanks.
>

I did read it, sorry I didn't comment on it before...

My immediate thought was that that approach only affects open calls. If
the containing directory weren't executable by the process it wouldn't
be able to (for instance) stat the file were these symlinks more
ordinary. Of course, the process could always do an fstat on the fd so
I don't suppose there's any harm in allowing it.

Since it's clear that these symlinks do need to have special semantics,
perhaps the approach you suggest would be the best thing. I'll have to
think about it a bit more.

> > If this approach is reasonable, there is one thing I think that I'm
> > pretty sure will need to be fixed.
>
> It's not reasonable for /proc/self/fd/N because that has historically
> been a way to follow a directory (like openat) or dup() an open file
> without sharing the seek offset, which is useful for multithreaded
> code.
>
> Same goes for /proc/self/exe: That has historically been a way to read
> your own executable, e.g. for self-extracting executables, executables
> with additional data glued on. That breaks if the executable at the
> link target is not yourself.
>
> But just to prove we've been over this before and never came to a
> consensus or conclusion:
>
> http://lkml.org/lkml/2008/3/23/3
>
> (the whole thread is worth a read, but Denys Vlasenko's remarks are
> especially relevant).
>
> And for those who remember 2.0 :-)
>
> http://lkml.indiana.edu/hypermail/linux/kernel/9609.2/0371.html
>

Thanks for the links. Those help clarify where you and Eric are coming
from. I'll need to rethink this.

Thank you (and Eric) for the comments so far.

Cheers,
--
Jeff Layton <[email protected]>

2009-11-24 11:53:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Tue, 24 Nov 2009, Jeff Layton wrote:
> Since it's clear that these symlinks do need to have special semantics,
> perhaps the approach you suggest would be the best thing. I'll have to
> think about it a bit more.

open() is not the only thing you need to think about. Anything that
checks read or write permission on the inode (truncate, utimes,
*xattr) would have to be changed to respect the open mode.

See, this is not just about hacking the proc follow_symlink code to
check some lookup intent. It's about changing the permission checking
mechanism for theses beasts. And since the permission checking is
inode based, this is not at all trivial to do.

I still believe leaving the current semantics and documenting them is
the best option.

Thanks,
Miklos

2009-11-24 12:28:17

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Tue 2009-11-24 12:53:09, Miklos Szeredi wrote:
> On Tue, 24 Nov 2009, Jeff Layton wrote:
> > Since it's clear that these symlinks do need to have special semantics,
> > perhaps the approach you suggest would be the best thing. I'll have to
> > think about it a bit more.
>
> open() is not the only thing you need to think about. Anything that
> checks read or write permission on the inode (truncate, utimes,
> *xattr) would have to be changed to respect the open mode.
>
> See, this is not just about hacking the proc follow_symlink code to
> check some lookup intent. It's about changing the permission checking
> mechanism for theses beasts. And since the permission checking is
> inode based, this is not at all trivial to do.
>
> I still believe leaving the current semantics and documenting them is
> the best option.

I believe that current semantics is ugly enough that 'documenting' it
is not enough... and people want to port from other systems, too, not
expecting nasty surprises like this...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-11-24 12:59:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Tue, 24 Nov 2009, Pavel Machek wrote:
> I believe that current semantics is ugly enough that 'documenting' it
> is not enough... and people want to port from other systems, too, not
> expecting nasty surprises like this...

This hasn't been a problem for the last 12 years, and still we don't
see script kiddies exploiting this hole and sysadmins hurrying to
secure their system, even though it has been public for quite a while.

Why?

The reason might be, that there *is no* violation of security.

See this: the surprise isn't that an inode can be reached from
multiple paths, that has been possible with hard links for as long as
unix lived. The suprise is that the inode can be reached through
proc. So this "hole" that has been opened about 12 years ago in linux
is quite well known. Only this particular aspect of it isn't well
known, but that doesn't mean it's not right, does it?

Thanks,
Miklos

2009-11-24 13:13:56

by Duane Griffin

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

2009/11/24 Pavel Machek <[email protected]>:
> On Tue 2009-11-24 12:53:09, Miklos Szeredi wrote:
>> I still believe leaving the current semantics and documenting them is
>> the best option.
>
> I believe that current semantics is ugly enough that 'documenting' it
> is not enough... and people want to port from other systems, too, not
> expecting nasty surprises like this...

Solaris 10 works the same way as Linux does now, so I don't think the
porting argument gets you anywhere.

> Pavel

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2009-11-30 12:28:54

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Tue 2009-11-24 13:59:06, Miklos Szeredi wrote:
> On Tue, 24 Nov 2009, Pavel Machek wrote:
> > I believe that current semantics is ugly enough that 'documenting' it
> > is not enough... and people want to port from other systems, too, not
> > expecting nasty surprises like this...
>
> This hasn't been a problem for the last 12 years, and still we don't
> see script kiddies exploiting this hole and sysadmins hurrying to
> secure their system, even though it has been public for quite a while.
>
> Why?

Because condition when it hits are quite unusual?

> The reason might be, that there *is no* violation of security.

Well, security people disagree with you.

> See this: the surprise isn't that an inode can be reached from
> multiple paths, that has been possible with hard links for as long as
> unix lived. The suprise is that the inode can be reached through
> proc. So this "hole" that has been opened about 12 years ago in linux
> is quite well known. Only this particular aspect of it isn't well
> known, but that doesn't mean it's not right, does it?

It does. Bypassing checks on read-only file descriptors is design
misfeature, and users are clearly unaware. (See bugtraq). Being "old"
does not mean it is right.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-11-30 19:00:46

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

Duane Griffin wrote:
> 2009/11/24 Pavel Machek <[email protected]>:
> > On Tue 2009-11-24 12:53:09, Miklos Szeredi wrote:
> >> I still believe leaving the current semantics and documenting them is
> >> the best option.
> >
> > I believe that current semantics is ugly enough that 'documenting' it
> > is not enough... and people want to port from other systems, too, not
> > expecting nasty surprises like this...
>
> Solaris 10 works the same way as Linux does now, so I don't think the
> porting argument gets you anywhere.

It certainly must be similar, as gnulib uses the same technique on
both Solaris and Linux.

I don't have a Solaris to try this on. Can you use /proc to re-open
with O_RDWR a file descriptor previously opened with O_RDONLY on
Solaris 10, assuming the underlying inode allows writing?

-- Jamie

2009-11-30 19:22:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

Pavel Machek <[email protected]> writes:

> On Tue 2009-11-24 13:59:06, Miklos Szeredi wrote:
>> On Tue, 24 Nov 2009, Pavel Machek wrote:
>> > I believe that current semantics is ugly enough that 'documenting' it
>> > is not enough... and people want to port from other systems, too, not
>> > expecting nasty surprises like this...
>>
>> This hasn't been a problem for the last 12 years, and still we don't
>> see script kiddies exploiting this hole and sysadmins hurrying to
>> secure their system, even though it has been public for quite a while.
>>
>> Why?
>
> Because condition when it hits are quite unusual?

So unusual perhaps that this is not a problem?

>> The reason might be, that there *is no* violation of security.
>
> Well, security people disagree with you.

Other security people disagree with you.

>> See this: the surprise isn't that an inode can be reached from
>> multiple paths, that has been possible with hard links for as long as
>> unix lived. The suprise is that the inode can be reached through
>> proc. So this "hole" that has been opened about 12 years ago in linux
>> is quite well known. Only this particular aspect of it isn't well
>> known, but that doesn't mean it's not right, does it?
>
> It does. Bypassing checks on read-only file descriptors is design
> misfeature, and users are clearly unaware. (See bugtraq). Being "old"
> does not mean it is right.

Being "old" does mean that changing it is a regression if any valid
application depends on this feature.

Eric

2009-12-01 08:56:37

by Duane Griffin

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

2009/11/30 Jamie Lokier <[email protected]>:
> Duane Griffin wrote:
>> 2009/11/24 Pavel Machek <[email protected]>:
>> > On Tue 2009-11-24 12:53:09, Miklos Szeredi wrote:
>> >> I still believe leaving the current semantics and documenting them is
>> >> the best option.
>> >
>> > I believe that current semantics is ugly enough that 'documenting' it
>> > is not enough... and people want to port from other systems, too, not
>> > expecting nasty surprises like this...
>>
>> Solaris 10 works the same way as Linux does now, so I don't think the
>> porting argument gets you anywhere.
>
> It certainly must be similar, as gnulib uses the same technique on
> both Solaris and Linux.
>
> I don't have a Solaris to try this on. Can you use /proc to re-open
> with O_RDWR a file descriptor previously opened with O_RDONLY on
> Solaris 10, assuming the underlying inode allows writing?

Yep, I basically followed Pavel's recipe and got the same result. On
the other hand, it didn't work on AIX.

> -- Jamie

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2009-12-01 13:15:29

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Mon, 23 Nov 2009 14:05:24 -0800
[email protected] (Eric W. Biederman) wrote:

> Jeff Layton <[email protected]> writes:
>
> > There are a few situations where a lookup can end up returning a dentry
> > without revalidating it, and without checking whether the calling
> > process has permissions to access it. Two situations identified so far
> > are:
> >
> > 1) LAST_BIND symlinks (such as those under /proc/<pid>)
> >
> > 2) file bind mounts
> >
> > This patchset is intended to fix this by forcing revalidation of the
> > returned dentries at appropriate locations.
> >
> > In the case of LAST_BIND symlinks it also adds a check to verify that
> > the target of the symlink is accessible by the current process by
> > walking mounts and dentries back up to the root and checking permission
> > on each inode.
> >
> > This set fixes the reproducers I have (including the reproducer that
> > Pavel provided for the permissions bypass). It's still pretty rough
> > though and I expect that it'll need revision. At this point, I'm mainly
> > looking to get these questions answered:
> >
> > 1) what should we do if these dentries are found to be invalid? Is it ok
> > to d_invalidate them? Or is that likely to break something (particularly
> > in the case of file bind mounts)?
>
> The normal sequence in do_revalidate should be safe. In practice what we
> should see is d_drop(). If we access the dentries via another path today
> we already go through d_revalidate. It is only the reference count on
> the dentry that keeps them alive and working. The cases I have looked
> at for distributed filesystems have to call d_drop themselves so I don't
> know if it would add anything if the vfs called d_revalidate. Especially
> since FS_REVAL_DOT doesn't have that logic.
>

There seems to be a lot of disagreement about whether the issue that
Pavel raised is even a bug. I think what I'm going to do at this point
is respin this patchset without that patch (just add the missing
revalidations).

I'll also plan to just have force_reval_path call do_revalidate instead
so that invalid dentries get d_invalidated too. Any other thoughts on
the first two patches in this set?

--
Jeff Layton <[email protected]>

2009-12-16 12:32:07

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Mon, Nov 23, 2009 at 06:15:45PM -0500, Jeff Layton wrote:

> The big question with all of this is: Should a task have the ability
> to follow a /proc/pid symlink to a path that it wouldn't ordinarily be
> able to resolve with a path lookup. The concensus that I got from the
> bugtraq discussion was that it should not, and this patch is an attempt
> to prevent that.
>
> I take it from you and Eric's comments that you disagree? If so, what's
> your rationale for allowing a task to resolve this symlink when it
> wouldn't ordinarily be able to do so if it were a "normal" symlink?

WTF not? It's convenient and doesn't lose any real security. If your
code relies on inaccessibility of <path> since some component of that
path is inaccessible, you are *already* fscked. Consider e.g. fchdir()
and its implications - if you have an opened descriptor for parent,
having no exec permissions on grandparent won't stop you at all. Already.
On all Unices, regardless of openat(), etc.

And that's aside of being able to see the same object at some other pathname.
Which is also possible in a lot of ways. IOW, any code relying on that class
of assumptions is very likely to be widely b0rken, even if you leave aside the
long-standing behaviour of Linux.

I might buy the argument about restricting reopening with wider permissions,
but
a) we still are looking at possible userland breakage of the worst
kind - random scripts passing /dev/fd/42 as command line arguments to
random programs. Once in a while. With error checking being... not quite
sufficient.
b) it's not just open - we have at least chmod/chown/truncate to
deal with.

Prohibiting *all* access is a complete non-starter - things like
cmp foo /dev/stdin || ....
would bloody better work and nobody cares whether you have redirect
from something out of your reach at the moment.

2009-12-20 19:59:13

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Wed 2009-12-16 12:31:43, Al Viro wrote:
> On Mon, Nov 23, 2009 at 06:15:45PM -0500, Jeff Layton wrote:
>
> > The big question with all of this is: Should a task have the ability
> > to follow a /proc/pid symlink to a path that it wouldn't ordinarily be
> > able to resolve with a path lookup. The concensus that I got from the
> > bugtraq discussion was that it should not, and this patch is an attempt
> > to prevent that.
> >
> > I take it from you and Eric's comments that you disagree? If so, what's
> > your rationale for allowing a task to resolve this symlink when it
> > wouldn't ordinarily be able to do so if it were a "normal" symlink?
>
> WTF not? It's convenient and doesn't lose any real security. If your
> code relies on inaccessibility of <path> since some component of that
> path is inaccessible, you are *already* fscked. Consider e.g. fchdir()
> and its implications - if you have an opened descriptor for parent,
> having no exec permissions on grandparent won't stop you at all. Already.
> On all Unices, regardless of openat(), etc.

Consider FD passing over unix socket. Passing R/O file descriptor to
the other task, then having the task write to the file is certainly bad.

> I might buy the argument about restricting reopening with wider permissions,
> but
> a) we still are looking at possible userland breakage of the worst
> kind - random scripts passing /dev/fd/42 as command line arguments to
> random programs. Once in a while. With error checking being... not quite
> sufficient.
> b) it's not just open - we have at least chmod/chown/truncate to
> deal with.

That's indeed the sane way to solve that.

> Prohibiting *all* access is a complete non-starter - things like
> cmp foo /dev/stdin || ....
> would bloody better work and nobody cares whether you have redirect
> from something out of your reach at the moment.

Ok.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-12-20 21:04:18

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Sun, Dec 20, 2009 at 08:59:03PM +0100, Pavel Machek wrote:
> > WTF not? It's convenient and doesn't lose any real security. If your
> > code relies on inaccessibility of <path> since some component of that
> > path is inaccessible, you are *already* fscked. Consider e.g. fchdir()
> > and its implications - if you have an opened descriptor for parent,
> > having no exec permissions on grandparent won't stop you at all. Already.
> > On all Unices, regardless of openat(), etc.
>
> Consider FD passing over unix socket. Passing R/O file descriptor to
> the other task, then having the task write to the file is certainly bad.

You've omitted the "R/O file descriptor of a file that is writable for
that other task" part...

2009-12-20 21:06:44

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Sun 2009-12-20 21:04:04, Al Viro wrote:
> On Sun, Dec 20, 2009 at 08:59:03PM +0100, Pavel Machek wrote:
> > > WTF not? It's convenient and doesn't lose any real security. If your
> > > code relies on inaccessibility of <path> since some component of that
> > > path is inaccessible, you are *already* fscked. Consider e.g. fchdir()
> > > and its implications - if you have an opened descriptor for parent,
> > > having no exec permissions on grandparent won't stop you at all. Already.
> > > On all Unices, regardless of openat(), etc.
> >
> > Consider FD passing over unix socket. Passing R/O file descriptor to
> > the other task, then having the task write to the file is certainly bad.
>
> You've omitted the "R/O file descriptor of a file that is writable for
> that other task" part...

That is 666 for the other task. But the other task can't access it due
to directory being 700 or something. Your fchdir() argument does not
apply here.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-12-20 21:23:54

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Sun, Dec 20, 2009 at 10:06:19PM +0100, Pavel Machek wrote:
> > > Consider FD passing over unix socket. Passing R/O file descriptor to
> > > the other task, then having the task write to the file is certainly bad.
> >
> > You've omitted the "R/O file descriptor of a file that is writable for
> > that other task" part...
>
> That is 666 for the other task. But the other task can't access it due
> to directory being 700 or something. Your fchdir() argument does not
> apply here.

*snort*

What you are advocating is a very limited class of setups that might be
usable for protecting files if not for the existing behaviour on a shitload
of systems.

The thing is, that class *is* very limited. E.g. introduce links and it's
fallen apart. Introduce bindings and the same will happen. Just try to
extend it one level deeper and fchdir() will bite you, etc. All of that
is not dependent on procfs even being there.

Access rights belong to file, not to a pathname (and there's no such thing
as _the_ pathname of a file).

I'd buy that as a minor QoI issue; as a security one - no way.

2010-01-01 15:40:30

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

Hi!

> > > > Consider FD passing over unix socket. Passing R/O file descriptor to
> > > > the other task, then having the task write to the file is certainly bad.
> > >
> > > You've omitted the "R/O file descriptor of a file that is writable for
> > > that other task" part...
> >
> > That is 666 for the other task. But the other task can't access it due
> > to directory being 700 or something. Your fchdir() argument does not
> > apply here.
>
> *snort*
>
> What you are advocating is a very limited class of setups that might be
> usable for protecting files if not for the existing behaviour on a shitload
> of systems.
>
> The thing is, that class *is* very limited. E.g. introduce links and it's
> fallen apart. Introduce bindings and the same will happen. Just try to
> extend it one level deeper and fchdir() will bite you, etc. All of that
> is not dependent on procfs even being there.
>
> Access rights belong to file, not to a pathname (and there's no such thing
> as _the_ pathname of a file).
>
> I'd buy that as a minor QoI issue; as a security one - no way.

Ok, so you see it as a (QoI) problem, but not too major. Good; I hope
it gets fixed one day.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-01-10 04:42:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/3] vfs: plug some holes involving LAST_BIND symlinks and file bind mounts (try #5)

On Fri, Jan 01, 2010 at 04:40:27PM +0100, Pavel Machek wrote:
> > Access rights belong to file, not to a pathname (and there's no such thing
> > as _the_ pathname of a file).
> >
> > I'd buy that as a minor QoI issue; as a security one - no way.
>
> Ok, so you see it as a (QoI) problem, but not too major. Good; I hope
> it gets fixed one day.

Actually, I'm not even sure that it *is* worse than what we'd get after
such change. Note that it's not just about trying to reopen a file
currently opened r/o for write; there's the opposite case. We'd break
scripts that try to read /dev/stderr and expect to be called with stderr
redirected to caller-writable file. With redirects done with 2> and not
2<>. Sure, it's a lousy practice. And scripts in question are not
well-written in general. Downright unmaintainable, in fact. Written
by sysadmin that had left the job five years ago and can't be located,
even if he could be bribed into touching That Shite(tm) ever again.

We have far lousier kinds of behaviour we can't fix for compatibility
reasons. O_CREAT on dangling symlinks, for one. We tried to switch to
sane variant (from the current "create file wherever that symlink points
to") and had to revert due to userland crap that actually relied on that
insanity.