This is a slightly extended repost of the patchset posted on
Jan 19. Current branch is in vfs.git#work.do_last, the main
difference from the last time around being a bit of do_last()
untangling added in the end of series. #work.openat2 is already
in mainline, which simplifies the series - now it's a straight
branch with no merges.
Individual patches are in the followups. Branch survives
the local testing (including ltp and xfstests).
Review and testing would be _very_ welcome; it does a lot
of massage, so there had been a plenty of opportunities to fuck up
and fail to spot that. The same goes for profiling - it doesn't
seem to slow the things down, but that needs to be verified.
part 1: follow_automount() cleanups and fixes.
Quite a bit of that function had been about working around the
wrong calling conventions of finish_automount(). The problem is that
finish_automount() misuses the primitive intended for mount(2) and
friends, where we want to mount on top of the pile, even if something
has managed to add to that while we'd been trying to lock the namespace.
For automount that's not the right thing to do - there we want to discard
whatever it was going to attach and just cross into what got mounted
there in the meanwhile (most likely - the results of the same automount
triggered by somebody else). Current mainline kinda-sorta manages to do
that, but it's unreliable and very convoluted. Much simpler approach
is to stop using lock_mount() in finish_automount() and have it bail
out if something turns out to have been mounted on top where we wanted
to attach. That allows to get rid of a lot of PITA in the caller.
Another simplification comes from not trying to cross into the results
of automount - simply ride through the next iteration of the loop and
let it move into overmount.
Another thing in the same series is divorcing follow_automount()
from nameidata; that'll play later when we get to unifying follow_down()
with the guts of follow_managed().
4 commits, the second one fixes a hard-to-hit race. The first
is a prereq for it.
1/34 do_add_mount(): lift lock_mount/unlock_mount into callers
2/34 fix automount/automount race properly
3/34 follow_automount(): get rid of dead^Wstillborn code
4/34 follow_automount() doesn't need the entire nameidata
part 2: unifying mount traversals in pathwalk.
Handling of mount traversal (follow_managed()) is currently called
in a bunch of places. Each of them is shortly followed by a call of
step_into() or an open-coded equivalent thereof. However, the locations
of those step_into() calls are far from preceding follow_managed();
moreover, that preceding call might happen on different paths that
converge to given step_into() call. It's harder to analyse that it should
be (especially when it comes to liveness analysis) and it forces rather
ugly calling conventions on lookup_fast()/atomic_open()/lookup_open().
The series below massages the code to the point when the calls of
follow_managed() (and __follow_mount_rcu()) move into the beginning of
step_into().
5/34 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
gets EEXIST handling in do_last() past the step_into() call there.
6/34 handle_mounts(): start building a sane wrapper for follow_managed()
rather than mangling follow_managed() itself (and creating conflicts
with openat2 series), add a wrapper that will absorb the required
interface changes.
7/34 atomic_open(): saner calling conventions (return dentry on success)
struct path passed to it is pure out parameter; only dentry part
ever varies, though - mnt is always nd->path.mnt. Just return
the dentry on success, and ERR_PTR(-E...) on failure.
8/34 lookup_open(): saner calling conventions (return dentry on success)
propagate the same change one level up the call chain.
9/34 do_last(): collapse the call of path_to_nameidata()
struct path filled in lookup_open() call is eventually given to
handle_mounts(); the only use it has before that is path_to_nameidata()
call in "->atomic_open() has actually opened it" case, and there
path_to_nameidata() is an overkill - we are guaranteed to replace
only nd->path.dentry. So have the struct path filled only immediately
prior to handle_mounts().
10/34 handle_mounts(): pass dentry in, turn path into a pure out argument
now all callers of handle_mount() are directly preceded by filling
struct path it gets. path->mnt is nd->path.mnt in all cases, so we can
pass just the dentry instead and fill path in handle_mount() itself.
Some boilerplate gone, path is pure out argument of handle_mount()
now.
11/34 lookup_fast(): consolidate the RCU success case
massage to gather what will become an RCU case equivalent of
handle_mounts(); basically, that's what we do if revalidate succeeds
in RCU case of lookup_fast(), including unlazy and fallback to
handle_mounts() if __follow_mount_rcu() says "it's too tricky".
12/34 teach handle_mounts() to handle RCU mode
... and take that into handle_mount() itself. The other caller of
__follow_mount_rcu() is fine with the same fallback (it just didn't
bother since it's in the very beginning of pathwalk), switched to
handle_mount() as well.
13/34 lookup_fast(): take mount traversal into callers
Now we are getting somewhere - both RCU and non-RCU success cases of
lookup_fast() are ended with the same return handle_mounts(...);
move that to the callers - there it will merge with the identical calls
that had been on the paths where we had to do slow lookups.
lookup_fast() returns dentry now.
14/34 new step_into() flag: WALK_NOFOLLOW
use step_into() instead of open-coding it in handle_lookup_down().
Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for
that (and eventually, I hope, for .. handling).
Now *all* calls of handle_mounts() and step_into() are right next to
each other.
15/34 fold handle_mounts() into step_into()
... and we can move the call of handle_mounts() into step_into(),
getting a slightly saner calling conventions out of that.
16/34 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()
another payoff from 14/17 - we can teach path_lookupat() to do
what path_mountpointat() used to. And kill the latter, along with
its wrappers.
17/34 expand the only remaining call of path_lookup_conditional()
minor cleanup - RIP path_lookup_conditional(). Only one caller left.
Changes so far:
* mount traversal is taken into step_into().
* lookup_fast(), atomic_open() and lookup_open() calling conventions
are slightly changed. All of them return dentry now, instead of returning
an int and filling struct path on success. For lookup_fast() the old
"0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache
miss, dentry - for hit".
* step_into() can be called in RCU mode as well. Takes nameidata,
WALK_... flags, dentry and, in RCU case, corresponding inode and seq value.
Handles mount traversals, decides whether it's a symlink to be followed.
Error => returns -E...; symlink to follow => returns 1, puts symlink on stack;
non-symlink or symlink not to follow => returns 0, moves nd->path to new location.
* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends
became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.
part 3: untangling the symlink handling.
Right now when we decide to follow a symlink it happens this way:
* step_into() decides that it has been given a symlink that needs to
be followed.
* it calls pick_link(), which pushes the symlink on stack and
returns 1 on success / -E... on error. Symlink's mount/dentry/seq is
stored on stack and the inode is stashed in nd->link_inode.
* step_into() passes that 1 to its callers, which proceed to pass it
up the call chain for several layers. In all cases we get to get_link()
call shortly afterwards.
* get_link() is called, picks the inode stashed in nd->link_inode
by the pick_link(), does some checks, touches the atime, etc.
* get_link() either picks the link body out of inode or calls
->get_link(). If it's an absolute symlink, we move to the root and return
the relative portion of the body; if it's a relative one - just return the
body. If it's a procfs-style one, the call of nd_jump_link() has been
made and we'd moved to whatever location is desired. And return NULL,
same as we do for symlink to "/".
* the caller proceeds to deal with the string returned to it.
The sequence is the same in all cases (nested symlink, trailing
symlink on lookup, trailing symlink on open), but its pieces are not close
to each other and the bit between the call of pick_link() and (inevitable)
call of get_link() afterwards is not easy to follow. Moreover, a bunch
of functions (walk_component/lookup_last/do_last) ends up with the same
conventions for return values as step_into(). And those conventions
(see above) are not pretty - 0/1/-E... is asking for mistakes, especially
when returned 1 is used only to direct control flow on a rather twisted
way to matching get_link() call. And that path can be seriously twisted.
E.g. when we are trying to open /dev/stdin, we get the following sequence:
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. Into the stack it goes and we return 1.
* do_last() sees 1 and returns it.
* trailing_symlink() is called (in the top-level loop) and it
calls get_link(). OK, we get "/proc/self/fd/0" for body, move to
root again and return "proc/self/fd/0".
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes and we return 1.
* again, do_last() passes 1 to caller
* trailing_symlink() is called and calls get_link().
* this time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. But the
fun doesn't stop yet.
* trailing_symlink() returns "" to the caller
* link_path_walk() is called on that and does nothing
whatsoever.
* do_last() is called and sees LAST_BIND left by the get_link().
It calls handle_dots()
* handle_dots() drops the symlink from stack and returns
* do_last() *FINALLY* proceeds to the point after its call of
step_into() (finish_open:) and gets around to opening the damn thing.
Making sense of the control flow through all of that is not fun,
to put it mildly; debugging anything in that area can be a massive PITA,
and this example has touched only one of 3 cases. Arguably, the worst
one, but... Anyway, it turns out that this code can be massaged to
considerably saner shape - both in terms of control flow and wrt calling
conventions.
18/34 merging pick_link() with get_link(), part 1
prep work: move the "hardening" crap from trailing_symlink() into
get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags).
We'll be moving the calls of get_link() around quite a bit through that
series, and the next step will be to eliminate trailing_symlink().
19/34 merging pick_link() with get_link(), part 2
fold trailing_symlink() into lookup_last() and do_last().
Now these are returning strings; it's not the final calling conventions,
but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) =>
old -E..., we'd failed. string => old 1, and the string is the symlink
body to follow. Just as for trailing_symlink(), "/" and procfs ones
(where get_link() returns NULL) yield "", so the ugly song and dance
with no-op trip through link_path_walk()/handle_dots() still remains.
20/34 merging pick_link() with get_link(), part 3
elimination of that round-trip. In *all* cases having
get_link() return NULL on such symlinks means that we'll proceed to
drop the symlink from stack and get back to the point near that
get_link() call - basically, where we would be if it hadn't been
a symlink at all. The path by which we are getting there depends
upon the call site; the end result is the same in all cases - such
symlinks (procfs ones and symlink to "/") are fully processed by
the time get_link() returns, so we could as well drop them from the
stack right in get_link(). Makes life simpler in terms of control
flow analysis...
And now the calling conventions for do_last() and lookup_last()
have reached the final shape - ERR_PTR(-E...) for error, NULL for
"we are done", string for "traverse this".
21/34 merging pick_link() with get_link(), part 4
now all calls of walk_component() are followed by the same
boilerplate - "if it has returned 1, call get_link() and if that
has returned NULL treat that as if walk_component() has returned 0".
Eliminate by folding that into walk_component() itself. Now
walk_component() return value conventions have joined those of
do_last()/lookup_last().
22/34 merging pick_link() with get_link(), part 5
same as for the previous, only this time the boilerplate
migrates one level down, into step_into(). Only one caller of
get_link() left, step_into() has joined the same return value
conventions.
23/34 merging pick_link() with get_link(), part 6
move that thing into pick_link(). Now all traces of
"return 1 if we are following a symlink" are gone.
24/34 finally fold get_link() into pick_link()
ta-da - expand get_link() into the only caller. As a side
benefit, we get rid of stashing the inode in nd->link_inode - it
was done only to carry that piece of information from pick_link()
to eventual get_link(). That's not the main benefit, though - the
control flow became considerably easier to reason about.
For what it's worth, the example above (/dev/stdin) becomes
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. On the stack it goes and we get its body. Which is
"/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
* do_last() sees non-NULL and returns it - whether it's an error
or a pathname to traverse, we hadn't reached something we'll be opening.
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes. This time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. So we
drop symlink from stack and return that NULL to caller.
* that NULL is returned by step_into(), same as if we had just
moved to a non-symlink.
* do_last() proceeds to open the damn thing.
part 4. some mount traversal cleanups.
25/34 massage __follow_mount_rcu() a bit
make it more similar to non-RCU counterpart
26/34 new helper: traverse_mounts()
the guts of follow_managed() are very similar to
follow_down(). The calling conventions are different (follow_managed()
works with nameidata, follow_down() - with standalone struct path),
but the core loop is pretty much the same in both. Turned that loop
into a common helper (traverse_mounts()) and since follow_managed()
becomes a very thin wrapper around it, expand follow_managed() at its
only call site (in handle_mounts()),
part 5. do_last() untangling.
Control flow in do_last() is an atrocity, and liveness analysis in there
is rather painful. What follows is a massage of that thing into (hopefully)
more straightforward shape; by the end of the series it's still unpleasant,
but at least easier to follow.
A major source of headache is treatment of "we'd already managed to
open it in ->atomic_open()" and "we'd just created that sucker" cases -
that's what gives complicated control flow graph. As it is, we
have the following horror:
|
/------* ends with . or ..?
| |
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---------------\ already opened in ->atomic_open()?
| | | #
| | *---\ | freshly created file?
| | | # |
| \---+ | | finish_lookup:
| | | |
| *---------------------> is it a symlink?
| | | |
\------+ | | finish_open:
# | |
+--/ | finish_open_created:
# |
+---------------/ opened:
#
To make it even more unpleasant, there is quite a bit of similar,
but not entirely identical logics on parallel branches, some of it
buried in lookup_open() *and* atomic_open() called by it. Keeping
track of that has been hard and that had lead to more than one bug.
27/34 atomic_open(): return the right dentry in FMODE_OPENED case
As it is, several invariants do not hold in "we'd already opened
it in ->atomic_open()" case. In particular, nd->path.dentry might be
pointing to the wrong place by the time we return to do_last() - on
that codepath we don't care anymore. That both makes it harder to reason
about and serves as an obstacle to transformations that would untangle
that mess. Fortunately, it's not hard to regularize.
28/34 atomic_open(): lift the call of may_open() into do_last()
may_open() is called before vfs_open() in "hadn't opened in
->atomic_open()" case. Rightfully so, since vfs_open() for e.g.
devices can have side effects. In "opened in ->atomic_open()" case
we have to do it after the actual opening - the whole point is to
combine open with lookup and we only get the information needed for
may_open() after the combined lookup/open has happened. That's
OK - no side effects are possible in that case. However, we don't
have to keep that call of may_open() inside fs/namei.c:atomic_open();
as the matter of fact, lifting it into do_last() allows to simplify
life there...
29/34 do_last(): merge the may_open() calls
... since now we have the "it's already opened" case in
do_last() rejoin the main path at earlier point.
At that point the horror graph from above has become
|
/------* ends with . or ..?
| |
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---------------\ already opened in ->atomic_open()?
| | | #
| | *---\ | freshly created file?
| | | # |
| \---+ | | finish_lookup:
| | | |
| *---------------------> is it a symlink?
| | | |
\------+ | | finish_open:
# | |
+--/------------/ finish_open_created:
|
30/34 do_last(): don't bother with keeping got_write in FMODE_OPENED case
Another source of unpleasantness is an attempt to be clever and
keep track of write access status; the thing is, it doesn't really buy
us anything - we could as well drop it right after the lookup_open() and
only regain it for truncation, should such be needed. Makes for much
simpler cleanups on failures and sets the things up for unification of
"already opened" and "new file" branches with the main path...
31/34 do_last(): rejoing the common path earlier in FMODE_{OPENED,CREATED} case
... which we do here.
32/34 do_last(): simplify the liveness analysis past finish_open_created
It also makes possible to shrink the liveness intervals for local
variables.
33/34 do_last(): rejoin the common path even earlier in FMODE_{OPENED,CREATED} case
Further unification of parallel branches.
At that point we get
|
/------* ends with . or ..?
| |
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---\ opened by ->atomic_open() or freshly creatd?
| | | |
| \---+ | finish_lookup:
| | |
| *---------------------> is it a symlink?
| | |
\------+ | finish_open:
| |
+--/ finish_open_created:
|
with very little work done between finish_open: and finish_open_created:,
as well as on any of the side branches. Moreover, we have a pretty clear
separation: most of the work on _opening_ is after finish_open_created
(some of it - conditional), while the work on lookups and creation is all
before that point. Even better, most of the local variables are used
either only before or only after that cutoff point.
34/34 split the lookup-related parts of do_last() into a separate helper
... which allows to separate the lookup-related parts from
open-related ones.
I'm not saying I'm entirely happy with the resulting state of
do_last() clusterfuck, but it got a lot easier to follow and reason
about. There are more cleanups possible (and needed) in there, though -
there will be followups.
From: Al Viro <[email protected]>
preparation to finish_automount() fix (next commit)
Signed-off-by: Al Viro <[email protected]>
---
fs/namespace.c | 47 ++++++++++++++++++++++++-----------------------
1 file changed, 24 insertions(+), 23 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 85b5f7bea82e..dcd015fafe01 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2697,45 +2697,32 @@ static int do_move_mount_old(struct path *path, const char *old_name)
/*
* add a mount into a namespace's mount tree
*/
-static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
+static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
+ struct path *path, int mnt_flags)
{
- struct mountpoint *mp;
- struct mount *parent;
- int err;
+ struct mount *parent = real_mount(path->mnt);
mnt_flags &= ~MNT_INTERNAL_FLAGS;
- mp = lock_mount(path);
- if (IS_ERR(mp))
- return PTR_ERR(mp);
-
- parent = real_mount(path->mnt);
- err = -EINVAL;
if (unlikely(!check_mnt(parent))) {
/* that's acceptable only for automounts done in private ns */
if (!(mnt_flags & MNT_SHRINKABLE))
- goto unlock;
+ return -EINVAL;
/* ... and for those we'd better have mountpoint still alive */
if (!parent->mnt_ns)
- goto unlock;
+ return -EINVAL;
}
/* Refuse the same filesystem on the same mount point */
- err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
- goto unlock;
+ return -EBUSY;
- err = -EINVAL;
if (d_is_symlink(newmnt->mnt.mnt_root))
- goto unlock;
+ return -EINVAL;
newmnt->mnt.mnt_flags = mnt_flags;
- err = graft_tree(newmnt, parent, mp);
-
-unlock:
- unlock_mount(mp);
- return err;
+ return graft_tree(newmnt, parent, mp);
}
static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
@@ -2748,6 +2735,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
unsigned int mnt_flags)
{
struct vfsmount *mnt;
+ struct mountpoint *mp;
struct super_block *sb = fc->root->d_sb;
int error;
@@ -2768,7 +2756,13 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
mnt_warn_timestamp_expiry(mountpoint, mnt);
- error = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ mp = lock_mount(mountpoint);
+ if (IS_ERR(mp)) {
+ mntput(mnt);
+ return PTR_ERR(mp);
+ }
+ error = do_add_mount(real_mount(mnt), mp, mountpoint, mnt_flags);
+ unlock_mount(mp);
if (error < 0)
mntput(mnt);
return error;
@@ -2830,6 +2824,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
struct mount *mnt = real_mount(m);
+ struct mountpoint *mp;
int err;
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
@@ -2842,7 +2837,13 @@ int finish_automount(struct vfsmount *m, struct path *path)
goto fail;
}
- err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ mp = lock_mount(path);
+ if (IS_ERR(mp)) {
+ err = PTR_ERR(mp);
+ goto fail;
+ }
+ err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ unlock_mount(mp);
if (!err)
return 0;
fail:
--
2.11.0
From: Al Viro <[email protected]>
Protection against automount/automount races (two threads hitting the same
referral point at the same time) is based upon do_add_mount() prevention of
identical overmounts - trying to overmount the root of mounted tree with
the same tree fails with -EBUSY. It's unreliable (the other thread might've
mounted something on top of the automount it has triggered) *and* causes
no end of headache for follow_automount() and its caller, since
finish_automount() behaves like do_new_mount() - if the mountpoint to be is
overmounted, it mounts on top what's overmounting it. It's not only wrong
(we want to go into what's overmounting the automount point and quietly
discard what we planned to mount there), it introduces the possibility of
original parent mount getting dropped. That's what 8aef18845266 (VFS: Fix
vfsmount overput on simultaneous automount) deals with, but it can't do
anything about the reliability of conflict detection - if something had
been overmounted the other thread's automount (e.g. that other thread
having stepped into automount in mount(2)), we don't get that -EBUSY and
the result is
referral point under automounted NFS under explicit overmount
under another copy of automounted NFS
What we need is finish_automount() *NOT* digging into overmounts - if it
finds one, it should just quietly discard the thing it was asked to mount.
And don't bother with actually crossing into the results of finish_automount() -
the same loop that calls follow_automount() will do that just fine on the
next iteration.
IOW, instead of calling lock_mount() have finish_automount() do it manually,
_without_ the "move into overmount and retry" part. And leave crossing into
the results to the caller of follow_automount(), which simplifies it a lot.
Moral: if you end up with a lot of glue working around the calling conventions
of something, perhaps these calling conventions are simply wrong...
Fixes: 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount)
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 29 ++++-------------------------
fs/namespace.c | 41 ++++++++++++++++++++++++++++++++++-------
2 files changed, 38 insertions(+), 32 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index db6565c99825..626eddb33508 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1208,11 +1208,9 @@ EXPORT_SYMBOL(follow_up);
* - return -EISDIR to tell follow_managed() to stop and return the path we
* were called with.
*/
-static int follow_automount(struct path *path, struct nameidata *nd,
- bool *need_mntput)
+static int follow_automount(struct path *path, struct nameidata *nd)
{
struct vfsmount *mnt;
- int err;
if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
return -EREMOTE;
@@ -1253,29 +1251,10 @@ static int follow_automount(struct path *path, struct nameidata *nd,
return PTR_ERR(mnt);
}
- if (!mnt) /* mount collision */
- return 0;
-
- if (!*need_mntput) {
- /* lock_mount() may release path->mnt on error */
- mntget(path->mnt);
- *need_mntput = true;
- }
- err = finish_automount(mnt, path);
-
- switch (err) {
- case -EBUSY:
- /* Someone else made a mount here whilst we were busy */
+ if (!mnt)
return 0;
- case 0:
- path_put(path);
- path->mnt = mnt;
- path->dentry = dget(mnt->mnt_root);
- return 0;
- default:
- return err;
- }
+ return finish_automount(mnt, path);
}
/*
@@ -1333,7 +1312,7 @@ static int follow_managed(struct path *path, struct nameidata *nd)
/* Handle an automount point */
if (flags & DCACHE_NEED_AUTOMOUNT) {
- ret = follow_automount(path, nd, &need_mntput);
+ ret = follow_automount(path, nd);
if (ret < 0)
break;
continue;
diff --git a/fs/namespace.c b/fs/namespace.c
index dcd015fafe01..6228fd1ef94f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2823,6 +2823,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
+ struct dentry *dentry = path->dentry;
struct mount *mnt = real_mount(m);
struct mountpoint *mp;
int err;
@@ -2832,21 +2833,47 @@ int finish_automount(struct vfsmount *m, struct path *path)
BUG_ON(mnt_get_count(mnt) < 2);
if (m->mnt_sb == path->mnt->mnt_sb &&
- m->mnt_root == path->dentry) {
+ m->mnt_root == dentry) {
err = -ELOOP;
- goto fail;
+ goto discard;
}
- mp = lock_mount(path);
+ /*
+ * we don't want to use lock_mount() - in this case finding something
+ * that overmounts our mountpoint to be means "quitely drop what we've
+ * got", not "try to mount it on top".
+ */
+ inode_lock(dentry->d_inode);
+ if (unlikely(cant_mount(dentry))) {
+ err = -ENOENT;
+ goto discard1;
+ }
+ namespace_lock();
+ rcu_read_lock();
+ if (unlikely(__lookup_mnt(path->mnt, dentry))) {
+ rcu_read_unlock();
+ err = 0;
+ goto discard2;
+ }
+ rcu_read_unlock();
+ mp = get_mountpoint(dentry);
if (IS_ERR(mp)) {
err = PTR_ERR(mp);
- goto fail;
+ goto discard2;
}
+
err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
unlock_mount(mp);
- if (!err)
- return 0;
-fail:
+ if (unlikely(err))
+ goto discard;
+ mntput(m);
+ return 0;
+
+discard2:
+ namespace_unlock();
+discard1:
+ inode_unlock(dentry->d_inode);
+discard:
/* remove m from any expiration list it may be on */
if (!list_empty(&mnt->mnt_expire)) {
namespace_lock();
--
2.11.0
From: Al Viro <[email protected]>
1) no instances of ->d_automount() have ever made use of the "return
ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
a rudiment of plans that got superseded before the thing went into
the tree. Despite the comment in follow_automount(), autofs has
never done that.
2) if there's no ->d_automount() in dentry_operations, filesystems
should not set DCACHE_NEED_AUTOMOUNT in the first place. None have
ever done so...
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 28 +++-------------------------
fs/namespace.c | 9 ++++++++-
2 files changed, 11 insertions(+), 26 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 626eddb33508..39dd56f5171f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1210,10 +1210,7 @@ EXPORT_SYMBOL(follow_up);
*/
static int follow_automount(struct path *path, struct nameidata *nd)
{
- struct vfsmount *mnt;
-
- if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
- return -EREMOTE;
+ struct dentry *dentry = path->dentry;
/* We don't want to mount if someone's just doing a stat -
* unless they're stat'ing a directory and appended a '/' to
@@ -1228,33 +1225,14 @@ static int follow_automount(struct path *path, struct nameidata *nd)
*/
if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
- path->dentry->d_inode)
+ dentry->d_inode)
return -EISDIR;
nd->total_link_count++;
if (nd->total_link_count >= 40)
return -ELOOP;
- mnt = path->dentry->d_op->d_automount(path);
- if (IS_ERR(mnt)) {
- /*
- * The filesystem is allowed to return -EISDIR here to indicate
- * it doesn't want to automount. For instance, autofs would do
- * this so that its userspace daemon can mount on this dentry.
- *
- * However, we can only permit this if it's a terminal point in
- * the path being looked up; if it wasn't then the remainder of
- * the path is inaccessible and we should say so.
- */
- if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT))
- return -EREMOTE;
- return PTR_ERR(mnt);
- }
-
- if (!mnt)
- return 0;
-
- return finish_automount(mnt, path);
+ return finish_automount(dentry->d_op->d_automount(path), path);
}
/*
diff --git a/fs/namespace.c b/fs/namespace.c
index 6228fd1ef94f..a9e556224945 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2824,9 +2824,16 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
struct dentry *dentry = path->dentry;
- struct mount *mnt = real_mount(m);
struct mountpoint *mp;
+ struct mount *mnt;
int err;
+
+ if (!m)
+ return 0;
+ if (IS_ERR(m))
+ return PTR_ERR(m);
+
+ mnt = real_mount(m);
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
*/
--
2.11.0
From: Al Viro <[email protected]>
Only the address of ->total_link_count and the flags.
And fix an off-by-one is ELOOP detection - make it
consistent with symlink following, where we check if
the pre-increment value has reached 40, rather than
check the post-increment one.
[kudos to Christian Brauner for spotted braino]
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 39dd56f5171f..6721c5f7e9d5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1208,7 +1208,7 @@ EXPORT_SYMBOL(follow_up);
* - return -EISDIR to tell follow_managed() to stop and return the path we
* were called with.
*/
-static int follow_automount(struct path *path, struct nameidata *nd)
+static int follow_automount(struct path *path, int *count, unsigned lookup_flags)
{
struct dentry *dentry = path->dentry;
@@ -1223,13 +1223,12 @@ static int follow_automount(struct path *path, struct nameidata *nd)
* as being automount points. These will need the attentions
* of the daemon to instantiate them before they can be used.
*/
- if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
+ if (!(lookup_flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
dentry->d_inode)
return -EISDIR;
- nd->total_link_count++;
- if (nd->total_link_count >= 40)
+ if (count && (*count)++ >= MAXSYMLINKS)
return -ELOOP;
return finish_automount(dentry->d_op->d_automount(path), path);
@@ -1290,7 +1289,8 @@ static int follow_managed(struct path *path, struct nameidata *nd)
/* Handle an automount point */
if (flags & DCACHE_NEED_AUTOMOUNT) {
- ret = follow_automount(path, nd);
+ ret = follow_automount(path, &nd->total_link_count,
+ nd->flags);
if (ret < 0)
break;
continue;
--
2.11.0
From: Al Viro <[email protected]>
All callers of follow_managed() follow it on success with the same steps -
d_backing_inode(path->dentry) is calculated and stored into some struct inode *
variable and, in all but one case, an unsigned variable (nd->seq to be) is
zeroed. The single exception is lookup_fast() and there zeroing is correct
thing to do - not doing it is a pointless microoptimization.
Add a wrapper for follow_managed() that would do that combination.
It's mostly a vehicle for code massage - it will be changing quite a bit,
and the current calling conventions are by no means final. Right now it
takes path, nameidata and (as out params) inode and seq, similar to
__follow_mount_rcu(). Which will soon get folded into it...
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6938d20aa73a..c104ec75faef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1385,6 +1385,18 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
!(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
}
+static inline int handle_mounts(struct path *path, struct nameidata *nd,
+ struct inode **inode, unsigned int *seqp)
+{
+ int ret = follow_managed(path, nd);
+
+ if (likely(ret >= 0)) {
+ *inode = d_backing_inode(path->dentry);
+ *seqp = 0; /* out of RCU mode, so the value doesn't matter */
+ }
+ return ret;
+}
+
static int follow_dotdot_rcu(struct nameidata *nd)
{
struct inode *inode = nd->inode;
@@ -1607,7 +1619,6 @@ static int lookup_fast(struct nameidata *nd,
struct vfsmount *mnt = nd->path.mnt;
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
- int err;
/*
* Rename seqlock is not required here because in the off chance
@@ -1677,10 +1688,7 @@ static int lookup_fast(struct nameidata *nd,
path->mnt = mnt;
path->dentry = dentry;
- err = follow_managed(path, nd);
- if (likely(err > 0))
- *inode = d_backing_inode(path->dentry);
- return err;
+ return handle_mounts(path, nd, inode, seqp);
}
/* Fast lookup failed, do it the slow way */
@@ -1875,12 +1883,9 @@ static int walk_component(struct nameidata *nd, int flags)
return PTR_ERR(path.dentry);
path.mnt = nd->path.mnt;
- err = follow_managed(&path, nd);
+ err = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(err < 0))
return err;
-
- seq = 0; /* we are already out of RCU mode */
- inode = d_backing_inode(path.dentry);
}
return step_into(nd, &path, flags, inode, seq);
@@ -2365,11 +2370,9 @@ static int handle_lookup_down(struct nameidata *nd)
return -ECHILD;
} else {
dget(path.dentry);
- err = follow_managed(&path, nd);
+ err = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(err < 0))
return err;
- inode = d_backing_inode(path.dentry);
- seq = 0;
}
path_to_nameidata(&path, nd);
nd->inode = inode;
@@ -3392,12 +3395,9 @@ static int do_last(struct nameidata *nd,
got_write = false;
}
- error = follow_managed(&path, nd);
+ error = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(error < 0))
return error;
-
- seq = 0; /* out of RCU mode, so the value doesn't matter */
- inode = d_backing_inode(path.dentry);
finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
--
2.11.0
From: Al Viro <[email protected]>
Currently it either returns -E... or puts (nd->path.mnt,dentry)
into *path and returns 0. Make it return ERR_PTR(-E...) or
dentry; adjust the caller. Fewer arguments and it's easier
to keep track of *path contents that way.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 41 ++++++++++++++++++++++++-----------------
1 file changed, 24 insertions(+), 17 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index c104ec75faef..5f8b791a6d6e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3087,10 +3087,10 @@ static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t m
*
* Returns an error code otherwise.
*/
-static int atomic_open(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct file *file,
- const struct open_flags *op,
- int open_flag, umode_t mode)
+static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
+ struct file *file,
+ const struct open_flags *op,
+ int open_flag, umode_t mode)
{
struct dentry *const DENTRY_NOT_SET = (void *) -1UL;
struct inode *dir = nd->path.dentry->d_inode;
@@ -3131,17 +3131,15 @@ static int atomic_open(struct nameidata *nd, struct dentry *dentry,
}
if (file->f_mode & FMODE_CREATED)
fsnotify_create(dir, dentry);
- if (unlikely(d_is_negative(dentry))) {
+ if (unlikely(d_is_negative(dentry)))
error = -ENOENT;
- } else {
- path->dentry = dentry;
- path->mnt = nd->path.mnt;
- return 0;
- }
}
}
- dput(dentry);
- return error;
+ if (error) {
+ dput(dentry);
+ dentry = ERR_PTR(error);
+ }
+ return dentry;
}
/*
@@ -3236,11 +3234,20 @@ static int lookup_open(struct nameidata *nd, struct path *path,
}
if (dir_inode->i_op->atomic_open) {
- error = atomic_open(nd, dentry, path, file, op, open_flag,
- mode);
- if (unlikely(error == -ENOENT) && create_error)
- error = create_error;
- return error;
+ dentry = atomic_open(nd, dentry, file, op, open_flag, mode);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ if (unlikely(error == -ENOENT) && create_error)
+ error = create_error;
+ return error;
+ }
+ if (file->f_mode & FMODE_OPENED) {
+ dput(dentry);
+ return 0;
+ }
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ return 0;
}
no_open:
--
2.11.0
From: Al Viro <[email protected]>
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.
However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 15 +++++----------
fs/open.c | 4 +++-
2 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6721c5f7e9d5..6938d20aa73a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3396,22 +3396,17 @@ static int do_last(struct nameidata *nd,
if (unlikely(error < 0))
return error;
- /*
- * create/update audit record if it already exists.
- */
- audit_inode(nd->name, path.dentry, 0);
-
- if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
- path_to_nameidata(&path, nd);
- return -EEXIST;
- }
-
seq = 0; /* out of RCU mode, so the value doesn't matter */
inode = d_backing_inode(path.dentry);
finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
return error;
+
+ if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
+ audit_inode(nd->name, nd->path.dentry, 0);
+ return -EEXIST;
+ }
finish_open:
/* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */
error = complete_walk(nd);
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..e5227cd533f4 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1049,8 +1049,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
- if (flags & O_EXCL)
+ if (flags & O_EXCL) {
op->intent |= LOOKUP_EXCL;
+ flags |= O_NOFOLLOW;
+ }
}
if (flags & O_DIRECTORY)
--
2.11.0
From: Al Viro <[email protected]>
same story as for atomic_open() in the previous commit.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 44 +++++++++++++++++++-------------------------
1 file changed, 19 insertions(+), 25 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 5f8b791a6d6e..4946d006ba20 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3157,10 +3157,9 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
*
* An error code is returned on failure.
*/
-static int lookup_open(struct nameidata *nd, struct path *path,
- struct file *file,
- const struct open_flags *op,
- bool got_write)
+static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
+ const struct open_flags *op,
+ bool got_write)
{
struct dentry *dir = nd->path.dentry;
struct inode *dir_inode = dir->d_inode;
@@ -3171,7 +3170,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
if (unlikely(IS_DEADDIR(dir_inode)))
- return -ENOENT;
+ return ERR_PTR(-ENOENT);
file->f_mode &= ~FMODE_CREATED;
dentry = d_lookup(dir, &nd->last);
@@ -3179,7 +3178,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
if (!dentry) {
dentry = d_alloc_parallel(dir, &nd->last, &wq);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return dentry;
}
if (d_in_lookup(dentry))
break;
@@ -3195,7 +3194,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
}
if (dentry->d_inode) {
/* Cached positive dentry: will open in f_op->open */
- goto out_no_open;
+ return dentry;
}
/*
@@ -3236,18 +3235,10 @@ static int lookup_open(struct nameidata *nd, struct path *path,
if (dir_inode->i_op->atomic_open) {
dentry = atomic_open(nd, dentry, file, op, open_flag, mode);
if (IS_ERR(dentry)) {
- error = PTR_ERR(dentry);
- if (unlikely(error == -ENOENT) && create_error)
- error = create_error;
- return error;
+ if (dentry == ERR_PTR(-ENOENT) && create_error)
+ dentry = ERR_PTR(create_error);
}
- if (file->f_mode & FMODE_OPENED) {
- dput(dentry);
- return 0;
- }
- path->mnt = nd->path.mnt;
- path->dentry = dentry;
- return 0;
+ return dentry;
}
no_open:
@@ -3283,14 +3274,11 @@ static int lookup_open(struct nameidata *nd, struct path *path,
error = create_error;
goto out_dput;
}
-out_no_open:
- path->dentry = dentry;
- path->mnt = nd->path.mnt;
- return 0;
+ return dentry;
out_dput:
dput(dentry);
- return error;
+ return ERR_PTR(error);
}
/*
@@ -3309,6 +3297,7 @@ static int do_last(struct nameidata *nd,
unsigned seq;
struct inode *inode;
struct path path;
+ struct dentry *dentry;
int error;
nd->flags &= ~LOOKUP_PARENT;
@@ -3365,14 +3354,18 @@ static int do_last(struct nameidata *nd,
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);
- error = lookup_open(nd, &path, file, op, got_write);
+ dentry = lookup_open(nd, file, op, got_write);
if (open_flag & O_CREAT)
inode_unlock(dir->d_inode);
else
inode_unlock_shared(dir->d_inode);
- if (error)
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
goto out;
+ }
+ path.mnt = nd->path.mnt;
+ path.dentry = dentry;
if (file->f_mode & FMODE_OPENED) {
if ((file->f_mode & FMODE_CREATED) ||
@@ -3380,6 +3373,7 @@ static int do_last(struct nameidata *nd,
will_truncate = false;
audit_inode(nd->name, file->f_path.dentry, 0);
+ dput(dentry);
goto opened;
}
--
2.11.0
From: Al Viro <[email protected]>
... and shift filling struct path to just before the call of
handle_mounts(). All callers of handle_mounts() are
immediately preceded by path->mnt = nd->path.mnt now.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 4946d006ba20..f26af0559abf 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3364,8 +3364,6 @@ static int do_last(struct nameidata *nd,
error = PTR_ERR(dentry);
goto out;
}
- path.mnt = nd->path.mnt;
- path.dentry = dentry;
if (file->f_mode & FMODE_OPENED) {
if ((file->f_mode & FMODE_CREATED) ||
@@ -3382,7 +3380,8 @@ static int do_last(struct nameidata *nd,
open_flag &= ~O_TRUNC;
will_truncate = false;
acc_mode = 0;
- path_to_nameidata(&path, nd);
+ dput(nd->path.dentry);
+ nd->path.dentry = dentry;
goto finish_open_created;
}
@@ -3396,6 +3395,8 @@ static int do_last(struct nameidata *nd,
got_write = false;
}
+ path.mnt = nd->path.mnt;
+ path.dentry = dentry;
error = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(error < 0))
return error;
--
2.11.0
From: Al Viro <[email protected]>
Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 50 ++++++++++++++++++++++++--------------------------
1 file changed, 24 insertions(+), 26 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3215b0da6e91..4c9b633e1981 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1628,9 +1628,9 @@ static struct dentry *__lookup_hash(const struct qstr *name,
return dentry;
}
-static int lookup_fast(struct nameidata *nd,
- struct path *path, struct inode **inode,
- unsigned *seqp)
+static struct dentry *lookup_fast(struct nameidata *nd,
+ struct inode **inode,
+ unsigned *seqp)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1645,8 +1645,8 @@ static int lookup_fast(struct nameidata *nd,
dentry = __d_lookup_rcu(parent, &nd->last, &seq);
if (unlikely(!dentry)) {
if (unlazy_walk(nd))
- return -ECHILD;
- return 0;
+ return ERR_PTR(-ECHILD);
+ return NULL;
}
/*
@@ -1655,7 +1655,7 @@ static int lookup_fast(struct nameidata *nd,
*/
*inode = d_backing_inode(dentry);
if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);
/*
* This sequence count validates that the parent had no
@@ -1665,30 +1665,30 @@ static int lookup_fast(struct nameidata *nd,
* enough, we can use __read_seqcount_retry here.
*/
if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);
*seqp = seq;
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0))
- return handle_mounts(nd, dentry, path, inode, seqp);
+ return dentry;
if (unlazy_child(nd, dentry, seq))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);
if (unlikely(status == -ECHILD))
/* we'd been told to redo it in non-rcu mode */
status = d_revalidate(dentry, nd->flags);
} else {
dentry = __d_lookup(parent, &nd->last);
if (unlikely(!dentry))
- return 0;
+ return NULL;
status = d_revalidate(dentry, nd->flags);
}
if (unlikely(status <= 0)) {
if (!status)
d_invalidate(dentry);
dput(dentry);
- return status;
+ return ERR_PTR(status);
}
- return handle_mounts(nd, dentry, path, inode, seqp);
+ return dentry;
}
/* Fast lookup failed, do it the slow way */
@@ -1874,19 +1874,18 @@ static int walk_component(struct nameidata *nd, int flags)
put_link(nd);
return err;
}
- err = lookup_fast(nd, &path, &inode, &seq);
- if (unlikely(err <= 0)) {
- if (err < 0)
- return err;
+ dentry = lookup_fast(nd, &inode, &seq);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ if (unlikely(!dentry)) {
dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
-
- err = handle_mounts(nd, dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
}
+ err = handle_mounts(nd, dentry, &path, &inode, &seq);
+ if (unlikely(err < 0))
+ return err;
return step_into(nd, &path, flags, inode, seq);
}
@@ -3304,13 +3303,12 @@ static int do_last(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- error = lookup_fast(nd, &path, &inode, &seq);
- if (likely(error > 0))
+ dentry = lookup_fast(nd, &inode, &seq);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ if (likely(dentry))
goto finish_lookup;
- if (error < 0)
- return error;
-
BUG_ON(nd->inode != dir->d_inode);
BUG_ON(nd->flags & LOOKUP_RCU);
} else {
@@ -3385,10 +3383,10 @@ static int do_last(struct nameidata *nd,
got_write = false;
}
+finish_lookup:
error = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(error < 0))
return error;
-finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
return error;
--
2.11.0
From: Al Viro <[email protected]>
Tells step_into() not to follow symlinks, regardless of LOOKUP_FOLLOW.
Allows to switch handle_lookup_down() to of step_into(), getting
all follow_managed() and step_into() calls paired.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 4c9b633e1981..fe48a8d00ae5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1827,7 +1827,7 @@ static int pick_link(struct nameidata *nd, struct path *link,
return 1;
}
-enum {WALK_FOLLOW = 1, WALK_MORE = 2};
+enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
/*
* Do we need to follow links? We _really_ want to be able
@@ -1841,7 +1841,8 @@ static inline int step_into(struct nameidata *nd, struct path *path,
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
if (likely(!d_is_symlink(path->dentry)) ||
- !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) {
+ !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) ||
+ flags & WALK_NOFOLLOW) {
/* not a symlink or should not follow */
path_to_nameidata(path, nd);
nd->inode = inode;
@@ -2363,10 +2364,7 @@ static int handle_lookup_down(struct nameidata *nd)
err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
- path_to_nameidata(&path, nd);
- nd->inode = inode;
- nd->seq = seq;
- return 0;
+ return step_into(nd, &path, WALK_NOFOLLOW, inode, seq);
}
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
--
2.11.0
From: Al Viro <[email protected]>
All callers are equivalent to
path->dentry = dentry;
path->mnt = nd->path.mnt;
err = handle_mounts(path, ...)
Pass dentry as an explicit argument, fill *path in handle_mounts()
itself.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 37 ++++++++++++++++++-------------------
1 file changed, 18 insertions(+), 19 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index f26af0559abf..033f91a72bb5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1385,11 +1385,15 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
!(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
}
-static inline int handle_mounts(struct path *path, struct nameidata *nd,
- struct inode **inode, unsigned int *seqp)
+static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
+ struct path *path, struct inode **inode,
+ unsigned int *seqp)
{
- int ret = follow_managed(path, nd);
+ int ret;
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ ret = follow_managed(path, nd);
if (likely(ret >= 0)) {
*inode = d_backing_inode(path->dentry);
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
@@ -1685,10 +1689,7 @@ static int lookup_fast(struct nameidata *nd,
dput(dentry);
return status;
}
-
- path->mnt = mnt;
- path->dentry = dentry;
- return handle_mounts(path, nd, inode, seqp);
+ return handle_mounts(nd, dentry, path, inode, seqp);
}
/* Fast lookup failed, do it the slow way */
@@ -1859,6 +1860,7 @@ static inline int step_into(struct nameidata *nd, struct path *path,
static int walk_component(struct nameidata *nd, int flags)
{
struct path path;
+ struct dentry *dentry;
struct inode *inode;
unsigned seq;
int err;
@@ -1877,13 +1879,11 @@ static int walk_component(struct nameidata *nd, int flags)
if (unlikely(err <= 0)) {
if (err < 0)
return err;
- path.dentry = lookup_slow(&nd->last, nd->path.dentry,
- nd->flags);
- if (IS_ERR(path.dentry))
- return PTR_ERR(path.dentry);
+ dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
- path.mnt = nd->path.mnt;
- err = handle_mounts(&path, nd, &inode, &seq);
+ err = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
}
@@ -2355,7 +2355,7 @@ static inline int lookup_last(struct nameidata *nd)
static int handle_lookup_down(struct nameidata *nd)
{
- struct path path = nd->path;
+ struct path path;
struct inode *inode = nd->inode;
unsigned seq = nd->seq;
int err;
@@ -2366,11 +2366,12 @@ static int handle_lookup_down(struct nameidata *nd)
* at the very beginning of walk, so we lose nothing
* if we simply redo everything in non-RCU mode
*/
+ path = nd->path;
if (unlikely(!__follow_mount_rcu(nd, &path, &inode, &seq)))
return -ECHILD;
} else {
- dget(path.dentry);
- err = handle_mounts(&path, nd, &inode, &seq);
+ dget(nd->path.dentry);
+ err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
}
@@ -3395,9 +3396,7 @@ static int do_last(struct nameidata *nd,
got_write = false;
}
- path.mnt = nd->path.mnt;
- path.dentry = dentry;
- error = handle_mounts(&path, nd, &inode, &seq);
+ error = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(error < 0))
return error;
finish_lookup:
--
2.11.0
From: Al Viro <[email protected]>
1) in case of __follow_mount_rcu() failure, lookup_fast() proceeds
to call unlazy_child() and, should it succeed, handle_mounts().
Note that we have status > 0 (or we wouldn't be calling
__follow_mount_rcu() at all), so all stuff conditional upon
non-positive status won't be even touched.
Consolidate just that sequence after the call of __follow_mount_rcu().
2) calling d_is_negative() and keeping its result is pointless -
we either don't get past checking ->d_seq (and don't use the results of
d_is_negative() at all), or we are guaranteed that ->d_inode and
type bits of ->d_flags had been consistent at the time of d_is_negative()
call. IOW, we could only get to the use of its result if it's
equal to !inode. The same ->d_seq check guarantees that after that point
this CPU won't observe ->d_flags values older than ->d_inode update.
So 'negative' variable is completely pointless these days.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 033f91a72bb5..53e859b80b4c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1631,7 +1631,6 @@ static int lookup_fast(struct nameidata *nd,
*/
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
- bool negative;
dentry = __d_lookup_rcu(parent, &nd->last, &seq);
if (unlikely(!dentry)) {
if (unlazy_walk(nd))
@@ -1644,7 +1643,6 @@ static int lookup_fast(struct nameidata *nd,
* the dentry name information from lookup.
*/
*inode = d_backing_inode(dentry);
- negative = d_is_negative(dentry);
if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
return -ECHILD;
@@ -1665,12 +1663,15 @@ static int lookup_fast(struct nameidata *nd,
* Note: do negative dentry check after revalidation in
* case that drops it.
*/
- if (unlikely(negative))
+ if (unlikely(!inode))
return -ENOENT;
path->mnt = mnt;
path->dentry = dentry;
if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
return 1;
+ if (unlazy_child(nd, dentry, seq))
+ return -ECHILD;
+ return handle_mounts(nd, dentry, path, inode, seqp);
}
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
--
2.11.0
From: Al Viro <[email protected]>
... and make the callers of __follow_mount_rcu() use handle_mounts().
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 46 +++++++++++++++++-----------------------------
1 file changed, 17 insertions(+), 29 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 53e859b80b4c..3215b0da6e91 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1393,6 +1393,18 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
path->mnt = nd->path.mnt;
path->dentry = dentry;
+ if (nd->flags & LOOKUP_RCU) {
+ unsigned int seq = *seqp;
+ if (unlikely(!*inode))
+ return -ENOENT;
+ if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ return 1;
+ if (unlazy_child(nd, dentry, seq))
+ return -ECHILD;
+ // *path might've been clobbered by __follow_mount_rcu()
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ }
ret = follow_managed(path, nd);
if (likely(ret >= 0)) {
*inode = d_backing_inode(path->dentry);
@@ -1620,7 +1632,6 @@ static int lookup_fast(struct nameidata *nd,
struct path *path, struct inode **inode,
unsigned *seqp)
{
- struct vfsmount *mnt = nd->path.mnt;
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1658,21 +1669,8 @@ static int lookup_fast(struct nameidata *nd,
*seqp = seq;
status = d_revalidate(dentry, nd->flags);
- if (likely(status > 0)) {
- /*
- * Note: do negative dentry check after revalidation in
- * case that drops it.
- */
- if (unlikely(!inode))
- return -ENOENT;
- path->mnt = mnt;
- path->dentry = dentry;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
- return 1;
- if (unlazy_child(nd, dentry, seq))
- return -ECHILD;
+ if (likely(status > 0))
return handle_mounts(nd, dentry, path, inode, seqp);
- }
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
if (unlikely(status == -ECHILD))
@@ -2361,21 +2359,11 @@ static int handle_lookup_down(struct nameidata *nd)
unsigned seq = nd->seq;
int err;
- if (nd->flags & LOOKUP_RCU) {
- /*
- * don't bother with unlazy_walk on failure - we are
- * at the very beginning of walk, so we lose nothing
- * if we simply redo everything in non-RCU mode
- */
- path = nd->path;
- if (unlikely(!__follow_mount_rcu(nd, &path, &inode, &seq)))
- return -ECHILD;
- } else {
+ if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
- }
+ err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
+ if (unlikely(err < 0))
+ return err;
path_to_nameidata(&path, nd);
nd->inode = inode;
nd->seq = seq;
--
2.11.0
From: Al Viro <[email protected]>
New LOOKUP flag, telling path_lookupat() to act as path_mountpointat().
IOW, traverse mounts at the final point and skip revalidation of the
location where it ends up.
Signed-off-by: Al Viro <[email protected]>
---
fs/autofs/dev-ioctl.c | 6 ++--
fs/internal.h | 1 -
fs/namei.c | 89 ++++-----------------------------------------------
fs/namespace.c | 4 +--
include/linux/namei.h | 2 +-
5 files changed, 12 insertions(+), 90 deletions(-)
diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c
index a3cdb0036c5d..f3a0f412b43b 100644
--- a/fs/autofs/dev-ioctl.c
+++ b/fs/autofs/dev-ioctl.c
@@ -186,7 +186,7 @@ static int find_autofs_mount(const char *pathname,
struct path path;
int err;
- err = kern_path_mountpoint(AT_FDCWD, pathname, &path, 0);
+ err = kern_path(pathname, LOOKUP_MOUNTPOINT, &path);
if (err)
return err;
err = -ENOENT;
@@ -519,8 +519,8 @@ static int autofs_dev_ioctl_ismountpoint(struct file *fp,
if (!fp || param->ioctlfd == -1) {
if (autofs_type_any(type))
- err = kern_path_mountpoint(AT_FDCWD,
- name, &path, LOOKUP_FOLLOW);
+ err = kern_path(name, LOOKUP_FOLLOW | LOOKUP_MOUNTPOINT,
+ &path);
else
err = find_autofs_mount(name, &path,
test_by_type, &type);
diff --git a/fs/internal.h b/fs/internal.h
index f3f280b952a3..b108a8eb75ca 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -61,7 +61,6 @@ extern int finish_clean_context(struct fs_context *fc);
*/
extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
struct path *path, struct path *root);
-extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
const char *, unsigned int, struct path *);
long do_mknodat(int dfd, const char __user *filename, umode_t mode,
diff --git a/fs/namei.c b/fs/namei.c
index 28835ee7168a..6f1f46b931a6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2382,6 +2382,10 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
if (!err && nd->flags & LOOKUP_DIRECTORY)
if (!d_can_lookup(nd->path.dentry))
err = -ENOTDIR;
+ if (!err && unlikely(nd->flags & LOOKUP_MOUNTPOINT)) {
+ err = handle_lookup_down(nd);
+ nd->flags &= ~LOOKUP_JUMPED; // no d_weak_revalidate(), please...
+ }
if (!err) {
*path = nd->path;
nd->path.mnt = NULL;
@@ -2410,7 +2414,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
retval = path_lookupat(&nd, flags | LOOKUP_REVAL, path);
if (likely(!retval))
- audit_inode(name, path->dentry, 0);
+ audit_inode(name, path->dentry,
+ flags & LOOKUP_MOUNTPOINT ? AUDIT_INODE_NOEVAL : 0);
restore_nameidata();
putname(name);
return retval;
@@ -2688,88 +2693,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
}
EXPORT_SYMBOL(user_path_at_empty);
-/**
- * path_mountpoint - look up a path to be umounted
- * @nd: lookup context
- * @flags: lookup flags
- * @path: pointer to container for result
- *
- * Look up the given name, but don't attempt to revalidate the last component.
- * Returns 0 and "path" will be valid on success; Returns error otherwise.
- */
-static int
-path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path)
-{
- const char *s = path_init(nd, flags);
- int err;
-
- while (!(err = link_path_walk(s, nd)) &&
- (err = lookup_last(nd)) > 0) {
- s = trailing_symlink(nd);
- }
- if (!err && (nd->flags & LOOKUP_RCU))
- err = unlazy_walk(nd);
- if (!err)
- err = handle_lookup_down(nd);
- if (!err) {
- *path = nd->path;
- nd->path.mnt = NULL;
- nd->path.dentry = NULL;
- }
- terminate_walk(nd);
- return err;
-}
-
-static int
-filename_mountpoint(int dfd, struct filename *name, struct path *path,
- unsigned int flags)
-{
- struct nameidata nd;
- int error;
- if (IS_ERR(name))
- return PTR_ERR(name);
- set_nameidata(&nd, dfd, name);
- error = path_mountpoint(&nd, flags | LOOKUP_RCU, path);
- if (unlikely(error == -ECHILD))
- error = path_mountpoint(&nd, flags, path);
- if (unlikely(error == -ESTALE))
- error = path_mountpoint(&nd, flags | LOOKUP_REVAL, path);
- if (likely(!error))
- audit_inode(name, path->dentry, AUDIT_INODE_NOEVAL);
- restore_nameidata();
- putname(name);
- return error;
-}
-
-/**
- * user_path_mountpoint_at - lookup a path from userland in order to umount it
- * @dfd: directory file descriptor
- * @name: pathname from userland
- * @flags: lookup flags
- * @path: pointer to container to hold result
- *
- * A umount is a special case for path walking. We're not actually interested
- * in the inode in this situation, and ESTALE errors can be a problem. We
- * simply want track down the dentry and vfsmount attached at the mountpoint
- * and avoid revalidating the last component.
- *
- * Returns 0 and populates "path" on success.
- */
-int
-user_path_mountpoint_at(int dfd, const char __user *name, unsigned int flags,
- struct path *path)
-{
- return filename_mountpoint(dfd, getname(name), path, flags);
-}
-
-int
-kern_path_mountpoint(int dfd, const char *name, struct path *path,
- unsigned int flags)
-{
- return filename_mountpoint(dfd, getname_kernel(name), path, flags);
-}
-EXPORT_SYMBOL(kern_path_mountpoint);
-
int __check_sticky(struct inode *dir, struct inode *inode)
{
kuid_t fsuid = current_fsuid();
diff --git a/fs/namespace.c b/fs/namespace.c
index a9e556224945..ef3f2a33992c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1669,7 +1669,7 @@ int ksys_umount(char __user *name, int flags)
struct path path;
struct mount *mnt;
int retval;
- int lookup_flags = 0;
+ int lookup_flags = LOOKUP_MOUNTPOINT;
if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
return -EINVAL;
@@ -1680,7 +1680,7 @@ int ksys_umount(char __user *name, int flags)
if (!(flags & UMOUNT_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
- retval = user_path_mountpoint_at(AT_FDCWD, name, lookup_flags, &path);
+ retval = user_path_at(AT_FDCWD, name, lookup_flags, &path);
if (retval)
goto out;
mnt = real_mount(path.mnt);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 0dd980d7318f..d9576a051808 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -23,6 +23,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_AUTOMOUNT 0x0004 /* force terminal automount */
#define LOOKUP_EMPTY 0x4000 /* accept empty path [user_... only] */
#define LOOKUP_DOWN 0x8000 /* follow mounts in the starting point */
+#define LOOKUP_MOUNTPOINT 0x0080 /* follow mounts in the end */
#define LOOKUP_REVAL 0x0020 /* tell ->d_revalidate() to trust no cache */
#define LOOKUP_RCU 0x0040 /* RCU pathwalk mode; semi-internal */
@@ -64,7 +65,6 @@ extern struct dentry *kern_path_create(int, const char *, struct path *, unsigne
extern struct dentry *user_path_create(int, const char __user *, struct path *, unsigned int);
extern void done_path_create(struct path *, struct dentry *);
extern struct dentry *kern_path_locked(const char *, struct path *);
-extern int kern_path_mountpoint(int, const char *, struct path *, unsigned int);
extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);
extern struct dentry *lookup_one_len(const char *, struct dentry *, int);
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6f1f46b931a6..3eed5784942a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -858,13 +858,6 @@ static int set_root(struct nameidata *nd)
return 0;
}
-static void path_put_conditional(struct path *path, struct nameidata *nd)
-{
- dput(path->dentry);
- if (path->mnt != nd->path.mnt)
- mntput(path->mnt);
-}
-
static inline void path_to_nameidata(const struct path *path,
struct nameidata *nd)
{
@@ -1312,8 +1305,11 @@ static int follow_managed(struct path *path, struct nameidata *nd)
ret = 1;
if (ret > 0 && unlikely(d_flags_negative(flags)))
ret = -ENOENT;
- if (unlikely(ret < 0))
- path_put_conditional(path, nd);
+ if (unlikely(ret < 0)) {
+ dput(path->dentry);
+ if (path->mnt != nd->path.mnt)
+ mntput(path->mnt);
+ }
return ret;
}
--
2.11.0
From: Al Viro <[email protected]>
The following is true:
* calls of handle_mounts() and step_into() are always
paired in sequences like
err = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
err = step_into(nd, &path, flags, inode, seq);
* in all such sequences path is uninitialized before and
unused after this pair of calls
* in all such sequences inode and seq are unused afterwards.
So the call of handle_mounts() can be shifted inside step_into(),
turning 'path' into a local variable in the combined function.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 41 +++++++++++++++--------------------------
1 file changed, 15 insertions(+), 26 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index fe48a8d00ae5..28835ee7168a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1835,31 +1835,35 @@ enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
* so we keep a cache of "no, this doesn't need follow_link"
* for the common case.
*/
-static inline int step_into(struct nameidata *nd, struct path *path,
- int flags, struct inode *inode, unsigned seq)
+static int step_into(struct nameidata *nd, int flags,
+ struct dentry *dentry, struct inode *inode, unsigned seq)
{
+ struct path path;
+ int err = handle_mounts(nd, dentry, &path, &inode, &seq);
+
+ if (err < 0)
+ return err;
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- if (likely(!d_is_symlink(path->dentry)) ||
+ if (likely(!d_is_symlink(path.dentry)) ||
!(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) ||
flags & WALK_NOFOLLOW) {
/* not a symlink or should not follow */
- path_to_nameidata(path, nd);
+ path_to_nameidata(&path, nd);
nd->inode = inode;
nd->seq = seq;
return 0;
}
/* make sure that d_is_symlink above matches inode */
if (nd->flags & LOOKUP_RCU) {
- if (read_seqcount_retry(&path->dentry->d_seq, seq))
+ if (read_seqcount_retry(&path.dentry->d_seq, seq))
return -ECHILD;
}
- return pick_link(nd, path, inode, seq);
+ return pick_link(nd, &path, inode, seq);
}
static int walk_component(struct nameidata *nd, int flags)
{
- struct path path;
struct dentry *dentry;
struct inode *inode;
unsigned seq;
@@ -1883,11 +1887,7 @@ static int walk_component(struct nameidata *nd, int flags)
if (IS_ERR(dentry))
return PTR_ERR(dentry);
}
-
- err = handle_mounts(nd, dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
- return step_into(nd, &path, flags, inode, seq);
+ return step_into(nd, flags, dentry, inode, seq);
}
/*
@@ -2354,17 +2354,10 @@ static inline int lookup_last(struct nameidata *nd)
static int handle_lookup_down(struct nameidata *nd)
{
- struct path path;
- struct inode *inode = nd->inode;
- unsigned seq = nd->seq;
- int err;
-
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
- return step_into(nd, &path, WALK_NOFOLLOW, inode, seq);
+ return step_into(nd, WALK_NOFOLLOW,
+ nd->path.dentry, nd->inode, nd->seq);
}
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3283,7 +3276,6 @@ static int do_last(struct nameidata *nd,
int acc_mode = op->acc_mode;
unsigned seq;
struct inode *inode;
- struct path path;
struct dentry *dentry;
int error;
@@ -3382,10 +3374,7 @@ static int do_last(struct nameidata *nd,
}
finish_lookup:
- error = handle_mounts(nd, dentry, &path, &inode, &seq);
- if (unlikely(error < 0))
- return error;
- error = step_into(nd, &path, 0, inode, seq);
+ error = step_into(nd, 0, dentry, inode, seq);
if (unlikely(error))
return error;
--
2.11.0
From: Al Viro <[email protected]>
After a pure jump ("/" or procfs-style symlink) we don't need to
hold the link anymore. link_path_walk() dropped it if such case
had been detected, lookup_last/do_last() (i.e. old trailing_symlink())
left it on the stack - it ended up calling terminate_walk() shortly
anyway, which would've purged the entire stack.
Do it in get_link() itself instead. Simpler logics that way...
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 888b1e5b994e..46cd3e5cb052 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1154,7 +1154,9 @@ const char *get_link(struct nameidata *nd)
} else {
res = get(dentry, inode, &last->done);
}
- if (IS_ERR_OR_NULL(res))
+ if (!res)
+ goto all_done;
+ if (IS_ERR(res))
return res;
}
if (*res == '/') {
@@ -1164,9 +1166,11 @@ const char *get_link(struct nameidata *nd)
while (unlikely(*++res == '/'))
;
}
- if (!*res)
- res = NULL;
- return res;
+ if (*res)
+ return res;
+all_done: // pure jump
+ put_link(nd);
+ return NULL;
}
/*
@@ -2211,11 +2215,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
if (IS_ERR(s))
return PTR_ERR(s);
- err = 0;
- if (unlikely(!s)) {
- /* jumped */
- put_link(nd);
- } else {
+ if (likely(s)) {
nd->stack[nd->depth - 1].name = name;
name = s;
continue;
--
2.11.0
From: Al Viro <[email protected]>
Move restoring LOOKUP_PARENT and zeroing nd->stack.name[0] past
the call of get_link() (nothing _currently_ uses them in there).
That allows to moved the call of may_follow_link() into get_link()
as well, since now the presence of LOOKUP_PARENT distinguishes
the callers from each other (link_path_walk() has it, trailing_symlink()
doesn't).
Preparations for folding trailing_symlink() into callers (lookup_last()
and do_last()) and changing the calling conventions of those. Next
stage after that will have get_link() call migrate into walk_component(),
then - into step_into(). It's tricky enough to warrant doing that
in stages, unfortunately...
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3eed5784942a..3594f6a4998b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1115,6 +1115,12 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;
+ if (!(nd->flags & LOOKUP_PARENT)) {
+ error = may_follow_link(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
+
if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
return ERR_PTR(-ELOOP);
@@ -2329,13 +2335,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
static const char *trailing_symlink(struct nameidata *nd)
{
- const char *s;
- int error = may_follow_link(nd);
- if (unlikely(error))
- return ERR_PTR(error);
+ const char *s = get_link(nd);
nd->flags |= LOOKUP_PARENT;
nd->stack[0].name = NULL;
- s = get_link(nd);
return s ? s : "";
}
--
2.11.0
From: Al Viro <[email protected]>
Move the call of get_link() into walk_component(). Change the
calling conventions for walk_component() to returning the link
body to follow (if any).
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 60 +++++++++++++++++++++++++++---------------------------------
1 file changed, 27 insertions(+), 33 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 46cd3e5cb052..09e9f9969fd3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1868,7 +1868,7 @@ static int step_into(struct nameidata *nd, int flags,
return pick_link(nd, &path, inode, seq);
}
-static int walk_component(struct nameidata *nd, int flags)
+static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
struct inode *inode;
@@ -1883,17 +1883,23 @@ static int walk_component(struct nameidata *nd, int flags)
err = handle_dots(nd, nd->last_type);
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return err;
+ return ERR_PTR(err);
}
dentry = lookup_fast(nd, &inode, &seq);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return ERR_CAST(dentry);
if (unlikely(!dentry)) {
dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return ERR_CAST(dentry);
}
- return step_into(nd, flags, dentry, inode, seq);
+ err = step_into(nd, flags, dentry, inode, seq);
+ if (!err)
+ return NULL;
+ else if (err > 0)
+ return get_link(nd);
+ else
+ return ERR_PTR(err);
}
/*
@@ -2145,6 +2151,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
/* At this point we know we have a real path component. */
for(;;) {
+ const char *link;
u64 hash_len;
int type;
@@ -2202,24 +2209,18 @@ static int link_path_walk(const char *name, struct nameidata *nd)
if (!name)
return 0;
/* last component of nested symlink */
- err = walk_component(nd, WALK_FOLLOW);
+ link = walk_component(nd, WALK_FOLLOW);
} else {
/* not the last component */
- err = walk_component(nd, WALK_FOLLOW | WALK_MORE);
+ link = walk_component(nd, WALK_FOLLOW | WALK_MORE);
}
- if (err < 0)
- return err;
-
- if (err) {
- const char *s = get_link(nd);
-
- if (IS_ERR(s))
- return PTR_ERR(s);
- if (likely(s)) {
- nd->stack[nd->depth - 1].name = name;
- name = s;
- continue;
- }
+ if (unlikely(link)) {
+ if (IS_ERR(link))
+ return PTR_ERR(link);
+ /* a symlink to follow */
+ nd->stack[nd->depth - 1].name = name;
+ name = link;
+ continue;
}
if (unlikely(!d_can_lookup(nd->path.dentry))) {
if (nd->flags & LOOKUP_RCU) {
@@ -2335,24 +2336,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
static inline const char *lookup_last(struct nameidata *nd)
{
- int err;
+ const char *link;
if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
nd->flags &= ~LOOKUP_PARENT;
- err = walk_component(nd, 0);
- if (unlikely(err)) {
- const char *s;
- if (err < 0)
- return PTR_ERR(err);
- s = get_link(nd);
- if (s) {
- nd->flags |= LOOKUP_PARENT;
- nd->stack[0].name = NULL;
- return s;
- }
+ link = walk_component(nd, 0);
+ if (link) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->stack[0].name = NULL;
}
- return NULL;
+ return link;
}
static int handle_lookup_down(struct nameidata *nd)
--
2.11.0
From: Al Viro <[email protected]>
Fold trailing_symlink() into lookup_last() and do_last(), change
the calling conventions of those two. Rules change:
success, we are done => NULL instead of 0
error => ERR_PTR(-E...) instead of -E...
got a symlink to follow => return the path to be followed instead of 1
The loops calling those (in path_lookupat() and path_openat()) adjusted.
A subtle change of control flow here: originally a pure-jump trailing
symlink ("/" or procfs one) would've passed through the upper level
loop once more, with "" for path to traverse. That would've brought
us back to the lookup_last/do_last entry and we would've hit LAST_BIND
case (LAST_BIND left from get_link() called by trailing_symlink())
and pretty much skip to the point right after where we'd left the
sucker back when we picked that trailing symlink.
Now we don't bother with that extra pass through the upper level
loop - if get_link() says "I've just done a pure jump, nothing
else to do", we just treat that as non-symlink case.
Boilerplate added on that step will go away shortly - it'll migrate
into walk_component() and then to step_into(), collapsing into the
change of calling conventions for those.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 68 ++++++++++++++++++++++++++++++++++++--------------------------
1 file changed, 40 insertions(+), 28 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3594f6a4998b..888b1e5b994e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2333,21 +2333,26 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
return s;
}
-static const char *trailing_symlink(struct nameidata *nd)
-{
- const char *s = get_link(nd);
- nd->flags |= LOOKUP_PARENT;
- nd->stack[0].name = NULL;
- return s ? s : "";
-}
-
-static inline int lookup_last(struct nameidata *nd)
+static inline const char *lookup_last(struct nameidata *nd)
{
+ int err;
if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
nd->flags &= ~LOOKUP_PARENT;
- return walk_component(nd, 0);
+ err = walk_component(nd, 0);
+ if (unlikely(err)) {
+ const char *s;
+ if (err < 0)
+ return PTR_ERR(err);
+ s = get_link(nd);
+ if (s) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->stack[0].name = NULL;
+ return s;
+ }
+ }
+ return NULL;
}
static int handle_lookup_down(struct nameidata *nd)
@@ -2370,10 +2375,9 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
s = ERR_PTR(err);
}
- while (!(err = link_path_walk(s, nd))
- && ((err = lookup_last(nd)) > 0)) {
- s = trailing_symlink(nd);
- }
+ while (!(err = link_path_walk(s, nd)) &&
+ (s = lookup_last(nd)) != NULL)
+ ;
if (!err)
err = complete_walk(nd);
@@ -3185,7 +3189,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
/*
* Handle the last step of open()
*/
-static int do_last(struct nameidata *nd,
+static const char *do_last(struct nameidata *nd,
struct file *file, const struct open_flags *op)
{
struct dentry *dir = nd->path.dentry;
@@ -3206,7 +3210,7 @@ static int do_last(struct nameidata *nd,
if (nd->last_type != LAST_NORM) {
error = handle_dots(nd, nd->last_type);
if (unlikely(error))
- return error;
+ return ERR_PTR(error);
goto finish_open;
}
@@ -3216,7 +3220,7 @@ static int do_last(struct nameidata *nd,
/* we _can_ be in RCU mode here */
dentry = lookup_fast(nd, &inode, &seq);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return ERR_CAST(dentry);
if (likely(dentry))
goto finish_lookup;
@@ -3231,12 +3235,12 @@ static int do_last(struct nameidata *nd,
*/
error = complete_walk(nd);
if (error)
- return error;
+ return ERR_PTR(error);
audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
/* trailing slashes? */
if (unlikely(nd->last.name[nd->last.len]))
- return -EISDIR;
+ return ERR_PTR(-EISDIR);
}
if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
@@ -3296,18 +3300,28 @@ static int do_last(struct nameidata *nd,
finish_lookup:
error = step_into(nd, 0, dentry, inode, seq);
- if (unlikely(error))
- return error;
+ if (unlikely(error)) {
+ const char *s;
+ if (error < 0)
+ return ERR_PTR(error);
+ s = get_link(nd);
+ if (s) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
+ nd->stack[0].name = NULL;
+ return s;
+ }
+ }
if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
audit_inode(nd->name, nd->path.dentry, 0);
- return -EEXIST;
+ return ERR_PTR(-EEXIST);
}
finish_open:
/* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */
error = complete_walk(nd);
if (error)
- return error;
+ return ERR_PTR(error);
audit_inode(nd->name, nd->path.dentry, 0);
if (open_flag & O_CREAT) {
error = -EISDIR;
@@ -3349,7 +3363,7 @@ static int do_last(struct nameidata *nd,
}
if (got_write)
mnt_drop_write(nd->path.mnt);
- return error;
+ return ERR_PTR(error);
}
struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode, int open_flag)
@@ -3452,10 +3466,8 @@ static struct file *path_openat(struct nameidata *nd,
} else {
const char *s = path_init(nd, flags);
while (!(error = link_path_walk(s, nd)) &&
- (error = do_last(nd, file, op)) > 0) {
- nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
- s = trailing_symlink(nd);
- }
+ (s = do_last(nd, file, op)) != NULL)
+ ;
terminate_walk(nd);
}
if (likely(!error)) {
--
2.11.0
From: Al Viro <[email protected]>
move get_link() call into step_into().
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 45 +++++++++++++++++++--------------------------
1 file changed, 19 insertions(+), 26 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 09e9f9969fd3..1e8548a547ff 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1841,14 +1841,14 @@ enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
* so we keep a cache of "no, this doesn't need follow_link"
* for the common case.
*/
-static int step_into(struct nameidata *nd, int flags,
+static const char *step_into(struct nameidata *nd, int flags,
struct dentry *dentry, struct inode *inode, unsigned seq)
{
struct path path;
int err = handle_mounts(nd, dentry, &path, &inode, &seq);
if (err < 0)
- return err;
+ return ERR_PTR(err);
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
if (likely(!d_is_symlink(path.dentry)) ||
@@ -1858,14 +1858,18 @@ static int step_into(struct nameidata *nd, int flags,
path_to_nameidata(&path, nd);
nd->inode = inode;
nd->seq = seq;
- return 0;
+ return NULL;
}
/* make sure that d_is_symlink above matches inode */
if (nd->flags & LOOKUP_RCU) {
if (read_seqcount_retry(&path.dentry->d_seq, seq))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);
}
- return pick_link(nd, &path, inode, seq);
+ err = pick_link(nd, &path, inode, seq);
+ if (err > 0)
+ return get_link(nd);
+ else
+ return ERR_PTR(err);
}
static const char *walk_component(struct nameidata *nd, int flags)
@@ -1893,13 +1897,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
if (IS_ERR(dentry))
return ERR_CAST(dentry);
}
- err = step_into(nd, flags, dentry, inode, seq);
- if (!err)
- return NULL;
- else if (err > 0)
- return get_link(nd);
- else
- return ERR_PTR(err);
+ return step_into(nd, flags, dentry, inode, seq);
}
/*
@@ -2353,8 +2351,8 @@ static int handle_lookup_down(struct nameidata *nd)
{
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- return step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq);
+ return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
+ nd->path.dentry, nd->inode, nd->seq));
}
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3196,6 +3194,7 @@ static const char *do_last(struct nameidata *nd,
unsigned seq;
struct inode *inode;
struct dentry *dentry;
+ const char *link;
int error;
nd->flags &= ~LOOKUP_PARENT;
@@ -3293,18 +3292,12 @@ static const char *do_last(struct nameidata *nd,
}
finish_lookup:
- error = step_into(nd, 0, dentry, inode, seq);
- if (unlikely(error)) {
- const char *s;
- if (error < 0)
- return ERR_PTR(error);
- s = get_link(nd);
- if (s) {
- nd->flags |= LOOKUP_PARENT;
- nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
- nd->stack[0].name = NULL;
- return s;
- }
+ link = step_into(nd, 0, dentry, inode, seq);
+ if (unlikely(link)) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
+ nd->stack[0].name = NULL;
+ return link;
}
if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
--
2.11.0
From: Al Viro <[email protected]>
move the only remaining call of get_link() into pick_link()
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 1e8548a547ff..fef2c447219d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1793,14 +1793,14 @@ static inline int handle_dots(struct nameidata *nd, int type)
return 0;
}
-static int pick_link(struct nameidata *nd, struct path *link,
+static const char *pick_link(struct nameidata *nd, struct path *link,
struct inode *inode, unsigned seq)
{
int error;
struct saved *last;
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
path_to_nameidata(link, nd);
- return -ELOOP;
+ return ERR_PTR(-ELOOP);
}
if (!(nd->flags & LOOKUP_RCU)) {
if (link->mnt == nd->path.mnt)
@@ -1821,7 +1821,7 @@ static int pick_link(struct nameidata *nd, struct path *link,
}
if (error) {
path_put(link);
- return error;
+ return ERR_PTR(error);
}
}
@@ -1830,7 +1830,7 @@ static int pick_link(struct nameidata *nd, struct path *link,
clear_delayed_call(&last->done);
nd->link_inode = inode;
last->seq = seq;
- return 1;
+ return get_link(nd);
}
enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
@@ -1865,11 +1865,7 @@ static const char *step_into(struct nameidata *nd, int flags,
if (read_seqcount_retry(&path.dentry->d_seq, seq))
return ERR_PTR(-ECHILD);
}
- err = pick_link(nd, &path, inode, seq);
- if (err > 0)
- return get_link(nd);
- else
- return ERR_PTR(err);
+ return pick_link(nd, &path, inode, seq);
}
static const char *walk_component(struct nameidata *nd, int flags)
--
2.11.0
From: Al Viro <[email protected]>
kill nd->link_inode, while we are at it
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 135 ++++++++++++++++++++++++++++---------------------------------
1 file changed, 61 insertions(+), 74 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index fef2c447219d..9b6c3e3edd75 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -503,7 +503,6 @@ struct nameidata {
} *stack, internal[EMBEDDED_LEVELS];
struct filename *name;
struct nameidata *saved;
- struct inode *link_inode;
unsigned root_seq;
int dfd;
} __randomize_layout;
@@ -962,9 +961,8 @@ int sysctl_protected_regular __read_mostly;
*
* Returns 0 if following the symlink is allowed, -ve on error.
*/
-static inline int may_follow_link(struct nameidata *nd)
+static inline int may_follow_link(struct nameidata *nd, const struct inode *inode)
{
- const struct inode *inode;
const struct inode *parent;
kuid_t puid;
@@ -972,7 +970,6 @@ static inline int may_follow_link(struct nameidata *nd)
return 0;
/* Allowed if owner and follower match. */
- inode = nd->link_inode;
if (uid_eq(current_cred()->fsuid, inode->i_uid))
return 0;
@@ -1106,73 +1103,6 @@ static int may_create_in_sticky(umode_t dir_mode, kuid_t dir_uid,
return 0;
}
-static __always_inline
-const char *get_link(struct nameidata *nd)
-{
- struct saved *last = nd->stack + nd->depth - 1;
- struct dentry *dentry = last->link.dentry;
- struct inode *inode = nd->link_inode;
- int error;
- const char *res;
-
- if (!(nd->flags & LOOKUP_PARENT)) {
- error = may_follow_link(nd);
- if (unlikely(error))
- return ERR_PTR(error);
- }
-
- if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
- return ERR_PTR(-ELOOP);
-
- if (!(nd->flags & LOOKUP_RCU)) {
- touch_atime(&last->link);
- cond_resched();
- } else if (atime_needs_update(&last->link, inode)) {
- if (unlikely(unlazy_walk(nd)))
- return ERR_PTR(-ECHILD);
- touch_atime(&last->link);
- }
-
- error = security_inode_follow_link(dentry, inode,
- nd->flags & LOOKUP_RCU);
- if (unlikely(error))
- return ERR_PTR(error);
-
- nd->last_type = LAST_BIND;
- res = READ_ONCE(inode->i_link);
- if (!res) {
- const char * (*get)(struct dentry *, struct inode *,
- struct delayed_call *);
- get = inode->i_op->get_link;
- if (nd->flags & LOOKUP_RCU) {
- res = get(NULL, inode, &last->done);
- if (res == ERR_PTR(-ECHILD)) {
- if (unlikely(unlazy_walk(nd)))
- return ERR_PTR(-ECHILD);
- res = get(dentry, inode, &last->done);
- }
- } else {
- res = get(dentry, inode, &last->done);
- }
- if (!res)
- goto all_done;
- if (IS_ERR(res))
- return res;
- }
- if (*res == '/') {
- error = nd_jump_root(nd);
- if (unlikely(error))
- return ERR_PTR(error);
- while (unlikely(*++res == '/'))
- ;
- }
- if (*res)
- return res;
-all_done: // pure jump
- put_link(nd);
- return NULL;
-}
-
/*
* follow_up - Find the mountpoint of path's vfsmount
*
@@ -1796,8 +1726,10 @@ static inline int handle_dots(struct nameidata *nd, int type)
static const char *pick_link(struct nameidata *nd, struct path *link,
struct inode *inode, unsigned seq)
{
- int error;
struct saved *last;
+ const char *res;
+ int error;
+
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
path_to_nameidata(link, nd);
return ERR_PTR(-ELOOP);
@@ -1828,9 +1760,64 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
last = nd->stack + nd->depth++;
last->link = *link;
clear_delayed_call(&last->done);
- nd->link_inode = inode;
last->seq = seq;
- return get_link(nd);
+
+ if (!(nd->flags & LOOKUP_PARENT)) {
+ error = may_follow_link(nd, inode);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
+
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
+ if (!(nd->flags & LOOKUP_RCU)) {
+ touch_atime(&last->link);
+ cond_resched();
+ } else if (atime_needs_update(&last->link, inode)) {
+ if (unlikely(unlazy_walk(nd)))
+ return ERR_PTR(-ECHILD);
+ touch_atime(&last->link);
+ }
+
+ error = security_inode_follow_link(link->dentry, inode,
+ nd->flags & LOOKUP_RCU);
+ if (unlikely(error))
+ return ERR_PTR(error);
+
+ nd->last_type = LAST_BIND;
+ res = READ_ONCE(inode->i_link);
+ if (!res) {
+ const char * (*get)(struct dentry *, struct inode *,
+ struct delayed_call *);
+ get = inode->i_op->get_link;
+ if (nd->flags & LOOKUP_RCU) {
+ res = get(NULL, inode, &last->done);
+ if (res == ERR_PTR(-ECHILD)) {
+ if (unlikely(unlazy_walk(nd)))
+ return ERR_PTR(-ECHILD);
+ res = get(link->dentry, inode, &last->done);
+ }
+ } else {
+ res = get(link->dentry, inode, &last->done);
+ }
+ if (!res)
+ goto all_done;
+ if (IS_ERR(res))
+ return res;
+ }
+ if (*res == '/') {
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ while (unlikely(*++res == '/'))
+ ;
+ }
+ if (*res)
+ return res;
+all_done: // pure jump
+ put_link(nd);
+ return NULL;
}
enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
--
2.11.0
From: Al Viro <[email protected]>
make the loop more similar to that in follow_managed(), with
explicit tracking of flags, etc.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 70 +++++++++++++++++++++++++++++++-------------------------------
1 file changed, 35 insertions(+), 35 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 9b6c3e3edd75..e83071d25fae 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1269,12 +1269,6 @@ int follow_down_one(struct path *path)
}
EXPORT_SYMBOL(follow_down_one);
-static inline int managed_dentry_rcu(const struct path *path)
-{
- return (path->dentry->d_flags & DCACHE_MANAGE_TRANSIT) ?
- path->dentry->d_op->d_manage(path, true) : 0;
-}
-
/*
* Try to skip to top of mountpoint pile in rcuwalk mode. Fail if
* we meet a managed dentry that would need blocking.
@@ -1282,43 +1276,49 @@ static inline int managed_dentry_rcu(const struct path *path)
static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
struct inode **inode, unsigned *seqp)
{
+ struct dentry *dentry = path->dentry;
+ unsigned int flags = dentry->d_flags;
+
+ if (likely(!(flags & DCACHE_MANAGED_DENTRY)))
+ return true;
+
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return false;
+
for (;;) {
- struct mount *mounted;
/*
* Don't forget we might have a non-mountpoint managed dentry
* that wants to block transit.
*/
- switch (managed_dentry_rcu(path)) {
- case -ECHILD:
- default:
- return false;
- case -EISDIR:
- return true;
- case 0:
- break;
+ if (unlikely(flags & DCACHE_MANAGE_TRANSIT)) {
+ int res = dentry->d_op->d_manage(path, true);
+ if (res)
+ return res == -EISDIR;
+ flags = dentry->d_flags;
}
- if (!d_mountpoint(path->dentry))
- return !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
-
- mounted = __lookup_mnt(path->mnt, path->dentry);
- if (!mounted)
- break;
- if (unlikely(nd->flags & LOOKUP_NO_XDEV))
- return false;
- path->mnt = &mounted->mnt;
- path->dentry = mounted->mnt.mnt_root;
- nd->flags |= LOOKUP_JUMPED;
- *seqp = read_seqcount_begin(&path->dentry->d_seq);
- /*
- * Update the inode too. We don't need to re-check the
- * dentry sequence number here after this d_inode read,
- * because a mount-point is always pinned.
- */
- *inode = path->dentry->d_inode;
+ if (flags & DCACHE_MOUNTED) {
+ struct mount *mounted = __lookup_mnt(path->mnt, dentry);
+ if (mounted) {
+ path->mnt = &mounted->mnt;
+ dentry = path->dentry = mounted->mnt.mnt_root;
+ nd->flags |= LOOKUP_JUMPED;
+ *seqp = read_seqcount_begin(&dentry->d_seq);
+ *inode = dentry->d_inode;
+ /*
+ * We don't need to re-check ->d_seq after this
+ * ->d_inode read - there will be an RCU delay
+ * between mount hash removal and ->mnt_root
+ * becoming unpinned.
+ */
+ flags = dentry->d_flags;
+ continue;
+ }
+ if (read_seqretry(&mount_lock, nd->m_seq))
+ return false;
+ }
+ return !(flags & DCACHE_NEED_AUTOMOUNT);
}
- return !read_seqretry(&mount_lock, nd->m_seq) &&
- !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
}
static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
--
2.11.0
From: Al Viro <[email protected]>
there we'll be able to merge it with its counterparts in other
cases, and there's no reason to do it before the parent has
been unlocked
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 26 +++++++++++---------------
1 file changed, 11 insertions(+), 15 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index c2244ee4b2f0..16786be13050 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2956,23 +2956,12 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
d_lookup_done(dentry);
if (!error) {
if (file->f_mode & FMODE_OPENED) {
- int acc_mode = op->acc_mode;
if (unlikely(dentry != file->f_path.dentry)) {
dput(dentry);
dentry = dget(file->f_path.dentry);
}
- /*
- * We didn't have the inode before the open, so check open
- * permission here.
- */
- if (file->f_mode & FMODE_CREATED) {
- WARN_ON(!(open_flag & O_CREAT));
+ if (file->f_mode & FMODE_CREATED)
fsnotify_create(dir, dentry);
- acc_mode = 0;
- }
- error = may_open(&file->f_path, acc_mode, open_flag);
- if (WARN_ON(error > 0))
- error = -EINVAL;
} else if (WARN_ON(file->f_path.dentry == DENTRY_NOT_SET)) {
error = -EIO;
} else {
@@ -3216,12 +3205,19 @@ static const char *do_last(struct nameidata *nd,
}
if (file->f_mode & FMODE_OPENED) {
- if ((file->f_mode & FMODE_CREATED) ||
- !S_ISREG(file_inode(file)->i_mode))
+ if (file->f_mode & FMODE_CREATED) {
+ open_flag &= ~O_TRUNC;
+ will_truncate = false;
+ acc_mode = 0;
+ } else if (!S_ISREG(file_inode(file)->i_mode))
will_truncate = false;
audit_inode(nd->name, file->f_path.dentry, 0);
- dput(dentry);
+ dput(nd->path.dentry);
+ nd->path.dentry = dentry;
+ error = may_open(&nd->path, acc_mode, open_flag);
+ if (error)
+ goto out;
goto opened;
}
--
2.11.0
From: Al Viro <[email protected]>
common guts of follow_down() and follow_managed() taken to a new
helper - traverse_mounts(). The remnants of follow_managed()
are folded into its sole remaining caller (handle_mounts()).
Calling conventions of handle_mounts() slightly sanitized -
instead of the weird "1 for success, -E... for failure" that used
to be imposed by the calling conventions of walk_component() et.al.
we can use the normal "0 for success, -E... for failure".
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 177 +++++++++++++++++++++++++------------------------------------
1 file changed, 72 insertions(+), 105 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e83071d25fae..a7730bbee162 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1168,91 +1168,79 @@ static int follow_automount(struct path *path, int *count, unsigned lookup_flags
}
/*
- * Handle a dentry that is managed in some way.
- * - Flagged for transit management (autofs)
- * - Flagged as mountpoint
- * - Flagged as automount point
- *
- * This may only be called in refwalk mode.
- * On success path->dentry is known positive.
- *
- * Serialization is taken care of in namespace.c
+ * mount traversal - out-of-line part. One note on ->d_flags accesses -
+ * dentries are pinned but not locked here, so negative dentry can go
+ * positive right under us. Use of smp_load_acquire() provides a barrier
+ * sufficient for ->d_inode and ->d_flags consistency.
*/
-static int follow_managed(struct path *path, struct nameidata *nd)
+static int __traverse_mounts(struct path *path, unsigned flags, bool *jumped,
+ int *count, unsigned lookup_flags)
{
- struct vfsmount *mnt = path->mnt; /* held by caller, must be left alone */
- unsigned flags;
+ struct vfsmount *mnt = path->mnt;
bool need_mntput = false;
int ret = 0;
- /* Given that we're not holding a lock here, we retain the value in a
- * local variable for each dentry as we look at it so that we don't see
- * the components of that value change under us */
- while (flags = smp_load_acquire(&path->dentry->d_flags),
- unlikely(flags & DCACHE_MANAGED_DENTRY)) {
+ while (flags & DCACHE_MANAGED_DENTRY) {
/* Allow the filesystem to manage the transit without i_mutex
* being held. */
if (flags & DCACHE_MANAGE_TRANSIT) {
- BUG_ON(!path->dentry->d_op);
- BUG_ON(!path->dentry->d_op->d_manage);
ret = path->dentry->d_op->d_manage(path, false);
flags = smp_load_acquire(&path->dentry->d_flags);
if (ret < 0)
break;
}
- /* Transit to a mounted filesystem. */
- if (flags & DCACHE_MOUNTED) {
+ if (flags & DCACHE_MOUNTED) { // something's mounted on it..
struct vfsmount *mounted = lookup_mnt(path);
- if (mounted) {
+ if (mounted) { // ... in our namespace
dput(path->dentry);
if (need_mntput)
mntput(path->mnt);
path->mnt = mounted;
path->dentry = dget(mounted->mnt_root);
+ // here we know it's positive
+ flags = path->dentry->d_flags;
need_mntput = true;
continue;
}
-
- /* Something is mounted on this dentry in another
- * namespace and/or whatever was mounted there in this
- * namespace got unmounted before lookup_mnt() could
- * get it */
}
- /* Handle an automount point */
- if (flags & DCACHE_NEED_AUTOMOUNT) {
- ret = follow_automount(path, &nd->total_link_count,
- nd->flags);
- if (ret < 0)
- break;
- continue;
- }
+ if (!(flags & DCACHE_NEED_AUTOMOUNT))
+ break;
- /* We didn't change the current path point */
- break;
+ // uncovered automount point
+ ret = follow_automount(path, count, lookup_flags);
+ flags = smp_load_acquire(&path->dentry->d_flags);
+ if (ret < 0)
+ break;
}
- if (need_mntput) {
- if (path->mnt == mnt)
- mntput(path->mnt);
- if (unlikely(nd->flags & LOOKUP_NO_XDEV))
- ret = -EXDEV;
- else
- nd->flags |= LOOKUP_JUMPED;
- }
- if (ret == -EISDIR || !ret)
- ret = 1;
- if (ret > 0 && unlikely(d_flags_negative(flags)))
+ if (ret == -EISDIR)
+ ret = 0;
+ // possible if you race with several mount --move
+ if (need_mntput && path->mnt == mnt)
+ mntput(path->mnt);
+ if (!ret && unlikely(d_flags_negative(flags)))
ret = -ENOENT;
- if (unlikely(ret < 0)) {
- dput(path->dentry);
- if (path->mnt != nd->path.mnt)
- mntput(path->mnt);
- }
+ *jumped = need_mntput;
return ret;
}
+static inline int traverse_mounts(struct path *path, bool *jumped,
+ int *count, unsigned lookup_flags)
+{
+ unsigned flags = smp_load_acquire(&path->dentry->d_flags);
+
+ /* fastpath */
+ if (likely(!(flags & DCACHE_MANAGED_DENTRY))) {
+ *jumped = false;
+ if (unlikely(d_flags_negative(flags)))
+ return -ENOENT;
+ return 0;
+ }
+ return __traverse_mounts(path, flags, jumped, count, lookup_flags);
+}
+
int follow_down_one(struct path *path)
{
struct vfsmount *mounted;
@@ -1270,6 +1258,23 @@ int follow_down_one(struct path *path)
EXPORT_SYMBOL(follow_down_one);
/*
+ * Follow down to the covering mount currently visible to userspace. At each
+ * point, the filesystem owning that dentry may be queried as to whether the
+ * caller is permitted to proceed or not.
+ */
+int follow_down(struct path *path)
+{
+ struct vfsmount *mnt = path->mnt;
+ bool jumped;
+ int ret = traverse_mounts(path, &jumped, NULL, 0);
+
+ if (path->mnt != mnt)
+ mntput(mnt);
+ return ret;
+}
+EXPORT_SYMBOL(follow_down);
+
+/*
* Try to skip to top of mountpoint pile in rcuwalk mode. Fail if
* we meet a managed dentry that would need blocking.
*/
@@ -1325,6 +1330,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
struct path *path, struct inode **inode,
unsigned int *seqp)
{
+ bool jumped;
int ret;
path->mnt = nd->path.mnt;
@@ -1334,15 +1340,25 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
if (unlikely(!*inode))
return -ENOENT;
if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
- return 1;
+ return 0;
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
// *path might've been clobbered by __follow_mount_rcu()
path->mnt = nd->path.mnt;
path->dentry = dentry;
}
- ret = follow_managed(path, nd);
- if (likely(ret >= 0)) {
+ ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
+ if (jumped) {
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
+ if (unlikely(ret)) {
+ dput(path->dentry);
+ if (path->mnt != nd->path.mnt)
+ mntput(path->mnt);
+ } else {
*inode = d_backing_inode(path->dentry);
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
@@ -1411,55 +1427,6 @@ static int follow_dotdot_rcu(struct nameidata *nd)
}
/*
- * Follow down to the covering mount currently visible to userspace. At each
- * point, the filesystem owning that dentry may be queried as to whether the
- * caller is permitted to proceed or not.
- */
-int follow_down(struct path *path)
-{
- unsigned managed;
- int ret;
-
- while (managed = READ_ONCE(path->dentry->d_flags),
- unlikely(managed & DCACHE_MANAGED_DENTRY)) {
- /* Allow the filesystem to manage the transit without i_mutex
- * being held.
- *
- * We indicate to the filesystem if someone is trying to mount
- * something here. This gives autofs the chance to deny anyone
- * other than its daemon the right to mount on its
- * superstructure.
- *
- * The filesystem may sleep at this point.
- */
- if (managed & DCACHE_MANAGE_TRANSIT) {
- BUG_ON(!path->dentry->d_op);
- BUG_ON(!path->dentry->d_op->d_manage);
- ret = path->dentry->d_op->d_manage(path, false);
- if (ret < 0)
- return ret == -EISDIR ? 0 : ret;
- }
-
- /* Transit to a mounted filesystem. */
- if (managed & DCACHE_MOUNTED) {
- struct vfsmount *mounted = lookup_mnt(path);
- if (!mounted)
- break;
- dput(path->dentry);
- mntput(path->mnt);
- path->mnt = mounted;
- path->dentry = dget(mounted->mnt_root);
- continue;
- }
-
- /* Don't handle automount points here */
- break;
- }
- return 0;
-}
-EXPORT_SYMBOL(follow_down);
-
-/*
* Skip to top of mountpoint pile in refwalk mode for follow_dotdot()
*/
static void follow_mount(struct path *path)
--
2.11.0
From: Al Viro <[email protected]>
->atomic_open() might have used a different alias than the one we'd
passed to it; in "not opened" case we take care of that, in "opened"
one we don't. Currently we don't care downstream of "opened" case
which alias to return; however, that will change shortly when we
get to unifying may_open() calls.
It's not hard to get right in all cases, anyway.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c
index a7730bbee162..c2244ee4b2f0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2956,11 +2956,15 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
d_lookup_done(dentry);
if (!error) {
if (file->f_mode & FMODE_OPENED) {
+ int acc_mode = op->acc_mode;
+ if (unlikely(dentry != file->f_path.dentry)) {
+ dput(dentry);
+ dentry = dget(file->f_path.dentry);
+ }
/*
* We didn't have the inode before the open, so check open
* permission here.
*/
- int acc_mode = op->acc_mode;
if (file->f_mode & FMODE_CREATED) {
WARN_ON(!(open_flag & O_CREAT));
fsnotify_create(dir, dentry);
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 56285466aa55..51283caaf7c4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3208,13 +3208,6 @@ static const char *do_last(struct nameidata *nd,
return ERR_CAST(dentry);
if (file->f_mode & FMODE_OPENED) {
- if (file->f_mode & FMODE_CREATED) {
- open_flag &= ~O_TRUNC;
- will_truncate = false;
- acc_mode = 0;
- } else if (!S_ISREG(file_inode(file)->i_mode))
- will_truncate = false;
-
audit_inode(nd->name, file->f_path.dentry, 0);
dput(nd->path.dentry);
nd->path.dentry = dentry;
@@ -3222,10 +3215,6 @@ static const char *do_last(struct nameidata *nd,
}
if (file->f_mode & FMODE_CREATED) {
- /* Don't check for write permission, don't truncate */
- open_flag &= ~O_TRUNC;
- will_truncate = false;
- acc_mode = 0;
dput(nd->path.dentry);
nd->path.dentry = dentry;
goto finish_open_created;
@@ -3260,10 +3249,16 @@ static const char *do_last(struct nameidata *nd,
}
if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
return ERR_PTR(-ENOTDIR);
- if (!d_is_reg(nd->path.dentry))
- will_truncate = false;
finish_open_created:
+ if (file->f_mode & FMODE_CREATED) {
+ /* Don't check for write permission, don't truncate */
+ open_flag &= ~O_TRUNC;
+ will_truncate = false;
+ acc_mode = 0;
+ } else if (!d_is_reg(nd->path.dentry)) {
+ will_truncate = false;
+ }
if (will_truncate) {
error = mnt_want_write(nd->path.mnt);
if (error)
--
2.11.0
From: Al Viro <[email protected]>
have FMODE_OPENED case rejoin the main path at earlier point
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 16786be13050..f79e020f08fe 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3215,10 +3215,7 @@ static const char *do_last(struct nameidata *nd,
audit_inode(nd->name, file->f_path.dentry, 0);
dput(nd->path.dentry);
nd->path.dentry = dentry;
- error = may_open(&nd->path, acc_mode, open_flag);
- if (error)
- goto out;
- goto opened;
+ goto finish_open_created;
}
if (file->f_mode & FMODE_CREATED) {
@@ -3285,11 +3282,10 @@ static const char *do_last(struct nameidata *nd,
error = may_open(&nd->path, acc_mode, open_flag);
if (error)
goto out;
- BUG_ON(file->f_mode & FMODE_OPENED); /* once it's opened, it's opened */
- error = vfs_open(&nd->path, file);
+ if (!(file->f_mode & FMODE_OPENED))
+ error = vfs_open(&nd->path, file);
if (error)
goto out;
-opened:
error = ima_file_check(file, op->acc_mode);
if (!error && will_truncate)
error = handle_truncate(file);
--
2.11.0
From: Al Viro <[email protected]>
Don't mess with got_write there - it is guaranteed to be false on
entry and it will be set true if and only if we decide to go for
truncation and manage to get write access for that.
Don't carry acc_mode through the entire thing - it's only used
in that part. And don't bother with gotos in there - compiler is
quite capable of optimizing that.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 28 +++++++++++-----------------
1 file changed, 11 insertions(+), 17 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 51283caaf7c4..ce6f2864a335 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3131,9 +3131,9 @@ static const char *do_last(struct nameidata *nd,
kuid_t dir_uid = nd->inode->i_uid;
umode_t dir_mode = nd->inode->i_mode;
int open_flag = op->open_flag;
- bool will_truncate = (open_flag & O_TRUNC) != 0;
+ bool do_truncate;
bool got_write = false;
- int acc_mode = op->acc_mode;
+ int acc_mode;
unsigned seq;
struct inode *inode;
struct dentry *dentry;
@@ -3251,36 +3251,30 @@ static const char *do_last(struct nameidata *nd,
return ERR_PTR(-ENOTDIR);
finish_open_created:
+ do_truncate = false;
+ acc_mode = op->acc_mode;
if (file->f_mode & FMODE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
- will_truncate = false;
acc_mode = 0;
- } else if (!d_is_reg(nd->path.dentry)) {
- will_truncate = false;
- }
- if (will_truncate) {
+ } else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
error = mnt_want_write(nd->path.mnt);
if (error)
return ERR_PTR(error);
- got_write = true;
+ do_truncate = true;
}
error = may_open(&nd->path, acc_mode, open_flag);
- if (error)
- goto out;
- if (!(file->f_mode & FMODE_OPENED))
+ if (!error && !(file->f_mode & FMODE_OPENED))
error = vfs_open(&nd->path, file);
- if (error)
- goto out;
- error = ima_file_check(file, op->acc_mode);
- if (!error && will_truncate)
+ if (!error)
+ error = ima_file_check(file, op->acc_mode);
+ if (!error && do_truncate)
error = handle_truncate(file);
-out:
if (unlikely(error > 0)) {
WARN_ON(1);
error = -EINVAL;
}
- if (got_write)
+ if (do_truncate)
mnt_drop_write(nd->path.mnt);
return ERR_PTR(error);
}
--
2.11.0
From: Al Viro <[email protected]>
... getting may_create_in_sticky() checks in FMODE_OPENED case as well.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index ce6f2864a335..37cbe7806677 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3207,14 +3207,7 @@ static const char *do_last(struct nameidata *nd,
if (IS_ERR(dentry))
return ERR_CAST(dentry);
- if (file->f_mode & FMODE_OPENED) {
- audit_inode(nd->name, file->f_path.dentry, 0);
- dput(nd->path.dentry);
- nd->path.dentry = dentry;
- goto finish_open_created;
- }
-
- if (file->f_mode & FMODE_CREATED) {
+ if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
dput(nd->path.dentry);
nd->path.dentry = dentry;
goto finish_open_created;
@@ -3238,7 +3231,9 @@ static const char *do_last(struct nameidata *nd,
error = complete_walk(nd);
if (error)
return ERR_PTR(error);
- audit_inode(nd->name, nd->path.dentry, 0);
+finish_open_created:
+ if (!(file->f_mode & FMODE_CREATED))
+ audit_inode(nd->name, nd->path.dentry, 0);
if (open_flag & O_CREAT) {
if (d_is_dir(nd->path.dentry))
return ERR_PTR(-EISDIR);
@@ -3250,7 +3245,6 @@ static const char *do_last(struct nameidata *nd,
if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
return ERR_PTR(-ENOTDIR);
-finish_open_created:
do_truncate = false;
acc_mode = op->acc_mode;
if (file->f_mode & FMODE_CREATED) {
--
2.11.0
From: Al Viro <[email protected]>
it's easier to drop it right after lookup_open() and regain if
needed (i.e. if we will need to truncate). On the non-FMODE_OPENED
path we do that anyway. In case of FMODE_CREATED we won't be
needing it. And it's easier to prove correctness that way,
especially since the initial failure to get write access is not
always fatal; proving that we'll never end up truncating in that
case is rather convoluted.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 31 +++++++++++--------------------
1 file changed, 11 insertions(+), 20 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index f79e020f08fe..56285466aa55 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3199,11 +3199,14 @@ static const char *do_last(struct nameidata *nd,
else
inode_unlock_shared(dir->d_inode);
- if (IS_ERR(dentry)) {
- error = PTR_ERR(dentry);
- goto out;
+ if (got_write) {
+ mnt_drop_write(nd->path.mnt);
+ got_write = false;
}
+ if (IS_ERR(dentry))
+ return ERR_CAST(dentry);
+
if (file->f_mode & FMODE_OPENED) {
if (file->f_mode & FMODE_CREATED) {
open_flag &= ~O_TRUNC;
@@ -3228,16 +3231,6 @@ static const char *do_last(struct nameidata *nd,
goto finish_open_created;
}
- /*
- * If atomic_open() acquired write access it is dropped now due to
- * possible mount and symlink following (this might be optimized away if
- * necessary...)
- */
- if (got_write) {
- mnt_drop_write(nd->path.mnt);
- got_write = false;
- }
-
finish_lookup:
link = step_into(nd, 0, dentry, inode, seq);
if (unlikely(link)) {
@@ -3258,27 +3251,25 @@ static const char *do_last(struct nameidata *nd,
return ERR_PTR(error);
audit_inode(nd->name, nd->path.dentry, 0);
if (open_flag & O_CREAT) {
- error = -EISDIR;
if (d_is_dir(nd->path.dentry))
- goto out;
+ return ERR_PTR(-EISDIR);
error = may_create_in_sticky(dir_mode, dir_uid,
d_backing_inode(nd->path.dentry));
if (unlikely(error))
- goto out;
+ return ERR_PTR(error);
}
- error = -ENOTDIR;
if ((nd->flags & LOOKUP_DIRECTORY) && !d_can_lookup(nd->path.dentry))
- goto out;
+ return ERR_PTR(-ENOTDIR);
if (!d_is_reg(nd->path.dentry))
will_truncate = false;
+finish_open_created:
if (will_truncate) {
error = mnt_want_write(nd->path.mnt);
if (error)
- goto out;
+ return ERR_PTR(error);
got_write = true;
}
-finish_open_created:
error = may_open(&nd->path, acc_mode, open_flag);
if (error)
goto out;
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 51 +++++++++++++++++++++++++++++----------------------
1 file changed, 29 insertions(+), 22 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 37cbe7806677..96182a947ca1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3121,19 +3121,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
return ERR_PTR(error);
}
-/*
- * Handle the last step of open()
- */
-static const char *do_last(struct nameidata *nd,
+static const char *open_last_lookups(struct nameidata *nd,
struct file *file, const struct open_flags *op)
{
struct dentry *dir = nd->path.dentry;
- kuid_t dir_uid = nd->inode->i_uid;
- umode_t dir_mode = nd->inode->i_mode;
int open_flag = op->open_flag;
- bool do_truncate;
bool got_write = false;
- int acc_mode;
unsigned seq;
struct inode *inode;
struct dentry *dentry;
@@ -3145,9 +3138,9 @@ static const char *do_last(struct nameidata *nd,
if (nd->last_type != LAST_NORM) {
error = handle_dots(nd, nd->last_type);
- if (unlikely(error))
- return ERR_PTR(error);
- goto finish_open;
+ if (likely(!error))
+ error = complete_walk(nd);
+ return ERR_PTR(error);
}
if (!(open_flag & O_CREAT)) {
@@ -3160,7 +3153,6 @@ static const char *do_last(struct nameidata *nd,
if (likely(dentry))
goto finish_lookup;
- BUG_ON(nd->inode != dir->d_inode);
BUG_ON(nd->flags & LOOKUP_RCU);
} else {
/* create side of things */
@@ -3170,7 +3162,7 @@ static const char *do_last(struct nameidata *nd,
* about to look up
*/
error = complete_walk(nd);
- if (error)
+ if (unlikely(error))
return ERR_PTR(error);
audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
@@ -3199,10 +3191,8 @@ static const char *do_last(struct nameidata *nd,
else
inode_unlock_shared(dir->d_inode);
- if (got_write) {
+ if (got_write)
mnt_drop_write(nd->path.mnt);
- got_write = false;
- }
if (IS_ERR(dentry))
return ERR_CAST(dentry);
@@ -3210,7 +3200,7 @@ static const char *do_last(struct nameidata *nd,
if (file->f_mode & (FMODE_OPENED | FMODE_CREATED)) {
dput(nd->path.dentry);
nd->path.dentry = dentry;
- goto finish_open_created;
+ return NULL;
}
finish_lookup:
@@ -3226,12 +3216,29 @@ static const char *do_last(struct nameidata *nd,
audit_inode(nd->name, nd->path.dentry, 0);
return ERR_PTR(-EEXIST);
}
-finish_open:
+
/* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */
- error = complete_walk(nd);
- if (error)
- return ERR_PTR(error);
-finish_open_created:
+ return ERR_PTR(complete_walk(nd));
+}
+
+/*
+ * Handle the last step of open()
+ */
+static const char *do_last(struct nameidata *nd,
+ struct file *file, const struct open_flags *op)
+{
+ kuid_t dir_uid = nd->inode->i_uid;
+ umode_t dir_mode = nd->inode->i_mode;
+ int open_flag = op->open_flag;
+ bool do_truncate;
+ int acc_mode;
+ const char *link;
+ int error;
+
+ link = open_last_lookups(nd, file, op);
+ if (unlikely(link))
+ return link;
+
if (!(file->f_mode & FMODE_CREATED))
audit_inode(nd->name, nd->path.dentry, 0);
if (open_flag & O_CREAT) {
--
2.11.0
On Sat, Feb 22, 2020 at 5:16 PM Al Viro <[email protected]> wrote:
>
> +
> +discard2:
> + namespace_unlock();
> +discard1:
> + inode_unlock(dentry->d_inode);
> +discard:
> /* remove m from any expiration list it may be on */
Would you mind re-naming those labels?
I realize that the numbering may help show that the error handling is
done in the reverse order, but I bet that a nice name could so that
too.
Linus
On Sat, Feb 22, 2020 at 5:20 PM Al Viro <[email protected]> wrote:
>
> if (likely(!d_is_symlink(path->dentry)) ||
> - !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) {
> + !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) ||
> + flags & WALK_NOFOLLOW) {
Humor me, and don't mix bitwise ops with logical boolean ops without
parentheses, ok?
And yes, the old code did it too, so it's not a new thing.
But as it gets even more complex, let's just generally strive for doing
(a & b) || (c & d)
instead of
a & b || c & d
to make it easier to mentally see the grouping.
Linus
Ok, so far I haven't seen anything bad. But I keep noticing these odd
stylistic things...
On Sat, Feb 22, 2020 at 5:22 PM Al Viro <[email protected]> wrote:
>
> - return step_into(nd, flags, dentry, inode, seq);
> + err = step_into(nd, flags, dentry, inode, seq);
> + if (!err)
> + return NULL;
> + else if (err > 0)
> + return get_link(nd);
> + else
> + return ERR_PTR(err);
> }
What?
Those "else" statements make no sense.
Each if-statement has a "return" in it. It's done. The else part is
not adding anything but confusion.
IOW, this should be
if (!err)
return NULL;
if (err > 0)
return get_link(nd);
return ERR_PTR(err);
with not an 'else' in sight.
Linus
Ok, one step back:
On Sat, Feb 22, 2020 at 5:22 PM Al Viro <[email protected]> wrote:
>
> + if (err > 0)
> + return get_link(nd);
> + else
> + return ERR_PTR(err);
> }
.. and two steps forward, as you then entirely remove the code I just
complained about.
> - err = step_into(nd, flags, dentry, inode, seq);
> - if (!err)
> - return NULL;
> - else if (err > 0)
> - return get_link(nd);
> - else
> - return ERR_PTR(err);
> + return step_into(nd, flags, dentry, inode, seq);
I'm waiting with bated breath if the next patch will then remove the
new odd if-return-else thing. But I'm not going to peek, it's going to
be a surprise.
Linus
On Sat, Feb 22, 2020 at 5:23 PM Al Viro <[email protected]> wrote:
>
> - err = pick_link(nd, &path, inode, seq);
> - if (err > 0)
> - return get_link(nd);
> - else
> - return ERR_PTR(err);
> + return pick_link(nd, &path, inode, seq);
Yay! It's like Christmas.
Except honestly, I think I'd have been happier if the intermediate
points didn't have that odd syntax in them. Even if it then gets
removed in the end.
Linus
On Sat, Feb 22, 2020 at 5:23 PM Al Viro <[email protected]> wrote:
>
> From: Al Viro <[email protected]>
>
> kill nd->link_inode, while we are at it
I like that part, but that pick_link() function is now the function from hell.
It's now something like a hundred lines of fairly dense code, isn't it?
Oh well. Maybe it's easier to follow since it's straight-line. So this
is more of an observation than a complaint.
Linus
On Sat, Feb 22, 2020 at 06:14:45PM -0800, Linus Torvalds wrote:
> On Sat, Feb 22, 2020 at 5:20 PM Al Viro <[email protected]> wrote:
> >
> > if (likely(!d_is_symlink(path->dentry)) ||
> > - !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) {
> > + !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) ||
> > + flags & WALK_NOFOLLOW) {
>
> Humor me, and don't mix bitwise ops with logical boolean ops without
> parentheses, ok?
>
> And yes, the old code did it too, so it's not a new thing.
>
> But as it gets even more complex, let's just generally strive for doing
>
> (a & b) || (c & d)
>
> instead of
>
> a & b || c & d
>
> to make it easier to mentally see the grouping.
Can do... FWIW, the only case where the normal "'and' is multiplication,
'or' is addition" doesn't give the right result is
x | y && z
written instead of
x | (y && z)
where you would be better off rewriting the expression anyway.
FWIW, one of the things in the local pile is this:
commit cc1b6724b32de1be108cf6a5f28dbb5aa424b42f
Author: Al Viro <[email protected]>
Date: Sun Jan 19 12:44:18 2020 -0500
namei: invert the meaning of WALK_FOLLOW
old flags & WALK_FOLLOW <=> new !(flags & WALK_TRAILING)
That's what that flag had really been used for.
Signed-off-by: Al Viro <[email protected]>
diff --git a/fs/namei.c b/fs/namei.c
index 6eb708014d4b..7d938241157f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1786,7 +1786,7 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
return NULL;
}
-enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
+enum {WALK_TRAILING = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
/*
* Do we need to follow links? We _really_ want to be able
@@ -1805,7 +1805,7 @@ static const char *step_into(struct nameidata *nd, int flags,
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
if (likely(!d_is_symlink(path.dentry)) ||
- !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) ||
+ (flags & WALK_TRAILING && !(nd->flags & LOOKUP_FOLLOW)) ||
flags & WALK_NOFOLLOW) {
/* not a symlink or should not follow */
path_to_nameidata(&path, nd);
@@ -2157,10 +2157,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
if (!name)
return 0;
/* last component of nested symlink */
- link = walk_component(nd, WALK_FOLLOW);
+ link = walk_component(nd, 0);
} else {
/* not the last component */
- link = walk_component(nd, WALK_FOLLOW | WALK_MORE);
+ link = walk_component(nd, WALK_MORE);
}
if (unlikely(link)) {
if (IS_ERR(link))
@@ -2288,7 +2288,7 @@ static inline const char *lookup_last(struct nameidata *nd)
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
nd->flags &= ~LOOKUP_PARENT;
- link = walk_component(nd, 0);
+ link = walk_component(nd, WALK_TRAILING);
if (link) {
nd->flags |= LOOKUP_PARENT;
nd->stack[0].name = NULL;
@@ -3241,7 +3241,7 @@ static const char *do_last(struct nameidata *nd,
}
finish_lookup:
- link = step_into(nd, 0, dentry, inode, seq);
+ link = step_into(nd, WALK_TRAILING, dentry, inode, seq);
if (unlikely(link)) {
nd->flags |= LOOKUP_PARENT;
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
and I can simply fold adding extra parens into it. Or I can fold that into
the patch you'd been replying to - I'm still uncertain about the series
containing WALK_TRAILING...
On Sun, Feb 23, 2020 at 01:12:21AM +0000, Al Viro wrote:
> This is a slightly extended repost of the patchset posted on
> Jan 19. Current branch is in vfs.git#work.do_last, the main
> difference from the last time around being a bit of do_last()
> untangling added in the end of series. #work.openat2 is already
> in mainline, which simplifies the series - now it's a straight
> branch with no merges.
Whee... While trying to massage ".." handling towards the use of
regular mount crossing semantics, I've found something interesting.
Namely, if you start in a directory with overmounted parent,
LOOKUP_NO_XDEV resolution of ../something will bloody well cross
into the overmount.
Reason: follow_dotdot() (and its RCU counterpart) check for LOOKUP_NO_XDEV
when crossing into underlying fs, but not when crossing into overmount
of the parent.
Interpretation of .. is basically
loop: if we are in root // uncommon
next = current position
else if we are in root of a mounted filesystem // more rare
move to underlying mountpoint
goto loop
else
next = parent directory of current position // most common
while next is overmounted // _VERY_ uncommon
next = whatever's mounted on next
move to next
The second loop should've been sharing code with the normal mountpoint
crossing. It doesn't, which has already lead to interesting inconsistencies
(e.g. autofs generally expects ->d_manage() to be called before crossing
into it; here it's not done). LOOKUP_NO_XDEV has just added one more...
Incidentally, another inconsistency is LOOKUP_BENEATH treatment in case
when we have walked out of the subtree by way of e.g. procfs symlink and
then ran into .. in the absolute root (that's
if (!follow_up(&nd->path))
break;
in follow_dotdot()). Shouldn't that give the same reaction as ..
in root (EXDEV on LOOKUP_BENEATH, that is)? It doesn't...
Another one is about LOOKUP_NO_XDEV again: suppose you have process'
root directly overmounted and cwd in the root of whatever's overmounting
it. Resolution of .. will stay in cwd - we have no parent within the
chroot jail we are in, so we move to whatever's overmounting that root.
Which is the original location. Should we fail on LOOKUP_NO_XDEV here?
Plain .. in the root of chroot jail (not overmounted by anything) does
*not*...
On Tue, Feb 25, 2020 at 01:24:57AM +0000, Al Viro wrote:
> Incidentally, another inconsistency is LOOKUP_BENEATH treatment in case
> when we have walked out of the subtree by way of e.g. procfs symlink and
> then ran into .. in the absolute root (that's
> if (!follow_up(&nd->path))
> break;
> in follow_dotdot()). Shouldn't that give the same reaction as ..
> in root (EXDEV on LOOKUP_BENEATH, that is)? It doesn't...
>
> Another one is about LOOKUP_NO_XDEV again: suppose you have process'
> root directly overmounted and cwd in the root of whatever's overmounting
> it. Resolution of .. will stay in cwd - we have no parent within the
> chroot jail we are in, so we move to whatever's overmounting that root.
> Which is the original location. Should we fail on LOOKUP_NO_XDEV here?
> Plain .. in the root of chroot jail (not overmounted by anything) does
> *not*...
FWIW, my preference would be the following (for non-RCU case; RCU
one is similar)
get_parent(nd)
{
if (path_equal(&nd->path, &nd->root))
return NULL;
if (nd->path.dentry != nd->path.mnt->mnt_root)
return dget_parent(nd->path.dentry);
m = real_mount(nd->path.mnt);
read_seqlock_excl(&mount_lock);
while (mnt_has_parent(m)) {
d = m->mnt_mountpoint;
m = m->mnt_parent;
if (&m->mnt == nd->root.mnt && d == nd->root.path) // root
break;
if (m->mnt_root != d) {
if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
read_sequnlock_excl(&mount_lock);
return ERR_PTR(-EXDEV);
}
mntget(&m->mnt);
dget(d);
read_sequnlock_excl(&mount_lock);
path_put(&nd->path);
nd->path.mnt = &m->mnt;
nd->path.dentry = d;
nd->inode = d->d_inode;
return dget_parent(d);
}
}
read_sequnlock_excl(&mount_lock);
return NULL;
}
with follow_dotdot() doing
parent = get_parent(nd);
if (unlikely(IS_ERR(parent)))
return PTR_ERR(parent);
if (unlikely(!parent)) { .. in root is a rare case
bugger off if LOOKUP_BENEATH
parent = dget(nd->path.dentry);
} else if (unlikely(!path_connected(nd->path.mnt, parent))) {
dput(parent);
return -ENOENT;
}
dput(nd->path.dentry);
nd->path.dentry = parent;
follow_mount(&nd->path);
... with the last part replaced with
step_into(nd, WALK_NOFOLLOW, dentry, NULL, 0);
later in this series, with similar in RCU case (only there we would want
inode and seq supplied, as usual, so it would be get_parent_rcu(nd, &inode,
&seq)).
On Sat, Feb 22, 2020 at 06:07:39PM -0800, Linus Torvalds wrote:
> On Sat, Feb 22, 2020 at 5:16 PM Al Viro <[email protected]> wrote:
> >
> > +
> > +discard2:
> > + namespace_unlock();
> > +discard1:
> > + inode_unlock(dentry->d_inode);
> > +discard:
> > /* remove m from any expiration list it may be on */
>
> Would you mind re-naming those labels?
>
> I realize that the numbering may help show that the error handling is
> done in the reverse order, but I bet that a nice name could so that
> too.
Umm... A bit of reordering in the beginning eliminates discard1, suggesting
s/discard2/discard_locked/... Incremental would be
diff --git a/fs/namespace.c b/fs/namespace.c
index 6228fd1ef94f..777c3116e62e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2844,22 +2844,22 @@ int finish_automount(struct vfsmount *m, struct path *path)
* got", not "try to mount it on top".
*/
inode_lock(dentry->d_inode);
+ namespace_lock();
if (unlikely(cant_mount(dentry))) {
err = -ENOENT;
- goto discard1;
+ goto discard_locked;
}
- namespace_lock();
rcu_read_lock();
if (unlikely(__lookup_mnt(path->mnt, dentry))) {
rcu_read_unlock();
err = 0;
- goto discard2;
+ goto discard_locked;
}
rcu_read_unlock();
mp = get_mountpoint(dentry);
if (IS_ERR(mp)) {
err = PTR_ERR(mp);
- goto discard2;
+ goto discard_locked;
}
err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
@@ -2869,9 +2869,8 @@ int finish_automount(struct vfsmount *m, struct path *path)
mntput(m);
return 0;
-discard2:
+discard_locked:
namespace_unlock();
-discard1:
inode_unlock(dentry->d_inode);
discard:
/* remove m from any expiration list it may be on */
On Thu, Feb 27, 2020 at 11:43 AM Al Viro <[email protected]> wrote:
>
> Umm... A bit of reordering in the beginning eliminates discard1, suggesting
> s/discard2/discard_locked/... Incremental would be
Ack, thanks.
Linus
On 2020-02-25, Al Viro <[email protected]> wrote:
> On Sun, Feb 23, 2020 at 01:12:21AM +0000, Al Viro wrote:
> > This is a slightly extended repost of the patchset posted on
> > Jan 19. Current branch is in vfs.git#work.do_last, the main
> > difference from the last time around being a bit of do_last()
> > untangling added in the end of series. #work.openat2 is already
> > in mainline, which simplifies the series - now it's a straight
> > branch with no merges.
>
> Whee... While trying to massage ".." handling towards the use of
> regular mount crossing semantics, I've found something interesting.
> Namely, if you start in a directory with overmounted parent,
> LOOKUP_NO_XDEV resolution of ../something will bloody well cross
> into the overmount.
Oh boy...
> Reason: follow_dotdot() (and its RCU counterpart) check for LOOKUP_NO_XDEV
> when crossing into underlying fs, but not when crossing into overmount
> of the parent.
>
> Interpretation of .. is basically
>
> loop: if we are in root // uncommon
> next = current position
> else if we are in root of a mounted filesystem // more rare
> move to underlying mountpoint
> goto loop
> else
> next = parent directory of current position // most common
>
> while next is overmounted // _VERY_ uncommon
> next = whatever's mounted on next
>
> move to next
>
> The second loop should've been sharing code with the normal mountpoint
> crossing. It doesn't, which has already lead to interesting inconsistencies
> (e.g. autofs generally expects ->d_manage() to be called before crossing
> into it; here it's not done). LOOKUP_NO_XDEV has just added one more...
You're quite right -- LOOKUP_NO_XDEV should block that and I missed it.
> Incidentally, another inconsistency is LOOKUP_BENEATH treatment in case
> when we have walked out of the subtree by way of e.g. procfs symlink and
> then ran into .. in the absolute root (that's
> if (!follow_up(&nd->path))
> break;
> in follow_dotdot()). Shouldn't that give the same reaction as ..
> in root (EXDEV on LOOKUP_BENEATH, that is)? It doesn't...
You can't go through procfs symlinks with LOOKUP_BENEATH, but if it's
possible to do that kind of jump then it should also be blocked (but I
would say that I'd prefer "block any kind of weird jump").
> Another one is about LOOKUP_NO_XDEV again: suppose you have process'
> root directly overmounted and cwd in the root of whatever's overmounting
> it. Resolution of .. will stay in cwd - we have no parent within the
> chroot jail we are in, so we move to whatever's overmounting that root.
> Which is the original location. Should we fail on LOOKUP_NO_XDEV here?
> Plain .. in the root of chroot jail (not overmounted by anything) does
> *not*...
I think LOOKUP_NO_XDEV should block that since you end up crossing a
mountpoint.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Fri, Feb 28, 2020 at 12:24:51PM +1100, Aleksa Sarai wrote:
> > Another one is about LOOKUP_NO_XDEV again: suppose you have process'
> > root directly overmounted and cwd in the root of whatever's overmounting
> > it. Resolution of .. will stay in cwd - we have no parent within the
> > chroot jail we are in, so we move to whatever's overmounting that root.
> > Which is the original location. Should we fail on LOOKUP_NO_XDEV here?
> > Plain .. in the root of chroot jail (not overmounted by anything) does
> > *not*...
>
> I think LOOKUP_NO_XDEV should block that since you end up crossing a
> mountpoint.
You are not. Your process' root is overmounted and your current directory
is on that overmount. You attempt to resolve ".." there.
# cd /
# unshare -m
# stat .
File: .
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: 801h/2049d Inode: 2 Links: 25
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-02-26 20:51:06.623409892 -0500
Modify: 2020-02-26 20:43:51.284020000 -0500
Change: 2020-02-26 20:43:51.284020000 -0500
Birth: -
# mkdir /tmp/foo
# mount -t tmpfs none /tmp/foo/
# for i in *; do test -d $i && mkdir /tmp/foo/$i && mount --rbind $i /tmp/foo/$i; done
# cd /tmp/foo/
# mount --move . /
# /bin/pwd
/
# ls
bin dev home lib32 libx32 ltp mnt proc run srv tmp var
boot etc lib lib64 lost+found media opt root sbin sys usr
# ls /
253_metadump etc lib32 media run usr
315.full home lib64 mnt sbin var
bin initrd.img libx32 opt srv vmlinuz
boot initrd.img.old lost+found proc sys vmlinuz.old
dev lib ltp root tmp
# stat ..
File: ..
Size: 500 Blocks: 0 IO Block: 4096 directory
Device: 19h/25d Inode: 1875746 Links: 25
Access: (1777/drwxrwxrwt) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-02-27 21:48:45.649705410 -0500
Modify: 2020-02-27 21:46:40.829607188 -0500
Change: 2020-02-27 21:46:40.829607188 -0500
Birth: -
# stat .
File: .
Size: 500 Blocks: 0 IO Block: 4096 directory
Device: 19h/25d Inode: 1875746 Links: 25
Access: (1777/drwxrwxrwt) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-02-27 21:48:45.649705410 -0500
Modify: 2020-02-27 21:46:40.829607188 -0500
Change: 2020-02-27 21:46:40.829607188 -0500
Birth: -
See what's going on? We have a tmpfs overmounting root; current directory
is the root of tmpfs; process root is on ext3. Lookups for . and .. yield
exactly the same result here - we stay in the root of tmpfs.
Anyway, see #work.dotdot in vfs.git
Extended since the last repost. The branch is in #work.dotdot;
#work.do_last is its beginning (about 2/3 of the total), slightly
reworked since the last time.
Individual patches are in the followups. Branch survives
the local testing (including ltp and xfstests).
Review and testing would be _very_ welcome; it does a lot
of massage, so there had been a plenty of opportunities to fuck up
and fail to spot that. The same goes for profiling - it doesn't
seem to slow the things down, but that needs to be verified. If
nobody screams, in a couple of days I'm going to push the beginning
of that thing (#work.do_last) into -next, with the rest to follow
it there next weekend or so.
Changes since v2: several followup cleanups in the end of
symlinks-related series (part 3) and large series on ".." handling
added in the end (part 6). The only thing suggested in review and
not done yet is splitup of pick_link() - it's going to happen, but
I want the tricky parts handled first; pick_link() is trivial
as far as control flow goes.
part 1: follow_automount() cleanups and fixes.
Quite a bit of that function had been about working around the
wrong calling conventions of finish_automount(). The problem is that
finish_automount() misuses the primitive intended for mount(2) and
friends, where we want to mount on top of the pile, even if something
has managed to add to that while we'd been trying to lock the namespace.
For automount that's not the right thing to do - there we want to discard
whatever it was going to attach and just cross into what got mounted
there in the meanwhile (most likely - the results of the same automount
triggered by somebody else). Current mainline kinda-sorta manages to do
that, but it's unreliable and very convoluted. Much simpler approach
is to stop using lock_mount() in finish_automount() and have it bail
out if something turns out to have been mounted on top where we wanted
to attach. That allows to get rid of a lot of PITA in the caller.
Another simplification comes from not trying to cross into the results
of automount - simply ride through the next iteration of the loop and
let it move into overmount.
Another thing in the same series is divorcing follow_automount()
from nameidata; that'll play later when we get to unifying follow_down()
with the guts of follow_managed().
4 commits, the second one fixes a hard-to-hit race. The first
is a prereq for it.
1/55 do_add_mount(): lift lock_mount/unlock_mount into callers
2/55 fix automount/automount race properly
3/55 follow_automount(): get rid of dead^Wstillborn code
4/55 follow_automount() doesn't need the entire nameidata
part 2: unifying mount traversals in pathwalk.
Handling of mount traversal (follow_managed()) is currently called
in a bunch of places. Each of them is shortly followed by a call of
step_into() or an open-coded equivalent thereof. However, the locations
of those step_into() calls are far from preceding follow_managed();
moreover, that preceding call might happen on different paths that
converge to given step_into() call. It's harder to analyse that it should
be (especially when it comes to liveness analysis) and it forces rather
ugly calling conventions on lookup_fast()/atomic_open()/lookup_open().
The series below massages the code to the point when the calls of
follow_managed() (and __follow_mount_rcu()) move into the beginning of
step_into().
5/55 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
gets EEXIST handling in do_last() past the step_into() call there.
6/55 handle_mounts(): start building a sane wrapper for follow_managed()
rather than mangling follow_managed() itself (and creating conflicts
with openat2 series), add a wrapper that will absorb the required
interface changes.
7/55 atomic_open(): saner calling conventions (return dentry on success)
struct path passed to it is pure out parameter; only dentry part
ever varies, though - mnt is always nd->path.mnt. Just return
the dentry on success, and ERR_PTR(-E...) on failure.
8/55 lookup_open(): saner calling conventions (return dentry on success)
propagate the same change one level up the call chain.
9/55 do_last(): collapse the call of path_to_nameidata()
struct path filled in lookup_open() call is eventually given to
handle_mounts(); the only use it has before that is path_to_nameidata()
call in "->atomic_open() has actually opened it" case, and there
path_to_nameidata() is an overkill - we are guaranteed to replace
only nd->path.dentry. So have the struct path filled only immediately
prior to handle_mounts().
10/55 handle_mounts(): pass dentry in, turn path into a pure out argument
now all callers of handle_mount() are directly preceded by filling
struct path it gets. path->mnt is nd->path.mnt in all cases, so we can
pass just the dentry instead and fill path in handle_mount() itself.
Some boilerplate gone, path is pure out argument of handle_mount()
now.
11/55 lookup_fast(): consolidate the RCU success case
massage to gather what will become an RCU case equivalent of
handle_mounts(); basically, that's what we do if revalidate succeeds
in RCU case of lookup_fast(), including unlazy and fallback to
handle_mounts() if __follow_mount_rcu() says "it's too tricky".
12/55 teach handle_mounts() to handle RCU mode
... and take that into handle_mount() itself. The other caller of
__follow_mount_rcu() is fine with the same fallback (it just didn't
bother since it's in the very beginning of pathwalk), switched to
handle_mount() as well.
13/55 lookup_fast(): take mount traversal into callers
Now we are getting somewhere - both RCU and non-RCU success cases of
lookup_fast() are ended with the same return handle_mounts(...);
move that to the callers - there it will merge with the identical calls
that had been on the paths where we had to do slow lookups.
lookup_fast() returns dentry now.
14/55 new step_into() flag: WALK_NOFOLLOW
use step_into() instead of open-coding it in handle_lookup_down().
Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for
that (and eventually, I hope, for .. handling).
Now *all* calls of handle_mounts() and step_into() are right next to
each other.
15/55 fold handle_mounts() into step_into()
... and we can move the call of handle_mounts() into step_into(),
getting a slightly saner calling conventions out of that.
16/55 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()
another payoff from 14/17 - we can teach path_lookupat() to do
what path_mountpointat() used to. And kill the latter, along with
its wrappers.
17/55 expand the only remaining call of path_lookup_conditional()
minor cleanup - RIP path_lookup_conditional(). Only one caller left.
Changes so far:
* mount traversal is taken into step_into().
* lookup_fast(), atomic_open() and lookup_open() calling conventions
are slightly changed. All of them return dentry now, instead of returning
an int and filling struct path on success. For lookup_fast() the old
"0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache
miss, dentry - for hit".
* step_into() can be called in RCU mode as well. Takes nameidata,
WALK_... flags, dentry and, in RCU case, corresponding inode and seq value.
Handles mount traversals, decides whether it's a symlink to be followed.
Error => returns -E...; symlink to follow => returns 1, puts symlink on stack;
non-symlink or symlink not to follow => returns 0, moves nd->path to new location.
* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends
became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.
part 3: untangling the symlink handling.
Right now when we decide to follow a symlink it happens this way:
* step_into() decides that it has been given a symlink that needs to
be followed.
* it calls pick_link(), which pushes the symlink on stack and
returns 1 on success / -E... on error. Symlink's mount/dentry/seq is
stored on stack and the inode is stashed in nd->link_inode.
* step_into() passes that 1 to its callers, which proceed to pass it
up the call chain for several layers. In all cases we get to get_link()
call shortly afterwards.
* get_link() is called, picks the inode stashed in nd->link_inode
by the pick_link(), does some checks, touches the atime, etc.
* get_link() either picks the link body out of inode or calls
->get_link(). If it's an absolute symlink, we move to the root and return
the relative portion of the body; if it's a relative one - just return the
body. If it's a procfs-style one, the call of nd_jump_link() has been
made and we'd moved to whatever location is desired. And return NULL,
same as we do for symlink to "/".
* the caller proceeds to deal with the string returned to it.
The sequence is the same in all cases (nested symlink, trailing
symlink on lookup, trailing symlink on open), but its pieces are not close
to each other and the bit between the call of pick_link() and (inevitable)
call of get_link() afterwards is not easy to follow. Moreover, a bunch
of functions (walk_component/lookup_last/do_last) ends up with the same
conventions for return values as step_into(). And those conventions
(see above) are not pretty - 0/1/-E... is asking for mistakes, especially
when returned 1 is used only to direct control flow on a rather twisted
way to matching get_link() call. And that path can be seriously twisted.
E.g. when we are trying to open /dev/stdin, we get the following sequence:
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. Into the stack it goes and we return 1.
* do_last() sees 1 and returns it.
* trailing_symlink() is called (in the top-level loop) and it
calls get_link(). OK, we get "/proc/self/fd/0" for body, move to
root again and return "proc/self/fd/0".
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes and we return 1.
* again, do_last() passes 1 to caller
* trailing_symlink() is called and calls get_link().
* this time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. But the
fun doesn't stop yet.
* trailing_symlink() returns "" to the caller
* link_path_walk() is called on that and does nothing
whatsoever.
* do_last() is called and sees LAST_BIND left by the get_link().
It calls handle_dots()
* handle_dots() drops the symlink from stack and returns
* do_last() *FINALLY* proceeds to the point after its call of
step_into() (finish_open:) and gets around to opening the damn thing.
Making sense of the control flow through all of that is not fun,
to put it mildly; debugging anything in that area can be a massive PITA,
and this example has touched only one of 3 cases. Arguably, the worst
one, but... Anyway, it turns out that this code can be massaged to
considerably saner shape - both in terms of control flow and wrt calling
conventions.
18/55 merging pick_link() with get_link(), part 1
prep work: move the "hardening" crap from trailing_symlink() into
get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags).
We'll be moving the calls of get_link() around quite a bit through that
series, and the next step will be to eliminate trailing_symlink().
19/55 merging pick_link() with get_link(), part 2
fold trailing_symlink() into lookup_last() and do_last().
Now these are returning strings; it's not the final calling conventions,
but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) =>
old -E..., we'd failed. string => old 1, and the string is the symlink
body to follow. Just as for trailing_symlink(), "/" and procfs ones
(where get_link() returns NULL) yield "", so the ugly song and dance
with no-op trip through link_path_walk()/handle_dots() still remains.
20/55 merging pick_link() with get_link(), part 3
elimination of that round-trip. In *all* cases having
get_link() return NULL on such symlinks means that we'll proceed to
drop the symlink from stack and get back to the point near that
get_link() call - basically, where we would be if it hadn't been
a symlink at all. The path by which we are getting there depends
upon the call site; the end result is the same in all cases - such
symlinks (procfs ones and symlink to "/") are fully processed by
the time get_link() returns, so we could as well drop them from the
stack right in get_link(). Makes life simpler in terms of control
flow analysis...
And now the calling conventions for do_last() and lookup_last()
have reached the final shape - ERR_PTR(-E...) for error, NULL for
"we are done", string for "traverse this".
21/55 merging pick_link() with get_link(), part 4
now all calls of walk_component() are followed by the same
boilerplate - "if it has returned 1, call get_link() and if that
has returned NULL treat that as if walk_component() has returned 0".
Eliminate by folding that into walk_component() itself. Now
walk_component() return value conventions have joined those of
do_last()/lookup_last().
22/55 merging pick_link() with get_link(), part 5
same as for the previous, only this time the boilerplate
migrates one level down, into step_into(). Only one caller of
get_link() left, step_into() has joined the same return value
conventions.
23/55 merging pick_link() with get_link(), part 6
move that thing into pick_link(). Now all traces of
"return 1 if we are following a symlink" are gone.
24/55 finally fold get_link() into pick_link()
ta-da - expand get_link() into the only caller. As a side
benefit, we get rid of stashing the inode in nd->link_inode - it
was done only to carry that piece of information from pick_link()
to eventual get_link(). That's not the main benefit, though - the
control flow became considerably easier to reason about.
For what it's worth, the example above (/dev/stdin) becomes
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. On the stack it goes and we get its body. Which is
"/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
* do_last() sees non-NULL and returns it - whether it's an error
or a pathname to traverse, we hadn't reached something we'll be opening.
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes. This time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. So we
drop symlink from stack and return that NULL to caller.
* that NULL is returned by step_into(), same as if we had just
moved to a non-symlink.
* do_last() proceeds to open the damn thing.
Some low-hanging fruits become available: LAST_BIND can be removed and
the predicate controlling may_follow_link() we'd moved into pick_link()
in #18 can be made more straightforward:
25/55 LAST_BIND removal
The only reason to keep it had just been eliminated - it was
needed to route the control flow through that weird last iteration
through the loop. With that iteration gone...
26/55 invert the meaning of WALK_FOLLOW
27/55 pick_link(): check for WALK_TRAILING, not LOOKUP_PARENT
In #18 the checks specific to trailing symlinks got moved into
pick_link(), where they were made conditional upon LOOKUP_PARENT in
nd->flags. That works, but it's more subtle than I would like it to
be - it depends upon the dynamic state (nd->flags) which gets changed
through the pathwalk and it's sensitive to exact locations where we
flip LOOKUP_PARENT. Now we have a more robust way to do that - the
call chains that end up in pick_link() with LOOKUP_PARENT in nd->flags
are those that had WALK_TRAILING passed to the immediate caller of
pick_link() (step_into()). So we can pass WALK_... down to pick_link()
and turn the check into explicit "if we are passed WALK_TRAILING,
it's a trailing symlink and we need to apply the checks in may_follow()".
We could, in principle, reorder these two commits into the
very beginning of symlink series; that would make #18 slightly simpler
at the cost of (marginally) more boilerplate to carry through the
get_link() call moves. Not sure if it's worth doing, though...
28/55 link_path_walk(): simplify stack handling
Another cleanup that becomes possible is handling of the stack(s).
We use nd->stack to store two things: pinning down the symlinks we are
resolving and resuming the name traversal when a nested symlink is finished.
Currently, nd->depth is used to keep track of both. It's 0 when we call
link_path_walk() for the first time (for the pathname itself) and 1 on
all subsequent calls (for trailing symlinks, if any). That's fine,
as far as pinning symlinks goes - when handling a trailing symlink,
the string we are interpreting is the body of symlink pinned down in
nd->stack[0]. It's rather inconvenient with respect to handling nested
symlinks, though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case we need
to pick the string saved back when we started to interpret that nested
symlink and resume its traversal) or not (in which case we are done with
link_path_walk()).
Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when running out of
a string to interpret:
* if nd->depth is zero, we are at the end of pathname itself.
* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.
It works, but it's rather non-obvious. Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails. The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically
a way to recognize the second set becoming empty.
The things get simpler if we keep track of the second set's size explicitly
and always store it in nd->stack[0..depth-1].name. We access the second set
only inside link_path_walk(), so its size can live in a local variable;
that way the check becomes trivial without the need of that kludge.
part 4. some mount traversal cleanups.
29/55 massage __follow_mount_rcu() a bit
make it more similar to non-RCU counterpart
30/55 new helper: traverse_mounts()
the guts of follow_managed() are very similar to
follow_down(). The calling conventions are different (follow_managed()
works with nameidata, follow_down() - with standalone struct path),
but the core loop is pretty much the same in both. Turned that loop
into a common helper (traverse_mounts()) and since follow_managed()
becomes a very thin wrapper around it, expand follow_managed() at its
only call site (in handle_mounts()),
part 5. do_last() untangling.
Control flow in do_last() is an atrocity, and liveness analysis in there
is rather painful. What follows is a massage of that thing into (hopefully)
more straightforward shape; by the end of the series it's still unpleasant,
but at least easier to follow.
A major source of headache is treatment of "we'd already managed to
open it in ->atomic_open()" and "we'd just created that sucker" cases -
that's what gives complicated control flow graph. As it is, we
have the following horror:
#
/------* ends with . or ..?
# |
| #
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---------------\ already opened in ->atomic_open()?
| | | #
| | *---\ | freshly created file?
| | | # |
| \---+ | | finish_lookup:
| # | |
| *---------------------> is it a symlink?
\------+ | | finish_open:
# | |
+--/ | finish_open_created:
# |
+---------------/ opened:
#
To make it even more unpleasant, there is quite a bit of similar,
but not entirely identical logics on parallel branches, some of it
buried in lookup_open() *and* atomic_open() called by it. Keeping
track of that has been hard and that had lead to more than one bug.
31/55 atomic_open(): return the right dentry in FMODE_OPENED case
As it is, several invariants do not hold in "we'd already opened
it in ->atomic_open()" case. In particular, nd->path.dentry might be
pointing to the wrong place by the time we return to do_last() - on
that codepath we don't care anymore. That both makes it harder to reason
about and serves as an obstacle to transformations that would untangle
that mess. Fortunately, it's not hard to regularize.
32/55 atomic_open(): lift the call of may_open() into do_last()
may_open() is called before vfs_open() in "hadn't opened in
->atomic_open()" case. Rightfully so, since vfs_open() for e.g.
devices can have side effects. In "opened in ->atomic_open()" case
we have to do it after the actual opening - the whole point is to
combine open with lookup and we only get the information needed for
may_open() after the combined lookup/open has happened. That's
OK - no side effects are possible in that case. However, we don't
have to keep that call of may_open() inside fs/namei.c:atomic_open();
as the matter of fact, lifting it into do_last() allows to simplify
life there...
33/55 do_last(): merge the may_open() calls
... since now we have the "it's already opened" case in
do_last() rejoin the main path at earlier point.
At that point the horror graph from above has become
#
/------* ends with . or ..?
# |
| #
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---------------\ already opened in ->atomic_open()?
| | | #
| | *---\ | freshly created file?
| | | # |
| \---+ | | finish_lookup:
| # | |
| *---------------------> is it a symlink?
\------+ | | finish_open:
# | |
+--/------------/ finish_open_created:
#
34/55 do_last(): don't bother with keeping got_write in FMODE_OPENED case
Another source of unpleasantness is an attempt to be clever and
keep track of write access status; the thing is, it doesn't really buy
us anything - we could as well drop it right after the lookup_open() and
only regain it for truncation, should such be needed. Makes for much
simpler cleanups on failures and sets the things up for unification of
"already opened" and "new file" branches with the main path...
35/55 do_last(): rejoing the common path earlier in FMODE_{OPENED,CREATED} case
... which we do here.
36/55 do_last(): simplify the liveness analysis past finish_open_created
It also makes possible to shrink the liveness intervals for local
variables.
37/55 do_last(): rejoin the common path even earlier in FMODE_{OPENED,CREATED} case
Further unification of parallel branches.
At that point we get
#
/------* ends with . or ..?
# |
| #
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---\ opened by ->atomic_open() or freshly creatd?
| \---+ | finish_lookup:
| # |
| *---------------------> is it a symlink?
\------+ | finish_open:
| |
+--/ finish_open_created:
#
with very little work done between finish_open: and finish_open_created:,
as well as on any of the side branches. Moreover, we have a pretty clear
separation: most of the work on _opening_ is after finish_open_created
(some of it - conditional), while the work on lookups and creation is all
before that point. Even better, most of the local variables are used
either only before or only after that cutoff point.
38/55 split the lookup-related parts of do_last() into a separate helper
... which allows to separate the lookup-related parts from
open-related ones.
I'm not saying I'm entirely happy with the resulting state of
do_last() clusterfuck, but it got a lot easier to follow and reason
about. There are more cleanups possible (and needed) in there, though -
there will be followups.
part 6: ".." handling
The main problem with .. traversal is that it has open-coded mount
crossing in the end. That mount crossing used to be identical to that done
after the normal name components, but it got overlooked in several series
(most recently - openat2, prior to that - mount traps and automounts) and
now we have an out-of-sync variant (two of them, actually - RCU and non-RCU
cases) festering there and breeding hard-to-spot bugs. The most recent
example was when openat2 got extra checks added to the normal mount crossing;
it added the same to RCU case of .. handling, but missed the non-RCU one.
Nobody noticed during the many rounds of review - me, Christoph and Linus
included.
Another issue is that we are heavier on the locking than we need to
during the rootwards mount traversal there; traversing mounts in other
direction (from mountpoint to mounted) gets by without grabbing mount_lock
exclusive (or dirtying its cacheline in any way, for that matter), even
in non-RCU case. The same should be the case for mounted-to-mountpoint
transitions - except for the case of very rare races with mount --move
and friends, we should be fine with just the seqcount checks there.
The following is how .. handling behaves on just about any post-v7
Unix:
while true
if the caller is chrooted into directory // rare
(A) parent = directory
break
if directory is absolute root // rare
(B) parent = directory
break
if directory is mounted on top of mountpoint // 2nd most common
(C) directory = mountpoint
else // the most common
(D) parent = the parent of directory (within its fs)
break
while something is mounted on parent // unusual setup
(E) parent = whatever overmounts it
return parent
There are 3 paths that execution commonly takes, and they cover almost everything
that occurs in practice:
1) [A] We are in / and we stay there
2) [C,D] We are in root of mounted filesystem,
we step into the underlying mountpoint,
then into the parent of mountpoint.
3) [D] The place we are in is not a root of mounted filesystem,
we step into its parent.
These cases are obvious. However, other execution paths are possible;
in fact, the only constraint is that if we leave the first loop via (A) or (B),
the body of the second loop (i.e. going from mountpoint to mounted) will be
executed at least as many times as (C) (going from mounted to mountpoint)
had been.
A closer look at the predicates in the above shows that "is absolute root"
is actually "is root of a mount and that mount is not attached to anything"
while "is mounted on top of mountpoint" is "is a root of a mount and the
mount is attached to mountpoint". Which suggest the following transformation:
choose_mountpoint(mount, &ancestor)
while mount is attached to something
d = mountpoint(mount)
mount = parent(mount)
if the caller is chrooted into <mount, d>
break
if d is a root of mount
ancestor = <mount, d>
return true
return false
handle_dotdot(directory)
if unlikely(the caller is chrooted into directory)
goto in_root
if unlikely(directory is a root of some mount)
if !choose_mountpoint(mount, &ancestor)
goto in_root
directory = ancestor
parent = the parent of directory (within its fs)
while unlikely(something is mounted on parent)
parent = whatever overmounts it
return parent
in_root:
parent = directory
while unlikely(something is mounted on parent)
parent = whatever overmounts it
return parent
In this form we have mounted-to-mountpoint mount traversals clearly
separated. Moreover, required updates of pathwalk context (nameidata)
can be packed into a call of the same primitive (step_into()) we use
for moves into normal components, including the forward mount
traversals.
NO_XDEV and BENEATH checks (added by openat2 series) fit into that
just fine - NO_XDEV at "directory = ancestor" part, BENEATH - at
in_root. Since the forward mount traversal is done by step_into(),
the regular NO_XDEV checks in there take care of the rest.
The following series massages follow_dotdot/follow_dotdot_rcu() to
that form and does choose_mountpoint() implementation with saner
locking than what we do in mainline now - for RCU case we only need
to check mount_lock seqcount once (in the caller), for non-RCU we
can use a loop similar to what lookup_mnt() does for forward traversals.
39/55 path_connected(): pass mount and dentry separately
40/55 path_parent_directory(): leave changing path->dentry to callers
41/55 follow_dotdot(): expand the call of path_parent_directory()
currently switching to parent is done inside path_parent_directory(),
called from the loop in follow_dotdot(). These 3 commits lift that into
the loop in follow_dotdot() itself...
42/55 follow_dotdot{,_rcu}(): lift switching nd->path to parent out of loop
43/55 follow_dotdot{,_rcu}(): lift LOOKUP_BENEATH checks out of loop
... and out of the loop(s) (both on the RCU and non-RCU sides)
44/55 move put_link() into handle_dots()
45/55 handle_dots(): return ERR_PTR/NULL instead of -E.../0
46/55 move follow_dotdot() and follow_dotdot_rcu() towards handle_dots()
47/55 follow_dotdot{,_rcu}(): preparation to switch to step_into()
gather the pieces for step_into() calls we are about to construct.
48/55 follow_dotdot{,_rcu}(): switch to use of step_into()
... and switch both to it. Now the RCU and non-RCU variants of
the loop that used to do forward mount traversal on .. are replaced with
step_into() calls...
49/55 lift all calls of step_into() out of follow_dotdot/follow_dotdot_rcu
... which can be consolidated. We are done with the forward traversal
parts.
50/55 follow_dotdot{,_rcu}(): massage loops
51/55 follow_dotdot_rcu(): be lazy about changing nd->path
52/55 follow_dotdot(): be lazy about changing nd->path
get the rootwards traversal into shape described above
53/55 helper for mount rootwards traversal
54/55 non-RCU analogue of the previous commit
... and introduce choose_mountpoint{,_rcu}(), switching both RCU and
non-RCU variants to it.
55/55 fs/namei.c: kill follow_mount()
detritus removal - the only remaining caller (path_pts()) ought to
use follow_down() anyway.
From: Al Viro <[email protected]>
preparation to finish_automount() fix (next commit)
Signed-off-by: Al Viro <[email protected]>
---
fs/namespace.c | 47 ++++++++++++++++++++++++-----------------------
1 file changed, 24 insertions(+), 23 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 85b5f7bea82e..dcd015fafe01 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2697,45 +2697,32 @@ static int do_move_mount_old(struct path *path, const char *old_name)
/*
* add a mount into a namespace's mount tree
*/
-static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
+static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
+ struct path *path, int mnt_flags)
{
- struct mountpoint *mp;
- struct mount *parent;
- int err;
+ struct mount *parent = real_mount(path->mnt);
mnt_flags &= ~MNT_INTERNAL_FLAGS;
- mp = lock_mount(path);
- if (IS_ERR(mp))
- return PTR_ERR(mp);
-
- parent = real_mount(path->mnt);
- err = -EINVAL;
if (unlikely(!check_mnt(parent))) {
/* that's acceptable only for automounts done in private ns */
if (!(mnt_flags & MNT_SHRINKABLE))
- goto unlock;
+ return -EINVAL;
/* ... and for those we'd better have mountpoint still alive */
if (!parent->mnt_ns)
- goto unlock;
+ return -EINVAL;
}
/* Refuse the same filesystem on the same mount point */
- err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
- goto unlock;
+ return -EBUSY;
- err = -EINVAL;
if (d_is_symlink(newmnt->mnt.mnt_root))
- goto unlock;
+ return -EINVAL;
newmnt->mnt.mnt_flags = mnt_flags;
- err = graft_tree(newmnt, parent, mp);
-
-unlock:
- unlock_mount(mp);
- return err;
+ return graft_tree(newmnt, parent, mp);
}
static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
@@ -2748,6 +2735,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
unsigned int mnt_flags)
{
struct vfsmount *mnt;
+ struct mountpoint *mp;
struct super_block *sb = fc->root->d_sb;
int error;
@@ -2768,7 +2756,13 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
mnt_warn_timestamp_expiry(mountpoint, mnt);
- error = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ mp = lock_mount(mountpoint);
+ if (IS_ERR(mp)) {
+ mntput(mnt);
+ return PTR_ERR(mp);
+ }
+ error = do_add_mount(real_mount(mnt), mp, mountpoint, mnt_flags);
+ unlock_mount(mp);
if (error < 0)
mntput(mnt);
return error;
@@ -2830,6 +2824,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
struct mount *mnt = real_mount(m);
+ struct mountpoint *mp;
int err;
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
@@ -2842,7 +2837,13 @@ int finish_automount(struct vfsmount *m, struct path *path)
goto fail;
}
- err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ mp = lock_mount(path);
+ if (IS_ERR(mp)) {
+ err = PTR_ERR(mp);
+ goto fail;
+ }
+ err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ unlock_mount(mp);
if (!err)
return 0;
fail:
--
2.11.0
From: Al Viro <[email protected]>
Only the address of ->total_link_count and the flags.
And fix an off-by-one is ELOOP detection - make it
consistent with symlink following, where we check if
the pre-increment value has reached 40, rather than
check the post-increment one.
[kudos to Christian Brauner for spotted braino]
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 39dd56f5171f..6721c5f7e9d5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1208,7 +1208,7 @@ EXPORT_SYMBOL(follow_up);
* - return -EISDIR to tell follow_managed() to stop and return the path we
* were called with.
*/
-static int follow_automount(struct path *path, struct nameidata *nd)
+static int follow_automount(struct path *path, int *count, unsigned lookup_flags)
{
struct dentry *dentry = path->dentry;
@@ -1223,13 +1223,12 @@ static int follow_automount(struct path *path, struct nameidata *nd)
* as being automount points. These will need the attentions
* of the daemon to instantiate them before they can be used.
*/
- if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
+ if (!(lookup_flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
dentry->d_inode)
return -EISDIR;
- nd->total_link_count++;
- if (nd->total_link_count >= 40)
+ if (count && (*count)++ >= MAXSYMLINKS)
return -ELOOP;
return finish_automount(dentry->d_op->d_automount(path), path);
@@ -1290,7 +1289,8 @@ static int follow_managed(struct path *path, struct nameidata *nd)
/* Handle an automount point */
if (flags & DCACHE_NEED_AUTOMOUNT) {
- ret = follow_automount(path, nd);
+ ret = follow_automount(path, &nd->total_link_count,
+ nd->flags);
if (ret < 0)
break;
continue;
--
2.11.0
On Sun, Mar 1, 2020 at 3:51 PM Al Viro <[email protected]> wrote:
>
> Extended since the last repost. The branch is in #work.dotdot;
> #work.do_last is its beginning (about 2/3 of the total), slightly
> reworked since the last time.
I'm traveling, so only a quick read-through.
One request: can you add the total diffstat to the cover letter (along
with what you used as a base)? I did apply it to a branch just to look
at it more closely, so I can see the final diffstat that way:
Documentation/filesystems/path-lookup.rst | 7 +-
fs/autofs/dev-ioctl.c | 6 +-
fs/internal.h | 1 -
fs/namei.c | 1333 +++++++++------------
fs/namespace.c | 96 +-
fs/open.c | 4 +-
include/linux/namei.h | 4 +-
7 files changed, 642 insertions(+), 809 deletions(-)
but it would have been nice to see in your explanation too.
Anyway, from a quick read-through, I don't see anything that raises my
hackles - you've fixed the goto label naming, and I didn't notice
anything else odd.
Maybe that was because I wasn't careful enough. But the final line
count certainly speaks for the series..
Linus
On Sun, Mar 01, 2020 at 04:34:06PM -0600, Linus Torvalds wrote:
> On Sun, Mar 1, 2020 at 3:51 PM Al Viro <[email protected]> wrote:
> >
> > Extended since the last repost. The branch is in #work.dotdot;
> > #work.do_last is its beginning (about 2/3 of the total), slightly
> > reworked since the last time.
>
> I'm traveling, so only a quick read-through.
>
> One request: can you add the total diffstat to the cover letter (along
> with what you used as a base)?
Sure, no problem (and the base is still -rc1)
> I did apply it to a branch just to look
> at it more closely, so I can see the final diffstat that way:
>
> Documentation/filesystems/path-lookup.rst | 7 +-
> fs/autofs/dev-ioctl.c | 6 +-
> fs/internal.h | 1 -
> fs/namei.c | 1333 +++++++++------------
> fs/namespace.c | 96 +-
> fs/open.c | 4 +-
> include/linux/namei.h | 4 +-
> 7 files changed, 642 insertions(+), 809 deletions(-)
>
> but it would have been nice to see in your explanation too.
>
> Anyway, from a quick read-through, I don't see anything that raises my
> hackles - you've fixed the goto label naming, and I didn't notice
> anything else odd.
>
> Maybe that was because I wasn't careful enough. But the final line
> count certainly speaks for the series..
Heh... Part of my metrics is actually "how large a sheet of paper does
one need to fit the call graph on" ;-)
I hope it gets serious beating, though - it touches pretty much every
codepath in pathname resolution. Is there any way to sic the bots on
a branch, short of "push it into -next and wait for screams"?
Al Viro <[email protected]> writes:
> On Sun, Mar 01, 2020 at 04:34:06PM -0600, Linus Torvalds wrote:
>> On Sun, Mar 1, 2020 at 3:51 PM Al Viro <[email protected]> wrote:
>> >
>> > Extended since the last repost. The branch is in #work.dotdot;
>> > #work.do_last is its beginning (about 2/3 of the total), slightly
>> > reworked since the last time.
>>
>> I'm traveling, so only a quick read-through.
>>
>> One request: can you add the total diffstat to the cover letter (along
>> with what you used as a base)?
>
> Sure, no problem (and the base is still -rc1)
>
>> I did apply it to a branch just to look
>> at it more closely, so I can see the final diffstat that way:
>>
>> Documentation/filesystems/path-lookup.rst | 7 +-
>> fs/autofs/dev-ioctl.c | 6 +-
>> fs/internal.h | 1 -
>> fs/namei.c | 1333 +++++++++------------
>> fs/namespace.c | 96 +-
>> fs/open.c | 4 +-
>> include/linux/namei.h | 4 +-
>> 7 files changed, 642 insertions(+), 809 deletions(-)
>>
>> but it would have been nice to see in your explanation too.
>>
>> Anyway, from a quick read-through, I don't see anything that raises my
>> hackles - you've fixed the goto label naming, and I didn't notice
>> anything else odd.
>>
>> Maybe that was because I wasn't careful enough. But the final line
>> count certainly speaks for the series..
>
> Heh... Part of my metrics is actually "how large a sheet of paper does
> one need to fit the call graph on" ;-)
>
> I hope it gets serious beating, though - it touches pretty much every
> codepath in pathname resolution. Is there any way to sic the bots on
> a branch, short of "push it into -next and wait for screams"?
Last I looked pushing a branch to kernel.org was enough for the
kbuild bots. Sending patches to LKML is also enough for those bots.
I don't know if that kind of bot is what you need testing your code.
Eric
On Tue, Mar 03, 2020 at 05:48:31PM -0600, Eric W. Biederman wrote:
> > I hope it gets serious beating, though - it touches pretty much every
> > codepath in pathname resolution. Is there any way to sic the bots on
> > a branch, short of "push it into -next and wait for screams"?
>
> Last I looked pushing a branch to kernel.org was enough for the
> kbuild bots. Sending patches to LKML is also enough for those bots.
>
> I don't know if that kind of bot is what you need testing your code.
Build bots are generally nice, but in this case... pretty much all of
the changes are in fs/namei.c, which is not all that sensitive to
config/architecture/whatnot. Sure, something like "is audit enabled?"
may affect the build problems, but not much beyond that.
What was that Intel-run(?) bot that posts "such-and-such metrics has
42% regression on such-and-such commit" from time to time?
<checks>
Subject: [locking/qspinlock] 7b6da71157: unixbench.score 8.4% improvement
seems to be the latest of that sort,
From: kernel test robot <[email protected]>
Not sure how much of pathwalk-heavy loads is covered by profiling
bots of that sort, unfortunately... ;-/
Al Viro <[email protected]> writes:
> On Tue, Mar 03, 2020 at 05:48:31PM -0600, Eric W. Biederman wrote:
>
>> > I hope it gets serious beating, though - it touches pretty much every
>> > codepath in pathname resolution. Is there any way to sic the bots on
>> > a branch, short of "push it into -next and wait for screams"?
>>
>> Last I looked pushing a branch to kernel.org was enough for the
>> kbuild bots. Sending patches to LKML is also enough for those bots.
>>
>> I don't know if that kind of bot is what you need testing your code.
>
> Build bots are generally nice, but in this case... pretty much all of
> the changes are in fs/namei.c, which is not all that sensitive to
> config/architecture/whatnot. Sure, something like "is audit enabled?"
> may affect the build problems, but not much beyond that.
>
> What was that Intel-run(?) bot that posts "such-and-such metrics has
> 42% regression on such-and-such commit" from time to time?
> <checks>
> Subject: [locking/qspinlock] 7b6da71157: unixbench.score 8.4% improvement
> seems to be the latest of that sort,
> From: kernel test robot <[email protected]>
>
> Not sure how much of pathwalk-heavy loads is covered by profiling
> bots of that sort, unfortunately... ;-/
Do the xfs-tests cover that sort of thing?
The emphasis is stress testing the filesystem not the VFS but there is a
lot of overlap between the two.
Eric
On Tue, Mar 03, 2020 at 11:23:39PM -0600, Eric W. Biederman wrote:
> Al Viro <[email protected]> writes:
>
> > On Tue, Mar 03, 2020 at 05:48:31PM -0600, Eric W. Biederman wrote:
> >
> >> > I hope it gets serious beating, though - it touches pretty much every
> >> > codepath in pathname resolution. Is there any way to sic the bots on
> >> > a branch, short of "push it into -next and wait for screams"?
> >>
> >> Last I looked pushing a branch to kernel.org was enough for the
> >> kbuild bots. Sending patches to LKML is also enough for those bots.
> >>
> >> I don't know if that kind of bot is what you need testing your code.
> >
> > Build bots are generally nice, but in this case... pretty much all of
> > the changes are in fs/namei.c, which is not all that sensitive to
> > config/architecture/whatnot. Sure, something like "is audit enabled?"
> > may affect the build problems, but not much beyond that.
> >
> > What was that Intel-run(?) bot that posts "such-and-such metrics has
> > 42% regression on such-and-such commit" from time to time?
> > <checks>
> > Subject: [locking/qspinlock] 7b6da71157: unixbench.score 8.4% improvement
> > seems to be the latest of that sort,
> > From: kernel test robot <[email protected]>
> >
> > Not sure how much of pathwalk-heavy loads is covered by profiling
> > bots of that sort, unfortunately... ;-/
>
> Do the xfs-tests cover that sort of thing?
> The emphasis is stress testing the filesystem not the VFS but there is a
> lot of overlap between the two.
I do run xfstests. But "runs in KVM without visible slowdowns" != "won't
cause them on 48-core bare metal". And this area (especially when it
comes to RCU mode) can be, er, interesting in that respect.
FWIW, I'm putting together some litmus tests for pathwalk semantics -
one of the things I'd like to discuss at LSF; quite a few codepaths
are simply not touched by anything in xfstests.
On Wed, Mar 04, 2020 at 06:55:47AM +0000, Al Viro wrote:
> On Tue, Mar 03, 2020 at 11:23:39PM -0600, Eric W. Biederman wrote:
> > Do the xfs-tests cover that sort of thing?
> > The emphasis is stress testing the filesystem not the VFS but there is a
> > lot of overlap between the two.
>
> I do run xfstests. But "runs in KVM without visible slowdowns" != "won't
> cause them on 48-core bare metal". And this area (especially when it
> comes to RCU mode) can be, er, interesting in that respect.
>
> FWIW, I'm putting together some litmus tests for pathwalk semantics -
> one of the things I'd like to discuss at LSF; quite a few codepaths
> are simply not touched by anything in xfstests.
Might be more appropriate for LTP than xfstests? will-it-scale might be
the right place for performance benchmarks.
On Wed, Mar 04, 2020 at 05:28:12AM -0800, Matthew Wilcox wrote:
> On Wed, Mar 04, 2020 at 06:55:47AM +0000, Al Viro wrote:
> > On Tue, Mar 03, 2020 at 11:23:39PM -0600, Eric W. Biederman wrote:
> > > Do the xfs-tests cover that sort of thing?
> > > The emphasis is stress testing the filesystem not the VFS but there is a
> > > lot of overlap between the two.
> >
> > I do run xfstests. But "runs in KVM without visible slowdowns" != "won't
> > cause them on 48-core bare metal". And this area (especially when it
> > comes to RCU mode) can be, er, interesting in that respect.
> >
> > FWIW, I'm putting together some litmus tests for pathwalk semantics -
> > one of the things I'd like to discuss at LSF; quite a few codepaths
> > are simply not touched by anything in xfstests.
>
> Might be more appropriate for LTP than xfstests? will-it-scale might be
> the right place for performance benchmarks.
Might be... I do run LTP as well, but it's still a 4-way KVM on a 6-way
amd64 host (phenom II X6 1100T) - not well-suited for catching scalability
issues.
Litmus tests mentioned above are more about verifying the semantics;
I hadn't moved past the "bunch of home-grown scripts creating setups
that would exercise the codepaths in question + trivial pieces
in C, pretty much limited to syscall()" stage with that; moving those
to LTP framework is something I'll need to look into. Might very well
make sense; for now I just want a way to get test coverage of that code
with minimal headache.
On Wed, Mar 04, 2020 at 04:20:51PM +0000, Al Viro wrote:
> On Wed, Mar 04, 2020 at 05:28:12AM -0800, Matthew Wilcox wrote:
> > On Wed, Mar 04, 2020 at 06:55:47AM +0000, Al Viro wrote:
> > > On Tue, Mar 03, 2020 at 11:23:39PM -0600, Eric W. Biederman wrote:
> > > > Do the xfs-tests cover that sort of thing?
> > > > The emphasis is stress testing the filesystem not the VFS but there is a
> > > > lot of overlap between the two.
> > >
> > > I do run xfstests. But "runs in KVM without visible slowdowns" != "won't
> > > cause them on 48-core bare metal". And this area (especially when it
> > > comes to RCU mode) can be, er, interesting in that respect.
> > >
> > > FWIW, I'm putting together some litmus tests for pathwalk semantics -
> > > one of the things I'd like to discuss at LSF; quite a few codepaths
> > > are simply not touched by anything in xfstests.
> >
> > Might be more appropriate for LTP than xfstests? will-it-scale might be
> > the right place for performance benchmarks.
>
> Might be... I do run LTP as well, but it's still a 4-way KVM on a 6-way
> amd64 host (phenom II X6 1100T) - not well-suited for catching scalability
> issues.
>
> Litmus tests mentioned above are more about verifying the semantics;
> I hadn't moved past the "bunch of home-grown scripts creating setups
> that would exercise the codepaths in question + trivial pieces
> in C, pretty much limited to syscall()" stage with that; moving those
> to LTP framework is something I'll need to look into. Might very well
> make sense; for now I just want a way to get test coverage of that code
> with minimal headache.
FWIW, I've just took mainline and added to follow_dotdot{,_rcu}() (.. handling,
RCU and non-RCU cases) gathering the number of full iterations through the
first loop, the way that loop had been left and the number of full iterations
through the second loop. Anything other than 3 common cases (think /..,
/proc/.., /proc/scsi/..) get reported. Built, booted and started xfstests
on it. So far all it has caught was .. into overmounted root early in the
boot (0 full passes through the first loop, left it via the first break
when it ran into current->fs->root, 1 full pass through the second loop).
Nothing from xfstests run (yet); LTP will be the next one to go...
Note, BTW, that this unusual call in early boot has hit RCU case and would
almost certainly do so on all boxen. So in all of that there had not been
a single sodding call that would've hit the traversal into overmount of
parent in non-RCU case.
On Wed, Mar 04, 2020 at 09:59:46PM +1100, Aleksa Sarai wrote:
> > FWIW, I'm putting together some litmus tests for pathwalk semantics -
> > one of the things I'd like to discuss at LSF; quite a few codepaths
> > are simply not touched by anything in xfstests.
>
> I won't be at LSF unfortunately, but this is something I would be very
> interested in helping with -- one of the things I've noticed is the lack
> of a test-suite for some of the more generic VFS bits (such as namei).
BTW, has anyone tried to run tests with oprofile and see how much of the
core kernel gets exercised? That looks like an obvious thing to try -
at least the places outside of spin_lock_irq() ought to get lit after
a while...
Have any CI folks tried doing that, or am I missing some obvious reason
why that is not feasible?
On Wed, Mar 04, 2020 at 09:00:31PM +0000, Al Viro wrote:
> On Wed, Mar 04, 2020 at 09:59:46PM +1100, Aleksa Sarai wrote:
>
> > > FWIW, I'm putting together some litmus tests for pathwalk semantics -
> > > one of the things I'd like to discuss at LSF; quite a few codepaths
> > > are simply not touched by anything in xfstests.
> >
> > I won't be at LSF unfortunately, but this is something I would be very
> > interested in helping with -- one of the things I've noticed is the lack
> > of a test-suite for some of the more generic VFS bits (such as namei).
>
> BTW, has anyone tried to run tests with oprofile and see how much of the
> core kernel gets exercised? That looks like an obvious thing to try -
> at least the places outside of spin_lock_irq() ought to get lit after
> a while...
>
> Have any CI folks tried doing that, or am I missing some obvious reason
> why that is not feasible?
I don't know about oprofile, but LTP got their gcov patches merged
into 2.6.31:
http://ltp.sourceforge.net/coverage/gcov.php
Extended since the last repost. The branch is in #work.dotdot,
still based at 5.6-rc1. Diffstat is
Documentation/filesystems/path-lookup.rst | 7 +-
fs/autofs/dev-ioctl.c | 6 +-
fs/internal.h | 1 -
fs/namei.c | 1465 ++++++++++++-----------------
fs/namespace.c | 96 +-
fs/open.c | 4 +-
include/linux/namei.h | 4 +-
7 files changed, 675 insertions(+), 908 deletions(-)
Individual patches are in the followups. Branch survives
the local testing (including ltp and xfstests).
Review and testing would be _very_ welcome; it does a lot
of massage, so there had been a plenty of opportunities to fuck up
and fail to spot that. The same goes for profiling - it doesn't
seem to slow the things down, but that needs to be verified. If
nobody screams, into -next it goes in a few days...
Changes since v3: ".." series (part 6) got cleaner up,
a bunch of pick_link() cleanups added (part 7) and more path_openat()/
do_last()/open_last_lookups() refactoring is added in the end (part 8).
part 1: follow_automount() cleanups and fixes.
Quite a bit of that function had been about working around the
wrong calling conventions of finish_automount(). The problem is that
finish_automount() misuses the primitive intended for mount(2) and
friends, where we want to mount on top of the pile, even if something
has managed to add to that while we'd been trying to lock the namespace.
For automount that's not the right thing to do - there we want to discard
whatever it was going to attach and just cross into what got mounted
there in the meanwhile (most likely - the results of the same automount
triggered by somebody else). Current mainline kinda-sorta manages to do
that, but it's unreliable and very convoluted. Much simpler approach
is to stop using lock_mount() in finish_automount() and have it bail
out if something turns out to have been mounted on top where we wanted
to attach. That allows to get rid of a lot of PITA in the caller.
Another simplification comes from not trying to cross into the results
of automount - simply ride through the next iteration of the loop and
let it move into overmount.
Another thing in the same series is divorcing follow_automount()
from nameidata; that'll play later when we get to unifying follow_down()
with the guts of follow_managed().
4 commits, the second one fixes a hard-to-hit race. The first
is a prereq for it.
1/69 do_add_mount(): lift lock_mount/unlock_mount into callers
2/69 fix automount/automount race properly
3/69 follow_automount(): get rid of dead^Wstillborn code
4/69 follow_automount() doesn't need the entire nameidata
part 2: unifying mount traversals in pathwalk.
Handling of mount traversal (follow_managed()) is currently called
in a bunch of places. Each of them is shortly followed by a call of
step_into() or an open-coded equivalent thereof. However, the locations
of those step_into() calls are far from preceding follow_managed();
moreover, that preceding call might happen on different paths that
converge to given step_into() call. It's harder to analyse that it should
be (especially when it comes to liveness analysis) and it forces rather
ugly calling conventions on lookup_fast()/atomic_open()/lookup_open().
The series below massages the code to the point when the calls of
follow_managed() (and __follow_mount_rcu()) move into the beginning of
step_into().
5/69 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
gets EEXIST handling in do_last() past the step_into() call there.
6/69 handle_mounts(): start building a sane wrapper for follow_managed()
rather than mangling follow_managed() itself (and creating conflicts
with openat2 series), add a wrapper that will absorb the required
interface changes.
7/69 atomic_open(): saner calling conventions (return dentry on success)
struct path passed to it is pure out parameter; only dentry part
ever varies, though - mnt is always nd->path.mnt. Just return
the dentry on success, and ERR_PTR(-E...) on failure.
8/69 lookup_open(): saner calling conventions (return dentry on success)
propagate the same change one level up the call chain.
9/69 do_last(): collapse the call of path_to_nameidata()
struct path filled in lookup_open() call is eventually given to
handle_mounts(); the only use it has before that is path_to_nameidata()
call in "->atomic_open() has actually opened it" case, and there
path_to_nameidata() is an overkill - we are guaranteed to replace
only nd->path.dentry. So have the struct path filled only immediately
prior to handle_mounts().
10/69 handle_mounts(): pass dentry in, turn path into a pure out argument
now all callers of handle_mount() are directly preceded by filling
struct path it gets. path->mnt is nd->path.mnt in all cases, so we can
pass just the dentry instead and fill path in handle_mount() itself.
Some boilerplate gone, path is pure out argument of handle_mount()
now.
11/69 lookup_fast(): consolidate the RCU success case
massage to gather what will become an RCU case equivalent of
handle_mounts(); basically, that's what we do if revalidate succeeds
in RCU case of lookup_fast(), including unlazy and fallback to
handle_mounts() if __follow_mount_rcu() says "it's too tricky".
12/69 teach handle_mounts() to handle RCU mode
... and take that into handle_mount() itself. The other caller of
__follow_mount_rcu() is fine with the same fallback (it just didn't
bother since it's in the very beginning of pathwalk), switched to
handle_mount() as well.
13/69 lookup_fast(): take mount traversal into callers
Now we are getting somewhere - both RCU and non-RCU success cases of
lookup_fast() are ended with the same return handle_mounts(...);
move that to the callers - there it will merge with the identical calls
that had been on the paths where we had to do slow lookups.
lookup_fast() returns dentry now.
14/69 step_into() callers: dismiss the symlink earlier
dismiss the symlink being traversed as soon as we know we are done
looking at its body; do that directly from step_into() callers, don't
leave it for step_into() to do.
15/69 new step_into() flag: WALK_NOFOLLOW
use step_into() instead of open-coding it in handle_lookup_down().
Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for
that (and eventually, I hope, for .. handling).
Now *all* calls of handle_mounts() and step_into() are right next to
each other.
16/69 fold handle_mounts() into step_into()
... and we can move the call of handle_mounts() into step_into(),
getting a slightly saner calling conventions out of that.
17/69 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()
another payoff from 14/17 - we can teach path_lookupat() to do
what path_mountpointat() used to. And kill the latter, along with
its wrappers.
18/69 expand the only remaining call of path_lookup_conditional()
minor cleanup - RIP path_lookup_conditional(). Only one caller left.
Changes so far:
* mount traversal is taken into step_into().
* lookup_fast(), atomic_open() and lookup_open() calling conventions
are slightly changed. All of them return dentry now, instead of returning
an int and filling struct path on success. For lookup_fast() the old
"0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache
miss, dentry - for hit".
* step_into() can be called in RCU mode as well. Takes nameidata,
WALK_... flags, dentry and, in RCU case, corresponding inode and seq value.
Handles mount traversals, decides whether it's a symlink to be followed.
Error => returns -E...; symlink to follow => returns 1, puts symlink on stack;
non-symlink or symlink not to follow => returns 0, moves nd->path to new location.
* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends
became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.
part 3: untangling the symlink handling.
Right now when we decide to follow a symlink it happens this way:
* step_into() decides that it has been given a symlink that needs to
be followed.
* it calls pick_link(), which pushes the symlink on stack and
returns 1 on success / -E... on error. Symlink's mount/dentry/seq is
stored on stack and the inode is stashed in nd->link_inode.
* step_into() passes that 1 to its callers, which proceed to pass it
up the call chain for several layers. In all cases we get to get_link()
call shortly afterwards.
* get_link() is called, picks the inode stashed in nd->link_inode
by the pick_link(), does some checks, touches the atime, etc.
* get_link() either picks the link body out of inode or calls
->get_link(). If it's an absolute symlink, we move to the root and return
the relative portion of the body; if it's a relative one - just return the
body. If it's a procfs-style one, the call of nd_jump_link() has been
made and we'd moved to whatever location is desired. And return NULL,
same as we do for symlink to "/".
* the caller proceeds to deal with the string returned to it.
The sequence is the same in all cases (nested symlink, trailing
symlink on lookup, trailing symlink on open), but its pieces are not close
to each other and the bit between the call of pick_link() and (inevitable)
call of get_link() afterwards is not easy to follow. Moreover, a bunch
of functions (walk_component/lookup_last/do_last) ends up with the same
conventions for return values as step_into(). And those conventions
(see above) are not pretty - 0/1/-E... is asking for mistakes, especially
when returned 1 is used only to direct control flow on a rather twisted
way to matching get_link() call. And that path can be seriously twisted.
E.g. when we are trying to open /dev/stdin, we get the following sequence:
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. Into the stack it goes and we return 1.
* do_last() sees 1 and returns it.
* trailing_symlink() is called (in the top-level loop) and it
calls get_link(). OK, we get "/proc/self/fd/0" for body, move to
root again and return "proc/self/fd/0".
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes and we return 1.
* again, do_last() passes 1 to caller
* trailing_symlink() is called and calls get_link().
* this time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. But the
fun doesn't stop yet.
* trailing_symlink() returns "" to the caller
* link_path_walk() is called on that and does nothing
whatsoever.
* do_last() is called and sees LAST_BIND left by the get_link().
It calls handle_dots()
* handle_dots() drops the symlink from stack and returns
* do_last() *FINALLY* proceeds to the point after its call of
step_into() (finish_open:) and gets around to opening the damn thing.
Making sense of the control flow through all of that is not fun,
to put it mildly; debugging anything in that area can be a massive PITA,
and this example has touched only one of 3 cases. Arguably, the worst
one, but... Anyway, it turns out that this code can be massaged to
considerably saner shape - both in terms of control flow and wrt calling
conventions.
19/69 merging pick_link() with get_link(), part 1
prep work: move the "hardening" crap from trailing_symlink() into
get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags).
We'll be moving the calls of get_link() around quite a bit through that
series, and the next step will be to eliminate trailing_symlink().
20/69 merging pick_link() with get_link(), part 2
fold trailing_symlink() into lookup_last() and do_last().
Now these are returning strings; it's not the final calling conventions,
but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) =>
old -E..., we'd failed. string => old 1, and the string is the symlink
body to follow. Just as for trailing_symlink(), "/" and procfs ones
(where get_link() returns NULL) yield "", so the ugly song and dance
with no-op trip through link_path_walk()/handle_dots() still remains.
21/69 merging pick_link() with get_link(), part 3
elimination of that round-trip. In *all* cases having
get_link() return NULL on such symlinks means that we'll proceed to
drop the symlink from stack and get back to the point near that
get_link() call - basically, where we would be if it hadn't been
a symlink at all. The path by which we are getting there depends
upon the call site; the end result is the same in all cases - such
symlinks (procfs ones and symlink to "/") are fully processed by
the time get_link() returns, so we could as well drop them from the
stack right in get_link(). Makes life simpler in terms of control
flow analysis...
And now the calling conventions for do_last() and lookup_last()
have reached the final shape - ERR_PTR(-E...) for error, NULL for
"we are done", string for "traverse this".
22/69 merging pick_link() with get_link(), part 4
now all calls of walk_component() are followed by the same
boilerplate - "if it has returned 1, call get_link() and if that
has returned NULL treat that as if walk_component() has returned 0".
Eliminate by folding that into walk_component() itself. Now
walk_component() return value conventions have joined those of
do_last()/lookup_last().
23/69 merging pick_link() with get_link(), part 5
same as for the previous, only this time the boilerplate
migrates one level down, into step_into(). Only one caller of
get_link() left, step_into() has joined the same return value
conventions.
24/69 merging pick_link() with get_link(), part 6
move that thing into pick_link(). Now all traces of
"return 1 if we are following a symlink" are gone.
25/69 finally fold get_link() into pick_link()
ta-da - expand get_link() into the only caller. As a side
benefit, we get rid of stashing the inode in nd->link_inode - it
was done only to carry that piece of information from pick_link()
to eventual get_link(). That's not the main benefit, though - the
control flow became considerably easier to reason about.
For what it's worth, the example above (/dev/stdin) becomes
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. On the stack it goes and we get its body. Which is
"/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
* do_last() sees non-NULL and returns it - whether it's an error
or a pathname to traverse, we hadn't reached something we'll be opening.
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes. This time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. So we
drop symlink from stack and return that NULL to caller.
* that NULL is returned by step_into(), same as if we had just
moved to a non-symlink.
* do_last() proceeds to open the damn thing.
Some low-hanging fruits become available: LAST_BIND can be removed and
the predicate controlling may_follow_link() we'd moved into pick_link()
in #18 can be made more straightforward:
26/69 LAST_BIND removal
The only reason to keep it had just been eliminated - it was
needed to route the control flow through that weird last iteration
through the loop. With that iteration gone...
27/69 invert the meaning of WALK_FOLLOW
28/69 pick_link(): check for WALK_TRAILING, not LOOKUP_PARENT
In #18 the checks specific to trailing symlinks got moved into
pick_link(), where they were made conditional upon LOOKUP_PARENT in
nd->flags. That works, but it's more subtle than I would like it to
be - it depends upon the dynamic state (nd->flags) which gets changed
through the pathwalk and it's sensitive to exact locations where we
flip LOOKUP_PARENT. Now we have a more robust way to do that - the
call chains that end up in pick_link() with LOOKUP_PARENT in nd->flags
are those that had WALK_TRAILING passed to the immediate caller of
pick_link() (step_into()). So we can pass WALK_... down to pick_link()
and turn the check into explicit "if we are passed WALK_TRAILING,
it's a trailing symlink and we need to apply the checks in may_follow()".
We could, in principle, reorder these two commits into the
very beginning of symlink series; that would make #18 slightly simpler
at the cost of (marginally) more boilerplate to carry through the
get_link() call moves. Not sure if it's worth doing, though...
29/69 link_path_walk(): simplify stack handling
Another cleanup that becomes possible is handling of the stack(s).
We use nd->stack to store two things: pinning down the symlinks we are
resolving and resuming the name traversal when a nested symlink is finished.
Currently, nd->depth is used to keep track of both. It's 0 when we call
link_path_walk() for the first time (for the pathname itself) and 1 on
all subsequent calls (for trailing symlinks, if any). That's fine,
as far as pinning symlinks goes - when handling a trailing symlink,
the string we are interpreting is the body of symlink pinned down in
nd->stack[0]. It's rather inconvenient with respect to handling nested
symlinks, though - when we run out of a string we are currently interpreting,
we need to decide whether it's a nested symlink (in which case we need
to pick the string saved back when we started to interpret that nested
symlink and resume its traversal) or not (in which case we are done with
link_path_walk()).
Current solution is a bit of a kludge - in handling of trailing symlink
(in lookup_last() and open_last_lookups() we clear nd->stack[0].name.
That allows link_path_walk() to use the following rules when running out of
a string to interpret:
* if nd->depth is zero, we are at the end of pathname itself.
* if nd->depth is positive, check the saved string; for
nested symlink it will be non-NULL, for trailing symlink - NULL.
It works, but it's rather non-obvious. Note that we have two sets:
the set of symlinks currently being traversed and the set of postponed
pathname tails. The former is stored in nd->stack[0..nd->depth-1].link
and it's valid throught the pathname resolution; the latter is valid only
during an individual call of link_path_walk() and it occupies
nd->stack[0..nd->depth-1].name for the first call of link_path_walk() and
nd->stack[1..nd->depth-1].name for subsequent ones. The kludge is basically
a way to recognize the second set becoming empty.
The things get simpler if we keep track of the second set's size explicitly
and always store it in nd->stack[0..depth-1].name. We access the second set
only inside link_path_walk(), so its size can live in a local variable;
that way the check becomes trivial without the need of that kludge.
30/69 namei: have link_path_walk() maintain LOOKUP_PARENT
just set it on the entry into link_path_walk() and clear when
we get to the last component. Removes boilerplate from lookup_last()
and do_last().
part 4. some mount traversal cleanups.
31/69 massage __follow_mount_rcu() a bit
make it more similar to non-RCU counterpart
32/69 new helper: traverse_mounts()
the guts of follow_managed() are very similar to
follow_down(). The calling conventions are different (follow_managed()
works with nameidata, follow_down() - with standalone struct path),
but the core loop is pretty much the same in both. Turned that loop
into a common helper (traverse_mounts()) and since follow_managed()
becomes a very thin wrapper around it, expand follow_managed() at its
only call site (in handle_mounts()),
part 5. do_last() untangling.
Control flow in do_last() is an atrocity, and liveness analysis in there
is rather painful. What follows is a massage of that thing into (hopefully)
more straightforward shape; by the end of the series it's still unpleasant,
but at least easier to follow.
A major source of headache is treatment of "we'd already managed to
open it in ->atomic_open()" and "we'd just created that sucker" cases -
that's what gives complicated control flow graph. As it is, we
have the following horror:
#
/------* ends with . or ..?
# |
| #
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---------------\ already opened in ->atomic_open()?
| | | #
| | *---\ | freshly created file?
| | | # |
| \---+ | | finish_lookup:
| # | |
| *---------------------> is it a symlink?
\------+ | | finish_open:
# | |
+--/ | finish_open_created:
# |
+---------------/ opened:
#
To make it even more unpleasant, there is quite a bit of similar,
but not entirely identical logics on parallel branches, some of it
buried in lookup_open() *and* atomic_open() called by it. Keeping
track of that has been hard and that had lead to more than one bug.
33/69 atomic_open(): return the right dentry in FMODE_OPENED case
As it is, several invariants do not hold in "we'd already opened
it in ->atomic_open()" case. In particular, nd->path.dentry might be
pointing to the wrong place by the time we return to do_last() - on
that codepath we don't care anymore. That both makes it harder to reason
about and serves as an obstacle to transformations that would untangle
that mess. Fortunately, it's not hard to regularize.
34/69 atomic_open(): lift the call of may_open() into do_last()
may_open() is called before vfs_open() in "hadn't opened in
->atomic_open()" case. Rightfully so, since vfs_open() for e.g.
devices can have side effects. In "opened in ->atomic_open()" case
we have to do it after the actual opening - the whole point is to
combine open with lookup and we only get the information needed for
may_open() after the combined lookup/open has happened. That's
OK - no side effects are possible in that case. However, we don't
have to keep that call of may_open() inside fs/namei.c:atomic_open();
as the matter of fact, lifting it into do_last() allows to simplify
life there...
35/69 do_last(): merge the may_open() calls
... since now we have the "it's already opened" case in
do_last() rejoin the main path at earlier point.
At that point the horror graph from above has become
#
/------* ends with . or ..?
# |
| #
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---------------\ already opened in ->atomic_open()?
| | | #
| | *---\ | freshly created file?
| | | # |
| \---+ | | finish_lookup:
| # | |
| *---------------------> is it a symlink?
\------+ | | finish_open:
# | |
+--/------------/ finish_open_created:
#
36/69 do_last(): don't bother with keeping got_write in FMODE_OPENED case
Another source of unpleasantness is an attempt to be clever and
keep track of write access status; the thing is, it doesn't really buy
us anything - we could as well drop it right after the lookup_open() and
only regain it for truncation, should such be needed. Makes for much
simpler cleanups on failures and sets the things up for unification of
"already opened" and "new file" branches with the main path...
37/69 do_last(): rejoing the common path earlier in FMODE_{OPENED,CREATED} case
... which we do here.
38/69 do_last(): simplify the liveness analysis past finish_open_created
It also makes possible to shrink the liveness intervals for local
variables.
39/69 do_last(): rejoin the common path even earlier in FMODE_{OPENED,CREATED} case
Further unification of parallel branches.
At that point we get
#
/------* ends with . or ..?
# |
| #
| /---* found in dcache, no O_CREAT?
| | |
| | # call lookup_open() here.
| | |
| | *---\ opened by ->atomic_open() or freshly creatd?
| \---+ | finish_lookup:
| # |
| *---------------------> is it a symlink?
\------+ | finish_open:
| |
+--/ finish_open_created:
#
with very little work done between finish_open: and finish_open_created:,
as well as on any of the side branches. Moreover, we have a pretty clear
separation: most of the work on _opening_ is after finish_open_created
(some of it - conditional), while the work on lookups and creation is all
before that point. Even better, most of the local variables are used
either only before or only after that cutoff point.
40/69 split the lookup-related parts of do_last() into a separate helper
... which allows to separate the lookup-related parts from
open-related ones.
I'm not saying I'm entirely happy with the resulting state of
do_last() clusterfuck, but it got a lot easier to follow and reason
about. There are more cleanups possible (and needed) in there, though -
there will be followups.
part 6: ".." handling
The main problem with .. traversal is that it has open-coded mount
crossing in the end. That mount crossing used to be identical to that done
after the normal name components, but it got overlooked in several series
(most recently - openat2, prior to that - mount traps and automounts) and
now we have an out-of-sync variant (two of them, actually - RCU and non-RCU
cases) festering there and breeding hard-to-spot bugs. The most recent
example was when openat2 got extra checks added to the normal mount crossing;
it added the same to RCU case of .. handling, but missed the non-RCU one.
Nobody noticed during the many rounds of review - me, Christoph and Linus
included.
Another issue is that we are heavier on the locking than we need to
during the rootwards mount traversal there; traversing mounts in other
direction (from mountpoint to mounted) gets by without grabbing mount_lock
exclusive (or dirtying its cacheline in any way, for that matter), even
in non-RCU case. The same should be the case for mounted-to-mountpoint
transitions - except for the case of very rare races with mount --move
and friends, we should be fine with just the seqcount checks there.
The following is how .. handling behaves on just about any post-v7
Unix:
while true
if the caller is chrooted into directory // rare
(A) parent = directory
break
if directory is absolute root // rare
(B) parent = directory
break
if directory is mounted on top of mountpoint // 2nd most common
(C) directory = mountpoint
else // the most common
(D) parent = the parent of directory (within its fs)
break
while something is mounted on parent // unusual setup
(E) parent = whatever overmounts it
return parent
There are 3 paths that execution commonly takes, and they cover almost everything
that occurs in practice:
1) [A] We are in / and we stay there
2) [C,D] We are in root of mounted filesystem,
we step into the underlying mountpoint,
then into the parent of mountpoint.
3) [D] The place we are in is not a root of mounted filesystem,
we step into its parent.
These cases are obvious. However, other execution paths are possible;
in fact, the only constraint is that if we leave the first loop via (A) or (B),
the body of the second loop (i.e. going from mountpoint to mounted) will be
executed at least as many times as (C) (going from mounted to mountpoint)
had been.
A closer look at the predicates in the above shows that "is absolute root"
is actually "is root of a mount and that mount is not attached to anything"
while "is mounted on top of mountpoint" is "is a root of a mount and the
mount is attached to mountpoint". Which suggest the following transformation:
choose_mountpoint(mount, &ancestor)
while mount is attached to something
d = mountpoint(mount)
mount = parent(mount)
if the caller is chrooted into <mount, d>
break
if d is a root of mount
ancestor = <mount, d>
return true
return false
handle_dotdot(directory)
if unlikely(the caller is chrooted into directory)
goto in_root
if unlikely(directory is a root of some mount)
if !choose_mountpoint(mount, &ancestor)
goto in_root
directory = ancestor
parent = the parent of directory (within its fs)
while unlikely(something is mounted on parent)
parent = whatever overmounts it
return parent
in_root:
parent = directory
while unlikely(something is mounted on parent)
parent = whatever overmounts it
return parent
In this form we have mounted-to-mountpoint mount traversals clearly
separated. Moreover, required updates of pathwalk context (nameidata)
can be packed into a call of the same primitive (step_into()) we use
for moves into normal components, including the forward mount
traversals.
NO_XDEV and BENEATH checks (added by openat2 series) fit into that
just fine - NO_XDEV at "directory = ancestor" part, BENEATH - at
in_root. Since the forward mount traversal is done by step_into(),
the regular NO_XDEV checks in there take care of the rest.
The following series massages follow_dotdot/follow_dotdot_rcu() to
that form and does choose_mountpoint() implementation with saner
locking than what we do in mainline now - for RCU case we only need
to check mount_lock seqcount once (in the caller), for non-RCU we
can use a loop similar to what lookup_mnt() does for forward traversals.
41/69 path_connected(): pass mount and dentry separately
42/69 path_parent_directory(): leave changing path->dentry to callers
43/69 follow_dotdot(): expand the call of path_parent_directory()
currently switching to parent is done inside path_parent_directory(),
called from the loop in follow_dotdot(). These 3 commits lift that into
the loop in follow_dotdot() itself...
44/69 follow_dotdot{,_rcu}(): lift switching nd->path to parent out of loop
45/69 follow_dotdot{,_rcu}(): lift LOOKUP_BENEATH checks out of loop
... and out of the loop(s) (both on the RCU and non-RCU sides)
Next part is to replace the second halves (crossing into parent and whatever
might be overmounting it) to step_into().
46/69 move handle_dots(), follow_dotdot() and follow_dotdot_rcu() past step_into()
pure move - get them into the right place
47/69 handle_dots(), follow_dotdot{,_rcu}(): preparation to switch to step_into()
convert to returning ERR_PTR()/NULL instead of -E.../0 - that's what
step_into() returns and the callers are actually happier that way.
48/69 follow_dotdot{,_rcu}(): switch to use of step_into()
... and switch both to it. Now the RCU and non-RCU variants of
the loop that used to do forward mount traversal on .. are replaced with
step_into() calls...
49/69 lift all calls of step_into() out of follow_dotdot/follow_dotdot_rcu
... which can be consolidated. We are done with the forward traversal
parts.
50/69 follow_dotdot{,_rcu}(): massage loops
51/69 follow_dotdot_rcu(): be lazy about changing nd->path
52/69 follow_dotdot(): be lazy about changing nd->path
get the rootwards traversal into shape described above
53/69 helper for mount rootwards traversal
54/69 non-RCU analogue of the previous commit
... and introduce choose_mountpoint{,_rcu}(), switching both RCU and
non-RCU variants to it.
55/69 fs/namei.c: kill follow_mount()
detritus removal - the only remaining caller (path_pts()) ought to
use follow_down() anyway.
That's about where the previous version of patchset used to end.
part 7: pick_link() and friends.
56/69 pick_link(): more straightforward handling of allocation failures
There's a rather annoying wart in pick_link() handling of stack
allocation failures. In RCU mode we try to do GFP_ATOMIC allocation; if
that fails, we need to unlazy and retry with GFP_NORMAL. The problem is
that we need to unlazy both the stuff in nameidata *and* the link we are
about to push onto stack. We need to do the link first (after successful
unlazy we'll have rcu_read_lock() already dropped, so it would be too late).
The question is what to do if trying to legitimize link fails. We might
need to drop references, so that can't happen until we drop rcu_read_lock().
OTOH, we have no place to stash it until the time it's normally done on
error. Result was microoptimized and confusing as hell - explaining the
reasons why we had to do anything special, let alone why this and not
something else was highly non-obvious. Turns out that there's a fairly
simple (and easily explained) solution, avoiding that mess.
57/69 pick_link(): pass it struct path already with normal refcounting rules
Some of the struct path instances share the mount reference with
nd->path. That's done to avoid grabbing/dropping mount references all the
time; unfortunately, it had been the source of quite a few bugs, with
those beasts confused for the normal ones (and vice versa) in some failure
exit. A lot of pathwalk-related code used to be exposed to those, mostly
due to unfortunate calling conventions. After the patches earlier in the
series that exposure is much more limited - step_into() and handle_mounts()
(where they are really needed - they'd been introduced for a good reason)
is pretty much all that is left. pick_link() is slightly exposed - it
gets link in such form, but immediately converts to the regular refcounting
rules. Doing that conversion in the caller (step_link()) makes for simpler
logics.
58/69 fold path_to_nameidata() into its only remaining caller
Another bit of exposure gone - path_to_nameidata() used to be
a primitive for working with those beasts and now it is called only in
step_into(). Expanding it there makes the things easier to read, actually.
59/69 pick_link(): take reserving space on stack into a new helper
60/69 reserve_stack(): switch to __nd_alloc_stack()
61/69 __nd_alloc_stack(): make it return bool
Take the stack allocation out of pick_link() and clean it up.
That's not the end of it for pick_link() - there are other pieces that would be
better off in separate helpers; that's for the latter, though.
part 8: more untangling of do_last()
While we'd already separated the lookup-related parts of do_last() into
a separate function (close analogue of lookup_last()), the top-level loop is
still calling do_last(), which starts with calling open_last_lookups(). If that
returns non-NULL, we leave do_last() immediately and go through the next iteration
of the top-level loop. Otherwise we proceed to do the work on actual opening
and whether it succeeds or not, we are done with pathwalk at that point - we know
it's not a trailing symlink for us to follow. More natural way to express that
is to have open_last_lookups() called by the loop, with the rest of do_last()
done after that loop has ended. Takes a bit or prep work, though:
62/69 link_path_walk(): sample parent's i_uid and i_mode for the last component
I tried to get that done by may_create_in_sticky(); however, we really need
the parent we'd walked through, _NOT_ whatever the file is in at the time of check.
The whole thing is about attacker guessing a name of something like a temp file
uncautious victim is about to open in e.g. /tmp with bare O_CREAT (no O_EXCL).
Attacker creates a file of his own there, lets the victim to open it and to start
writing to attacker-owned "temporary" file. Then attacker modifies the content
and gets the victim screwed. The check is predicated upon the directory being
a sticky one; fair enough, but the same property that makes the attack possible
allows the attacker to mkdir a non-sticky subdirectory there and move the
file to it. If they get it after the victim has looked the damn thing up (still
in the old place), but before it does may_create_in_sticky(), the parent of
file at the moment when may_create_in_sticky() gets called is *NOT* a sticky
directory at all. In other words, we really must sample mode/uid before
doing the lookup.
63/69 take post-lookup part of do_last() out of loop
That turns the loop much more similar to path_lookupat() one. And
unlike the full do_last(), open_last_lookups() is fairly similar to
lookup_last()/walk_component().
Next come some cleanups of open_last_lookups():
64/69 open_last_lookups(): consolidate fsnotify_create() calls
straightforward; the only thing to keep in mind is that we need it
done before unlocking the parent.
65/69 open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
... especially with big comment to the effect that this is all
the call is going to do.
66/69 open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
we move O_EXCL|O_CREAT check from the end of open_last_lookups()
(where we'd already decided we are not looking at symlink to follow) out
into do_open() (== what's left of do_last()). That's one of the cases when
analysis is much longer than the patch itself; see commit message for
details.
67/69 open_last_lookups(): move complete_walk() into do_open()
similar move for complete_walk()
68/69 atomic_open(): no need to pass struct open_flags anymore
unused argument I'd missed back in 2016; should've removed it back
then.
69/69 lookup_open(): don't bother with fallbacks to lookup+create
That's one of the payoffs we get from #66/69 - the reasons why
"we might not have permissions on parent/write access to the entire fs"
logics had been so complicated (sometimes we just trim O_CREAT and call
->atomic_open(), sometimes we forcibly fall back to lookup+create path)
no longer apply. O_TRUNC side effects wouldn't have been a problem
for quite a while now (we call handle_truncate() in the end anyway, so
trimming O_TRUNC for ->atomic_open() would be fine). O_CREAT|O_EXCL,
OTOH, _did_ require the fallback - trimming O_CREAT would not suffice,
since that would've disabled the checks in ->atomic_open() with nothing
downstream to catch it and fail with EEXIST. Well, now we do have
something downstream - the combination of FMODE_CREATED not set,
while O_CREAT|O_EXCL is present will trigger just that.
In addition to the simpler logics in lookup_open(), the last
commit opens the way to something more interesting, but that will have
to wait for the next cycle - I do have some stuff in that direction,
but it changes the ->atomic_open() calling conventions and it's too
late in this cycle for that. I'll post a separate RFC later this
week. Basically, switching ->atomic_open() to returning dentry with
the same interpretation of result as for ->lookup() would make it possible
to get rid both of the weird DENTRY_NOT_SET thing and of the irregularity
in refcounting we have when struct file is used as a vehicle for passing
the lookup result to the caller. No extra arguments are needed,
and return d_in_lookup(dentry) ? foo_lookup(dir, dentry, flags) : NULL;
becomes a legitimate instance, open-coded in lookup_open() side by
side with ->atomic_open() call, with codepaths converging immediately
after that. Anyway, that's a separate story and I'm still not entirely
sure about some of the details for calling conventions. Definitely
not in this series...
From: Al Viro <[email protected]>
preparation to finish_automount() fix (next commit)
Signed-off-by: Al Viro <[email protected]>
---
fs/namespace.c | 47 ++++++++++++++++++++++++-----------------------
1 file changed, 24 insertions(+), 23 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 85b5f7bea82e..dcd015fafe01 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2697,45 +2697,32 @@ static int do_move_mount_old(struct path *path, const char *old_name)
/*
* add a mount into a namespace's mount tree
*/
-static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
+static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
+ struct path *path, int mnt_flags)
{
- struct mountpoint *mp;
- struct mount *parent;
- int err;
+ struct mount *parent = real_mount(path->mnt);
mnt_flags &= ~MNT_INTERNAL_FLAGS;
- mp = lock_mount(path);
- if (IS_ERR(mp))
- return PTR_ERR(mp);
-
- parent = real_mount(path->mnt);
- err = -EINVAL;
if (unlikely(!check_mnt(parent))) {
/* that's acceptable only for automounts done in private ns */
if (!(mnt_flags & MNT_SHRINKABLE))
- goto unlock;
+ return -EINVAL;
/* ... and for those we'd better have mountpoint still alive */
if (!parent->mnt_ns)
- goto unlock;
+ return -EINVAL;
}
/* Refuse the same filesystem on the same mount point */
- err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
- goto unlock;
+ return -EBUSY;
- err = -EINVAL;
if (d_is_symlink(newmnt->mnt.mnt_root))
- goto unlock;
+ return -EINVAL;
newmnt->mnt.mnt_flags = mnt_flags;
- err = graft_tree(newmnt, parent, mp);
-
-unlock:
- unlock_mount(mp);
- return err;
+ return graft_tree(newmnt, parent, mp);
}
static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
@@ -2748,6 +2735,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
unsigned int mnt_flags)
{
struct vfsmount *mnt;
+ struct mountpoint *mp;
struct super_block *sb = fc->root->d_sb;
int error;
@@ -2768,7 +2756,13 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
mnt_warn_timestamp_expiry(mountpoint, mnt);
- error = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ mp = lock_mount(mountpoint);
+ if (IS_ERR(mp)) {
+ mntput(mnt);
+ return PTR_ERR(mp);
+ }
+ error = do_add_mount(real_mount(mnt), mp, mountpoint, mnt_flags);
+ unlock_mount(mp);
if (error < 0)
mntput(mnt);
return error;
@@ -2830,6 +2824,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
struct mount *mnt = real_mount(m);
+ struct mountpoint *mp;
int err;
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
@@ -2842,7 +2837,13 @@ int finish_automount(struct vfsmount *m, struct path *path)
goto fail;
}
- err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ mp = lock_mount(path);
+ if (IS_ERR(mp)) {
+ err = PTR_ERR(mp);
+ goto fail;
+ }
+ err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ unlock_mount(mp);
if (!err)
return 0;
fail:
--
2.11.0
From: Al Viro <[email protected]>
Currently it either returns -E... or puts (nd->path.mnt,dentry)
into *path and returns 0. Make it return ERR_PTR(-E...) or
dentry; adjust the caller. Fewer arguments and it's easier
to keep track of *path contents that way.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 41 ++++++++++++++++++++++++-----------------
1 file changed, 24 insertions(+), 17 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index c104ec75faef..5f8b791a6d6e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3087,10 +3087,10 @@ static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t m
*
* Returns an error code otherwise.
*/
-static int atomic_open(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct file *file,
- const struct open_flags *op,
- int open_flag, umode_t mode)
+static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
+ struct file *file,
+ const struct open_flags *op,
+ int open_flag, umode_t mode)
{
struct dentry *const DENTRY_NOT_SET = (void *) -1UL;
struct inode *dir = nd->path.dentry->d_inode;
@@ -3131,17 +3131,15 @@ static int atomic_open(struct nameidata *nd, struct dentry *dentry,
}
if (file->f_mode & FMODE_CREATED)
fsnotify_create(dir, dentry);
- if (unlikely(d_is_negative(dentry))) {
+ if (unlikely(d_is_negative(dentry)))
error = -ENOENT;
- } else {
- path->dentry = dentry;
- path->mnt = nd->path.mnt;
- return 0;
- }
}
}
- dput(dentry);
- return error;
+ if (error) {
+ dput(dentry);
+ dentry = ERR_PTR(error);
+ }
+ return dentry;
}
/*
@@ -3236,11 +3234,20 @@ static int lookup_open(struct nameidata *nd, struct path *path,
}
if (dir_inode->i_op->atomic_open) {
- error = atomic_open(nd, dentry, path, file, op, open_flag,
- mode);
- if (unlikely(error == -ENOENT) && create_error)
- error = create_error;
- return error;
+ dentry = atomic_open(nd, dentry, file, op, open_flag, mode);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ if (unlikely(error == -ENOENT) && create_error)
+ error = create_error;
+ return error;
+ }
+ if (file->f_mode & FMODE_OPENED) {
+ dput(dentry);
+ return 0;
+ }
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ return 0;
}
no_open:
--
2.11.0
From: Al Viro <[email protected]>
1) no instances of ->d_automount() have ever made use of the "return
ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
a rudiment of plans that got superseded before the thing went into
the tree. Despite the comment in follow_automount(), autofs has
never done that.
2) if there's no ->d_automount() in dentry_operations, filesystems
should not set DCACHE_NEED_AUTOMOUNT in the first place. None have
ever done so...
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 28 +++-------------------------
fs/namespace.c | 9 ++++++++-
2 files changed, 11 insertions(+), 26 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 626eddb33508..39dd56f5171f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1210,10 +1210,7 @@ EXPORT_SYMBOL(follow_up);
*/
static int follow_automount(struct path *path, struct nameidata *nd)
{
- struct vfsmount *mnt;
-
- if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
- return -EREMOTE;
+ struct dentry *dentry = path->dentry;
/* We don't want to mount if someone's just doing a stat -
* unless they're stat'ing a directory and appended a '/' to
@@ -1228,33 +1225,14 @@ static int follow_automount(struct path *path, struct nameidata *nd)
*/
if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
- path->dentry->d_inode)
+ dentry->d_inode)
return -EISDIR;
nd->total_link_count++;
if (nd->total_link_count >= 40)
return -ELOOP;
- mnt = path->dentry->d_op->d_automount(path);
- if (IS_ERR(mnt)) {
- /*
- * The filesystem is allowed to return -EISDIR here to indicate
- * it doesn't want to automount. For instance, autofs would do
- * this so that its userspace daemon can mount on this dentry.
- *
- * However, we can only permit this if it's a terminal point in
- * the path being looked up; if it wasn't then the remainder of
- * the path is inaccessible and we should say so.
- */
- if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT))
- return -EREMOTE;
- return PTR_ERR(mnt);
- }
-
- if (!mnt)
- return 0;
-
- return finish_automount(mnt, path);
+ return finish_automount(dentry->d_op->d_automount(path), path);
}
/*
diff --git a/fs/namespace.c b/fs/namespace.c
index 777c3116e62e..743980380a8f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2824,9 +2824,16 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
struct dentry *dentry = path->dentry;
- struct mount *mnt = real_mount(m);
struct mountpoint *mp;
+ struct mount *mnt;
int err;
+
+ if (!m)
+ return 0;
+ if (IS_ERR(m))
+ return PTR_ERR(m);
+
+ mnt = real_mount(m);
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
*/
--
2.11.0
From: Al Viro <[email protected]>
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.
However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 15 +++++----------
fs/open.c | 4 +++-
2 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6721c5f7e9d5..6938d20aa73a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3396,22 +3396,17 @@ static int do_last(struct nameidata *nd,
if (unlikely(error < 0))
return error;
- /*
- * create/update audit record if it already exists.
- */
- audit_inode(nd->name, path.dentry, 0);
-
- if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
- path_to_nameidata(&path, nd);
- return -EEXIST;
- }
-
seq = 0; /* out of RCU mode, so the value doesn't matter */
inode = d_backing_inode(path.dentry);
finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
return error;
+
+ if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
+ audit_inode(nd->name, nd->path.dentry, 0);
+ return -EEXIST;
+ }
finish_open:
/* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */
error = complete_walk(nd);
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..e5227cd533f4 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1049,8 +1049,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
- if (flags & O_EXCL)
+ if (flags & O_EXCL) {
op->intent |= LOOKUP_EXCL;
+ flags |= O_NOFOLLOW;
+ }
}
if (flags & O_DIRECTORY)
--
2.11.0
From: Al Viro <[email protected]>
same story as for atomic_open() in the previous commit.
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 5f8b791a6d6e..b350e1e2b46f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3157,10 +3157,9 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
*
* An error code is returned on failure.
*/
-static int lookup_open(struct nameidata *nd, struct path *path,
- struct file *file,
- const struct open_flags *op,
- bool got_write)
+static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
+ const struct open_flags *op,
+ bool got_write)
{
struct dentry *dir = nd->path.dentry;
struct inode *dir_inode = dir->d_inode;
@@ -3171,7 +3170,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
if (unlikely(IS_DEADDIR(dir_inode)))
- return -ENOENT;
+ return ERR_PTR(-ENOENT);
file->f_mode &= ~FMODE_CREATED;
dentry = d_lookup(dir, &nd->last);
@@ -3179,7 +3178,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
if (!dentry) {
dentry = d_alloc_parallel(dir, &nd->last, &wq);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return dentry;
}
if (d_in_lookup(dentry))
break;
@@ -3195,7 +3194,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
}
if (dentry->d_inode) {
/* Cached positive dentry: will open in f_op->open */
- goto out_no_open;
+ return dentry;
}
/*
@@ -3235,19 +3234,9 @@ static int lookup_open(struct nameidata *nd, struct path *path,
if (dir_inode->i_op->atomic_open) {
dentry = atomic_open(nd, dentry, file, op, open_flag, mode);
- if (IS_ERR(dentry)) {
- error = PTR_ERR(dentry);
- if (unlikely(error == -ENOENT) && create_error)
- error = create_error;
- return error;
- }
- if (file->f_mode & FMODE_OPENED) {
- dput(dentry);
- return 0;
- }
- path->mnt = nd->path.mnt;
- path->dentry = dentry;
- return 0;
+ if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
+ dentry = ERR_PTR(create_error);
+ return dentry;
}
no_open:
@@ -3283,14 +3272,11 @@ static int lookup_open(struct nameidata *nd, struct path *path,
error = create_error;
goto out_dput;
}
-out_no_open:
- path->dentry = dentry;
- path->mnt = nd->path.mnt;
- return 0;
+ return dentry;
out_dput:
dput(dentry);
- return error;
+ return ERR_PTR(error);
}
/*
@@ -3309,6 +3295,7 @@ static int do_last(struct nameidata *nd,
unsigned seq;
struct inode *inode;
struct path path;
+ struct dentry *dentry;
int error;
nd->flags &= ~LOOKUP_PARENT;
@@ -3365,14 +3352,18 @@ static int do_last(struct nameidata *nd,
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);
- error = lookup_open(nd, &path, file, op, got_write);
+ dentry = lookup_open(nd, file, op, got_write);
if (open_flag & O_CREAT)
inode_unlock(dir->d_inode);
else
inode_unlock_shared(dir->d_inode);
- if (error)
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
goto out;
+ }
+ path.mnt = nd->path.mnt;
+ path.dentry = dentry;
if (file->f_mode & FMODE_OPENED) {
if ((file->f_mode & FMODE_CREATED) ||
@@ -3380,6 +3371,7 @@ static int do_last(struct nameidata *nd,
will_truncate = false;
audit_inode(nd->name, file->f_path.dentry, 0);
+ dput(dentry);
goto opened;
}
--
2.11.0
From: Al Viro <[email protected]>
kill nd->link_inode, while we are at it
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 135 ++++++++++++++++++++++++++++---------------------------------
1 file changed, 61 insertions(+), 74 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 969b99a88631..9b08c64397c8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -503,7 +503,6 @@ struct nameidata {
} *stack, internal[EMBEDDED_LEVELS];
struct filename *name;
struct nameidata *saved;
- struct inode *link_inode;
unsigned root_seq;
int dfd;
} __randomize_layout;
@@ -962,9 +961,8 @@ int sysctl_protected_regular __read_mostly;
*
* Returns 0 if following the symlink is allowed, -ve on error.
*/
-static inline int may_follow_link(struct nameidata *nd)
+static inline int may_follow_link(struct nameidata *nd, const struct inode *inode)
{
- const struct inode *inode;
const struct inode *parent;
kuid_t puid;
@@ -972,7 +970,6 @@ static inline int may_follow_link(struct nameidata *nd)
return 0;
/* Allowed if owner and follower match. */
- inode = nd->link_inode;
if (uid_eq(current_cred()->fsuid, inode->i_uid))
return 0;
@@ -1106,73 +1103,6 @@ static int may_create_in_sticky(umode_t dir_mode, kuid_t dir_uid,
return 0;
}
-static __always_inline
-const char *get_link(struct nameidata *nd)
-{
- struct saved *last = nd->stack + nd->depth - 1;
- struct dentry *dentry = last->link.dentry;
- struct inode *inode = nd->link_inode;
- int error;
- const char *res;
-
- if (!(nd->flags & LOOKUP_PARENT)) {
- error = may_follow_link(nd);
- if (unlikely(error))
- return ERR_PTR(error);
- }
-
- if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
- return ERR_PTR(-ELOOP);
-
- if (!(nd->flags & LOOKUP_RCU)) {
- touch_atime(&last->link);
- cond_resched();
- } else if (atime_needs_update(&last->link, inode)) {
- if (unlikely(unlazy_walk(nd)))
- return ERR_PTR(-ECHILD);
- touch_atime(&last->link);
- }
-
- error = security_inode_follow_link(dentry, inode,
- nd->flags & LOOKUP_RCU);
- if (unlikely(error))
- return ERR_PTR(error);
-
- nd->last_type = LAST_BIND;
- res = READ_ONCE(inode->i_link);
- if (!res) {
- const char * (*get)(struct dentry *, struct inode *,
- struct delayed_call *);
- get = inode->i_op->get_link;
- if (nd->flags & LOOKUP_RCU) {
- res = get(NULL, inode, &last->done);
- if (res == ERR_PTR(-ECHILD)) {
- if (unlikely(unlazy_walk(nd)))
- return ERR_PTR(-ECHILD);
- res = get(dentry, inode, &last->done);
- }
- } else {
- res = get(dentry, inode, &last->done);
- }
- if (!res)
- goto all_done;
- if (IS_ERR(res))
- return res;
- }
- if (*res == '/') {
- error = nd_jump_root(nd);
- if (unlikely(error))
- return ERR_PTR(error);
- while (unlikely(*++res == '/'))
- ;
- }
- if (*res)
- return res;
-all_done: // pure jump
- put_link(nd);
- return NULL;
-}
-
/*
* follow_up - Find the mountpoint of path's vfsmount
*
@@ -1796,8 +1726,10 @@ static inline int handle_dots(struct nameidata *nd, int type)
static const char *pick_link(struct nameidata *nd, struct path *link,
struct inode *inode, unsigned seq)
{
- int error;
struct saved *last;
+ const char *res;
+ int error;
+
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
path_to_nameidata(link, nd);
return ERR_PTR(-ELOOP);
@@ -1828,9 +1760,64 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
last = nd->stack + nd->depth++;
last->link = *link;
clear_delayed_call(&last->done);
- nd->link_inode = inode;
last->seq = seq;
- return get_link(nd);
+
+ if (!(nd->flags & LOOKUP_PARENT)) {
+ error = may_follow_link(nd, inode);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
+
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
+ if (!(nd->flags & LOOKUP_RCU)) {
+ touch_atime(&last->link);
+ cond_resched();
+ } else if (atime_needs_update(&last->link, inode)) {
+ if (unlikely(unlazy_walk(nd)))
+ return ERR_PTR(-ECHILD);
+ touch_atime(&last->link);
+ }
+
+ error = security_inode_follow_link(link->dentry, inode,
+ nd->flags & LOOKUP_RCU);
+ if (unlikely(error))
+ return ERR_PTR(error);
+
+ nd->last_type = LAST_BIND;
+ res = READ_ONCE(inode->i_link);
+ if (!res) {
+ const char * (*get)(struct dentry *, struct inode *,
+ struct delayed_call *);
+ get = inode->i_op->get_link;
+ if (nd->flags & LOOKUP_RCU) {
+ res = get(NULL, inode, &last->done);
+ if (res == ERR_PTR(-ECHILD)) {
+ if (unlikely(unlazy_walk(nd)))
+ return ERR_PTR(-ECHILD);
+ res = get(link->dentry, inode, &last->done);
+ }
+ } else {
+ res = get(link->dentry, inode, &last->done);
+ }
+ if (!res)
+ goto all_done;
+ if (IS_ERR(res))
+ return res;
+ }
+ if (*res == '/') {
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ while (unlikely(*++res == '/'))
+ ;
+ }
+ if (*res)
+ return res;
+all_done: // pure jump
+ put_link(nd);
+ return NULL;
}
enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
--
2.11.0
On Fri, Mar 13, 2020 at 4:53 PM Al Viro <[email protected]> wrote:
>
> Review and testing would be _very_ welcome;
I didn't notice anythign else than the few things I sent out to
individual patches.
But I have to say, 69 patches is too long of a series to review. I
think you could send them out as multiple series (you describe them
that way anyway - parts 1-7) with a day in between.
Because my eyes were starting to glaze over about halfway in the series.
But don't do it for this version. If you do a #5. But it would be good
to be in -next regardless of whether you do a #5 or not.
Linus
On Fri, Mar 13, 2020 at 05:50:51PM -0700, Linus Torvalds wrote:
> On Fri, Mar 13, 2020 at 4:53 PM Al Viro <[email protected]> wrote:
> >
> > Review and testing would be _very_ welcome;
>
> I didn't notice anythign else than the few things I sent out to
> individual patches.
>
> But I have to say, 69 patches is too long of a series to review. I
> think you could send them out as multiple series (you describe them
> that way anyway - parts 1-7) with a day in between.
A week of fun? OK...
> Because my eyes were starting to glaze over about halfway in the series.
>
> But don't do it for this version. If you do a #5. But it would be good
> to be in -next regardless of whether you do a #5 or not.
FWIW, I've dealt with bisect hazards and I'll probably reorder #56 after
#57/#58, to get the stuff that deals with stack allocation (#56, #59..61)
together, without "reduce the exposure to weird struct path instances"
(57 and 58) mixed in the middle of that.
As for the rest... I'm not sure that choose_mountpoint{,_rcu}()
is inserted into the right place in text - might be better next to
follow_up(). There's also a couple of pick_link() pieces worth separate
helpers, but I'd rather leave that for the next cycle - the series is
bloody long as it is.
I'm not going to throw the immediate prereqs for ->atomic_open() calling
conventions change into that pile - they don't harm anything, but they
are unmotivated without the next step (method signature change) and it's
really too late in the cycle for that. That's going to be a separate
series, probably for the next cycle. Changes to instances are not
huge; ceph is the worst by far and that's only +27/-22 lines. So I don't
think there will be a lot of conflicts to cope with in the next cycle,
especially since ceph side of things looks like we want to do some
refactoring first, with much smaller changeover on top of that. That
refactoring itself won't have prereqs at all, so that can be dealt with
sanely and that'll soak most of the potential conflicts in.
There are other potential refactorings/cleanups, but that's definitely
not for this cycle. So... short of regressions found in that series
that's probably close to what I'll have for the coming window in
this branch. If I see something else in there that can be usefully
cleaned up, I'll keep it for after -rc1...
Next: context switch to uaccess series and getting that patchbomb ready.
Oh, well...