2019-12-30 05:22:21

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

An undocumented feature of the mount interface was that it was possible
to mount over a symlink (even with the old mount API) by mounting over
/proc/self/fd/$n -- where the corresponding file descrpitor was opened
with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts
(for a variety of reasons), but MS_BIND worked without issue. With the
new mount API it was even easier.

A reasonably detailed explanation of the issues is provided in the patch
itself, but the full traces produced by both the oopses and deadlocks is
included below (it makes little sense to include them in the commit since we
are disabling this feature, not directly fixing the bugs themselves).

I've posted this as an RFC on whether this feature should be allowed at
all (and if anyone knows of legitimate uses for it), or if we should
work on fixing these other kernel bugs that it exposes.

Oops on NULL dereference:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 8000000181b1f067 P4D 8000000181b1f067 PUD 24829c067 PMD 0
Oops: 0010 [#1] SMP PTI
CPU: 6 PID: 20796 Comm: mount_to_symlin Tainted: G OE 5.5.0-rc1+openat2~v18+ #123
Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffbc7d87e1bcb0 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffffa0c28cb633c0 RCX: 000000000000ae5a
RDX: 0000000000000089 RSI: ffffa0c0eece8840 RDI: ffffa0c0eb8843b0
RBP: ffffa0c0eb8843b0 R08: ffffdc7d7fbbb770 R09: ffffa0c0ca333000
R10: 0000000000000000 R11: 808080807fffffff R12: ffffa0c0eece8840
R13: 0000000000000089 R14: ffffbc7d87e1bdb0 R15: 0000000000000080
FS: 00007fd921508540(0000) GS:ffffa0c3cf580000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000018878a003 CR4: 00000000003606e0
Call Trace:
__lookup_slow+0x94/0x160
lookup_slow+0x36/0x50
path_mountpoint+0x1be/0x350
filename_mountpoint+0xa5/0x150
? __lookup_hash+0xa0/0xa0
ksys_umount+0x78/0x490
__x64_sys_umount+0x12/0x20
do_syscall_64+0x64/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fd92143f4e7
Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe98c89cc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd92143f4e7
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000167a330
RBP: 00007ffe98c89da0 R08: 0000000000000000 R09: 000000000000000f
R10: 00000000004004c6 R11: 0000000000000202 R12: 00000000004010c0
R13: 00007ffe98c89e80 R14: 0000000000000000 R15: 0000000000000000
CR2: 0000000000000000

Oops on kernel address:
BUG: unable to handle page fault for address: ffffbc7d87e1bcc0
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 107d4a067 P4D 107d4a067 PUD 107d4b067 PMD 46d753067 PTE 0
Oops: 0002 [#2] SMP PTI
CPU: 4 PID: 20975 Comm: mount_to_symlin Tainted: G D OE 5.5.0-rc1+openat2~v18+ #123
Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
RIP: 0010:_raw_spin_lock_irqsave+0x28/0x50
Code: 00 00 0f 1f 44 00 00 41 54 53 48 89 fb 9c 58 0f 1f 44 00 00 49 89 c4
fa 66 0f 1f 44 00 00 e8 3f 55 82 ff 31 c0 ba 01 00 00 00 <f0> 0f b1
13 75 07 4c 89 e0 5b 41 5c c3 89 c6 48 89 df e8 01 52 77
RSP: 0018:ffffbc7d90067bd8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffbc7d87e1bcc0 RCX: 0000000200000000
RDX: 0000000000000001 RSI: ffffbc7d90067c50 RDI: ffffbc7d87e1bcc0
RBP: ffffbc7d87e1bcc0 R08: 0000000000000001 R09: 0000000000000003
R10: 0000000000000000 R11: 808080807fffffff R12: 0000000000000246
R13: ffffa0c28cb633c0 R14: ffffbc7d90067db0 R15: ffffa0c0eece8898
FS: 00007f4b80214540(0000) GS:ffffa0c3cf500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffbc7d87e1bcc0 CR3: 000000026d4d0002 CR4: 00000000003606e0
Call Trace:
add_wait_queue+0x15/0x40
d_alloc_parallel+0x36d/0x480
? get_acl+0x1a/0x160
? wake_up_q+0xa0/0xa0
__lookup_slow+0x6b/0x160
lookup_slow+0x36/0x50
path_mountpoint+0x1be/0x350
filename_mountpoint+0xa5/0x150
? __lookup_hash+0xa0/0xa0
ksys_umount+0x78/0x490
__x64_sys_umount+0x12/0x20
do_syscall_64+0x64/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4b8014b4e7
Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffee8041b28 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4b8014b4e7
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000019c8330
RBP: 00007ffee8041c00 R08: 0000000000000000 R09: 000000000000000f
R10: 00000000004004c6 R11: 0000000000000206 R12: 00000000004010c0
R13: 00007ffee8041ce0 R14: 0000000000000000 R15: 0000000000000000
CR2: ffffbc7d87e1bcc0

Apparent deadlock in d_alloc_parallel:
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount_to_symlin:21285]
CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G D OE 5.5.0-rc1+openat2~v18+ #123
Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0
a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84
c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898
RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003
R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73
R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898
FS: 00007fbb1fd30540(0000) GS:ffffa0c3cf400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbb1fbd25a0 CR3: 0000000181ace005 CR4: 00000000003606f0
Call Trace:
_raw_spin_lock+0x1a/0x20
lockref_get_not_dead+0x4f/0x90
d_alloc_parallel+0x1a8/0x480
? get_acl+0x1a/0x160
__lookup_slow+0x6b/0x160
lookup_slow+0x36/0x50
path_mountpoint+0x1be/0x350
filename_mountpoint+0xa5/0x150
? __lookup_hash+0xa0/0xa0
ksys_umount+0x78/0x490
__x64_sys_umount+0x12/0x20
do_syscall_64+0x64/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fbb1fc674e7
Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330
RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f
R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0
R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000

RCU stall when trying to grab /proc/$pid/stack for the stuck process:
rcu: INFO: rcu_sched self-detected stall on CPU
rcu: 0-....: (15000 ticks this GP) idle=2c6/1/0x4000000000000002 softirq=1172554/1172554 fqs=6849
(t=15001 jiffies g=1935177 q=25734)
NMI backtrace for cpu 0
CPU: 0 PID: 21285 Comm: mount_to_symlin Tainted: G D OEL 5.5.0-rc1+openat2~v18+ #123
Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
Call Trace:
<IRQ>
dump_stack+0x8f/0xd0
? lapic_can_unplug_cpu.cold+0x3e/0x3e
nmi_cpu_backtrace.cold+0x14/0x52
nmi_trigger_cpumask_backtrace+0xf6/0xf8
rcu_dump_cpu_stacks+0x8f/0xbd
rcu_sched_clock_irq.cold+0x1b2/0x39f
update_process_times+0x24/0x50
tick_sched_handle+0x22/0x60
tick_sched_timer+0x38/0x80
? tick_sched_do_timer+0x60/0x60
__hrtimer_run_queues+0xf6/0x270
hrtimer_interrupt+0x10e/0x240
smp_apic_timer_interrupt+0x6c/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0
a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0
75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
RSP: 0018:ffffbc7d90547be8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000101 RBX: ffffffffbac7ac60 RCX: 0000000000000018
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa0c0eece8898
RBP: ffffa0c0eece8898 R08: 00000000006f6f66 R09: 0000000000000003
R10: 0000000000000000 R11: 808080807fffffff R12: 00000000e25b3c73
R13: ffffa0c28cb633c0 R14: ffffbc7d90547db0 R15: ffffa0c0eece8898
_raw_spin_lock+0x1a/0x20
lockref_get_not_dead+0x4f/0x90
d_alloc_parallel+0x1a8/0x480
? get_acl+0x1a/0x160
__lookup_slow+0x6b/0x160
lookup_slow+0x36/0x50
path_mountpoint+0x1be/0x350
filename_mountpoint+0xa5/0x150
? __lookup_hash+0xa0/0xa0
ksys_umount+0x78/0x490
__x64_sys_umount+0x12/0x20
do_syscall_64+0x64/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fbb1fc674e7
Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd75fcb858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbb1fc674e7
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000f6c330
RBP: 00007ffd75fcb930 R08: 0000000000000000 R09: 000000000000000f
R10: 00000000004004a6 R11: 0000000000000202 R12: 00000000004010b0
R13: 00007ffd75fcba10 R14: 0000000000000000 R15: 0000000000000000

Deadlock on lock_mount after a successful umount(). The watchdog does trigger,
but I could only find this stall when trying to suspend the system in my logs:
Freezing of tasks failed after 20.010 seconds (2 tasks refusing to freeze, wq_busy=0):
mount_to_symlin D 0 5850 5849 0x00000004
Call Trace:
? __schedule+0x2dd/0x770
schedule+0x4a/0xb0
rwsem_down_write_slowpath+0x256/0x500
lock_mount+0x22/0xf0
do_mount+0x4b7/0x9f0
ksys_mount+0x7e/0xc0
__x64_sys_mount+0x21/0x30
do_syscall_64+0x64/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f86e6355fda
Code: Bad RIP value.
RSP: 002b:00007ffc36f952d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f86e6355fda
RDX: 0000000000402099 RSI: 00000000019a5310 RDI: 00007ffc36f96ee1
RBP: 00007ffc36f953b0 R08: 0000000000402099 R09: 000000000000000f
R10: 0000000000001000 R11: 0000000000000206 R12: 00000000004010c0
R13: 00007ffc36f95490 R14: 0000000000000000 R15: 0000000000000000

Cc: [email protected] # pre-git
Cc: Al Viro <[email protected]>
Cc: David Howells <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>

Aleksa Sarai (1):
mount: universally disallow mounting over symlinks

fs/namespace.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)


base-commit: fd6988496e79a6a4bdb514a4655d2920209eb85d
--
2.24.1


2019-12-30 05:22:31

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks

An undocumented feature of the mount interface was that it was possible
to mount over a symlink (even with the old mount API) by mounting over
/proc/self/fd/$n -- where the corresponding file descrpitor was opened
with (O_PATH|O_NOFOLLOW). This didn't work with traditional "new" mounts
(for a variety of reasons), but MS_BIND worked without issue. With the
new mount API it was even easier.

From userspace's perspective, this capability is only really useful as
an attack vector. Until the introduction of openat2(RESOLVE_NO_XDEV),
there was no trivial way to detect if a bind-mount was present. In the
container runtime context (in a similar vein to CVE-2019-19921), this
could result in a privileged process being unable to detect that a
configuration resulted in magic-link usage operating on the wrong
magic-links. Additionally, the API to use this feature was incredibly
strange -- in order to umount, you would have go through
/proc/self/fd/$n again (umounting the path would result in the
*underlying* symlink being followed).

Which brings us to the issues on the kernel side. When umounting a mount
on top of a symlink, several oopses (both NULL and garbage kernel
address dereferences) and deadlocks could be triggered incredibly
trivially. Note that because this works in user namespaces, an
unprivileged user could trigger these oopses incredibly trivially. While
these bugs could be fixed separately, it seems much cleaner to disable a
"feature" which clearly was not intentional (and is not used --
otherwise we would've seen bug reports about it breaking on umount).

Note that because the linux-utils mount(1) helper will expand paths
containing symlinks in user-space, only users which used the mount(2)
syscall directly could possibly have seen this behaviour.

Cc: [email protected] # pre-git
Cc: Al Viro <[email protected]>
Cc: David Howells <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
---
fs/namespace.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index be601d3a8008..01a62bce105f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2172,8 +2172,12 @@ static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
if (mnt->mnt.mnt_sb->s_flags & SB_NOUSER)
return -EINVAL;

+ if (d_is_symlink(mp->m_dentry) ||
+ d_is_symlink(mnt->mnt.mnt_root))
+ return -EINVAL;
+
if (d_is_dir(mp->m_dentry) !=
- d_is_dir(mnt->mnt.mnt_root))
+ d_is_dir(mnt->mnt.mnt_root))
return -ENOTDIR;

return attach_recursive_mnt(mnt, p, mp, false);
@@ -2251,6 +2255,9 @@ static struct mount *__do_loopback(struct path *old_path, int recurse)
if (IS_MNT_UNBINDABLE(old))
return mnt;

+ if (d_is_symlink(old_path->dentry))
+ return mnt;
+
if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
return mnt;

@@ -2635,6 +2642,10 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
if (old_path->dentry != old_path->mnt->mnt_root)
goto out;

+ if (d_is_symlink(new_path->dentry) ||
+ d_is_symlink(old_path->dentry))
+ goto out;
+
if (d_is_dir(new_path->dentry) !=
d_is_dir(old_path->dentry))
goto out;
@@ -2726,10 +2737,6 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
path->mnt->mnt_root == path->dentry)
goto unlock;

- err = -EINVAL;
- if (d_is_symlink(newmnt->mnt.mnt_root))
- goto unlock;
-
newmnt->mnt.mnt_flags = mnt_flags;
err = graft_tree(newmnt, parent, mp);

--
2.24.1

2019-12-30 05:45:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:

> A reasonably detailed explanation of the issues is provided in the patch
> itself, but the full traces produced by both the oopses and deadlocks is
> included below (it makes little sense to include them in the commit since we
> are disabling this feature, not directly fixing the bugs themselves).
>
> I've posted this as an RFC on whether this feature should be allowed at
> all (and if anyone knows of legitimate uses for it), or if we should
> work on fixing these other kernel bugs that it exposes.

Umm... Are all of those traces
a) reproducible on mainline and
b) reproducible as the first oopsen?

As it is, quite a few might be secondary results of earlier memory
corruption...

2019-12-30 05:50:30

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2019-12-30, Al Viro <[email protected]> wrote:
> On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
>
> > A reasonably detailed explanation of the issues is provided in the patch
> > itself, but the full traces produced by both the oopses and deadlocks is
> > included below (it makes little sense to include them in the commit since we
> > are disabling this feature, not directly fixing the bugs themselves).
> >
> > I've posted this as an RFC on whether this feature should be allowed at
> > all (and if anyone knows of legitimate uses for it), or if we should
> > work on fixing these other kernel bugs that it exposes.
>
> Umm... Are all of those traces
> a) reproducible on mainline and

This was on viro/for-next, I'll retry it on v5.5-rc4.

> b) reproducible as the first oopsen?

The NULL and garbage pointer derefs are reproducible as the first oops.
Looking at my logs, it looks like the deadlocks were always triggered
after the oops, but that might just have been a mistake on my part while
testing things.

> As it is, quite a few might be secondary results of earlier memory
> corruption...

Yeah, I thought that might be the case but decided to include them
anyway (the /proc/self/stack RCU stall is definitely the result of other
corruption and stalls).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.40 kB)
signature.asc (235.00 B)
Download all attachments

2019-12-30 07:31:46

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2019-12-30, Aleksa Sarai <[email protected]> wrote:
> On 2019-12-30, Al Viro <[email protected]> wrote:
> > On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
> >
> > > A reasonably detailed explanation of the issues is provided in the patch
> > > itself, but the full traces produced by both the oopses and deadlocks is
> > > included below (it makes little sense to include them in the commit since we
> > > are disabling this feature, not directly fixing the bugs themselves).
> > >
> > > I've posted this as an RFC on whether this feature should be allowed at
> > > all (and if anyone knows of legitimate uses for it), or if we should
> > > work on fixing these other kernel bugs that it exposes.
> >
> > Umm... Are all of those traces
> > a) reproducible on mainline and
>
> This was on viro/for-next, I'll retry it on v5.5-rc4.

The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems
harder to reproduce than on viro/for-next (I kept reproducing it there
by accident), but I'll double-check if that really is the case.

The simplest reproducer is (using the attached programs and .config):

ln -s . link
sudo ./umount_symlink link

There's also a few other whacky behaviours where you get -ELOOP or
-EACCES in cases where you shouldn't -- which results in MNT_DETACH
failing and the mount being impossible to get rid of. A good example is

sudo ./mount_to_symlink /proc/self/exe link
sudo ./umount_symlink link # -EACCES

Or

ln -s . link1
ln -s . link2
sudo ./mount_to_symlink link1 link2
sudo ./umount_symlink link1 # -ELOOP
sudo ./umount_symlink link2 # -ELOOP

But I am trying to find a reproducer for the "umount of a mount
triggering an Oops" issue.

On another note -- I guess this is considered a feature which should
"just work" and not a bug?

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 80000003c6fca067 P4D 80000003c6fca067 PUD 3c6f42067 PMD 0
Oops: 0010 [#1] SMP PTI
CPU: 4 PID: 4486 Comm: umount_symlink Tainted: G E 5.5.0-rc4-cyphar #126
Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET55W (1.30 ) 08/31/2018
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0
Call Trace:
__lookup_slow+0x94/0x160
lookup_slow+0x36/0x50
path_mountpoint+0x1be/0x360
filename_mountpoint+0xa5/0x150
? __lookup_hash+0xa0/0xa0
ksys_umount+0x78/0x490
__x64_sys_umount+0x12/0x20
do_syscall_64+0x64/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fbc2a8274e7
Code: 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09
00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0
ff ff 73 01 c3 48 8b 0d 69 09 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd1da9b3f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc2a8274e7
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000001300310
RBP: 00007ffd1da9b4c0 R08: 0000000000000000 R09: 000000000000000f
R10: 00007fbc2a92f800 R11: 0000000000000202 R12: 0000000000401090
R13: 00007ffd1da9b5a0 R14: 0000000000000000 R15: 0000000000000000
Modules linked in: [snip]
CR2: 0000000000000000
---[ end trace ae473813e34e641d ]---
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffb70b82963cc0 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (0.00 B)
signature.asc (235.00 B)
Download all attachments

2019-12-30 07:36:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks

On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <[email protected]> wrote:
>
> + if (d_is_symlink(mp->m_dentry) ||
> + d_is_symlink(mnt->mnt.mnt_root))
> + return -EINVAL;

So I don't hate this kind of check in general - overmounting a symlink
sounds odd, but at the same time I get the feeling that the real issue
is that something went wrong earlier.

Yeah, the mount target kind of _is_ a path, but at the same time, we
most definitely want to have the permission to really open the
directory in question, don't we, and I don't see that we should accept
a O_PATH file descriptor.

I feel like the only valid use of "O_PATH" files is to then use them
as the base for an openat() and friends (ie fchmodat/execveat() etc).

But maybe I'm completely wrong, and people really do want O_PATH
handling exactly for mounting too. It does sound a bit odd. By
definition, mounting wants permissions to the mount-point, so what's
the point of using O_PATH?

So instead of saying "don't overmount symlinks", I would feel like
it's the mount system call that should use a proper file descriptor
that isn't FMODE_PATH.

Is it really the symlink that is the issue? Because if it's the
symlink that is the issue then I feel like O_NOFOLLOW should have
triggered it, but your other email seems to say that you really need
O_PATH | O_SYMLINK.

So I'm not sayng that this patch is wrong, but it really smells a bit
like it's papering over the more fundamental issue.

For example, is the problem that when you do a proper

fd = open("somepath", O_PATH);

in one process, and then another thread does

fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR);

then we get confused and do bad things on that *second* open? Because
now the second open doesn't have O_PATH, and doesn't ghet marked
FMODE_PATH, but the underlying file descriptor is one of those limited
"is really only useful for openat() and friends".

I dunno. I haven't thought through the whole thing. But the oopses you
quote seem like we're really doing something wrong, and it really does
feel like your patch in no way _fixes_ the wrong thing we're doing,
it's just hiding the symptoms.

Linus

2019-12-30 07:58:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai <[email protected]> wrote:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000

Would you mind building with debug info, and then running the oops through

scripts/decode_stacktrace.sh

which makes those addresses much more legible.

> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page

Somebody jumped through a NULL pointer.

> RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
> RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
> RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
> R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
> R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
> FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0
> Call Trace:
> __lookup_slow+0x94/0x160

And "__lookup_slow()" has two indirect calls (they aren't obvious with
retpoline, but look for something like

call __x86_indirect_thunk_rax

which is the modern sad way of doing "call *%rax"). One is for
revalidatinging an old dentry, but the one I _suspect_ you trigger is
this one:

old = inode->i_op->lookup(inode, dentry, flags);

but I thought we only could get here if we know it's a directory.

How did we miss the "d_can_lookup()", which is what should check that
yes, we can call that ->lookup() routine.

This is why I have that suspicion that it's somehow that O_PATH fd
opened in another process without O_PATH causes confusion...

So what I think has happened is that because of the O_PATH thing,
we've ended up with an inode that has never been truly opened (because
O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we
now have a file descriptor that _looks_ like it is valid, and we're
treating that inode as if it can be used.

But I'm handwaving.

Linus

2019-12-30 08:29:58

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks

On 2019-12-29, Linus Torvalds <[email protected]> wrote:
> On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <[email protected]> wrote:
> > + if (d_is_symlink(mp->m_dentry) ||
> > + d_is_symlink(mnt->mnt.mnt_root))
> > + return -EINVAL;
>
> So I don't hate this kind of check in general - overmounting a symlink
> sounds odd, but at the same time I get the feeling that the real issue
> is that something went wrong earlier.
>
> Yeah, the mount target kind of _is_ a path, but at the same time, we
> most definitely want to have the permission to really open the
> directory in question, don't we, and I don't see that we should accept
> a O_PATH file descriptor.

The new mount API uses O_PATH under the hood (which is a good thing
since some files you'd like to avoid actually opening -- FIFOs are the
obvious example) so I'm not sure that's something we could really avoid.

But if we block O_PATH for mounts this will achieve the same thing,
because the only way to get a file descriptor that references a symlink
is through (O_PATH | O_NOFOLLOW).

> I feel like the only valid use of "O_PATH" files is to then use them
> as the base for an openat() and friends (ie fchmodat/execveat() etc).

See below, we use this for all sorts of dirty^Wclever tricks.

> But maybe I'm completely wrong, and people really do want O_PATH
> handling exactly for mounting too. It does sound a bit odd. By
> definition, mounting wants permissions to the mount-point, so what's
> the point of using O_PATH?

When you go through O_PATH, you still get a proper 'struct path' which
means that for operations such as mount (or open) you will operate on
the *real* underlying file.

This is part of what makes magic-links so useful (but also quite
terrifying).

> For example, is the problem that when you do a proper
>
> fd = open("somepath", O_PATH);
>
> in one process, and then another thread does
>
> fd = open("/proc/<pid>/fd/<opathfd>", O_RDWR);
>
> then we get confused and do bad things on that *second* open? Because
> now the second open doesn't have O_PATH, and doesn't ghet marked
> FMODE_PATH, but the underlying file descriptor is one of those limited
> "is really only useful for openat() and friends".

Actually, this isn't true (for the same reason as above) -- when you do
a re-open through /proc/$pid/fd/$n you get a real-as-a-heart-attack file
descriptor. We make lots of use of this in container runtimes in order
to do some dirty^Wfun tricks that help us harden the runtime against
malicious container processes.

You might recall that when I was posting the earlier revisions of
openat2(), I also included a patch for O_EMPTYPATH (which basically did
a re-open of /proc/self/fd/$dfd but without needing /proc). That had
precisely the same semantics so that you could do the same operation
without procfs. That patch was dropped before Al merged openat2(), but I
am probably going to revive it for the reasons I outlined below.

> I dunno. I haven't thought through the whole thing. But the oopses you
> quote seem like we're really doing something wrong, and it really does
> feel like your patch in no way _fixes_ the wrong thing we're doing,
> it's just hiding the symptoms.

That's fair enough.

I'll be honest, the real reason why I don't want mounts over symlinks to
be possible is for an entirely different reason. I'm working on a safe
path resolution library to accompany openat2()[1] -- and one of the
things I want to do is to harden all of our uses of procfs (such that if
we are running in a context where procfs has been messed with -- such as
having files bind-mounted -- we can detect it and abort). The issue with
symlinks is that we need to be able to operate on magic-links (such as
/proc/self/fd/$n and /proc/self/exe) -- and if it's possible bind-mount
over those magic-links then we can't detect it at all.

openat2(RESOLVE_NO_XDEV) would block it, but it also blocks going
through magic-links which change your mount (which would almost always
be true). You can't trust /proc/self/mountinfo by definition -- not just
because of the TOCTOU race but also because you can't depend on /proc to
harden against a "bad" /proc. All other options such as
umount2(MNT_EXPIRE) won't help with magic-links because we cannot take
an O_PATH to a magic-link and follow it -- O_PATHs of symlinks are
completely stunted in this respect.

If allowing bind-mounts over symlinks is allowed (which I don't have a
problem with really), it just means we'll need a few more kernel pieces
to get this hardening to work. But these features would be useful
outside of the problems I'm dealing with (O_EMPTYPATH and some kind of
pidfd-based interface to grab the equivalent of /proc/self/exe and a few
other such magic-link targets).

[1]: https://github.com/openSUSE/libpathrs

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (4.91 kB)
signature.asc (235.00 B)
Download all attachments

2019-12-30 08:33:34

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2019-12-29, Linus Torvalds <[email protected]> wrote:
> On Sun, Dec 29, 2019 at 11:30 PM Aleksa Sarai <[email protected]> wrote:
> >
> > BUG: kernel NULL pointer dereference, address: 0000000000000000
>
> Would you mind building with debug info, and then running the oops through
>
> scripts/decode_stacktrace.sh
>
> which makes those addresses much more legible.

Will do.

> > #PF: supervisor instruction fetch in kernel mode
> > #PF: error_code(0x0010) - not-present page
>
> Somebody jumped through a NULL pointer.
>
> > RAX: 0000000000000000 RBX: ffff906d0cc3bb40 RCX: 0000000000000abc
> > RDX: 0000000000000089 RSI: ffff906d74623cc0 RDI: ffff906d74475df0
> > RBP: ffff906d74475df0 R08: ffffd70b7fb24c20 R09: ffff906d066a5000
> > R10: 0000000000000000 R11: 8080807fffffffff R12: ffff906d74623cc0
> > R13: 0000000000000089 R14: ffffb70b82963dc0 R15: 0000000000000080
> > FS: 00007fbc2a8f0540(0000) GS:ffff906dcf500000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: ffffffffffffffd6 CR3: 00000003c68f8001 CR4: 00000000003606e0
> > Call Trace:
> > __lookup_slow+0x94/0x160
>
> And "__lookup_slow()" has two indirect calls (they aren't obvious with
> retpoline, but look for something like
>
> call __x86_indirect_thunk_rax
>
> which is the modern sad way of doing "call *%rax"). One is for
> revalidatinging an old dentry, but the one I _suspect_ you trigger is
> this one:
>
> old = inode->i_op->lookup(inode, dentry, flags);
>
> but I thought we only could get here if we know it's a directory.
>
> How did we miss the "d_can_lookup()", which is what should check that
> yes, we can call that ->lookup() routine.

I'll try applying a trivial patch to add d_can_lookup() to see if it
fixes the immediate issue.

> This is why I have that suspicion that it's somehow that O_PATH fd
> opened in another process without O_PATH causes confusion...
>
> So what I think has happened is that because of the O_PATH thing,
> we've ended up with an inode that has never been truly opened (because
> O_PATH skips that part), but then with the /proc/<pid>/fd/xyz open, we
> now have a file descriptor that _looks_ like it is valid, and we're
> treating that inode as if it can be used.

I'm not sure I agree -- as I mentioned in my other mail, re-opening
through /proc/self/fd/$n works *very* well and has for a long time (in
fact, both LXC and runc depend on this working).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.62 kB)
signature.asc (235.00 B)
Download all attachments

2020-01-01 00:46:15

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, Dec 30, 2019 at 06:29:59PM +1100, Aleksa Sarai wrote:
> On 2019-12-30, Aleksa Sarai <[email protected]> wrote:
> > On 2019-12-30, Al Viro <[email protected]> wrote:
> > > On Mon, Dec 30, 2019 at 04:20:35PM +1100, Aleksa Sarai wrote:
> > >
> > > > A reasonably detailed explanation of the issues is provided in the patch
> > > > itself, but the full traces produced by both the oopses and deadlocks is
> > > > included below (it makes little sense to include them in the commit since we
> > > > are disabling this feature, not directly fixing the bugs themselves).
> > > >
> > > > I've posted this as an RFC on whether this feature should be allowed at
> > > > all (and if anyone knows of legitimate uses for it), or if we should
> > > > work on fixing these other kernel bugs that it exposes.
> > >
> > > Umm... Are all of those traces
> > > a) reproducible on mainline and
> >
> > This was on viro/for-next, I'll retry it on v5.5-rc4.
>
> The NULL deref oops is reproducible on v5.5-rc4. Strangely it seems
> harder to reproduce than on viro/for-next (I kept reproducing it there
> by accident), but I'll double-check if that really is the case.
>
> The simplest reproducer is (using the attached programs and .config):
>
> ln -s . link
> sudo ./umount_symlink link

FWIW, the problem with that reproducer is that we *CAN'T* resolve that
path. Look: you have /proc/self/fd/3 resolve to ./link. OK, you've
asked to follow that. Got ./link, which is a symlink, so we need to
follow it further. Relative to what, though?

The meaning of symlink is dependent upon the directory you find it in.
And we don't have any here.

The bug is in mountpoint_last() - we have
if (unlikely(nd->last_type != LAST_NORM)) {
error = handle_dots(nd, nd->last_type);
if (error)
return error;
path.dentry = dget(nd->path.dentry);
} else {
path.dentry = d_lookup(dir, &nd->last);
if (!path.dentry) {
/*
* No cached dentry. Mounted dentries are pinned in the
* cache, so that means that this dentry is probably
* a symlink or the path doesn't actually point
* to a mounted dentry.
*/
path.dentry = lookup_slow(&nd->last, dir,
nd->flags | LOOKUP_NO_REVAL);
if (IS_ERR(path.dentry))
return PTR_ERR(path.dentry);
}
}
if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) {
dput(path.dentry);
return -ENOENT;
}
path.mnt = nd->path.mnt;
return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
in there, and that ends up with step_into() called in case of LAST_DOT/LAST_DOTDOT
(where it's harmless) *AND* in case of LAST_BIND. Where it very much isn't.

I'm not sure if you have caught anything else, but we really, really should *NOT*
consider the LAST_BIND as "maybe we should follow the result" material. So
at least the following is needed; could you check if anything else remains
with that applied?

diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..d4fbbda8a7ff 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd)
nd->flags &= ~LOOKUP_PARENT;

if (unlikely(nd->last_type != LAST_NORM)) {
- error = handle_dots(nd, nd->last_type);
- if (error)
- return error;
- path.dentry = dget(nd->path.dentry);
+ return handle_dots(nd, nd->last_type);
} else {
path.dentry = d_lookup(dir, &nd->last);
if (!path.dentry) {

2020-01-01 00:59:33

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Wed, Jan 01, 2020 at 12:43:24AM +0000, Al Viro wrote:
> I'm not sure if you have caught anything else, but we really, really should *NOT*
> consider the LAST_BIND as "maybe we should follow the result" material. So
> at least the following is needed; could you check if anything else remains
> with that applied?
>
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..d4fbbda8a7ff 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2656,10 +2656,7 @@ mountpoint_last(struct nameidata *nd)
> nd->flags &= ~LOOKUP_PARENT;
>
> if (unlikely(nd->last_type != LAST_NORM)) {
> - error = handle_dots(nd, nd->last_type);
> - if (error)
> - return error;
> - path.dentry = dget(nd->path.dentry);
> + return handle_dots(nd, nd->last_type);
> } else {
> path.dentry = d_lookup(dir, &nd->last);
> if (!path.dentry) {

Note, BTW, that lookup_last() (aka walk_component()) does just
that - we only hit step_into() on LAST_NORM. The same goes
for do_last(). mountpoint_last() not doing the same is _not_
intentional - it's definitely a bug.

Consider your testcase; link points to . here. So the only
thing you could expect from trying to follow it would be
the directory 'link' lives in. And you don't have it
when you reach the fscker via /proc/self/fd/3; what happens
instead is nd->path set to ./link (by nd_jump_link()) *AND*
step_into() called, pushing the same ./link onto stack.
It violates all kinds of assumptions made by fs/namei.c -
when pushing a symlink onto stack nd->path is expected to
contain the base directory for resolving it.

I'm fairly sure that this is the cause of at least some
of the insanity you've caught; there always could be
something else, of course, but this hole needs to be
closed in any case.

2020-01-01 03:12:09

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
> Note, BTW, that lookup_last() (aka walk_component()) does just
> that - we only hit step_into() on LAST_NORM. The same goes
> for do_last(). mountpoint_last() not doing the same is _not_
> intentional - it's definitely a bug.
>
> Consider your testcase; link points to . here. So the only
> thing you could expect from trying to follow it would be
> the directory 'link' lives in. And you don't have it
> when you reach the fscker via /proc/self/fd/3; what happens
> instead is nd->path set to ./link (by nd_jump_link()) *AND*
> step_into() called, pushing the same ./link onto stack.
> It violates all kinds of assumptions made by fs/namei.c -
> when pushing a symlink onto stack nd->path is expected to
> contain the base directory for resolving it.
>
> I'm fairly sure that this is the cause of at least some
> of the insanity you've caught; there always could be
> something else, of course, but this hole needs to be
> closed in any case.

... and with removal of now unused local variable, that's

mountpoint_last(): fix the treatment of LAST_BIND

step_into() should be attempted only in LAST_NORM
case, when we have the parent directory (in nd->path).
We get away with that for LAST_DOT and LOST_DOTDOT,
since those can't be symlinks, making step_init() and
equivalent of path_to_nameidata() - we do a bit of
useless work, but that's it. For LAST_BIND (i.e.
the case when we'd just followed a procfs-style
symlink) we really can't go there - result might
be a symlink and we really can't attempt following
it.

lookup_last() and do_last() do handle that properly;
mountpoint_last() should do the same.

Cc: [email protected]
Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..13f9f973722b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty);
static int
mountpoint_last(struct nameidata *nd)
{
- int error = 0;
struct dentry *dir = nd->path.dentry;
struct path path;

@@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd)
nd->flags &= ~LOOKUP_PARENT;

if (unlikely(nd->last_type != LAST_NORM)) {
- error = handle_dots(nd, nd->last_type);
- if (error)
- return error;
- path.dentry = dget(nd->path.dentry);
+ return handle_dots(nd, nd->last_type);
} else {
path.dentry = d_lookup(dir, &nd->last);
if (!path.dentry) {

2020-01-01 14:45:38

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-01, Al Viro <[email protected]> wrote:
> On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
> > Note, BTW, that lookup_last() (aka walk_component()) does just
> > that - we only hit step_into() on LAST_NORM. The same goes
> > for do_last(). mountpoint_last() not doing the same is _not_
> > intentional - it's definitely a bug.
> >
> > Consider your testcase; link points to . here. So the only
> > thing you could expect from trying to follow it would be
> > the directory 'link' lives in. And you don't have it
> > when you reach the fscker via /proc/self/fd/3; what happens
> > instead is nd->path set to ./link (by nd_jump_link()) *AND*
> > step_into() called, pushing the same ./link onto stack.
> > It violates all kinds of assumptions made by fs/namei.c -
> > when pushing a symlink onto stack nd->path is expected to
> > contain the base directory for resolving it.
> >
> > I'm fairly sure that this is the cause of at least some
> > of the insanity you've caught; there always could be
> > something else, of course, but this hole needs to be
> > closed in any case.
>
> ... and with removal of now unused local variable, that's
>
> mountpoint_last(): fix the treatment of LAST_BIND
>
> step_into() should be attempted only in LAST_NORM
> case, when we have the parent directory (in nd->path).
> We get away with that for LAST_DOT and LOST_DOTDOT,
> since those can't be symlinks, making step_init() and
> equivalent of path_to_nameidata() - we do a bit of
> useless work, but that's it. For LAST_BIND (i.e.
> the case when we'd just followed a procfs-style
> symlink) we really can't go there - result might
> be a symlink and we really can't attempt following
> it.
>
> lookup_last() and do_last() do handle that properly;
> mountpoint_last() should do the same.
>
> Cc: [email protected]
> Signed-off-by: Al Viro <[email protected]>

Thanks, this fixes the issue for me (and also fixes another reproducer I
found -- mounting a symlink on top of itself then trying to umount it).

Reported-by: Aleksa Sarai <[email protected]>
Tested-by: Aleksa Sarai <[email protected]>

As for the original topic of bind-mounting symlinks -- given this is a
supported feature, would you be okay with me sending an updated
O_EMPTYPATH series?

> ---
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..13f9f973722b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2643,7 +2643,6 @@ EXPORT_SYMBOL(user_path_at_empty);
> static int
> mountpoint_last(struct nameidata *nd)
> {
> - int error = 0;
> struct dentry *dir = nd->path.dentry;
> struct path path;
>
> @@ -2656,10 +2655,7 @@ mountpoint_last(struct nameidata *nd)
> nd->flags &= ~LOOKUP_PARENT;
>
> if (unlikely(nd->last_type != LAST_NORM)) {
> - error = handle_dots(nd, nd->last_type);
> - if (error)
> - return error;
> - path.dentry = dget(nd->path.dentry);
> + return handle_dots(nd, nd->last_type);
> } else {
> path.dentry = d_lookup(dir, &nd->last);
> if (!path.dentry) {


--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (3.12 kB)
signature.asc (235.00 B)
Download all attachments

2020-01-01 23:41:27

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:

> Thanks, this fixes the issue for me (and also fixes another reproducer I
> found -- mounting a symlink on top of itself then trying to umount it).
>
> Reported-by: Aleksa Sarai <[email protected]>
> Tested-by: Aleksa Sarai <[email protected]>

Pushed into #fixes.

> As for the original topic of bind-mounting symlinks -- given this is a
> supported feature, would you be okay with me sending an updated
> O_EMPTYPATH series?

Post it on fsdevel; I'll need to reread it anyway to say anything useful...

2020-01-02 04:00:41

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-01, Al Viro <[email protected]> wrote:
> On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
>
> > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > found -- mounting a symlink on top of itself then trying to umount it).
> >
> > Reported-by: Aleksa Sarai <[email protected]>
> > Tested-by: Aleksa Sarai <[email protected]>
>
> Pushed into #fixes.

Thanks. One other thing I noticed is that umount applies to the
underlying symlink rather than the mountpoint on top. So, for example
(using the same scripts I posted in the thread):

# ln -s /tmp/foo link
# ./mount_to_symlink /etc/passwd link
# umount -l link # will attempt to unmount "/tmp/foo"

Is that intentional?

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (855.00 B)
signature.asc (235.00 B)
Download all attachments

2020-01-02 08:59:05

by David Laight

[permalink] [raw]
Subject: RE: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

From: Aleksa Sarai
> Sent: 30 December 2019 08:32
...
> I'm not sure I agree -- as I mentioned in my other mail, re-opening
> through /proc/self/fd/$n works *very* well and has for a long time (in
> fact, both LXC and runc depend on this working).

I thought it was marginally broken because it is followed as a symlink?
On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the
filesystem inode and can be used to link the file back into the filesystem
if all the directory entries have been removed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-01-02 09:10:45

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-02, David Laight <[email protected]> wrote:
> From: Aleksa Sarai
> > Sent: 30 December 2019 08:32
> ...
> > I'm not sure I agree -- as I mentioned in my other mail, re-opening
> > through /proc/self/fd/$n works *very* well and has for a long time (in
> > fact, both LXC and runc depend on this working).
>
> I thought it was marginally broken because it is followed as a symlink?
> On, for example, NetBSD /proc/<n>/fd/<n> is a real reference to the
> filesystem inode and can be used to link the file back into the filesystem
> if all the directory entries have been removed.

That is also the case on Linux. It (strictly speaking) isn't a symlink
in the normal sense of the word, it's a magic-link (nd_jump_link
switches the nd->path to the actual 'struct file' in the case of
/proc/self/fd/$n).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (937.00 B)
signature.asc (235.00 B)
Download all attachments

2020-01-03 01:50:12

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> On 2020-01-01, Al Viro <[email protected]> wrote:
> > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> >
> > > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > > found -- mounting a symlink on top of itself then trying to umount it).
> > >
> > > Reported-by: Aleksa Sarai <[email protected]>
> > > Tested-by: Aleksa Sarai <[email protected]>
> >
> > Pushed into #fixes.
>
> Thanks. One other thing I noticed is that umount applies to the
> underlying symlink rather than the mountpoint on top. So, for example
> (using the same scripts I posted in the thread):
>
> # ln -s /tmp/foo link
> # ./mount_to_symlink /etc/passwd link
> # umount -l link # will attempt to unmount "/tmp/foo"
>
> Is that intentional?

It's a mess, again in mountpoint_last(). FWIW, at some point I proposed
to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
Linus asked for reasons deeper than my dislike of the semantics, I looked
around and hadn't spotted anything. And there hadn't been at the time,
but when four months later umount_lookup_last() went in I failed to look
for that source of potential problems in it ;-/

I've looked at that area again now. Aside of usual cursing at do_last()
horrors (yes, its control flow is a horror; yes, it needs serious massage;
no, it's not a good idea to get sidetracked into that right now), there
are several fun questions:
* d_manage() and d_automount(). We almost certainly don't
want those for autofs on the final component of pathname in umount,
including the trailing symlinks. But do we want those on usual access
via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not)
in autofs; do we want ->d_manage()/->d_automount() called when
resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that
correct from autofs point of view? I suspect that refusing to
do ->d_automount() is correct, but I don't understand ->d_manage()
purpose well enough to tell.
* I really hope that the weird "trailing / forces automount
even in cases when we normally wouldn't trigger it" (stat /mnt/foo
vs. stat /mnt/foo/) is not meant to extend to umount. I'd like
Ian's confirmation, though.
* do we want ->d_manage() on following .. into overmounted
directory? Again, autofs question...

The minimal fix to mountpoint_last() would be to have
follow_mount() done in LAST_NORM case. However, I'd like to understand
(and hopefully regularize) the rules for follow_mount()/follow_managed().
Additional scary question is nfsd iterplay with automount. For nfs4
exports it's potentially interesting...

Ian, could you comment on the autofs questions above?
I'd rather avoid doing changes in that area without your input -
it's subtle and breakage in automount-related behaviour can be
mysterious as hell.

2020-01-04 04:47:39

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks


It may be a bit off-topic here but, in autofs symlinks can be used
in place of mounts. That mechanism can be used (mostly nowadays) with
amd map format maps.

If I'm using symlinks instead of mounts (where I can) I definitely
don't want these to be over mounted by a mount.

I haven't seen problems like that happening but if it did happen
that would be a bug in automount or user mis-use of some sort.

On Fri, 2020-01-03 at 01:49 +0000, Al Viro wrote:
> On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> > On 2020-01-01, Al Viro <[email protected]> wrote:
> > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > >
> > > > Thanks, this fixes the issue for me (and also fixes another
> > > > reproducer I
> > > > found -- mounting a symlink on top of itself then trying to
> > > > umount it).
> > > >
> > > > Reported-by: Aleksa Sarai <[email protected]>
> > > > Tested-by: Aleksa Sarai <[email protected]>
> > >
> > > Pushed into #fixes.
> >
> > Thanks. One other thing I noticed is that umount applies to the
> > underlying symlink rather than the mountpoint on top. So, for
> > example
> > (using the same scripts I posted in the thread):
> >
> > # ln -s /tmp/foo link
> > # ./mount_to_symlink /etc/passwd link
> > # umount -l link # will attempt to unmount "/tmp/foo"
> >
> > Is that intentional?
>
> It's a mess, again in mountpoint_last(). FWIW, at some point I
> proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a
> symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I
> looked
> around and hadn't spotted anything. And there hadn't been at the
> time,
> but when four months later umount_lookup_last() went in I failed to
> look
> for that source of potential problems in it ;-/
>
> I've looked at that area again now. Aside of usual cursing at
> do_last()
> horrors (yes, its control flow is a horror; yes, it needs serious
> massage;
> no, it's not a good idea to get sidetracked into that right now),
> there
> are several fun questions:
> * d_manage() and d_automount(). We almost certainly don't
> want those for autofs on the final component of pathname in umount,
> including the trailing symlinks. But do we want those on usual
> access
> via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not)
> in autofs; do we want ->d_manage()/->d_automount() called when
> resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that
> correct from autofs point of view? I suspect that refusing to
> do ->d_automount() is correct, but I don't understand ->d_manage()
> purpose well enough to tell.

Yes, we don't want those on the final component of the path in
umount. The following of a symlink will give use a new path of
some sort so the rules would change to the usual ones for the
new path.

The semantics of following a symlink, be the source a proc entry
or not (I think) should always be the same. If the follow takes
us to an autofs file system (be it a trigger mount or an indirect
mount in an autofs file system) the behaviour should be that of
the autofs file system when we arrive there, from an auto-mount
POV.

The original intent of ->d_manage() was to prevent walks into an
under construction mount and that might not be as simple as mounting
a source on a mount point.

For example take the case of an automount indirect mount map entry
like this:

test /some/path/one server:/source/path1 \
/some/path/two server2:/source/path2 \
/some/other/path server:/source/path3 \
/some/other/path/three server:/source/path4

This entry has no mount at the root of the tree (so called root-less
multi-mount) but walks need to block when it's under construction as
the topology isn't known until the directory tree and any associated
mounts (usually trigger mounts) have been completed.

In this case it's needed to go to ref-walk mode and block until it's
done.

The other (perhaps not so obvious) use of ->d_manage() is to detect
expire to mount races. When an automount is expiring at the same time
a process (that would cause an automount) is traversing the path. The
base (I'll not say root, since the root of the expire might not be the
root of the tree) needs to block the walk until the expire is done.

These multi-mounts are meant to provide a "mount as you go" mechanism
so that only portions of the tree of mounts are mounted or expired at
any one time.

For example, the offsets in the above entry are /some/path/one,
/some/path/two, /some/other/path and /some/other/path/three.

On access to <autofs mount>/test automount is meant to mount trigger
mounts for offsets /some/path/one, /some/path/two and /some/other/path
and mount an offset trigger for /some/other/path/three into the mount
for /some/other/path when it's accessed and that might not happen
during the initial mount of the tree. The reverse being done on umount
in sub-trees of mounts when a nesting point like /some/other/path is
encountered.

But that's something of an aside because in all cases below the root
there will be an actual mount preventing walks into the tree under
nesting point mounts being constructed or expired.

Anyway, returning to the topic at hand, the answer to whether we want
->d_manage()/->d_automount() after a symlink has been followed is
yes, I think, because at that point we could be within a file system
that has automounts of some sort.

But perhaps I'm missing something about the description of the case
above ...

> * I really hope that the weird "trailing / forces automount
> even in cases when we normally wouldn't trigger it" (stat /mnt/foo
> vs. stat /mnt/foo/) is not meant to extend to umount. I'd like
> Ian's confirmation, though.

I can't see any way that the trailing "/" can realte to umount.

It has always been meant to be used to trigger a mount on something
that would otherwise not be mounted and that's the only case I'm
aware of.

> * do we want ->d_manage() on following .. into overmounted
> directory? Again, autofs question...

I think that amounts to asking "can the target of the ../ be in the
process of being constructed or expired at this time" and that's
probably yes. A root-less multi-mount would be one case where this
could happen (although it's not strictly an over-mounted directory).

>
> The minimal fix to mountpoint_last() would be to have
> follow_mount() done in LAST_NORM case. However, I'd like to
> understand
> (and hopefully regularize) the rules for
> follow_mount()/follow_managed().
> Additional scary question is nfsd iterplay with automount. For nfs4
> exports it's potentially interesting...

I'm not sure about nfs (and other cross mounting file systems). The
automounting in file systems other than autofs always have a real
mount as the target (AFAIK) so there's an implied blocking that occurs
on crossing the mount point. That's always made the nfs automounting
case simpler to my thinking anyway.

The real problem with nfs automount trees is when the topology of
the exports tree changes while parts of it are in use. People that
have any idea of how nfs cross mounting (and mount dependencies in
general) work shouldn't do that but they do it and then wonder why
things go wrong ...

>
> Ian, could you comment on the autofs questions above?
> I'd rather avoid doing changes in that area without your input -
> it's subtle and breakage in automount-related behaviour can be
> mysterious as hell.

Thanks for the heads up.

As always I can run tests on changes you want to do.
Fortunately that's generally worked out ok for us in the past.

Ian

2020-01-04 05:53:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks


> On Jan 1, 2020, at 11:44 PM, Aleksa Sarai <[email protected]> wrote:
>
> On 2020-01-01, Al Viro <[email protected]> wrote:
>>> On Wed, Jan 01, 2020 at 12:54:46AM +0000, Al Viro wrote:
>>> Note, BTW, that lookup_last() (aka walk_component()) does just
>>> that - we only hit step_into() on LAST_NORM. The same goes
>>> for do_last(). mountpoint_last() not doing the same is _not_
>>> intentional - it's definitely a bug.
>>>
>>> Consider your testcase; link points to . here. So the only
>>> thing you could expect from trying to follow it would be
>>> the directory 'link' lives in. And you don't have it
>>> when you reach the fscker via /proc/self/fd/3; what happens
>>> instead is nd->path set to ./link (by nd_jump_link()) *AND*
>>> step_into() called, pushing the same ./link onto stack.
>>> It violates all kinds of assumptions made by fs/namei.c -
>>> when pushing a symlink onto stack nd->path is expected to
>>> contain the base directory for resolving it.
>>>
>>> I'm fairly sure that this is the cause of at least some
>>> of the insanity you've caught; there always could be
>>> something else, of course, but this hole needs to be
>>> closed in any case.
>>
>> ... and with removal of now unused local variable, that's
>>
>> mountpoint_last(): fix the treatment of LAST_BIND
>>
>> step_into() should be attempted only in LAST_NORM
>> case, when we have the parent directory (in nd->path).
>> We get away with that for LAST_DOT and LOST_DOTDOT,
>> since those can't be symlinks, making step_init() and
>> equivalent of path_to_nameidata() - we do a bit of
>> useless work, but that's it. For LAST_BIND (i.e.
>> the case when we'd just followed a procfs-style
>> symlink) we really can't go there - result might
>> be a symlink and we really can't attempt following
>> it.
>>
>> lookup_last() and do_last() do handle that properly;
>> mountpoint_last() should do the same.
>>
>> Cc: [email protected]
>> Signed-off-by: Al Viro <[email protected]>
>
> Thanks, this fixes the issue for me (and also fixes another reproducer I
> found -- mounting a symlink on top of itself then trying to umount it).
>
> Reported-by: Aleksa Sarai <[email protected]>
> Tested-by: Aleksa Sarai <[email protected]>
>
> As for the original topic of bind-mounting symlinks -- given this is a
> supported feature, would you be okay with me sending an updated
> O_EMPTYPATH series?

FWIW, I have an actual use case for mounting over a symlink: replacing /etc/resolv.conf. My virtme tool is presented with somewhat arbitrary crud in /etc, where /etc/resolv.conf might be a plain file or a symlink, but, regardless, has inappropriate contents. If it’s a file, I can mount a new file over it. If it’s a symlink and the kernel properly supported it, I could also mount over it.

Yes, I could also use overlayfs. Maybe I should regardless.

2020-01-08 03:14:27

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:

> It's a mess, again in mountpoint_last(). FWIW, at some point I proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I looked
> around and hadn't spotted anything. And there hadn't been at the time,
> but when four months later umount_lookup_last() went in I failed to look
> for that source of potential problems in it ;-/
>
> I've looked at that area again now. Aside of usual cursing at do_last()
> horrors (yes, its control flow is a horror; yes, it needs serious massage;
> no, it's not a good idea to get sidetracked into that right now), there
> are several fun questions:
> * d_manage() and d_automount(). We almost certainly don't
> want those for autofs on the final component of pathname in umount,
> including the trailing symlinks. But do we want those on usual access
> via /proc/*/fd/*? I.e. suppose somebody does open() (O_PATH or not)
> in autofs; do we want ->d_manage()/->d_automount() called when
> resolving /proc/self/fd/<whatever>/foo/bar? We do not; is that
> correct from autofs point of view? I suspect that refusing to
> do ->d_automount() is correct, but I don't understand ->d_manage()
> purpose well enough to tell.
> * I really hope that the weird "trailing / forces automount
> even in cases when we normally wouldn't trigger it" (stat /mnt/foo
> vs. stat /mnt/foo/) is not meant to extend to umount. I'd like
> Ian's confirmation, though.
> * do we want ->d_manage() on following .. into overmounted
> directory? Again, autofs question...

FWIW, I suspect that we want to do something along the following lines:

1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been
O_NOFOLLOW in the mix. Reason: if there is a trailing symlink, we want
to fail with EEXIST anyway. Benefit: this fragment in do_last()
error = follow_managed(&path, nd);
if (unlikely(error < 0))
return error;

/*
* create/update audit record if it already exists.
*/
audit_inode(nd->name, path.dentry, 0);

if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
path_to_nameidata(&path, nd);
return -EEXIST;
}

seq = 0; /* out of RCU mode, so the value doesn't matter */
inode = d_backing_inode(path.dentry);
finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
return error;
can become
error = follow_managed(&path, nd);
if (unlikely(error < 0))
return error;

seq = 0; /* out of RCU mode, so the value doesn't matter */
inode = d_backing_inode(path.dentry);
finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
return error;

if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
audit_inode(nd->name, nd->path.dentry, 0);
return -EEXIST;
}
Equivalent transformation, since the the only goto finish_lookup; is under
if (!(open_flag & O_CREAT)). What it buys us is more regular structure
of follow_managed() callers.

2) make follow_managed() take &inode and &seq. Look: follow_managed() never
returns 0 (we have
if (ret == -EISDIR || !ret)
ret = 1;
on the way to the only return in it) and the callers are
err = follow_managed(path, nd);
if (likely(err > 0))
*inode = d_backing_inode(path->dentry);
return err;
in lookup_fast(),
err = follow_managed(&path, nd);
if (unlikely(err < 0))
return err;

seq = 0; /* we are already out of RCU mode */
inode = d_backing_inode(path.dentry);
in walk_component(),
err = follow_managed(&path, nd);
if (unlikely(err < 0))
return err;
inode = d_backing_inode(path.dentry);
seq = 0;
in handle_lookup_down() and (after the previous change)
error = follow_managed(&path, nd);
if (unlikely(error < 0))
return error;

seq = 0; /* out of RCU mode, so the value doesn't matter */
inode = d_backing_inode(path.dentry);
in do_last(). That's begging to fold those followups into follow_managed()
itself, doesn't it? And having *seqp = 0; equivalent added in lookup_fast()
is not going to hurt the performance - in all callers it's an address of
local variable, right next to the one whose address is passed as inodep.
Which we'd just dirtied, and the cacheline is not going to have been shared
anyway.

Note that after that the arguments for follow_managed() become identical
to those for __follow_mount_rcu(). Which makes a lot of sense, since
the latter is RCU-mode counterpart of the former.

3) have the followup to failing __follow_mount_rcu() taken into it.
After (2) we have this in lookup_fast():

*seqp = seq;
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0)) {
/*
* Note: do negative dentry check after revalidation in
* case that drops it.
*/
if (unlikely(negative))
return -ENOENT;
path->mnt = mnt;
path->dentry = dentry;
if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
return 1;
}
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
if (unlikely(status == -ECHILD))
/* we'd been told to redo it in non-rcu mode */
status = d_revalidate(dentry, nd->flags);
} else {
...
}
if (unlikely(status <= 0)) {
if (!status)
d_invalidate(dentry);
dput(dentry);
return status;
}

path->mnt = mnt;
path->dentry = dentry;
return follow_managed(path, nd, inode, seqp);

Suppose __follow_mount_rcu() returns false; what follows is
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
seq here is equal to *seqp here, dentry - the value of path->dentry at the
time of __follow_mount_rcu() call.
if (unlikely(status == -ECHILD))
....
not taken - we know that status must have been positive
if (unlikely(status <= 0)) {
...
}
ditto
path->mnt = mnt;
path->dentry = dentry;
return follow_managed(path, nd, inode, seqp);
we return *path to original and call follow_managed(). IOW, we could bloody
well do all of that in the __follow_mount_rcu() itself, having it return 1
when the original would've returned true and doing that "revert *path,
call unlazy_child() and fall back to follow_mount_rcu() in case of success"
in cases when the original would've returned false. The caller turns into
/*
* Note: do negative dentry check after revalidation in
* case that drops it.
*/
if (unlikely(negative))
return -ENOENT;
path->mnt = mnt;
path->dentry = dentry;
return __follow_mount_rcu(nd, path, inode, seqp);

4) fold __follow_mount_rcu() into follow_managed(), using the latter both in
RCU and non-RCU cases.

5) take the calls of follow_managed() out of lookup_fast() into its callers.
That would be
err = lookup_fast(nd, &path, &inode, &seq);
if (unlikely(err <= 0)) {
if (err < 0)
return err;
path.dentry = lookup_slow(&nd->last, nd->path.dentry,
nd->flags);
if (IS_ERR(path.dentry))
return PTR_ERR(path.dentry);

path.mnt = nd->path.mnt;
err = follow_managed(&path, nd, &inode, &seq);
if (unlikely(err < 0))
return err;
}
turning into
err = lookup_fast(nd, &path, &inode, &seq);
if (unlikely(err <= 0)) {
if (err < 0)
return err;
path.dentry = lookup_slow(&nd->last, nd->path.dentry,
nd->flags);
if (IS_ERR(path.dentry))
return PTR_ERR(path.dentry);

path.mnt = nd->path.mnt;
}
err = follow_managed(&path, nd, &inode, &seq);
if (unlikely(err < 0))
return err;
in walk_component() and
error = lookup_fast(nd, &path, &inode, &seq);
if (likely(error > 0))
goto finish_lookup;
...
error = follow_managed(&path, nd, &inode, &seq);
if (unlikely(error < 0))
return error;
finish_lookup:
turning into
error = lookup_fast(nd, &path, &inode, &seq);
if (likely(error > 0))
goto finish_lookup;
...
finish_lookup:
error = follow_managed(&path, nd, &inode, &seq);
if (unlikely(error < 0))
return error;
in do_last().

6) after that we have 3 callers of step_into(); the ones in
walk_component() and in do_last() would be immediately preceded
by the calls of follow_managed(). The last one is in
mountpoint_last(). That's
if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) {
dput(path.dentry);
return -ENOENT;
}
path.mnt = nd->path.mnt;
return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
And that's where we are missing the mountpoint traversal in symlink case -
sure, the caller does follow_mount(), but it doesn't catch the case when
we have a symlink overmounted - we run into step_into() before that.
Note that smp_load_acquire + d_flags_negative is what we would've done
in follow_managed(), as well as getting d_backing_inode(). So here
we also have an open-coded bastardized variant of follow_managed().
The difference is, we don't want to trigger ->d_automount() and ->d_manage()
in that one.

And at that point the only call of follow_managed() *NOT* followed by
step_into() is in handle_lookup_down(). What it is followed by is
path_to_nameidata(&path, nd);
nd->inode = inode;
nd->seq = seq;
And that's a piece of step_into():
if (likely(!d_is_symlink(path->dentry)) ||
!(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) {
/* not a symlink or should not follow */
path_to_nameidata(path, nd);
nd->inode = inode;
nd->seq = seq;
return 0;
}
is the normal path through that sucker. What's more, we are guaranteed
that this will _not_ be a symlink (it's the starting point of pathwalk,
and path_init() would've told us to sod off were it not a directory).

So if we manage to convert the damn thing in mountpoint_last() into
follow_managed(), we could fold follow_managed() into step_into().
Which suggests the way to do that - not that step_into() takes an
argument containing ORed WALK_... constants. So we can simply add
WALK_NOAUTOMOUNT and put a check for it into
if (flags & DCACHE_MANAGE_TRANSIT) {
and
if (flags & DCACHE_NEED_AUTOMOUNT) {
bodies, so that they would be ignored if that's passed to
follow_mount()/step_into() hybrid.

At that point we have one primitive for moving into child, handling
both the mountpoint traversals and keeping track of symlinks. Moreover,
there's a fairly strong argument for using it in case of .. as well.
As it is, if the parent is overmounted, we cross into whatever is
mounted on top of it. And we ignore ->d_manage/->d_automount on
the damn thing. Which is not an issue for anything other than
autofs (nobody else has ->d_manage() and nfs/afs/cifs automount
points don't have children) and for autofs we *want* those called;
that's not something likely to be encountered, but it's an impossible
setup (autofs direct mount set on an ancestor of somebody's current
directory) and autofs does count upon not walking into something
being set up by the daemon.

I'll put together such series and see how well does it work; it would
fix the idiocies in user_path_mountpoint_at() and make the pathwalk
machinery easier to follow - the boilerplate around mountpoint
crossing and symlink handling is demonstrably easy to get wrong.
If that works and doesn't cause observable slowdown, I'll put it
into -next, either stepping around the changes done by openat2()
series, or rebasing it on top of that.

Another interesting question is whether we want O_PATH open
to trigger automounts. The thing is, we do *NOT* trigger them
(or traverse mountpoints) at the starting point of lookups.
I believe it's a mistake (and mine, at that), but I doubt that
there's anything that can be done about it at that point.
It's a user-visible behaviour and I can easily imagine
a custom /init that ends up relying upon it ;-/ mkdir /root,
mount the final root there, chdir /root, mount --move . /,
remove everything on initramfs using absolute pathnames
and chroot to "." to finish... Traversing mounts at the
beginning of pathwalk would break the hell out of that,
potentially with root filesystem contents wiped out... ;-/

I wish we could change that, but I'm afraid that's cast
in stone by now (and had been for 20 years or so). As it is,
we have an unpleasant side effect - O_PATH open does *NOT*
trigger automounts. So if you do that to e.g. referral point
and try to do ...at() syscalls with that as the origin, you'll
get an unpleasant surprise - automount won't trigger at all.

I think the easiest way to handle that is to have O_PATH
turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's
trivial to do, but that changes user-visible behaviour. OTOH,
with the current behaviour nobody can rely upon automount not
getting triggered by somebody else just as they are entering
their open(dir, O_PATH), so I think that's not a problem.

Linus, do you have any objections to such O_PATH semantics
change?

PS: I think I see how to untangle the control flow horrors
in do_last() with this massage done, but I'm not going there
until this is sorted out - by previous experience touching
the damn thing can easily turn into several weeks of digging
through the nfs/gfs2/etc. guts trying to verify something,
with a couple of detours into fixing something in there
found in process... ;-/

2020-01-08 03:55:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Tue, Jan 7, 2020 at 7:13 PM Al Viro <[email protected]> wrote:
>
> FWIW, I suspect that we want to do something along the following lines:
>
> 1) make build_open_flags() treat O_CREAT | O_EXCL as if there had been
> O_NOFOLLOW in the mix.

My reaction to that is "Whee, that's a big change".

But:

> Benefit: this fragment in do_last()

you're right.

That's the semantics we have right now (and I think it's the correct
safe semantics when I think about it). But when I first looked at your
email I without thinking more about it actually thought we followed
the symlink, and then did the O_CREAT | O_EXCL on the target (and
potentially succeeded).

So I agree - making O_CREAT | O_EXCL imply O_NOFOLLOW seems to be the
right thing to do, and not only should simplify our code, it's much
more descriptive of what the real semantics are.

Even if my first reaction was that it would act differently.

Slash-and-burn approach to your explanatory subsequent steps:

> 2) make follow_managed() take &inode and &seq.
> 3) have the followup to failing __follow_mount_rcu() taken into it.
> 4) fold __follow_mount_rcu() into follow_managed(), using the latter both in
> RCU and non-RCU cases.
> 5) take the calls of follow_managed() out of lookup_fast() into its callers.
> 6) after that we have 3 callers of step_into(); [..]
> So if we manage to convert the damn thing in mountpoint_last() into
> follow_managed(), we could fold follow_managed() into step_into().

I think that all makes sense. I didn't go to look at the source, but
from the email contents your steps look reasonable to me.

> Another interesting question is whether we want O_PATH open
> to trigger automounts.

It does sound like they shouldn't, but as you say:

> The thing is, we do *NOT* trigger them
> (or traverse mountpoints) at the starting point of lookups.
> I believe it's a mistake (and mine, at that), but I doubt that
> there's anything that can be done about it at that point.
> It's a user-visible behaviour [..]

Hmm. I wonder how set in stone that is. We may have two decades of
history of not doing it at start point of lookups, but we do *not*
have two decades of history of O_PATH.

So what I think we agree would be sane behavior would be for O_PATH
opens to not trigger automounts (unless there's a slash at the end,
whatever), but _do_ add the mount-point traversal to the beginning of
lookups.

But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.

That way we maintain original behavior: if somebody overmounts your
cwd, you still see the pre-mount directory on lookups, because your
cwd is "under" the mount.

But if you open a file with O_PATH, and somebody does a mount
_afterwards_, the openat() will see that later mount and/or do the
automount.

Don't you think that would be the more sane/obvious semantics of how
O_PATH should work?

> I think the easiest way to handle that is to have O_PATH
> turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's
> trivial to do, but that changes user-visible behaviour. OTOH,
> with the current behaviour nobody can rely upon automount not
> getting triggered by somebody else just as they are entering
> their open(dir, O_PATH), so I think that's not a problem.
>
> Linus, do you have any objections to such O_PATH semantics
> change?

See above: I think I'd prefer the O_PATH behavior the other way
around. That seems to be more of a consistent behavior of what
"O_PATH" means - it means "don't really open, we'll do it only when
you use it as a directory".

But I don't have any _strong_ opinions. If you have a good reason to
tell me that I'm being stupid, go ahead and do so and override my
stupidity.

Linus

2020-01-08 04:40:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mount: universally disallow mounting over symlinks

On Mon, Dec 30, 2019 at 12:29 AM Aleksa Sarai <[email protected]> wrote:
>
> On 2019-12-29, Linus Torvalds <[email protected]> wrote:
> > On Sun, Dec 29, 2019 at 9:21 PM Aleksa Sarai <[email protected]> wrote:
>
> If allowing bind-mounts over symlinks is allowed (which I don't have a
> problem with really), it just means we'll need a few more kernel pieces
> to get this hardening to work. But these features would be useful
> outside of the problems I'm dealing with (O_EMPTYPATH and some kind of
> pidfd-based interface to grab the equivalent of /proc/self/exe and a few
> other such magic-link targets).

As one data point, I would use this ability in virtme: this would
allow me to more reliably mount over /etc/resolve.conf even when it's
a symlink.

(Perhaps I should use overlayfs instead. Hmm.)

2020-01-08 21:35:52

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Tue, Jan 07, 2020 at 07:54:02PM -0800, Linus Torvalds wrote:

> > Another interesting question is whether we want O_PATH open
> > to trigger automounts.
>
> It does sound like they shouldn't, but as you say:
>
> > The thing is, we do *NOT* trigger them
> > (or traverse mountpoints) at the starting point of lookups.
> > I believe it's a mistake (and mine, at that), but I doubt that
> > there's anything that can be done about it at that point.
> > It's a user-visible behaviour [..]
>
> Hmm. I wonder how set in stone that is. We may have two decades of
> history of not doing it at start point of lookups, but we do *not*
> have two decades of history of O_PATH.
>
> So what I think we agree would be sane behavior would be for O_PATH
> opens to not trigger automounts (unless there's a slash at the end,
> whatever), but _do_ add the mount-point traversal to the beginning of
> lookups.
>
> But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
>
> That way we maintain original behavior: if somebody overmounts your
> cwd, you still see the pre-mount directory on lookups, because your
> cwd is "under" the mount.
>
> But if you open a file with O_PATH, and somebody does a mount
> _afterwards_, the openat() will see that later mount and/or do the
> automount.
>
> Don't you think that would be the more sane/obvious semantics of how
> O_PATH should work?

Maybe, but... note that we do not (and AFAICS never had) follow mounts
on /proc/self/cwd, /proc/self/fd/42, etc. And there are very good
reasons for that. First of all, if your stdin is from /tmp/foo,
you'd better get that file when you open /dev/stdin, even if somebody
has done mount --bind /tmp/bar /tmp/foo; another issue is with
the use of stat("/proc/self/fd/42", &buf) - it should be an equivalent
of fstat(42, &buf), even if somebody has overmounted that. BTW, for
similar reason after
link(".", "foo");
fd = open("foo", O_PATH); // return 42
we really should (and do) have resolution of /proc/self/fd/42 stop at
foo, not . Reason: consistency of stat() behaviour...

The point is, we'd never followed mounts on /proc/self/cwd et.al.
I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
is that way. Actually, scratch that - 2.0 behaves the same way
(mountpoint crossing is done in iget() there; is that Minix influence
or straight from the Lions' book?)

Hmm... Looking through the history, we have

(for reference) v7: mount traversal in iget()
(forward) and namei() (back); due to the way it's done, forward
traversal happens
* at starting point
* after any component (. and .. included)
* on results of forward traversal (due to a loop in iget()).
Back traversal (to covered on .. from root directory) is also
to unlimited depth.

0.01: no mount handling

0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
(not by Lions' Book, then - v6 didn't do back traversals at all).
Forward traversal
* after any component (. and .. included)
No traversal on starting point, no traversal on result of traversal.
OTOH, mount(2) refuses to mount on top of root, so the lack of the last
one is not an issue.

0.12: symlinks added; no mount traversal on starting point of those either.
We start at the process' root for absolute ones, even if it happens to be
overmounted, and we start from parent for relative ones. The latter matters
only if we were in the beginning of the pathwalk, since anything else would've
traversed mounts back when we'd picked said parent. Mount traversal takes
precedence over symlink traversal, but that's not an issue since mount follows
links on mountpoint. It does not, at that point, reject fs image with
symlink for root, but that actually more or less works.

0.97.3: same, with addition of procfs symlinks. No mount crossing on their
targets (for normal symlinks we don't do mount crossing in the beginning
and any component inside triggers mount crossing as usual; for procfs ones
there's no components inside)

Situation remains essentially unchanged until 2.1.42. Next few kernels
are in flux, to put it politely - initial merge had been insane and it
took until 2.1.44 or so for the things to get more or less working.

At 2.1.44: forward traversal in fs/namei.c:lookup(), back traversal in
fs/namei.c:reserved_lookup(). Otherwise the same behaviour as pre-dcache
(wrt mount traversals, that is).

2.1.51pre1: forward traversal moved into real_lookup() and __d_lookup().
Forward traversal happens *ONLY* after normal components - not after . or ..

2.1.61: forward traversal moved into follow_mount(), behaviour reverted to
pre-dcache one.

Previous is from reading through the historical trees; my involvement started
circa 2.1.120-something.

2.3.50pre3: call of follow_mount() moved a bit, reverting to 2.1.51pre1
behaviour (nor traversal on . or ..) *again*. Not sure whose idea had that
been - might've been mine, but unlike the other patch that went into fs/namei.c
in the same release, I hadn't been able to find anything related to that
one. If your memories (or mail archives) are better...

2.3.99pre4-5: massive surgery in there. Preparations to allowing mount on top
of mount; forward traversal adjusted accordingly, back traversal still isn't.

2.3.99pre7-1: more surgery, back traversals are also to unlimited depth now
and mount on top of mount has been allowed.

2.3.99pre9-4: mount --bind taught to mount non-directories on top of
non-directories. At that point it does *NOT* follow trailing symlinks, so
mounting of symlinks and mounting on top of symlinks becomes possible.
Mount traversal still takes precedence over symlink traversal, symlink traversal
of mount traversal result still generally works, even though it's not something
I considered at the time.

v2.4.5.2: mount --bind started to follow symlinks. So that source of mounting
of and on the symlinks was no more.

2.5.0.5: forward mount traversal is done after .. (in handle_dotdot()).
That brings back the pre-dcache behaviour for those suckers. Still no
forward traversal after ., though.

At about the same time I'd been getting rid of the early-boot incestous
relationships with fs/namespace.c (initramfs work) and that was probably
the last time we could realistically switch to following mounts at starting
point; I considered trying to do that, but decided not to. Pity, that...

2.6.5-rc2: normal mount now checks for corrupt fs with symlink for root.
Since it has always been following symlinks for mountpoint, the remaining
source of mounting of and on symlinks was gone; that lasted until
after O_PATH introduction.

2.6.39-rc1: mount traps support - instead of abusing ->follow_link()
for automounting, we have an explicit pair of methods that can be
called at the same places where we traverse mounts. None too consistent -
we don't do that on .. results. That was Dave Howells and Ian Kent.

2.6.39-rc1: O_PATH introduced and, later in the same series, allowed for
symlinks. That has changed things - now procfs symlink targets could
be symlinks themselves. Originally an attempt to follow those would
blow up with -ELOOP (there's simply no good way to follow such beast;
it's either "stop even if we are asked to follow" or "give an error").

3.6.0-rc1: nd_jump_link() introduction (hch) had unnoticed side effects -
we'd switched from "fail traversal with -ELOOP" to "stop there". Mostly it
doesn't change behaviour, but it has opened a way to mount symlinks and
mount on top of symlinks. Which generally worked.

circa 3.8--3.9: side effects had been noticed; my first reaction had been
"let's make nd_jump_link() return an error, then", but I hadn't been
able to find good reasons when challenged to do so. Did an audit,
found no obvious problems, went "oh, well - whether it works by accident
or by design, it doesn't break anything".

3.12.0-rc1: lookups for umount(2) are different - we don't want
revalidate on the last component. Which had been handled by
introduction of path_umountat()/umount_lookup_last(), parallel to
path_lookupat(). Which has gotten quite a few things wrong -
it *did* try to follow symlinks obtained by following procfs
ones (and blew up big way) and it didn't follow mounts on
overmounted trailing symlinks. Nobody noticed for 6 years,
until folks actually tried to play with mount-on-symlink...
Patches were by Jeff Layton, neither he nor I have spotted the
problem back then. And I should have, since it had been only
a few months since the audit for exactly that kind of problems...

AFAICS, there'd been no serious semantical changes since then. What we
have right now:
* no mount traversal on the starting point
* mount traversal after any component other than "."
* symlink traversal consists of possibly jumping to given
point plus following a given (possibly empty) series of components.
It can be both - e.g. symlink to "/foo/bar" is 'jump to root,
then traverse "foo", then traverse "bar"'. Procfs "magic" symlinks
are not really magical - they behave as symlinks to "/" as far as
the pathwalk semantics is concerned. The only differences is that
jump might be not to process' root.
* mount traversal takes precedence over symlink traversal.
* jump (if any) in symlink traversal is treated the same
as the starting point - it's not followed by mount traversal.
It's also not followed by symlink traversal, even if we are jumping
into a symlink. Of course, in any position other than the end of
pathname that's an instant error. That's also not different from
the starting point treatment - if ...at(2) is given a symlink for
starting point, it leaves it as-is if AT_EMPTY_PATH is given and
fails with -ENOTDIR otherwise.
* umount(2) handles the final component differently -
for one thing, it does not do revalidate, for another - its
mount traversal (if any) does not include automount-related
parts. And there we *do* want mount traversal at the final
point, for obvious reasons.

> > I think the easiest way to handle that is to have O_PATH
> > turn LOOKUP_AUTOMOUNT, same as the normal open() does. That's
> > trivial to do, but that changes user-visible behaviour. OTOH,
> > with the current behaviour nobody can rely upon automount not
> > getting triggered by somebody else just as they are entering
> > their open(dir, O_PATH), so I think that's not a problem.
> >
> > Linus, do you have any objections to such O_PATH semantics
> > change?
>
> See above: I think I'd prefer the O_PATH behavior the other way
> around. That seems to be more of a consistent behavior of what
> "O_PATH" means - it means "don't really open, we'll do it only when
> you use it as a directory".

How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
vs. faccessat(42, "foo", MAY_READ)? The latter would trigger automount,
the former would not... Or would you extend that to "traverse mounts
upon following procfs links, if the file in question had been opened with
O_PATH"? We could do that (give nd_jump_link() an extra argument telling
if we want mount traversal), but I'm not sure if the resulting semantics
is sane...

Note, BTW, that O_PATH users really can't rely upon automounts _not_
being triggered - all it takes is a lookup on bogus path with such prefix
by anybody who can reach that place... We are not opening anything,
really, but we are not able to ignore automounts triggered by somebody
else.

2020-01-10 00:11:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Wed, Jan 8, 2020 at 1:34 PM Al Viro <[email protected]> wrote:
>
> The point is, we'd never followed mounts on /proc/self/cwd et.al.
> I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
> is that way.

Hmm. If that's the case, maybe they should be marked implicitly as
O_PATH when opened?

> Actually, scratch that - 2.0 behaves the same way
> (mountpoint crossing is done in iget() there; is that Minix influence
> or straight from the Lions' book?)

I don't think I ever had access to Lions' - I've _seen_ a printout of
it later, and obviously maybe others did,

More likely it's from Maurice Bach: the Design of the Unix Operating
System. I'm pretty sure that's where a lot of the FS layer stuff came
from. Certainly the bad old buffer head interfaces, and quite likely
the iget() stuff too.

> 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()

Whee, you _really_ went back in time.

So I did too.

And looking at that code in iget(), I doubt it came from anywhere.
Christ. It's just looping over a fixed-size array, both when finding
the inode, and finding the superblock.

Cute, but unbelievably stupid. It was a more innocent time.

In other words, I think you can chalk it up to just me, because
blaming anybody else for that garbage would be very very unfair indeed
;)

> How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
> vs. faccessat(42, "foo", MAY_READ)?

I think that in a perfect world, the O_PATH'ness of '42' would be the
deciding factor. Wouldn't those be the best and most consistent
semantics?

And then 'cwd'/'root' always have the O_PATH behavior.

> The latter would trigger automount,
> the former would not... Or would you extend that to "traverse mounts
> upon following procfs links, if the file in question had been opened with
> O_PATH"?

Exactly.

But you know what? I do not believe this is all that important, and I
doubt it will matter to anybody.

So what matters most is what makes the most sense to the VFS layer,
and what makes the most sense to _you_.

Because my reaction from this thread is that not only have you thought
about this issue and followed the history a whole lot more than I
would ever have done, it's also that I trust you to DTRT.

I think it would be good to have some self-consistency, but at the
same time clearly we already don't really, and our behavior here has
subtly changed over the years (and not so subtly - if you go back
sufficiently far, /proc behavior wrt file descriptors has had both
"dup()" behavior and "make a new file descriptor with the same inode"
behavior, afaik).

Linus

2020-01-10 04:16:49

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
> On Wed, Jan 8, 2020 at 1:34 PM Al Viro <[email protected]> wrote:
> >
> > The point is, we'd never followed mounts on /proc/self/cwd et.al.
> > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from me)
> > is that way.
>
> Hmm. If that's the case, maybe they should be marked implicitly as
> O_PATH when opened?

I thought you wanted O_PATH as starting point to have mounts traversed?
Confused...

> > Actually, scratch that - 2.0 behaves the same way
> > (mountpoint crossing is done in iget() there; is that Minix influence
> > or straight from the Lions' book?)
>
> I don't think I ever had access to Lions' - I've _seen_ a printout of
> it later, and obviously maybe others did,
>
> More likely it's from Maurice Bach: the Design of the Unix Operating
> System. I'm pretty sure that's where a lot of the FS layer stuff came
> from. Certainly the bad old buffer head interfaces, and quite likely
> the iget() stuff too.
>
> > 0.10: forward traversal in iget(), back traversal in fs/namei.c:find_entry()
>
> Whee, you _really_ went back in time.
>
> So I did too.
>
> And looking at that code in iget(), I doubt it came from anywhere.
> Christ. It's just looping over a fixed-size array, both when finding
> the inode, and finding the superblock.
>
> Cute, but unbelievably stupid. It was a more innocent time.
>
> In other words, I think you can chalk it up to just me, because
> blaming anybody else for that garbage would be very very unfair indeed
> ;)

See https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c
Exactly the same algorithm, complete with linear searches over those
fixed-sized array.

<grabs Bach> Right, he simply transcribes v7 iget().

So I suspect that you are right - your variant of iget was pretty much
one-to-one implementation of Bach's description of v7 iget.

Your namei wasn't - Bach has 'if the entry points to root and you are
in the root and name is "..", find mount table entry (by device number),
drop your directory inode, grab the inode of mountpount and restart
the search for ".." in there', which gives back traversals to arbitrary
depth. And v7 namei() (as Bach mentions) uses iget() for starting point
as well as for each component. You kept pointers instead, which is where
the other difference has come from (no mount traversal at the starting
point)...

Actually, I've misread your code in 0.10 - it does unlimited forward
traversals; it's back traversals that go only one level. The forward
ones got limited to one level in 0.95, but then mount-over-root had
been banned all along. I'd read the pre-dcache variant of iget(),
seen it go pretty much all the way back to beginning and hadn't
sorted out the 0.12 -> 0.95 transition...

> > How would your proposal deal with access("/proc/self/fd/42/foo", MAY_READ)
> > vs. faccessat(42, "foo", MAY_READ)?
>
> I think that in a perfect world, the O_PATH'ness of '42' would be the
> deciding factor. Wouldn't those be the best and most consistent
> semantics?
>
> And then 'cwd'/'root' always have the O_PATH behavior.

See above - unless I'm misparsing you, you wanted mount traversals in the
starting point if it's ...at() with O_PATH fd. With O_PATH open() not
doing them.

For cwd and root the situation is opposite - we do NOT traverse mounts
for those. And that's really too late to change.

> > The latter would trigger automount,
> > the former would not... Or would you extend that to "traverse mounts
> > upon following procfs links, if the file in question had been opened with
> > O_PATH"?
>
> Exactly.
>
> But you know what? I do not believe this is all that important, and I
> doubt it will matter to anybody.

FWIW, digging through the automount-related parts of that stuff has
caught several fun issues. One (and I'm rather embarrassed by it)
should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount
overput on simultaneous automount). To quote the commit message:
The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.

During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.
At which point I should've gone "what the fuck?" - lock_mount() does, indeed,
drop path->mnt in this situation and replaces it with the whatever's come to
cover it. For mount(2) that's the right thing to do - we _want_ to mount
on top of whatever we have at the mountpoint. For automounts we very much
don't want that - it's either "mount right on top of the automount trigger"
or discard whatever we'd been about to mount and walk into whatever's got
mounted there (presumably the same thing triggered by another process).
We kinda-sorta get that effect, but in a very convoluted way: do_add_mount()
will refuse to mount something on top of itself -
/* Refuse the same filesystem on the same mount point */
err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
goto unlock;
which will end up with -EBUSY returned (and recognized by follow_automount()).

First of all, that's unreliable. If somebody not only has triggered that
automount, but managed to _mount_ something else on top (for example,
has triggered it by lookup of mountpoint-to-be in mount(2)), we'll end
up not triggering that check. In which case we'll get something like
nfs referral point under nfs automounted there under tmpfs from explicit
overmount under same nfs mount we'd automounted there - identical to what's
been buried under tmpfs. It's hard to hit, but not impossibly so.

What's more, the whole solution is a kludge - the root of problem is
that lock_mount() is the wrong thing to do in case of finish_automount().
We don't want to go into whatever's overmounting us there, both for
the reasons above *and* because it's a PITA for the caller. So the
right solution is
* lift lock_mount() call from do_add_mount() into its callers
(all 2 of them); while we are at it, lift unlock_mount() as well
(makes for simpler failure exits in do_add_mount()).
* replace the call of lock_mount() in finish_automount()
with variant that doesn't do "unlock, walk deeper and retry locking",
returning ERR_PTR(-EBUSY) in such case.
* get rid of the kludge introduced in that commit. Better
yet, don't bother with traversing into the covering mount in case
of success - let the caller of follow_automount() do that. Which
eliminates the need to pass need_mntput to the sucker and suggests
an even better solution - have this analogue of lock_mount()
return NULL instead of ERR_PTR(-EBUSY) and treat it in finish_automount()
as "OK, discard what we wanted to mount and return 0". That gets
rid of the entire
err = finish_automount(mnt, path);
switch (err) {
case -EBUSY:
/* Someone else made a mount here whilst we were busy */
return 0;
case 0:
path_put(path);
path->mnt = mnt;
path->dentry = dget(mnt->mnt_root);
return 0;
default:
return err;
}
chunk in follow_automount() - it would just be
return finish_automount(mnt, path);

Another thing (in the same area) is not a bug per se, but...
after the call of ->d_automount() we have this:
if (IS_ERR(mnt)) {
/*
* The filesystem is allowed to return -EISDIR here to indicate
* it doesn't want to automount. For instance, autofs would do
* this so that its userspace daemon can mount on this dentry.
*
* However, we can only permit this if it's a terminal point in
* the path being looked up; if it wasn't then the remainder of
* the path is inaccessible and we should say so.
*/
if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT))
return -EREMOTE;
return PTR_ERR(mnt);
}
Except that not a single instance of ->d_automount() has ever returned
-EISDIR. Certainly not autofs one, despite the what the comment says.
That chunk has come from dhowells, back when the whole mount trap series
had been merged. After talking that thing over (fun: trying to figure
out what had been intended nearly 9 years ago, when people involved are
in UK, US east coast and AU west coast respectively. The only way it
could suck more would've been if I were on the west coast - then all
timezone deltas would be 8-hour ones)... looks like it's a rudiment
of plans that got superseded during the series development, nobody
quite remembers exact details. Conclusion: it's not even dead, it's
stillborn; bury it.

Unfortunately, there are other interesting questions related to
autofs-specific bits (->d_manage()) and the timezone-related fun
is, of course, still there. I hope to sort that out today or
tomorrow, at least enough to do a reasonable set of backportable
fixes to put in front of follow_managed()/step_into() queue.
Oh, well...

2020-01-10 05:04:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Thu, Jan 9, 2020 at 8:15 PM Al Viro <[email protected]> wrote:
> >
> > Hmm. If that's the case, maybe they should be marked implicitly as
> > O_PATH when opened?
>
> I thought you wanted O_PATH as starting point to have mounts traversed?
> Confused...

No, I'm confused. I meant "non-O_PATH", just got the rules reversed in my mind.

So cwd/root would always act as it non-O_PATH, and only using an
actual fd would look at the O_PATH flag, and if it was set would walk
the mountpoints.

> <grabs Bach> Right, he simply transcribes v7 iget().
>
> So I suspect that you are right - your variant of iget was pretty much
> one-to-one implementation of Bach's description of v7 iget.

Ok, that makes sense. My copy of Bach literally had the system call
list "marked off" when I implemented them back when.

I may still have that paperbook copy somewhere. I don't _think_ I'd
have thrown it out, it has sentimental value.

> > I think that in a perfect world, the O_PATH'ness of '42' would be the
> > deciding factor. Wouldn't those be the best and most consistent
> > semantics?
> >
> > And then 'cwd'/'root' always have the O_PATH behavior.
>
> See above - unless I'm misparsing you, you wanted mount traversals in the
> starting point if it's ...at() with O_PATH fd.

.. and see above, it was just my confusion about the sense of O_PATH.

> For cwd and root the situation is opposite - we do NOT traverse mounts
> for those. And that's really too late to change.

Oh, absolutely.

[ snip some more about your automount digging. Looks about right, but
I'm not going to make a peep after getting O_PATH reversed ;) ]

Linus

2020-01-10 06:22:19

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Fri, 2020-01-10 at 04:15 +0000, Al Viro wrote:
> On Thu, Jan 09, 2020 at 04:08:16PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 8, 2020 at 1:34 PM Al Viro <[email protected]>
> > wrote:
> > > The point is, we'd never followed mounts on /proc/self/cwd et.al.
> > > I hadn't checked 2.0, but 2.1.100 ('97, before any changes from
> > > me)
> > > is that way.
> >
> > Hmm. If that's the case, maybe they should be marked implicitly as
> > O_PATH when opened?
>
> I thought you wanted O_PATH as starting point to have mounts
> traversed?
> Confused...
>
> > > Actually, scratch that - 2.0 behaves the same way
> > > (mountpoint crossing is done in iget() there; is that Minix
> > > influence
> > > or straight from the Lions' book?)
> >
> > I don't think I ever had access to Lions' - I've _seen_ a printout
> > of
> > it later, and obviously maybe others did,
> >
> > More likely it's from Maurice Bach: the Design of the Unix
> > Operating
> > System. I'm pretty sure that's where a lot of the FS layer stuff
> > came
> > from. Certainly the bad old buffer head interfaces, and quite
> > likely
> > the iget() stuff too.
> >
> > > 0.10: forward traversal in iget(), back traversal in
> > > fs/namei.c:find_entry()
> >
> > Whee, you _really_ went back in time.
> >
> > So I did too.
> >
> > And looking at that code in iget(), I doubt it came from anywhere.
> > Christ. It's just looping over a fixed-size array, both when
> > finding
> > the inode, and finding the superblock.
> >
> > Cute, but unbelievably stupid. It was a more innocent time.
> >
> > In other words, I think you can chalk it up to just me, because
> > blaming anybody else for that garbage would be very very unfair
> > indeed
> > ;)
>
> See
> https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/iget.c
> Exactly the same algorithm, complete with linear searches over those
> fixed-sized array.
>
> <grabs Bach> Right, he simply transcribes v7 iget().
>
> So I suspect that you are right - your variant of iget was pretty
> much
> one-to-one implementation of Bach's description of v7 iget.
>
> Your namei wasn't - Bach has 'if the entry points to root and you are
> in the root and name is "..", find mount table entry (by device
> number),
> drop your directory inode, grab the inode of mountpount and restart
> the search for ".." in there', which gives back traversals to
> arbitrary
> depth. And v7 namei() (as Bach mentions) uses iget() for starting
> point
> as well as for each component. You kept pointers instead, which is
> where
> the other difference has come from (no mount traversal at the
> starting
> point)...
>
> Actually, I've misread your code in 0.10 - it does unlimited forward
> traversals; it's back traversals that go only one level. The forward
> ones got limited to one level in 0.95, but then mount-over-root had
> been banned all along. I'd read the pre-dcache variant of iget(),
> seen it go pretty much all the way back to beginning and hadn't
> sorted out the 0.12 -> 0.95 transition...
>
> > > How would your proposal deal with access("/proc/self/fd/42/foo",
> > > MAY_READ)
> > > vs. faccessat(42, "foo", MAY_READ)?
> >
> > I think that in a perfect world, the O_PATH'ness of '42' would be
> > the
> > deciding factor. Wouldn't those be the best and most consistent
> > semantics?
> >
> > And then 'cwd'/'root' always have the O_PATH behavior.
>
> See above - unless I'm misparsing you, you wanted mount traversals in
> the
> starting point if it's ...at() with O_PATH fd. With O_PATH open()
> not
> doing them.
>
> For cwd and root the situation is opposite - we do NOT traverse
> mounts
> for those. And that's really too late to change.
>
> > > The latter would trigger automount,
> > > the former would not... Or would you extend that to "traverse
> > > mounts
> > > upon following procfs links, if the file in question had been
> > > opened with
> > > O_PATH"?
> >
> > Exactly.
> >
> > But you know what? I do not believe this is all that important, and
> > I
> > doubt it will matter to anybody.
>
> FWIW, digging through the automount-related parts of that stuff has
> caught several fun issues. One (and I'm rather embarrassed by it)
> should've been caught back in commit 8aef18845266 (VFS: Fix vfsmount
> overput on simultaneous automount). To quote the commit message:
> The problem is that lock_mount() drops the caller's reference to
> the
> mountpoint's vfsmount in the case where it finds something
> already mounted on
> the mountpoint as it transits to the mounted filesystem and
> replaces path->mnt
> with the new mountpoint vfsmount.
>
> During a pathwalk, however, we don't take a reference on the
> vfsmount if it is
> the same as the one in the nameidata struct, but do_add_mount()
> doesn't know
> this.
> At which point I should've gone "what the fuck?" - lock_mount() does,
> indeed,
> drop path->mnt in this situation and replaces it with the whatever's
> come to
> cover it. For mount(2) that's the right thing to do - we _want_ to
> mount
> on top of whatever we have at the mountpoint. For automounts we very
> much
> don't want that - it's either "mount right on top of the automount
> trigger"
> or discard whatever we'd been about to mount and walk into whatever's
> got
> mounted there (presumably the same thing triggered by another
> process).
> We kinda-sorta get that effect, but in a very convoluted way:
> do_add_mount()
> will refuse to mount something on top of itself -
> /* Refuse the same filesystem on the same mount point */
> err = -EBUSY;
> if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
> path->mnt->mnt_root == path->dentry)
> goto unlock;
> which will end up with -EBUSY returned (and recognized by
> follow_automount()).
>
> First of all, that's unreliable. If somebody not only has triggered
> that
> automount, but managed to _mount_ something else on top (for example,
> has triggered it by lookup of mountpoint-to-be in mount(2)), we'll
> end
> up not triggering that check. In which case we'll get something like
> nfs referral point under nfs automounted there under tmpfs from
> explicit
> overmount under same nfs mount we'd automounted there - identical to
> what's
> been buried under tmpfs. It's hard to hit, but not impossibly so.
>
> What's more, the whole solution is a kludge - the root of problem is
> that lock_mount() is the wrong thing to do in case of
> finish_automount().
> We don't want to go into whatever's overmounting us there, both for
> the reasons above *and* because it's a PITA for the caller. So the
> right solution is
> * lift lock_mount() call from do_add_mount() into its callers
> (all 2 of them); while we are at it, lift unlock_mount() as well
> (makes for simpler failure exits in do_add_mount()).
> * replace the call of lock_mount() in finish_automount()
> with variant that doesn't do "unlock, walk deeper and retry locking",
> returning ERR_PTR(-EBUSY) in such case.
> * get rid of the kludge introduced in that commit. Better
> yet, don't bother with traversing into the covering mount in case
> of success - let the caller of follow_automount() do that. Which
> eliminates the need to pass need_mntput to the sucker and suggests
> an even better solution - have this analogue of lock_mount()
> return NULL instead of ERR_PTR(-EBUSY) and treat it in
> finish_automount()
> as "OK, discard what we wanted to mount and return 0". That gets
> rid of the entire
> err = finish_automount(mnt, path);
> switch (err) {
> case -EBUSY:
> /* Someone else made a mount here whilst we were busy
> */
> return 0;
> case 0:
> path_put(path);
> path->mnt = mnt;
> path->dentry = dget(mnt->mnt_root);
> return 0;
> default:
> return err;
> }
> chunk in follow_automount() - it would just be
> return finish_automount(mnt, path);
>
> Another thing (in the same area) is not a bug per se, but...
> after the call of ->d_automount() we have this:
> if (IS_ERR(mnt)) {
> /*
> * The filesystem is allowed to return -EISDIR here
> to indicate
> * it doesn't want to automount. For instance,
> autofs would do
> * this so that its userspace daemon can mount on
> this dentry.
> *
> * However, we can only permit this if it's a
> terminal point in
> * the path being looked up; if it wasn't then the
> remainder of
> * the path is inaccessible and we should say so.
> */
> if (PTR_ERR(mnt) == -EISDIR && (nd->flags &
> LOOKUP_PARENT))
> return -EREMOTE;
> return PTR_ERR(mnt);
> }
> Except that not a single instance of ->d_automount() has ever
> returned
> -EISDIR. Certainly not autofs one, despite the what the comment
> says.
> That chunk has come from dhowells, back when the whole mount trap
> series
> had been merged. After talking that thing over (fun: trying to
> figure
> out what had been intended nearly 9 years ago, when people involved
> are
> in UK, US east coast and AU west coast respectively. The only way it
> could suck more would've been if I were on the west coast - then all
> timezone deltas would be 8-hour ones)... looks like it's a rudiment
> of plans that got superseded during the series development, nobody
> quite remembers exact details. Conclusion: it's not even dead, it's
> stillborn; bury it.

Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
we get there it's not relevant any more, so that check looks
redundant. I'm not aware of any other fs automount implementation
that needs that EISDIR pass-thru function.

I didn't notice it at the time of the merge, sorry about that.

While we're at it that:
if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
return -EREMOTE;

at the top of follow_automount() isn't going to be be relevant
for autofs because ->d_automount() really must always be defined
for it.

But, at the time of the merge, I didn't object to it because
there were (are) other file systems that use the VFS automount
function which may accidentally not define the method.

>
> Unfortunately, there are other interesting questions related to
> autofs-specific bits (->d_manage()) and the timezone-related fun
> is, of course, still there. I hope to sort that out today or
> tomorrow, at least enough to do a reasonable set of backportable
> fixes to put in front of follow_managed()/step_into() queue.
> Oh, well...

Yeah, I know it slows you down but I kink-off like having a chance
to look at what's going and think about your questions before trying
to answer them, rather than replying prematurely, as I usually do ...

It's been a bit of a busy day so far but I'm getting to look into
the questions you've asked.

Ian

2020-01-10 21:08:59

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-07, Linus Torvalds <[email protected]> wrote:
> On Tue, Jan 7, 2020 at 7:13 PM Al Viro <[email protected]> wrote:
> > Another interesting question is whether we want O_PATH open
> > to trigger automounts.
>
> It does sound like they shouldn't, but as you say:
>
> > The thing is, we do *NOT* trigger them
> > (or traverse mountpoints) at the starting point of lookups.
> > I believe it's a mistake (and mine, at that), but I doubt that
> > there's anything that can be done about it at that point.
> > It's a user-visible behaviour [..]
>
> Hmm. I wonder how set in stone that is. We may have two decades of
> history of not doing it at start point of lookups, but we do *not*
> have two decades of history of O_PATH.
>
> So what I think we agree would be sane behavior would be for O_PATH
> opens to not trigger automounts (unless there's a slash at the end,
> whatever), but _do_ add the mount-point traversal to the beginning of
> lookups.
>
> But only do it for the actual O_PATH fd case, not the cwd/root/non-O_PATH case.
>
> That way we maintain original behavior: if somebody overmounts your
> cwd, you still see the pre-mount directory on lookups, because your
> cwd is "under" the mount.
>
> But if you open a file with O_PATH, and somebody does a mount
> _afterwards_, the openat() will see that later mount and/or do the
> automount.
>
> Don't you think that would be the more sane/obvious semantics of how
> O_PATH should work?

If I'm understanding this proposal correctly, this would be a problem
for the libpathrs use-case -- if this is done then there's no way to
avoid a TOCTOU with someone mounting and the userspace program checking
whether something is a mountpoint (unless you have Linux >5.6 and
RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:

1. Open the candidate directory.
2. umount2(MNT_EXPIRE) the fd.
* -EINVAL means it wasn't a mountpoint when we got the fd, and the
fd is a stable handle to the underlying directory.
* -EAGAIN or -EBUSY means that it was a mountpoint or became a
mountpoint after the fd was opened (we don't care about that, but
fail-safe is better here).
3. Use the fd from (1) for all operations.

Don't get me wrong, I want to fix this issue *properly* by adding some
new kernel features that allow us to avoid worrying about
mounts-over-magiclinks -- but on old kernels (which libpathrs cares
about) I would be worried about changes like this being backported
resulting in it being not possible to implement the hardening I
mentioned up-thread.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.67 kB)
signature.asc (235.00 B)
Download all attachments

2020-01-10 23:22:09

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:
> On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> > On 2020-01-01, Al Viro <[email protected]> wrote:
> > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > >
> > > > Thanks, this fixes the issue for me (and also fixes another reproducer I
> > > > found -- mounting a symlink on top of itself then trying to umount it).
> > > >
> > > > Reported-by: Aleksa Sarai <[email protected]>
> > > > Tested-by: Aleksa Sarai <[email protected]>
> > >
> > > Pushed into #fixes.
> >
> > Thanks. One other thing I noticed is that umount applies to the
> > underlying symlink rather than the mountpoint on top. So, for example
> > (using the same scripts I posted in the thread):
> >
> > # ln -s /tmp/foo link
> > # ./mount_to_symlink /etc/passwd link
> > # umount -l link # will attempt to unmount "/tmp/foo"
> >
> > Is that intentional?
>
> It's a mess, again in mountpoint_last(). FWIW, at some point I proposed
> to have nd_jump_link() to fail with -ELOOP if the target was a symlink;
> Linus asked for reasons deeper than my dislike of the semantics, I looked
> around and hadn't spotted anything. And there hadn't been at the time,
> but when four months later umount_lookup_last() went in I failed to look
> for that source of potential problems in it ;-/

FWIW, since Ian appears to agree that we want ->d_manage() on the mount
crossing at the end of umount(2) lookup, here's a much simpler solution -
kill mountpoint_last() and switch to using lookup_last(). As a side
benefit, LOOKUP_NO_REVAL also goes away. It's possible to trim the
things even more (path_mountpoint() is very similar to path_lookupat()
at that point, and it's not hard to make the differences conditional on
something like LOOKUP_UMOUNT); I would rather do that part in the
cleanups series - the one below is easier to backport.

Aleksa, Ian - could you see if the patch below works for you?

commit e56b43b971a7c08762fceab330a52b7245041dbc
Author: Al Viro <[email protected]>
Date: Fri Jan 10 17:17:19 2020 -0500

reimplement path_mountpoint() with less magic

... and get rid of a bunch of bugs in it. Background:
the reason for path_mountpoint() is that umount() really doesn't
want attempts to revalidate the root of what it's trying to umount.
The thing we want to avoid actually happen from complete_walk();
solution was to do something parallel to normal path_lookupat()
and it both went overboard and got the boilerplate subtly
(and not so subtly) wrong.

A better solution is to do pretty much what the normal path_lookupat()
does, but instead of complete_walk() do unlazy_walk(). All it takes
to avoid that ->d_weak_revalidate() call... mountpoint_last() goes
away, along with everything it got wrong, and so does the magic around
LOOKUP_NO_REVAL.

Another source of bugs is that when we traverse mounts at the final
location (and we need to do that - umount . expects to get whatever's
overmounting ., if any, out of the lookup) we really ought to take
care of ->d_manage() - as it is, manual umount of autofs automount
in progress can lead to unpleasant surprises for the daemon. Easily
solved by using handle_lookup_down() instead of follow_mount().

Signed-off-by: Al Viro <[email protected]>

diff --git a/fs/namei.c b/fs/namei.c
index d6c91d1e88cb..1793661c3342 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const struct qstr *name,
if (IS_ERR(dentry))
return dentry;
if (unlikely(!d_in_lookup(dentry))) {
- if (!(flags & LOOKUP_NO_REVAL)) {
- int error = d_revalidate(dentry, flags);
- if (unlikely(error <= 0)) {
- if (!error) {
- d_invalidate(dentry);
- dput(dentry);
- goto again;
- }
+ int error = d_revalidate(dentry, flags);
+ if (unlikely(error <= 0)) {
+ if (!error) {
+ d_invalidate(dentry);
dput(dentry);
- dentry = ERR_PTR(error);
+ goto again;
}
+ dput(dentry);
+ dentry = ERR_PTR(error);
}
} else {
old = inode->i_op->lookup(inode, dentry, flags);
@@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
EXPORT_SYMBOL(user_path_at_empty);

/**
- * mountpoint_last - look up last component for umount
- * @nd: pathwalk nameidata - currently pointing at parent directory of "last"
- *
- * This is a special lookup_last function just for umount. In this case, we
- * need to resolve the path without doing any revalidation.
- *
- * The nameidata should be the result of doing a LOOKUP_PARENT pathwalk. Since
- * mountpoints are always pinned in the dcache, their ancestors are too. Thus,
- * in almost all cases, this lookup will be served out of the dcache. The only
- * cases where it won't are if nd->last refers to a symlink or the path is
- * bogus and it doesn't exist.
- *
- * Returns:
- * -error: if there was an error during lookup. This includes -ENOENT if the
- * lookup found a negative dentry.
- *
- * 0: if we successfully resolved nd->last and found it to not to be a
- * symlink that needs to be followed.
- *
- * 1: if we successfully resolved nd->last and found it to be a symlink
- * that needs to be followed.
- */
-static int
-mountpoint_last(struct nameidata *nd)
-{
- int error = 0;
- struct dentry *dir = nd->path.dentry;
- struct path path;
-
- /* If we're in rcuwalk, drop out of it to handle last component */
- if (nd->flags & LOOKUP_RCU) {
- if (unlazy_walk(nd))
- return -ECHILD;
- }
-
- nd->flags &= ~LOOKUP_PARENT;
-
- if (unlikely(nd->last_type != LAST_NORM)) {
- error = handle_dots(nd, nd->last_type);
- if (error)
- return error;
- path.dentry = dget(nd->path.dentry);
- } else {
- path.dentry = d_lookup(dir, &nd->last);
- if (!path.dentry) {
- /*
- * No cached dentry. Mounted dentries are pinned in the
- * cache, so that means that this dentry is probably
- * a symlink or the path doesn't actually point
- * to a mounted dentry.
- */
- path.dentry = lookup_slow(&nd->last, dir,
- nd->flags | LOOKUP_NO_REVAL);
- if (IS_ERR(path.dentry))
- return PTR_ERR(path.dentry);
- }
- }
- if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags))) {
- dput(path.dentry);
- return -ENOENT;
- }
- path.mnt = nd->path.mnt;
- return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
-}
-
-/**
* path_mountpoint - look up a path to be umounted
* @nd: lookup context
* @flags: lookup flags
@@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path)
int err;

while (!(err = link_path_walk(s, nd)) &&
- (err = mountpoint_last(nd)) > 0) {
+ (err = lookup_last(nd)) > 0) {
s = trailing_symlink(nd);
}
+ if (!err)
+ err = unlazy_walk(nd);
+ if (!err)
+ err = handle_lookup_down(nd);
if (!err) {
*path = nd->path;
nd->path.mnt = NULL;
nd->path.dentry = NULL;
- follow_mount(path);
}
terminate_walk(nd);
return err;
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index f64a33d2a1d1..2a82dcce5fc1 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT);
TRACE_DEFINE_ENUM(LOOKUP_PARENT);
TRACE_DEFINE_ENUM(LOOKUP_REVAL);
TRACE_DEFINE_ENUM(LOOKUP_RCU);
-TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL);
TRACE_DEFINE_ENUM(LOOKUP_OPEN);
TRACE_DEFINE_ENUM(LOOKUP_CREATE);
TRACE_DEFINE_ENUM(LOOKUP_EXCL);
@@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN);
{ LOOKUP_PARENT, "PARENT" }, \
{ LOOKUP_REVAL, "REVAL" }, \
{ LOOKUP_RCU, "RCU" }, \
- { LOOKUP_NO_REVAL, "NO_REVAL" }, \
{ LOOKUP_OPEN, "OPEN" }, \
{ LOOKUP_CREATE, "CREATE" }, \
{ LOOKUP_EXCL, "EXCL" }, \
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 7fe7b87a3ded..07bfb0874033 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};

/* internal use only */
#define LOOKUP_PARENT 0x0010
-#define LOOKUP_NO_REVAL 0x0080
#define LOOKUP_JUMPED 0x1000
#define LOOKUP_ROOT 0x2000
#define LOOKUP_ROOT_GRABBED 0x0008

2020-01-12 21:41:07

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:

> Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
> we get there it's not relevant any more, so that check looks
> redundant. I'm not aware of any other fs automount implementation
> that needs that EISDIR pass-thru function.
>
> I didn't notice it at the time of the merge, sorry about that.
>
> While we're at it that:
> if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
> return -EREMOTE;
>
> at the top of follow_automount() isn't going to be be relevant
> for autofs because ->d_automount() really must always be defined
> for it.
>
> But, at the time of the merge, I didn't object to it because
> there were (are) other file systems that use the VFS automount
> function which may accidentally not define the method.

OK...

> > Unfortunately, there are other interesting questions related to
> > autofs-specific bits (->d_manage()) and the timezone-related fun
> > is, of course, still there. I hope to sort that out today or
> > tomorrow, at least enough to do a reasonable set of backportable
> > fixes to put in front of follow_managed()/step_into() queue.
> > Oh, well...
>
> Yeah, I know it slows you down but I kink-off like having a chance

Nice typo, that ;-)

> to look at what's going and think about your questions before trying
> to answer them, rather than replying prematurely, as I usually do ...
>
> It's been a bit of a busy day so far but I'm getting to look into
> the questions you've asked.

Here's a bit more of those (I might've missed some of your replies on
IRC; my apologies if that's the case):

1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try
->d_automount() here". If its effect can be delayed until the decision
to call ->d_automount(), the things seem to get simpler. Is it ever
returned in situation when the sucker _is_ overmounted?

2) can autofs_d_automount() ever be called for a daemon? Looks like it
shouldn't be...

3) is _anything_ besides root directory ever created in direct autofs
superblocks by anyone? If not, why does autofs_lookup() even bother to
do anything there? IOW, why not have it return ERR_PTR(-ENOENT) immediately
for direct ones? Or am I missing something and it is, in fact, possible
to have the daemon create something in those?

4) Symlinks look like they should qualify for parent being non-empty;
at least autofs_d_manage() seems to think so (simple_empty() use).
So shouldn't we remove the trap from its parent on symlink/restore on
unlink if parent gets empty? For version 4 or earlier, that is. Or is
it simply that daemon only creates symlinks in root directory?


Anyway, intermediate state of the series is in #work.namei right now,
and some _very_ interesting possibilities open up. It definitely
needs more massage around __follow_mount_rcu() (as it is, the
fastpath in there is still too twisted). Said that
* call graph is less convoluted
* follow_managed() calls are folded into step_into(). Interface:
int step_into(nd, flags, dentry, inode, seq), with inode/seq used only
if we are in RCU mode.
* ".." still doesn't use that; it probably ought to.
* lookup_fast() doesn't take path - nd, &inode, &seq and returns dentry
* lookup_open() and fs/namei.c:atomic_open() get similar treatment
- don't take path, return dentry.
* calls of follow_managed()/step_into() combination returning 1
are always followed by get_link(), and very shortly, at that. So much
that we can realistically merge pick_link() (in the end of
step_into()) with get_link(). That merge is NOT done in this branch yet.

The last one promises to get rid of a rather unpleasant group of calling
conventions. Right now we have several functions (step_into()/
walk_component()/lookup_last()/do_last()) with the following calling
conventions:
-E... => error
0 => non-symlink or symlink not followed; nd->path points to it
1 => picked a symlink to follow; its mount/dentry/seq has been
pushed on nd->stack[]; its inode is stashed into nd->link_inode for
subsequent get_link() to pick. nd->path is left unchanged.

That way all of those become
ERR_PTR(-E...) => error
NULL => non-symlink, symlink not followed or a pure
jump (bare "/" or procfs ones); nd->path points to where we end up
string => symlink being followed; the sucker's pushed
to stack, initial jump (if any) has been handled and the string returned
is what we need to traverse.

IMO it's less arbitrary that way. More importantly, the separation between
step_into() committing to symlink traversal and (inevitably following)
get_link() is gone - it's one operation after that change. No nd->link_inode
either - it's only needed to carry the information from pick_link() to the
next get_link().

Loops turn into
while (!(err = link_path_walk(nd, s)) &&
(s = lookup_last(nd)) != NULL)
;
and
while (!(err = link_path_walk(nd, s)) &&
(s = do_last(nd, file, op)) != NULL)
;

trailing_symlink() goes away (folded into pick_link()/get_link() combo,
conditional upon nd->depth at the entry). And in link_path_walk() we'll
have
if (unlikely(!*name)) {
/* pathname body, done */
if (!nd->depth)
return 0;
name = nd->stack[nd->depth - 1].name;
/* trailing symlink, done */
if (!name)
return 0;
/* last component of nested symlink */
s = walk_component(nd, WALK_FOLLOW);
} else {
/* not the last component */
s = walk_component(nd, WALK_FOLLOW | WALK_MORE);
}
if (s) {
if (IS_ERR(s))
return PTR_ERR(s);
/* a symlink to follow */
nd->stack[nd->depth - 1].name = name;
name = s;
continue;
}

Anyway, before I try that one I'm going to fold path_openat2() into
that series - that step is definitely going to require some massage
there; it's too close to get_link() changes done in Aleksa's series.

If we do that, we get a single primitive for "here's the result of
lookup; traverse mounts and either move into the result or, if
it's a symlink that needs to be traversed, start the symlink
traversal - jump into the base position for it (if needed) and
return the pathname that needs to be handled". As it is, mainline
has that logics spread over about a dozen locations...

Diffstat at the moment:
fs/autofs/dev-ioctl.c | 6 +-
fs/internal.h | 1 -
fs/namei.c | 460 ++++++++++++++------------------------------------
fs/namespace.c | 97 +++++++----
fs/nfs/nfstrace.h | 2 -
fs/open.c | 4 +-
include/linux/namei.h | 3 +-
7 files changed, 197 insertions(+), 376 deletions(-)

In the current form the sucker appears to work (so far - about 30%
into the usual xfstests run) without visible slowdowns...

2020-01-13 01:49:46

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Fri, 2020-01-10 at 23:19 +0000, Al Viro wrote:
> On Fri, Jan 03, 2020 at 01:49:01AM +0000, Al Viro wrote:
> > On Thu, Jan 02, 2020 at 02:59:20PM +1100, Aleksa Sarai wrote:
> > > On 2020-01-01, Al Viro <[email protected]> wrote:
> > > > On Thu, Jan 02, 2020 at 01:44:07AM +1100, Aleksa Sarai wrote:
> > > >
> > > > > Thanks, this fixes the issue for me (and also fixes another
> > > > > reproducer I
> > > > > found -- mounting a symlink on top of itself then trying to
> > > > > umount it).
> > > > >
> > > > > Reported-by: Aleksa Sarai <[email protected]>
> > > > > Tested-by: Aleksa Sarai <[email protected]>
> > > >
> > > > Pushed into #fixes.
> > >
> > > Thanks. One other thing I noticed is that umount applies to the
> > > underlying symlink rather than the mountpoint on top. So, for
> > > example
> > > (using the same scripts I posted in the thread):
> > >
> > > # ln -s /tmp/foo link
> > > # ./mount_to_symlink /etc/passwd link
> > > # umount -l link # will attempt to unmount "/tmp/foo"
> > >
> > > Is that intentional?
> >
> > It's a mess, again in mountpoint_last(). FWIW, at some point I
> > proposed
> > to have nd_jump_link() to fail with -ELOOP if the target was a
> > symlink;
> > Linus asked for reasons deeper than my dislike of the semantics, I
> > looked
> > around and hadn't spotted anything. And there hadn't been at the
> > time,
> > but when four months later umount_lookup_last() went in I failed to
> > look
> > for that source of potential problems in it ;-/
>
> FWIW, since Ian appears to agree that we want ->d_manage() on the
> mount
> crossing at the end of umount(2) lookup, here's a much simpler
> solution -
> kill mountpoint_last() and switch to using lookup_last(). As a side
> benefit, LOOKUP_NO_REVAL also goes away. It's possible to trim the
> things even more (path_mountpoint() is very similar to
> path_lookupat()
> at that point, and it's not hard to make the differences conditional
> on
> something like LOOKUP_UMOUNT); I would rather do that part in the
> cleanups series - the one below is easier to backport.
>
> Aleksa, Ian - could you see if the patch below works for you?

I did try this patch and I was trying to work out why it didn't
work. But thought I'd let you know what I saw.

Applying it to current Linus tree systemd stops at switch root.

Not sure what causes that, I couldn't see any reason for it.

I see you have a development branch in your repo. I'll have a look
at that rather than continue with this.

>
> commit e56b43b971a7c08762fceab330a52b7245041dbc
> Author: Al Viro <[email protected]>
> Date: Fri Jan 10 17:17:19 2020 -0500
>
> reimplement path_mountpoint() with less magic
>
> ... and get rid of a bunch of bugs in it. Background:
> the reason for path_mountpoint() is that umount() really doesn't
> want attempts to revalidate the root of what it's trying to
> umount.
> The thing we want to avoid actually happen from complete_walk();
> solution was to do something parallel to normal path_lookupat()
> and it both went overboard and got the boilerplate subtly
> (and not so subtly) wrong.
>
> A better solution is to do pretty much what the normal
> path_lookupat()
> does, but instead of complete_walk() do unlazy_walk(). All it
> takes
> to avoid that ->d_weak_revalidate() call... mountpoint_last()
> goes
> away, along with everything it got wrong, and so does the magic
> around
> LOOKUP_NO_REVAL.
>
> Another source of bugs is that when we traverse mounts at the
> final
> location (and we need to do that - umount . expects to get
> whatever's
> overmounting ., if any, out of the lookup) we really ought to
> take
> care of ->d_manage() - as it is, manual umount of autofs
> automount
> in progress can lead to unpleasant surprises for the
> daemon. Easily
> solved by using handle_lookup_down() instead of follow_mount().
>
> Signed-off-by: Al Viro <[email protected]>
>
> diff --git a/fs/namei.c b/fs/namei.c
> index d6c91d1e88cb..1793661c3342 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1649,17 +1649,15 @@ static struct dentry *__lookup_slow(const
> struct qstr *name,
> if (IS_ERR(dentry))
> return dentry;
> if (unlikely(!d_in_lookup(dentry))) {
> - if (!(flags & LOOKUP_NO_REVAL)) {
> - int error = d_revalidate(dentry, flags);
> - if (unlikely(error <= 0)) {
> - if (!error) {
> - d_invalidate(dentry);
> - dput(dentry);
> - goto again;
> - }
> + int error = d_revalidate(dentry, flags);
> + if (unlikely(error <= 0)) {
> + if (!error) {
> + d_invalidate(dentry);
> dput(dentry);
> - dentry = ERR_PTR(error);
> + goto again;
> }
> + dput(dentry);
> + dentry = ERR_PTR(error);
> }
> } else {
> old = inode->i_op->lookup(inode, dentry, flags);
> @@ -2618,72 +2616,6 @@ int user_path_at_empty(int dfd, const char
> __user *name, unsigned flags,
> EXPORT_SYMBOL(user_path_at_empty);
>
> /**
> - * mountpoint_last - look up last component for umount
> - * @nd: pathwalk nameidata - currently pointing at parent
> directory of "last"
> - *
> - * This is a special lookup_last function just for umount. In this
> case, we
> - * need to resolve the path without doing any revalidation.
> - *
> - * The nameidata should be the result of doing a LOOKUP_PARENT
> pathwalk. Since
> - * mountpoints are always pinned in the dcache, their ancestors are
> too. Thus,
> - * in almost all cases, this lookup will be served out of the
> dcache. The only
> - * cases where it won't are if nd->last refers to a symlink or the
> path is
> - * bogus and it doesn't exist.
> - *
> - * Returns:
> - * -error: if there was an error during lookup. This includes
> -ENOENT if the
> - * lookup found a negative dentry.
> - *
> - * 0: if we successfully resolved nd->last and found it to not
> to be a
> - * symlink that needs to be followed.
> - *
> - * 1: if we successfully resolved nd->last and found it to be a
> symlink
> - * that needs to be followed.
> - */
> -static int
> -mountpoint_last(struct nameidata *nd)
> -{
> - int error = 0;
> - struct dentry *dir = nd->path.dentry;
> - struct path path;
> -
> - /* If we're in rcuwalk, drop out of it to handle last component
> */
> - if (nd->flags & LOOKUP_RCU) {
> - if (unlazy_walk(nd))
> - return -ECHILD;
> - }
> -
> - nd->flags &= ~LOOKUP_PARENT;
> -
> - if (unlikely(nd->last_type != LAST_NORM)) {
> - error = handle_dots(nd, nd->last_type);
> - if (error)
> - return error;
> - path.dentry = dget(nd->path.dentry);
> - } else {
> - path.dentry = d_lookup(dir, &nd->last);
> - if (!path.dentry) {
> - /*
> - * No cached dentry. Mounted dentries are
> pinned in the
> - * cache, so that means that this dentry is
> probably
> - * a symlink or the path doesn't actually point
> - * to a mounted dentry.
> - */
> - path.dentry = lookup_slow(&nd->last, dir,
> - nd->flags |
> LOOKUP_NO_REVAL);
> - if (IS_ERR(path.dentry))
> - return PTR_ERR(path.dentry);
> - }
> - }
> - if (d_flags_negative(smp_load_acquire(&path.dentry->d_flags)))
> {
> - dput(path.dentry);
> - return -ENOENT;
> - }
> - path.mnt = nd->path.mnt;
> - return step_into(nd, &path, 0, d_backing_inode(path.dentry),
> 0);
> -}
> -
> -/**
> * path_mountpoint - look up a path to be umounted
> * @nd: lookup context
> * @flags: lookup flags
> @@ -2699,14 +2631,17 @@ path_mountpoint(struct nameidata *nd,
> unsigned flags, struct path *path)
> int err;
>
> while (!(err = link_path_walk(s, nd)) &&
> - (err = mountpoint_last(nd)) > 0) {
> + (err = lookup_last(nd)) > 0) {
> s = trailing_symlink(nd);
> }
> + if (!err)
> + err = unlazy_walk(nd);
> + if (!err)
> + err = handle_lookup_down(nd);
> if (!err) {
> *path = nd->path;
> nd->path.mnt = NULL;
> nd->path.dentry = NULL;
> - follow_mount(path);
> }
> terminate_walk(nd);
> return err;
> diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
> index f64a33d2a1d1..2a82dcce5fc1 100644
> --- a/fs/nfs/nfstrace.h
> +++ b/fs/nfs/nfstrace.h
> @@ -206,7 +206,6 @@ TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT);
> TRACE_DEFINE_ENUM(LOOKUP_PARENT);
> TRACE_DEFINE_ENUM(LOOKUP_REVAL);
> TRACE_DEFINE_ENUM(LOOKUP_RCU);
> -TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL);
> TRACE_DEFINE_ENUM(LOOKUP_OPEN);
> TRACE_DEFINE_ENUM(LOOKUP_CREATE);
> TRACE_DEFINE_ENUM(LOOKUP_EXCL);
> @@ -224,7 +223,6 @@ TRACE_DEFINE_ENUM(LOOKUP_DOWN);
> { LOOKUP_PARENT, "PARENT" }, \
> { LOOKUP_REVAL, "REVAL" }, \
> { LOOKUP_RCU, "RCU" }, \
> - { LOOKUP_NO_REVAL, "NO_REVAL" }, \
> { LOOKUP_OPEN, "OPEN" }, \
> { LOOKUP_CREATE, "CREATE" }, \
> { LOOKUP_EXCL, "EXCL" }, \
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 7fe7b87a3ded..07bfb0874033 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -34,7 +34,6 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT,
> LAST_BIND};
>
> /* internal use only */
> #define LOOKUP_PARENT 0x0010
> -#define LOOKUP_NO_REVAL 0x0080
> #define LOOKUP_JUMPED 0x1000
> #define LOOKUP_ROOT 0x2000
> #define LOOKUP_ROOT_GRABBED 0x0008

2020-01-13 03:01:59

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Sun, 2020-01-12 at 21:33 +0000, Al Viro wrote:
> On Fri, Jan 10, 2020 at 02:20:55PM +0800, Ian Kent wrote:
>
> > Yeah, autofs ->d_automount() doesn't return -EISDIR, by the time
> > we get there it's not relevant any more, so that check looks
> > redundant. I'm not aware of any other fs automount implementation
> > that needs that EISDIR pass-thru function.
> >
> > I didn't notice it at the time of the merge, sorry about that.
> >
> > While we're at it that:
> > if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
> > return -EREMOTE;
> >
> > at the top of follow_automount() isn't going to be be relevant
> > for autofs because ->d_automount() really must always be defined
> > for it.
> >
> > But, at the time of the merge, I didn't object to it because
> > there were (are) other file systems that use the VFS automount
> > function which may accidentally not define the method.
>
> OK...
>
> > > Unfortunately, there are other interesting questions related to
> > > autofs-specific bits (->d_manage()) and the timezone-related fun
> > > is, of course, still there. I hope to sort that out today or
> > > tomorrow, at least enough to do a reasonable set of backportable
> > > fixes to put in front of follow_managed()/step_into() queue.
> > > Oh, well...
> >
> > Yeah, I know it slows you down but I kink-off like having a chance
>
> Nice typo, that ;-)
>
> > to look at what's going and think about your questions before
> > trying
> > to answer them, rather than replying prematurely, as I usually do
> > ...
> >
> > It's been a bit of a busy day so far but I'm getting to look into
> > the questions you've asked.
>
> Here's a bit more of those (I might've missed some of your replies on
> IRC; my apologies if that's the case):
>
> 1) AFAICS, -EISDIR from ->d_manage() actually means "don't even try
> ->d_automount() here". If its effect can be delayed until the
> decision
> to call ->d_automount(), the things seem to get simpler. Is it ever
> returned in situation when the sucker _is_ overmounted?

In theory it shouldn't need to be returned when there is an
actual mount there.

If there is a real mount at this point that should be enough to
prevent walks into that mount until it's mount is complete.

The whole idea of -EISDIR is to prevent processes from walking
into a directory tree that "doesn't have a real mount at its
base" (the so called multi-mount map construct).

>
> 2) can autofs_d_automount() ever be called for a daemon? Looks like
> it
> shouldn't be...

Can't do that, it will lead to deadlock very quickly.

>
> 3) is _anything_ besides root directory ever created in direct autofs
> superblocks by anyone? If not, why does autofs_lookup() even bother
> to
> do anything there? IOW, why not have it return ERR_PTR(-ENOENT)
> immediately
> for direct ones? Or am I missing something and it is, in fact,
> possible
> to have the daemon create something in those?

Short answer is no, longer answer is directories "shouldn't" ever
be created inside direct mount points.

The thing is that the multi-mount map construct can be used with
direct mounts too, but they must always have a real mount at the
base because they are direct mounts. So processes should not be
able to walk into them while they are being mounted (constructed).

But I'm pretty sure it's rare (maybe not done at all) that this
map construct is used with direct mounts.

>
> 4) Symlinks look like they should qualify for parent being non-empty;
> at least autofs_d_manage() seems to think so (simple_empty() use).
> So shouldn't we remove the trap from its parent on symlink/restore on
> unlink if parent gets empty? For version 4 or earlier, that is. Or
> is
> it simply that daemon only creates symlinks in root directory?

Yes, they have to be empty.

If a symlink is to be used (based on autofs config or map option)
and the "browse" option is used for the indirect mount (browse
only makes sense for indirect autofs managed mounts) then the
mount point directory has to be removed and a symlink created
so it must be empty to for this to make sense.

If it's a "nobrowse" autofs mount then nothing should already
exist, it just gets created.

The catch is that a map entry for which a symlink is to be used
instead of a mount can't be a multi-mount. I'm pretty sure I don't
have sufficient error checking for that in the daemon but I also
haven't had reports of problems with it either.

For a very long time the use of symlinks was not common but when
the amd format map parser was added it made sense to use symlinks
in some cases for those. That was partly to reduce the number of
mounts needed and because I deliberately don't support amd map
entries that provide the multi-mount construct. The way amd did
this looked ugly to me, very much a hack to add a Sun format
mount feature.

As far as keeping the trap flags up to date, I don't.

It seemed so much simpler to just leave the flags in place but,
at that time, symlinks were not used (although it was possible to
do so), now that's changed fiddling with the flags might now make
sense.

As I said on IRC:
"DCACHE_NEED_AUTOMOUNT is set on symlink dentries because, when
->lookup() is called the dentry may trigger a callback to the
daemon that will either create a directory (since, in this case,
one does not already exist) and attempt to mount on it or create
a symlink if the autofs config/map requires it.

I didn't think there would be potential simplification by setting
and clearing the DCACHE_NEED_AUTOMOUNT flag based on it being a
directory (mountpoint) or a symlink so the flag is always left set.
Although, as you point out, symlinks won't actually trigger mounts
so the flag being left set when the dentry is a symlink is due to
lazyness, since there's nothing to gain. If you can see potential
simplification in the VFS code by managing this flag better then
that would be worth while."

>
>
> Anyway, intermediate state of the series is in #work.namei right now,
> and some _very_ interesting possibilities open up. It definitely
> needs more massage around __follow_mount_rcu() (as it is, the
> fastpath in there is still too twisted). Said that
> * call graph is less convoluted
> * follow_managed() calls are folded into
> step_into(). Interface:
> int step_into(nd, flags, dentry, inode, seq), with inode/seq used
> only
> if we are in RCU mode.
> * ".." still doesn't use that; it probably ought to.
> * lookup_fast() doesn't take path - nd, &inode, &seq and
> returns dentry
> * lookup_open() and fs/namei.c:atomic_open() get similar
> treatment
> - don't take path, return dentry.
> * calls of follow_managed()/step_into() combination returning 1
> are always followed by get_link(), and very shortly, at that. So
> much
> that we can realistically merge pick_link() (in the end of
> step_into()) with get_link(). That merge is NOT done in this branch
> yet.
>
> The last one promises to get rid of a rather unpleasant group of
> calling
> conventions. Right now we have several functions (step_into()/
> walk_component()/lookup_last()/do_last()) with the following calling
> conventions:
> -E... => error
> 0 => non-symlink or symlink not followed; nd->path points to it
> 1 => picked a symlink to follow; its mount/dentry/seq has been
> pushed on nd->stack[]; its inode is stashed into nd->link_inode for
> subsequent get_link() to pick. nd->path is left unchanged.
>
> That way all of those become
> ERR_PTR(-E...) => error
> NULL => non-symlink, symlink not followed or a
> pure
> jump (bare "/" or procfs ones); nd->path points to where we end up
> string => symlink being followed; the sucker's
> pushed
> to stack, initial jump (if any) has been handled and the string
> returned
> is what we need to traverse.
>
> IMO it's less arbitrary that way. More importantly, the separation
> between
> step_into() committing to symlink traversal and (inevitably
> following)
> get_link() is gone - it's one operation after that change. No nd-
> >link_inode
> either - it's only needed to carry the information from pick_link()
> to the
> next get_link().
>
> Loops turn into
> while (!(err = link_path_walk(nd, s)) &&
> (s = lookup_last(nd)) != NULL)
> ;
> and
> while (!(err = link_path_walk(nd, s)) &&
> (s = do_last(nd, file, op)) != NULL)
> ;
>
> trailing_symlink() goes away (folded into pick_link()/get_link()
> combo,
> conditional upon nd->depth at the entry). And in link_path_walk()
> we'll
> have
> if (unlikely(!*name)) {
> /* pathname body, done */
> if (!nd->depth)
> return 0;
> name = nd->stack[nd->depth - 1].name;
> /* trailing symlink, done */
> if (!name)
> return 0;
> /* last component of nested symlink */
> s = walk_component(nd, WALK_FOLLOW);
> } else {
> /* not the last component */
> s = walk_component(nd, WALK_FOLLOW |
> WALK_MORE);
> }
> if (s) {
> if (IS_ERR(s))
> return PTR_ERR(s);
> /* a symlink to follow */
> nd->stack[nd->depth - 1].name = name;
> name = s;
> continue;
> }
>
> Anyway, before I try that one I'm going to fold path_openat2() into
> that series - that step is definitely going to require some massage
> there; it's too close to get_link() changes done in Aleksa's series.
>
> If we do that, we get a single primitive for "here's the result of
> lookup; traverse mounts and either move into the result or, if
> it's a symlink that needs to be traversed, start the symlink
> traversal - jump into the base position for it (if needed) and
> return the pathname that needs to be handled". As it is, mainline
> has that logics spread over about a dozen locations...
>
> Diffstat at the moment:
> fs/autofs/dev-ioctl.c | 6 +-
> fs/internal.h | 1 -
> fs/namei.c | 460 ++++++++++++++------------------------
> ------------
> fs/namespace.c | 97 +++++++----
> fs/nfs/nfstrace.h | 2 -
> fs/open.c | 4 +-
> include/linux/namei.h | 3 +-
> 7 files changed, 197 insertions(+), 376 deletions(-)
>
> In the current form the sucker appears to work (so far - about 30%
> into the usual xfstests run) without visible slowdowns...

Ok, I'll have a look at that branch, ;)

Ian

2020-01-13 03:55:30

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:

> I did try this patch and I was trying to work out why it didn't
> work. But thought I'd let you know what I saw.
>
> Applying it to current Linus tree systemd stops at switch root.
>
> Not sure what causes that, I couldn't see any reason for it.

Wait a minute... So you are seeing problems early in the boot,
before any autofs ioctls might come into play?

Sigh... Guess I'll have to dig that Fedora KVM image out and
try to see what it's about... ;-/ Here comes a couple of hours
of build...

2020-01-13 06:07:03

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote:
> On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:
>
> > I did try this patch and I was trying to work out why it didn't
> > work. But thought I'd let you know what I saw.
> >
> > Applying it to current Linus tree systemd stops at switch root.
> >
> > Not sure what causes that, I couldn't see any reason for it.
>
> Wait a minute... So you are seeing problems early in the boot,
> before any autofs ioctls might come into play?

I did, then I checked it booted without the patch, then tried
building from scratch with the patch twice and same thing
happened each time.

Looked like this, such as it is:
[ OK ] Reached target Switch Root.
[ OK ] Started Plymouth switch root service.
Starting Switch Root...

I don't have any evidence but thought it might be this:
https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c

Mind you, that's not the actual systemd repo. either I probably
need to look a lot deeper (and at the actual systemd repo) to
work out what's actually being called.

>
> Sigh... Guess I'll have to dig that Fedora KVM image out and
> try to see what it's about... ;-/ Here comes a couple of hours
> of build...

2020-01-13 06:13:28

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, 2020-01-13 at 14:00 +0800, Ian Kent wrote:
> On Mon, 2020-01-13 at 03:54 +0000, Al Viro wrote:
> > On Mon, Jan 13, 2020 at 09:48:23AM +0800, Ian Kent wrote:
> >
> > > I did try this patch and I was trying to work out why it didn't
> > > work. But thought I'd let you know what I saw.
> > >
> > > Applying it to current Linus tree systemd stops at switch root.
> > >
> > > Not sure what causes that, I couldn't see any reason for it.
> >
> > Wait a minute... So you are seeing problems early in the boot,
> > before any autofs ioctls might come into play?
>
> I did, then I checked it booted without the patch, then tried
> building from scratch with the patch twice and same thing
> happened each time.
>
> Looked like this, such as it is:
> [ OK ] Reached target Switch Root.
> [ OK ] Started Plymouth switch root service.
> Starting Switch Root...
>
> I don't have any evidence but thought it might be this:
> https://github.com/karelzak/util-linux/blob/master/sys-utils/switch_root.c

Oh wait, for systemd I was actually looking at:
https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c

>
> Mind you, that's not the actual systemd repo. either I probably
> need to look a lot deeper (and at the actual systemd repo) to
> work out what's actually being called.
>
> > Sigh... Guess I'll have to dig that Fedora KVM image out and
> > try to see what it's about... ;-/ Here comes a couple of hours
> > of build...

2020-01-13 13:32:10

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:

> Oh wait, for systemd I was actually looking at:
> https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
>
> >
> > Mind you, that's not the actual systemd repo. either I probably
> > need to look a lot deeper (and at the actual systemd repo) to
> > work out what's actually being called.
> >
> > > Sigh... Guess I'll have to dig that Fedora KVM image out and
> > > try to see what it's about... ;-/ Here comes a couple of hours
> > > of build...

D'oh... And yes, that would've been a bisect hazard - switch to
path_lookupat() later in the series gets rid of that. Incremental
(to be foldede, of course):

diff --git a/fs/namei.c b/fs/namei.c
index 1793661c3342..204677c37751 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path)
(err = lookup_last(nd)) > 0) {
s = trailing_symlink(nd);
}
- if (!err)
+ if (!err && (nd->flags & LOOKUP_RCU))
err = unlazy_walk(nd);
if (!err)
err = handle_lookup_down(nd);

2020-01-14 00:28:40

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, 2020-01-13 at 10:59 +0800, Ian Kent wrote:
>
> > 3) is _anything_ besides root directory ever created in direct
> > autofs
> > superblocks by anyone? If not, why does autofs_lookup() even
> > bother
> > to
> > do anything there? IOW, why not have it return ERR_PTR(-ENOENT)
> > immediately
> > for direct ones? Or am I missing something and it is, in fact,
> > possible
> > to have the daemon create something in those?
>
> Short answer is no, longer answer is directories "shouldn't" ever
> be created inside direct mount points.
>
> The thing is that the multi-mount map construct can be used with
> direct mounts too, but they must always have a real mount at the
> base because they are direct mounts. So processes should not be
> able to walk into them while they are being mounted (constructed).
>
> But I'm pretty sure it's rare (maybe not done at all) that this
> map construct is used with direct mounts.

This isn't right.

There's actually nothing stopping a user from using a direct map
entry that's a multi-mount without an actual mount at its root.
So there could be directories created under these, it's just not
usually done.

I'm pretty sure I don't check and disallow this.

Ian

2020-01-14 04:40:47

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:

> This isn't right.
>
> There's actually nothing stopping a user from using a direct map
> entry that's a multi-mount without an actual mount at its root.
> So there could be directories created under these, it's just not
> usually done.
>
> I'm pretty sure I don't check and disallow this.

IDGI... How the hell will that work in v5? Who will set _any_
traps outside the one in root in that scenario? autofs_lookup()
won't (there it's conditional upon indirect mount). Neither
will autofs_dir_mkdir() (conditional upon version being less
than 5). Who will, then?

Confused...

2020-01-14 05:02:52

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
> On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
>
> > This isn't right.
> >
> > There's actually nothing stopping a user from using a direct map
> > entry that's a multi-mount without an actual mount at its root.
> > So there could be directories created under these, it's just not
> > usually done.
> >
> > I'm pretty sure I don't check and disallow this.
>
> IDGI... How the hell will that work in v5? Who will set _any_
> traps outside the one in root in that scenario? autofs_lookup()
> won't (there it's conditional upon indirect mount). Neither
> will autofs_dir_mkdir() (conditional upon version being less
> than 5). Who will, then?
>
> Confused...

It's easy to miss.

For autofs type direct and offset mounts the flags are set at fill
super time.

They have to be set then because they are direct mounts and offset
mounts behave the same as direct mounts so they need to be set then
too. So, like direct mounts, offset mounts are each distinct autofs
(trigger) mounts.

I could check for this construct and refuse it if that's really
needed. I'm pretty sure this map construct isn't much used by
people using direct mounts.

Ian

2020-01-14 05:06:28

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:

> If I'm understanding this proposal correctly, this would be a problem
> for the libpathrs use-case -- if this is done then there's no way to
> avoid a TOCTOU with someone mounting and the userspace program checking
> whether something is a mountpoint (unless you have Linux >5.6 and
> RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
>
> 1. Open the candidate directory.
> 2. umount2(MNT_EXPIRE) the fd.
> * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> fd is a stable handle to the underlying directory.
> * -EAGAIN or -EBUSY means that it was a mountpoint or became a
> mountpoint after the fd was opened (we don't care about that, but
> fail-safe is better here).
> 3. Use the fd from (1) for all operations.

... except that foo/../bar *WILL* cross into the covering mount, on any
kernel that supports ...at(2) at all, so I would be very cautious about
any kind "hardening" claims in that case.

I'm not sure about Linus' proposal - it looks rather convoluted and we
get a hard to describe twist of semantics in an area (procfs symlinks
vs. mount traversal) on top of everything else in there...

Anyway, a couple of questions:

1) do you see any problems on your testcases with the current #fixes?
That's commit 7a955b7363b8 as branch tip.

2) do you have any updates you would like to fold into stuff in
#work.openat2? Right now I have a local variant of #work.namei (with
fairly cosmetical change compared to vfs.git one) that merges clean
with #work.openat2; I would like to do any updates/fold-ins/etc.
of #work.openat2 *before* doing a merge and continuing to work on
top of the merge results...

2020-01-14 05:14:16

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Tue, Jan 14, 2020 at 04:57:33AM +0000, Al Viro wrote:
> On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
>
> > If I'm understanding this proposal correctly, this would be a problem
> > for the libpathrs use-case -- if this is done then there's no way to
> > avoid a TOCTOU with someone mounting and the userspace program checking
> > whether something is a mountpoint (unless you have Linux >5.6 and
> > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> >
> > 1. Open the candidate directory.
> > 2. umount2(MNT_EXPIRE) the fd.
> > * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> > fd is a stable handle to the underlying directory.
> > * -EAGAIN or -EBUSY means that it was a mountpoint or became a
> > mountpoint after the fd was opened (we don't care about that, but
> > fail-safe is better here).
> > 3. Use the fd from (1) for all operations.
>
> ... except that foo/../bar *WILL* cross into the covering mount, on any
> kernel that supports ...at(2) at all, so I would be very cautious about
> any kind "hardening" claims in that case.
>
> I'm not sure about Linus' proposal - it looks rather convoluted and we
> get a hard to describe twist of semantics in an area (procfs symlinks
> vs. mount traversal) on top of everything else in there...

PS: one thing that might be interesting is exposing LOOKUP_DOWN via
AT_... flag - it would allow to request mount traversals at the starting
point explicitly. Pretty much all code needed for that is already there;
all it would take is checking the flag in path_openat() and path_parentat()
and having handle_lookup_down() called there, same as in path_lookupat().

A tricky question is whether such flag should affect absolute symlinks -
i.e.

chdir /foo
ln -s /bar barf
overmount /
do lookup with that flag for /bar/splat
do lookup with that flag for barf/splat

Do we want the same results in both calls? The first one would
traverse mounts on / and walk into /bar/splat in overmounting;
the second - see no mounts whatsoever on current directory (/foo
in old root), see the symlink to "/bar", jump to process' root
and proceed from there, first for "bar", then "splat" in it...

2020-01-14 06:01:07

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Tue, 2020-01-14 at 13:01 +0800, Ian Kent wrote:
> On Tue, 2020-01-14 at 04:39 +0000, Al Viro wrote:
> > On Tue, Jan 14, 2020 at 08:25:19AM +0800, Ian Kent wrote:
> >
> > > This isn't right.
> > >
> > > There's actually nothing stopping a user from using a direct map
> > > entry that's a multi-mount without an actual mount at its root.
> > > So there could be directories created under these, it's just not
> > > usually done.
> > >
> > > I'm pretty sure I don't check and disallow this.
> >
> > IDGI... How the hell will that work in v5? Who will set _any_
> > traps outside the one in root in that scenario? autofs_lookup()
> > won't (there it's conditional upon indirect mount). Neither
> > will autofs_dir_mkdir() (conditional upon version being less
> > than 5). Who will, then?
> >
> > Confused...
>
> It's easy to miss.
>
> For autofs type direct and offset mounts the flags are set at fill
> super time.
>
> They have to be set then because they are direct mounts and offset
> mounts behave the same as direct mounts so they need to be set then
> too. So, like direct mounts, offset mounts are each distinct autofs
> (trigger) mounts.
>
> I could check for this construct and refuse it if that's really
> needed. I'm pretty sure this map construct isn't much used by
> people using direct mounts.

Ok, once again I'm not exactly accurate is some of what I said.

It turns out that the autofs connectathon tests, one of the tests
that I use, does test direct mounts with offsets both with and
without a real mount at the base of the mount.

Based on that, I have to say this map construct is meant to be
supported with Sun format maps of autofs (even though I think it's
probably not used much).

So not allowing it is probably the wrong thing to do.

OTOH initial testing with the #work.namei branch shows these are
functioning as required.

Ian

2020-01-14 07:26:58

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote:
> On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:
>
> > Oh wait, for systemd I was actually looking at:
> > https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
> >
> > > Mind you, that's not the actual systemd repo. either I probably
> > > need to look a lot deeper (and at the actual systemd repo) to
> > > work out what's actually being called.
> > >
> > > > Sigh... Guess I'll have to dig that Fedora KVM image out and
> > > > try to see what it's about... ;-/ Here comes a couple of hours
> > > > of build...
>
> D'oh... And yes, that would've been a bisect hazard - switch to
> path_lookupat() later in the series gets rid of that. Incremental
> (to be foldede, of course):
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 1793661c3342..204677c37751 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd, unsigned
> flags, struct path *path)
> (err = lookup_last(nd)) > 0) {
> s = trailing_symlink(nd);
> }
> - if (!err)
> + if (!err && (nd->flags & LOOKUP_RCU))
> err = unlazy_walk(nd);
> if (!err)
> err = handle_lookup_down(nd);

Ok, so I've tested with the updated patch.

The autofs connectathon tests I use function fine.

I also tested sending a SIGKILL to the daemon with about 180 active
mounts and restarted the daemon to test the function of the ioctls
that Al was concerned about.

While the connectathon test expired everything I had 3 mounts left
after allowing sufficient expire time with the SIGKILL test.

Those mounts correspond to one map entry that has a mix of NFS
vers=3 and vers=2 mount options and NFSv2 isn't supported by the
servers I use in testing.

I'm inclined to think this is a bug in the automount mount tree
re-connection code rather than a problem with this patch since
all the other mounts, some simple and others with not so simple
constructs, expired fine after automount re-connected to them.

There are two other map entries that have an NFS vers=2 option but
they are simple mounts that will fail on attempting the automount
because the server doesn't support v2 so they don't end up with
mounts to reconnect to.

This particular map entry, having a mix of NFS vers=3 and vers=2
in the offsets of the entry, will lead to a partial mount of the
map entry which is probably not being handled properly by automount
when re-connecting to the mounts in the tree.

So I think the patch here is fine from an autofs POV.

Ian

2020-01-14 12:18:29

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Tue, 2020-01-14 at 15:25 +0800, Ian Kent wrote:
> On Mon, 2020-01-13 at 13:30 +0000, Al Viro wrote:
> > On Mon, Jan 13, 2020 at 02:03:00PM +0800, Ian Kent wrote:
> >
> > > Oh wait, for systemd I was actually looking at:
> > > https://github.com/systemd/systemd/blob/master/src/shared/switch-root.c
> > >
> > > > Mind you, that's not the actual systemd repo. either I probably
> > > > need to look a lot deeper (and at the actual systemd repo) to
> > > > work out what's actually being called.
> > > >
> > > > > Sigh... Guess I'll have to dig that Fedora KVM image out and
> > > > > try to see what it's about... ;-/ Here comes a couple of
> > > > > hours
> > > > > of build...
> >
> > D'oh... And yes, that would've been a bisect hazard - switch to
> > path_lookupat() later in the series gets rid of that. Incremental
> > (to be foldede, of course):
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 1793661c3342..204677c37751 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2634,7 +2634,7 @@ path_mountpoint(struct nameidata *nd,
> > unsigned
> > flags, struct path *path)
> > (err = lookup_last(nd)) > 0) {
> > s = trailing_symlink(nd);
> > }
> > - if (!err)
> > + if (!err && (nd->flags & LOOKUP_RCU))
> > err = unlazy_walk(nd);
> > if (!err)
> > err = handle_lookup_down(nd);
>
> Ok, so I've tested with the updated patch.
>
> The autofs connectathon tests I use function fine.
>
> I also tested sending a SIGKILL to the daemon with about 180 active
> mounts and restarted the daemon to test the function of the ioctls
> that Al was concerned about.
>
> While the connectathon test expired everything I had 3 mounts left
> after allowing sufficient expire time with the SIGKILL test.
>
> Those mounts correspond to one map entry that has a mix of NFS
> vers=3 and vers=2 mount options and NFSv2 isn't supported by the
> servers I use in testing.
>
> I'm inclined to think this is a bug in the automount mount tree
> re-connection code rather than a problem with this patch since
> all the other mounts, some simple and others with not so simple
> constructs, expired fine after automount re-connected to them.
>
> There are two other map entries that have an NFS vers=2 option but
> they are simple mounts that will fail on attempting the automount
> because the server doesn't support v2 so they don't end up with
> mounts to reconnect to.
>
> This particular map entry, having a mix of NFS vers=3 and vers=2
> in the offsets of the entry, will lead to a partial mount of the
> map entry which is probably not being handled properly by automount
> when re-connecting to the mounts in the tree.
>
> So I think the patch here is fine from an autofs POV.

Umm ... unfortunately further testing shows an autofs problem.

It appears to be present in the current kernel (so far I've only
been able to check the current git head and an earlier kernel
but can't remember the version and can't check) so I must have
missed it.

I'm attempting to bisect now but managed to trash the root
file system on my VM. I'll get this done as quickly as I can.

Ian

2020-01-14 20:04:52

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-14, Al Viro <[email protected]> wrote:
> On Sat, Jan 11, 2020 at 08:07:19AM +1100, Aleksa Sarai wrote:
>
> > If I'm understanding this proposal correctly, this would be a problem
> > for the libpathrs use-case -- if this is done then there's no way to
> > avoid a TOCTOU with someone mounting and the userspace program checking
> > whether something is a mountpoint (unless you have Linux >5.6 and
> > RESOLVE_NO_XDEV). Today, you can (in theory) do it with MNT_EXPIRE:
> >
> > 1. Open the candidate directory.
> > 2. umount2(MNT_EXPIRE) the fd.
> > * -EINVAL means it wasn't a mountpoint when we got the fd, and the
> > fd is a stable handle to the underlying directory.
> > * -EAGAIN or -EBUSY means that it was a mountpoint or became a
> > mountpoint after the fd was opened (we don't care about that, but
> > fail-safe is better here).
> > 3. Use the fd from (1) for all operations.
>
> ... except that foo/../bar *WILL* cross into the covering mount, on any
> kernel that supports ...at(2) at all, so I would be very cautious about
> any kind "hardening" claims in that case.

In the use-case I have, we would have full control over what the path
being opened is (and thus you wouldn't open "foo/../bar"). But I agree
that generally the MNT_EXPIRE solution is really non-ideal anyway.

Not to mention that we're still screwed when it comes to using
magic-links (because if someone bind-mounts a magic-link over a
magic-link there's absolutely no race-free way to be sure that we're
traversing the right magic-link -- for that we'll need to have a
different solution).

> I'm not sure about Linus' proposal - it looks rather convoluted and we
> get a hard to describe twist of semantics in an area (procfs symlinks
> vs. mount traversal) on top of everything else in there...

Yeah, I agree.

> 1) do you see any problems on your testcases with the current #fixes?
> That's commit 7a955b7363b8 as branch tip.

I will take a quick look later today, but I'm currently at a conference.

> 2) do you have any updates you would like to fold into stuff in
> #work.openat2? Right now I have a local variant of #work.namei (with
> fairly cosmetical change compared to vfs.git one) that merges clean
> with #work.openat2; I would like to do any updates/fold-ins/etc.
> of #work.openat2 *before* doing a merge and continuing to work on
> top of the merge results...

Yes, there were two patches I sent a while ago[1]. I can re-send them if
you like. The second patch switches open_how->mode to a u64, but I'm
still on the fence about whether that makes sense to do...

[1]: https://lore.kernel.org/lkml/[email protected]/

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.77 kB)
signature.asc (235.00 B)
Download all attachments

2020-01-15 13:58:37

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-14, Al Viro <[email protected]> wrote:
> 1) do you see any problems on your testcases with the current #fixes?
> That's commit 7a955b7363b8 as branch tip.

I just finished testing the few cases I reported earlier and they both
appear to be fixed with the current #work.namei branch. And I don't have
any troubles booting whatsoever.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (461.00 B)
signature.asc (235.00 B)
Download all attachments

2020-01-15 14:26:52

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:

> Yes, there were two patches I sent a while ago[1]. I can re-send them if
> you like. The second patch switches open_how->mode to a u64, but I'm
> still on the fence about whether that makes sense to do...

IMO plain __u64 is better than games with __aligned_u64 - all sizes are
fixed, so...

> [1]: https://lore.kernel.org/lkml/[email protected]/

Do you want that series folded into "open: introduce openat2(2) syscall"
and "selftests: add openat2(2) selftests" or would you rather have them
appended at the end of the series. Personally I'd go for "fold them in"
if it had been about my code, but it's really up to you.

2020-01-15 14:30:46

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-15, Al Viro <[email protected]> wrote:
> On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
>
> > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > you like. The second patch switches open_how->mode to a u64, but I'm
> > still on the fence about whether that makes sense to do...
>
> IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> fixed, so...
>
> > [1]: https://lore.kernel.org/lkml/[email protected]/
>
> Do you want that series folded into "open: introduce openat2(2) syscall"
> and "selftests: add openat2(2) selftests" or would you rather have them
> appended at the end of the series. Personally I'd go for "fold them in"
> if it had been about my code, but it's really up to you.

"fold them in" would probably be better to avoid making the mainline
history confusing afterwards. Thanks.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.00 kB)
signature.asc (235.00 B)
Download all attachments

2020-01-15 14:37:37

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On 2020-01-16, Aleksa Sarai <[email protected]> wrote:
> On 2020-01-15, Al Viro <[email protected]> wrote:
> > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> >
> > > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > > you like. The second patch switches open_how->mode to a u64, but I'm
> > > still on the fence about whether that makes sense to do...
> >
> > IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> > fixed, so...
> >
> > > [1]: https://lore.kernel.org/lkml/[email protected]/
> >
> > Do you want that series folded into "open: introduce openat2(2) syscall"
> > and "selftests: add openat2(2) selftests" or would you rather have them
> > appended at the end of the series. Personally I'd go for "fold them in"
> > if it had been about my code, but it's really up to you.
>
> "fold them in" would probably be better to avoid making the mainline
> history confusing afterwards. Thanks.

Also (if you prefer) I can send a v3 which uses u64s rather than
aligned_u64s.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.18 kB)
signature.asc (235.00 B)
Download all attachments

2020-01-15 14:49:51

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC 0/1] mount: universally disallow mounting over symlinks

On Thu, Jan 16, 2020 at 01:34:59AM +1100, Aleksa Sarai wrote:
> On 2020-01-16, Aleksa Sarai <[email protected]> wrote:
> > On 2020-01-15, Al Viro <[email protected]> wrote:
> > > On Wed, Jan 15, 2020 at 07:01:50AM +1100, Aleksa Sarai wrote:
> > >
> > > > Yes, there were two patches I sent a while ago[1]. I can re-send them if
> > > > you like. The second patch switches open_how->mode to a u64, but I'm
> > > > still on the fence about whether that makes sense to do...
> > >
> > > IMO plain __u64 is better than games with __aligned_u64 - all sizes are
> > > fixed, so...
> > >
> > > > [1]: https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Do you want that series folded into "open: introduce openat2(2) syscall"
> > > and "selftests: add openat2(2) selftests" or would you rather have them
> > > appended at the end of the series. Personally I'd go for "fold them in"
> > > if it had been about my code, but it's really up to you.
> >
> > "fold them in" would probably be better to avoid making the mainline
> > history confusing afterwards. Thanks.
>
> Also (if you prefer) I can send a v3 which uses u64s rather than
> aligned_u64s.

<mode "lazy bastard">
Could you fold and resend the results of folding (i.e. replacements
for two commits in question)?
</mode>

The hard part is, of course, in updating commit messages ;-)

2020-01-18 12:09:37

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v3 0/2] openat2: minor uapi cleanups

Patch changelog:
v3:
* Merge changes into the original patches to make Al's life easier.
[Al Viro]
v2:
* Add include <linux/types.h> to openat2.h. [Florian Weimer]
* Move OPEN_HOW_SIZE_* constants out of UAPI. [Florian Weimer]
* Switch from __aligned_u64 to __u64 since it isn't necessary.
[David Laight]
v1: <https://lore.kernel.org/lkml/[email protected]/>

While openat2(2) is still not yet in Linus's tree, we can take this
opportunity to iron out some small warts that weren't noticed earlier:

* A fix was suggested by Florian Weimer, to separate the openat2
definitions so glibc can use the header directly. I've put the
maintainership under VFS but let me know if you'd prefer it belong
ot the fcntl folks.

* Having heterogenous field sizes in an extensible struct results in
"padding hole" problems when adding new fields (in addition the
correct error to use for non-zero padding isn't entirely clear ).
The simplest solution is to just copy clone(3)'s model -- always use
u64s. It will waste a little more space in the struct, but it
removes a possible future headache.

This patch is intended to replace the corresponding patches in Al's
#work.openat2 tree (and *will not* apply on Linus' tree).

@Al: I will send some additional patches later, but they will require
proper design review since they're ABI-related features (namely,
adding a way to check what features a syscall supports as I
outlined in my talk here[1]).

[1]: https://youtu.be/ggD-eb3yPVs

Aleksa Sarai (2):
open: introduce openat2(2) syscall
selftests: add openat2(2) selftests

CREDITS | 4 +-
MAINTAINERS | 1 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/open.c | 147 +++--
include/linux/fcntl.h | 16 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 2 +-
include/uapi/linux/openat2.h | 39 ++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 106 ++++
.../testing/selftests/openat2/openat2_test.c | 312 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
34 files changed, 1418 insertions(+), 39 deletions(-)
create mode 100644 include/uapi/linux/openat2.h
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c

--
2.24.1

2020-01-18 12:09:52

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v3 1/2] open: introduce openat2(2) syscall

/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
/*
* open_how is an extensible structure (similar in interface to
* clone3(2) or sched_setattr(2)). The size parameter must be set to
* sizeof(struct open_how), to allow for future extensions. All future
* extensions will be appended to open_how, with their zero value
* acting as a no-op default.
*/
struct open_how { /* ... */ };

int openat2(int dfd, const char *pathname,
struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).

mode
The file mode for O_CREAT or O_TMPFILE.

Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).

RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH => LOOKUP_BENEATH
RESOLVE_IN_ROOT => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/[email protected]/
[6]: https://youtu.be/ggD-eb3yPVs

Suggested-by: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
CREDITS | 4 +-
MAINTAINERS | 1 +
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/open.c | 147 +++++++++++++++-----
include/linux/fcntl.h | 16 ++-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/fcntl.h | 2 +-
include/uapi/linux/openat2.h | 39 ++++++
26 files changed, 198 insertions(+), 39 deletions(-)
create mode 100644 include/uapi/linux/openat2.h

diff --git a/CREDITS b/CREDITS
index 9602b0fa1c95..a97d3280a627 100644
--- a/CREDITS
+++ b/CREDITS
@@ -3302,7 +3302,9 @@ S: France
N: Aleksa Sarai
E: [email protected]
W: https://www.cyphar.com/
-D: `pids` cgroup subsystem
+D: /sys/fs/cgroup/pids
+D: openat2(2)
+S: Sydney, Australia

N: Dipankar Sarma
E: [email protected]
diff --git a/MAINTAINERS b/MAINTAINERS
index bd5847e802de..737ada377ac3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6397,6 +6397,7 @@ F: fs/*
F: include/linux/fs.h
F: include/linux/fs_types.h
F: include/uapi/linux/fs.h
+F: include/uapi/linux/openat2.h

FINTEK F75375S HARDWARE MONITOR AND FAN CONTROLLER DRIVER
M: Riku Voipio <[email protected]>
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8e13b0b2928d..4d7f2ffa957c 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
543 common fspick sys_fspick
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
+547 common openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..4ba54bc7e19a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..8aa00ccb0b96 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)

-#define __NR_compat_syscalls 436
+#define __NR_compat_syscalls 438
#endif

#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..57f6f592d460 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)

/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..8d36f2e2dc89 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..2559925f1924 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..c04385e60833 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index e7c5ab38e403..68c9ec06851f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
433 n32 fspick sys_fspick
434 n32 pidfd_open sys_pidfd_open
435 n32 clone3 __sys_clone3
+437 n32 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 13cd66581f3b..42a72d010050 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
433 n64 fspick sys_fspick
434 n64 pidfd_open sys_pidfd_open
435 n64 clone3 __sys_clone3
+437 n64 openat2 sys_openat2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 353539ea4140..f114c4aed0ed 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
433 o32 fspick sys_fspick
434 o32 pidfd_open sys_pidfd_open
435 o32 clone3 __sys_clone3
+437 o32 openat2 sys_openat2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 285ff516150c..b550ae9a7fea 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
+437 common openat2 sys_openat2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 43f736ed47f2..a8b5ecb5b602 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 nospu clone3 ppc_clone3
+437 common openat2 sys_openat2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 3054e9c035a3..16b571c06161 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick sys_fspick
434 common pidfd_open sys_pidfd_open sys_pidfd_open
435 common clone3 sys_clone3 sys_clone3
+437 common openat2 sys_openat2 sys_openat2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..a7185cc18626 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..b11c19552022 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+437 common openat2 sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 15908eb9b17e..d22a8b5c3fab 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
435 i386 clone3 sys_clone3 __ia32_sys_clone3
+437 i386 openat2 sys_openat2 __ia32_sys_openat2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..9035647ef236 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
435 common clone3 __x64_sys_clone3/ptregs
+437 common openat2 __x64_sys_openat2

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..f0a68013c038 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+437 common openat2 sys_openat2
diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..8cdb2b675867 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -955,48 +955,84 @@ struct file *open_with_fake_path(const struct path *path, int flags,
}
EXPORT_SYMBOL(open_with_fake_path);

-static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
+#define WILL_CREATE(flags) (flags & (O_CREAT | __O_TMPFILE))
+#define O_PATH_FLAGS (O_DIRECTORY | O_NOFOLLOW | O_PATH | O_CLOEXEC)
+
+static inline struct open_how build_open_how(int flags, umode_t mode)
+{
+ struct open_how how = {
+ .flags = flags & VALID_OPEN_FLAGS,
+ .mode = mode & S_IALLUGO,
+ };
+
+ /* O_PATH beats everything else. */
+ if (how.flags & O_PATH)
+ how.flags &= O_PATH_FLAGS;
+ /* Modes should only be set for create-like flags. */
+ if (!WILL_CREATE(how.flags))
+ how.mode = 0;
+ return how;
+}
+
+static inline int build_open_flags(const struct open_how *how,
+ struct open_flags *op)
{
+ int flags = how->flags;
int lookup_flags = 0;
int acc_mode = ACC_MODE(flags);

+ /* Must never be set by userspace */
+ flags &= ~(FMODE_NONOTIFY | O_CLOEXEC);
+
/*
- * Clear out all open flags we don't know about so that we don't report
- * them in fcntl(F_GETFD) or similar interfaces.
+ * Older syscalls implicitly clear all of the invalid flags or argument
+ * values before calling build_open_flags(), but openat2(2) checks all
+ * of its arguments.
*/
- flags &= VALID_OPEN_FLAGS;
+ if (flags & ~VALID_OPEN_FLAGS)
+ return -EINVAL;
+ if (how->resolve & ~VALID_RESOLVE_FLAGS)
+ return -EINVAL;

- if (flags & (O_CREAT | __O_TMPFILE))
- op->mode = (mode & S_IALLUGO) | S_IFREG;
- else
+ /* Deal with the mode. */
+ if (WILL_CREATE(flags)) {
+ if (how->mode & ~S_IALLUGO)
+ return -EINVAL;
+ op->mode = how->mode | S_IFREG;
+ } else {
+ if (how->mode != 0)
+ return -EINVAL;
op->mode = 0;
-
- /* Must never be set by userspace */
- flags &= ~FMODE_NONOTIFY & ~O_CLOEXEC;
+ }

/*
- * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
- * check for O_DSYNC if the need any syncing at all we enforce it's
- * always set instead of having to deal with possibly weird behaviour
- * for malicious applications setting only __O_SYNC.
+ * In order to ensure programs get explicit errors when trying to use
+ * O_TMPFILE on old kernels, O_TMPFILE is implemented such that it
+ * looks like (O_DIRECTORY|O_RDWR & ~O_CREAT) to old kernels. But we
+ * have to require userspace to explicitly set it.
*/
- if (flags & __O_SYNC)
- flags |= O_DSYNC;
-
if (flags & __O_TMPFILE) {
if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
return -EINVAL;
if (!(acc_mode & MAY_WRITE))
return -EINVAL;
- } else if (flags & O_PATH) {
- /*
- * If we have O_PATH in the open flag. Then we
- * cannot have anything other than the below set of flags
- */
- flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+ }
+ if (flags & O_PATH) {
+ /* O_PATH only permits certain other flags to be set. */
+ if (flags & ~O_PATH_FLAGS)
+ return -EINVAL;
acc_mode = 0;
}

+ /*
+ * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
+ * check for O_DSYNC if the need any syncing at all we enforce it's
+ * always set instead of having to deal with possibly weird behaviour
+ * for malicious applications setting only __O_SYNC.
+ */
+ if (flags & __O_SYNC)
+ flags |= O_DSYNC;
+
op->open_flag = flags;

/* O_TRUNC implies we need access checks for write permissions */
@@ -1022,6 +1058,18 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
+
+ if (how->resolve & RESOLVE_NO_XDEV)
+ lookup_flags |= LOOKUP_NO_XDEV;
+ if (how->resolve & RESOLVE_NO_MAGICLINKS)
+ lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (how->resolve & RESOLVE_NO_SYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (how->resolve & RESOLVE_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (how->resolve & RESOLVE_IN_ROOT)
+ lookup_flags |= LOOKUP_IN_ROOT;
+
op->lookup_flags = lookup_flags;
return 0;
}
@@ -1040,8 +1088,11 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
struct file *file_open_name(struct filename *name, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
- return err ? ERR_PTR(err) : do_filp_open(AT_FDCWD, name, &op);
+ struct open_how how = build_open_how(flags, mode);
+ int err = build_open_flags(&how, &op);
+ if (err)
+ return ERR_PTR(err);
+ return do_filp_open(AT_FDCWD, name, &op);
}

/**
@@ -1072,17 +1123,19 @@ struct file *file_open_root(struct dentry *dentry, struct vfsmount *mnt,
const char *filename, int flags, umode_t mode)
{
struct open_flags op;
- int err = build_open_flags(flags, mode, &op);
+ struct open_how how = build_open_how(flags, mode);
+ int err = build_open_flags(&how, &op);
if (err)
return ERR_PTR(err);
return do_file_open_root(dentry, mnt, filename, &op);
}
EXPORT_SYMBOL(file_open_root);

-long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
+static long do_sys_openat2(int dfd, const char __user *filename,
+ struct open_how *how)
{
struct open_flags op;
- int fd = build_open_flags(flags, mode, &op);
+ int fd = build_open_flags(how, &op);
struct filename *tmp;

if (fd)
@@ -1092,7 +1145,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
if (IS_ERR(tmp))
return PTR_ERR(tmp);

- fd = get_unused_fd_flags(flags);
+ fd = get_unused_fd_flags(how->flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
@@ -1107,12 +1160,16 @@ long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
return fd;
}

-SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
- if (force_o_largefile())
- flags |= O_LARGEFILE;
+ struct open_how how = build_open_how(flags, mode);
+ return do_sys_openat2(dfd, filename, &how);
+}

- return do_sys_open(AT_FDCWD, filename, flags, mode);
+
+SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
+{
+ return ksys_open(filename, flags, mode);
}

SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
@@ -1120,10 +1177,32 @@ SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags,
{
if (force_o_largefile())
flags |= O_LARGEFILE;
-
return do_sys_open(dfd, filename, flags, mode);
}

+SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
+ struct open_how __user *, how, size_t, usize)
+{
+ int err;
+ struct open_how tmp;
+
+ BUILD_BUG_ON(sizeof(struct open_how) < OPEN_HOW_SIZE_VER0);
+ BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_LATEST);
+
+ if (unlikely(usize < OPEN_HOW_SIZE_VER0))
+ return -EINVAL;
+
+ err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
+ if (err)
+ return err;
+
+ /* O_LARGEFILE is only allowed for non-O_PATH. */
+ if (!(tmp.flags & O_PATH) && force_o_largefile())
+ tmp.flags |= O_LARGEFILE;
+
+ return do_sys_openat2(dfd, filename, &tmp);
+}
+
#ifdef CONFIG_COMPAT
/*
* Exactly like sys_open(), except that it doesn't set the
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..7bcdcf4f6ab2 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -2,15 +2,29 @@
#ifndef _LINUX_FCNTL_H
#define _LINUX_FCNTL_H

+#include <linux/stat.h>
#include <uapi/linux/fcntl.h>

-/* list of all valid flags for the open/openat flags argument: */
+/* List of all valid flags for the open/openat flags argument: */
#define VALID_OPEN_FLAGS \
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)

+/* List of all valid flags for the how->upgrade_mask argument: */
+#define VALID_UPGRADE_FLAGS \
+ (UPGRADE_NOWRITE | UPGRADE_NOREAD)
+
+/* List of all valid flags for the how->resolve argument: */
+#define VALID_RESOLVE_FLAGS \
+ (RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
+ RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+
+/* List of all open_how "versions". */
+#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
+
#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d0391cc2dae9..cd9f27cbc567 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct rseq;
union bpf_attr;
struct io_uring_params;
struct clone_args;
+struct open_how;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -439,6 +440,8 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
+asmlinkage long sys_openat2(int dfd, const char __user *filename,
+ struct open_how *how, size_t size);
asmlinkage long sys_close(unsigned int fd);
asmlinkage long sys_vhangup(void);

diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1fc8faa6e973..d4122c091472 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -851,8 +851,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
__SYSCALL(__NR_clone3, sys_clone3)
#endif

+#define __NR_openat2 437
+__SYSCALL(__NR_openat2, sys_openat2)
+
#undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 438

/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1f97b33c840e..ca88b7bce553 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -3,6 +3,7 @@
#define _UAPI_LINUX_FCNTL_H

#include <asm/fcntl.h>
+#include <linux/openat2.h>

#define F_SETLEASE (F_LINUX_SPECIFIC_BASE + 0)
#define F_GETLEASE (F_LINUX_SPECIFIC_BASE + 1)
@@ -100,5 +101,4 @@

#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */

-
#endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
new file mode 100644
index 000000000000..58b1eb711360
--- /dev/null
+++ b/include/uapi/linux/openat2.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_OPENAT2_H
+#define _UAPI_LINUX_OPENAT2_H
+
+#include <linux/types.h>
+
+/*
+ * Arguments for how openat2(2) should open the target path. If only @flags and
+ * @mode are non-zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown or invalid bits in @flags result in
+ * -EINVAL rather than being silently ignored. @mode must be zero unless one of
+ * {O_CREAT, O_TMPFILE} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+ __u64 flags;
+ __u64 mode;
+ __u64 resolve;
+};
+
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
+
+#endif /* _UAPI_LINUX_OPENAT2_H */
--
2.24.1

2020-01-18 12:10:03

by Aleksa Sarai

[permalink] [raw]
Subject: [PATCH v3 2/2] selftests: add openat2(2) selftests

Test all of the various openat2(2) flags. A small stress-test of a
symlink-rename attack is included to show that the protections against
".."-based attacks are sufficient.

The main things these self-tests are enforcing are:

* The struct+usize ABI for openat2(2) and copy_struct_from_user() to
ensure that upgrades will be handled gracefully (in addition,
ensuring that misaligned structures are also handled correctly).

* The -EINVAL checks for openat2(2) are all correctly handled to avoid
userspace passing unknown or conflicting flag sets (most
importantly, ensuring that invalid flag combinations are checked).

* All of the RESOLVE_* semantics (including errno values) are
correctly handled with various combinations of paths and flags.

* RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
attack that has been responsible for several CVEs (and likely will
be responsible for several more).

Cc: Shuah Khan <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat2/.gitignore | 1 +
tools/testing/selftests/openat2/Makefile | 8 +
tools/testing/selftests/openat2/helpers.c | 109 ++++
tools/testing/selftests/openat2/helpers.h | 106 ++++
.../testing/selftests/openat2/openat2_test.c | 312 +++++++++++
.../selftests/openat2/rename_attack_test.c | 160 ++++++
.../testing/selftests/openat2/resolve_test.c | 523 ++++++++++++++++++
8 files changed, 1220 insertions(+)
create mode 100644 tools/testing/selftests/openat2/.gitignore
create mode 100644 tools/testing/selftests/openat2/Makefile
create mode 100644 tools/testing/selftests/openat2/helpers.c
create mode 100644 tools/testing/selftests/openat2/helpers.h
create mode 100644 tools/testing/selftests/openat2/openat2_test.c
create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b001c602414b..4f502448dc7e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -40,6 +40,7 @@ TARGETS += powerpc
TARGETS += proc
TARGETS += pstore
TARGETS += ptrace
+TARGETS += openat2
TARGETS += rseq
TARGETS += rtc
TARGETS += seccomp
diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index 000000000000..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index 000000000000..4b93b1417b86
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
+TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index 000000000000..e9a6557ab16f
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+
+#include "helpers.h"
+
+bool needs_openat2(const struct open_how *how)
+{
+ return how->resolve != 0;
+}
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size)
+{
+ int ret = syscall(__NR_openat2, dfd, path, how, size);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat2(int dfd, const char *path, struct open_how *how)
+{
+ return raw_openat2(dfd, path, how, sizeof(*how));
+}
+
+int sys_openat(int dfd, const char *path, struct open_how *how)
+{
+ int ret = openat(dfd, path, how->flags, how->mode);
+ return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+ int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+ newdirfd, newpath, flags);
+ return ret >= 0 ? ret : -errno;
+}
+
+int touchat(int dfd, const char *path)
+{
+ int fd = openat(dfd, path, O_CREAT);
+ if (fd >= 0)
+ close(fd);
+ return fd;
+}
+
+char *fdreadlink(int fd)
+{
+ char *target, *tmp;
+
+ E_asprintf(&tmp, "/proc/self/fd/%d", fd);
+
+ target = malloc(PATH_MAX);
+ if (!target)
+ ksft_exit_fail_msg("fdreadlink: malloc failed\n");
+ memset(target, 0, PATH_MAX);
+
+ E_readlink(tmp, target, PATH_MAX);
+ free(tmp);
+ return target;
+}
+
+bool fdequal(int fd, int dfd, const char *path)
+{
+ char *fdpath, *dfdpath, *other;
+ bool cmp;
+
+ fdpath = fdreadlink(fd);
+ dfdpath = fdreadlink(dfd);
+
+ if (!path)
+ E_asprintf(&other, "%s", dfdpath);
+ else if (*path == '/')
+ E_asprintf(&other, "%s", path);
+ else
+ E_asprintf(&other, "%s/%s", dfdpath, path);
+
+ cmp = !strcmp(fdpath, other);
+
+ free(fdpath);
+ free(dfdpath);
+ free(other);
+ return cmp;
+}
+
+bool openat2_supported = false;
+
+void __attribute__((constructor)) init(void)
+{
+ struct open_how how = {};
+ int fd;
+
+ BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
+
+ /* Check openat2(2) support. */
+ fd = sys_openat2(AT_FDCWD, ".", &how);
+ openat2_supported = (fd >= 0);
+
+ if (fd >= 0)
+ close(fd);
+}
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
new file mode 100644
index 000000000000..a6ea27344db2
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#ifndef __RESOLVEAT_H__
+#define __RESOLVEAT_H__
+
+#define _GNU_SOURCE
+#include <stdint.h>
+#include <errno.h>
+#include <linux/types.h>
+#include "../kselftest.h"
+
+#define ARRAY_LEN(X) (sizeof (X) / sizeof (*(X)))
+#define BUILD_BUG_ON(e) ((void)(sizeof(struct { int:(-!!(e)); })))
+
+#ifndef SYS_openat2
+#ifndef __NR_openat2
+#define __NR_openat2 437
+#endif /* __NR_openat2 */
+#define SYS_openat2 __NR_openat2
+#endif /* SYS_openat2 */
+
+/*
+ * Arguments for how openat2(2) should open the target path. If @resolve is
+ * zero, then openat2(2) operates very similarly to openat(2).
+ *
+ * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
+ * than being silently ignored. @mode must be zero unless one of {O_CREAT,
+ * O_TMPFILE} are set.
+ *
+ * @flags: O_* flags.
+ * @mode: O_CREAT/O_TMPFILE file mode.
+ * @resolve: RESOLVE_* flags.
+ */
+struct open_how {
+ __u64 flags;
+ __u64 mode;
+ __u64 resolve;
+};
+
+#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
+#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
+
+bool needs_openat2(const struct open_how *how);
+
+#ifndef RESOLVE_IN_ROOT
+/* how->resolve flags for openat2(2). */
+#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
+ (includes bind-mounts). */
+#define RESOLVE_NO_MAGICLINKS 0x02 /* Block traversal through procfs-style
+ "magic-links". */
+#define RESOLVE_NO_SYMLINKS 0x04 /* Block traversal through all symlinks
+ (implies OEXT_NO_MAGICLINKS) */
+#define RESOLVE_BENEATH 0x08 /* Block "lexical" trickery like
+ "..", symlinks, and absolute
+ paths which escape the dirfd. */
+#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
+ be scoped inside the dirfd
+ (similar to chroot(2)). */
+#endif /* RESOLVE_IN_ROOT */
+
+#define E_func(func, ...) \
+ do { \
+ if (func(__VA_ARGS__) < 0) \
+ ksft_exit_fail_msg("%s:%d %s failed\n", \
+ __FILE__, __LINE__, #func);\
+ } while (0)
+
+#define E_asprintf(...) E_func(asprintf, __VA_ARGS__)
+#define E_chmod(...) E_func(chmod, __VA_ARGS__)
+#define E_dup2(...) E_func(dup2, __VA_ARGS__)
+#define E_fchdir(...) E_func(fchdir, __VA_ARGS__)
+#define E_fstatat(...) E_func(fstatat, __VA_ARGS__)
+#define E_kill(...) E_func(kill, __VA_ARGS__)
+#define E_mkdirat(...) E_func(mkdirat, __VA_ARGS__)
+#define E_mount(...) E_func(mount, __VA_ARGS__)
+#define E_prctl(...) E_func(prctl, __VA_ARGS__)
+#define E_readlink(...) E_func(readlink, __VA_ARGS__)
+#define E_setresuid(...) E_func(setresuid, __VA_ARGS__)
+#define E_symlinkat(...) E_func(symlinkat, __VA_ARGS__)
+#define E_touchat(...) E_func(touchat, __VA_ARGS__)
+#define E_unshare(...) E_func(unshare, __VA_ARGS__)
+
+#define E_assert(expr, msg, ...) \
+ do { \
+ if (!(expr)) \
+ ksft_exit_fail_msg("ASSERT(%s:%d) failed (%s): " msg "\n", \
+ __FILE__, __LINE__, #expr, ##__VA_ARGS__); \
+ } while (0)
+
+int raw_openat2(int dfd, const char *path, void *how, size_t size);
+int sys_openat2(int dfd, const char *path, struct open_how *how);
+int sys_openat(int dfd, const char *path, struct open_how *how);
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags);
+
+int touchat(int dfd, const char *path);
+char *fdreadlink(int fd);
+bool fdequal(int fd, int dfd, const char *path);
+
+extern bool openat2_supported;
+
+#endif /* __RESOLVEAT_H__ */
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
new file mode 100644
index 000000000000..b386367c606b
--- /dev/null
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * O_LARGEFILE is set to 0 by glibc.
+ * XXX: This is wrong on {mips, parisc, powerpc, sparc}.
+ */
+#undef O_LARGEFILE
+#define O_LARGEFILE 0x8000
+
+struct open_how_ext {
+ struct open_how inner;
+ uint32_t extra1;
+ char pad1[128];
+ uint32_t extra2;
+ char pad2[128];
+ uint32_t extra3;
+};
+
+struct struct_test {
+ const char *name;
+ struct open_how_ext arg;
+ size_t size;
+ int err;
+};
+
+#define NUM_OPENAT2_STRUCT_TESTS 7
+#define NUM_OPENAT2_STRUCT_VARIATIONS 13
+
+void test_openat2_struct(void)
+{
+ int misalignments[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 17, 87 };
+
+ struct struct_test tests[] = {
+ /* Normal struct. */
+ { .name = "normal struct",
+ .arg.inner.flags = O_RDONLY,
+ .size = sizeof(struct open_how) },
+ /* Bigger struct, with zeroed out end. */
+ { .name = "bigger struct (zeroed out)",
+ .arg.inner.flags = O_RDONLY,
+ .size = sizeof(struct open_how_ext) },
+
+ /* TODO: Once expanded, check zero-padding. */
+
+ /* Smaller than version-0 struct. */
+ { .name = "zero-sized 'struct'",
+ .arg.inner.flags = O_RDONLY, .size = 0, .err = -EINVAL },
+ { .name = "smaller-than-v0 struct",
+ .arg.inner.flags = O_RDONLY,
+ .size = OPEN_HOW_SIZE_VER0 - 1, .err = -EINVAL },
+
+ /* Bigger struct, with non-zero trailing bytes. */
+ { .name = "bigger struct (non-zero data in first 'future field')",
+ .arg.inner.flags = O_RDONLY, .arg.extra1 = 0xdeadbeef,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ { .name = "bigger struct (non-zero data in middle of 'future fields')",
+ .arg.inner.flags = O_RDONLY, .arg.extra2 = 0xfeedcafe,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ { .name = "bigger struct (non-zero data at end of 'future fields')",
+ .arg.inner.flags = O_RDONLY, .arg.extra3 = 0xabad1dea,
+ .size = sizeof(struct open_how_ext), .err = -E2BIG },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(misalignments) != NUM_OPENAT2_STRUCT_VARIATIONS);
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_STRUCT_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ struct struct_test *test = &tests[i];
+ struct open_how_ext how_ext = test->arg;
+
+ for (int j = 0; j < ARRAY_LEN(misalignments); j++) {
+ int fd, misalign = misalignments[j];
+ char *fdpath = NULL;
+ bool failed;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+ void *copy = NULL, *how_copy = &how_ext;
+
+ if (!openat2_supported) {
+ ksft_print_msg("openat2(2) unsupported\n");
+ resultfn = ksft_test_result_skip;
+ goto skip;
+ }
+
+ if (misalign) {
+ /*
+ * Explicitly misalign the structure copying it with the given
+ * (mis)alignment offset. The other data is set to be non-zero to
+ * make sure that non-zero bytes outside the struct aren't checked
+ *
+ * This is effectively to check that is_zeroed_user() works.
+ */
+ copy = malloc(misalign + sizeof(how_ext));
+ how_copy = copy + misalign;
+ memset(copy, 0xff, misalign);
+ memcpy(how_copy, &how_ext, sizeof(how_ext));
+ }
+
+ fd = raw_openat2(AT_FDCWD, ".", how_copy, test->size);
+ if (test->err >= 0)
+ failed = (fd < 0);
+ else
+ failed = (fd != test->err);
+ if (fd >= 0) {
+ fdpath = fdreadlink(fd);
+ close(fd);
+ }
+
+ if (failed) {
+ resultfn = ksft_test_result_fail;
+
+ ksft_print_msg("openat2 unexpectedly returned ");
+ if (fdpath)
+ ksft_print_msg("%d['%s']\n", fd, fdpath);
+ else
+ ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+ }
+
+skip:
+ if (test->err >= 0)
+ resultfn("openat2 with %s argument [misalign=%d] succeeds\n",
+ test->name, misalign);
+ else
+ resultfn("openat2 with %s argument [misalign=%d] fails with %d (%s)\n",
+ test->name, misalign, test->err,
+ strerror(-test->err));
+
+ free(copy);
+ free(fdpath);
+ fflush(stdout);
+ }
+ }
+}
+
+struct flag_test {
+ const char *name;
+ struct open_how how;
+ int err;
+};
+
+#define NUM_OPENAT2_FLAG_TESTS 23
+
+void test_openat2_flags(void)
+{
+ struct flag_test tests[] = {
+ /* O_TMPFILE is incompatible with O_PATH and O_CREAT. */
+ { .name = "incompatible flags (O_TMPFILE | O_PATH)",
+ .how.flags = O_TMPFILE | O_PATH | O_RDWR, .err = -EINVAL },
+ { .name = "incompatible flags (O_TMPFILE | O_CREAT)",
+ .how.flags = O_TMPFILE | O_CREAT | O_RDWR, .err = -EINVAL },
+
+ /* O_PATH only permits certain other flags to be set ... */
+ { .name = "compatible flags (O_PATH | O_CLOEXEC)",
+ .how.flags = O_PATH | O_CLOEXEC },
+ { .name = "compatible flags (O_PATH | O_DIRECTORY)",
+ .how.flags = O_PATH | O_DIRECTORY },
+ { .name = "compatible flags (O_PATH | O_NOFOLLOW)",
+ .how.flags = O_PATH | O_NOFOLLOW },
+ /* ... and others are absolutely not permitted. */
+ { .name = "incompatible flags (O_PATH | O_RDWR)",
+ .how.flags = O_PATH | O_RDWR, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_CREAT)",
+ .how.flags = O_PATH | O_CREAT, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_EXCL)",
+ .how.flags = O_PATH | O_EXCL, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_NOCTTY)",
+ .how.flags = O_PATH | O_NOCTTY, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_DIRECT)",
+ .how.flags = O_PATH | O_DIRECT, .err = -EINVAL },
+ { .name = "incompatible flags (O_PATH | O_LARGEFILE)",
+ .how.flags = O_PATH | O_LARGEFILE, .err = -EINVAL },
+
+ /* ->mode must only be set with O_{CREAT,TMPFILE}. */
+ { .name = "non-zero how.mode and O_RDONLY",
+ .how.flags = O_RDONLY, .how.mode = 0600, .err = -EINVAL },
+ { .name = "non-zero how.mode and O_PATH",
+ .how.flags = O_PATH, .how.mode = 0600, .err = -EINVAL },
+ { .name = "valid how.mode and O_CREAT",
+ .how.flags = O_CREAT, .how.mode = 0600 },
+ { .name = "valid how.mode and O_TMPFILE",
+ .how.flags = O_TMPFILE | O_RDWR, .how.mode = 0600 },
+ /* ->mode must only contain 0777 bits. */
+ { .name = "invalid how.mode and O_CREAT",
+ .how.flags = O_CREAT,
+ .how.mode = 0xFFFF, .err = -EINVAL },
+ { .name = "invalid (very large) how.mode and O_CREAT",
+ .how.flags = O_CREAT,
+ .how.mode = 0xC000000000000000ULL, .err = -EINVAL },
+ { .name = "invalid how.mode and O_TMPFILE",
+ .how.flags = O_TMPFILE | O_RDWR,
+ .how.mode = 0x1337, .err = -EINVAL },
+ { .name = "invalid (very large) how.mode and O_TMPFILE",
+ .how.flags = O_TMPFILE | O_RDWR,
+ .how.mode = 0x0000A00000000000ULL, .err = -EINVAL },
+
+ /* ->resolve must only contain RESOLVE_* flags. */
+ { .name = "invalid how.resolve and O_RDONLY",
+ .how.flags = O_RDONLY,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ { .name = "invalid how.resolve and O_CREAT",
+ .how.flags = O_CREAT,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ { .name = "invalid how.resolve and O_TMPFILE",
+ .how.flags = O_TMPFILE | O_RDWR,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ { .name = "invalid how.resolve and O_PATH",
+ .how.flags = O_PATH,
+ .how.resolve = 0x1337, .err = -EINVAL },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int fd, fdflags = -1;
+ char *path, *fdpath = NULL;
+ bool failed = false;
+ struct flag_test *test = &tests[i];
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+
+ if (!openat2_supported) {
+ ksft_print_msg("openat2(2) unsupported\n");
+ resultfn = ksft_test_result_skip;
+ goto skip;
+ }
+
+ path = (test->how.flags & O_CREAT) ? "/tmp/ksft.openat2_tmpfile" : ".";
+ unlink(path);
+
+ fd = sys_openat2(AT_FDCWD, path, &test->how);
+ if (test->err >= 0)
+ failed = (fd < 0);
+ else
+ failed = (fd != test->err);
+ if (fd >= 0) {
+ int otherflags;
+
+ fdpath = fdreadlink(fd);
+ fdflags = fcntl(fd, F_GETFL);
+ otherflags = fcntl(fd, F_GETFD);
+ close(fd);
+
+ E_assert(fdflags >= 0, "fcntl F_GETFL of new fd");
+ E_assert(otherflags >= 0, "fcntl F_GETFD of new fd");
+
+ /* O_CLOEXEC isn't shown in F_GETFL. */
+ if (otherflags & FD_CLOEXEC)
+ fdflags |= O_CLOEXEC;
+ /* O_CREAT is hidden from F_GETFL. */
+ if (test->how.flags & O_CREAT)
+ fdflags |= O_CREAT;
+ if (!(test->how.flags & O_LARGEFILE))
+ fdflags &= ~O_LARGEFILE;
+ failed |= (fdflags != test->how.flags);
+ }
+
+ if (failed) {
+ resultfn = ksft_test_result_fail;
+
+ ksft_print_msg("openat2 unexpectedly returned ");
+ if (fdpath)
+ ksft_print_msg("%d['%s'] with %X (!= %X)\n",
+ fd, fdpath, fdflags,
+ test->how.flags);
+ else
+ ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+ }
+
+skip:
+ if (test->err >= 0)
+ resultfn("openat2 with %s succeeds\n", test->name);
+ else
+ resultfn("openat2 with %s fails with %d (%s)\n",
+ test->name, test->err, strerror(-test->err));
+
+ free(fdpath);
+ fflush(stdout);
+ }
+}
+
+#define NUM_TESTS (NUM_OPENAT2_STRUCT_VARIATIONS * NUM_OPENAT2_STRUCT_TESTS + \
+ NUM_OPENAT2_FLAG_TESTS)
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ test_openat2_struct();
+ test_openat2_flags();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/rename_attack_test.c b/tools/testing/selftests/openat2/rename_attack_test.c
new file mode 100644
index 000000000000..0a770728b436
--- /dev/null
+++ b/tools/testing/selftests/openat2/rename_attack_test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <syscall.h>
+#include <limits.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/* Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- a/
+ * | `-- c/
+ * `-- b/
+ */
+int setup_testdir(void)
+{
+ int dfd;
+ char dirname[] = "/tmp/ksft-openat2-rename-attack.XXXXXX";
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ E_mkdirat(dfd, "a", 0755);
+ E_mkdirat(dfd, "b", 0755);
+ E_mkdirat(dfd, "a/c", 0755);
+
+ return dfd;
+}
+
+/* Swap @dirfd/@a and @dirfd/@b constantly. Parent must kill this process. */
+pid_t spawn_attack(int dirfd, char *a, char *b)
+{
+ pid_t child = fork();
+ if (child != 0)
+ return child;
+
+ /* If the parent (the test process) dies, kill ourselves too. */
+ E_prctl(PR_SET_PDEATHSIG, SIGKILL);
+
+ /* Swap @a and @b. */
+ for (;;)
+ renameat2(dirfd, a, dirfd, b, RENAME_EXCHANGE);
+ exit(1);
+}
+
+#define NUM_RENAME_TESTS 2
+#define ROUNDS 400000
+
+const char *flagname(int resolve)
+{
+ switch (resolve) {
+ case RESOLVE_IN_ROOT:
+ return "RESOLVE_IN_ROOT";
+ case RESOLVE_BENEATH:
+ return "RESOLVE_BENEATH";
+ }
+ return "(unknown)";
+}
+
+void test_rename_attack(int resolve)
+{
+ int dfd, afd;
+ pid_t child;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ int escapes = 0, other_errs = 0, exdevs = 0, eagains = 0, successes = 0;
+
+ struct open_how how = {
+ .flags = O_PATH,
+ .resolve = resolve,
+ };
+
+ if (!openat2_supported) {
+ how.resolve = 0;
+ ksft_print_msg("openat2(2) unsupported -- using openat(2) instead\n");
+ }
+
+ dfd = setup_testdir();
+ afd = openat(dfd, "a", O_PATH);
+ if (afd < 0)
+ ksft_exit_fail_msg("test_rename_attack: failed to open 'a'\n");
+
+ child = spawn_attack(dfd, "a/c", "b");
+
+ for (int i = 0; i < ROUNDS; i++) {
+ int fd;
+ char *victim_path = "c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../../c/../..";
+
+ if (openat2_supported)
+ fd = sys_openat2(afd, victim_path, &how);
+ else
+ fd = sys_openat(afd, victim_path, &how);
+
+ if (fd < 0) {
+ if (fd == -EAGAIN)
+ eagains++;
+ else if (fd == -EXDEV)
+ exdevs++;
+ else if (fd == -ENOENT)
+ escapes++; /* escaped outside and got ENOENT... */
+ else
+ other_errs++; /* unexpected error */
+ } else {
+ if (fdequal(fd, afd, NULL))
+ successes++;
+ else
+ escapes++; /* we got an unexpected fd */
+ }
+ close(fd);
+ }
+
+ if (escapes > 0)
+ resultfn = ksft_test_result_fail;
+ ksft_print_msg("non-escapes: EAGAIN=%d EXDEV=%d E<other>=%d success=%d\n",
+ eagains, exdevs, other_errs, successes);
+ resultfn("rename attack with %s (%d runs, got %d escapes)\n",
+ flagname(resolve), ROUNDS, escapes);
+
+ /* Should be killed anyway, but might as well make sure. */
+ E_kill(child, SIGKILL);
+}
+
+#define NUM_TESTS NUM_RENAME_TESTS
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ test_rename_attack(RESOLVE_BENEATH);
+ test_rename_attack(RESOLVE_IN_ROOT);
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/openat2/resolve_test.c b/tools/testing/selftests/openat2/resolve_test.c
new file mode 100644
index 000000000000..7a94b1da8e7b
--- /dev/null
+++ b/tools/testing/selftests/openat2/resolve_test.c
@@ -0,0 +1,523 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <[email protected]>
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+
+#include "../kselftest.h"
+#include "helpers.h"
+
+/*
+ * Construct a test directory with the following structure:
+ *
+ * root/
+ * |-- procexe -> /proc/self/exe
+ * |-- procroot -> /proc/self/root
+ * |-- root/
+ * |-- mnt/ [mountpoint]
+ * | |-- self -> ../mnt/
+ * | `-- absself -> /mnt/
+ * |-- etc/
+ * | `-- passwd
+ * |-- creatlink -> /newfile3
+ * |-- reletc -> etc/
+ * |-- relsym -> etc/passwd
+ * |-- absetc -> /etc/
+ * |-- abssym -> /etc/passwd
+ * |-- abscheeky -> /cheeky
+ * `-- cheeky/
+ * |-- absself -> /
+ * |-- self -> ../../root/
+ * |-- garbageself -> /../../root/
+ * |-- passwd -> ../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- abspasswd -> /../cheeky/../cheeky/../etc/../etc/passwd
+ * |-- dotdotlink -> ../../../../../../../../../../../../../../etc/passwd
+ * `-- garbagelink -> /../../../../../../../../../../../../../../etc/passwd
+ */
+int setup_testdir(void)
+{
+ int dfd, tmpfd;
+ char dirname[] = "/tmp/ksft-openat2-testdir.XXXXXX";
+
+ /* Unshare and make /tmp a new directory. */
+ E_unshare(CLONE_NEWNS);
+ E_mount("", "/tmp", "", MS_PRIVATE, "");
+
+ /* Make the top-level directory. */
+ if (!mkdtemp(dirname))
+ ksft_exit_fail_msg("setup_testdir: failed to create tmpdir\n");
+ dfd = open(dirname, O_PATH | O_DIRECTORY);
+ if (dfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+
+ /* A sub-directory which is actually used for tests. */
+ E_mkdirat(dfd, "root", 0755);
+ tmpfd = openat(dfd, "root", O_PATH | O_DIRECTORY);
+ if (tmpfd < 0)
+ ksft_exit_fail_msg("setup_testdir: failed to open tmpdir\n");
+ close(dfd);
+ dfd = tmpfd;
+
+ E_symlinkat("/proc/self/exe", dfd, "procexe");
+ E_symlinkat("/proc/self/root", dfd, "procroot");
+ E_mkdirat(dfd, "root", 0755);
+
+ /* There is no mountat(2), so use chdir. */
+ E_mkdirat(dfd, "mnt", 0755);
+ E_fchdir(dfd);
+ E_mount("tmpfs", "./mnt", "tmpfs", MS_NOSUID | MS_NODEV, "");
+ E_symlinkat("../mnt/", dfd, "mnt/self");
+ E_symlinkat("/mnt/", dfd, "mnt/absself");
+
+ E_mkdirat(dfd, "etc", 0755);
+ E_touchat(dfd, "etc/passwd");
+
+ E_symlinkat("/newfile3", dfd, "creatlink");
+ E_symlinkat("etc/", dfd, "reletc");
+ E_symlinkat("etc/passwd", dfd, "relsym");
+ E_symlinkat("/etc/", dfd, "absetc");
+ E_symlinkat("/etc/passwd", dfd, "abssym");
+ E_symlinkat("/cheeky", dfd, "abscheeky");
+
+ E_mkdirat(dfd, "cheeky", 0755);
+
+ E_symlinkat("/", dfd, "cheeky/absself");
+ E_symlinkat("../../root/", dfd, "cheeky/self");
+ E_symlinkat("/../../root/", dfd, "cheeky/garbageself");
+
+ E_symlinkat("../cheeky/../etc/../etc/passwd", dfd, "cheeky/passwd");
+ E_symlinkat("/../cheeky/../etc/../etc/passwd", dfd, "cheeky/abspasswd");
+
+ E_symlinkat("../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/dotdotlink");
+ E_symlinkat("/../../../../../../../../../../../../../../etc/passwd",
+ dfd, "cheeky/garbagelink");
+
+ return dfd;
+}
+
+struct basic_test {
+ const char *name;
+ const char *dir;
+ const char *path;
+ struct open_how how;
+ bool pass;
+ union {
+ int err;
+ const char *path;
+ } out;
+};
+
+#define NUM_OPENAT2_OPATH_TESTS 88
+
+void test_openat2_opath_tests(void)
+{
+ int rootfd, hardcoded_fd;
+ char *procselfexe, *hardcoded_fdpath;
+
+ E_asprintf(&procselfexe, "/proc/%d/exe", getpid());
+ rootfd = setup_testdir();
+
+ hardcoded_fd = open("/dev/null", O_RDONLY);
+ E_assert(hardcoded_fd >= 0, "open fd to hardcode");
+ E_asprintf(&hardcoded_fdpath, "self/fd/%d", hardcoded_fd);
+
+ struct basic_test tests[] = {
+ /** RESOLVE_BENEATH **/
+ /* Attempts to cross dirfd should be blocked. */
+ { .name = "[beneath] jump to /",
+ .path = "/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] absolute link to $root",
+ .path = "cheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained absolute links to $root",
+ .path = "abscheeky/absself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] jump outside $root",
+ .path = "..", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] temporary jump outside $root",
+ .path = "../root/", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] symlink temporary jump outside $root",
+ .path = "cheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained symlink temporary jump outside $root",
+ .path = "abscheeky/self", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] garbage links to $root",
+ .path = "cheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained garbage links to $root",
+ .path = "abscheeky/garbageself", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Only relative paths that stay inside dirfd should work. */
+ { .name = "[beneath] ordinary path to 'root'",
+ .path = "root", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "root", .pass = true },
+ { .name = "[beneath] ordinary path to 'etc'",
+ .path = "etc", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc", .pass = true },
+ { .name = "[beneath] ordinary path to 'etc/passwd'",
+ .path = "etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[beneath] relative symlink inside $root",
+ .path = "relsym", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[beneath] chained-'..' relative symlink inside $root",
+ .path = "cheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[beneath] absolute symlink component outside $root",
+ .path = "abscheeky/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] absolute symlink target outside $root",
+ .path = "abssym", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] absolute path outside $root",
+ .path = "/etc/passwd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] cheeky absolute path outside $root",
+ .path = "cheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] chained cheeky absolute path outside $root",
+ .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ /* Tricky paths should fail. */
+ { .name = "[beneath] tricky '..'-chained symlink outside $root",
+ .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] tricky absolute + '..'-chained symlink outside $root",
+ .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] tricky garbage link outside $root",
+ .path = "cheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[beneath] tricky absolute + garbage link outside $root",
+ .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_BENEATH,
+ .out.err = -EXDEV, .pass = false },
+
+ /** RESOLVE_IN_ROOT **/
+ /* All attempts to cross the dirfd will be scoped-to-root. */
+ { .name = "[in_root] jump to /",
+ .path = "/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] absolute symlink to /root",
+ .path = "cheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] chained absolute symlinks to /root",
+ .path = "abscheeky/absself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] '..' at root",
+ .path = "..", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = NULL, .pass = true },
+ { .name = "[in_root] '../root' at root",
+ .path = "../root/", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] relative symlink containing '..' above root",
+ .path = "cheeky/self", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] garbage link to /root",
+ .path = "cheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] chainged garbage links to /root",
+ .path = "abscheeky/garbageself", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] relative path to 'root'",
+ .path = "root", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "root", .pass = true },
+ { .name = "[in_root] relative path to 'etc'",
+ .path = "etc", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc", .pass = true },
+ { .name = "[in_root] relative path to 'etc/passwd'",
+ .path = "etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] relative symlink to 'etc/passwd'",
+ .path = "relsym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] chained-'..' relative symlink to 'etc/passwd'",
+ .path = "cheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] chained-'..' absolute + relative symlink to 'etc/passwd'",
+ .path = "abscheeky/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] absolute symlink to 'etc/passwd'",
+ .path = "abssym", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] absolute path 'etc/passwd'",
+ .path = "/etc/passwd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] cheeky absolute path 'etc/passwd'",
+ .path = "cheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] chained cheeky absolute path 'etc/passwd'",
+ .path = "abscheeky/abspasswd", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky '..'-chained symlink outside $root",
+ .path = "cheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute + '..'-chained symlink outside $root",
+ .path = "abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute path + absolute + '..'-chained symlink outside $root",
+ .path = "/../../../../abscheeky/dotdotlink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky garbage link outside $root",
+ .path = "cheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute + garbage link outside $root",
+ .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ { .name = "[in_root] tricky absolute path + absolute + garbage link outside $root",
+ .path = "/../../../../abscheeky/garbagelink", .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "etc/passwd", .pass = true },
+ /* O_CREAT should handle trailing symlinks correctly. */
+ { .name = "[in_root] O_CREAT of relative path inside $root",
+ .path = "newfile1", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile1", .pass = true },
+ { .name = "[in_root] O_CREAT of absolute path",
+ .path = "/newfile2", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile2", .pass = true },
+ { .name = "[in_root] O_CREAT of tricky symlink outside root",
+ .path = "/creatlink", .how.flags = O_CREAT,
+ .how.mode = 0700,
+ .how.resolve = RESOLVE_IN_ROOT,
+ .out.path = "newfile3", .pass = true },
+
+ /** RESOLVE_NO_XDEV **/
+ /* Crossing *down* into a mountpoint is disallowed. */
+ { .name = "[no_xdev] cross into $mnt",
+ .path = "mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] cross into $mnt/",
+ .path = "mnt/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] cross into $mnt/.",
+ .path = "mnt/.", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Crossing *up* out of a mountpoint is disallowed. */
+ { .name = "[no_xdev] goto mountpoint root",
+ .dir = "mnt", .path = ".", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "mnt", .pass = true },
+ { .name = "[no_xdev] cross up through '..'",
+ .dir = "mnt", .path = "..", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] temporary cross up through '..'",
+ .dir = "mnt", .path = "../mnt", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] temporary relative symlink cross up",
+ .dir = "mnt", .path = "self", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] temporary absolute symlink cross up",
+ .dir = "mnt", .path = "absself", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Jumping to "/" is ok, but later components cannot cross. */
+ { .name = "[no_xdev] jump to / directly",
+ .dir = "mnt", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .name = "[no_xdev] jump to / (from /) directly",
+ .dir = "/", .path = "/", .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/", .pass = true },
+ { .name = "[no_xdev] jump to / then proc",
+ .path = "/proc/1", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] jump to / then tmp",
+ .path = "/tmp", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Magic-links are blocked since they can switch vfsmounts. */
+ { .name = "[no_xdev] cross through magic-link to self/root",
+ .dir = "/proc", .path = "self/root", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ { .name = "[no_xdev] cross through magic-link to self/cwd",
+ .dir = "/proc", .path = "self/cwd", .how.resolve = RESOLVE_NO_XDEV,
+ .out.err = -EXDEV, .pass = false },
+ /* Except magic-link jumps inside the same vfsmount. */
+ { .name = "[no_xdev] jump through magic-link to same procfs",
+ .dir = "/proc", .path = hardcoded_fdpath, .how.resolve = RESOLVE_NO_XDEV,
+ .out.path = "/proc", .pass = true, },
+
+ /** RESOLVE_NO_MAGICLINKS **/
+ /* Regular symlinks should work. */
+ { .name = "[no_magiclinks] ordinary relative symlink",
+ .path = "relsym", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Magic-links should not work. */
+ { .name = "[no_magiclinks] symlink to magic-link",
+ .path = "procexe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] normal path to magic-link",
+ .path = "/proc/self/exe", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] normal path to magic-link with O_NOFOLLOW",
+ .path = "/proc/self/exe", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.path = procselfexe, .pass = true },
+ { .name = "[no_magiclinks] symlink to magic-link path component",
+ .path = "procroot/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] magic-link path component",
+ .path = "/proc/self/root/etc", .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_magiclinks] magic-link path component with O_NOFOLLOW",
+ .path = "/proc/self/root/etc", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_MAGICLINKS,
+ .out.err = -ELOOP, .pass = false },
+
+ /** RESOLVE_NO_SYMLINKS **/
+ /* Normal paths should work. */
+ { .name = "[no_symlinks] ordinary path to '.'",
+ .path = ".", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = NULL, .pass = true },
+ { .name = "[no_symlinks] ordinary path to 'root'",
+ .path = "root", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "root", .pass = true },
+ { .name = "[no_symlinks] ordinary path to 'etc'",
+ .path = "etc", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc", .pass = true },
+ { .name = "[no_symlinks] ordinary path to 'etc/passwd'",
+ .path = "etc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "etc/passwd", .pass = true },
+ /* Regular symlinks are blocked. */
+ { .name = "[no_symlinks] relative symlink target",
+ .path = "relsym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] relative symlink component",
+ .path = "reletc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] absolute symlink target",
+ .path = "abssym", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] absolute symlink component",
+ .path = "absetc/passwd", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] cheeky garbage link",
+ .path = "cheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] cheeky absolute + garbage link",
+ .path = "abscheeky/garbagelink", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] cheeky absolute + absolute symlink",
+ .path = "abscheeky/absself", .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ /* Trailing symlinks with NO_FOLLOW. */
+ { .name = "[no_symlinks] relative symlink with O_NOFOLLOW",
+ .path = "relsym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "relsym", .pass = true },
+ { .name = "[no_symlinks] absolute symlink with O_NOFOLLOW",
+ .path = "abssym", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "abssym", .pass = true },
+ { .name = "[no_symlinks] trailing symlink with O_NOFOLLOW",
+ .path = "cheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.path = "cheeky/garbagelink", .pass = true },
+ { .name = "[no_symlinks] multiple symlink components with O_NOFOLLOW",
+ .path = "abscheeky/absself", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ { .name = "[no_symlinks] multiple symlink (and garbage link) components with O_NOFOLLOW",
+ .path = "abscheeky/garbagelink", .how.flags = O_NOFOLLOW,
+ .how.resolve = RESOLVE_NO_SYMLINKS,
+ .out.err = -ELOOP, .pass = false },
+ };
+
+ BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_OPATH_TESTS);
+
+ for (int i = 0; i < ARRAY_LEN(tests); i++) {
+ int dfd, fd;
+ char *fdpath = NULL;
+ bool failed;
+ void (*resultfn)(const char *msg, ...) = ksft_test_result_pass;
+ struct basic_test *test = &tests[i];
+
+ if (!openat2_supported) {
+ ksft_print_msg("openat2(2) unsupported\n");
+ resultfn = ksft_test_result_skip;
+ goto skip;
+ }
+
+ /* Auto-set O_PATH. */
+ if (!(test->how.flags & O_CREAT))
+ test->how.flags |= O_PATH;
+
+ if (test->dir)
+ dfd = openat(rootfd, test->dir, O_PATH | O_DIRECTORY);
+ else
+ dfd = dup(rootfd);
+ E_assert(dfd, "failed to openat root '%s': %m", test->dir);
+
+ E_dup2(dfd, hardcoded_fd);
+
+ fd = sys_openat2(dfd, test->path, &test->how);
+ if (test->pass)
+ failed = (fd < 0 || !fdequal(fd, rootfd, test->out.path));
+ else
+ failed = (fd != test->out.err);
+ if (fd >= 0) {
+ fdpath = fdreadlink(fd);
+ close(fd);
+ }
+ close(dfd);
+
+ if (failed) {
+ resultfn = ksft_test_result_fail;
+
+ ksft_print_msg("openat2 unexpectedly returned ");
+ if (fdpath)
+ ksft_print_msg("%d['%s']\n", fd, fdpath);
+ else
+ ksft_print_msg("%d (%s)\n", fd, strerror(-fd));
+ }
+
+skip:
+ if (test->pass)
+ resultfn("%s gives path '%s'\n", test->name,
+ test->out.path ?: ".");
+ else
+ resultfn("%s fails with %d (%s)\n", test->name,
+ test->out.err, strerror(-test->out.err));
+
+ fflush(stdout);
+ free(fdpath);
+ }
+
+ free(procselfexe);
+ close(rootfd);
+
+ free(hardcoded_fdpath);
+ close(hardcoded_fd);
+}
+
+#define NUM_TESTS NUM_OPENAT2_OPATH_TESTS
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ /* NOTE: We should be checking for CAP_SYS_ADMIN here... */
+ if (geteuid() != 0)
+ ksft_exit_skip("all tests require euid == 0\n");
+
+ test_openat2_opath_tests();
+
+ if (ksft_get_fail_cnt() + ksft_get_error_cnt() > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
--
2.24.1


2020-01-18 15:30:54

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] openat2: minor uapi cleanups

On Sat, Jan 18, 2020 at 11:07:58PM +1100, Aleksa Sarai wrote:
> Patch changelog:
> v3:
> * Merge changes into the original patches to make Al's life easier.
> [Al Viro]
> v2:
> * Add include <linux/types.h> to openat2.h. [Florian Weimer]
> * Move OPEN_HOW_SIZE_* constants out of UAPI. [Florian Weimer]
> * Switch from __aligned_u64 to __u64 since it isn't necessary.
> [David Laight]
> v1: <https://lore.kernel.org/lkml/[email protected]/>
>
> While openat2(2) is still not yet in Linus's tree, we can take this
> opportunity to iron out some small warts that weren't noticed earlier:
>
> * A fix was suggested by Florian Weimer, to separate the openat2
> definitions so glibc can use the header directly. I've put the
> maintainership under VFS but let me know if you'd prefer it belong
> ot the fcntl folks.
>
> * Having heterogenous field sizes in an extensible struct results in
> "padding hole" problems when adding new fields (in addition the
> correct error to use for non-zero padding isn't entirely clear ).
> The simplest solution is to just copy clone(3)'s model -- always use
> u64s. It will waste a little more space in the struct, but it
> removes a possible future headache.
>
> This patch is intended to replace the corresponding patches in Al's
> #work.openat2 tree (and *will not* apply on Linus' tree).
>
> @Al: I will send some additional patches later, but they will require
> proper design review since they're ABI-related features (namely,
> adding a way to check what features a syscall supports as I
> outlined in my talk here[1]).

#work.openat2 updated, #for-next rebuilt and force-pushed. There's
a massive update of #work.namei as well, also pushed out; not in
#for-next yet, will post the patch series for review later today.

2020-01-18 18:10:57

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] openat2: minor uapi cleanups

On Sat, Jan 18, 2020 at 03:28:33PM +0000, Al Viro wrote:

> #work.openat2 updated, #for-next rebuilt and force-pushed. There's
> a massive update of #work.namei as well, also pushed out; not in
> #for-next yet, will post the patch series for review later today.

BTW, looking through that code again, how could this
static bool legitimize_root(struct nameidata *nd)
{
/*
* For scoped-lookups (where nd->root has been zeroed), we need to
* restart the whole lookup from scratch -- because set_root() is wrong
* for these lookups (nd->dfd is the root, not the filesystem root).
*/
if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED))
return false;

possibly trigger? The only things that ever clean ->root.mnt are

1) failing legitimize_path(nd, &nd->root, nd->root_seq) in
legitimize_root() itself. If *ANY* legitimize_path() has failed,
we are through - RCU pathwalk is given up. In particular, if you
look at the call chains leading to legitimize_root(), you'll see
that it's called by unlazy_walk() or unlazy_child() and failure
has either of those buggger off immediately. The same goes for
their callers; fail any of those and we are done; the very next
thing that will be done with that nameidata is going to be
terminate_walk(). We don't look at its fields, etc. - just return
to the top level ASAP and call terminate_walk() on it. Which is where
we run into
if (nd->flags & LOOKUP_ROOT_GRABBED) {
path_put(&nd->root);
nd->flags &= ~LOOKUP_ROOT_GRABBED;
}
paired with setting LOOKUP_ROOT_GRABBED just before the attempt
to legitimize in legitimize_root(). The next thing *after*
terminate_walk() is either path_init() or the end of life for
that struct nameidata instance.
This is really, really fundamental for understanding the whole
thing - a failure of unlazy_walk/unlazy_child means that we are through
with that attempt.

2) complete_walk() doing
if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
nd->root.mnt = NULL;
Can't happen with LOOKUP_IS_SCOPED in flags, obviously.

3) path_init(). Where it's followed either by leaving through
if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
....
}
(and LOOKUP_IS_SCOPED includes LOOKUP_IN_ROOT) or with a failure exit
(no calls of *anything* but terminate_walk() after that or with
if (flags & LOOKUP_IS_SCOPED) {
nd->root = nd->path;
... and that makes damn sure nd->root.mnt is not NULL.

And neither of the LOOKUP_IS_SCOPED bits ever gets changed in nd->flags -
they remain as path_init() has set them.

The same, BTW, goes for the check you've added in the beginning of
set_root() - set_root() is called only with NULL nd->root.mnt (trivial to
prove) and that is incompatible with LOOKUP_IS_SCOPED. I'm kinda-sorta
OK with having WARN_ON() there for a while, but IMO the check in the
beginning of legitimize_root() should go away - this kind of defensive
programming only makes harder to reason about the behaviour of the
entire thing. And fs/namei.c is too convoluted as it is...

2020-01-18 23:04:35

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] openat2: minor uapi cleanups

On 2020-01-18, Al Viro <[email protected]> wrote:
> On Sat, Jan 18, 2020 at 03:28:33PM +0000, Al Viro wrote:
>
> > #work.openat2 updated, #for-next rebuilt and force-pushed. There's
> > a massive update of #work.namei as well, also pushed out; not in
> > #for-next yet, will post the patch series for review later today.
>
> BTW, looking through that code again, how could this
> static bool legitimize_root(struct nameidata *nd)
> {
> /*
> * For scoped-lookups (where nd->root has been zeroed), we need to
> * restart the whole lookup from scratch -- because set_root() is wrong
> * for these lookups (nd->dfd is the root, not the filesystem root).
> */
> if (!nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED))
> return false;
>
> possibly trigger? The only things that ever clean ->root.mnt are

You're quite right -- the codepath I was worried about was pick_link()
failing (which *does* clear nd->path.mnt, and I must've misread it at
the time as nd->root.mnt).

We can drop this check, though now complete_walk()'s main defence
against a NULL nd->root.mnt is that path_is_under() will fail and
trigger -EXDEV (or set_root() will fail at some point in the future).
However, as you pointed out, a NULL nd->root.mnt won't happen with
things as they stand today -- I might be a little too paranoid. :P

> This is really, really fundamental for understanding the whole
> thing - a failure of unlazy_walk/unlazy_child means that we are through
> with that attempt.

Yup -- see above, the worry was about pick_link() not about how the
RCU-walk and REF-walk dances operate.

> The same, BTW, goes for the check you've added in the beginning of
> set_root() - set_root() is called only with NULL nd->root.mnt (trivial to
> prove) and that is incompatible with LOOKUP_IS_SCOPED. I'm kinda-sorta
> OK with having WARN_ON() there for a while, but IMO the check in the
> beginning of legitimize_root() should go away -

You're quite right about dropping the legitimize_root() check, but I'd
like to keep the WARN_ON() in set_root(). The main reason being that it
makes us very damn sure that a future change won't accidentally break
the nd->root contract which all of the LOOKUP_IS_SCOPED changes rely on.
Then again, this might be my paranoia popping up again.

> this kind of defensive programming only makes harder to reason about
> the behaviour of the entire thing. And fs/namei.c is too convoluted
> as it is...

If you feel that dropping some of these more defensive checks is better
for the codebase as a whole, then I defer to your judgement. I
completely agree that namei is a pretty complicated chunk of code.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (2.79 kB)
signature.asc (235.00 B)
Download all attachments

2020-01-19 01:14:57

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] openat2: minor uapi cleanups

On Sun, Jan 19, 2020 at 10:03:13AM +1100, Aleksa Sarai wrote:

> > possibly trigger? The only things that ever clean ->root.mnt are
>
> You're quite right -- the codepath I was worried about was pick_link()
> failing (which *does* clear nd->path.mnt, and I must've misread it at
> the time as nd->root.mnt).

pick_link() (allocation failure of external stack in RCU case, followed
by failure to legitimize the link) is, unfortunately, subtle and nasty.
We *must* path_put() the link; if we'd managed to legitimize the mount
and failed on dentry, the mount needs to be dropped. No way around it.
And while everything else there can be left for soon-to-be-reached
terminate_walk(), this cannot. We have no good way to pass what
we need to drop to the place where that eventual terminate_walk()
drops rcu_read_lock(). So we end up having to do what terminate_walk()
would've done and do it right there, so we could do that path_put(link)
before we bugger off.

I'm not happy about that, but I don't see cleaner solutions, more's the
pity. However, it doesn't mess with ->root - nor should it, since
we don't have LOOKUP_ROOT_GRABBED (not in RCU mode), so it can and
should be left alone.

> We can drop this check, though now complete_walk()'s main defence
> against a NULL nd->root.mnt is that path_is_under() will fail and
> trigger -EXDEV (or set_root() will fail at some point in the future).
> However, as you pointed out, a NULL nd->root.mnt won't happen with
> things as they stand today -- I might be a little too paranoid. :P

The only reason why complete_walk() zeroes nd->root in some cases is
microoptimization - we *know* we won't be using it later, so we don't
care whether it's stale or not and can spare unlazy_walk() a bit of
work. All there is to that one.

I don't see any reason for adding code that would clear nd->root in later
work; if such thing does get added (again, I don't see what purpose
could that possibly serve), we'll need to watch out for a lot of things.
Starting with LOOKUP_ROOT case... It's not something likely to slip
in unnoticed.

2020-01-19 03:16:10

by Al Viro

[permalink] [raw]
Subject: [RFC][PATCHSET][CFT] pathwalk cleanups and fixes

OK, vfs.git #work.namei seems to survive xfstests. I think
it cleans the things quite a bit, but it obviously needs more
review and testing.

Review and testing would be _very_ welcome; it does a lot
of massage, so there had been a plenty of opportunities to fuck up
and fail to spot that. The same goes for profiling - it doesn't
seem to slow the things down, but that needs to be verified.

It does include #work.openat2. Topology: 17 commits, followed
by clean merge with #work.openat2, followed by 9 followups. The
part is #work.openat2 is as posted by Aleksa; I can repost it, but
I don't see much point. Description of the rest follows; patches
themselves will be in followups.

part 1: follow_automount() cleanups and fixes.

Quite a bit of that function had been about working around the
wrong calling conventions of finish_automount(). The problem is that
finish_automount() misuses the primitive intended for mount(2) and
friends, where we want to mount on top of the pile, even if something
has managed to add to that while we'd been trying to lock the namespace.
For automount that's not the right thing to do - there we want to discard
whatever it was going to attach and just cross into what got mounted
there in the meanwhile (most likely - the results of the same automount
triggered by somebody else). Current mainline kinda-sorta manages to do
that, but it's unreliable and very convoluted. Much simpler approach
is to stop using lock_mount() in finish_automount() and have it bail
out if something turns out to have been mounted on top where we wanted
to attach. That allows to get rid of a lot of PITA in the caller.
Another simplification comes from not trying to cross into the results
of automount - simply ride through the next iteration of the loop and
let it move into overmount.

Another thing in the same series is divorcing follow_automount()
from nameidata; that'll play later when we get to unifying follow_down()
with the guts of follow_managed().

4 commits, the second one fixes a hard-to-hit race. The first
is a prereq for it.

1/17 do_add_mount(): lift lock_mount/unlock_mount into callers
2/17 fix automount/automount race properly
3/17 follow_automount(): get rid of dead^Wstillborn code
4/17 follow_automount() doesn't need the entire nameidata

part 2: unifying mount traversals in pathwalk.

Handling of mount traversal (follow_managed()) is currently called
in a bunch of places. Each of them is shortly followed by a call of
step_into() or an open-coded equivalent thereof. However, the locations
of those step_into() calls are far from preceding follow_managed();
moreover, that preceding call might happen on different paths that
converge to given step_into() call. It's harder to analyse that it should
be (especially when it comes to liveness analysis) and it forces rather
ugly calling conventions on lookup_fast()/atomic_open()/lookup_open().
The series below massages the code to the point when the calls of
follow_managed() (and __follow_mount_rcu()) move into the beginning of
step_into().

5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
gets EEXIST handling in do_last() past the step_into() call there.
6/17 handle_mounts(): start building a sane wrapper for follow_managed()
rather than mangling follow_managed() itself (and creating conflicts
with openat2 series), add a wrapper that will absorb the required
interface changes.
7/17 atomic_open(): saner calling conventions (return dentry on success)
struct path passed to it is pure out parameter; only dentry part
ever varies, though - mnt is always nd->path.mnt. Just return
the dentry on success, and ERR_PTR(-E...) on failure.
8/17 lookup_open(): saner calling conventions (return dentry on success)
propagate the same change one level up the call chain.
9/17 do_last(): collapse the call of path_to_nameidata()
struct path filled in lookup_open() call is eventually given to
handle_mounts(); the only use it has before that is path_to_nameidata()
call in "->atomic_open() has actually opened it" case, and there
path_to_nameidata() is an overkill - we are guaranteed to replace
only nd->path.dentry. So have the struct path filled only immediately
prior to handle_mounts().
10/17 handle_mounts(): pass dentry in, turn path into a pure out argument
now all callers of handle_mount() are directly preceded by filling
struct path it gets. path->mnt is nd->path.mnt in all cases, so we can
pass just the dentry instead and fill path in handle_mount() itself.
Some boilerplate gone, path is pure out argument of handle_mount()
now.
11/17 lookup_fast(): consolidate the RCU success case
massage to gather what will become an RCU case equivalent of
handle_mounts(); basically, that's what we do if revalidate succeeds
in RCU case of lookup_fast(), including unlazy and fallback to
handle_mounts() if __follow_mount_rcu() says "it's too tricky".
12/17 teach handle_mounts() to handle RCU mode
... and take that into handle_mount() itself. The other caller of
__follow_mount_rcu() is fine with the same fallback (it just didn't
bother since it's in the very beginning of pathwalk), switched to
handle_mount() as well.
13/17 lookup_fast(): take mount traversal into callers
Now we are getting somewhere - both RCU and non-RCU success cases of
lookup_fast() are ended with the same return handle_mounts(...);
move that to the callers - there it will merge with the identical calls
that had been on the paths where we had to do slow lookups.
lookup_fast() returns dentry now.
14/17 new step_into() flag: WALK_NOFOLLOW
use step_into() instead of open-coding it in handle_lookup_down().
Add a flag for "don't follow symlinks regardless of LOOKUP_FOLLOW" for
that (and eventually, I hope, for .. handling).
Now *all* calls of handle_mounts() and step_into() are right next to
each other.
15/17 fold handle_mounts() into step_into()
... and we can move the call of handle_mounts() into step_into(),
getting a slightly saner calling conventions out of that.
16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()
another payoff from 14/17 - we can teach path_lookupat() to do
what path_mountpointat() used to. And kill the latter, along with
its wrappers.
17/17 expand the only remaining call of path_lookup_conditional()
minor cleanup - RIP path_lookup_conditional(). Only one caller left.

At that point we run out of things that can be done without textual conflicts
with openat2 series. Changes so far:
* mount traversal is taken into step_into().
* lookup_fast(), atomic_open() and lookup_open() calling conventions
are slightly changed. All of them return dentry now, instead of returning
an int and filling struct path on success. For lookup_fast() the old
"0 for cache miss, 1 for cache hit" is replaced with "NULL stands for cache
miss, dentry - for hit".
* step_into() can be called in RCU mode as well. Takes nameidata,
WALK_... flags, dentry and, in RCU case, corresponding inode and seq value.
Handles mount traversals, decides whether it's a symlink to be followed.
Error => returns -E...; symlink to follow => returns 1, puts symlink on stack;
non-symlink or symlink not to follow => returns 0, moves nd->path to new location.
* LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and friends
became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in flags.

Next comes the merge with Aleksa's openat2 patchset; everything up to that point
had been non-conflicting with it. That patchset has been posted earlier;
it's in #work.openat2. The next series comes on top of the merge.

part 3: untangling the symlink handling.

Right now when we decide to follow a symlink it happens this way:
* step_into() decides that it has been given a symlink that needs to
be followed.
* it calls pick_link(), which pushes the symlink on stack and
returns 1 on success / -E... on error. Symlink's mount/dentry/seq is
stored on stack and the inode is stashed in nd->link_inode.
* step_into() passes that 1 to its callers, which proceed to pass it
up the call chain for several layers. In all cases we get to get_link()
call shortly afterwards.
* get_link() is called, picks the inode stashed in nd->link_inode
by the pick_link(), does some checks, touches the atime, etc.
* get_link() either picks the link body out of inode or calls
->get_link(). If it's an absolute symlink, we move to the root and return
the relative portion of the body; if it's a relative one - just return the
body. If it's a procfs-style one, the call of nd_jump_link() has been
made and we'd moved to whatever location is desired. And return NULL,
same as we do for symlink to "/".
* the caller proceeds to deal with the string returned to it.

The sequence is the same in all cases (nested symlink, trailing
symlink on lookup, trailing symlink on open), but its pieces are not close
to each other and the bit between the call of pick_link() and (inevitable)
call of get_link() afterwards is not easy to follow. Moreover, a bunch
of functions (walk_component/lookup_last/do_last) ends up with the same
conventions for return values as step_into(). And those conventions
(see above) are not pretty - 0/1/-E... is asking for mistakes, especially
when returned 1 is used only to direct control flow on a rather twisted
way to matching get_link() call. And that path can be seriously twisted.
E.g. when we are trying to open /dev/stdin, we get the following sequence:
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. Into the stack it goes and we return 1.
* do_last() sees 1 and returns it.
* trailing_symlink() is called (in the top-level loop) and it
calls get_link(). OK, we get "/proc/self/fd/0" for body, move to
root again and return "proc/self/fd/0".
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes and we return 1.
* again, do_last() passes 1 to caller
* trailing_symlink() is called and calls get_link().
* this time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. But the
fun doesn't stop yet.
* trailing_symlink() returns "" to the caller
* link_path_walk() is called on that and does nothing
whatsoever.
* do_last() is called and sees LAST_BIND left by the get_link().
It calls handle_dots()
* handle_dots() drops the symlink from stack and returns
* do_last() *FINALLY* proceeds to the point after its call of
step_into() (finish_open:) and gets around to opening the damn thing.

Making sense of the control flow through all of that is not fun,
to put it mildly; debugging anything in that area can be a massive PITA,
and this example has touched only one of 3 cases. Arguably, the worst
one, but... Anyway, it turns out that this code can be massaged to
considerably saner shape - both in terms of control flow and wrt calling
conventions.

1/9 merging pick_link() with get_link(), part 1
prep work: move the "hardening" crap from trailing_symlink() into
get_link() (conditional on the absense of LOOKUP_PARENT in nd->flags).
We'll be moving the calls of get_link() around quite a bit through that
series, and the next step will be to eliminate trailing_symlink().
2/9 merging pick_link() with get_link(), part 2
fold trailing_symlink() into lookup_last() and do_last().
Now these are returning strings; it's not the final calling conventions,
but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...) =>
old -E..., we'd failed. string => old 1, and the string is the symlink
body to follow. Just as for trailing_symlink(), "/" and procfs ones
(where get_link() returns NULL) yield "", so the ugly song and dance
with no-op trip through link_path_walk()/handle_dots() still remains.
3/9 merging pick_link() with get_link(), part 3
elimination of that round-trip. In *all* cases having
get_link() return NULL on such symlinks means that we'll proceed to
drop the symlink from stack and get back to the point near that
get_link() call - basically, where we would be if it hadn't been
a symlink at all. The path by which we are getting there depends
upon the call site; the end result is the same in all cases - such
symlinks (procfs ones and symlink to "/") are fully processed by
the time get_link() returns, so we could as well drop them from the
stack right in get_link(). Makes life simpler in terms of control
flow analysis...
And now the calling conventions for do_last() and lookup_last()
have reached the final shape - ERR_PTR(-E...) for error, NULL for
"we are done", string for "traverse this".
4/9 merging pick_link() with get_link(), part 4
now all calls of walk_component() are followed by the same
boilerplate - "if it has returned 1, call get_link() and if that
has returned NULL treat that as if walk_component() has returned 0".
Eliminate by folding that into walk_component() itself. Now
walk_component() return value conventions have joined those of
do_last()/lookup_last().
5/9 merging pick_link() with get_link(), part 5
same as for the previous, only this time the boilerplate
migrates one level down, into step_into(). Only one caller of
get_link() left, step_into() has joined the same return value
conventions.
6/9 merging pick_link() with get_link(), part 6
move that thing into pick_link(). Now all traces of
"return 1 if we are following a symlink" are gone.
7/9 finally fold get_link() into pick_link()
ta-da - expand get_link() into the only caller. As a side
benefit, we get rid of stashing the inode in nd->link_inode - it
was done only to carry that piece of information from pick_link()
to eventual get_link(). That's not the main benefit, though - the
control flow became considerably easier to reason about.

For what it's worth, the example above (/dev/stdin) becomes
* path_init() has put us into root and returned "/dev/stdin"
* link_path_walk() has eventually reached /dev and left
<LAST_NORM, "stdin"> in nd->last_type/nd->last
* we call do_last(), which sees that we have LAST_NORM and calls
lookup_fast(). Let's assume that everything is in dcache; we get the
dentry of /dev/stdin and proceed to finish_lookup:, where we call step_into()
* it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick the
damn thing. On the stack it goes and we get its body. Which is
"/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
* do_last() sees non-NULL and returns it - whether it's an error
or a pathname to traverse, we hadn't reached something we'll be opening.
* link_path_walk() is given that string, eventually leading us into
/proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
* do_last() is called, and similar to the previous case we
eventually reach the call of step_into() with dentry of /proc/self/fd/0.
* _now_ we can discard /dev/stdin from the stack (we'd been
using its body until now). It's dropped (from step_into()) and we get
to look at what we'd been given. A symlink to follow, so on the stack
it goes. This time it's a procfs symlink and its ->get_link() method
moves us to the mount/dentry of our stdin. And returns NULL. So we
drop symlink from stack and return that NULL to caller.
* that NULL is returned by step_into(), same as if we had just
moved to a non-symlink.
* do_last() proceeds to open the damn thing.

part 4. some mount traversal cleanups.

8/9 massage __follow_mount_rcu() a bit
make it more similar to non-RCU counterpart
9/9 new helper: traverse_mounts()
the guts of follow_managed() are very similar to
follow_down(). The calling conventions are different (follow_managed()
works with nameidata, follow_down() - with standalone struct path),
but the core loop is pretty much the same in both. Turned that loop
into a common helper (traverse_mounts()) and since follow_managed()
becomes a very thin wrapper around it, expand follow_managed() at its
only call site (in handle_mounts()),

That's where the series stands right now. FWIW, at 5.5-rc1 fs/namei.c
had been 4867 lines, at the tip of #work.openat2 - 4998, at the
tip of #work.namei (containing #work.openat2) - 4730... And IMO
the thing has become considerably easier to follow.

What's more, it might be possible to untangle the control flow in
do_last() now. Probably a separate series, though - do_last() is
one hell of a tarpit, so I'm not stepping into it for the rest
of this cycle...

2020-01-19 03:19:27

by Al Viro

[permalink] [raw]
Subject: [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers

From: Al Viro <[email protected]>

preparation to finish_automount() fix (next commit)

Signed-off-by: Al Viro <[email protected]>
---
fs/namespace.c | 47 ++++++++++++++++++++++++-----------------------
1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2fd0c8bcb8c1..5f0a80f17651 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2697,45 +2697,32 @@ static int do_move_mount_old(struct path *path, const char *old_name)
/*
* add a mount into a namespace's mount tree
*/
-static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
+static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
+ struct path *path, int mnt_flags)
{
- struct mountpoint *mp;
- struct mount *parent;
- int err;
+ struct mount *parent = real_mount(path->mnt);

mnt_flags &= ~MNT_INTERNAL_FLAGS;

- mp = lock_mount(path);
- if (IS_ERR(mp))
- return PTR_ERR(mp);
-
- parent = real_mount(path->mnt);
- err = -EINVAL;
if (unlikely(!check_mnt(parent))) {
/* that's acceptable only for automounts done in private ns */
if (!(mnt_flags & MNT_SHRINKABLE))
- goto unlock;
+ return -EINVAL;
/* ... and for those we'd better have mountpoint still alive */
if (!parent->mnt_ns)
- goto unlock;
+ return -EINVAL;
}

/* Refuse the same filesystem on the same mount point */
- err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
- goto unlock;
+ return -EBUSY;

- err = -EINVAL;
if (d_is_symlink(newmnt->mnt.mnt_root))
- goto unlock;
+ return -EINVAL;

newmnt->mnt.mnt_flags = mnt_flags;
- err = graft_tree(newmnt, parent, mp);
-
-unlock:
- unlock_mount(mp);
- return err;
+ return graft_tree(newmnt, parent, mp);
}

static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags);
@@ -2748,6 +2735,7 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
unsigned int mnt_flags)
{
struct vfsmount *mnt;
+ struct mountpoint *mp;
struct super_block *sb = fc->root->d_sb;
int error;

@@ -2768,7 +2756,13 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,

mnt_warn_timestamp_expiry(mountpoint, mnt);

- error = do_add_mount(real_mount(mnt), mountpoint, mnt_flags);
+ mp = lock_mount(mountpoint);
+ if (IS_ERR(mp)) {
+ mntput(mnt);
+ return PTR_ERR(mp);
+ }
+ error = do_add_mount(real_mount(mnt), mp, mountpoint, mnt_flags);
+ unlock_mount(mp);
if (error < 0)
mntput(mnt);
return error;
@@ -2830,6 +2824,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
struct mount *mnt = real_mount(m);
+ struct mountpoint *mp;
int err;
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
@@ -2842,7 +2837,13 @@ int finish_automount(struct vfsmount *m, struct path *path)
goto fail;
}

- err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ mp = lock_mount(path);
+ if (IS_ERR(mp)) {
+ err = PTR_ERR(mp);
+ goto fail;
+ }
+ err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
+ unlock_mount(mp);
if (!err)
return 0;
fail:
--
2.20.1

2020-01-19 03:19:57

by Al Viro

[permalink] [raw]
Subject: [PATCH 02/17] fix automount/automount race properly

From: Al Viro <[email protected]>

Protection against automount/automount races (two threads hitting the same
referral point at the same time) is based upon do_add_mount() prevention of
identical overmounts - trying to overmount the root of mounted tree with
the same tree fails with -EBUSY. It's unreliable (the other thread might've
mounted something on top of the automount it has triggered) *and* causes
no end of headache for follow_automount() and its caller, since
finish_automount() behaves like do_new_mount() - if the mountpoint to be is
overmounted, it mounts on top what's overmounting it. It's not only wrong
(we want to go into what's overmounting the automount point and quietly
discard what we planned to mount there), it introduces the possibility of
original parent mount getting dropped. That's what 8aef18845266 (VFS: Fix
vfsmount overput on simultaneous automount) deals with, but it can't do
anything about the reliability of conflict detection - if something had
been overmounted the other thread's automount (e.g. that other thread
having stepped into automount in mount(2)), we don't get that -EBUSY and
the result is
referral point under automounted NFS under explicit overmount
under another copy of automounted NFS

What we need is finish_automount() *NOT* digging into overmounts - if it
finds one, it should just quietly discard the thing it was asked to mount.
And don't bother with actually crossing into the results of finish_automount() -
the same loop that calls follow_automount() will do that just fine on the
next iteration.

IOW, instead of calling lock_mount() have finish_automount() do it manually,
_without_ the "move into overmount and retry" part. And leave crossing into
the results to the caller of follow_automount(), which simplifies it a lot.

Moral: if you end up with a lot of glue working around the calling conventions
of something, perhaps these calling conventions are simply wrong...

Fixes: 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount)
Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 29 ++++-------------------------
fs/namespace.c | 41 ++++++++++++++++++++++++++++++++++-------
2 files changed, 38 insertions(+), 32 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d2720dc71d0e..bd036dfdb0d9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1133,11 +1133,9 @@ EXPORT_SYMBOL(follow_up);
* - return -EISDIR to tell follow_managed() to stop and return the path we
* were called with.
*/
-static int follow_automount(struct path *path, struct nameidata *nd,
- bool *need_mntput)
+static int follow_automount(struct path *path, struct nameidata *nd)
{
struct vfsmount *mnt;
- int err;

if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
return -EREMOTE;
@@ -1178,29 +1176,10 @@ static int follow_automount(struct path *path, struct nameidata *nd,
return PTR_ERR(mnt);
}

- if (!mnt) /* mount collision */
- return 0;
-
- if (!*need_mntput) {
- /* lock_mount() may release path->mnt on error */
- mntget(path->mnt);
- *need_mntput = true;
- }
- err = finish_automount(mnt, path);
-
- switch (err) {
- case -EBUSY:
- /* Someone else made a mount here whilst we were busy */
+ if (!mnt)
return 0;
- case 0:
- path_put(path);
- path->mnt = mnt;
- path->dentry = dget(mnt->mnt_root);
- return 0;
- default:
- return err;
- }

+ return finish_automount(mnt, path);
}

/*
@@ -1258,7 +1237,7 @@ static int follow_managed(struct path *path, struct nameidata *nd)

/* Handle an automount point */
if (flags & DCACHE_NEED_AUTOMOUNT) {
- ret = follow_automount(path, nd, &need_mntput);
+ ret = follow_automount(path, nd);
if (ret < 0)
break;
continue;
diff --git a/fs/namespace.c b/fs/namespace.c
index 5f0a80f17651..f1817eb5f87d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2823,6 +2823,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,

int finish_automount(struct vfsmount *m, struct path *path)
{
+ struct dentry *dentry = path->dentry;
struct mount *mnt = real_mount(m);
struct mountpoint *mp;
int err;
@@ -2832,21 +2833,47 @@ int finish_automount(struct vfsmount *m, struct path *path)
BUG_ON(mnt_get_count(mnt) < 2);

if (m->mnt_sb == path->mnt->mnt_sb &&
- m->mnt_root == path->dentry) {
+ m->mnt_root == dentry) {
err = -ELOOP;
- goto fail;
+ goto discard;
}

- mp = lock_mount(path);
+ /*
+ * we don't want to use lock_mount() - in this case finding something
+ * that overmounts our mountpoint to be means "quitely drop what we've
+ * got", not "try to mount it on top".
+ */
+ inode_lock(dentry->d_inode);
+ if (unlikely(cant_mount(dentry))) {
+ err = -ENOENT;
+ goto discard1;
+ }
+ namespace_lock();
+ rcu_read_lock();
+ if (unlikely(__lookup_mnt(path->mnt, dentry))) {
+ rcu_read_unlock();
+ err = 0;
+ goto discard2;
+ }
+ rcu_read_unlock();
+ mp = get_mountpoint(dentry);
if (IS_ERR(mp)) {
err = PTR_ERR(mp);
- goto fail;
+ goto discard2;
}
+
err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
unlock_mount(mp);
- if (!err)
- return 0;
-fail:
+ if (unlikely(err))
+ goto discard;
+ mntput(m);
+ return 0;
+
+discard2:
+ namespace_unlock();
+discard1:
+ inode_unlock(dentry->d_inode);
+discard:
/* remove m from any expiration list it may be on */
if (!list_empty(&mnt->mnt_expire)) {
namespace_lock();
--
2.20.1

2020-01-19 03:20:59

by Al Viro

[permalink] [raw]
Subject: [PATCH 05/17] make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW

From: Al Viro <[email protected]>

O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.

However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 15 +++++----------
fs/open.c | 4 +++-
2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 3b6f60c02f8a..c19b458f66da 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3262,22 +3262,17 @@ static int do_last(struct nameidata *nd,
if (unlikely(error < 0))
return error;

- /*
- * create/update audit record if it already exists.
- */
- audit_inode(nd->name, path.dentry, 0);
-
- if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
- path_to_nameidata(&path, nd);
- return -EEXIST;
- }
-
seq = 0; /* out of RCU mode, so the value doesn't matter */
inode = d_backing_inode(path.dentry);
finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
return error;
+
+ if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
+ audit_inode(nd->name, nd->path.dentry, 0);
+ return -EEXIST;
+ }
finish_open:
/* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */
error = complete_walk(nd);
diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..ba7009a5dd1a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1014,8 +1014,10 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o

if (flags & O_CREAT) {
op->intent |= LOOKUP_CREATE;
- if (flags & O_EXCL)
+ if (flags & O_EXCL) {
op->intent |= LOOKUP_EXCL;
+ flags |= O_NOFOLLOW;
+ }
}

if (flags & O_DIRECTORY)
--
2.20.1

2020-01-19 03:21:20

by Al Viro

[permalink] [raw]
Subject: [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code

From: Al Viro <[email protected]>

1) no instances of ->d_automount() have ever made use of the "return
ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
a rudiment of plans that got superseded before the thing went into
the tree. Despite the comment in follow_automount(), autofs has
never done that.

2) if there's no ->d_automount() in dentry_operations, filesystems
should not set DCACHE_NEED_AUTOMOUNT in the first place. None have
ever done so...

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 28 +++-------------------------
fs/namespace.c | 9 ++++++++-
2 files changed, 11 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index bd036dfdb0d9..d30a74a18da9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1135,10 +1135,7 @@ EXPORT_SYMBOL(follow_up);
*/
static int follow_automount(struct path *path, struct nameidata *nd)
{
- struct vfsmount *mnt;
-
- if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
- return -EREMOTE;
+ struct dentry *dentry = path->dentry;

/* We don't want to mount if someone's just doing a stat -
* unless they're stat'ing a directory and appended a '/' to
@@ -1153,33 +1150,14 @@ static int follow_automount(struct path *path, struct nameidata *nd)
*/
if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
- path->dentry->d_inode)
+ dentry->d_inode)
return -EISDIR;

nd->total_link_count++;
if (nd->total_link_count >= 40)
return -ELOOP;

- mnt = path->dentry->d_op->d_automount(path);
- if (IS_ERR(mnt)) {
- /*
- * The filesystem is allowed to return -EISDIR here to indicate
- * it doesn't want to automount. For instance, autofs would do
- * this so that its userspace daemon can mount on this dentry.
- *
- * However, we can only permit this if it's a terminal point in
- * the path being looked up; if it wasn't then the remainder of
- * the path is inaccessible and we should say so.
- */
- if (PTR_ERR(mnt) == -EISDIR && (nd->flags & LOOKUP_PARENT))
- return -EREMOTE;
- return PTR_ERR(mnt);
- }
-
- if (!mnt)
- return 0;
-
- return finish_automount(mnt, path);
+ return finish_automount(dentry->d_op->d_automount(path), path);
}

/*
diff --git a/fs/namespace.c b/fs/namespace.c
index f1817eb5f87d..b37dc59bfa05 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2824,9 +2824,16 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int finish_automount(struct vfsmount *m, struct path *path)
{
struct dentry *dentry = path->dentry;
- struct mount *mnt = real_mount(m);
struct mountpoint *mp;
+ struct mount *mnt;
int err;
+
+ if (!m)
+ return 0;
+ if (IS_ERR(m))
+ return PTR_ERR(m);
+
+ mnt = real_mount(m);
/* The new mount record should have at least 2 refs to prevent it being
* expired before we get a chance to add it
*/
--
2.20.1

2020-01-19 03:21:23

by Al Viro

[permalink] [raw]
Subject: [PATCH 07/17] atomic_open(): saner calling conventions (return dentry on success)

From: Al Viro <[email protected]>

Currently it either returns -E... or puts (nd->path.mnt,dentry)
into *path and returns 0. Make it return ERR_PTR(-E...) or
dentry; adjust the caller. Fewer arguments and it's easier
to keep track of *path contents that way.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 37 ++++++++++++++++++++-----------------
1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4c867d0970d5..9d8837432a7b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2955,10 +2955,10 @@ static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t m
*
* Returns an error code otherwise.
*/
-static int atomic_open(struct nameidata *nd, struct dentry *dentry,
- struct path *path, struct file *file,
- const struct open_flags *op,
- int open_flag, umode_t mode)
+static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
+ struct file *file,
+ const struct open_flags *op,
+ int open_flag, umode_t mode)
{
struct dentry *const DENTRY_NOT_SET = (void *) -1UL;
struct inode *dir = nd->path.dentry->d_inode;
@@ -2999,17 +2999,15 @@ static int atomic_open(struct nameidata *nd, struct dentry *dentry,
}
if (file->f_mode & FMODE_CREATED)
fsnotify_create(dir, dentry);
- if (unlikely(d_is_negative(dentry))) {
+ if (unlikely(d_is_negative(dentry)))
error = -ENOENT;
- } else {
- path->dentry = dentry;
- path->mnt = nd->path.mnt;
- return 0;
- }
}
}
- dput(dentry);
- return error;
+ if (error) {
+ dput(dentry);
+ dentry = ERR_PTR(error);
+ }
+ return dentry;
}

/*
@@ -3104,11 +3102,16 @@ static int lookup_open(struct nameidata *nd, struct path *path,
}

if (dir_inode->i_op->atomic_open) {
- error = atomic_open(nd, dentry, path, file, op, open_flag,
- mode);
- if (unlikely(error == -ENOENT) && create_error)
- error = create_error;
- return error;
+ dentry = atomic_open(nd, dentry, file, op, open_flag, mode);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ if (unlikely(error == -ENOENT) && create_error)
+ error = create_error;
+ return error;
+ }
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ return 0;
}

no_open:
--
2.20.1

2020-01-19 03:21:26

by Al Viro

[permalink] [raw]
Subject: [PATCH 04/17] follow_automount() doesn't need the entire nameidata

From: Al Viro <[email protected]>

only the address of ->total_link_count and the flags

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d30a74a18da9..3b6f60c02f8a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1133,7 +1133,7 @@ EXPORT_SYMBOL(follow_up);
* - return -EISDIR to tell follow_managed() to stop and return the path we
* were called with.
*/
-static int follow_automount(struct path *path, struct nameidata *nd)
+static int follow_automount(struct path *path, int *count, unsigned lookup_flags)
{
struct dentry *dentry = path->dentry;

@@ -1148,13 +1148,12 @@ static int follow_automount(struct path *path, struct nameidata *nd)
* as being automount points. These will need the attentions
* of the daemon to instantiate them before they can be used.
*/
- if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
+ if (!(lookup_flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
dentry->d_inode)
return -EISDIR;

- nd->total_link_count++;
- if (nd->total_link_count >= 40)
+ if (count && *count++ >= 40)
return -ELOOP;

return finish_automount(dentry->d_op->d_automount(path), path);
@@ -1215,7 +1214,8 @@ static int follow_managed(struct path *path, struct nameidata *nd)

/* Handle an automount point */
if (flags & DCACHE_NEED_AUTOMOUNT) {
- ret = follow_automount(path, nd);
+ ret = follow_automount(path, &nd->total_link_count,
+ nd->flags);
if (ret < 0)
break;
continue;
--
2.20.1

2020-01-19 03:21:44

by Al Viro

[permalink] [raw]
Subject: [PATCH 08/17] lookup_open(): saner calling conventions (return dentry on success)

From: Al Viro <[email protected]>

same story as for atomic_open() in the previous commit.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 39 ++++++++++++++++++---------------------
1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 9d8837432a7b..30503f114142 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3025,10 +3025,9 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
*
* An error code is returned on failure.
*/
-static int lookup_open(struct nameidata *nd, struct path *path,
- struct file *file,
- const struct open_flags *op,
- bool got_write)
+static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
+ const struct open_flags *op,
+ bool got_write)
{
struct dentry *dir = nd->path.dentry;
struct inode *dir_inode = dir->d_inode;
@@ -3039,7 +3038,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);

if (unlikely(IS_DEADDIR(dir_inode)))
- return -ENOENT;
+ return ERR_PTR(-ENOENT);

file->f_mode &= ~FMODE_CREATED;
dentry = d_lookup(dir, &nd->last);
@@ -3047,7 +3046,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
if (!dentry) {
dentry = d_alloc_parallel(dir, &nd->last, &wq);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return dentry;
}
if (d_in_lookup(dentry))
break;
@@ -3063,7 +3062,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
}
if (dentry->d_inode) {
/* Cached positive dentry: will open in f_op->open */
- goto out_no_open;
+ return dentry;
}

/*
@@ -3104,14 +3103,10 @@ static int lookup_open(struct nameidata *nd, struct path *path,
if (dir_inode->i_op->atomic_open) {
dentry = atomic_open(nd, dentry, file, op, open_flag, mode);
if (IS_ERR(dentry)) {
- error = PTR_ERR(dentry);
- if (unlikely(error == -ENOENT) && create_error)
- error = create_error;
- return error;
+ if (dentry == ERR_PTR(-ENOENT) && create_error)
+ dentry = ERR_PTR(create_error);
}
- path->mnt = nd->path.mnt;
- path->dentry = dentry;
- return 0;
+ return dentry;
}

no_open:
@@ -3147,14 +3142,11 @@ static int lookup_open(struct nameidata *nd, struct path *path,
error = create_error;
goto out_dput;
}
-out_no_open:
- path->dentry = dentry;
- path->mnt = nd->path.mnt;
- return 0;
+ return dentry;

out_dput:
dput(dentry);
- return error;
+ return ERR_PTR(error);
}

/*
@@ -3171,6 +3163,7 @@ static int do_last(struct nameidata *nd,
unsigned seq;
struct inode *inode;
struct path path;
+ struct dentry *dentry;
int error;

nd->flags &= ~LOOKUP_PARENT;
@@ -3227,14 +3220,18 @@ static int do_last(struct nameidata *nd,
inode_lock(dir->d_inode);
else
inode_lock_shared(dir->d_inode);
- error = lookup_open(nd, &path, file, op, got_write);
+ dentry = lookup_open(nd, file, op, got_write);
if (open_flag & O_CREAT)
inode_unlock(dir->d_inode);
else
inode_unlock_shared(dir->d_inode);

- if (error)
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
goto out;
+ }
+ path.mnt = nd->path.mnt;
+ path.dentry = dentry;

if (file->f_mode & FMODE_OPENED) {
if ((file->f_mode & FMODE_CREATED) ||
--
2.20.1

2020-01-19 03:22:05

by Al Viro

[permalink] [raw]
Subject: [PATCH 09/17] do_last(): collapse the call of path_to_nameidata()

From: Al Viro <[email protected]>

... and shift filling struct path to just before the call of
handle_mounts(). All callers of handle_mounts() are
immediately preceded by path->mnt = nd->path.mnt now.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 30503f114142..f66553ef436a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3230,8 +3230,6 @@ static int do_last(struct nameidata *nd,
error = PTR_ERR(dentry);
goto out;
}
- path.mnt = nd->path.mnt;
- path.dentry = dentry;

if (file->f_mode & FMODE_OPENED) {
if ((file->f_mode & FMODE_CREATED) ||
@@ -3247,7 +3245,8 @@ static int do_last(struct nameidata *nd,
open_flag &= ~O_TRUNC;
will_truncate = false;
acc_mode = 0;
- path_to_nameidata(&path, nd);
+ dput(nd->path.dentry);
+ nd->path.dentry = dentry;
goto finish_open_created;
}

@@ -3261,6 +3260,8 @@ static int do_last(struct nameidata *nd,
got_write = false;
}

+ path.mnt = nd->path.mnt;
+ path.dentry = dentry;
error = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(error < 0))
return error;
--
2.20.1

2020-01-19 03:22:12

by Al Viro

[permalink] [raw]
Subject: [PATCH 06/17] handle_mounts(): start building a sane wrapper for follow_managed()

From: Al Viro <[email protected]>

All callers of follow_managed() follow it on success with the same steps -
d_backing_inode(path->dentry) is calculated and stored into some struct inode *
variable and, in all but one case, an unsigned variable (nd->seq to be) is
zeroed. The single exception is lookup_fast() and there zeroing is correct
thing to do - not doing it is a pointless microoptimization.

Add a wrapper for follow_managed() that would do that combination.
It's mostly a vehicle for code massage - it will be changing quite a bit,
and the current calling conventions are by no means final. Right now it
takes path, nameidata and (as out params) inode and seq, similar to
__follow_mount_rcu(). Which will soon get folded into it...

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c19b458f66da..4c867d0970d5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1304,6 +1304,18 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
!(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
}

+static inline int handle_mounts(struct path *path, struct nameidata *nd,
+ struct inode **inode, unsigned int *seqp)
+{
+ int ret = follow_managed(path, nd);
+
+ if (likely(ret >= 0)) {
+ *inode = d_backing_inode(path->dentry);
+ *seqp = 0; /* out of RCU mode, so the value doesn't matter */
+ }
+ return ret;
+}
+
static int follow_dotdot_rcu(struct nameidata *nd)
{
struct inode *inode = nd->inode;
@@ -1514,7 +1526,6 @@ static int lookup_fast(struct nameidata *nd,
struct vfsmount *mnt = nd->path.mnt;
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
- int err;

/*
* Rename seqlock is not required here because in the off chance
@@ -1584,10 +1595,7 @@ static int lookup_fast(struct nameidata *nd,

path->mnt = mnt;
path->dentry = dentry;
- err = follow_managed(path, nd);
- if (likely(err > 0))
- *inode = d_backing_inode(path->dentry);
- return err;
+ return handle_mounts(path, nd, inode, seqp);
}

/* Fast lookup failed, do it the slow way */
@@ -1761,12 +1769,9 @@ static int walk_component(struct nameidata *nd, int flags)
return PTR_ERR(path.dentry);

path.mnt = nd->path.mnt;
- err = follow_managed(&path, nd);
+ err = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(err < 0))
return err;
-
- seq = 0; /* we are already out of RCU mode */
- inode = d_backing_inode(path.dentry);
}

return step_into(nd, &path, flags, inode, seq);
@@ -2233,11 +2238,9 @@ static int handle_lookup_down(struct nameidata *nd)
return -ECHILD;
} else {
dget(path.dentry);
- err = follow_managed(&path, nd);
+ err = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(err < 0))
return err;
- inode = d_backing_inode(path.dentry);
- seq = 0;
}
path_to_nameidata(&path, nd);
nd->inode = inode;
@@ -3258,12 +3261,9 @@ static int do_last(struct nameidata *nd,
got_write = false;
}

- error = follow_managed(&path, nd);
+ error = handle_mounts(&path, nd, &inode, &seq);
if (unlikely(error < 0))
return error;
-
- seq = 0; /* out of RCU mode, so the value doesn't matter */
- inode = d_backing_inode(path.dentry);
finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
--
2.20.1

2020-01-19 03:22:23

by Al Viro

[permalink] [raw]
Subject: [PATCH 10/17] handle_mounts(): pass dentry in, turn path into a pure out argument

From: Al Viro <[email protected]>

All callers are equivalent to
path->dentry = dentry;
path->mnt = nd->path.mnt;
err = handle_mounts(path, ...)
Pass dentry as an explicit argument, fill *path in handle_mounts()
itself.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 37 ++++++++++++++++++-------------------
1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index f66553ef436a..f95c072bad03 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1304,11 +1304,15 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
!(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
}

-static inline int handle_mounts(struct path *path, struct nameidata *nd,
- struct inode **inode, unsigned int *seqp)
+static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
+ struct path *path, struct inode **inode,
+ unsigned int *seqp)
{
- int ret = follow_managed(path, nd);
+ int ret;

+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ ret = follow_managed(path, nd);
if (likely(ret >= 0)) {
*inode = d_backing_inode(path->dentry);
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
@@ -1592,10 +1596,7 @@ static int lookup_fast(struct nameidata *nd,
dput(dentry);
return status;
}
-
- path->mnt = mnt;
- path->dentry = dentry;
- return handle_mounts(path, nd, inode, seqp);
+ return handle_mounts(nd, dentry, path, inode, seqp);
}

/* Fast lookup failed, do it the slow way */
@@ -1745,6 +1746,7 @@ static inline int step_into(struct nameidata *nd, struct path *path,
static int walk_component(struct nameidata *nd, int flags)
{
struct path path;
+ struct dentry *dentry;
struct inode *inode;
unsigned seq;
int err;
@@ -1763,13 +1765,11 @@ static int walk_component(struct nameidata *nd, int flags)
if (unlikely(err <= 0)) {
if (err < 0)
return err;
- path.dentry = lookup_slow(&nd->last, nd->path.dentry,
- nd->flags);
- if (IS_ERR(path.dentry))
- return PTR_ERR(path.dentry);
+ dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);

- path.mnt = nd->path.mnt;
- err = handle_mounts(&path, nd, &inode, &seq);
+ err = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
}
@@ -2223,7 +2223,7 @@ static inline int lookup_last(struct nameidata *nd)

static int handle_lookup_down(struct nameidata *nd)
{
- struct path path = nd->path;
+ struct path path;
struct inode *inode = nd->inode;
unsigned seq = nd->seq;
int err;
@@ -2234,11 +2234,12 @@ static int handle_lookup_down(struct nameidata *nd)
* at the very beginning of walk, so we lose nothing
* if we simply redo everything in non-RCU mode
*/
+ path = nd->path;
if (unlikely(!__follow_mount_rcu(nd, &path, &inode, &seq)))
return -ECHILD;
} else {
- dget(path.dentry);
- err = handle_mounts(&path, nd, &inode, &seq);
+ dget(nd->path.dentry);
+ err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
}
@@ -3260,9 +3261,7 @@ static int do_last(struct nameidata *nd,
got_write = false;
}

- path.mnt = nd->path.mnt;
- path.dentry = dentry;
- error = handle_mounts(&path, nd, &inode, &seq);
+ error = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(error < 0))
return error;
finish_lookup:
--
2.20.1

2020-01-19 03:22:58

by Al Viro

[permalink] [raw]
Subject: [PATCH 12/17] teach handle_mounts() to handle RCU mode

From: Al Viro <[email protected]>

... and make the callers of __follow_mount_rcu() use handle_mounts().

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 46 +++++++++++++++++-----------------------------
1 file changed, 17 insertions(+), 29 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2e416bd8ee26..a3bed1307a4b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1312,6 +1312,18 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,

path->mnt = nd->path.mnt;
path->dentry = dentry;
+ if (nd->flags & LOOKUP_RCU) {
+ unsigned int seq = *seqp;
+ if (unlikely(!*inode))
+ return -ENOENT;
+ if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ return 1;
+ if (unlazy_child(nd, dentry, seq))
+ return -ECHILD;
+ // *path might've been clobbered by __follow_mount_rcu()
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ }
ret = follow_managed(path, nd);
if (likely(ret >= 0)) {
*inode = d_backing_inode(path->dentry);
@@ -1527,7 +1539,6 @@ static int lookup_fast(struct nameidata *nd,
struct path *path, struct inode **inode,
unsigned *seqp)
{
- struct vfsmount *mnt = nd->path.mnt;
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;

@@ -1565,21 +1576,8 @@ static int lookup_fast(struct nameidata *nd,

*seqp = seq;
status = d_revalidate(dentry, nd->flags);
- if (likely(status > 0)) {
- /*
- * Note: do negative dentry check after revalidation in
- * case that drops it.
- */
- if (unlikely(!inode))
- return -ENOENT;
- path->mnt = mnt;
- path->dentry = dentry;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
- return 1;
- if (unlazy_child(nd, dentry, seq))
- return -ECHILD;
+ if (likely(status > 0))
return handle_mounts(nd, dentry, path, inode, seqp);
- }
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
if (unlikely(status == -ECHILD))
@@ -2229,21 +2227,11 @@ static int handle_lookup_down(struct nameidata *nd)
unsigned seq = nd->seq;
int err;

- if (nd->flags & LOOKUP_RCU) {
- /*
- * don't bother with unlazy_walk on failure - we are
- * at the very beginning of walk, so we lose nothing
- * if we simply redo everything in non-RCU mode
- */
- path = nd->path;
- if (unlikely(!__follow_mount_rcu(nd, &path, &inode, &seq)))
- return -ECHILD;
- } else {
+ if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
- }
+ err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
+ if (unlikely(err < 0))
+ return err;
path_to_nameidata(&path, nd);
nd->inode = inode;
nd->seq = seq;
--
2.20.1

2020-01-19 03:23:23

by Al Viro

[permalink] [raw]
Subject: [PATCH 14/17] new step_into() flag: WALK_NOFOLLOW

From: Al Viro <[email protected]>

Tells step_into() not to follow symlinks, regardless of LOOKUP_FOLLOW.
Allows to switch handle_lookup_down() to of step_into(), getting
all follow_managed() and step_into() calls paired.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d529c1e138ff..44634643475d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1713,7 +1713,7 @@ static int pick_link(struct nameidata *nd, struct path *link,
return 1;
}

-enum {WALK_FOLLOW = 1, WALK_MORE = 2};
+enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};

/*
* Do we need to follow links? We _really_ want to be able
@@ -1727,7 +1727,8 @@ static inline int step_into(struct nameidata *nd, struct path *path,
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
if (likely(!d_is_symlink(path->dentry)) ||
- !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW)) {
+ !(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) ||
+ flags & WALK_NOFOLLOW) {
/* not a symlink or should not follow */
path_to_nameidata(path, nd);
nd->inode = inode;
@@ -2231,10 +2232,7 @@ static int handle_lookup_down(struct nameidata *nd)
err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
- path_to_nameidata(&path, nd);
- nd->inode = inode;
- nd->seq = seq;
- return 0;
+ return step_into(nd, &path, WALK_NOFOLLOW, inode, seq);
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
--
2.20.1

2020-01-19 03:23:45

by Al Viro

[permalink] [raw]
Subject: [PATCH 15/17] fold handle_mounts() into step_into()

From: Al Viro <[email protected]>

The following is true:
* calls of handle_mounts() and step_into() are always
paired in sequences like
err = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(err < 0))
return err;
err = step_into(nd, &path, flags, inode, seq);
* in all such sequences path is uninitialized before and
unused after this pair of calls
* in all such sequences inode and seq are unused afterwards.

So the call of handle_mounts() can be shifted inside step_into(),
turning 'path' into a local variable in the combined function.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 41 +++++++++++++++--------------------------
1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 44634643475d..6c28b969f4d1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1721,31 +1721,35 @@ enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
* so we keep a cache of "no, this doesn't need follow_link"
* for the common case.
*/
-static inline int step_into(struct nameidata *nd, struct path *path,
- int flags, struct inode *inode, unsigned seq)
+static int step_into(struct nameidata *nd, int flags,
+ struct dentry *dentry, struct inode *inode, unsigned seq)
{
+ struct path path;
+ int err = handle_mounts(nd, dentry, &path, &inode, &seq);
+
+ if (err < 0)
+ return err;
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- if (likely(!d_is_symlink(path->dentry)) ||
+ if (likely(!d_is_symlink(path.dentry)) ||
!(flags & WALK_FOLLOW || nd->flags & LOOKUP_FOLLOW) ||
flags & WALK_NOFOLLOW) {
/* not a symlink or should not follow */
- path_to_nameidata(path, nd);
+ path_to_nameidata(&path, nd);
nd->inode = inode;
nd->seq = seq;
return 0;
}
/* make sure that d_is_symlink above matches inode */
if (nd->flags & LOOKUP_RCU) {
- if (read_seqcount_retry(&path->dentry->d_seq, seq))
+ if (read_seqcount_retry(&path.dentry->d_seq, seq))
return -ECHILD;
}
- return pick_link(nd, path, inode, seq);
+ return pick_link(nd, &path, inode, seq);
}

static int walk_component(struct nameidata *nd, int flags)
{
- struct path path;
struct dentry *dentry;
struct inode *inode;
unsigned seq;
@@ -1769,11 +1773,7 @@ static int walk_component(struct nameidata *nd, int flags)
if (IS_ERR(dentry))
return PTR_ERR(dentry);
}
-
- err = handle_mounts(nd, dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
- return step_into(nd, &path, flags, inode, seq);
+ return step_into(nd, flags, dentry, inode, seq);
}

/*
@@ -2222,17 +2222,10 @@ static inline int lookup_last(struct nameidata *nd)

static int handle_lookup_down(struct nameidata *nd)
{
- struct path path;
- struct inode *inode = nd->inode;
- unsigned seq = nd->seq;
- int err;
-
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- err = handle_mounts(nd, nd->path.dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
- return step_into(nd, &path, WALK_NOFOLLOW, inode, seq);
+ return step_into(nd, WALK_NOFOLLOW,
+ nd->path.dentry, nd->inode, nd->seq);
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3149,7 +3142,6 @@ static int do_last(struct nameidata *nd,
int acc_mode = op->acc_mode;
unsigned seq;
struct inode *inode;
- struct path path;
struct dentry *dentry;
int error;

@@ -3247,10 +3239,7 @@ static int do_last(struct nameidata *nd,
}

finish_lookup:
- error = handle_mounts(nd, dentry, &path, &inode, &seq);
- if (unlikely(error < 0))
- return error;
- error = step_into(nd, &path, 0, inode, seq);
+ error = step_into(nd, 0, dentry, inode, seq);
if (unlikely(error))
return error;

--
2.20.1

2020-01-19 03:23:46

by Al Viro

[permalink] [raw]
Subject: [PATCH 11/17] lookup_fast(): consolidate the RCU success case

From: Al Viro <[email protected]>

1) in case of __follow_mount_rcu() failure, lookup_fast() proceeds
to call unlazy_child() and, should it succeed, handle_mounts().
Note that we have status > 0 (or we wouldn't be calling
__follow_mount_rcu() at all), so all stuff conditional upon
non-positive status won't be even touched.

Consolidate just that sequence after the call of __follow_mount_rcu().

2) calling d_is_negative() and keeping its result is pointless -
we either don't get past checking ->d_seq (and don't use the results of
d_is_negative() at all), or we are guaranteed that ->d_inode and
type bits of ->d_flags had been consistent at the time of d_is_negative()
call. IOW, we could only get to the use of its result if it's
equal to !inode. The same ->d_seq check guarantees that after that point
this CPU won't observe ->d_flags values older than ->d_inode update.
So 'negative' variable is completely pointless these days.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index f95c072bad03..2e416bd8ee26 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1538,7 +1538,6 @@ static int lookup_fast(struct nameidata *nd,
*/
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
- bool negative;
dentry = __d_lookup_rcu(parent, &nd->last, &seq);
if (unlikely(!dentry)) {
if (unlazy_walk(nd))
@@ -1551,7 +1550,6 @@ static int lookup_fast(struct nameidata *nd,
* the dentry name information from lookup.
*/
*inode = d_backing_inode(dentry);
- negative = d_is_negative(dentry);
if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
return -ECHILD;

@@ -1572,12 +1570,15 @@ static int lookup_fast(struct nameidata *nd,
* Note: do negative dentry check after revalidation in
* case that drops it.
*/
- if (unlikely(negative))
+ if (unlikely(!inode))
return -ENOENT;
path->mnt = mnt;
path->dentry = dentry;
if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
return 1;
+ if (unlazy_child(nd, dentry, seq))
+ return -ECHILD;
+ return handle_mounts(nd, dentry, path, inode, seqp);
}
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
--
2.20.1

2020-01-19 03:24:05

by Al Viro

[permalink] [raw]
Subject: [PATCH 13/17] lookup_fast(): take mount traversal into callers

From: Al Viro <[email protected]>

Current calling conventions: -E... on error, 0 on cache miss,
result of handle_mounts(nd, dentry, path, inode, seqp) on
success. Turn that into returning ERR_PTR(-E...), NULL and dentry
resp.; deal with handle_mounts() in the callers. The thing
is, they already do that in cache miss handling case, so we
just need to supply dentry to them and unify the mount traversal
in those cases. Fewer arguments that way, and we get closer
to merging handle_mounts() and step_into().

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 50 ++++++++++++++++++++++++--------------------------
1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a3bed1307a4b..d529c1e138ff 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1535,9 +1535,9 @@ static struct dentry *__lookup_hash(const struct qstr *name,
return dentry;
}

-static int lookup_fast(struct nameidata *nd,
- struct path *path, struct inode **inode,
- unsigned *seqp)
+static struct dentry *lookup_fast(struct nameidata *nd,
+ struct inode **inode,
+ unsigned *seqp)
{
struct dentry *dentry, *parent = nd->path.dentry;
int status = 1;
@@ -1552,8 +1552,8 @@ static int lookup_fast(struct nameidata *nd,
dentry = __d_lookup_rcu(parent, &nd->last, &seq);
if (unlikely(!dentry)) {
if (unlazy_walk(nd))
- return -ECHILD;
- return 0;
+ return ERR_PTR(-ECHILD);
+ return NULL;
}

/*
@@ -1562,7 +1562,7 @@ static int lookup_fast(struct nameidata *nd,
*/
*inode = d_backing_inode(dentry);
if (unlikely(read_seqcount_retry(&dentry->d_seq, seq)))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);

/*
* This sequence count validates that the parent had no
@@ -1572,30 +1572,30 @@ static int lookup_fast(struct nameidata *nd,
* enough, we can use __read_seqcount_retry here.
*/
if (unlikely(__read_seqcount_retry(&parent->d_seq, nd->seq)))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);

*seqp = seq;
status = d_revalidate(dentry, nd->flags);
if (likely(status > 0))
- return handle_mounts(nd, dentry, path, inode, seqp);
+ return dentry;
if (unlazy_child(nd, dentry, seq))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);
if (unlikely(status == -ECHILD))
/* we'd been told to redo it in non-rcu mode */
status = d_revalidate(dentry, nd->flags);
} else {
dentry = __d_lookup(parent, &nd->last);
if (unlikely(!dentry))
- return 0;
+ return NULL;
status = d_revalidate(dentry, nd->flags);
}
if (unlikely(status <= 0)) {
if (!status)
d_invalidate(dentry);
dput(dentry);
- return status;
+ return ERR_PTR(status);
}
- return handle_mounts(nd, dentry, path, inode, seqp);
+ return dentry;
}

/* Fast lookup failed, do it the slow way */
@@ -1760,19 +1760,18 @@ static int walk_component(struct nameidata *nd, int flags)
put_link(nd);
return err;
}
- err = lookup_fast(nd, &path, &inode, &seq);
- if (unlikely(err <= 0)) {
- if (err < 0)
- return err;
+ dentry = lookup_fast(nd, &inode, &seq);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ if (unlikely(!dentry)) {
dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
-
- err = handle_mounts(nd, dentry, &path, &inode, &seq);
- if (unlikely(err < 0))
- return err;
}

+ err = handle_mounts(nd, dentry, &path, &inode, &seq);
+ if (unlikely(err < 0))
+ return err;
return step_into(nd, &path, flags, inode, seq);
}

@@ -3170,13 +3169,12 @@ static int do_last(struct nameidata *nd,
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
- error = lookup_fast(nd, &path, &inode, &seq);
- if (likely(error > 0))
+ dentry = lookup_fast(nd, &inode, &seq);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ if (likely(dentry))
goto finish_lookup;

- if (error < 0)
- return error;
-
BUG_ON(nd->inode != dir->d_inode);
BUG_ON(nd->flags & LOOKUP_RCU);
} else {
@@ -3250,10 +3248,10 @@ static int do_last(struct nameidata *nd,
got_write = false;
}

+finish_lookup:
error = handle_mounts(nd, dentry, &path, &inode, &seq);
if (unlikely(error < 0))
return error;
-finish_lookup:
error = step_into(nd, &path, 0, inode, seq);
if (unlikely(error))
return error;
--
2.20.1

2020-01-19 03:24:33

by Al Viro

[permalink] [raw]
Subject: [PATCH 17/17] expand the only remaining call of path_lookup_conditional()

From: Al Viro <[email protected]>

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 6852a0dcb25d..e840472ab9bf 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -816,13 +816,6 @@ static void set_root(struct nameidata *nd)
}
}

-static void path_put_conditional(struct path *path, struct nameidata *nd)
-{
- dput(path->dentry);
- if (path->mnt != nd->path.mnt)
- mntput(path->mnt);
-}
-
static inline void path_to_nameidata(const struct path *path,
struct nameidata *nd)
{
@@ -1233,8 +1226,11 @@ static int follow_managed(struct path *path, struct nameidata *nd)
ret = 1;
if (ret > 0 && unlikely(d_flags_negative(flags)))
ret = -ENOENT;
- if (unlikely(ret < 0))
- path_put_conditional(path, nd);
+ if (unlikely(ret < 0)) {
+ dput(path->dentry);
+ if (path->mnt != nd->path.mnt)
+ mntput(path->mnt);
+ }
return ret;
}

--
2.20.1

2020-01-19 03:24:57

by Al Viro

[permalink] [raw]
Subject: [PATCH 1/9] merging pick_link() with get_link(), part 1

From: Al Viro <[email protected]>

Move restoring LOOKUP_PARENT and zeroing nd->stack.name[0] past
the call of get_link() (nothing _currently_ uses them in there).
That allows to moved the call of may_follow_link() into get_link()
as well, since now the presence of LOOKUP_PARENT distinguishes
the callers from each other (link_path_walk() has it, trailing_symlink()
doesn't).

Preparations for folding trailing_symlink() into callers (lookup_last()
and do_last()) and changing the calling conventions of those. Next
stage after that will have get_link() call migrate into walk_component(),
then - into step_into(). It's tricky enough to warrant doing that
in stages, unfortunately...

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index f9fa8579cf6a..45cedbe267ab 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1114,6 +1114,12 @@ const char *get_link(struct nameidata *nd)
int error;
const char *res;

+ if (!(nd->flags & LOOKUP_PARENT)) {
+ error = may_follow_link(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
+
if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
return ERR_PTR(-ELOOP);

@@ -2328,13 +2334,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)

static const char *trailing_symlink(struct nameidata *nd)
{
- const char *s;
- int error = may_follow_link(nd);
- if (unlikely(error))
- return ERR_PTR(error);
+ const char *s = get_link(nd);
nd->flags |= LOOKUP_PARENT;
nd->stack[0].name = NULL;
- s = get_link(nd);
return s ? s : "";
}

--
2.20.1

2020-01-19 03:25:17

by Al Viro

[permalink] [raw]
Subject: [PATCH 16/17] LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()

From: Al Viro <[email protected]>

New LOOKUP flag, telling path_lookupat() to act as path_mountpointat().
IOW, traverse mounts at the final point and skip revalidation of the
location where it ends up.

Signed-off-by: Al Viro <[email protected]>
---
fs/autofs/dev-ioctl.c | 6 +--
fs/internal.h | 1 -
fs/namei.c | 89 +++----------------------------------------
fs/namespace.c | 4 +-
include/linux/namei.h | 2 +-
5 files changed, 12 insertions(+), 90 deletions(-)

diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c
index a3cdb0036c5d..f3a0f412b43b 100644
--- a/fs/autofs/dev-ioctl.c
+++ b/fs/autofs/dev-ioctl.c
@@ -186,7 +186,7 @@ static int find_autofs_mount(const char *pathname,
struct path path;
int err;

- err = kern_path_mountpoint(AT_FDCWD, pathname, &path, 0);
+ err = kern_path(pathname, LOOKUP_MOUNTPOINT, &path);
if (err)
return err;
err = -ENOENT;
@@ -519,8 +519,8 @@ static int autofs_dev_ioctl_ismountpoint(struct file *fp,

if (!fp || param->ioctlfd == -1) {
if (autofs_type_any(type))
- err = kern_path_mountpoint(AT_FDCWD,
- name, &path, LOOKUP_FOLLOW);
+ err = kern_path(name, LOOKUP_FOLLOW | LOOKUP_MOUNTPOINT,
+ &path);
else
err = find_autofs_mount(name, &path,
test_by_type, &type);
diff --git a/fs/internal.h b/fs/internal.h
index 4a7da1df573d..07695e0f56fe 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -61,7 +61,6 @@ extern int finish_clean_context(struct fs_context *fc);
*/
extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
struct path *path, struct path *root);
-extern int user_path_mountpoint_at(int, const char __user *, unsigned int, struct path *);
extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
const char *, unsigned int, struct path *);
long do_mknodat(int dfd, const char __user *filename, umode_t mode,
diff --git a/fs/namei.c b/fs/namei.c
index 6c28b969f4d1..6852a0dcb25d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2250,6 +2250,10 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
if (!err && nd->flags & LOOKUP_DIRECTORY)
if (!d_can_lookup(nd->path.dentry))
err = -ENOTDIR;
+ if (!err && unlikely(nd->flags & LOOKUP_MOUNTPOINT)) {
+ err = handle_lookup_down(nd);
+ nd->flags &= ~LOOKUP_JUMPED; // no d_weak_revalidate(), please...
+ }
if (!err) {
*path = nd->path;
nd->path.mnt = NULL;
@@ -2278,7 +2282,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
retval = path_lookupat(&nd, flags | LOOKUP_REVAL, path);

if (likely(!retval))
- audit_inode(name, path->dentry, 0);
+ audit_inode(name, path->dentry,
+ flags & LOOKUP_MOUNTPOINT ? AUDIT_INODE_NOEVAL : 0);
restore_nameidata();
putname(name);
return retval;
@@ -2556,88 +2561,6 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
}
EXPORT_SYMBOL(user_path_at_empty);

-/**
- * path_mountpoint - look up a path to be umounted
- * @nd: lookup context
- * @flags: lookup flags
- * @path: pointer to container for result
- *
- * Look up the given name, but don't attempt to revalidate the last component.
- * Returns 0 and "path" will be valid on success; Returns error otherwise.
- */
-static int
-path_mountpoint(struct nameidata *nd, unsigned flags, struct path *path)
-{
- const char *s = path_init(nd, flags);
- int err;
-
- while (!(err = link_path_walk(s, nd)) &&
- (err = lookup_last(nd)) > 0) {
- s = trailing_symlink(nd);
- }
- if (!err && (nd->flags & LOOKUP_RCU))
- err = unlazy_walk(nd);
- if (!err)
- err = handle_lookup_down(nd);
- if (!err) {
- *path = nd->path;
- nd->path.mnt = NULL;
- nd->path.dentry = NULL;
- }
- terminate_walk(nd);
- return err;
-}
-
-static int
-filename_mountpoint(int dfd, struct filename *name, struct path *path,
- unsigned int flags)
-{
- struct nameidata nd;
- int error;
- if (IS_ERR(name))
- return PTR_ERR(name);
- set_nameidata(&nd, dfd, name);
- error = path_mountpoint(&nd, flags | LOOKUP_RCU, path);
- if (unlikely(error == -ECHILD))
- error = path_mountpoint(&nd, flags, path);
- if (unlikely(error == -ESTALE))
- error = path_mountpoint(&nd, flags | LOOKUP_REVAL, path);
- if (likely(!error))
- audit_inode(name, path->dentry, AUDIT_INODE_NOEVAL);
- restore_nameidata();
- putname(name);
- return error;
-}
-
-/**
- * user_path_mountpoint_at - lookup a path from userland in order to umount it
- * @dfd: directory file descriptor
- * @name: pathname from userland
- * @flags: lookup flags
- * @path: pointer to container to hold result
- *
- * A umount is a special case for path walking. We're not actually interested
- * in the inode in this situation, and ESTALE errors can be a problem. We
- * simply want track down the dentry and vfsmount attached at the mountpoint
- * and avoid revalidating the last component.
- *
- * Returns 0 and populates "path" on success.
- */
-int
-user_path_mountpoint_at(int dfd, const char __user *name, unsigned int flags,
- struct path *path)
-{
- return filename_mountpoint(dfd, getname(name), path, flags);
-}
-
-int
-kern_path_mountpoint(int dfd, const char *name, struct path *path,
- unsigned int flags)
-{
- return filename_mountpoint(dfd, getname_kernel(name), path, flags);
-}
-EXPORT_SYMBOL(kern_path_mountpoint);
-
int __check_sticky(struct inode *dir, struct inode *inode)
{
kuid_t fsuid = current_fsuid();
diff --git a/fs/namespace.c b/fs/namespace.c
index b37dc59bfa05..b31a75782a59 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1669,7 +1669,7 @@ int ksys_umount(char __user *name, int flags)
struct path path;
struct mount *mnt;
int retval;
- int lookup_flags = 0;
+ int lookup_flags = LOOKUP_MOUNTPOINT;

if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
return -EINVAL;
@@ -1680,7 +1680,7 @@ int ksys_umount(char __user *name, int flags)
if (!(flags & UMOUNT_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;

- retval = user_path_mountpoint_at(AT_FDCWD, name, lookup_flags, &path);
+ retval = user_path_at(AT_FDCWD, name, lookup_flags, &path);
if (retval)
goto out;
mnt = real_mount(path.mnt);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 07bfb0874033..df3549de1cd1 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -22,6 +22,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_AUTOMOUNT 0x0004 /* force terminal automount */
#define LOOKUP_EMPTY 0x4000 /* accept empty path [user_... only] */
#define LOOKUP_DOWN 0x8000 /* follow mounts in the starting point */
+#define LOOKUP_MOUNTPOINT 0x0080 /* follow mounts in the end */

#define LOOKUP_REVAL 0x0020 /* tell ->d_revalidate() to trust no cache */
#define LOOKUP_RCU 0x0040 /* RCU pathwalk mode; semi-internal */
@@ -54,7 +55,6 @@ extern struct dentry *kern_path_create(int, const char *, struct path *, unsigne
extern struct dentry *user_path_create(int, const char __user *, struct path *, unsigned int);
extern void done_path_create(struct path *, struct dentry *);
extern struct dentry *kern_path_locked(const char *, struct path *);
-extern int kern_path_mountpoint(int, const char *, struct path *, unsigned int);

extern struct dentry *try_lookup_one_len(const char *, struct dentry *, int);
extern struct dentry *lookup_one_len(const char *, struct dentry *, int);
--
2.20.1

2020-01-19 03:26:14

by Al Viro

[permalink] [raw]
Subject: [PATCH 4/9] merging pick_link() with get_link(), part 4

From: Al Viro <[email protected]>

Move the call of get_link() into walk_component(). Change the
calling conventions for walk_component() to returning the link
body to follow (if any).

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 60 ++++++++++++++++++++++++------------------------------
1 file changed, 27 insertions(+), 33 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fe03e4d1144b..2c7778d95d32 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1867,7 +1867,7 @@ static int step_into(struct nameidata *nd, int flags,
return pick_link(nd, &path, inode, seq);
}

-static int walk_component(struct nameidata *nd, int flags)
+static const char *walk_component(struct nameidata *nd, int flags)
{
struct dentry *dentry;
struct inode *inode;
@@ -1882,17 +1882,23 @@ static int walk_component(struct nameidata *nd, int flags)
err = handle_dots(nd, nd->last_type);
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
- return err;
+ return ERR_PTR(err);
}
dentry = lookup_fast(nd, &inode, &seq);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return ERR_CAST(dentry);
if (unlikely(!dentry)) {
dentry = lookup_slow(&nd->last, nd->path.dentry, nd->flags);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return ERR_CAST(dentry);
}
- return step_into(nd, flags, dentry, inode, seq);
+ err = step_into(nd, flags, dentry, inode, seq);
+ if (!err)
+ return NULL;
+ else if (err > 0)
+ return get_link(nd);
+ else
+ return ERR_PTR(err);
}

/*
@@ -2144,6 +2150,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)

/* At this point we know we have a real path component. */
for(;;) {
+ const char *link;
u64 hash_len;
int type;

@@ -2201,24 +2208,18 @@ static int link_path_walk(const char *name, struct nameidata *nd)
if (!name)
return 0;
/* last component of nested symlink */
- err = walk_component(nd, WALK_FOLLOW);
+ link = walk_component(nd, WALK_FOLLOW);
} else {
/* not the last component */
- err = walk_component(nd, WALK_FOLLOW | WALK_MORE);
+ link = walk_component(nd, WALK_FOLLOW | WALK_MORE);
}
- if (err < 0)
- return err;
-
- if (err) {
- const char *s = get_link(nd);
-
- if (IS_ERR(s))
- return PTR_ERR(s);
- if (likely(s)) {
- nd->stack[nd->depth - 1].name = name;
- name = s;
- continue;
- }
+ if (unlikely(link)) {
+ if (IS_ERR(link))
+ return PTR_ERR(link);
+ /* a symlink to follow */
+ nd->stack[nd->depth - 1].name = name;
+ name = link;
+ continue;
}
if (unlikely(!d_can_lookup(nd->path.dentry))) {
if (nd->flags & LOOKUP_RCU) {
@@ -2334,24 +2335,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)

static inline const char *lookup_last(struct nameidata *nd)
{
- int err;
+ const char *link;
if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;

nd->flags &= ~LOOKUP_PARENT;
- err = walk_component(nd, 0);
- if (unlikely(err)) {
- const char *s;
- if (err < 0)
- return PTR_ERR(err);
- s = get_link(nd);
- if (s) {
- nd->flags |= LOOKUP_PARENT;
- nd->stack[0].name = NULL;
- return s;
- }
+ link = walk_component(nd, 0);
+ if (link) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->stack[0].name = NULL;
}
- return NULL;
+ return link;
}

static int handle_lookup_down(struct nameidata *nd)
--
2.20.1

2020-01-19 03:26:14

by Al Viro

[permalink] [raw]
Subject: [PATCH 2/9] merging pick_link() with get_link(), part 2

From: Al Viro <[email protected]>

Fold trailing_symlink() into lookup_last() and do_last(), change
the calling conventions of those two. Rules change:
success, we are done => NULL instead of 0
error => ERR_PTR(-E...) instead of -E...
got a symlink to follow => return the path to be followed instead of 1

The loops calling those (in path_lookupat() and path_openat()) adjusted.

A subtle change of control flow here: originally a pure-jump trailing
symlink ("/" or procfs one) would've passed through the upper level
loop once more, with "" for path to traverse. That would've brought
us back to the lookup_last/do_last entry and we would've hit LAST_BIND
case (LAST_BIND left from get_link() called by trailing_symlink())
and pretty much skip to the point right after where we'd left the
sucker back when we picked that trailing symlink.

Now we don't bother with that extra pass through the upper level
loop - if get_link() says "I've just done a pure jump, nothing
else to do", we just treat that as non-symlink case.

Boilerplate added on that step will go away shortly - it'll migrate
into walk_component() and then to step_into(), collapsing into the
change of calling conventions for those.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 68 ++++++++++++++++++++++++++++++++----------------------
1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 45cedbe267ab..d93e155caded 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2332,21 +2332,26 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
return s;
}

-static const char *trailing_symlink(struct nameidata *nd)
-{
- const char *s = get_link(nd);
- nd->flags |= LOOKUP_PARENT;
- nd->stack[0].name = NULL;
- return s ? s : "";
-}
-
-static inline int lookup_last(struct nameidata *nd)
+static inline const char *lookup_last(struct nameidata *nd)
{
+ int err;
if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;

nd->flags &= ~LOOKUP_PARENT;
- return walk_component(nd, 0);
+ err = walk_component(nd, 0);
+ if (unlikely(err)) {
+ const char *s;
+ if (err < 0)
+ return PTR_ERR(err);
+ s = get_link(nd);
+ if (s) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->stack[0].name = NULL;
+ return s;
+ }
+ }
+ return NULL;
}

static int handle_lookup_down(struct nameidata *nd)
@@ -2369,10 +2374,9 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
s = ERR_PTR(err);
}

- while (!(err = link_path_walk(s, nd))
- && ((err = lookup_last(nd)) > 0)) {
- s = trailing_symlink(nd);
- }
+ while (!(err = link_path_walk(s, nd)) &&
+ (s = lookup_last(nd)) != NULL)
+ ;
if (!err)
err = complete_walk(nd);

@@ -3184,7 +3188,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
/*
* Handle the last step of open()
*/
-static int do_last(struct nameidata *nd,
+static const char *do_last(struct nameidata *nd,
struct file *file, const struct open_flags *op)
{
struct dentry *dir = nd->path.dentry;
@@ -3203,7 +3207,7 @@ static int do_last(struct nameidata *nd,
if (nd->last_type != LAST_NORM) {
error = handle_dots(nd, nd->last_type);
if (unlikely(error))
- return error;
+ return ERR_PTR(error);
goto finish_open;
}

@@ -3213,7 +3217,7 @@ static int do_last(struct nameidata *nd,
/* we _can_ be in RCU mode here */
dentry = lookup_fast(nd, &inode, &seq);
if (IS_ERR(dentry))
- return PTR_ERR(dentry);
+ return ERR_CAST(dentry);
if (likely(dentry))
goto finish_lookup;

@@ -3228,12 +3232,12 @@ static int do_last(struct nameidata *nd,
*/
error = complete_walk(nd);
if (error)
- return error;
+ return ERR_PTR(error);

audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
/* trailing slashes? */
if (unlikely(nd->last.name[nd->last.len]))
- return -EISDIR;
+ return ERR_PTR(-EISDIR);
}

if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
@@ -3292,18 +3296,28 @@ static int do_last(struct nameidata *nd,

finish_lookup:
error = step_into(nd, 0, dentry, inode, seq);
- if (unlikely(error))
- return error;
+ if (unlikely(error)) {
+ const char *s;
+ if (error < 0)
+ return ERR_PTR(error);
+ s = get_link(nd);
+ if (s) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
+ nd->stack[0].name = NULL;
+ return s;
+ }
+ }

if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
audit_inode(nd->name, nd->path.dentry, 0);
- return -EEXIST;
+ return ERR_PTR(-EEXIST);
}
finish_open:
/* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */
error = complete_walk(nd);
if (error)
- return error;
+ return ERR_PTR(error);
audit_inode(nd->name, nd->path.dentry, 0);
if (open_flag & O_CREAT) {
error = -EISDIR;
@@ -3345,7 +3359,7 @@ static int do_last(struct nameidata *nd,
}
if (got_write)
mnt_drop_write(nd->path.mnt);
- return error;
+ return ERR_PTR(error);
}

struct dentry *vfs_tmpfile(struct dentry *dentry, umode_t mode, int open_flag)
@@ -3448,10 +3462,8 @@ static struct file *path_openat(struct nameidata *nd,
} else {
const char *s = path_init(nd, flags);
while (!(error = link_path_walk(s, nd)) &&
- (error = do_last(nd, file, op)) > 0) {
- nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
- s = trailing_symlink(nd);
- }
+ (s = do_last(nd, file, op)) != NULL)
+ ;
terminate_walk(nd);
}
if (likely(!error)) {
--
2.20.1

2020-01-19 03:26:28

by Al Viro

[permalink] [raw]
Subject: [PATCH 5/9] merging pick_link() with get_link(), part 5

From: Al Viro <[email protected]>

move get_link() call into step_into().

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 45 +++++++++++++++++++--------------------------
1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2c7778d95d32..ad6de8b4167e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1840,14 +1840,14 @@ enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
* so we keep a cache of "no, this doesn't need follow_link"
* for the common case.
*/
-static int step_into(struct nameidata *nd, int flags,
+static const char *step_into(struct nameidata *nd, int flags,
struct dentry *dentry, struct inode *inode, unsigned seq)
{
struct path path;
int err = handle_mounts(nd, dentry, &path, &inode, &seq);

if (err < 0)
- return err;
+ return ERR_PTR(err);
if (!(flags & WALK_MORE) && nd->depth)
put_link(nd);
if (likely(!d_is_symlink(path.dentry)) ||
@@ -1857,14 +1857,18 @@ static int step_into(struct nameidata *nd, int flags,
path_to_nameidata(&path, nd);
nd->inode = inode;
nd->seq = seq;
- return 0;
+ return NULL;
}
/* make sure that d_is_symlink above matches inode */
if (nd->flags & LOOKUP_RCU) {
if (read_seqcount_retry(&path.dentry->d_seq, seq))
- return -ECHILD;
+ return ERR_PTR(-ECHILD);
}
- return pick_link(nd, &path, inode, seq);
+ err = pick_link(nd, &path, inode, seq);
+ if (err > 0)
+ return get_link(nd);
+ else
+ return ERR_PTR(err);
}

static const char *walk_component(struct nameidata *nd, int flags)
@@ -1892,13 +1896,7 @@ static const char *walk_component(struct nameidata *nd, int flags)
if (IS_ERR(dentry))
return ERR_CAST(dentry);
}
- err = step_into(nd, flags, dentry, inode, seq);
- if (!err)
- return NULL;
- else if (err > 0)
- return get_link(nd);
- else
- return ERR_PTR(err);
+ return step_into(nd, flags, dentry, inode, seq);
}

/*
@@ -2352,8 +2350,8 @@ static int handle_lookup_down(struct nameidata *nd)
{
if (!(nd->flags & LOOKUP_RCU))
dget(nd->path.dentry);
- return step_into(nd, WALK_NOFOLLOW,
- nd->path.dentry, nd->inode, nd->seq);
+ return PTR_ERR(step_into(nd, WALK_NOFOLLOW,
+ nd->path.dentry, nd->inode, nd->seq));
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
@@ -3193,6 +3191,7 @@ static const char *do_last(struct nameidata *nd,
unsigned seq;
struct inode *inode;
struct dentry *dentry;
+ const char *link;
int error;

nd->flags &= ~LOOKUP_PARENT;
@@ -3289,18 +3288,12 @@ static const char *do_last(struct nameidata *nd,
}

finish_lookup:
- error = step_into(nd, 0, dentry, inode, seq);
- if (unlikely(error)) {
- const char *s;
- if (error < 0)
- return ERR_PTR(error);
- s = get_link(nd);
- if (s) {
- nd->flags |= LOOKUP_PARENT;
- nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
- nd->stack[0].name = NULL;
- return s;
- }
+ link = step_into(nd, 0, dentry, inode, seq);
+ if (unlikely(link)) {
+ nd->flags |= LOOKUP_PARENT;
+ nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
+ nd->stack[0].name = NULL;
+ return link;
}

if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
--
2.20.1

2020-01-19 03:26:50

by Al Viro

[permalink] [raw]
Subject: [PATCH 3/9] merging pick_link() with get_link(), part 3

From: Al Viro <[email protected]>

After a pure jump ("/" or procfs-style symlink) we don't need to
hold the link anymore. link_path_walk() dropped it if such case
had been detected, lookup_last/do_last() (i.e. old trailing_symlink())
left it on the stack - it ended up calling terminate_walk() shortly
anyway, which would've purged the entire stack.

Do it in get_link() itself instead. Simpler logics that way...

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d93e155caded..fe03e4d1144b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1153,7 +1153,9 @@ const char *get_link(struct nameidata *nd)
} else {
res = get(dentry, inode, &last->done);
}
- if (IS_ERR_OR_NULL(res))
+ if (!res)
+ goto all_done;
+ if (IS_ERR(res))
return res;
}
if (*res == '/') {
@@ -1163,9 +1165,11 @@ const char *get_link(struct nameidata *nd)
while (unlikely(*++res == '/'))
;
}
- if (!*res)
- res = NULL;
- return res;
+ if (*res)
+ return res;
+all_done: // pure jump
+ put_link(nd);
+ return NULL;
}

/*
@@ -2210,11 +2214,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)

if (IS_ERR(s))
return PTR_ERR(s);
- err = 0;
- if (unlikely(!s)) {
- /* jumped */
- put_link(nd);
- } else {
+ if (likely(s)) {
nd->stack[nd->depth - 1].name = name;
name = s;
continue;
--
2.20.1

2020-01-19 03:27:01

by Al Viro

[permalink] [raw]
Subject: [PATCH 6/9] merging pick_link() with get_link(), part 6

From: Al Viro <[email protected]>

move the only remaining call of get_link() into pick_link()

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ad6de8b4167e..adb573e0f424 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1792,14 +1792,14 @@ static inline int handle_dots(struct nameidata *nd, int type)
return 0;
}

-static int pick_link(struct nameidata *nd, struct path *link,
+static const char *pick_link(struct nameidata *nd, struct path *link,
struct inode *inode, unsigned seq)
{
int error;
struct saved *last;
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
path_to_nameidata(link, nd);
- return -ELOOP;
+ return ERR_PTR(-ELOOP);
}
if (!(nd->flags & LOOKUP_RCU)) {
if (link->mnt == nd->path.mnt)
@@ -1820,7 +1820,7 @@ static int pick_link(struct nameidata *nd, struct path *link,
}
if (error) {
path_put(link);
- return error;
+ return ERR_PTR(error);
}
}

@@ -1829,7 +1829,7 @@ static int pick_link(struct nameidata *nd, struct path *link,
clear_delayed_call(&last->done);
nd->link_inode = inode;
last->seq = seq;
- return 1;
+ return get_link(nd);
}

enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
@@ -1864,11 +1864,7 @@ static const char *step_into(struct nameidata *nd, int flags,
if (read_seqcount_retry(&path.dentry->d_seq, seq))
return ERR_PTR(-ECHILD);
}
- err = pick_link(nd, &path, inode, seq);
- if (err > 0)
- return get_link(nd);
- else
- return ERR_PTR(err);
+ return pick_link(nd, &path, inode, seq);
}

static const char *walk_component(struct nameidata *nd, int flags)
--
2.20.1

2020-01-19 03:27:22

by Al Viro

[permalink] [raw]
Subject: [PATCH 7/9] finally fold get_link() into pick_link()

From: Al Viro <[email protected]>

kill nd->link_inode, while we are at it

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 135 ++++++++++++++++++++++++-----------------------------
1 file changed, 61 insertions(+), 74 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index adb573e0f424..40263f89a54f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -503,7 +503,6 @@ struct nameidata {
} *stack, internal[EMBEDDED_LEVELS];
struct filename *name;
struct nameidata *saved;
- struct inode *link_inode;
unsigned root_seq;
int dfd;
} __randomize_layout;
@@ -962,9 +961,8 @@ int sysctl_protected_regular __read_mostly;
*
* Returns 0 if following the symlink is allowed, -ve on error.
*/
-static inline int may_follow_link(struct nameidata *nd)
+static inline int may_follow_link(struct nameidata *nd, const struct inode *inode)
{
- const struct inode *inode;
const struct inode *parent;
kuid_t puid;

@@ -972,7 +970,6 @@ static inline int may_follow_link(struct nameidata *nd)
return 0;

/* Allowed if owner and follower match. */
- inode = nd->link_inode;
if (uid_eq(current_cred()->fsuid, inode->i_uid))
return 0;

@@ -1105,73 +1102,6 @@ static int may_create_in_sticky(struct dentry * const dir,
return 0;
}

-static __always_inline
-const char *get_link(struct nameidata *nd)
-{
- struct saved *last = nd->stack + nd->depth - 1;
- struct dentry *dentry = last->link.dentry;
- struct inode *inode = nd->link_inode;
- int error;
- const char *res;
-
- if (!(nd->flags & LOOKUP_PARENT)) {
- error = may_follow_link(nd);
- if (unlikely(error))
- return ERR_PTR(error);
- }
-
- if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
- return ERR_PTR(-ELOOP);
-
- if (!(nd->flags & LOOKUP_RCU)) {
- touch_atime(&last->link);
- cond_resched();
- } else if (atime_needs_update(&last->link, inode)) {
- if (unlikely(unlazy_walk(nd)))
- return ERR_PTR(-ECHILD);
- touch_atime(&last->link);
- }
-
- error = security_inode_follow_link(dentry, inode,
- nd->flags & LOOKUP_RCU);
- if (unlikely(error))
- return ERR_PTR(error);
-
- nd->last_type = LAST_BIND;
- res = READ_ONCE(inode->i_link);
- if (!res) {
- const char * (*get)(struct dentry *, struct inode *,
- struct delayed_call *);
- get = inode->i_op->get_link;
- if (nd->flags & LOOKUP_RCU) {
- res = get(NULL, inode, &last->done);
- if (res == ERR_PTR(-ECHILD)) {
- if (unlikely(unlazy_walk(nd)))
- return ERR_PTR(-ECHILD);
- res = get(dentry, inode, &last->done);
- }
- } else {
- res = get(dentry, inode, &last->done);
- }
- if (!res)
- goto all_done;
- if (IS_ERR(res))
- return res;
- }
- if (*res == '/') {
- error = nd_jump_root(nd);
- if (unlikely(error))
- return ERR_PTR(error);
- while (unlikely(*++res == '/'))
- ;
- }
- if (*res)
- return res;
-all_done: // pure jump
- put_link(nd);
- return NULL;
-}
-
/*
* follow_up - Find the mountpoint of path's vfsmount
*
@@ -1795,8 +1725,10 @@ static inline int handle_dots(struct nameidata *nd, int type)
static const char *pick_link(struct nameidata *nd, struct path *link,
struct inode *inode, unsigned seq)
{
- int error;
struct saved *last;
+ const char *res;
+ int error;
+
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
path_to_nameidata(link, nd);
return ERR_PTR(-ELOOP);
@@ -1827,9 +1759,64 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
last = nd->stack + nd->depth++;
last->link = *link;
clear_delayed_call(&last->done);
- nd->link_inode = inode;
last->seq = seq;
- return get_link(nd);
+
+ if (!(nd->flags & LOOKUP_PARENT)) {
+ error = may_follow_link(nd, inode);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ }
+
+ if (unlikely(nd->flags & LOOKUP_NO_SYMLINKS))
+ return ERR_PTR(-ELOOP);
+
+ if (!(nd->flags & LOOKUP_RCU)) {
+ touch_atime(&last->link);
+ cond_resched();
+ } else if (atime_needs_update(&last->link, inode)) {
+ if (unlikely(unlazy_walk(nd)))
+ return ERR_PTR(-ECHILD);
+ touch_atime(&last->link);
+ }
+
+ error = security_inode_follow_link(link->dentry, inode,
+ nd->flags & LOOKUP_RCU);
+ if (unlikely(error))
+ return ERR_PTR(error);
+
+ nd->last_type = LAST_BIND;
+ res = READ_ONCE(inode->i_link);
+ if (!res) {
+ const char * (*get)(struct dentry *, struct inode *,
+ struct delayed_call *);
+ get = inode->i_op->get_link;
+ if (nd->flags & LOOKUP_RCU) {
+ res = get(NULL, inode, &last->done);
+ if (res == ERR_PTR(-ECHILD)) {
+ if (unlikely(unlazy_walk(nd)))
+ return ERR_PTR(-ECHILD);
+ res = get(link->dentry, inode, &last->done);
+ }
+ } else {
+ res = get(link->dentry, inode, &last->done);
+ }
+ if (!res)
+ goto all_done;
+ if (IS_ERR(res))
+ return res;
+ }
+ if (*res == '/') {
+ error = nd_jump_root(nd);
+ if (unlikely(error))
+ return ERR_PTR(error);
+ while (unlikely(*++res == '/'))
+ ;
+ }
+ if (*res)
+ return res;
+all_done: // pure jump
+ put_link(nd);
+ return NULL;
}

enum {WALK_FOLLOW = 1, WALK_MORE = 2, WALK_NOFOLLOW = 4};
--
2.20.1

2020-01-19 03:27:40

by Al Viro

[permalink] [raw]
Subject: [PATCH 8/9] massage __follow_mount_rcu() a bit

From: Al Viro <[email protected]>

make the loop more similar to that in follow_managed(), with
explicit tracking of flags, etc.

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 70 +++++++++++++++++++++++++++---------------------------
1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 40263f89a54f..310c5ccddf42 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1268,12 +1268,6 @@ int follow_down_one(struct path *path)
}
EXPORT_SYMBOL(follow_down_one);

-static inline int managed_dentry_rcu(const struct path *path)
-{
- return (path->dentry->d_flags & DCACHE_MANAGE_TRANSIT) ?
- path->dentry->d_op->d_manage(path, true) : 0;
-}
-
/*
* Try to skip to top of mountpoint pile in rcuwalk mode. Fail if
* we meet a managed dentry that would need blocking.
@@ -1281,43 +1275,49 @@ static inline int managed_dentry_rcu(const struct path *path)
static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
struct inode **inode, unsigned *seqp)
{
+ struct dentry *dentry = path->dentry;
+ unsigned int flags = dentry->d_flags;
+
+ if (likely(!(flags & DCACHE_MANAGED_DENTRY)))
+ return true;
+
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ return false;
+
for (;;) {
- struct mount *mounted;
/*
* Don't forget we might have a non-mountpoint managed dentry
* that wants to block transit.
*/
- switch (managed_dentry_rcu(path)) {
- case -ECHILD:
- default:
- return false;
- case -EISDIR:
- return true;
- case 0:
- break;
+ if (unlikely(flags & DCACHE_MANAGE_TRANSIT)) {
+ int res = dentry->d_op->d_manage(path, true);
+ if (res)
+ return res == -EISDIR;
+ flags = dentry->d_flags;
}

- if (!d_mountpoint(path->dentry))
- return !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
-
- mounted = __lookup_mnt(path->mnt, path->dentry);
- if (!mounted)
- break;
- if (unlikely(nd->flags & LOOKUP_NO_XDEV))
- return false;
- path->mnt = &mounted->mnt;
- path->dentry = mounted->mnt.mnt_root;
- nd->flags |= LOOKUP_JUMPED;
- *seqp = read_seqcount_begin(&path->dentry->d_seq);
- /*
- * Update the inode too. We don't need to re-check the
- * dentry sequence number here after this d_inode read,
- * because a mount-point is always pinned.
- */
- *inode = path->dentry->d_inode;
+ if (flags & DCACHE_MOUNTED) {
+ struct mount *mounted = __lookup_mnt(path->mnt, dentry);
+ if (mounted) {
+ path->mnt = &mounted->mnt;
+ dentry = path->dentry = mounted->mnt.mnt_root;
+ nd->flags |= LOOKUP_JUMPED;
+ *seqp = read_seqcount_begin(&dentry->d_seq);
+ *inode = dentry->d_inode;
+ /*
+ * We don't need to re-check ->d_seq after this
+ * ->d_inode read - there will be an RCU delay
+ * between mount hash removal and ->mnt_root
+ * becoming unpinned.
+ */
+ flags = dentry->d_flags;
+ continue;
+ }
+ if (read_seqretry(&mount_lock, nd->m_seq))
+ return false;
+ }
+ return !(flags & DCACHE_NEED_AUTOMOUNT);
}
- return !read_seqretry(&mount_lock, nd->m_seq) &&
- !(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
}

static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
--
2.20.1

2020-01-19 03:27:58

by Al Viro

[permalink] [raw]
Subject: [PATCH 9/9] new helper: traverse_mounts()

From: Al Viro <[email protected]>

common guts of follow_down() and follow_managed() taken to a new
helper - traverse_mounts(). The remnants of follow_managed()
are folded into its sole remaining caller (handle_mounts()).
Calling conventions of handle_mounts() slightly sanitized -
instead of the weird "1 for success, -E... for failure" that used
to be imposed by the calling conventions of walk_component() et.al.
we can use the normal "0 for success, -E... for failure".

Signed-off-by: Al Viro <[email protected]>
---
fs/namei.c | 177 ++++++++++++++++++++++-------------------------------
1 file changed, 72 insertions(+), 105 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 310c5ccddf42..d3172e2c7f7f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1167,91 +1167,79 @@ static int follow_automount(struct path *path, int *count, unsigned lookup_flags
}

/*
- * Handle a dentry that is managed in some way.
- * - Flagged for transit management (autofs)
- * - Flagged as mountpoint
- * - Flagged as automount point
- *
- * This may only be called in refwalk mode.
- * On success path->dentry is known positive.
- *
- * Serialization is taken care of in namespace.c
+ * mount traversal - out-of-line part. One note on ->d_flags accesses -
+ * dentries are pinned but not locked here, so negative dentry can go
+ * positive right under us. Use of smp_load_acquire() provides a barrier
+ * sufficient for ->d_inode and ->d_flags consistency.
*/
-static int follow_managed(struct path *path, struct nameidata *nd)
+static int __traverse_mounts(struct path *path, unsigned flags, bool *jumped,
+ int *count, unsigned lookup_flags)
{
- struct vfsmount *mnt = path->mnt; /* held by caller, must be left alone */
- unsigned flags;
+ struct vfsmount *mnt = path->mnt;
bool need_mntput = false;
int ret = 0;

- /* Given that we're not holding a lock here, we retain the value in a
- * local variable for each dentry as we look at it so that we don't see
- * the components of that value change under us */
- while (flags = smp_load_acquire(&path->dentry->d_flags),
- unlikely(flags & DCACHE_MANAGED_DENTRY)) {
+ while (flags & DCACHE_MANAGED_DENTRY) {
/* Allow the filesystem to manage the transit without i_mutex
* being held. */
if (flags & DCACHE_MANAGE_TRANSIT) {
- BUG_ON(!path->dentry->d_op);
- BUG_ON(!path->dentry->d_op->d_manage);
ret = path->dentry->d_op->d_manage(path, false);
flags = smp_load_acquire(&path->dentry->d_flags);
if (ret < 0)
break;
}

- /* Transit to a mounted filesystem. */
- if (flags & DCACHE_MOUNTED) {
+ if (flags & DCACHE_MOUNTED) { // something's mounted on it..
struct vfsmount *mounted = lookup_mnt(path);
- if (mounted) {
+ if (mounted) { // ... in our namespace
dput(path->dentry);
if (need_mntput)
mntput(path->mnt);
path->mnt = mounted;
path->dentry = dget(mounted->mnt_root);
+ // here we know it's positive
+ flags = path->dentry->d_flags;
need_mntput = true;
continue;
}
-
- /* Something is mounted on this dentry in another
- * namespace and/or whatever was mounted there in this
- * namespace got unmounted before lookup_mnt() could
- * get it */
}

- /* Handle an automount point */
- if (flags & DCACHE_NEED_AUTOMOUNT) {
- ret = follow_automount(path, &nd->total_link_count,
- nd->flags);
- if (ret < 0)
- break;
- continue;
- }
+ if (!(flags & DCACHE_NEED_AUTOMOUNT))
+ break;

- /* We didn't change the current path point */
- break;
+ // uncovered automount point
+ ret = follow_automount(path, count, lookup_flags);
+ flags = smp_load_acquire(&path->dentry->d_flags);
+ if (ret < 0)
+ break;
}

- if (need_mntput) {
- if (path->mnt == mnt)
- mntput(path->mnt);
- if (unlikely(nd->flags & LOOKUP_NO_XDEV))
- ret = -EXDEV;
- else
- nd->flags |= LOOKUP_JUMPED;
- }
- if (ret == -EISDIR || !ret)
- ret = 1;
- if (ret > 0 && unlikely(d_flags_negative(flags)))
+ if (ret == -EISDIR)
+ ret = 0;
+ // possible if you race with several mount --move
+ if (need_mntput && path->mnt == mnt)
+ mntput(path->mnt);
+ if (!ret && unlikely(d_flags_negative(flags)))
ret = -ENOENT;
- if (unlikely(ret < 0)) {
- dput(path->dentry);
- if (path->mnt != nd->path.mnt)
- mntput(path->mnt);
- }
+ *jumped = need_mntput;
return ret;
}

+static inline int traverse_mounts(struct path *path, bool *jumped,
+ int *count, unsigned lookup_flags)
+{
+ unsigned flags = smp_load_acquire(&path->dentry->d_flags);
+
+ /* fastpath */
+ if (likely(!(flags & DCACHE_MANAGED_DENTRY))) {
+ *jumped = false;
+ if (unlikely(d_flags_negative(flags)))
+ return -ENOENT;
+ return 0;
+ }
+ return __traverse_mounts(path, flags, jumped, count, lookup_flags);
+}
+
int follow_down_one(struct path *path)
{
struct vfsmount *mounted;
@@ -1268,6 +1256,23 @@ int follow_down_one(struct path *path)
}
EXPORT_SYMBOL(follow_down_one);

+/*
+ * Follow down to the covering mount currently visible to userspace. At each
+ * point, the filesystem owning that dentry may be queried as to whether the
+ * caller is permitted to proceed or not.
+ */
+int follow_down(struct path *path)
+{
+ struct vfsmount *mnt = path->mnt;
+ bool jumped;
+ int ret = traverse_mounts(path, &jumped, NULL, 0);
+
+ if (path->mnt != mnt)
+ mntput(mnt);
+ return ret;
+}
+EXPORT_SYMBOL(follow_down);
+
/*
* Try to skip to top of mountpoint pile in rcuwalk mode. Fail if
* we meet a managed dentry that would need blocking.
@@ -1324,6 +1329,7 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
struct path *path, struct inode **inode,
unsigned int *seqp)
{
+ bool jumped;
int ret;

path->mnt = nd->path.mnt;
@@ -1333,15 +1339,25 @@ static inline int handle_mounts(struct nameidata *nd, struct dentry *dentry,
if (unlikely(!*inode))
return -ENOENT;
if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
- return 1;
+ return 0;
if (unlazy_child(nd, dentry, seq))
return -ECHILD;
// *path might've been clobbered by __follow_mount_rcu()
path->mnt = nd->path.mnt;
path->dentry = dentry;
}
- ret = follow_managed(path, nd);
- if (likely(ret >= 0)) {
+ ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags);
+ if (jumped) {
+ if (unlikely(nd->flags & LOOKUP_NO_XDEV))
+ ret = -EXDEV;
+ else
+ nd->flags |= LOOKUP_JUMPED;
+ }
+ if (unlikely(ret)) {
+ dput(path->dentry);
+ if (path->mnt != nd->path.mnt)
+ mntput(path->mnt);
+ } else {
*inode = d_backing_inode(path->dentry);
*seqp = 0; /* out of RCU mode, so the value doesn't matter */
}
@@ -1409,55 +1425,6 @@ static int follow_dotdot_rcu(struct nameidata *nd)
return 0;
}

-/*
- * Follow down to the covering mount currently visible to userspace. At each
- * point, the filesystem owning that dentry may be queried as to whether the
- * caller is permitted to proceed or not.
- */
-int follow_down(struct path *path)
-{
- unsigned managed;
- int ret;
-
- while (managed = READ_ONCE(path->dentry->d_flags),
- unlikely(managed & DCACHE_MANAGED_DENTRY)) {
- /* Allow the filesystem to manage the transit without i_mutex
- * being held.
- *
- * We indicate to the filesystem if someone is trying to mount
- * something here. This gives autofs the chance to deny anyone
- * other than its daemon the right to mount on its
- * superstructure.
- *
- * The filesystem may sleep at this point.
- */
- if (managed & DCACHE_MANAGE_TRANSIT) {
- BUG_ON(!path->dentry->d_op);
- BUG_ON(!path->dentry->d_op->d_manage);
- ret = path->dentry->d_op->d_manage(path, false);
- if (ret < 0)
- return ret == -EISDIR ? 0 : ret;
- }
-
- /* Transit to a mounted filesystem. */
- if (managed & DCACHE_MOUNTED) {
- struct vfsmount *mounted = lookup_mnt(path);
- if (!mounted)
- break;
- dput(path->dentry);
- mntput(path->mnt);
- path->mnt = mounted;
- path->dentry = dget(mounted->mnt_root);
- continue;
- }
-
- /* Don't handle automount points here */
- break;
- }
- return 0;
-}
-EXPORT_SYMBOL(follow_down);
-
/*
* Skip to top of mountpoint pile in refwalk mode for follow_dotdot()
*/
--
2.20.1

2020-01-19 14:35:08

by Ian Kent

[permalink] [raw]
Subject: Re: [RFC][PATCHSET][CFT] pathwalk cleanups and fixes

On Sun, 2020-01-19 at 03:14 +0000, Al Viro wrote:
> OK, vfs.git #work.namei seems to survive xfstests. I think
> it cleans the things quite a bit, but it obviously needs more
> review and testing.
>
> Review and testing would be _very_ welcome; it does a lot
> of massage, so there had been a plenty of opportunities to fuck up
> and fail to spot that. The same goes for profiling - it doesn't
> seem to slow the things down, but that needs to be verified.

I have run my usual tests (the second run of my submount-test is still
going) and they have run through fine.

I spend what time I can looking through the series tomorrow but will
probably need to complete that when I return from my trip to Albany
(Western Australia) some time on Friday.

>
> It does include #work.openat2. Topology: 17 commits, followed
> by clean merge with #work.openat2, followed by 9 followups. The
> part is #work.openat2 is as posted by Aleksa; I can repost it, but
> I don't see much point. Description of the rest follows; patches
> themselves will be in followups.
>
> part 1: follow_automount() cleanups and fixes.
>
> Quite a bit of that function had been about working around the
> wrong calling conventions of finish_automount(). The problem is that
> finish_automount() misuses the primitive intended for mount(2) and
> friends, where we want to mount on top of the pile, even if something
> has managed to add to that while we'd been trying to lock the
> namespace.
> For automount that's not the right thing to do - there we want to
> discard
> whatever it was going to attach and just cross into what got mounted
> there in the meanwhile (most likely - the results of the same
> automount
> triggered by somebody else). Current mainline kinda-sorta manages to
> do
> that, but it's unreliable and very convoluted. Much simpler approach
> is to stop using lock_mount() in finish_automount() and have it bail
> out if something turns out to have been mounted on top where we
> wanted
> to attach. That allows to get rid of a lot of PITA in the caller.
> Another simplification comes from not trying to cross into the
> results
> of automount - simply ride through the next iteration of the loop and
> let it move into overmount.
>
> Another thing in the same series is divorcing
> follow_automount()
> from nameidata; that'll play later when we get to unifying
> follow_down()
> with the guts of follow_managed().
>
> 4 commits, the second one fixes a hard-to-hit race. The first
> is a prereq for it.
>
> 1/17 do_add_mount(): lift lock_mount/unlock_mount into callers
> 2/17 fix automount/automount race properly
> 3/17 follow_automount(): get rid of dead^Wstillborn code
> 4/17 follow_automount() doesn't need the entire nameidata
>
> part 2: unifying mount traversals in pathwalk.
>
> Handling of mount traversal (follow_managed()) is currently
> called
> in a bunch of places. Each of them is shortly followed by a call of
> step_into() or an open-coded equivalent thereof. However, the
> locations
> of those step_into() calls are far from preceding follow_managed();
> moreover, that preceding call might happen on different paths that
> converge to given step_into() call. It's harder to analyse that it
> should
> be (especially when it comes to liveness analysis) and it forces
> rather
> ugly calling conventions on
> lookup_fast()/atomic_open()/lookup_open().
> The series below massages the code to the point when the calls of
> follow_managed() (and __follow_mount_rcu()) move into the beginning
> of
> step_into().
>
> 5/17 make build_open_flags() treat O_CREAT | O_EXCL as implying
> O_NOFOLLOW
> gets EEXIST handling in do_last() past the step_into() call
> there.
> 6/17 handle_mounts(): start building a sane wrapper for
> follow_managed()
> rather than mangling follow_managed() itself (and creating
> conflicts
> with openat2 series), add a wrapper that will absorb the
> required
> interface changes.
> 7/17 atomic_open(): saner calling conventions (return dentry on
> success)
> struct path passed to it is pure out parameter; only dentry
> part
> ever varies, though - mnt is always nd->path.mnt. Just return
> the dentry on success, and ERR_PTR(-E...) on failure.
> 8/17 lookup_open(): saner calling conventions (return dentry on
> success)
> propagate the same change one level up the call chain.
> 9/17 do_last(): collapse the call of path_to_nameidata()
> struct path filled in lookup_open() call is eventually given to
> handle_mounts(); the only use it has before that is
> path_to_nameidata()
> call in "->atomic_open() has actually opened it" case, and
> there
> path_to_nameidata() is an overkill - we are guaranteed to
> replace
> only nd->path.dentry. So have the struct path filled only
> immediately
> prior to handle_mounts().
> 10/17 handle_mounts(): pass dentry in, turn path into a pure out
> argument
> now all callers of handle_mount() are directly preceded by
> filling
> struct path it gets. path->mnt is nd->path.mnt in all cases,
> so we can
> pass just the dentry instead and fill path in handle_mount()
> itself.
> Some boilerplate gone, path is pure out argument of
> handle_mount()
> now.
> 11/17 lookup_fast(): consolidate the RCU success case
> massage to gather what will become an RCU case equivalent of
> handle_mounts(); basically, that's what we do if revalidate
> succeeds
> in RCU case of lookup_fast(), including unlazy and fallback to
> handle_mounts() if __follow_mount_rcu() says "it's too tricky".
> 12/17 teach handle_mounts() to handle RCU mode
> ... and take that into handle_mount() itself. The other caller
> of
> __follow_mount_rcu() is fine with the same fallback (it just
> didn't
> bother since it's in the very beginning of pathwalk), switched
> to
> handle_mount() as well.
> 13/17 lookup_fast(): take mount traversal into callers
> Now we are getting somewhere - both RCU and non-RCU success
> cases of
> lookup_fast() are ended with the same return
> handle_mounts(...);
> move that to the callers - there it will merge with the
> identical calls
> that had been on the paths where we had to do slow lookups.
> lookup_fast() returns dentry now.
> 14/17 new step_into() flag: WALK_NOFOLLOW
> use step_into() instead of open-coding it in
> handle_lookup_down().
> Add a flag for "don't follow symlinks regardless of
> LOOKUP_FOLLOW" for
> that (and eventually, I hope, for .. handling).
> Now *all* calls of handle_mounts() and step_into() are right
> next to
> each other.
> 15/17 fold handle_mounts() into step_into()
> ... and we can move the call of handle_mounts() into
> step_into(),
> getting a slightly saner calling conventions out of that.
> 16/17 LOOKUP_MOUNTPOINT: fold path_mountpointat() into
> path_lookupat()
> another payoff from 14/17 - we can teach path_lookupat() to do
> what path_mountpointat() used to. And kill the latter, along
> with
> its wrappers.
> 17/17 expand the only remaining call of path_lookup_conditional()
> minor cleanup - RIP path_lookup_conditional(). Only one caller
> left.
>
> At that point we run out of things that can be done without textual
> conflicts
> with openat2 series. Changes so far:
> * mount traversal is taken into step_into().
> * lookup_fast(), atomic_open() and lookup_open() calling
> conventions
> are slightly changed. All of them return dentry now, instead of
> returning
> an int and filling struct path on success. For lookup_fast() the old
> "0 for cache miss, 1 for cache hit" is replaced with "NULL stands for
> cache
> miss, dentry - for hit".
> * step_into() can be called in RCU mode as well. Takes
> nameidata,
> WALK_... flags, dentry and, in RCU case, corresponding inode and seq
> value.
> Handles mount traversals, decides whether it's a symlink to be
> followed.
> Error => returns -E...; symlink to follow => returns 1, puts symlink
> on stack;
> non-symlink or symlink not to follow => returns 0, moves nd->path to
> new location.
> * LOOKUP_MOUNTPOINT introduced; user_path_mountpoint_at() and
> friends
> became calls of user_path_at() et.al. with LOOKUP_MOUNTPOINT in
> flags.
>
> Next comes the merge with Aleksa's openat2 patchset; everything up to
> that point
> had been non-conflicting with it. That patchset has been posted
> earlier;
> it's in #work.openat2. The next series comes on top of the merge.
>
> part 3: untangling the symlink handling.
>
> Right now when we decide to follow a symlink it happens this
> way:
> * step_into() decides that it has been given a symlink that
> needs to
> be followed.
> * it calls pick_link(), which pushes the symlink on stack and
> returns 1 on success / -E... on error. Symlink's mount/dentry/seq is
> stored on stack and the inode is stashed in nd->link_inode.
> * step_into() passes that 1 to its callers, which proceed to
> pass it
> up the call chain for several layers. In all cases we get to
> get_link()
> call shortly afterwards.
> * get_link() is called, picks the inode stashed in nd-
> >link_inode
> by the pick_link(), does some checks, touches the atime, etc.
> * get_link() either picks the link body out of inode or calls
> ->get_link(). If it's an absolute symlink, we move to the root and
> return
> the relative portion of the body; if it's a relative one - just
> return the
> body. If it's a procfs-style one, the call of nd_jump_link() has
> been
> made and we'd moved to whatever location is desired. And return
> NULL,
> same as we do for symlink to "/".
> * the caller proceeds to deal with the string returned to it.
>
> The sequence is the same in all cases (nested symlink, trailing
> symlink on lookup, trailing symlink on open), but its pieces are not
> close
> to each other and the bit between the call of pick_link() and
> (inevitable)
> call of get_link() afterwards is not easy to follow. Moreover, a
> bunch
> of functions (walk_component/lookup_last/do_last) ends up with the
> same
> conventions for return values as step_into(). And those conventions
> (see above) are not pretty - 0/1/-E... is asking for mistakes,
> especially
> when returned 1 is used only to direct control flow on a rather
> twisted
> way to matching get_link() call. And that path can be seriously
> twisted.
> E.g. when we are trying to open /dev/stdin, we get the following
> sequence:
> * path_init() has put us into root and returned "/dev/stdin"
> * link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> * we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast(). Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing. Into the stack it goes and we return 1.
> * do_last() sees 1 and returns it.
> * trailing_symlink() is called (in the top-level loop) and it
> calls get_link(). OK, we get "/proc/self/fd/0" for body, move to
> root again and return "proc/self/fd/0".
> * link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> * do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> * _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now). It's dropped (from step_into()) and we
> get
> to look at what we'd been given. A symlink to follow, so on the
> stack
> it goes and we return 1.
> * again, do_last() passes 1 to caller
> * trailing_symlink() is called and calls get_link().
> * this time it's a procfs symlink and its ->get_link() method
> moves us to the mount/dentry of our stdin. And returns NULL. But
> the
> fun doesn't stop yet.
> * trailing_symlink() returns "" to the caller
> * link_path_walk() is called on that and does nothing
> whatsoever.
> * do_last() is called and sees LAST_BIND left by the
> get_link().
> It calls handle_dots()
> * handle_dots() drops the symlink from stack and returns
> * do_last() *FINALLY* proceeds to the point after its call of
> step_into() (finish_open:) and gets around to opening the damn thing.
>
> Making sense of the control flow through all of that is not
> fun,
> to put it mildly; debugging anything in that area can be a massive
> PITA,
> and this example has touched only one of 3 cases. Arguably, the
> worst
> one, but... Anyway, it turns out that this code can be massaged to
> considerably saner shape - both in terms of control flow and wrt
> calling
> conventions.
>
> 1/9 merging pick_link() with get_link(), part 1
> prep work: move the "hardening" crap from trailing_symlink()
> into
> get_link() (conditional on the absense of LOOKUP_PARENT in nd-
> >flags).
> We'll be moving the calls of get_link() around quite a bit through
> that
> series, and the next step will be to eliminate trailing_symlink().
> 2/9 merging pick_link() with get_link(), part 2
> fold trailing_symlink() into lookup_last() and do_last().
> Now these are returning strings; it's not the final calling
> conventions,
> but it's almost there. NULL => old 0, we are done. ERR_PTR(-E...)
> =>
> old -E..., we'd failed. string => old 1, and the string is the
> symlink
> body to follow. Just as for trailing_symlink(), "/" and procfs ones
> (where get_link() returns NULL) yield "", so the ugly song and dance
> with no-op trip through link_path_walk()/handle_dots() still remains.
> 3/9 merging pick_link() with get_link(), part 3
> elimination of that round-trip. In *all* cases having
> get_link() return NULL on such symlinks means that we'll proceed to
> drop the symlink from stack and get back to the point near that
> get_link() call - basically, where we would be if it hadn't been
> a symlink at all. The path by which we are getting there depends
> upon the call site; the end result is the same in all cases - such
> symlinks (procfs ones and symlink to "/") are fully processed by
> the time get_link() returns, so we could as well drop them from the
> stack right in get_link(). Makes life simpler in terms of control
> flow analysis...
> And now the calling conventions for do_last() and lookup_last()
> have reached the final shape - ERR_PTR(-E...) for error, NULL for
> "we are done", string for "traverse this".
> 4/9 merging pick_link() with get_link(), part 4
> now all calls of walk_component() are followed by the same
> boilerplate - "if it has returned 1, call get_link() and if that
> has returned NULL treat that as if walk_component() has returned 0".
> Eliminate by folding that into walk_component() itself. Now
> walk_component() return value conventions have joined those of
> do_last()/lookup_last().
> 5/9 merging pick_link() with get_link(), part 5
> same as for the previous, only this time the boilerplate
> migrates one level down, into step_into(). Only one caller of
> get_link() left, step_into() has joined the same return value
> conventions.
> 6/9 merging pick_link() with get_link(), part 6
> move that thing into pick_link(). Now all traces of
> "return 1 if we are following a symlink" are gone.
> 7/9 finally fold get_link() into pick_link()
> ta-da - expand get_link() into the only caller. As a side
> benefit, we get rid of stashing the inode in nd->link_inode - it
> was done only to carry that piece of information from pick_link()
> to eventual get_link(). That's not the main benefit, though - the
> control flow became considerably easier to reason about.
>
> For what it's worth, the example above (/dev/stdin) becomes
> * path_init() has put us into root and returned "/dev/stdin"
> * link_path_walk() has eventually reached /dev and left
> <LAST_NORM, "stdin"> in nd->last_type/nd->last
> * we call do_last(), which sees that we have LAST_NORM and
> calls
> lookup_fast(). Let's assume that everything is in dcache; we get the
> dentry of /dev/stdin and proceed to finish_lookup:, where we call
> step_into()
> * it's a symlink, we have LOOKUP_FOLLOW, so we decide to pick
> the
> damn thing. On the stack it goes and we get its body. Which is
> "/proc/self/fd/0", so we move to root and return "proc/self/fd/0".
> * do_last() sees non-NULL and returns it - whether it's an
> error
> or a pathname to traverse, we hadn't reached something we'll be
> opening.
> * link_path_walk() is given that string, eventually leading us
> into
> /proc/self/fd, with <LAST_NORM, "0"> left as the component to handle.
> * do_last() is called, and similar to the previous case we
> eventually reach the call of step_into() with dentry of
> /proc/self/fd/0.
> * _now_ we can discard /dev/stdin from the stack (we'd been
> using its body until now). It's dropped (from step_into()) and we
> get
> to look at what we'd been given. A symlink to follow, so on the
> stack
> it goes. This time it's a procfs symlink and its ->get_link()
> method
> moves us to the mount/dentry of our stdin. And returns NULL. So we
> drop symlink from stack and return that NULL to caller.
> * that NULL is returned by step_into(), same as if we had just
> moved to a non-symlink.
> * do_last() proceeds to open the damn thing.
>
> part 4. some mount traversal cleanups.
>
> 8/9 massage __follow_mount_rcu() a bit
> make it more similar to non-RCU counterpart
> 9/9 new helper: traverse_mounts()
> the guts of follow_managed() are very similar to
> follow_down(). The calling conventions are different
> (follow_managed()
> works with nameidata, follow_down() - with standalone struct path),
> but the core loop is pretty much the same in both. Turned that loop
> into a common helper (traverse_mounts()) and since follow_managed()
> becomes a very thin wrapper around it, expand follow_managed() at its
> only call site (in handle_mounts()),
>
> That's where the series stands right now. FWIW, at 5.5-rc1
> fs/namei.c
> had been 4867 lines, at the tip of #work.openat2 - 4998, at the
> tip of #work.namei (containing #work.openat2) - 4730... And IMO
> the thing has become considerably easier to follow.
>
> What's more, it might be possible to untangle the control flow in
> do_last() now. Probably a separate series, though - do_last() is
> one hell of a tarpit, so I'm not stepping into it for the rest
> of this cycle...
>

2020-01-30 14:14:39

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 01/17] do_add_mount(): lift lock_mount/unlock_mount into callers

On Sun, Jan 19, 2020 at 03:17:13AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> preparation to finish_automount() fix (next commit)
>
> Signed-off-by: Al Viro <[email protected]>

Just a naming nit below.
Acked-by: Christian Brauner <[email protected]>

> ---
> fs/namespace.c | 47 ++++++++++++++++++++++++-----------------------
> 1 file changed, 24 insertions(+), 23 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 2fd0c8bcb8c1..5f0a80f17651 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2697,45 +2697,32 @@ static int do_move_mount_old(struct path *path, const char *old_name)
> /*
> * add a mount into a namespace's mount tree
> */
> -static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
> +static int do_add_mount(struct mount *newmnt, struct mountpoint *mp,
> + struct path *path, int mnt_flags)

Maybe this should now be named do_add_mount_locked() so callers know
that they need to do locking themselves?
But that's bikeshedding...

Christian

2020-01-30 14:36:55

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 02/17] fix automount/automount race properly

On Sun, Jan 19, 2020 at 03:17:14AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> Protection against automount/automount races (two threads hitting the same
> referral point at the same time) is based upon do_add_mount() prevention of
> identical overmounts - trying to overmount the root of mounted tree with
> the same tree fails with -EBUSY. It's unreliable (the other thread might've
> mounted something on top of the automount it has triggered) *and* causes
> no end of headache for follow_automount() and its caller, since
> finish_automount() behaves like do_new_mount() - if the mountpoint to be is
> overmounted, it mounts on top what's overmounting it. It's not only wrong
> (we want to go into what's overmounting the automount point and quietly
> discard what we planned to mount there), it introduces the possibility of
> original parent mount getting dropped. That's what 8aef18845266 (VFS: Fix
> vfsmount overput on simultaneous automount) deals with, but it can't do
> anything about the reliability of conflict detection - if something had
> been overmounted the other thread's automount (e.g. that other thread
> having stepped into automount in mount(2)), we don't get that -EBUSY and
> the result is
> referral point under automounted NFS under explicit overmount
> under another copy of automounted NFS
>
> What we need is finish_automount() *NOT* digging into overmounts - if it
> finds one, it should just quietly discard the thing it was asked to mount.
> And don't bother with actually crossing into the results of finish_automount() -
> the same loop that calls follow_automount() will do that just fine on the
> next iteration.
>
> IOW, instead of calling lock_mount() have finish_automount() do it manually,
> _without_ the "move into overmount and retry" part. And leave crossing into
> the results to the caller of follow_automount(), which simplifies it a lot.
>
> Moral: if you end up with a lot of glue working around the calling conventions
> of something, perhaps these calling conventions are simply wrong...
>
> Fixes: 8aef18845266 (VFS: Fix vfsmount overput on simultaneous automount)
> Signed-off-by: Al Viro <[email protected]>

I mean, just reading this is awefully complicated but the code seems
fine.
Acked-by: Christian Brauner <[email protected]>

> ---
> fs/namei.c | 29 ++++-------------------------
> fs/namespace.c | 41 ++++++++++++++++++++++++++++++++++-------
> 2 files changed, 38 insertions(+), 32 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index d2720dc71d0e..bd036dfdb0d9 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1133,11 +1133,9 @@ EXPORT_SYMBOL(follow_up);
> * - return -EISDIR to tell follow_managed() to stop and return the path we
> * were called with.
> */
> -static int follow_automount(struct path *path, struct nameidata *nd,
> - bool *need_mntput)
> +static int follow_automount(struct path *path, struct nameidata *nd)
> {
> struct vfsmount *mnt;
> - int err;
>
> if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
> return -EREMOTE;
> @@ -1178,29 +1176,10 @@ static int follow_automount(struct path *path, struct nameidata *nd,
> return PTR_ERR(mnt);
> }
>
> - if (!mnt) /* mount collision */
> - return 0;
> -
> - if (!*need_mntput) {
> - /* lock_mount() may release path->mnt on error */
> - mntget(path->mnt);
> - *need_mntput = true;
> - }
> - err = finish_automount(mnt, path);
> -
> - switch (err) {
> - case -EBUSY:
> - /* Someone else made a mount here whilst we were busy */
> + if (!mnt)
> return 0;
> - case 0:
> - path_put(path);
> - path->mnt = mnt;
> - path->dentry = dget(mnt->mnt_root);
> - return 0;
> - default:
> - return err;
> - }
>
> + return finish_automount(mnt, path);
> }
>
> /*
> @@ -1258,7 +1237,7 @@ static int follow_managed(struct path *path, struct nameidata *nd)
>
> /* Handle an automount point */
> if (flags & DCACHE_NEED_AUTOMOUNT) {
> - ret = follow_automount(path, nd, &need_mntput);
> + ret = follow_automount(path, nd);
> if (ret < 0)
> break;
> continue;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 5f0a80f17651..f1817eb5f87d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2823,6 +2823,7 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
>
> int finish_automount(struct vfsmount *m, struct path *path)
> {
> + struct dentry *dentry = path->dentry;
> struct mount *mnt = real_mount(m);
> struct mountpoint *mp;
> int err;
> @@ -2832,21 +2833,47 @@ int finish_automount(struct vfsmount *m, struct path *path)
> BUG_ON(mnt_get_count(mnt) < 2);
>
> if (m->mnt_sb == path->mnt->mnt_sb &&
> - m->mnt_root == path->dentry) {
> + m->mnt_root == dentry) {
> err = -ELOOP;
> - goto fail;
> + goto discard;
> }
>
> - mp = lock_mount(path);
> + /*
> + * we don't want to use lock_mount() - in this case finding something
> + * that overmounts our mountpoint to be means "quitely drop what we've
> + * got", not "try to mount it on top".
> + */
> + inode_lock(dentry->d_inode);
> + if (unlikely(cant_mount(dentry))) {
> + err = -ENOENT;
> + goto discard1;
> + }
> + namespace_lock();
> + rcu_read_lock();
> + if (unlikely(__lookup_mnt(path->mnt, dentry))) {

That means someone has already performed that mount in the meantime, I
take it.

> + rcu_read_unlock();
> + err = 0;
> + goto discard2;
> + }
> + rcu_read_unlock();
> + mp = get_mountpoint(dentry);
> if (IS_ERR(mp)) {
> err = PTR_ERR(mp);
> - goto fail;
> + goto discard2;
> }
> +
> err = do_add_mount(mnt, mp, path, path->mnt->mnt_flags | MNT_SHRINKABLE);
> unlock_mount(mp);
> - if (!err)
> - return 0;
> -fail:
> + if (unlikely(err))
> + goto discard;
> + mntput(m);

Probably being dense here but better safe than sorry: this mntput()
corresponds to the get_mountpoint() above, right?

2020-01-30 14:40:54

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 03/17] follow_automount(): get rid of dead^Wstillborn code

On Sun, Jan 19, 2020 at 03:17:15AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> 1) no instances of ->d_automount() have ever made use of the "return
> ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
> a rudiment of plans that got superseded before the thing went into
> the tree. Despite the comment in follow_automount(), autofs has
> never done that.
>
> 2) if there's no ->d_automount() in dentry_operations, filesystems
> should not set DCACHE_NEED_AUTOMOUNT in the first place. None have
> ever done so...
>
> Signed-off-by: Al Viro <[email protected]>

I can't speak to 1) but code seems correct:
Acked-by: Christian Brauner <[email protected]>

2020-01-30 14:47:09

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 04/17] follow_automount() doesn't need the entire nameidata

On Sun, Jan 19, 2020 at 03:17:16AM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> only the address of ->total_link_count and the flags
>
> Signed-off-by: Al Viro <[email protected]>
> ---
> fs/namei.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index d30a74a18da9..3b6f60c02f8a 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1133,7 +1133,7 @@ EXPORT_SYMBOL(follow_up);
> * - return -EISDIR to tell follow_managed() to stop and return the path we
> * were called with.
> */
> -static int follow_automount(struct path *path, struct nameidata *nd)
> +static int follow_automount(struct path *path, int *count, unsigned lookup_flags)
> {
> struct dentry *dentry = path->dentry;
>
> @@ -1148,13 +1148,12 @@ static int follow_automount(struct path *path, struct nameidata *nd)
> * as being automount points. These will need the attentions
> * of the daemon to instantiate them before they can be used.
> */
> - if (!(nd->flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
> + if (!(lookup_flags & (LOOKUP_PARENT | LOOKUP_DIRECTORY |
> LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_AUTOMOUNT)) &&
> dentry->d_inode)
> return -EISDIR;
>
> - nd->total_link_count++;
> - if (nd->total_link_count >= 40)
> + if (count && *count++ >= 40)

He, side-effects galore. :)
Isn't this incrementing the address but you want to increment the
counter?
Seems like this should be

if (count && (*count)++ >= 40)

and even then it seems to me not incrementing at all when we have hit
the limit seems more natural?

Christian

2020-01-30 15:39:55

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 04/17] follow_automount() doesn't need the entire nameidata

On Thu, Jan 30, 2020 at 03:45:20PM +0100, Christian Brauner wrote:
> > - nd->total_link_count++;
> > - if (nd->total_link_count >= 40)
> > + if (count && *count++ >= 40)
>
> He, side-effects galore. :)
> Isn't this incrementing the address but you want to increment the
> counter?
> Seems like this should be
>
> if (count && (*count)++ >= 40)

Nice catch; incidentally, it means that usual testsuites (xfstests,
LTP) are missing the automount loop detection. Hmm...

Something like

export $FOO over nfs4 to localhost

mkdir $FOO/sub
touch $FOO/a
mount $SCRATCH_DEV $FOO/sub
touch $FOO/sub/a
cd $BAR
mkdir nfs
mount -t nfs localhost:$FOO nfs
for i in `seq 40`; do ln -s l`expr $i - 1` l$i; done
for i in `seq 40`; do ln -s m`expr $i - 1` m$i; done
ln -s nfs/sub/a l0
ln -s nfs/a m0
for i in `seq 40`; do
umount nfs/sub 2>/dev/null
cat l$i m$i
done

BTW, the check of pre-increment value is more correct - it's
accidental, but it does give consistency with the normal symlink
following. We do allow up to 40 symlinks over the pathname
resolution, not up to 39.

The thing above should produce
cat: l39: Too many levels of symbolic links
cat: l40: Too many levels of symbolic links
cat: m40: Too many levels of symbolic links

Here l<n> and m<n> go through n + 1 symlink, ending at
nfs/sub/a and nfs/a resp.; the former does trigger an automount,
the latter does not.

On mainline it actually starts to complain about l38, l39, l40 and m40,
due to that off-by-one in follow_automount().

2020-01-30 15:56:29

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 04/17] follow_automount() doesn't need the entire nameidata

On Thu, Jan 30, 2020 at 03:38:25PM +0000, Al Viro wrote:
> On Thu, Jan 30, 2020 at 03:45:20PM +0100, Christian Brauner wrote:
> > > - nd->total_link_count++;
> > > - if (nd->total_link_count >= 40)
> > > + if (count && *count++ >= 40)
> >
> > He, side-effects galore. :)
> > Isn't this incrementing the address but you want to increment the
> > counter?
> > Seems like this should be
> >
> > if (count && (*count)++ >= 40)
>
> Nice catch; incidentally, it means that usual testsuites (xfstests,
> LTP) are missing the automount loop detection. Hmm...

Fix folded and pushed (the series in #next.namei now, on top of
#work.openat2 + #fixes)