mqueue_evict_inode() doesn't access the ipc namespace if it was
already freed. It can happen if in a new IPC namespace the inode was
created without a prior mq_open() which creates the vfsmount used to
access the superblock from mq_clear_sbinfo().
Keep a direct pointer to the superblock used by the inodes so we can
correctly reset the reference to the IPC namespace being destroyed.
Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
kern_mount_data in new namespaces")
==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:183
[inline]
BUG: KASAN: use-after-free in atomic_read arch/x86/include/asm/atomic.h:27
[inline]
BUG: KASAN: use-after-free in refcount_inc_not_zero+0x16e/0x180
lib/refcount.c:120
Read of size 4 at addr ffff8801c51bb200 by task syzkaller711981/3156
CPU: 1 PID: 3156 Comm: syzkaller711981 Not tainted 4.15.0-rc2-mm1+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x25b/0x340 mm/kasan/report.c:409
__asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
__read_once_size include/linux/compiler.h:183 [inline]
atomic_read arch/x86/include/asm/atomic.h:27 [inline]
refcount_inc_not_zero+0x16e/0x180 lib/refcount.c:120
refcount_inc+0x15/0x50 lib/refcount.c:153
get_ipc_ns include/linux/ipc_namespace.h:129 [inline]
__get_ns_from_inode ipc/mqueue.c:110 [inline]
get_ns_from_inode ipc/mqueue.c:118 [inline]
mqueue_evict_inode+0x137/0x9c0 ipc/mqueue.c:402
evict+0x481/0x920 fs/inode.c:552
iput_final fs/inode.c:1514 [inline]
iput+0x7b9/0xaf0 fs/inode.c:1541
dentry_unlink_inode+0x4b0/0x5e0 fs/dcache.c:376
__dentry_kill+0x3b7/0x6d0 fs/dcache.c:573
shrink_dentry_list+0x3c5/0xcf0 fs/dcache.c:1020
shrink_dcache_parent+0xba/0x230 fs/dcache.c:1454
do_one_tree+0x15/0x50 fs/dcache.c:1485
shrink_dcache_for_umount+0xbb/0x290 fs/dcache.c:1502
generic_shutdown_super+0xcd/0x540 fs/super.c:424
kill_anon_super fs/super.c:987 [inline]
kill_litter_super+0x72/0x90 fs/super.c:997
deactivate_locked_super+0x88/0xd0 fs/super.c:312
deactivate_super+0x141/0x1b0 fs/super.c:343
cleanup_mnt+0xb2/0x150 fs/namespace.c:1173
__cleanup_mnt+0x16/0x20 fs/namespace.c:1180
task_work_run+0x199/0x270 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x9bb/0x1ae0 kernel/exit.c:869
do_group_exit+0x149/0x400 kernel/exit.c:972
SYSC_exit_group kernel/exit.c:983 [inline]
SyS_exit_group+0x1d/0x20 kernel/exit.c:981
entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x440729
RSP: 002b:00007ffd090ef228 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0030656c69662f2e RCX: 0000000000440729
RDX: 0000000000440729 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00000000006cb018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000401bf0
R13: 0000000000401c80 R14: 0000000000000000 R15: 0000000000000000
Allocated by task 3156:
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3614
kmalloc include/linux/slab.h:516 [inline]
create_ipc_ns ipc/namespace.c:45 [inline]
copy_ipcs+0x1b3/0x520 ipc/namespace.c:96
create_new_namespaces+0x278/0x880 kernel/nsproxy.c:87
unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:206
SYSC_unshare kernel/fork.c:2421 [inline]
SyS_unshare+0x653/0xfa0 kernel/fork.c:2371
entry_SYSCALL_64_fastpath+0x1f/0x96
Freed by task 3156:
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3492 [inline]
kfree+0xca/0x250 mm/slab.c:3807
free_ipc_ns ipc/namespace.c:139 [inline]
put_ipc_ns+0x112/0x150 ipc/namespace.c:164
free_nsproxy+0xc0/0x1f0 kernel/nsproxy.c:180
switch_task_namespaces+0x9d/0xc0 kernel/nsproxy.c:229
exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
do_exit+0x9b6/0x1ae0 kernel/exit.c:868
do_group_exit+0x149/0x400 kernel/exit.c:972
SYSC_exit_group kernel/exit.c:983 [inline]
SyS_exit_group+0x1d/0x20 kernel/exit.c:981
entry_SYSCALL_64_fastpath+0x1f/0x96
The buggy address belongs to the object at ffff8801c51bb200
which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 0 bytes inside of
2048-byte region [ffff8801c51bb200, ffff8801c51bba00)
The buggy address belongs to the page:
page:000000007764ba6d count:1 mapcount:0 mapping:000000002c36623f index:0x0
compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801c51ba100 0000000000000000 0000000100000003
raw: ffffea000715d320 ffff8801dac01950 ffff8801dac00c40 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8801c51bb100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801c51bb180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff8801c51bb200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801c51bb280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8801c51bb300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Reported-by: syzbot <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
---
include/linux/ipc_namespace.h | 3 ++-
ipc/mqueue.c | 6 ++++--
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 554e31494f69..29ae2ede7602 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -52,7 +52,8 @@ struct ipc_namespace {
struct notifier_block ipcns_nb;
/* The kern_mount of the mqueuefs sb. We take a ref on it */
- struct vfsmount *mq_mnt;
+ struct vfsmount *mq_mnt;
+ struct super_block *mq_sb;
/* # queues in this ns, protected by mq_lock */
unsigned int mq_queues_count;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 36f177dcb39a..d664c0b0f075 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -341,6 +341,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
sb->s_root = d_make_root(inode);
if (!sb->s_root)
return -ENOMEM;
+ ns->mq_sb = sb;
return 0;
}
@@ -1554,6 +1555,7 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
ns->mq_msg_max = DFLT_MSGMAX;
ns->mq_msgsize_max = DFLT_MSGSIZEMAX;
ns->mq_msg_default = DFLT_MSG;
+ ns->mq_sb = NULL;
ns->mq_msgsize_default = DFLT_MSGSIZE;
if (!mount)
@@ -1573,8 +1575,8 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
void mq_clear_sbinfo(struct ipc_namespace *ns)
{
- if (ns->mq_mnt)
- ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+ if (ns->mq_sb)
+ ns->mq_sb->s_fs_info = NULL;
}
void mq_put_mnt(struct ipc_namespace *ns)
--
2.14.3
On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote:
> mqueue_evict_inode() doesn't access the ipc namespace if it was
> already freed. It can happen if in a new IPC namespace the inode was
> created without a prior mq_open() which creates the vfsmount used to
> access the superblock from mq_clear_sbinfo().
>
> Keep a direct pointer to the superblock used by the inodes so we can
> correctly reset the reference to the IPC namespace being destroyed.
>
> Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
> kern_mount_data in new namespaces")
And just what will happen in the same scenario if you mount the damn
thing in userland without ever calling mq_open(), touch a file there,
then unmount and then leave the ipc namespace?
On Tue, Dec 19, 2017 at 11:48:19AM +0000, Al Viro wrote:
> On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote:
> > mqueue_evict_inode() doesn't access the ipc namespace if it was
> > already freed. It can happen if in a new IPC namespace the inode was
> > created without a prior mq_open() which creates the vfsmount used to
> > access the superblock from mq_clear_sbinfo().
> >
> > Keep a direct pointer to the superblock used by the inodes so we can
> > correctly reset the reference to the IPC namespace being destroyed.
> >
> > Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
> > kern_mount_data in new namespaces")
>
> And just what will happen in the same scenario if you mount the damn
> thing in userland without ever calling mq_open(), touch a file there,
> then unmount and then leave the ipc namespace?
FWIW, the real solution would be to have userland mounts trigger the creation
of internal one, same as mq_open() would. Something along these lines
(completely untested, on top of vfs.git#for-next). Care to give it some
beating?
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 10b82338415b..30327e201571 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -343,18 +343,46 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
return 0;
}
+static struct file_system_type mqueue_fs_type;
+/*
+ * Return value is pinned only by reference in ->mq_mnt; it will
+ * live until ipcns dies. Caller does not need to drop it.
+ */
+static struct vfsmount *mq_internal_mount(void)
+{
+ struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+ struct vfsmount *m = ns->mq_mnt;
+ if (m)
+ return m;
+ m = kern_mount_data(&mqueue_fs_type, ns);
+ spin_lock(&mq_lock);
+ if (unlikely(ns->mq_mnt)) {
+ spin_unlock(&mq_lock);
+ if (!IS_ERR(m))
+ kern_unmount(m);
+ return ns->mq_mnt;
+ }
+ if (!IS_ERR(m))
+ ns->mq_mnt = m;
+ spin_unlock(&mq_lock);
+ return m;
+}
+
static struct dentry *mqueue_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
void *data)
{
- struct ipc_namespace *ns;
- if (flags & MS_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = current->nsproxy->ipc_ns;
- }
- return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
+ struct ipc_namespace *ns = data;
+ struct vfsmount *m;
+ if (flags & MS_KERNMOUNT)
+ return mount_ns(fs_type, flags, NULL, ns, ns->user_ns,
+ mqueue_fill_super);
+ m = mq_internal_mount();
+ if (IS_ERR(m))
+ return ERR_CAST(m);
+ atomic_inc(&m->mnt_sb->s_active);
+ down_write(&m->mnt_sb->s_umount);
+ return dget(m->mnt_root);
}
static void init_once(void *foo)
@@ -743,13 +771,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
struct mq_attr *attr)
{
- struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt;
- struct dentry *root = mnt->mnt_root;
+ struct vfsmount *mnt = mq_internal_mount();
+ struct dentry *root;
struct filename *name;
struct path path;
int fd, error;
int ro;
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+
audit_mq_open(oflag, mode, attr);
if (IS_ERR(name = getname(u_name)))
@@ -760,6 +791,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
goto out_putname;
ro = mnt_want_write(mnt); /* we'll drop it in any case */
+ root = mnt->mnt_root;
inode_lock(d_inode(root));
path.dentry = lookup_one_len(name->name, root, strlen(name->name));
if (IS_ERR(path.dentry)) {
@@ -1535,27 +1567,24 @@ int mq_init_ns(struct ipc_namespace *ns)
ns->mq_msg_default = DFLT_MSG;
ns->mq_msgsize_default = DFLT_MSGSIZE;
- ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
- if (IS_ERR(ns->mq_mnt)) {
- int err = PTR_ERR(ns->mq_mnt);
- ns->mq_mnt = NULL;
- return err;
- }
return 0;
}
void mq_clear_sbinfo(struct ipc_namespace *ns)
{
- ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+ if (ns->mq_mnt)
+ ns->mq_mnt->mnt_sb->s_fs_info = NULL;
}
void mq_put_mnt(struct ipc_namespace *ns)
{
- kern_unmount(ns->mq_mnt);
+ if (ns->mq_mnt)
+ kern_unmount(ns->mq_mnt);
}
static int __init init_mqueue_fs(void)
{
+ struct vfsmount *m;
int error;
mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
@@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
if (error)
goto out_filesystem;
+ m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
+ if (IS_ERR(m))
+ goto out_filesystem;
+ init_ipc_ns.mq_mnt = m;
return 0;
out_filesystem:
On Tue, Dec 19, 2017 at 03:32:25PM +0000, Al Viro wrote:
> + m = mq_internal_mount();
> + if (IS_ERR(m))
> + return ERR_CAST(m);
> + atomic_inc(&m->mnt_sb->s_active);
> + down_write(&m->mnt_sb->s_umount);
> + return dget(m->mnt_root);
Note: this is stripped down mount_subtree(m, ""), of course;
it might make sense to recognize that case and bypass the
create_mnt_ns/vfs_path_lookup/put_mnt_ns business in
mount_subtree() when the relative pathname is empty, replacing
it with path.mnt = mntget(mnt); path.dentry = dget(mnt->mnt_root);
in such case. That'd allow to simply call mount_subtree() here.
It would work as-is, but it's ridiculously heavy for such use
right now.
> static int __init init_mqueue_fs(void)
> {
> + struct vfsmount *m;
> int error;
>
> mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
> @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
> if (error)
> goto out_filesystem;
>
> + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
> + if (IS_ERR(m))
> + goto out_filesystem;
> + init_ipc_ns.mq_mnt = m;
> return 0;
>
> out_filesystem:
Unrelated issue, but register_filesystem() should be the last thing
module_init() of a filesystem driver does. It's a separate story,
in any case...
On Tue, Dec 19, 2017 at 4:44 PM, Al Viro <[email protected]> wrote:
> On Tue, Dec 19, 2017 at 03:32:25PM +0000, Al Viro wrote:
>> + m = mq_internal_mount();
>> + if (IS_ERR(m))
>> + return ERR_CAST(m);
>> + atomic_inc(&m->mnt_sb->s_active);
>> + down_write(&m->mnt_sb->s_umount);
>> + return dget(m->mnt_root);
>
> Note: this is stripped down mount_subtree(m, ""), of course;
> it might make sense to recognize that case and bypass the
> create_mnt_ns/vfs_path_lookup/put_mnt_ns business in
> mount_subtree() when the relative pathname is empty, replacing
> it with path.mnt = mntget(mnt); path.dentry = dget(mnt->mnt_root);
> in such case. That'd allow to simply call mount_subtree() here.
> It would work as-is, but it's ridiculously heavy for such use
> right now.
>
>> static int __init init_mqueue_fs(void)
>> {
>> + struct vfsmount *m;
>> int error;
>>
>> mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
>> @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
>> if (error)
>> goto out_filesystem;
>>
>> + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
>> + if (IS_ERR(m))
>> + goto out_filesystem;
>> + init_ipc_ns.mq_mnt = m;
>> return 0;
>>
>> out_filesystem:
>
> Unrelated issue, but register_filesystem() should be the last thing
> module_init() of a filesystem driver does. It's a separate story,
> in any case...
Giuseppe, what report is this?
If there is a reproducer, you can ask syzbot to test a patch.
Al Viro <[email protected]> writes:
> On Tue, Dec 19, 2017 at 11:48:19AM +0000, Al Viro wrote:
>> On Tue, Dec 19, 2017 at 11:14:40AM +0100, Giuseppe Scrivano wrote:
>> > mqueue_evict_inode() doesn't access the ipc namespace if it was
>> > already freed. It can happen if in a new IPC namespace the inode was
>> > created without a prior mq_open() which creates the vfsmount used to
>> > access the superblock from mq_clear_sbinfo().
>> >
>> > Keep a direct pointer to the superblock used by the inodes so we can
>> > correctly reset the reference to the IPC namespace being destroyed.
>> >
>> > Bug introduced with 9c583773d03633 ("ipc, mqueue: lazy call
>> > kern_mount_data in new namespaces")
>>
>> And just what will happen in the same scenario if you mount the damn
>> thing in userland without ever calling mq_open(), touch a file there,
>> then unmount and then leave the ipc namespace?
>
> FWIW, the real solution would be to have userland mounts trigger the creation
> of internal one, same as mq_open() would. Something along these lines
> (completely untested, on top of vfs.git#for-next). Care to give it some
> beating?
thanks for the patch. It seems to work after this minor fixup on top of
it:
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 30327e201571..636989a44fae 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -360,7 +360,7 @@ static struct vfsmount *mq_internal_mount(void)
spin_unlock(&mq_lock);
if (!IS_ERR(m))
kern_unmount(m);
- return ns->mq_mnt;
+ return ns->mq_mnt;
}
if (!IS_ERR(m))
ns->mq_mnt = m;
@@ -1560,6 +1560,7 @@ static struct file_system_type mqueue_fs_type = {
int mq_init_ns(struct ipc_namespace *ns)
{
+ ns->mq_mnt = NULL;
ns->mq_queues_count = 0;
ns->mq_queues_max = DFLT_QUEUESMAX;
ns->mq_msg_max = DFLT_MSGMAX;
The only issue I've seen with my version is that if I do:
# unshare -im /bin/sh
# mount -t mqueue mqueue /dev/mqueue
# touch /dev/mqueue/foo
# umount /dev/mqueue
# mount -t mqueue mqueue /dev/mqueue
then /dev/mqueue/foo doesn't exist at this point. Your patch does not
have this problem and /dev/mqueue/foo is again accessible after the
second mount.
Regards,
Giuseppe
Dmitry Vyukov <[email protected]> writes:
>> Unrelated issue, but register_filesystem() should be the last thing
>> module_init() of a filesystem driver does. It's a separate story,
>> in any case...
>
> Giuseppe, what report is this?
> If there is a reproducer, you can ask syzbot to test a patch.
I have tried locally the reproducer and the issue seems fixed both in
Al's patch and in my version.
In any case, the original issue was:
https://groups.google.com/forum/#!msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ
Thanks,
Giuseppe
Giuseppe Scrivano <[email protected]> writes:
> The only issue I've seen with my version is that if I do:
>
> # unshare -im /bin/sh
> # mount -t mqueue mqueue /dev/mqueue
> # touch /dev/mqueue/foo
> # umount /dev/mqueue
> # mount -t mqueue mqueue /dev/mqueue
>
> then /dev/mqueue/foo doesn't exist at this point. Your patch does not
> have this problem and /dev/mqueue/foo is again accessible after the
> second mount.
although, how much is that of an issue? Is there any other way to delay
the cost of kern_mount_data()? Most containers have /dev/mqueue mounted
but it is not really going to be used.
Would it be possible somehow to postpone it until the first inode is
created?
Thanks,
Giuseppe
On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote:
> Giuseppe Scrivano <[email protected]> writes:
>
> > The only issue I've seen with my version is that if I do:
> >
> > # unshare -im /bin/sh
> > # mount -t mqueue mqueue /dev/mqueue
> > # touch /dev/mqueue/foo
> > # umount /dev/mqueue
> > # mount -t mqueue mqueue /dev/mqueue
> >
> > then /dev/mqueue/foo doesn't exist at this point. Your patch does not
> > have this problem and /dev/mqueue/foo is again accessible after the
> > second mount.
>
> although, how much is that of an issue? Is there any other way to delay
You do realize that with your patch you would end up with worse than that -
mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed
super_block. With really unpleasant effects when you quit that ipcns...
> the cost of kern_mount_data()? Most containers have /dev/mqueue mounted
> but it is not really going to be used.
_What_ cost? At mount(2) time you are setting the superblock up anyway, so
what would you be delaying? kmem_cache_alloc() for struct mount and assignments
to its fields? That's noise; if anything, I would expect the main cost with
a plenty of containers to be in sget() scanning the list of mqueue superblocks.
And we can get rid of that, while we are at it - to hell with mount_ns(), with
that approach we can just use mount_nodev() instead. The logics in
mq_internal_mount() will deal with multiple instances - if somebody has already
triggered creation of internal mount, all subsequent calls in that ipcns will
end up avoiding kern_mount_data() entirely. And if you have two callers
racing - sure, you will get two superblocks. Not for long, though - the first
one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
and prompty destroys his vfsmount and superblock. I seriously suspect that
variant below would cut down on the cost a whole lot more - as it is, we have
the total of O(N^2) spent in the loop inside of sget_userns() when we create
N ipcns and mount in each of those; this patch should cut that to O(N)...
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 10b82338415b..e9870b0cda29 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -325,8 +325,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
{
struct inode *inode;
- struct ipc_namespace *ns = sb->s_fs_info;
+ struct ipc_namespace *ns = data;
+ sb->s_fs_info = ns;
sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
@@ -343,18 +344,44 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
return 0;
}
+static struct file_system_type mqueue_fs_type;
+/*
+ * Return value is pinned only by reference in ->mq_mnt; it will
+ * live until ipcns dies. Caller does not need to drop it.
+ */
+static struct vfsmount *mq_internal_mount(void)
+{
+ struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+ struct vfsmount *m = ns->mq_mnt;
+ if (m)
+ return m;
+ m = kern_mount_data(&mqueue_fs_type, ns);
+ spin_lock(&mq_lock);
+ if (unlikely(ns->mq_mnt)) {
+ spin_unlock(&mq_lock);
+ if (!IS_ERR(m))
+ kern_unmount(m);
+ return ns->mq_mnt;
+ }
+ if (!IS_ERR(m))
+ ns->mq_mnt = m;
+ spin_unlock(&mq_lock);
+ return m;
+}
+
static struct dentry *mqueue_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
void *data)
{
- struct ipc_namespace *ns;
- if (flags & MS_KERNMOUNT) {
- ns = data;
- data = NULL;
- } else {
- ns = current->nsproxy->ipc_ns;
- }
- return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
+ struct vfsmount *m;
+ if (flags & MS_KERNMOUNT)
+ return mount_nodev(fs_type, flags, data, mqueue_fill_super);
+ m = mq_internal_mount();
+ if (IS_ERR(m))
+ return ERR_CAST(m);
+ atomic_inc(&m->mnt_sb->s_active);
+ down_write(&m->mnt_sb->s_umount);
+ return dget(m->mnt_root);
}
static void init_once(void *foo)
@@ -743,13 +770,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
struct mq_attr *attr)
{
- struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt;
- struct dentry *root = mnt->mnt_root;
+ struct vfsmount *mnt = mq_internal_mount();
+ struct dentry *root;
struct filename *name;
struct path path;
int fd, error;
int ro;
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+
audit_mq_open(oflag, mode, attr);
if (IS_ERR(name = getname(u_name)))
@@ -760,6 +790,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
goto out_putname;
ro = mnt_want_write(mnt); /* we'll drop it in any case */
+ root = mnt->mnt_root;
inode_lock(d_inode(root));
path.dentry = lookup_one_len(name->name, root, strlen(name->name));
if (IS_ERR(path.dentry)) {
@@ -1534,28 +1565,26 @@ int mq_init_ns(struct ipc_namespace *ns)
ns->mq_msgsize_max = DFLT_MSGSIZEMAX;
ns->mq_msg_default = DFLT_MSG;
ns->mq_msgsize_default = DFLT_MSGSIZE;
+ ns->mq_mnt = NULL;
- ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
- if (IS_ERR(ns->mq_mnt)) {
- int err = PTR_ERR(ns->mq_mnt);
- ns->mq_mnt = NULL;
- return err;
- }
return 0;
}
void mq_clear_sbinfo(struct ipc_namespace *ns)
{
- ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+ if (ns->mq_mnt)
+ ns->mq_mnt->mnt_sb->s_fs_info = NULL;
}
void mq_put_mnt(struct ipc_namespace *ns)
{
- kern_unmount(ns->mq_mnt);
+ if (ns->mq_mnt)
+ kern_unmount(ns->mq_mnt);
}
static int __init init_mqueue_fs(void)
{
+ struct vfsmount *m;
int error;
mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
@@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
if (error)
goto out_filesystem;
+ m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
+ if (IS_ERR(m))
+ goto out_filesystem;
+ init_ipc_ns.mq_mnt = m;
return 0;
out_filesystem:
Al Viro <[email protected]> writes:
> On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote:
>> Giuseppe Scrivano <[email protected]> writes:
>>
>> > The only issue I've seen with my version is that if I do:
>> >
>> > # unshare -im /bin/sh
>> > # mount -t mqueue mqueue /dev/mqueue
>> > # touch /dev/mqueue/foo
>> > # umount /dev/mqueue
>> > # mount -t mqueue mqueue /dev/mqueue
>> >
>> > then /dev/mqueue/foo doesn't exist at this point. Your patch does not
>> > have this problem and /dev/mqueue/foo is again accessible after the
>> > second mount.
>>
>> although, how much is that of an issue? Is there any other way to delay
>
> You do realize that with your patch you would end up with worse than that -
> mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed
> super_block. With really unpleasant effects when you quit that ipcns...
>
>> the cost of kern_mount_data()? Most containers have /dev/mqueue mounted
>> but it is not really going to be used.
>
> _What_ cost? At mount(2) time you are setting the superblock up anyway, so
> what would you be delaying? kmem_cache_alloc() for struct mount and assignments
> to its fields? That's noise; if anything, I would expect the main cost with
> a plenty of containers to be in sget() scanning the list of mqueue superblocks.
> And we can get rid of that, while we are at it - to hell with mount_ns(), with
> that approach we can just use mount_nodev() instead. The logics in
> mq_internal_mount() will deal with multiple instances - if somebody has already
> triggered creation of internal mount, all subsequent calls in that ipcns will
> end up avoiding kern_mount_data() entirely. And if you have two callers
> racing - sure, you will get two superblocks. Not for long, though - the first
> one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
> and prompty destroys his vfsmount and superblock. I seriously suspect that
> variant below would cut down on the cost a whole lot more - as it is, we have
> the total of O(N^2) spent in the loop inside of sget_userns() when we create
> N ipcns and mount in each of those; this patch should cut that to
> O(N)...
If that is where the cost is, is there any point in delaying creating
the internal mount at all?
Eric
>
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 10b82338415b..e9870b0cda29 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -325,8 +325,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
> static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
> {
> struct inode *inode;
> - struct ipc_namespace *ns = sb->s_fs_info;
> + struct ipc_namespace *ns = data;
>
> + sb->s_fs_info = ns;
> sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
> sb->s_blocksize = PAGE_SIZE;
> sb->s_blocksize_bits = PAGE_SHIFT;
> @@ -343,18 +344,44 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
> return 0;
> }
>
> +static struct file_system_type mqueue_fs_type;
> +/*
> + * Return value is pinned only by reference in ->mq_mnt; it will
> + * live until ipcns dies. Caller does not need to drop it.
> + */
> +static struct vfsmount *mq_internal_mount(void)
> +{
> + struct ipc_namespace *ns = current->nsproxy->ipc_ns;
> + struct vfsmount *m = ns->mq_mnt;
> + if (m)
> + return m;
> + m = kern_mount_data(&mqueue_fs_type, ns);
> + spin_lock(&mq_lock);
> + if (unlikely(ns->mq_mnt)) {
> + spin_unlock(&mq_lock);
> + if (!IS_ERR(m))
> + kern_unmount(m);
> + return ns->mq_mnt;
> + }
> + if (!IS_ERR(m))
> + ns->mq_mnt = m;
> + spin_unlock(&mq_lock);
> + return m;
> +}
> +
> static struct dentry *mqueue_mount(struct file_system_type *fs_type,
> int flags, const char *dev_name,
> void *data)
> {
> - struct ipc_namespace *ns;
> - if (flags & MS_KERNMOUNT) {
> - ns = data;
> - data = NULL;
> - } else {
> - ns = current->nsproxy->ipc_ns;
> - }
> - return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super);
> + struct vfsmount *m;
> + if (flags & MS_KERNMOUNT)
> + return mount_nodev(fs_type, flags, data, mqueue_fill_super);
> + m = mq_internal_mount();
> + if (IS_ERR(m))
> + return ERR_CAST(m);
> + atomic_inc(&m->mnt_sb->s_active);
> + down_write(&m->mnt_sb->s_umount);
> + return dget(m->mnt_root);
> }
>
> static void init_once(void *foo)
> @@ -743,13 +770,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro,
> static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
> struct mq_attr *attr)
> {
> - struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt;
> - struct dentry *root = mnt->mnt_root;
> + struct vfsmount *mnt = mq_internal_mount();
> + struct dentry *root;
> struct filename *name;
> struct path path;
> int fd, error;
> int ro;
>
> + if (IS_ERR(mnt))
> + return PTR_ERR(mnt);
> +
> audit_mq_open(oflag, mode, attr);
>
> if (IS_ERR(name = getname(u_name)))
> @@ -760,6 +790,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode,
> goto out_putname;
>
> ro = mnt_want_write(mnt); /* we'll drop it in any case */
> + root = mnt->mnt_root;
> inode_lock(d_inode(root));
> path.dentry = lookup_one_len(name->name, root, strlen(name->name));
> if (IS_ERR(path.dentry)) {
> @@ -1534,28 +1565,26 @@ int mq_init_ns(struct ipc_namespace *ns)
> ns->mq_msgsize_max = DFLT_MSGSIZEMAX;
> ns->mq_msg_default = DFLT_MSG;
> ns->mq_msgsize_default = DFLT_MSGSIZE;
> + ns->mq_mnt = NULL;
>
> - ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns);
> - if (IS_ERR(ns->mq_mnt)) {
> - int err = PTR_ERR(ns->mq_mnt);
> - ns->mq_mnt = NULL;
> - return err;
> - }
> return 0;
> }
>
> void mq_clear_sbinfo(struct ipc_namespace *ns)
> {
> - ns->mq_mnt->mnt_sb->s_fs_info = NULL;
> + if (ns->mq_mnt)
> + ns->mq_mnt->mnt_sb->s_fs_info = NULL;
> }
>
> void mq_put_mnt(struct ipc_namespace *ns)
> {
> - kern_unmount(ns->mq_mnt);
> + if (ns->mq_mnt)
> + kern_unmount(ns->mq_mnt);
> }
>
> static int __init init_mqueue_fs(void)
> {
> + struct vfsmount *m;
> int error;
>
> mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
> @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void)
> if (error)
> goto out_filesystem;
>
> + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns);
> + if (IS_ERR(m))
> + goto out_filesystem;
> + init_ipc_ns.mq_mnt = m;
> return 0;
>
> out_filesystem:
On Tue, Dec 19, 2017 at 03:49:24PM -0600, Eric W. Biederman wrote:
> > what would you be delaying? kmem_cache_alloc() for struct mount and assignments
> > to its fields? That's noise; if anything, I would expect the main cost with
> > a plenty of containers to be in sget() scanning the list of mqueue superblocks.
> > And we can get rid of that, while we are at it - to hell with mount_ns(), with
> > that approach we can just use mount_nodev() instead. The logics in
> > mq_internal_mount() will deal with multiple instances - if somebody has already
> > triggered creation of internal mount, all subsequent calls in that ipcns will
> > end up avoiding kern_mount_data() entirely. And if you have two callers
> > racing - sure, you will get two superblocks. Not for long, though - the first
> > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
> > and prompty destroys his vfsmount and superblock. I seriously suspect that
> > variant below would cut down on the cost a whole lot more - as it is, we have
> > the total of O(N^2) spent in the loop inside of sget_userns() when we create
> > N ipcns and mount in each of those; this patch should cut that to
> > O(N)...
>
> If that is where the cost is, is there any point in delaying creating
> the internal mount at all?
We won't know without the profiles... Incidentally, is there any point in
using mount_ns() for procfs? Similar scheme (with ->proc_mnt instead of
->mq_mnt, of course) would live with mount_nodev() just fine, and it's
definitely less costly - we don't bother with the loop in sget_userns()
at all that way.
Al Viro <[email protected]> writes:
> On Tue, Dec 19, 2017 at 03:49:24PM -0600, Eric W. Biederman wrote:
>> > what would you be delaying? kmem_cache_alloc() for struct mount and assignments
>> > to its fields? That's noise; if anything, I would expect the main cost with
>> > a plenty of containers to be in sget() scanning the list of mqueue superblocks.
>> > And we can get rid of that, while we are at it - to hell with mount_ns(), with
>> > that approach we can just use mount_nodev() instead. The logics in
>> > mq_internal_mount() will deal with multiple instances - if somebody has already
>> > triggered creation of internal mount, all subsequent calls in that ipcns will
>> > end up avoiding kern_mount_data() entirely. And if you have two callers
>> > racing - sure, you will get two superblocks. Not for long, though - the first
>> > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
>> > and prompty destroys his vfsmount and superblock. I seriously suspect that
>> > variant below would cut down on the cost a whole lot more - as it is, we have
>> > the total of O(N^2) spent in the loop inside of sget_userns() when we create
>> > N ipcns and mount in each of those; this patch should cut that to
>> > O(N)...
>>
>> If that is where the cost is, is there any point in delaying creating
>> the internal mount at all?
>
> We won't know without the profiles... Incidentally, is there any point in
> using mount_ns() for procfs? Similar scheme (with ->proc_mnt instead of
> ->mq_mnt, of course) would live with mount_nodev() just fine, and it's
> definitely less costly - we don't bother with the loop in sget_userns()
> at all that way.
The mechanism of mqueuefs and proc are different for dealing with a
filesystem that continues to be mounted/referenced after the namespace
exists.
Proc actually takes a reference on the pid namespace so it is easier to
work with. pid_ns_prepare_proc and and pid_ns_release_proc are the
namespace side of that dependency.
So yes we could look at a local cache in the namespace and all
would be well for proc. I don't know what we would use for locking when
we start allowing more that one path to set it.
atmoic_cmpxchg(&proc_mnt, NULL)?
That makes me suspect we could have a common helper that does the work.
I do know that the reason I moved proc to mount_ns is that it had simply
been open coding that function.
Eric
Al Viro <[email protected]> writes:
> On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote:
>> Giuseppe Scrivano <[email protected]> writes:
>>
>> > The only issue I've seen with my version is that if I do:
>> >
>> > # unshare -im /bin/sh
>> > # mount -t mqueue mqueue /dev/mqueue
>> > # touch /dev/mqueue/foo
>> > # umount /dev/mqueue
>> > # mount -t mqueue mqueue /dev/mqueue
>> >
>> > then /dev/mqueue/foo doesn't exist at this point. Your patch does not
>> > have this problem and /dev/mqueue/foo is again accessible after the
>> > second mount.
>>
>> although, how much is that of an issue? Is there any other way to delay
>
> You do realize that with your patch you would end up with worse than that -
> mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed
> super_block. With really unpleasant effects when you quit that ipcns...
>
>> the cost of kern_mount_data()? Most containers have /dev/mqueue mounted
>> but it is not really going to be used.
>
> _What_ cost? At mount(2) time you are setting the superblock up anyway, so
> what would you be delaying? kmem_cache_alloc() for struct mount and assignments
> to its fields? That's noise; if anything, I would expect the main cost with
> a plenty of containers to be in sget() scanning the list of mqueue superblocks.
> And we can get rid of that, while we are at it - to hell with mount_ns(), with
> that approach we can just use mount_nodev() instead. The logics in
> mq_internal_mount() will deal with multiple instances - if somebody has already
> triggered creation of internal mount, all subsequent calls in that ipcns will
> end up avoiding kern_mount_data() entirely. And if you have two callers
> racing - sure, you will get two superblocks. Not for long, though - the first
> one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses
> and prompty destroys his vfsmount and superblock. I seriously suspect that
> variant below would cut down on the cost a whole lot more - as it is, we have
> the total of O(N^2) spent in the loop inside of sget_userns() when we create
> N ipcns and mount in each of those; this patch should cut that to O(N)...
Thanks for the explanation. I see what you mean now and how this cost is
inevitable as anyway we need to setup the superblock.
Giuseppe