Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753128AbdLSVtz (ORCPT ); Tue, 19 Dec 2017 16:49:55 -0500 Received: from out02.mta.xmission.com ([166.70.13.232]:52021 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752004AbdLSVtx (ORCPT ); Tue, 19 Dec 2017 16:49:53 -0500 From: ebiederm@xmission.com (Eric W. Biederman) To: Al Viro Cc: Giuseppe Scrivano , Andrew Morton , LKML , alexander.deucher@amd.com, broonie@kernel.org, chris@chris-wilson.co.uk, David Miller , deepa.kernel@gmail.com, Greg KH , luc.vanoostenryck@gmail.com, lucien xin , Ingo Molnar , Neil Horman , syzkaller-bugs@googlegroups.com, Vladislav Yasevich References: <20171219101440.19736-1-gscrivan@redhat.com> <20171219114819.GQ21978@ZenIV.linux.org.uk> <20171219153225.GA14771@ZenIV.linux.org.uk> <874lomhcwb.fsf@redhat.com> <87vah2ftn8.fsf@redhat.com> <20171219201411.GT21978@ZenIV.linux.org.uk> Date: Tue, 19 Dec 2017 15:49:24 -0600 In-Reply-To: <20171219201411.GT21978@ZenIV.linux.org.uk> (Al Viro's message of "Tue, 19 Dec 2017 20:14:12 +0000") Message-ID: <8737465qxn.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1eRPll-0003ho-Pt;;;mid=<8737465qxn.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=75.170.127.89;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX180vuBiSqQHU8MCSq80Cnfl9olB5JY85Hw= X-SA-Exim-Connect-IP: 75.170.127.89 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.7 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa03 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa03 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Al Viro X-Spam-Relay-Country: X-Spam-Timing: total 1269 ms - load_scoreonly_sql: 0.06 (0.0%), signal_user_changed: 6 (0.5%), b_tie_ro: 4.6 (0.4%), parse: 3.9 (0.3%), extract_message_metadata: 45 (3.6%), get_uri_detail_list: 9 (0.7%), tests_pri_-1000: 20 (1.6%), tests_pri_-950: 3.3 (0.3%), tests_pri_-900: 2.5 (0.2%), tests_pri_-400: 57 (4.5%), check_bayes: 54 (4.3%), b_tokenize: 26 (2.0%), b_tok_get_all: 13 (1.0%), b_comp_prob: 7 (0.5%), b_tok_touch_all: 3.8 (0.3%), b_finish: 0.95 (0.1%), tests_pri_0: 1104 (87.1%), check_dkim_signature: 1.36 (0.1%), check_dkim_adsp: 6 (0.5%), tests_pri_500: 13 (1.0%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH linux-next] mqueue: fix IPC namespace use-after-free X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6445 Lines: 190 Al Viro writes: > On Tue, Dec 19, 2017 at 07:40:43PM +0100, Giuseppe Scrivano wrote: >> Giuseppe Scrivano writes: >> >> > The only issue I've seen with my version is that if I do: >> > >> > # unshare -im /bin/sh >> > # mount -t mqueue mqueue /dev/mqueue >> > # touch /dev/mqueue/foo >> > # umount /dev/mqueue >> > # mount -t mqueue mqueue /dev/mqueue >> > >> > then /dev/mqueue/foo doesn't exist at this point. Your patch does not >> > have this problem and /dev/mqueue/foo is again accessible after the >> > second mount. >> >> although, how much is that of an issue? Is there any other way to delay > > You do realize that with your patch you would end up with worse than that - > mount/touch/umount and now you've got ->mq_sb non-NULL and pointing to freed > super_block. With really unpleasant effects when you quit that ipcns... > >> the cost of kern_mount_data()? Most containers have /dev/mqueue mounted >> but it is not really going to be used. > > _What_ cost? At mount(2) time you are setting the superblock up anyway, so > what would you be delaying? kmem_cache_alloc() for struct mount and assignments > to its fields? That's noise; if anything, I would expect the main cost with > a plenty of containers to be in sget() scanning the list of mqueue superblocks. > And we can get rid of that, while we are at it - to hell with mount_ns(), with > that approach we can just use mount_nodev() instead. The logics in > mq_internal_mount() will deal with multiple instances - if somebody has already > triggered creation of internal mount, all subsequent calls in that ipcns will > end up avoiding kern_mount_data() entirely. And if you have two callers > racing - sure, you will get two superblocks. Not for long, though - the first > one to get to setting ->mq_mnt (serialized on mq_lock) wins, the second loses > and prompty destroys his vfsmount and superblock. I seriously suspect that > variant below would cut down on the cost a whole lot more - as it is, we have > the total of O(N^2) spent in the loop inside of sget_userns() when we create > N ipcns and mount in each of those; this patch should cut that to > O(N)... If that is where the cost is, is there any point in delaying creating the internal mount at all? Eric > > diff --git a/ipc/mqueue.c b/ipc/mqueue.c > index 10b82338415b..e9870b0cda29 100644 > --- a/ipc/mqueue.c > +++ b/ipc/mqueue.c > @@ -325,8 +325,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb, > static int mqueue_fill_super(struct super_block *sb, void *data, int silent) > { > struct inode *inode; > - struct ipc_namespace *ns = sb->s_fs_info; > + struct ipc_namespace *ns = data; > > + sb->s_fs_info = ns; > sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV; > sb->s_blocksize = PAGE_SIZE; > sb->s_blocksize_bits = PAGE_SHIFT; > @@ -343,18 +344,44 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent) > return 0; > } > > +static struct file_system_type mqueue_fs_type; > +/* > + * Return value is pinned only by reference in ->mq_mnt; it will > + * live until ipcns dies. Caller does not need to drop it. > + */ > +static struct vfsmount *mq_internal_mount(void) > +{ > + struct ipc_namespace *ns = current->nsproxy->ipc_ns; > + struct vfsmount *m = ns->mq_mnt; > + if (m) > + return m; > + m = kern_mount_data(&mqueue_fs_type, ns); > + spin_lock(&mq_lock); > + if (unlikely(ns->mq_mnt)) { > + spin_unlock(&mq_lock); > + if (!IS_ERR(m)) > + kern_unmount(m); > + return ns->mq_mnt; > + } > + if (!IS_ERR(m)) > + ns->mq_mnt = m; > + spin_unlock(&mq_lock); > + return m; > +} > + > static struct dentry *mqueue_mount(struct file_system_type *fs_type, > int flags, const char *dev_name, > void *data) > { > - struct ipc_namespace *ns; > - if (flags & MS_KERNMOUNT) { > - ns = data; > - data = NULL; > - } else { > - ns = current->nsproxy->ipc_ns; > - } > - return mount_ns(fs_type, flags, data, ns, ns->user_ns, mqueue_fill_super); > + struct vfsmount *m; > + if (flags & MS_KERNMOUNT) > + return mount_nodev(fs_type, flags, data, mqueue_fill_super); > + m = mq_internal_mount(); > + if (IS_ERR(m)) > + return ERR_CAST(m); > + atomic_inc(&m->mnt_sb->s_active); > + down_write(&m->mnt_sb->s_umount); > + return dget(m->mnt_root); > } > > static void init_once(void *foo) > @@ -743,13 +770,16 @@ static int prepare_open(struct dentry *dentry, int oflag, int ro, > static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, > struct mq_attr *attr) > { > - struct vfsmount *mnt = current->nsproxy->ipc_ns->mq_mnt; > - struct dentry *root = mnt->mnt_root; > + struct vfsmount *mnt = mq_internal_mount(); > + struct dentry *root; > struct filename *name; > struct path path; > int fd, error; > int ro; > > + if (IS_ERR(mnt)) > + return PTR_ERR(mnt); > + > audit_mq_open(oflag, mode, attr); > > if (IS_ERR(name = getname(u_name))) > @@ -760,6 +790,7 @@ static int do_mq_open(const char __user *u_name, int oflag, umode_t mode, > goto out_putname; > > ro = mnt_want_write(mnt); /* we'll drop it in any case */ > + root = mnt->mnt_root; > inode_lock(d_inode(root)); > path.dentry = lookup_one_len(name->name, root, strlen(name->name)); > if (IS_ERR(path.dentry)) { > @@ -1534,28 +1565,26 @@ int mq_init_ns(struct ipc_namespace *ns) > ns->mq_msgsize_max = DFLT_MSGSIZEMAX; > ns->mq_msg_default = DFLT_MSG; > ns->mq_msgsize_default = DFLT_MSGSIZE; > + ns->mq_mnt = NULL; > > - ns->mq_mnt = kern_mount_data(&mqueue_fs_type, ns); > - if (IS_ERR(ns->mq_mnt)) { > - int err = PTR_ERR(ns->mq_mnt); > - ns->mq_mnt = NULL; > - return err; > - } > return 0; > } > > void mq_clear_sbinfo(struct ipc_namespace *ns) > { > - ns->mq_mnt->mnt_sb->s_fs_info = NULL; > + if (ns->mq_mnt) > + ns->mq_mnt->mnt_sb->s_fs_info = NULL; > } > > void mq_put_mnt(struct ipc_namespace *ns) > { > - kern_unmount(ns->mq_mnt); > + if (ns->mq_mnt) > + kern_unmount(ns->mq_mnt); > } > > static int __init init_mqueue_fs(void) > { > + struct vfsmount *m; > int error; > > mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache", > @@ -1577,6 +1606,10 @@ static int __init init_mqueue_fs(void) > if (error) > goto out_filesystem; > > + m = kern_mount_data(&mqueue_fs_type, &init_ipc_ns); > + if (IS_ERR(m)) > + goto out_filesystem; > + init_ipc_ns.mq_mnt = m; > return 0; > > out_filesystem: