Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752969AbZA0V41 (ORCPT ); Tue, 27 Jan 2009 16:56:27 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752747AbZA0V4M (ORCPT ); Tue, 27 Jan 2009 16:56:12 -0500 Received: from e34.co.us.ibm.com ([32.97.110.152]:41954 "EHLO e34.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754006AbZA0V4L (ORCPT ); Tue, 27 Jan 2009 16:56:11 -0500 Date: Tue, 27 Jan 2009 15:56:08 -0600 From: "Serge E. Hallyn" To: Andrew Morton Cc: linux-kernel@vger.kernel.org, containers@lists.osdl.org Subject: Re: [PATCH 2/3] ipc namespaces: implement support for posix msqueues Message-ID: <20090127215608.GA13210@us.ibm.com> References: <20090117020248.GA8615@us.ibm.com> <20090117020332.GB8708@us.ibm.com> <20090126152839.b35dbb45.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090126152839.b35dbb45.akpm@linux-foundation.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4949 Lines: 127 Quoting Andrew Morton (akpm@linux-foundation.org): > On Fri, 16 Jan 2009 20:03:32 -0600 > "Serge E. Hallyn" wrote: > > > Implement multiple mounts of the mqueue file system, and > > link it to usage of CLONE_NEWIPC. > > > > Each ipc ns has a corresponding mqueuefs superblock. When > > a user does clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the > > unshare will cause an internal mount of a new mqueuefs sb > > linked to the new ipc ns. > > > > When a user does 'mount -t mqueue mqueue /dev/mqueue', he > > mounts the mqueuefs superblock. > > > > Posix message queues can be worked with both through the > > mq_* system calls (see mq_overview(7)), and through the VFS > > through the mqueue mount. Any usage of mq_open() and friends > > will work with the acting task's ipc namespace. Any actions > > through the VFS will work with the mqueuefs in which the > > file was created. So if a user doesn't remount mqueuefs > > after unshare(CLONE_NEWIPC), mq_open("/ab") will not be > > reflected in "ls /dev/mqueue". > > > > If task a mounts mqueue for ipc_ns:1, then clones task b with > > a new ipcns, ipcns:2, and then task a is the last task in > > ipc_ns:1 to exit, then (1) ipc_ns:1 will be freed, (2) it's > > superblock will live on until task b umounts the corresponding > > mqueuefs, and vfs actions will continue to succeed, but (3) > > sb->s_fs_info will be NULL for the sb corresponding to the > > deceased ipc_ns:1. > > I suppose that stuff like this should find its way into a manpage one > day. That'll be fun for someone. Nadia wrote an update for the mq_overview.7 page. But the semantics have changed a bit since then so I'll need to revise that. > > +static int get_sb_single_ns(struct file_system_type *fs_type, > > + int flags, void *data, > > + int (*fill_super)(struct super_block *, void *, int), > > + struct vfsmount *mnt) > > +{ > > + struct super_block *s; > > + int error; > > + > > + s = sget(fs_type, compare_sb_single_ns, set_sb_single_ns, data); > > + if (IS_ERR(s)) > > + return PTR_ERR(s); > > + if (!s->s_root) { > > + s->s_flags = flags; > > + error = fill_super(s, data, flags & MS_SILENT ? 1 : 0); > > + if (error) { > > + up_write(&s->s_umount); > > + deactivate_super(s); > > + return error; > > + } > > + s->s_flags |= MS_ACTIVE; > > + } > > + do_remount_sb(s, flags, data, 0); > > + return simple_set_mnt(mnt, s); > > } > > The above doesn't seem specific to mqueue. Is it in the best place? No, I think there is commonality with devpts and perhaps also proc. Now that devpts namespaces are upstream i should first move it to using a more generic helper, then introduce posixmq namespaces as another user. > > @@ -266,10 +317,12 @@ static void mqueue_delete_inode(struct inode *inode) > > if (user) { > > spin_lock(&mq_lock); > > user->mq_bytes -= mq_bytes; > > - ipc_ns->mq_queues_count--; > > + if (ipc_ns) > > The reader might be wondering why ipc_ns==NULL is an acceptable state, > and what that state actually means. This reader is wondering that. That's actually a central part of how we resolved the dependency between the mqfs mount (pinned by vfs) and mqns (pinned by nsproxy). So the mqns pins its mqfs vfsmount, but the mnt->mnt_sb->s_fs_info is not pinned by the mqns. Rather, s_fs_info is protected by the mq_lock and either non-NULL and valid, or NULL. The mqns takes the mq_lock to set it to NULL, and any reader of s_fs_info must dereference it and pin the mqns under mq_lock. > > + * put_ipc_ns - drop a reference to an ipc namespace. > > + * @ns: the namespace to put > > + * > > + * If this is the last task in the namespace exiting, and > > + * it is dropping the refcount to 0, then it can race with > > + * a task in another ipc namespace but in a mounts namespace > > + * which has this ipcns's mqueuefs mounted, doing some action > > + * with one of the mqueuefs files. That can raise the refcount. > > + * So dropping the refcount, and raising the refcount when > > + * accessing it through the VFS, are protected with mq_lock. > > + * > > + * (Clearly, a task raising the refcount on its own ipc_ns > > + * needn't take mq_lock since it can't race with the last task > > + * in the ipcns exiting). > > + */ > > +void put_ipc_ns(struct ipc_namespace *ns) > > { > > - struct ipc_namespace *ns; > > + if (ns && atomic_dec_and_lock(&ns->count, &mq_lock)) { > > + mq_clear_sbinfo(ns); > > + spin_unlock(&mq_lock); > > + mq_put_mnt(ns); > > + free_ipc_ns(ns); > > + } > > +} > > Why are we supporting calls with a NULL arg here? Hmm, I'm not sure that's necessary. Will look into it. Every other comment I'll fix in the next version. Thanks very much. -serge -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/