Subject: Race between "mount" uevent and /proc/mounts?

Hi,

the 2.6.13 and 2.6.14-* kernels seem susceptible to a race condition
between the sending of a "mount" uevent and the actual mount becoming
visible thru /proc/mounts, at least when the kernel is configured
with voluntary preemption.

The following scenario:
- system is using the HAL daemon, configured to monitor kernel uvents
- someone (usually some kind of volume manager in response to
a device hotplug, but could also a manual mount) mounts a filesystem
- "mount" uevent is emitted
- HAL daemon reads the event, then opens and reads /proc/mounts
(in order to determine the corresponding mount point, since the uevent
contains only the sysfs device name), but /proc/mounts does not
(yet) contain the corresponding line

There has been one previous post about this:
http://marc.theaimsgroup.com/?l=linux-kernel&m=112670567427154

Cheers, Roderich


2005-10-25 14:00:43

by Al Viro

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Tue, Oct 25, 2005 at 03:20:10PM +0200, Schupp Roderich (extern) BenQ MD PD SWP 2 CM MCH wrote:
> Hi,
>
> the 2.6.13 and 2.6.14-* kernels seem susceptible to a race condition
> between the sending of a "mount" uevent and the actual mount becoming
> visible thru /proc/mounts, at least when the kernel is configured
> with voluntary preemption.
>
> The following scenario:
> - system is using the HAL daemon, configured to monitor kernel uvents
> - someone (usually some kind of volume manager in response to
> a device hotplug, but could also a manual mount) mounts a filesystem
> - "mount" uevent is emitted

... said event happens to be a piece of junk with ill-defined semantics.

> - HAL daemon reads the event, then opens and reads /proc/mounts

real useful, since
a) we have no idea if mount() is being done in the same namespace
b) we have no idea if mount() actually succeeds
c) even if we manage to find a mountpoint, we have no idea if it
gets e.g. mount --move just as we'd finished reading from /prov/mounts
d) if the goal is to see which devices are held by mounted fs,
you'll miss such things as e.g. external journals.

> (in order to determine the corresponding mount point, since the uevent

*the* corresponding mountpoint? Which one? There might be any number
of those...

2005-10-26 10:27:28

by Sergey Vlasov

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

(sorry for the broken message sent yesterday)

On Tue, 25 Oct 2005 15:00:41 +0100 Al Viro wrote:

> On Tue, Oct 25, 2005 at 03:20:10PM +0200, Schupp Roderich (extern) BenQ MD PD SWP 2 CM MCH wrote:
> > the 2.6.13 and 2.6.14-* kernels seem susceptible to a race condition
> > between the sending of a "mount" uevent and the actual mount becoming
> > visible thru /proc/mounts, at least when the kernel is configured
> > with voluntary preemption.

The "umount" uevent has the same problem - the event is emitted at the
start of umount process, but the umount can really complete much later
(e.g., if there was a lot of cached writes). This is really bad,
because the "umount" uevent cannot be used to tell the user when it is
safe to remove the device.

> > The following scenario:
> > - system is using the HAL daemon, configured to monitor kernel uvents
> > - someone (usually some kind of volume manager in response to
> > a device hotplug, but could also a manual mount) mounts a filesystem
> > - "mount" uevent is emitted
>
> ... said event happens to be a piece of junk with ill-defined semantics.

Hmm, and what should be the proper semantics for such an event?

Currently the "mount" uevent signals that the device is busy (and the
"umount" uevent then should signal that the device is no longer in use,
but that is currently broken).

> > - HAL daemon reads the event, then opens and reads /proc/mounts
>
> real useful, since
> a) we have no idea if mount() is being done in the same namespace
> b) we have no idea if mount() actually succeeds
> c) even if we manage to find a mountpoint, we have no idea if it
> gets e.g. mount --move just as we'd finished reading from /prov/mounts
> d) if the goal is to see which devices are held by mounted fs,
> you'll miss such things as e.g. external journals.
>
> > (in order to determine the corresponding mount point, since the uevent
>
> *the* corresponding mountpoint? Which one? There might be any number
> of those...


Attachments:
(No filename) (1.94 kB)
(No filename) (189.00 B)
Download all attachments

2005-10-26 11:15:10

by Al Viro

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Wed, Oct 26, 2005 at 02:27:10PM +0400, Sergey Vlasov wrote:
> > ... said event happens to be a piece of junk with ill-defined semantics.
>
> Hmm, and what should be the proper semantics for such an event?

> Currently the "mount" uevent signals that the device is busy

But it does not. The thing that is fundamentally wrong about it is
that it doesn't match any of the real objects. It assumes that
* there is such thing as "filesystem of the device"
* there is such thing as "mountpoint of the filesystem"
* that no two filesystems use the same device
* that no filesystem would use more than one device
* that the thing we get on mountpoint is fully determined by fs
mounted there.

All of these assumptions used to be true for v7. Guess what? The world
had changed. _None_ of the above is true anymore.

First of all, the fundamental property of filesystem is its type. And
that's the only universal property these objects have. Everything else
is type-dependent.

*Some* types happen to use one or more block devices. How they use those
depends on the type. E.g. ext3 can span two devices (journal being the
second one).

Some fs types claim devices they use exclusively. Some do not; e.g. stuff
that does online resizing via secondary fs a-la ext2meta will, by design,
coexist with normal fs on the same device.

Each fs has a directory tree. Pieces of these trees can be glued together
into unified tree; the same subtree can be seen in many places, several
different subtrees can be mounted even when the entire tree is not.
Moreover, different processes can see different mount trees. Filesystem
can be active even if not present in mount trees of any processes - it can
be kept alive by e.g. open files on it; that's what happens if we do
umount -l and something in the subtree is still busy.

_That_ is the reality and any reasonable system of events should match it.
The objects are:

* fs types: flat set, depending on the kernel config
* active filesystems: each belongs to fs type, each has associated directory
tree and files in that tree.
* mounts: each maps a subtree of some filesystem.
* mount trees (aka namespaces): trees of mounts, providing a unified directory
tree as seen by processes; they glue together the subtrees from individual
mounts.
* block devices: used by many things in many ways; a lot of active filesystems
happen to use them; the number and kind of use depends on fs type.

Semantics for events depends on which objects you are interested in.
Existing ones do not match _any_ of the real objects and I have no
idea what exactly had been intended for them. I've asked gregkh, but
he didn't remember that either. Apparently they are used by different
people as (bad) approximations to different things. Which doesn't work
well. And until somebody cares to describe what exactly are they trying
to watch the situation obviously won't improve.

2005-10-26 14:34:23

by Kay Sievers

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Wed, Oct 26, 2005 at 12:15:06PM +0100, Al Viro wrote:
> On Wed, Oct 26, 2005 at 02:27:10PM +0400, Sergey Vlasov wrote:
> > > ... said event happens to be a piece of junk with ill-defined semantics.
> >
> > Hmm, and what should be the proper semantics for such an event?
>
> > Currently the "mount" uevent signals that the device is busy
>
> But it does not. The thing that is fundamentally wrong about it is
> that it doesn't match any of the real objects. It assumes that
> * there is such thing as "filesystem of the device"
> * there is such thing as "mountpoint of the filesystem"
> * that no two filesystems use the same device
> * that no filesystem would use more than one device
> * that the thing we get on mountpoint is fully determined by fs
> mounted there.
>
> All of these assumptions used to be true for v7. Guess what? The world
> had changed. _None_ of the above is true anymore.
>
> First of all, the fundamental property of filesystem is its type. And
> that's the only universal property these objects have. Everything else
> is type-dependent.
>
> *Some* types happen to use one or more block devices. How they use those
> depends on the type. E.g. ext3 can span two devices (journal being the
> second one).
>
> Some fs types claim devices they use exclusively. Some do not; e.g. stuff
> that does online resizing via secondary fs a-la ext2meta will, by design,
> coexist with normal fs on the same device.
>
> Each fs has a directory tree. Pieces of these trees can be glued together
> into unified tree; the same subtree can be seen in many places, several
> different subtrees can be mounted even when the entire tree is not.
> Moreover, different processes can see different mount trees. Filesystem
> can be active even if not present in mount trees of any processes - it can
> be kept alive by e.g. open files on it; that's what happens if we do
> umount -l and something in the subtree is still busy.
>
> _That_ is the reality and any reasonable system of events should match it.
> The objects are:
>
> * fs types: flat set, depending on the kernel config
> * active filesystems: each belongs to fs type, each has associated directory
> tree and files in that tree.
> * mounts: each maps a subtree of some filesystem.
> * mount trees (aka namespaces): trees of mounts, providing a unified directory
> tree as seen by processes; they glue together the subtrees from individual
> mounts.
> * block devices: used by many things in many ways; a lot of active filesystems
> happen to use them; the number and kind of use depends on fs type.
>
> Semantics for events depends on which objects you are interested in.
> Existing ones do not match _any_ of the real objects and I have no
> idea what exactly had been intended for them. I've asked gregkh, but
> he didn't remember that either. Apparently they are used by different
> people as (bad) approximations to different things. Which doesn't work
> well. And until somebody cares to describe what exactly are they trying
> to watch the situation obviously won't improve.

They are actually events for claim/release of a block device. As uevents
are bound to kobjects we needed to send these events from an existing device
which is the blockdev itself.

Sure, the event itself, has nothing to do with a filesystem. The names are
like this for historical reasons and "CLAIM/RELEASE" may be less confusing.
The events are used as a trigger to rescan /proc/mounts instead of polling
it constantly.

If you have a better idea where to plug it, or if we better rename it, we
should do that...

Thanks,
Kay

2005-10-26 14:46:08

by Xavier Bestel

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Wed, 2005-10-26 at 16:34, Kay Sievers wrote:

> They are actually events for claim/release of a block device. As uevents
> are bound to kobjects we needed to send these events from an existing device
> which is the blockdev itself.
>
> Sure, the event itself, has nothing to do with a filesystem. The names are
> like this for historical reasons and "CLAIM/RELEASE" may be less confusing.
> The events are used as a trigger to rescan /proc/mounts instead of polling
> it constantly.
>
> If you have a better idea where to plug it, or if we better rename it, we
> should do that...

Make /proc/mount send an inotify event ?


2005-10-26 19:29:01

by Al Viro

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Wed, Oct 26, 2005 at 04:34:17PM +0200, Kay Sievers wrote:
> > Semantics for events depends on which objects you are interested in.
> > Existing ones do not match _any_ of the real objects and I have no
> > idea what exactly had been intended for them. I've asked gregkh, but
> > he didn't remember that either. Apparently they are used by different
> > people as (bad) approximations to different things. Which doesn't work
> > well. And until somebody cares to describe what exactly are they trying
> > to watch the situation obviously won't improve.
>
> They are actually events for claim/release of a block device. As uevents
> are bound to kobjects we needed to send these events from an existing device
> which is the blockdev itself.
>
> Sure, the event itself, has nothing to do with a filesystem. The names are
> like this for historical reasons and "CLAIM/RELEASE" may be less confusing.
> The events are used as a trigger to rescan /proc/mounts instead of polling
> it constantly.

But that makes no sense. /proc/*/mounts changes when mount tree changes.
Which is obviously not an event happening to block devices. Moreover,
changes of mount tree may involve no changes in the set of active filesystems
or be separated in time from such changes by arbitrary intervals.

Looks like seriously wrong assumptions in userland code working with these
events... _IF_ you want to keep track of /proc/*/mounts changes, the obvious
solution would be to implement ->poll() for them. However, if you are
really interested in block devices, keep in mind that
* getting them claimed happens before your event is generated
* eventually the filesystem claiming them becomes active (or doesn't,
if mount fails)
* eventually an active fs may (or may not) become visible in mount
tree.
* not every umount leads to deactivation
* deactivation can happen long after the fs is no longer present in
mount tree
* fs may become visible in mount tree again without being deactivated
and activated again - (mount /dev/foo /mnt; exec </mnt/bar; umount -l /mnt;
sleep 100; mount /dev/foo /tmp/barf) in case of block filesystem will do just
that; fs gets activated, mounted, unmounted and mounted again 100 seconds
later.
* deactivated fs gives up its claim on device(s). Incidentally,
your UMOUNT event is triggered before either thing happens; any amount of
IO on the device(s) can happen after it.

Oh, and there are things other than filesystems that can (and do) claim
block devices.

So what's really going on? If you want to know when device gets busy, you
need events in fs/block_dev.c and no expectation regarding /proc/mounts.
If you want to know when mount tree changes, you need events on attach_mnt/
detach_mnt (and I would seriously suggest ->poll() rather than wanking with
events). If you want something more complex, you might or might not be
SOL, depending on what you are trying to achieve.

2005-11-01 00:28:54

by Kay Sievers

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Wed, Oct 26, 2005 at 08:28:59PM +0100, Al Viro wrote:
> On Wed, Oct 26, 2005 at 04:34:17PM +0200, Kay Sievers wrote:
> > > Semantics for events depends on which objects you are interested in.
> > > Existing ones do not match _any_ of the real objects and I have no
> > > idea what exactly had been intended for them. I've asked gregkh, but
> > > he didn't remember that either. Apparently they are used by different
> > > people as (bad) approximations to different things. Which doesn't work
> > > well. And until somebody cares to describe what exactly are they trying
> > > to watch the situation obviously won't improve.
> >
> > They are actually events for claim/release of a block device. As uevents
> > are bound to kobjects we needed to send these events from an existing device
> > which is the blockdev itself.
> >
> > Sure, the event itself, has nothing to do with a filesystem. The names are
> > like this for historical reasons and "CLAIM/RELEASE" may be less confusing.
> > The events are used as a trigger to rescan /proc/mounts instead of polling
> > it constantly.
>
> But that makes no sense. /proc/*/mounts changes when mount tree changes.
> Which is obviously not an event happening to block devices. Moreover,
> changes of mount tree may involve no changes in the set of active filesystems
> or be separated in time from such changes by arbitrary intervals.
>
> Looks like seriously wrong assumptions in userland code working with these
> events... _IF_ you want to keep track of /proc/*/mounts changes, the obvious
> solution would be to implement ->poll() for them.

Ok, makes sense. The attached seems to work for me. If we can get
something like this, we can remove the superblock claim/release events
completely and just read the events from the /proc/mounts file itself.

Thanks,
Kay


diff --git a/fs/namespace.c b/fs/namespace.c
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -26,6 +26,7 @@
#include <asm/unistd.h>

extern int __init init_rootfs(void);
+DECLARE_WAIT_QUEUE_HEAD(mounts_wait);

#ifdef CONFIG_SYSFS
extern int __init sysfs_init(void);
@@ -120,6 +121,8 @@ static void detach_mnt(struct vfsmount *
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
old_nd->dentry->d_mounted--;
+ current->namespace->mounts_poll_pending = 1;
+ wake_up_interruptible(&mounts_wait);
}

static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd)
@@ -129,6 +132,8 @@ static void attach_mnt(struct vfsmount *
list_add(&mnt->mnt_hash, mount_hashtable+hash(nd->mnt, nd->dentry));
list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
nd->dentry->d_mounted++;
+ current->namespace->mounts_poll_pending = 1;
+ wake_up_interruptible(&mounts_wait);
}

static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root)
@@ -1395,6 +1400,7 @@ static void __init init_mount_tree(void)
list_add(&mnt->mnt_list, &namespace->list);
namespace->root = mnt;
mnt->mnt_namespace = namespace;
+ namespace->mounts_poll_pending = 0;

init_task.namespace = namespace;
read_lock(&tasklist_lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -57,6 +57,7 @@
#include <linux/init.h>
#include <linux/file.h>
#include <linux/string.h>
+#include <linux/poll.h>
#include <linux/seq_file.h>
#include <linux/namei.h>
#include <linux/namespace.h>
@@ -660,6 +661,8 @@ static struct file_operations proc_smaps
#endif

extern struct seq_operations mounts_op;
+extern wait_queue_head_t mounts_wait;
+
static int mounts_open(struct inode *inode, struct file *file)
{
struct task_struct *task = proc_task(inode);
@@ -692,10 +695,36 @@ static int mounts_release(struct inode *
return seq_release(inode, file);
}

+static unsigned int mounts_poll(struct file *file, poll_table *wait)
+{
+ struct task_struct *task = proc_task(file->f_dentry->d_inode);
+ struct namespace *namespace;
+ int ret = 0;
+
+ task_lock(task);
+ namespace = task->namespace;
+ if (namespace)
+ get_namespace(namespace);
+ task_unlock(task);
+
+ if (!namespace)
+ return -EINVAL;
+
+ poll_wait(file, &mounts_wait, wait);
+ if (namespace->mounts_poll_pending) {
+ namespace->mounts_poll_pending = 0;
+ ret = POLLIN | POLLRDNORM;
+ }
+ put_namespace(namespace);
+
+ return ret;
+}
+
static struct file_operations proc_mounts_operations = {
.open = mounts_open,
.read = seq_read,
.llseek = seq_lseek,
+ .poll = mounts_poll,
.release = mounts_release,
};

diff --git a/include/linux/namespace.h b/include/linux/namespace.h
--- a/include/linux/namespace.h
+++ b/include/linux/namespace.h
@@ -10,6 +10,7 @@ struct namespace {
struct vfsmount * root;
struct list_head list;
struct rw_semaphore sem;
+ int mounts_poll_pending;
};

extern int copy_namespace(int, struct task_struct *);

2005-11-01 03:58:23

by Kay Sievers

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Tue, Nov 01, 2005 at 01:28:46AM +0100, Kay Sievers wrote:
> On Wed, Oct 26, 2005 at 08:28:59PM +0100, Al Viro wrote:
> > On Wed, Oct 26, 2005 at 04:34:17PM +0200, Kay Sievers wrote:
> > > > Semantics for events depends on which objects you are interested in.
> > > > Existing ones do not match _any_ of the real objects and I have no
> > > > idea what exactly had been intended for them. I've asked gregkh, but
> > > > he didn't remember that either. Apparently they are used by different
> > > > people as (bad) approximations to different things. Which doesn't work
> > > > well. And until somebody cares to describe what exactly are they trying
> > > > to watch the situation obviously won't improve.
> > >
> > > They are actually events for claim/release of a block device. As uevents
> > > are bound to kobjects we needed to send these events from an existing device
> > > which is the blockdev itself.
> > >
> > > Sure, the event itself, has nothing to do with a filesystem. The names are
> > > like this for historical reasons and "CLAIM/RELEASE" may be less confusing.
> > > The events are used as a trigger to rescan /proc/mounts instead of polling
> > > it constantly.
> >
> > But that makes no sense. /proc/*/mounts changes when mount tree changes.
> > Which is obviously not an event happening to block devices. Moreover,
> > changes of mount tree may involve no changes in the set of active filesystems
> > or be separated in time from such changes by arbitrary intervals.
> >
> > Looks like seriously wrong assumptions in userland code working with these
> > events... _IF_ you want to keep track of /proc/*/mounts changes, the obvious
> > solution would be to implement ->poll() for them.
>
> Ok, makes sense. The attached seems to work for me. If we can get
> something like this, we can remove the superblock claim/release events
> completely and just read the events from the /proc/mounts file itself.

New patch. Missed a check for namespace == NULL in detach_mnt().

Thanks,
Kay

diff --git a/fs/namespace.c b/fs/namespace.c
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -26,6 +26,7 @@
#include <asm/unistd.h>

extern int __init init_rootfs(void);
+DECLARE_WAIT_QUEUE_HEAD(mounts_wait);

#ifdef CONFIG_SYSFS
extern int __init sysfs_init(void);
@@ -120,6 +121,10 @@ static void detach_mnt(struct vfsmount *
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
old_nd->dentry->d_mounted--;
+ if (current->namespace) {
+ current->namespace->mounts_poll_pending = 1;
+ wake_up_interruptible(&mounts_wait);
+ }
}

static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd)
@@ -129,6 +134,8 @@ static void attach_mnt(struct vfsmount *
list_add(&mnt->mnt_hash, mount_hashtable+hash(nd->mnt, nd->dentry));
list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
nd->dentry->d_mounted++;
+ current->namespace->mounts_poll_pending = 1;
+ wake_up_interruptible(&mounts_wait);
}

static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root)
@@ -1093,6 +1100,7 @@ int copy_namespace(int flags, struct tas
atomic_set(&new_ns->count, 1);
init_rwsem(&new_ns->sem);
INIT_LIST_HEAD(&new_ns->list);
+ new_ns->mounts_poll_pending = 0;

down_write(&tsk->namespace->sem);
/* First pass: copy the tree topology */
@@ -1395,6 +1403,7 @@ static void __init init_mount_tree(void)
list_add(&mnt->mnt_list, &namespace->list);
namespace->root = mnt;
mnt->mnt_namespace = namespace;
+ namespace->mounts_poll_pending = 0;

init_task.namespace = namespace;
read_lock(&tasklist_lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -57,6 +57,7 @@
#include <linux/init.h>
#include <linux/file.h>
#include <linux/string.h>
+#include <linux/poll.h>
#include <linux/seq_file.h>
#include <linux/namei.h>
#include <linux/namespace.h>
@@ -660,6 +661,8 @@ static struct file_operations proc_smaps
#endif

extern struct seq_operations mounts_op;
+extern wait_queue_head_t mounts_wait;
+
static int mounts_open(struct inode *inode, struct file *file)
{
struct task_struct *task = proc_task(inode);
@@ -692,10 +695,36 @@ static int mounts_release(struct inode *
return seq_release(inode, file);
}

+static unsigned int mounts_poll(struct file *file, poll_table *wait)
+{
+ struct task_struct *task = proc_task(file->f_dentry->d_inode);
+ struct namespace *namespace;
+ int ret = 0;
+
+ task_lock(task);
+ namespace = task->namespace;
+ if (namespace)
+ get_namespace(namespace);
+ task_unlock(task);
+
+ if (!namespace)
+ return -EINVAL;
+
+ poll_wait(file, &mounts_wait, wait);
+ if (namespace->mounts_poll_pending) {
+ namespace->mounts_poll_pending = 0;
+ ret = POLLIN | POLLRDNORM;
+ }
+ put_namespace(namespace);
+
+ return ret;
+}
+
static struct file_operations proc_mounts_operations = {
.open = mounts_open,
.read = seq_read,
.llseek = seq_lseek,
+ .poll = mounts_poll,
.release = mounts_release,
};

diff --git a/include/linux/namespace.h b/include/linux/namespace.h
--- a/include/linux/namespace.h
+++ b/include/linux/namespace.h
@@ -10,6 +10,7 @@ struct namespace {
struct vfsmount * root;
struct list_head list;
struct rw_semaphore sem;
+ int mounts_poll_pending;
};

extern int copy_namespace(int, struct task_struct *);

2005-11-01 19:55:13

by Sergey Vlasov

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Tue, Nov 01, 2005 at 04:58:16AM +0100, Kay Sievers wrote:
> On Tue, Nov 01, 2005 at 01:28:46AM +0100, Kay Sievers wrote:
> > Ok, makes sense. The attached seems to work for me. If we can get
> > something like this, we can remove the superblock claim/release events
> > completely and just read the events from the /proc/mounts file itself.

No, we need both events. When you need to tell the user when it is
safe to disconnect the storage device, the event from detach_mnt() is
useless - it happens too early. In fact, even the current way of
sending the event from kill_block_super() is broken, because the event
is generated before generic_shutdown_super() and sync_blockdev(), and
writing out cached data may take some time.

We could try to emit busy/free events from bd_claim() and
bd_release(); this would be triggered by most "interesting" users
(even opens with O_EXCL), but not by things like volume_id.

> New patch. Missed a check for namespace == NULL in detach_mnt().
[skip]
> +static unsigned int mounts_poll(struct file *file, poll_table *wait)
> +{
> + struct task_struct *task = proc_task(file->f_dentry->d_inode);
> + struct namespace *namespace;
> + int ret = 0;
> +
> + task_lock(task);
> + namespace = task->namespace;
> + if (namespace)
> + get_namespace(namespace);
> + task_unlock(task);
> +
> + if (!namespace)
> + return -EINVAL;
> +
> + poll_wait(file, &mounts_wait, wait);
> + if (namespace->mounts_poll_pending) {
> + namespace->mounts_poll_pending = 0;
> + ret = POLLIN | POLLRDNORM;
> + }

This assumes that there will be only one process per namespace which
will call poll() on /proc/mounts. Even though someone may argue that
it is the right approach (have a single process which watches
/proc/mounts and broadcasts updates to other interested processes,
e.g., over dbus), with the above implementation any unprivileged user
can call poll() and interfere with the operation of that designated
process.


Attachments:
(No filename) (1.90 kB)
(No filename) (189.00 B)
Download all attachments

2005-11-01 21:35:32

by Kay Sievers

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Tue, Nov 01, 2005 at 10:54:49PM +0300, Sergey Vlasov wrote:
> On Tue, Nov 01, 2005 at 04:58:16AM +0100, Kay Sievers wrote:
> > On Tue, Nov 01, 2005 at 01:28:46AM +0100, Kay Sievers wrote:
> > > Ok, makes sense. The attached seems to work for me. If we can get
> > > something like this, we can remove the superblock claim/release events
> > > completely and just read the events from the /proc/mounts file itself.
>
> No, we need both events. When you need to tell the user when it is
> safe to disconnect the storage device, the event from detach_mnt() is
> useless - it happens too early. In fact, even the current way of
> sending the event from kill_block_super() is broken, because the event
> is generated before generic_shutdown_super() and sync_blockdev(), and
> writing out cached data may take some time.
>
> We could try to emit busy/free events from bd_claim() and
> bd_release(); this would be triggered by most "interesting" users
> (even opens with O_EXCL), but not by things like volume_id.

Hmm, HAL polls optical drives every 2 seconds with O_EXCL to detect media
changes. You need to do it EXCL, cause otherwise some cd burners fail.

> > New patch. Missed a check for namespace == NULL in detach_mnt().
> [skip]
> > +static unsigned int mounts_poll(struct file *file, poll_table *wait)
> > +{
> > + struct task_struct *task = proc_task(file->f_dentry->d_inode);
> > + struct namespace *namespace;
> > + int ret = 0;
> > +
> > + task_lock(task);
> > + namespace = task->namespace;
> > + if (namespace)
> > + get_namespace(namespace);
> > + task_unlock(task);
> > +
> > + if (!namespace)
> > + return -EINVAL;
> > +
> > + poll_wait(file, &mounts_wait, wait);
> > + if (namespace->mounts_poll_pending) {
> > + namespace->mounts_poll_pending = 0;
> > + ret = POLLIN | POLLRDNORM;
> > + }
>
> This assumes that there will be only one process per namespace which
> will call poll() on /proc/mounts. Even though someone may argue that
> it is the right approach (have a single process which watches
> /proc/mounts and broadcasts updates to other interested processes,
> e.g., over dbus), with the above implementation any unprivileged user
> can call poll() and interfere with the operation of that designated
> process.

Sure, capable(CAP_SYS_ADMIN) could prevent this.

Thanks,
Kay

2005-11-02 13:01:26

by Sergey Vlasov

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Tue, Nov 01, 2005 at 10:35:25PM +0100, Kay Sievers wrote:
> On Tue, Nov 01, 2005 at 10:54:49PM +0300, Sergey Vlasov wrote:
> > On Tue, Nov 01, 2005 at 04:58:16AM +0100, Kay Sievers wrote:
> > > On Tue, Nov 01, 2005 at 01:28:46AM +0100, Kay Sievers wrote:
> > > > Ok, makes sense. The attached seems to work for me. If we can get
> > > > something like this, we can remove the superblock claim/release events
> > > > completely and just read the events from the /proc/mounts file itself.
> >
> > No, we need both events. When you need to tell the user when it is
> > safe to disconnect the storage device, the event from detach_mnt() is
> > useless - it happens too early. In fact, even the current way of
> > sending the event from kill_block_super() is broken, because the event
> > is generated before generic_shutdown_super() and sync_blockdev(), and
> > writing out cached data may take some time.
> >
> > We could try to emit busy/free events from bd_claim() and
> > bd_release(); this would be triggered by most "interesting" users
> > (even opens with O_EXCL), but not by things like volume_id.
>
> Hmm, HAL polls optical drives every 2 seconds with O_EXCL to detect media
> changes. You need to do it EXCL, cause otherwise some cd burners fail.

Crap... Then we can only move the bdev_uevent() call in
kill_block_super() later, and other users of block devices will not be
noticed at all.

> > > New patch. Missed a check for namespace == NULL in detach_mnt().
[skip]
> > This assumes that there will be only one process per namespace which
> > will call poll() on /proc/mounts. Even though someone may argue that
> > it is the right approach (have a single process which watches
> > /proc/mounts and broadcasts updates to other interested processes,
> > e.g., over dbus), with the above implementation any unprivileged user
> > can call poll() and interfere with the operation of that designated
> > process.
>
> Sure, capable(CAP_SYS_ADMIN) could prevent this.

poll() protected by CAP_SYS_ADMIN? Ewww.

/proc/bus/usb/devices uses event counters in its poll() implementation;
something like this could be done here too. What about the following
patch (compile tested only):

-----------------------------------------------------------------------

Add poll support for /proc/.../mounts

Now poll() or select() system calls can be used to wait for mount tree
changes.

Signed-off-by: Sergey Vlasov <[email protected]>


---

fs/namespace.c | 19 ++++++++++-
fs/proc/base.c | 75 ++++++++++++++++++++++++++++++++++-----------
include/linux/namespace.h | 9 +++++
3 files changed, 82 insertions(+), 21 deletions(-)

applies-to: b87f06d928e0ea06ae6244c1aeecf3e745f39bb9
1704c384737e1cdbca3194f3471195c9ee4377a5
diff --git a/fs/namespace.c b/fs/namespace.c
index 2fa9fdf..4182210 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -39,6 +39,8 @@ static inline int sysfs_init(void)
/* spinlock for vfsmount related operations, inplace of dcache_lock */
__cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock);

+DECLARE_WAIT_QUEUE_HEAD(mounts_wait);
+
static struct list_head *mount_hashtable;
static int hash_mask __read_mostly, hash_bits __read_mostly;
static kmem_cache_t *mnt_cache;
@@ -120,6 +122,10 @@ static void detach_mnt(struct vfsmount *
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
old_nd->dentry->d_mounted--;
+ if (current->namespace) {
+ current->namespace->event++;
+ wake_up_interruptible(&mounts_wait);
+ }
}

static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd)
@@ -129,6 +135,8 @@ static void attach_mnt(struct vfsmount *
list_add(&mnt->mnt_hash, mount_hashtable+hash(nd->mnt, nd->dentry));
list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
nd->dentry->d_mounted++;
+ current->namespace->event++;
+ wake_up_interruptible(&mounts_wait);
}

static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root)
@@ -185,7 +193,8 @@ EXPORT_SYMBOL(__mntput);
/* iterator */
static void *m_start(struct seq_file *m, loff_t *pos)
{
- struct namespace *n = m->private;
+ struct proc_mounts_private *private = m->private;
+ struct namespace *n = private->namespace;
struct list_head *p;
loff_t l = *pos;

@@ -198,7 +207,8 @@ static void *m_start(struct seq_file *m,

static void *m_next(struct seq_file *m, void *v, loff_t *pos)
{
- struct namespace *n = m->private;
+ struct proc_mounts_private *private = m->private;
+ struct namespace *n = private->namespace;
struct list_head *p = ((struct vfsmount *)v)->mnt_list.next;
(*pos)++;
return p==&n->list ? NULL : list_entry(p, struct vfsmount, mnt_list);
@@ -206,7 +216,8 @@ static void *m_next(struct seq_file *m,

static void m_stop(struct seq_file *m, void *v)
{
- struct namespace *n = m->private;
+ struct proc_mounts_private *private = m->private;
+ struct namespace *n = private->namespace;
up_read(&n->sem);
}

@@ -1093,6 +1104,7 @@ int copy_namespace(int flags, struct tas
atomic_set(&new_ns->count, 1);
init_rwsem(&new_ns->sem);
INIT_LIST_HEAD(&new_ns->list);
+ new_ns->event = 0;

down_write(&tsk->namespace->sem);
/* First pass: copy the tree topology */
@@ -1394,6 +1406,7 @@ static void __init init_mount_tree(void)
init_rwsem(&namespace->sem);
list_add(&mnt->mnt_list, &namespace->list);
namespace->root = mnt;
+ namespace->event = 0;
mnt->mnt_namespace = namespace;

init_task.namespace = namespace;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a170450..41462c0 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -57,6 +57,7 @@
#include <linux/init.h>
#include <linux/file.h>
#include <linux/string.h>
+#include <linux/poll.h>
#include <linux/seq_file.h>
#include <linux/namei.h>
#include <linux/namespace.h>
@@ -663,39 +664,77 @@ extern struct seq_operations mounts_op;
static int mounts_open(struct inode *inode, struct file *file)
{
struct task_struct *task = proc_task(inode);
- int ret = seq_open(file, &mounts_op);
+ struct proc_mounts_private *private;
+ struct seq_file *m;
+ struct namespace *namespace;
+ int ret = -ENOMEM;
+
+ private = kmalloc(sizeof(*private), GFP_KERNEL);
+ if (!private)
+ goto out;
+ ret = seq_open(file, &mounts_op);
+ if (ret)
+ goto out_free_private;
+ m = file->private_data;
+ m->private = private;

- if (!ret) {
- struct seq_file *m = file->private_data;
- struct namespace *namespace;
- task_lock(task);
- namespace = task->namespace;
- if (namespace)
- get_namespace(namespace);
- task_unlock(task);
-
- if (namespace)
- m->private = namespace;
- else {
- seq_release(inode, file);
- ret = -EINVAL;
- }
+ task_lock(task);
+ namespace = task->namespace;
+ if (namespace)
+ get_namespace(namespace);
+ task_unlock(task);
+ if (!namespace) {
+ ret = -EINVAL;
+ goto out_release;
}
+
+ private->namespace = namespace;
+ down_read(&namespace->sem);
+ private->last_event = namespace->event;
+ up_read(&namespace->sem);
+out:
return ret;
+
+out_release:
+ seq_release(inode, file);
+out_free_private:
+ kfree(private);
+ goto out;
}

static int mounts_release(struct inode *inode, struct file *file)
{
struct seq_file *m = file->private_data;
- struct namespace *namespace = m->private;
- put_namespace(namespace);
+ struct proc_mounts_private *private = m->private;
+ put_namespace(private->namespace);
+ kfree(private);
return seq_release(inode, file);
}

+static unsigned int mounts_poll(struct file *file, poll_table *wait)
+{
+ struct seq_file *m = file->private_data;
+ struct proc_mounts_private *private = m->private;
+ struct namespace *namespace = private->namespace;
+ int ret = 0;
+
+ poll_wait(file, &mounts_wait, wait);
+
+ down_read(&namespace->sem);
+ if (private->last_event != namespace->event) {
+ private->last_event = namespace->event;
+ ret = POLLIN | POLLRDNORM;
+ }
+ up_read(&namespace->sem);
+
+ return ret;
+}
+
static struct file_operations proc_mounts_operations = {
.open = mounts_open,
.read = seq_read,
.llseek = seq_lseek,
+ .poll = mounts_poll,
.release = mounts_release,
};

diff --git a/include/linux/namespace.h b/include/linux/namespace.h
index 0e5a86f..fe5840a 100644
--- a/include/linux/namespace.h
+++ b/include/linux/namespace.h
@@ -10,6 +10,7 @@ struct namespace {
struct vfsmount * root;
struct list_head list;
struct rw_semaphore sem;
+ unsigned int event;
};

extern int copy_namespace(int, struct task_struct *);
@@ -38,5 +39,13 @@ static inline void get_namespace(struct
atomic_inc(&namespace->count);
}

+/* /proc/.../mounts file descriptor state */
+struct proc_mounts_private {
+ struct namespace * namespace;
+ unsigned int last_event;
+};
+
+extern wait_queue_head_t mounts_wait;
+
#endif
#endif
---
0.99.8.GIT


Attachments:
(No filename) (8.53 kB)
(No filename) (189.00 B)
Download all attachments

2005-11-03 08:07:18

by Al Viro

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Wed, Nov 02, 2005 at 04:01:18PM +0300, Sergey Vlasov wrote:
> @@ -120,6 +122,10 @@ static void detach_mnt(struct vfsmount *
> list_del_init(&mnt->mnt_child);
> list_del_init(&mnt->mnt_hash);
> old_nd->dentry->d_mounted--;
> + if (current->namespace) {
> + current->namespace->event++;
> + wake_up_interruptible(&mounts_wait);
> + }
> }
>
> static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd)

Ugh... So umount -l gives one hell of a spew for no good reason.

> @@ -129,6 +135,8 @@ static void attach_mnt(struct vfsmount *
> list_add(&mnt->mnt_hash, mount_hashtable+hash(nd->mnt, nd->dentry));
> list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
> nd->dentry->d_mounted++;
> + current->namespace->event++;
> + wake_up_interruptible(&mounts_wait);
> }

Bad idea - copy_tree() will spew *and* we get bogus events on CLONE_NEWNS
(i.e. current->namespace is not even the namespace being modified).

> @@ -1093,6 +1104,7 @@ int copy_namespace(int flags, struct tas
> atomic_set(&new_ns->count, 1);
> init_rwsem(&new_ns->sem);
> INIT_LIST_HEAD(&new_ns->list);
> + new_ns->event = 0;

BTW, I'd rather make that queue per-namespace...

> + down_read(&namespace->sem);
> + if (private->last_event != namespace->event) {
> + private->last_event = namespace->event;
> + ret = POLLIN | POLLRDNORM;

Umm... I'd rather use POLLERR, since POLLIN doesn't apply here - it's not
a stream of data that gives blocking read() when reached the end.


IMO the right approach is to have global event counter and do the following:
if (namespace->event != event) {
namespace->event = event;
wake...
}
in tree modifications and bump the event counter as soon as we decide
to do something. I'll see how well that works with shared-subtree stuff
and post the patch if it turns out to be usable...

2005-11-03 10:52:44

by Sergey Vlasov

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Thu, Nov 03, 2005 at 08:07:13AM +0000, Al Viro wrote:
> On Wed, Nov 02, 2005 at 04:01:18PM +0300, Sergey Vlasov wrote:
> > @@ -120,6 +122,10 @@ static void detach_mnt(struct vfsmount *
> > list_del_init(&mnt->mnt_child);
> > list_del_init(&mnt->mnt_hash);
> > old_nd->dentry->d_mounted--;
> > + if (current->namespace) {
> > + current->namespace->event++;
> > + wake_up_interruptible(&mounts_wait);
> > + }
> > }
> >
> > static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd)
>
> Ugh... So umount -l gives one hell of a spew for no good reason.

umount -l will change contents of /proc/mounts, so waking up poll() on
that file seems to be right in this case (even if the filesystem is still
mounted internally, it is no longer accessible).

> > @@ -129,6 +135,8 @@ static void attach_mnt(struct vfsmount *
> > list_add(&mnt->mnt_hash, mount_hashtable+hash(nd->mnt, nd->dentry));
> > list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
> > nd->dentry->d_mounted++;
> > + current->namespace->event++;
> > + wake_up_interruptible(&mounts_wait);
> > }
>
> Bad idea - copy_tree() will spew *and* we get bogus events on CLONE_NEWNS
> (i.e. current->namespace is not even the namespace being modified).

IMHO it's not spew, but real changes in the mount tree.

CLONE_NEWNS handling may really be broken (maybe mnt->mnt_namespace should
be used instead of current->namespace, but I'm not sure if it is set
correctly at this place - it is certainly wrong in detach_mnt()).

> > @@ -1093,6 +1104,7 @@ int copy_namespace(int flags, struct tas
> > atomic_set(&new_ns->count, 1);
> > init_rwsem(&new_ns->sem);
> > INIT_LIST_HEAD(&new_ns->list);
> > + new_ns->event = 0;
>
> BTW, I'd rather make that queue per-namespace...

You mean mount_wait, so that only tasks which wait for changes in a
particular namespace would be woken up? Yes, that would be better (if
namespaces are really used).

> > + down_read(&namespace->sem);
> > + if (private->last_event != namespace->event) {
> > + private->last_event = namespace->event;
> > + ret = POLLIN | POLLRDNORM;
>
> Umm... I'd rather use POLLERR, since POLLIN doesn't apply here - it's not
> a stream of data that gives blocking read() when reached the end.

This is copied from /proc/bus/usb/devices, which has similar behavior.


Attachments:
(No filename) (2.26 kB)
(No filename) (189.00 B)
Download all attachments

2005-11-03 11:30:25

by Al Viro

[permalink] [raw]
Subject: Re: Race between "mount" uevent and /proc/mounts?

On Thu, Nov 03, 2005 at 01:52:35PM +0300, Sergey Vlasov wrote:
> > Ugh... So umount -l gives one hell of a spew for no good reason.
>
> umount -l will change contents of /proc/mounts, so waking up poll() on
> that file seems to be right in this case (even if the filesystem is still
> mounted internally, it is no longer accessible).

Yes, but that's a single change.

> > Bad idea - copy_tree() will spew *and* we get bogus events on CLONE_NEWNS
> > (i.e. current->namespace is not even the namespace being modified).
>
> IMHO it's not spew, but real changes in the mount tree.

Again, mount --rbind is a single change. And fsckloads of attach_mnt().
IOW, you are doing that on too low level. Right ones:

* graft_tree()
* do_move_mount()
* sys_pivot_root()
* expire_mount() (BTW, again wrong namespace touched in your variant)
* lazy one in umount_tree() (single update of event in do_umount(),
then touch ->mnt_namespace for each vfsmount - with shared-subtree it may
affect more than one namespace)
* couple of extra places introduced by shared-subtree patchset