Hello,
Announcing the release of inotify 0.10.0.
Attached is a patch to 2.6.8.1.
--New in this version--
-code cleanup (rml)
-fix an OOPS when events where queued (me)
-cleaned up the events (me)
- remove IN_CREATE, IN_DELETED, IN_RENAME and IN_MOVE
- added IN_CREATE_SUBDIR, IN_CREATE_FILE, IN_DELETE_SUBDIR,
IN_DELETE_FILE, IN_DELETE_SELF, IN_MOVED_FROM and IN_MOVED_TO
-added a cookie field to the event structure to pair MOVED_FROM/MOVED_TO
-misc fixes (me)
-added description of inotify events to the release notes (me)
-updated inotify-utils to work with this version (me)
John McCutchan
Release notes:
--Why Not dnotify and Why inotify (By Robert Love)--
Everyone seems quick to deride the blunder known as "dnotify" and
applaud a
replacement, any replacement, man anything but that current mess, but in
the
name of fairness I present my treatise on why dnotify is what one might
call
not good:
* dnotify requires the opening of one fd per each directory that you
intend to
watch.
o The file descriptor pins the directory, disallowing the backing
device to be unmounted, which absolutely wrecks havoc with removable
media.
o Watching many directories results in many open file descriptors,
possibly hitting a per-process fd limit.
* dnotify is directory-based. You only learn about changes to
directories.
Sure, a change to a file in a directory affects the directory, but you
are
then forced to keep a cache of stat structures around to compare
things in
order to find out which file.
* dnotify's interface to user-space is awful.
o dnotify uses signals to communicate with user-space.
o Specifically, dnotify uses SIGIO.
o But then you can pick a different signal! So by "signals," I really
meant you need to use real-time signals if you want to queue the
events.
* dnotify basically ignores any problems that would arise in the VFS
from hard
links.
* Rumor is that the "d" in "dnotify" does not stand for "directory" but
for
"suck."
A suitable replacement is "inotify." And now, my tract on what inotify
brings
to the table:
* inotify's interface is a device node, not SIGIO.
o You open only a single fd, to the device node. No more pinning
directories or opening a million file descriptors.
o Usage is nice: open the device, issue simple commands via ioctl(),
and then block on the device. It returns events when, well, there are
events to be returned.
o You can select() on the device node and so it integrates with main
loops like coffee mixed with vanilla milkshake.
* inotify has an event that says "the filesystem that the item you were
watching is on was unmounted" (this is particularly cool).
* inotify can watch directories or files.
* The "i" in inotify does not stand for "suck" but for "inode" -- the
logical
choice since inotify is inode-based.
--COMPLEXITY--
I have been asked what the complexity of inotify is. Inotify has
2 path codes where complexity could be an issue:
Adding a watcher to a device
This code has to check if the inode is already being watched
by the device, this is O(1) since the maximum number of
devices is limited to 8.
Removing a watch from a device
This code has to do a search of all watches on the device to
find the watch descriptor that is being asked to remove.
This involves a linear search, but should not really be an issue
because it is limited to 8192 entries. If this does turn in to
a concern, I would replace the list of watches on the device
with a sorted binary tree, so that the search could be done
very quickly.
The calls to inotify from the VFS code has a complexity of O(1) so
inotify does not affect the speed of VFS operations.
--MEMORY USAGE--
The inotify data structures are light weight:
inotify watch is 40 bytes
inotify device is 68 bytes
inotify event is 272 bytes
So assuming a device has 8192 watches, the structures are only going
to consume 320KB of memory. With a maximum number of 8 devices allowed
to exist at a time, this is still only 2.5 MB
Each device can also have 256 events queued at a time, which sums to
68KB per device. And only .5 MB if all devices are opened and have
a full event queue.
So approximately 3 MB of memory are used in the rare case of
everything open and full.
Each inotify watch pins the inode of a directory/file in memory,
the size of an inode is different per file system but lets assume
that it is 512 byes.
So assuming the maximum number of global watches are active, this would
pin down 32 MB of inodes in the inode cache. Again not a problem
on a modern system.
On smaller systems, the maximum watches / events could be lowered
to provide a smaller foot print.
Keep in mind that this is an absolute worst case memory analysis.
In reality it will most likely cost approximately 5MB.
--HOWTO USE--
Inotify is a character device that when opened offers 2 IOCTL's.
(It actually has 4 but the other 2 are used for debugging)
INOTIFY_WATCH:
Which takes a path and event mask and returns a unique
(to the instance of the driver) integer (wd [watch descriptor]
from here on) that is a 1:1 mapping to the path passed.
What happens is inotify gets the inode (and ref's the inode)
for the path and adds a inotify_watcher structure to the inodes
list of watchers. If this instance of the driver is already
watching the path, the event mask will be updated and
the original wd will be returned.
INOTIFY_IGNORE:
Which takes an integer (that you got from INOTIFY_WATCH)
representing a wd that you are not interested in watching
anymore. This will:
send an IGNORE event to the device
remove the inotify_watcher structure from the device and
from the inode and unref the inode.
After you are watching 1 or more paths, you can read from the fd
and get events. The events are struct inotify_event. If you are
watching a directory and something happens to a file in the directory
the event will contain the filename (just the filename not the full
path).
-- EVENTS --
IN_ACCESS - Sent when file is accessed.
IN_MODIFY - Sent when file is modified.
IN_ATTRIB - Sent when file is chmod'ed.
IN_CLOSE - Sent when file is closed
IN_OPEN - Sent when file is opened.
IN_MOVED_FROM - Sent to the source folder of a move.
IN_MOVED_TO - Sent to the destination folder of a move.
IN_DELETE_SUBDIR - Sent when a sub directory is deleted. (When watching
parent)
IN_DELETE_FILE - Sent when a file is deleted. (When watching parent)
IN_CREATE_SUBDIR - Sent when a sub directory is created. (When watching
parent)
IN_CREATE_FILE - Sent when a file is created. (When watching parent)
IN_DELETE_SELF - Sent when file is deleted.
IN_UNMOUNT - Sent when the filesystem is being unmounted.
IN_Q_OVERFLOW - Sent when your event queue has over flowed.
The MOVED_FROM/MOVED_TO events are always sent in pairs.
MOVED_FROM/MOVED_TO
is also sent when a file is renamed. The cookie field in the event pairs
up MOVED_FROM/MOVED_TO events. These two events are not guaranteed to be
successive in the event stream. You must rely on the cookie to pair
them up. (Note, the cookie is not sent yet.)
If you aren't watching the source and destination folders in a MOVE.
You will only get MOVED_TO or MOVED_FROM. In this case, MOVED_TO
is equivelent to a CREATE and MOVED_FROM is equivelent to a DELETE.
--KERNEL CHANGES--
inotify char device driver.
Adding calls to inotify_inode_queue_event and
inotify_dentry_parent_queue_event from VFS operations.
Dnotify has the same function calls. The complexity of the VFS
operations is not affected because inotify_*_queue_event is O(1).
Adding a call to inotify_super_block_umount from
generic_shutdown_superblock
inotify_super_block_umount consists of this:
find all of the inodes that are on the super block being shut down,
sends each watcher on each inode the UNMOUNT and IGNORED event
removes the watcher structures from each instance of the device driver
and each inode.
unref's the inode.
John McCutchan <[email protected]> wrote:
>
> Announcing the release of inotify 0.10.0.
I can't say I'm thrilled by the locking.
> Attached is a patch to 2.6.8.1.
Please raise patches against current kernels from
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots. This kernel is six
weeks old.
> +#define INOTIFY_VERSION "0.10.0"
You should plan to remove this - it becomes meainingless once the code
is merged up, and nobody ever updates it as they patch things.
> +#define MAX_INOTIFY_DEVS 8 /* max open inotify devices */
> +#define MAX_INOTIFY_DEV_WATCHERS 8192 /* max total watches */
> +#define MAX_INOTIFY_QUEUED_EVENTS 256 /* max events queued on the dev*/
hmm.
+#define INOTIFY_DEV_TIMER_TIME (jiffies + (HZ/4))
ick. Don't hide the logic in a #define.
static inline arm_dev_timer(struct inotify_device *dev)
{
mod_timer(&dev->timer, jiffies + HZ/4);
}
is nicer.
> +/* For debugging */
> +static int event_object_count;
> +static int watcher_object_count;
> +static int inode_ref_count;
OK. These are accessed racily. Either make them atomic_t's or remove them.
> +struct inotify_device {
> + unsigned long bitmask[MAX_INOTIFY_DEV_WATCHERS/BITS_PER_LONG];
> + struct timer_list timer;
> + wait_queue_head_t wait;
> + struct list_head events;
> + struct list_head watchers;
> + spinlock_t lock;
> + unsigned int event_count;
> + int nr_watches;
> +};
> +
> +struct inotify_watcher {
> + int wd; // watcher descriptor
> + unsigned long mask;
> + struct inode * inode;
> + struct inotify_device * dev;
> + struct list_head d_list; // device list
> + struct list_head i_list; // inode list
> + struct list_head u_list; // unmount list
> +};
> ...
> +struct inotify_kernel_event {
> + struct list_head list;
> + struct inotify_event event;
> +};
Would be nice to add commentary documenting what lock protects the various
fields.
> +static int find_inode(const char __user *dirname, struct inode **inode)
This can just return an inode*, or an IS_ERR() errno, I think?
> +struct inotify_kernel_event *kernel_event(int wd, int mask,
> + const char *filename)
> +{
> + struct inotify_kernel_event *kevent;
> +
> + kevent = kmem_cache_alloc(kevent_cache, GFP_ATOMIC);
Try to rearrange things so the allocation mode here can become GFP_KERNEL.
Are there ever enough of these objects in flight to justify a standalone
slab cache?
> + watcher = kmem_cache_alloc(watcher_cache, GFP_KERNEL);
Ditto.
> +void inotify_inode_queue_event(struct inode *inode, unsigned long mask,
> + const char *filename)
> +{
> + struct inotify_watcher *watcher;
> +
> + spin_lock(&inode->i_lock);
i_lock is documented as an "innermost" lock.
> + list_for_each_entry(watcher, &inode->watchers, i_list) {
> + spin_lock(&watcher->dev->lock);
But here, dev->lock is nesting inside it. Not necessarily a bug per-se,
but it changes and complicates kernel locking rules, and invalidates
commentary elsewhere.
If possible, please try to avoid using i_lock. Use i_sem instead.
> +void inotify_inode_queue_event_pair(struct inode *inode1, unsigned long mask1,
> + const char *filename1,
> + struct inode *inode2, unsigned long mask2,
> + const char *filename2)
> +{
> + struct inotify_watcher *watcher;
> +
> + spin_lock(&inode1->i_lock);
> + spin_lock(&inode2->i_lock);
A bug, I think. What happens if another CPU comes in and tries to take these
two locks in the opposite order?
The same problem applies if you switch to i_sem. The standard fix is to
take the lowest-addressed lock first. See d_move() for an example.
> +
> +void inotify_dentry_parent_queue_event(struct dentry *dentry,
> + unsigned long mask,
> + const char *filename)
> +{
> + struct dentry *parent;
> +
> + spin_lock(&dentry->d_lock);
> + dget(dentry->d_parent);
> + parent = dentry->d_parent;
> + inotify_inode_queue_event(parent->d_inode, mask, filename);
> + dput(parent);
> + spin_unlock(&dentry->d_lock);
> +}
Does d_lock need to be held across the inotify_inode_queue_event() call?
If (hopefully) not, simply use dget_parent() in here.
> +EXPORT_SYMBOL_GPL(inotify_dentry_parent_queue_event);
> +
> +static void ignore_helper(struct inotify_watcher *watcher, int event)
> +{
> + struct inotify_device *dev;
> + struct inode *inode;
> +
> + spin_lock(&watcher->dev->lock);
> + spin_lock(&watcher->inode->i_lock);
Bug. Elsewhere, i_lock nests outside dev->lock.
> +static void build_umount_list(struct list_head *head, struct super_block *sb,
> + struct list_head *umount)
> +{
> + struct inode * inode;
> +
> + list_for_each_entry(inode, head, i_list) {
> + struct inotify_watcher *watcher;
> +
> + if (inode->i_sb != sb)
> + continue;
> +
> + spin_lock(&inode->i_lock);
> +
> + list_for_each_entry(watcher, &inode->watchers, i_list)
> + list_add(&watcher->u_list, umount);
> +
> + spin_unlock(&inode->i_lock);
> + }
> +}
I'll be merging invalidate_inodes-speedup.patch once the 2.6.10 stream
opens. That will make the above code simpler, faster and quite different.
> +void inotify_inode_is_dead(struct inode *inode)
> +{
> + struct inotify_watcher *watcher, *next;
> +
> + list_for_each_entry_safe(watcher, next, &inode->watchers, i_list) {
> + ignore_helper(watcher, 0);
> +}
> +EXPORT_SYMBOL_GPL(inotify_inode_is_dead);
This needs locking, I think. Or a comment explaining why it does not.
> + add_wait_queue(&dev->wait, &wait);
> +repeat:
> + if (signal_pending(current)) {
> + spin_unlock(&dev->lock);
> + out = -ERESTARTSYS;
> + set_current_state(TASK_RUNNING);
> + remove_wait_queue(&dev->wait, &wait);
> + goto out;
> + }
> + set_current_state(TASK_INTERRUPTIBLE);
> + if (!inotify_dev_has_events(dev)) {
> + spin_unlock(&dev->lock);
> + schedule();
> + spin_lock(&dev->lock);
> + goto repeat;
> + }
> +
> + set_current_state(TASK_RUNNING);
> + remove_wait_queue(&dev->wait, &wait);
The above seems a bit clumsy.
> + err = !access_ok(VERIFY_WRITE, (void *)buf,
> + sizeof(struct inotify_event));
> +
> + if (err) {
> + out = -EFAULT;
> + goto out;
> + }
> +
The above is unneeded. copy_to_user() will check.
> + /* Copy all the events we can to the event buffer */
> + for (event_count = 0; event_count < events; event_count++) {
> + kevent = inotify_dev_get_event(dev);
> + eventbuf[event_count] = kevent->event;
> + inotify_dev_event_dequeue(dev);
> + }
> +
> + spin_unlock(&dev->lock);
> +
> + /* Send the event buffer to user space */
> + err = copy_to_user(buf, eventbuf,
> + events * sizeof(struct inotify_event));
> +
> + buf += sizeof(struct inotify_event) * events;
> +
> + out = buf - obuf;
> +
> +out:
> + kfree(eventbuf);
> + return out;
> +}
Except the copy_to_user() return value is being ignored!
I don't think you need event_buf at all. Just copy one event at a time to
userspace. copy_to_user() is efficient. Using an intermediate buffer like
this consumes extra CPU dcache and may even be slower.
Probably slower, given that the temp buffer is kmalloced.
> +static int inotify_open(struct inode *inode, struct file *file)
> +{
> + struct inotify_device *dev;
> +
> + if (atomic_read(&watcher_count) == MAX_INOTIFY_DEVS)
> + return -ENODEV;
> +
> + atomic_inc(&watcher_count);
> +
> + dev = kmalloc(sizeof(struct inotify_device), GFP_KERNEL);
> + if (!dev)
> + return -ENOMEM;
> +
> + memset(dev->bitmask, 0,
> + sizeof(unsigned long) * MAX_INOTIFY_DEV_WATCHERS / BITS_PER_LONG);
What purpose does this bitmask serve, anyway??
> + add_timer(&dev->timer);
And what does the timer do? (You surely explained the feature earlier,
but I was asleep, sorry).
> +static void inotify_release_all_watchers(struct inotify_device *dev)
> +{
> + struct inotify_watcher *watcher,*next;
> +
> + list_for_each_entry_safe(watcher, next, &dev->watchers, d_list)
> + ignore_helper(watcher, 0);
> +}
Locking?
> +
> +static int inotify_release(struct inode *inode, struct file *file)
> +{
> + if (file->private_data) {
Why test ->private_data here?
If it indeed needs testing here, shouldn't it be zeroed out as well, with
appropriate locking?
> + struct inotify_device *dev;
> +
> + dev = (struct inotify_device *) file->private_data;
Please don't typecast when assigning to and from void*'s
> + del_timer_sync(&dev->timer);
> + inotify_release_all_watchers(dev);
> + inotify_release_all_events(dev);
> + kfree(dev);
> + }
> +
> + printk(KERN_ALERT "inotify device released\n");
> +
> + atomic_dec(&watcher_count);
If file->private_data was zero, we shouldn't have decremented this?
+static int inotify_watch(struct inotify_device *dev,
+ struct inotify_watch_request *request)
> +{
> ...
> + spin_lock(&dev->lock);
> + spin_lock(&inode->i_lock);
Lock ranking.
> +static int inotify_ioctl(struct inode *ip, struct file *fp,
> + unsigned int cmd, unsigned long arg) {
An errant brace!
> +
> + if (_IOC_DIR(cmd) & _IOC_READ)
> + err = !access_ok(VERIFY_READ, (void *) arg, _IOC_SIZE(cmd));
> +
> + if (err)
> + err = -EFAULT;
> + goto out;
> +
eh? The above is missing braces, and cannot possibly have worked.
> + if (_IOC_DIR(cmd) & _IOC_WRITE)
> + err = !access_ok(VERIFY_WRITE, (void *)arg, _IOC_SIZE(cmd));
> +
> + if (err) {
> + err = -EFAULT;
> + goto out;
> + }
Why are these access_ok() checks here? I think they can (should) go away.
> + err = -EINVAL;
> +
> + switch (cmd) {
> + case INOTIFY_WATCH:
We often do:
switch (cmd) {
case INOTIFY_WATCH:
to save a tabstop.
> +struct miscdevice inotify_device = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "inotify",
> + .fops = &inotify_fops,
> +};
Please update devices.txt
> +/* this size could limit things, since technically we could need PATH_MAX */
> +#define INOTIFY_FILENAME_MAX 256
> +
> +/*
> + * struct inotify_event - structure read from the inotify device for each event
> + *
> + * When you are watching a directory, you will receive the filename for events
> + * such as IN_CREATE, IN_DELETE, IN_OPEN, IN_CLOSE, ...
> + *
> + * Note: When reading from the device you must provide a buffer that is a
> + * multiple of sizeof(struct inotify_event)
> + */
> +struct inotify_event {
> + int wd;
> + int mask;
> + int cookie;
> + char filename[INOTIFY_FILENAME_MAX];
> +};
yeah, that's not very nice. Better to kmalloc the pathname.
> +/* Adds event to all watchers on inode that are interested in mask */
> +void inotify_inode_queue_event (struct inode *inode, unsigned long mask,
> + const char *filename);
> +
> +/* Same as above but uses dentry's inode */
> +void inotify_dentry_parent_queue_event (struct dentry *dentry,
> + unsigned long mask, const char *filename);
> +
> +/* This will remove all watchers from all inodes on the superblock */
> +void inotify_super_block_umount (struct super_block *sb);
> +
> +/* Call this when an inode is dead, and inotify should ignore it */
> +void inotify_inode_is_dead (struct inode *inode);
> +
Please add CONFIG_INOTIFY and make all this:
> --- clean/linux/include/linux/fs.h 2004-08-14 06:55:09.000000000 -0400
> +++ linux/include/linux/fs.h 2004-09-18 02:24:33.000000000 -0400
> @@ -458,6 +458,10 @@
> unsigned long i_dnotify_mask; /* Directory notify events */
> struct dnotify_struct *i_dnotify; /* for directory notifications */
>
> + struct list_head watchers;
> + unsigned long watchers_mask;
> + int watcher_count;
> +
> unsigned long i_state;
> unsigned long dirtied_when; /* jiffies of first dirtying */
>
and this:
> @@ -216,8 +217,11 @@
> ret = file->f_op->read(file, buf, count, pos);
> else
> ret = do_sync_read(file, buf, count, pos);
> - if (ret > 0)
> + if (ret > 0) {
> dnotify_parent(file->f_dentry, DN_ACCESS);
> + inotify_dentry_parent_queue_event(file->f_dentry, IN_ACCESS, file->f_dentry->d_name.name);
> + inotify_inode_queue_event (file->f_dentry->d_inode, IN_ACCESS, NULL);
> + }
> }
go away if the user doesn't want inotify. And remember to test with
CONFIG_INOTIFY=n!
> + INIT_LIST_HEAD (&inode->watchers);
Please review the entire patch and ensure that all macros and function
calls have no space between the identifier and the opening parenthesis.
Can't say that I'm thrilled that it uses an ioctl.
What are the security policies?
Am I allowed to monitor an inode for which I do not have read permission?
If so, can I use that to pin files which root wants to delete, thus causing
the disk to fill up?
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
Hi,
> -cleaned up the events (me)
> - remove IN_CREATE, IN_DELETED, IN_RENAME and IN_MOVE
> - added IN_CREATE_SUBDIR, IN_CREATE_FILE, IN_DELETE_SUBDIR,
> IN_DELETE_FILE, IN_DELETE_SELF, IN_MOVED_FROM and IN_MOVED_TO
I have just looked quickly at the inotify backend this weekend, so not
really clued up yet, so excuse the stupid question: Does this mean
inotify is now up to par with dnotify/poll feature wise? Or should we
still look at getting inotify backend to use poll?
--
Martin Schlemmer
On Mon, 2004-09-27 at 18:21 +0200, Martin Schlemmer [c] wrote:
> I have just looked quickly at the inotify backend this weekend, so not
> really clued up yet, so excuse the stupid question: Does this mean
> inotify is now up to par with dnotify/poll feature wise? Or should we
> still look at getting inotify backend to use poll?
inotify supports poll in the sense that you can poll() or select()
on /dev/inotify.
dnotify does not support poll() at all (it is signal based), so your
question is kind of confusing.
I cannot think of any features that dnotify has that inotify does not.
Robert Love
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
Hi,
> -cleaned up the events (me)
> - remove IN_CREATE, IN_DELETED, IN_RENAME and IN_MOVE
> - added IN_CREATE_SUBDIR, IN_CREATE_FILE, IN_DELETE_SUBDIR,
> IN_DELETE_FILE, IN_DELETE_SELF, IN_MOVED_FROM and IN_MOVED_TO
I have just looked quickly at the inotify backend this weekend, so not
really clued up yet, so excuse the stupid question: Does this mean
inotify is now up to par with dnotify/poll feature wise? Or should we
still look at getting inotify backend to use poll?
--
Martin Schlemmer
On Mon, 2004-09-27 at 12:24 -0400, Robert Love wrote:
> On Mon, 2004-09-27 at 18:21 +0200, Martin Schlemmer [c] wrote:
>
> > I have just looked quickly at the inotify backend this weekend, so not
> > really clued up yet, so excuse the stupid question: Does this mean
> > inotify is now up to par with dnotify/poll feature wise? Or should we
> > still look at getting inotify backend to use poll?
>
> inotify supports poll in the sense that you can poll() or select()
> on /dev/inotify.
>
> dnotify does not support poll() at all (it is signal based), so your
> question is kind of confusing.
>
(Sorry for double posting - keep forgetting with what email addy I am
subscribed to this list)
Right, but using gamin, dnotify+poll support some things that inotify
do not support. Basically the dnotify backend also use the poll backend
to enhance it to be able to monitor some events that dnotify by itself
cannot (Daniel will be the better person to ask here). Here is a small
snippit from another thread:
----
> > > > inotify should be able to fallback to poll too, otherwise it's
> > > > > a bug in the inotify back-end, think for example for NFS resources.
> > > > >
> > > >
> > > > Hmm - well it does not use poll. I will see if I can have a
> > > > look at at dnotify backend and cook something up. I assume
> > > > John never got to it, or you added the poll support to dnotify
> > > > after inotify was added?
> > >
> > > I added it after, yes.
> > >
> >
> > Ok, thanks.
> >
>
> Actually, initially the inotify backend did support using the poll
> backend as well. I got rid of it because at the time inotify provided
> everything we needed. Now gamin supports watching things that inotify
> can't, like directories that don't exist.
----
Thanks,
--
Martin Schlemmer
On Mon, 2004-09-27 at 18:30 +0200, Martin Schlemmer [c] wrote:
> Right, but using gamin, dnotify+poll support some things that inotify
> do not support. Basically the dnotify backend also use the poll backend
> to enhance it to be able to monitor some events that dnotify by itself
> cannot (Daniel will be the better person to ask here). Here is a small
> snippit from another thread:
I am thinking that we have two different definitions of poll here,
sorry.
I guess you mean poll as in some sort of Gamin backend type?
Anyhow, I guess this is a Gamin question, not a inotify kernel-side
question.
Robert Love
On Mon, 2004-09-27 at 12:35 -0400, Robert Love wrote:
> On Mon, 2004-09-27 at 18:30 +0200, Martin Schlemmer [c] wrote:
>
> > Right, but using gamin, dnotify+poll support some things that inotify
> > do not support. Basically the dnotify backend also use the poll backend
> > to enhance it to be able to monitor some events that dnotify by itself
> > cannot (Daniel will be the better person to ask here). Here is a small
> > snippit from another thread:
>
> I am thinking that we have two different definitions of poll here,
> sorry.
>
> I guess you mean poll as in some sort of Gamin backend type?
>
Yip.
> Anyhow, I guess this is a Gamin question, not a inotify kernel-side
> question.
>
Sorry - we talked about this during last week, and I did not think
to trim the CC list. I will do so from now on.
Regards,
--
Martin Schlemmer
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
Hi, John.
> Announcing the release of inotify 0.10.0.
> Attached is a patch to 2.6.8.1.
Attached patch addresses a few misc. issues that Andrew raised:
- Get rid of INOTIFY_VERSION. I agree with Andrew, once
inotify is in the kernel, version information like this
is worthless and thus shunned.
- Just open code INOTIFY_DEV_TIMER_TIME.
- Add comment re 'bitmask'
- Don't typecast when assigning from a void *.
- Misc. and minor coding style fixes
Best,
Robert Love
> unsigned long bitmask[MAX_INOTIFY_DEV_WATCHERS/BITS_PER_LONG];
This assumes that MAX_INOTIFY_DEV_WATCHERS is an integral multiple
of BITS_PER_LONG, otherwise, the last word will be missing.
Perhaps this would this better be written as:
DECLARE_BITMAP(bitmask, MAX_INOTIFY_DEV_WATCHERS);
and the clearing of it in the original patch:
> + memset(dev->bitmask, 0,
> + sizeof(unsigned long) * MAX_INOTIFY_DEV_WATCHERS / BITS_PER_LONG);
might better be written as:
CLEAR_BITMAP(dev->bitmask, MAX_INOTIFY_DEV_WATCHERS);
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
> Announcing the release of inotify 0.10.0.
> Attached is a patch to 2.6.8.1.
Hi, John.
Attached patch makes inotify configurable via the boolean
CONFIG_INOTIFY. We cannot allow CONFIG_INOTIFY=m without playing with
function pointers, since the inotify (as with dnotify) code is called
from generic code.
Compile tested both on and off.
Also removed some module stuff that was unneeded, since we are not a
module.
A nice TODO would be to make dnotify a compile option, since any modern
desktop using Gamin would not need nor want dnotify. I will look at
that later.
Rock,
Robert Love
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
> Announcing the release of inotify 0.10.0.
> Attached is a patch to 2.6.8.1.
Don't even ask.
And definitely don't laugh.
Robert Love
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
> Announcing the release of inotify 0.10.0.
> Attached is a patch to 2.6.8.1.
John,
find_inode() can just return an inode directly, so we don't need the
second argument and the double indirection.
FYI, errors can be encoded in pointers via ERR_PTR(), checked via IS_ERR
(), and retrieved via PTR_ERR().
Patch is on top of your patch and all of my previous patches.
Robert Love
On Mon, 2004-09-27 at 12:48 -0700, Paul Jackson wrote:
> > unsigned long bitmask[MAX_INOTIFY_DEV_WATCHERS/BITS_PER_LONG];
>
> This assumes that MAX_INOTIFY_DEV_WATCHERS is an integral multiple
> of BITS_PER_LONG, otherwise, the last word will be missing.
Yah. Since we defined MAX_INOTIFY_DEV_WATCHERS, I presumed to be able
to ensure it was a multiple (e.g. just keep it a power of two) ...
> Perhaps this would this better be written as:
>
> DECLARE_BITMAP(bitmask, MAX_INOTIFY_DEV_WATCHERS);
... but this is indeed cleaner.
> and the clearing of it in the original patch:
>
> > + memset(dev->bitmask, 0,
> > + sizeof(unsigned long) * MAX_INOTIFY_DEV_WATCHERS / BITS_PER_LONG);
>
> might better be written as:
>
> CLEAR_BITMAP(dev->bitmask, MAX_INOTIFY_DEV_WATCHERS);
I think you mean bitmap_zero(), but yah. Agreed.
John, here is a patch to use the bitmap.h functions to manipulate the
bitmap.
Robert Love
On Sun, 2004-09-26 at 21:17 -0700, Andrew Morton wrote:
> Please raise patches against current kernels from
> ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots. This kernel is six
> weeks old.
I have patches against newer kernels at
http://www.kernel.org/pub/linux/kernel/people/rml/inotify
Anyhow, the tidal flood of commentary is appreciated. I have addressed
all of your issues ...
> > +#define INOTIFY_VERSION "0.10.0"
>
> You should plan to remove this - it becomes meainingless once the code
> is merged up, and nobody ever updates it as they patch things.
Agreed. Patch sent.
> +#define INOTIFY_DEV_TIMER_TIME (jiffies + (HZ/4))
>
> ick. Don't hide the logic in a #define.
>
> static inline arm_dev_timer(struct inotify_device *dev)
> {
> mod_timer(&dev->timer, jiffies + HZ/4);
> }
>
> is nicer.
It is used in more than one place, but OK. Patch sent.
> > +/* For debugging */
> > +static int event_object_count;
> > +static int watcher_object_count;
> > +static int inode_ref_count;
>
> OK. These are accessed racily. Either make them atomic_t's or remove them.
They are just statistics. I'd prefer to remove them entirely (I don't
personally think that the debugging code, statistics, etc. should go
into the mainline kernel).
> > +static int find_inode(const char __user *dirname, struct inode **inode)
>
> This can just return an inode*, or an IS_ERR() errno, I think?
Yes, it can. Done, patch sent.
> > +struct inotify_kernel_event *kernel_event(int wd, int mask,
> > + const char *filename)
> > +{
> > + struct inotify_kernel_event *kevent;
> > +
> > + kevent = kmem_cache_alloc(kevent_cache, GFP_ATOMIC);
>
> Try to rearrange things so the allocation mode here can become GFP_KERNEL.
Hrmph.
> Are there ever enough of these objects in flight to justify a standalone
> slab cache?
Yes. There are up to 256*8 possible events and (more importantly than
the net number, I think) they come and go constantly.
> > + watcher = kmem_cache_alloc(watcher_cache, GFP_KERNEL);
Yes. There can be a lot of watches... one of the intended uses of this
is automatic indexing of a user's homedir (think Apple Spotlight). Alan
Cox has also mentioned virus checking, etc.
There are structures are _not_ used as much, and we do not slab cache
them.
> i_lock is documented as an "innermost" lock.
>
> But here, dev->lock is nesting inside it. Not necessarily a bug per-se,
> but it changes and complicates kernel locking rules, and invalidates
> commentary elsewhere.
>
> If possible, please try to avoid using i_lock. Use i_sem instead.
>
> A bug, I think. What happens if another CPU comes in and tries to take these
> two locks in the opposite order?
>
> The same problem applies if you switch to i_sem. The standard fix is to
> take the lowest-addressed lock first. See d_move() for an example.
I'll work on the locking.
> I'll be merging invalidate_inodes-speedup.patch once the 2.6.10 stream
> opens. That will make the above code simpler, faster and quite different.
Sweet.
> > + add_wait_queue(&dev->wait, &wait);
> > +repeat:
> > + if (signal_pending(current)) {
> > + spin_unlock(&dev->lock);
> > + out = -ERESTARTSYS;
> > + set_current_state(TASK_RUNNING);
> > + remove_wait_queue(&dev->wait, &wait);
> > + goto out;
> > + }
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + if (!inotify_dev_has_events(dev)) {
> > + spin_unlock(&dev->lock);
> > + schedule();
> > + spin_lock(&dev->lock);
> > + goto repeat;
> > + }
> > +
> > + set_current_state(TASK_RUNNING);
> > + remove_wait_queue(&dev->wait, &wait);
>
> The above seems a bit clumsy.
John is reworking inotify_read(), which should take care of the various
issues you raised here.
> > +static int inotify_open(struct inode *inode, struct file *file)
> > +{
> > + struct inotify_device *dev;
> > +
> > + if (atomic_read(&watcher_count) == MAX_INOTIFY_DEVS)
> > + return -ENODEV;
> > +
> > + atomic_inc(&watcher_count);
> > +
> > + dev = kmalloc(sizeof(struct inotify_device), GFP_KERNEL);
> > + if (!dev)
> > + return -ENOMEM;
> > +
> > + memset(dev->bitmask, 0,
> > + sizeof(unsigned long) * MAX_INOTIFY_DEV_WATCHERS / BITS_PER_LONG);
>
> What purpose does this bitmask serve, anyway??
Bitmask of allocated/unallocated watcher descriptors.
Patch sent to add a comment. Also sent a patch to use the bitmap.h
functions instead of this open-coded memset().
> > +static void inotify_release_all_watchers(struct inotify_device *dev)
> > +{
> > + struct inotify_watcher *watcher,*next;
> > +
> > + list_for_each_entry_safe(watcher, next, &dev->watchers, d_list)
> > + ignore_helper(watcher, 0);
> > +}
>
> Locking?
>
> > +
> > +static int inotify_release(struct inode *inode, struct file *file)
> > +{
> > + if (file->private_data) {
>
> Why test ->private_data here?
I'm not sure why. I don't think we ought to - release functions are
only called once, when the last ref on the file dies. I asked John and
sent a patch to remove it.
> If it indeed needs testing here, shouldn't it be zeroed out as well, with
> appropriate locking?
>
> > + struct inotify_device *dev;
> > +
> > + dev = (struct inotify_device *) file->private_data;
>
> Please don't typecast when assigning to and from void*'s
Nod. Patch sent.
> > + del_timer_sync(&dev->timer);
> > + inotify_release_all_watchers(dev);
> > + inotify_release_all_events(dev);
> > + kfree(dev);
> > + }
> > +
> > + printk(KERN_ALERT "inotify device released\n");
> > +
> > + atomic_dec(&watcher_count);
>
> If file->private_data was zero, we shouldn't have decremented this?
I don't think we should test file->private_data at all, so after fixing
that, this is fixed.
> +static int inotify_watch(struct inotify_device *dev,
> + struct inotify_watch_request *request)
> > +{
> > ...
> > + spin_lock(&dev->lock);
> > + spin_lock(&inode->i_lock);
>
> Lock ranking.
>
> > +static int inotify_ioctl(struct inode *ip, struct file *fp,
> > + unsigned int cmd, unsigned long arg) {
>
> An errant brace!
Patch sent.
> > +
> > + if (_IOC_DIR(cmd) & _IOC_READ)
> > + err = !access_ok(VERIFY_READ, (void *) arg, _IOC_SIZE(cmd));
> > +
> > + if (err)
> > + err = -EFAULT;
> > + goto out;
> > +
>
> eh? The above is missing braces, and cannot possibly have worked.
That is my fault, just introduced in the latest revision.
> > + if (_IOC_DIR(cmd) & _IOC_WRITE)
> > + err = !access_ok(VERIFY_WRITE, (void *)arg, _IOC_SIZE(cmd));
> > +
> > + if (err) {
> > + err = -EFAULT;
> > + goto out;
> > + }
>
> Why are these access_ok() checks here? I think they can (should) go away.
They should. We should just use copy_{to,from}_user. Will fix.
> We often do:
>
> switch (cmd) {
> case INOTIFY_WATCH:
>
> to save a tabstop.
Yah. I thought I included that in my coding style cleanup. Patch sent.
> > +struct miscdevice inotify_device = {
> > + .minor = MISC_DYNAMIC_MINOR,
> > + .name = "inotify",
> > + .fops = &inotify_fops,
> > +};
>
> Please update devices.txt
Documentation/devices.txt doesn't really have provisions for dynamically
allocated devices, which this is.
We _could_ take a fixed minor...
> > +struct inotify_event {
> > + int wd;
> > + int mask;
> > + int cookie;
> > + char filename[INOTIFY_FILENAME_MAX];
> > +};
>
> yeah, that's not very nice. Better to kmalloc the pathname.
That is the structure that we communicate with to user-space.
We could kmalloc() filename, but it makes the user-space use a bit more
complicated (and right now it is trivial and wonderfully simple).
We've been debating the pros and cons.
> Please add CONFIG_INOTIFY and make all this:
>
> [...]
>
> go away if the user doesn't want inotify. And remember to test with
> CONFIG_INOTIFY=n!
Done. Patch sent.
> > + INIT_LIST_HEAD (&inode->watchers);
>
> Please review the entire patch and ensure that all macros and function
> calls have no space between the identifier and the opening parenthesis.
I thought I caught them all - guess not, patch sent.
Thanks for the feedback.
Robert Love
> I think you mean bitmap_zero(), but yah. Agreed.
>
> John, here is a patch to use the bitmap.h functions to manipulate the
> bitmap.
thanks
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
> Announcing the release of inotify 0.10.0.
> Attached is a patch to 2.6.8.1.
Hey, John.
Any reason we check file->private_data in inotify_release()?
I don't think there is. Following patch removes the check.
Robert Love
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
John,
> Announcing the release of inotify 0.10.0.
> Attached is a patch to 2.6.8.1.
John, as Andrew pointed out, we are missing some braces in an
inopportune location. I think we need to refactor this code entirely,
to remove the access_ok() checks and use copy_{to,from}_user(), but we
can take this in the meantime since the patch is dead without it.
I'll take paper, please. One brown paper bag right over my head.
Robert Love
Robert Love <[email protected]> wrote:
>
> > > + memset(dev->bitmask, 0,
> > > + sizeof(unsigned long) * MAX_INOTIFY_DEV_WATCHERS / BITS_PER_LONG);
> >
> > What purpose does this bitmask serve, anyway??
>
> Bitmask of allocated/unallocated watcher descriptors.
Can you expand on that? Why do we need such a bitmap?
Would an idr tree be more appropriate?
> We _could_ take a fixed minor...
>
> > > +struct inotify_event {
> > > + int wd;
> > > + int mask;
> > > + int cookie;
> > > + char filename[INOTIFY_FILENAME_MAX];
> > > +};
> >
> > yeah, that's not very nice. Better to kmalloc the pathname.
>
> That is the structure that we communicate with to user-space.
In that case it looks rather 64-bit-unfriendly. A 32-bit compiler will lay
that structure out differently from a 64-bit compiler. Or not. Hard to
say. Perhaps something more defensive is needed here.
One other thing: the patch adds 16 bytes to struct inode, for a feature
which many users and most inodes will not use. Unfortunate.
Is it possible to redesign things so that those four new fields are in a
standalone struct which points at the managed inode? Joined at the hip
like journal_head and buffer_head?
Bonus marks for not having a backpointer from the inode to the new struct ;)
(Still wondering what those timers are doing in there, btw)
On Mon, 2004-09-27 at 21:41 -0700, Andrew Morton wrote:
> Can you expand on that? Why do we need such a bitmap?
It is a unique cookie that identifies the exact object being watched
(e.g. /home/rml or /etc/fstab). I suspect John introduced it in lieu of
the (device,inode) tuple when Al bitched, which makes sense. Because we
have only a single fd (this is one of the problems with dnotify, the 1:1
relation between objects and file descriptors consumed) we need some
other object to identify each watched object.
So John introduced watcher descriptors. This bitmask keeps track of
which descriptors are used versus unused.
> Would an idr tree be more appropriate?
Quite possibly. I was originally thinking that idr's were too heavy,
but if we can make the wd <-> inotify_watcher relation then they make
perfect sense.
I'll look at making the conversion.
> In that case it looks rather 64-bit-unfriendly. A 32-bit compiler will lay
> that structure out differently from a 64-bit compiler. Or not. Hard to
> say. Perhaps something more defensive is needed here.
Well, no, since all known architectures are everything-is-32bit or LP64,
as far as I know. And padding would be the same.
And even if not, the only problem would be with 64-bit architectures and
a 32-bit user-space.
Nonetheless, we should probably make the three int types be s32 or
u32's, eh? I will submit a patch.
> One other thing: the patch adds 16 bytes to struct inode, for a feature
> which many users and most inodes will not use. Unfortunate.
>
> Is it possible to redesign things so that those four new fields are in a
> standalone struct which points at the managed inode? Joined at the hip
> like journal_head and buffer_head?
We could probably get away with a single word-sized variable in the
inode structure.
> Bonus marks for not having a backpointer from the inode to the new struct ;)
Don't push your luck. ;-)
In school, I always felt the bonus was just showing off, what with the
perfect score on the normal assignment. But I will investigate.
> (Still wondering what those timers are doing in there, btw)
John? I see what the timer does, but I am wondering why a timer _has_
to do it?
Thanks for the review, Andrew.
Robert Love
On Mon, 2004-09-27 at 22:14, Robert Love wrote:
> On Mon, 2004-09-27 at 21:41 -0700, Andrew Morton wrote:
>
> > Can you expand on that? Why do we need such a bitmap?
>
> It is a unique cookie that identifies the exact object being watched
> (e.g. /home/rml or /etc/fstab). I suspect John introduced it in lieu of
> the (device,inode) tuple when Al bitched, which makes sense. Because we
> have only a single fd (this is one of the problems with dnotify, the 1:1
> relation between objects and file descriptors consumed) we need some
> other object to identify each watched object.
>
> So John introduced watcher descriptors. This bitmask keeps track of
> which descriptors are used versus unused.
>
> > Would an idr tree be more appropriate?
>
> Quite possibly. I was originally thinking that idr's were too heavy,
> but if we can make the wd <-> inotify_watcher relation then they make
> perfect sense.
>
> I'll look at making the conversion.
I only first heard about idr in last weeks LWN, I thought they might be
useful.
>
> > In that case it looks rather 64-bit-unfriendly. A 32-bit compiler will lay
> > that structure out differently from a 64-bit compiler. Or not. Hard to
> > say. Perhaps something more defensive is needed here.
>
> Well, no, since all known architectures are everything-is-32bit or LP64,
> as far as I know. And padding would be the same.
>
> And even if not, the only problem would be with 64-bit architectures and
> a 32-bit user-space.
>
> Nonetheless, we should probably make the three int types be s32 or
> u32's, eh? I will submit a patch.
>
> > One other thing: the patch adds 16 bytes to struct inode, for a feature
> > which many users and most inodes will not use. Unfortunate.
> >
> > Is it possible to redesign things so that those four new fields are in a
> > standalone struct which points at the managed inode? Joined at the hip
> > like journal_head and buffer_head?
>
> We could probably get away with a single word-sized variable in the
> inode structure.
>
Yep, we could toss everything in to a structure and only have a pointer
to it from the inode.
> > Bonus marks for not having a backpointer from the inode to the new struct ;)
>
> Don't push your luck. ;-)
>
> In school, I always felt the bonus was just showing off, what with the
> perfect score on the normal assignment. But I will investigate.
>
> > (Still wondering what those timers are doing in there, btw)
>
> John? I see what the timer does, but I am wondering why a timer _has_
> to do it?
We need a timer to wake up any processes blocking on a read() call. The
reason it has to be a timer is because the code paths that get run when
an event is queued are not safe places to wake up blocked processes (But
I a kernel amateur so I am probably wrong).
John
On Mon, 2004-09-27 at 16:52 -0400, Robert Love wrote:
> > > +struct inotify_event {
> > > + int wd;
> > > + int mask;
> > > + int cookie;
> > > + char filename[INOTIFY_FILENAME_MAX];
> > > +};
> >
> > yeah, that's not very nice. Better to kmalloc the pathname.
>
> That is the structure that we communicate with to user-space.
>
> We could kmalloc() filename, but it makes the user-space use a bit more
> complicated (and right now it is trivial and wonderfully simple).
>
> We've been debating the pros and cons.
The current way pads out the structure unnecessarily, and still doesn't
handle the really long filenames, by your admission. It incurs extra
syscalls, as few filenames are really 256 characters in length. It makes
userspace double-check whether the filename extends all the way to the
boundary of the structure, and if so, then go back to the disk to try to
guess what the kernel really meant to say.
My way, doing a kmalloc of either the entire structure + filename length
+ NUL, or just the filename + NUL makes userspace infinitesimally
trickier to write. There's more effort in just getting the error
handling correct in either version.
The current way viewed by userspace, stolen shamelessly from the beagle
source:
/* FIXME: We need to check for errors, etc. */
remaining = sizeof (struct inotify_event);
event_buffer = (char *) &event;
while (remaining > 0) {
num_bytes = read (fd, event_buffer, remaining);
event_buffer += num_bytes;
remaining -= num_bytes;
}
My way (where struct inotify_event is declared with a char filename[0]
at the end):
if ((n=read(fd, buf, sizeof buf)) > 0) {
unsigned offset=0;
while (n > 0) {
// inotify guarantees complete events
unsigned len = sizeof(struct inotify_event);
dispatch((struct inotify_event *) &buf[offset]);
len += strlen(buf[offset + len]) + 1;
n -= len;
offset += len;
}
}
Doesn't seem all that tricky.
I'm willing to believe that I'm missing something. Help me see what.
Ray
Ray Lee <[email protected]> wrote:
>
> The current way pads out the structure unnecessarily, and still doesn't
> handle the really long filenames, by your admission. It incurs extra
> syscalls, as few filenames are really 256 characters in length.
Why don't you pass a file descriptor into the syscall instead of a pathname?
You can then take a ref on the inode and userspace can close the file.
That gets you permission checking for free.
Andrew Morton wrote:
> Why don't you pass a file descriptor into the syscall instead of a pathname?
> You can then take a ref on the inode and userspace can close the file.
> That gets you permission checking for free.
For passing in the data, that would work. Wouldn't you still need a name or
path when getting data back though?
Chris
On Tue, 2004-09-28 at 10:41 -0600, Chris Friesen wrote:
> Andrew Morton wrote:
>
> > Why don't you pass a file descriptor into the syscall instead of a pathname?
> > You can then take a ref on the inode and userspace can close the file.
> > That gets you permission checking for free.
>
> For passing in the data, that would work. Wouldn't you still need a name or
> path when getting data back though?
Does Andrew mean an fd on the thing being watched?
That is what we are trying to fix with dnotify: the open fd's are pin
the device and prevent unmount, making notification on removable devices
impossible. Such a 1:1 relationship also opens way too many fd's.
Robert Love
On Tue, 2004-09-28 at 12:53 -0400, Robert Love wrote:
> On Tue, 2004-09-28 at 10:41 -0600, Chris Friesen wrote:
> > Andrew Morton wrote:
> >
> > > Why don't you pass a file descriptor into the syscall instead of a pathname?
> > > You can then take a ref on the inode and userspace can close the file.
> > > That gets you permission checking for free.
> >
> > For passing in the data, that would work. Wouldn't you still need a name or
> > path when getting data back though?
>
> Does Andrew mean an fd on the thing being watched?
>
> That is what we are trying to fix with dnotify: the open fd's are pin
> the device and prevent unmount, making notification on removable devices
> impossible.
That's why he said to close the fd right after the syscall. But yeah,
for a case of someone wanting to watch their 1700 directories underneath
~/, thems a lot of open calls.
> Such a 1:1 relationship also opens way too many fd's.
...I'm not sure I follow. If you're talking about the IN_CREATE and
IN_DELETE events available when watching a parent directory, then I
don't think anything would change. IOW, why not do an open(2) on the
directory in question, and pass that fd in?
Regardless, Andrew's point still stands. What do we want the permission
semantics to be? One would think that a normal user account should not
be able to watch the contents of some other user's 0600 directories, for
example. open(2) already does all the correct checks. We should inherit
that work if at all possible.
Another benefit of passing in an fd, by the way, would be to make it
easier to make a write(2) interface to inotify, and get rid of the ioctl
one.
~ ~
As Chris points out, we still need a way to pass the name or path back
to userspace when an event occurs, which is the interface I was harping
on a few messages back.
It seems we're trying to recreate a variant struct dirent for
communicating changes to userspace. Perhaps we can learn something from
already trodden ground? Just sayin'.
Ray
On Mon, 2004-09-27 at 23:44 -0400, John McCutchan wrote:
> We need a timer to wake up any processes blocking on a read() call. The
> reason it has to be a timer is because the code paths that get run when
> an event is queued are not safe places to wake up blocked processes (But
> I a kernel amateur so I am probably wrong).
We probably don't need the timer. wake_up_interruptible() does not
sleep; we can call it from anywhere. Heck, timers are more atomic than
where we probably need to wake stuff up from anyhow.
But it is not easy to tell where that place is, because it looks like
the timer just runs every 250ms? That is no good.
I suspect that we can remove all of the timer stuff and just do
/* wake up! you are going to miss the bus! */
wake_up_interruptible(&dev->wait);
after
list_add_tail(&kevent->list, &dev->events);
in inotify_dev_queue_event().
Robert Love
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John McCutchan wrote:
|
| --Why Not dnotify and Why inotify (By Robert Love)--
|
| * inotify has an event that says "the filesystem that the item you were
| watching is on was unmounted" (this is particularly cool).
| +++ linux/fs/super.c 2004-09-18 02:24:33.000000000 -0400
| @@ -36,6 +36,7 @@
| #include <linux/writeback.h> /* for the emergency remount stuff */
| #include <linux/idr.h>
| #include <asm/uaccess.h>
| +#include <linux/inotify.h>
|
|
| void get_filesystem(struct file_system_type *fs);
| @@ -204,6 +205,7 @@
|
| if (root) {
| sb->s_root = NULL;
| + inotify_super_block_umount (sb);
| shrink_dcache_parent(root);
| shrink_dcache_anon(&sb->s_anon);
| dput(root);
This doesn't seem right. generic_shutdown_super is only called when the
last instance of a super is released. If a system were to have a
filesystem mounted in two locations (for instance, by creating a new
namespace), then the umount and ignore would not get propagated when one
is unmounted.
How about an approach that somehow referenced vfsmounts (without having
a reference count proper)? That way you could queue messages in
umount_tree and do_umount..
- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBWaGHdQs4kOxk3/MRAr22AJ0SHDUiIXKRKE/TBFmTYBL5J7KD9gCbBoso
DYRg+SjnO8urCKLmDbehvkM=
=5fMW
-----END PGP SIGNATURE-----
On Sun, 2004-09-26 at 22:02 -0400, John McCutchan wrote:
John,
> Announcing the release of inotify 0.10.0.
> Attached is a patch to 2.6.8.1.
Attached patch removes dev->timer and instead wakes up waiting tasks in
inotify_dev_queue_event().
I'd like you to go over it, make sure this works the same and makes
sense. It needs a good testing before merged.
Robert Love
On Tue, 2004-09-28 at 01:45, Ray Lee wrote:
> On Mon, 2004-09-27 at 16:52 -0400, Robert Love wrote:
> > > > +struct inotify_event {
> > > > + int wd;
> > > > + int mask;
> > > > + int cookie;
> > > > + char filename[INOTIFY_FILENAME_MAX];
> > > > +};
> > >
> > > yeah, that's not very nice. Better to kmalloc the pathname.
> >
> > That is the structure that we communicate with to user-space.
> >
> > We could kmalloc() filename, but it makes the user-space use a bit more
> > complicated (and right now it is trivial and wonderfully simple).
> >
> > We've been debating the pros and cons.
>
> The current way pads out the structure unnecessarily, and still doesn't
> handle the really long filenames, by your admission. It incurs extra
> syscalls, as few filenames are really 256 characters in length. It makes
> userspace double-check whether the filename extends all the way to the
> boundary of the structure, and if so, then go back to the disk to try to
> guess what the kernel really meant to say.
I thought that filenames where limited to 256 characters? That was the
idea behind the 256 character limit.
John
On Tue, 2004-09-28 at 13:32, Ray Lee wrote:
> On Tue, 2004-09-28 at 12:53 -0400, Robert Love wrote:
> > On Tue, 2004-09-28 at 10:41 -0600, Chris Friesen wrote:
> > > Andrew Morton wrote:
> > >
> > > > Why don't you pass a file descriptor into the syscall instead of a pathname?
> > > > You can then take a ref on the inode and userspace can close the file.
> > > > That gets you permission checking for free.
> > >
> > > For passing in the data, that would work. Wouldn't you still need a name or
> > > path when getting data back though?
> >
> > Does Andrew mean an fd on the thing being watched?
> >
> > That is what we are trying to fix with dnotify: the open fd's are pin
> > the device and prevent unmount, making notification on removable devices
> > impossible.
>
> That's why he said to close the fd right after the syscall. But yeah,
> for a case of someone wanting to watch their 1700 directories underneath
> ~/, thems a lot of open calls.
>
> > Such a 1:1 relationship also opens way too many fd's.
>
> ...I'm not sure I follow. If you're talking about the IN_CREATE and
> IN_DELETE events available when watching a parent directory, then I
> don't think anything would change. IOW, why not do an open(2) on the
> directory in question, and pass that fd in?
>
> Regardless, Andrew's point still stands. What do we want the permission
> semantics to be? One would think that a normal user account should not
> be able to watch the contents of some other user's 0600 directories, for
> example. open(2) already does all the correct checks. We should inherit
> that work if at all possible.
Yes we should, but I think the inotify interface would be cleaner if we
just factored out this permission code and called it from open() and
from the inotify code.
>
> Another benefit of passing in an fd, by the way, would be to make it
> easier to make a write(2) interface to inotify, and get rid of the ioctl
> one.
>
I don't see how passing directories/files to inotify by fd not filename,
makes providing a write(2) interface to inotify any easier. To me they
are mutually exclusive. When you open up /dev/inotify, you get an fd,
you read events from it. We could provide write on that fd instead of
the ioctl() interface.
> ~ ~
>
> As Chris points out, we still need a way to pass the name or path back
> to userspace when an event occurs, which is the interface I was harping
> on a few messages back.
>
> It seems we're trying to recreate a variant struct dirent for
> communicating changes to userspace. Perhaps we can learn something from
> already trodden ground? Just sayin'.
Yes the current method of passing the name back to user space is
definitely sub par. But I don't think passing a full path to user space
is reasonable, as that would require walking the dirent tree for every
event. Really the best we can provide user space is the filename/dirname
(relative to the directory you are currently watching).
John
On Tue, 2004-09-28 at 13:38, Mike Waychison wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> John McCutchan wrote:
> |
> | --Why Not dnotify and Why inotify (By Robert Love)--
> |
>
> | * inotify has an event that says "the filesystem that the item you were
> | watching is on was unmounted" (this is particularly cool).
>
> | +++ linux/fs/super.c 2004-09-18 02:24:33.000000000 -0400
> | @@ -36,6 +36,7 @@
> | #include <linux/writeback.h> /* for the emergency remount stuff */
> | #include <linux/idr.h>
> | #include <asm/uaccess.h>
> | +#include <linux/inotify.h>
> |
> |
> | void get_filesystem(struct file_system_type *fs);
> | @@ -204,6 +205,7 @@
> |
> | if (root) {
> | sb->s_root = NULL;
> | + inotify_super_block_umount (sb);
> | shrink_dcache_parent(root);
> | shrink_dcache_anon(&sb->s_anon);
> | dput(root);
>
> This doesn't seem right. generic_shutdown_super is only called when the
> last instance of a super is released. If a system were to have a
> filesystem mounted in two locations (for instance, by creating a new
> namespace), then the umount and ignore would not get propagated when one
> is unmounted.
>
> How about an approach that somehow referenced vfsmounts (without having
> a reference count proper)? That way you could queue messages in
> umount_tree and do_umount..
I was not aware of this subtlety. You are right, we should make sure
events are sent for every unmount, not just the last.
John
On Tue, 2004-09-28 at 15:08, Andrew Morton wrote:
> Ray Lee <[email protected]> wrote:
> >
> > The current way pads out the structure unnecessarily, and still doesn't
> > handle the really long filenames, by your admission. It incurs extra
> > syscalls, as few filenames are really 256 characters in length.
>
> Why don't you pass a file descriptor into the syscall instead of a pathname?
> You can then take a ref on the inode and userspace can close the file.
> That gets you permission checking for free.
>
I don't think moving inotify to a syscall based interface is worth it.
First off, on startup, this would require about 2k open() calls,
followed by 2k syscalls to inotify. Not as nice as just 2k ioctl()
calls.
The character device interface right now suits it perfectly. If we used
syscalls we would need to provide a syscall that gives user space a FD
that it can read events on, then more 2 more syscalls to provide the
watch and ignore functionality. Switching to the syscall interface would
also require implementing the idea of the inotify device instance
without the assistance of the char device subsystem. If the ioctl()
based interface is so bad, we could change it to a write() based
interface.
John
On Tue, 2004-09-28 at 16:40 -0400, John McCutchan wrote:
> If the ioctl() based interface is so bad, we could change it to a
> write() based interface.
What baffles me is that a write() interface is infinitely more type
unsafe, arbitrary, and uncontrolled than an ioctl() one.
Robert Love
On Tue, 2004-09-28 at 16:26 -0400, John McCutchan wrote:
> On Tue, 2004-09-28 at 01:45, Ray Lee wrote:
> > The current way pads out the structure unnecessarily, and still doesn't
> > handle the really long filenames, by your admission. It incurs extra
> > syscalls, as few filenames are really 256 characters in length. It makes
> > userspace double-check whether the filename extends all the way to the
> > boundary of the structure, and if so, then go back to the disk to try to
> > guess what the kernel really meant to say.
>
> I thought that filenames where limited to 256 characters? That was the
> idea behind the 256 character limit.
I thought so too, as linux/limits.h claims:
#define NAME_MAX 255 /* # chars in a file name */
But Robert earlier said:
> Technically speaking, a single filename can be as large as PATH_MAX-1.
> The comment is just a warning, though, to explain the dreary
> theoretical side of the world.
...where PATH_MAX is 4096.
So, got me. I believe there is some minor confusion going on.
Ray
On Tue, 2004-09-28 at 14:10 -0700, Ray Lee wrote:
> So, got me. I believe there is some minor confusion going on.
I think you are both just talking about different things. John was
confused about what you were saying, I think.
Right now we limit the filename returned to 256 characters.
And file names can be as large as PATH_MAX-1 (minus one for the root
slash).
Robert Love
On Tue, 2004-09-28 at 16:34 -0400, John McCutchan wrote:
> On Tue, 2004-09-28 at 13:32, Ray Lee wrote:
> > On Tue, 2004-09-28 at 12:53 -0400, Robert Love wrote:
> > > On Tue, 2004-09-28 at 10:41 -0600, Chris Friesen wrote:
> > > > Andrew Morton wrote:
> > > >
> > > > > Why don't you pass a file descriptor into the syscall instead of a pathname?
> > > > > You can then take a ref on the inode and userspace can close the file.
> > > > > That gets you permission checking for free.
> > > >
> > > > For passing in the data, that would work. Wouldn't you still need a name or
> > > > path when getting data back though?
> > >
> > > Does Andrew mean an fd on the thing being watched?
> > >
> > > That is what we are trying to fix with dnotify: the open fd's are pin
> > > the device and prevent unmount, making notification on removable devices
> > > impossible.
> >
> > That's why he said to close the fd right after the syscall. But yeah,
> > for a case of someone wanting to watch their 1700 directories underneath
> > ~/, thems a lot of open calls.
> >
> > > Such a 1:1 relationship also opens way too many fd's.
> >
> > ...I'm not sure I follow. If you're talking about the IN_CREATE and
> > IN_DELETE events available when watching a parent directory, then I
> > don't think anything would change. IOW, why not do an open(2) on the
> > directory in question, and pass that fd in?
> >
> > Regardless, Andrew's point still stands. What do we want the permission
> > semantics to be? One would think that a normal user account should not
> > be able to watch the contents of some other user's 0600 directories, for
> > example. open(2) already does all the correct checks. We should inherit
> > that work if at all possible.
>
> Yes we should, but I think the inotify interface would be cleaner if we
> just factored out this permission code and called it from open() and
> from the inotify code.
I don't know enough to have an opinion on this. I'll leave that up to
actual arbiters of taste.
> >
> > Another benefit of passing in an fd, by the way, would be to make it
> > easier to make a write(2) interface to inotify, and get rid of the ioctl
> > one.
> >
>
> I don't see how passing directories/files to inotify by fd not filename,
> makes providing a write(2) interface to inotify any easier.
There seemed to be some unwillingness to deal with variable length items
being passed back and forth, to and fro userspace. Placing an fd in the
watch structure would make it fixed size, rather than variable length.
Frankly, I find either as easy to deal with.
> To me they are mutually exclusive.
To me as well. Robert seems to have issues with making the userspace
interface as simple as possible, hence my suggestion. I'm happy to
withdraw it.
> When you open up /dev/inotify, you get an fd,
> you read events from it. We could provide write on that fd instead of
> the ioctl() interface.
I agree completely.
> > As Chris points out, we still need a way to pass the name or path back
> > to userspace when an event occurs, which is the interface I was harping
> > on a few messages back.
> >
> > It seems we're trying to recreate a variant struct dirent for
> > communicating changes to userspace. Perhaps we can learn something from
> > already trodden ground? Just sayin'.
>
> Yes the current method of passing the name back to user space is
> definitely sub par. But I don't think passing a full path to user space
> is reasonable, as that would require walking the dirent tree for every
> event. Really the best we can provide user space is the filename/dirname
> (relative to the directory you are currently watching).
Apologies, I was unclear. I should not have said that we would pass the
full path back. In fact, we can't, as the same directory may be mounted
in multiple places. I knew this, but I blame lack of sleep for my slip.
Ray
On Tue, 2004-09-28 at 17:10, Ray Lee wrote:
> On Tue, 2004-09-28 at 16:26 -0400, John McCutchan wrote:
> > On Tue, 2004-09-28 at 01:45, Ray Lee wrote:
> > > The current way pads out the structure unnecessarily, and still doesn't
> > > handle the really long filenames, by your admission. It incurs extra
> > > syscalls, as few filenames are really 256 characters in length. It makes
> > > userspace double-check whether the filename extends all the way to the
> > > boundary of the structure, and if so, then go back to the disk to try to
> > > guess what the kernel really meant to say.
> >
> > I thought that filenames where limited to 256 characters? That was the
> > idea behind the 256 character limit.
>
> I thought so too, as linux/limits.h claims:
>
> #define NAME_MAX 255 /* # chars in a file name */
>
> But Robert earlier said:
>
> > Technically speaking, a single filename can be as large as PATH_MAX-1.
> > The comment is just a warning, though, to explain the dreary
> > theoretical side of the world.
>
> ...where PATH_MAX is 4096.
>
> So, got me. I believe there is some minor confusion going on.
A quick test of 'echo "" > XXXX...XXX' the filename seems to be limited
to 256.
John
On Tue, 2004-09-28 at 16:40 -0400, John McCutchan wrote:
> On Tue, 2004-09-28 at 15:08, Andrew Morton wrote:
> > Ray Lee <[email protected]> wrote:
> > >
> > > The current way pads out the structure unnecessarily, and still doesn't
> > > handle the really long filenames, by your admission. It incurs extra
> > > syscalls, as few filenames are really 256 characters in length.
> >
> > Why don't you pass a file descriptor into the syscall instead of a pathname?
> > You can then take a ref on the inode and userspace can close the file.
> > That gets you permission checking for free.
> >
>
> I don't think moving inotify to a syscall based interface is worth it.
>
> First off, on startup, this would require about 2k open() calls,
> followed by 2k syscalls to inotify.
And then 2k close() calls.
> Not as nice as just 2k ioctl() calls.
<shrug> Syscalls aren't free, but they aren't the end of the world.
> The character device interface right now suits it perfectly. If we used
> syscalls we would need to provide a syscall that gives user space a FD
> that it can read events on,
Again, apologies, I should know better than to write email on short
sleep. All I was suggesting was that we pass in an fd that comes from
open(), and that we should look at replacing the ioctl with write(). I
like it as a character device, honest.
Ray
On Tue, 2004-09-28 at 17:21 -0400, John McCutchan wrote:
> A quick test of 'echo "" > XXXX...XXX' the filename seems to be limited
> to 256.
I think ext3 limits filenames to 255 or 256 characters. Other
filesystems probably have other limits.
The POSIX absolute maximum is PATH_MAX for the entire path, which means
that a legal filename could theoretically be PATH_MAX-1 (a file in the
root directory). But maybe in practice this is never an issue.
We still have the issue where 256 is much larger than the average file.
But, as I have said before, I am _for_ keeping the thing static,
although I acknowledge all the points about going dynamic.
Robert Love
OK, I told John I would post this ASAP, as soon as I finished testing
it, but I got backed up, so here it is without much testing. It does
compile fine.
This patch removes our current bitmap-based allocation system and
replaces it with the idr layer. The idr layer allows us to use a radix
tree, rooted at each device instance, to trivially map between watcher
descriptors and watcher structures. In idr terminology, the watcher
descriptor is the id and the watcher structure is the pointer.
Allocating a new watcher descriptor and associating it with a given
watcher structure is done in the same place as before,
inotify_dev_get_wd(). The code for doing this is a bit weird. The idr
layer's interface leaves a bit to be desired.
The function dev_find_wd() is used to map from a given watcher
descriptor to a watcher structure. This used to be our least scalable
function: O(n), but at a small fixed n, so you could call it O(1). Now
it is O(lg n); n is still fixed, so you can still call it O(1).
I also cleaned up some locking and added some comments.
Like I said, I have not tested this, probably won't until tomorrow, but
here it is to play with earlier if anyone so chooses. The idr layer is
rather nice for this sort of thing.
Patch is on top of all of my previous patches.
Robert Love
On Tue, 2004-09-28 at 17:35 -0400, Robert Love wrote:
> The POSIX absolute maximum is PATH_MAX for the entire path, which means
> that a legal filename could theoretically be PATH_MAX-1 (a file in the
> root directory). But maybe in practice this is never an issue.
Ah, okay.
> We still have the issue where 256 is much larger than the average file.
Yes.
> But, as I have said before, I am _for_ keeping the thing static,
> although I acknowledge all the points about going dynamic.
I'm afraid I'm not seeing the complexity argument. Do you have other
concerns regarding dynamic lengths?
Ray
On Tue, 2004-09-28 at 16:47 -0400, Robert Love wrote:
> On Tue, 2004-09-28 at 16:40 -0400, John McCutchan wrote:
>
> > If the ioctl() based interface is so bad, we could change it to a
> > write() based interface.
>
> What baffles me is that a write() interface is infinitely more type
> unsafe, arbitrary, and uncontrolled than an ioctl() one.
The word "Huh?" comes to mind, for some reason.
It's just as easy to pass crap via ioctl() as write(). Regardless, I'm
not the Arbiter Of Taste on this topic. I'm merely following what I
believe to be the prevailing (<cough>Linus</cough>) sentiments are
regarding ioctl()s.
Well, that, and I personally find write(fd, &x, sizeof x) a cleaner
interface. My personal preference is of little importance on this topic,
however.
Ray
+ spin_lock(&dev->lock);
+ ret = idr_get_new(&dev->idr, watcher, &watcher->wd);
+ spin_lock(&dev->lock);
I think you mean spin_unlock on the third line? Other than that I think
it should work.
John
On Tue, 2004-09-28 at 17:46, Robert Love wrote:
> OK, I told John I would post this ASAP, as soon as I finished testing
> it, but I got backed up, so here it is without much testing. It does
> compile fine.
>
> This patch removes our current bitmap-based allocation system and
> replaces it with the idr layer. The idr layer allows us to use a radix
> tree, rooted at each device instance, to trivially map between watcher
> descriptors and watcher structures. In idr terminology, the watcher
> descriptor is the id and the watcher structure is the pointer.
>
> Allocating a new watcher descriptor and associating it with a given
> watcher structure is done in the same place as before,
> inotify_dev_get_wd(). The code for doing this is a bit weird. The idr
> layer's interface leaves a bit to be desired.
>
> The function dev_find_wd() is used to map from a given watcher
> descriptor to a watcher structure. This used to be our least scalable
> function: O(n), but at a small fixed n, so you could call it O(1). Now
> it is O(lg n); n is still fixed, so you can still call it O(1).
>
> I also cleaned up some locking and added some comments.
>
> Like I said, I have not tested this, probably won't until tomorrow, but
> here it is to play with earlier if anyone so chooses. The idr layer is
> rather nice for this sort of thing.
>
> Patch is on top of all of my previous patches.
>
> Robert Love
>
On Tue, 2004-09-28 at 14:50 -0700, Ray Lee wrote:
> I'm afraid I'm not seeing the complexity argument. Do you have other
> concerns regarding dynamic lengths?
You _do_ see the complexity, you just don't that think it matters.
It forces more complexity on user-space (as least as much as you have
shown) and even more in the kernel (due partly to the current very
simple event code in place that needs the buffer to be a multiple of the
struct size, etc.).
Personally, I don't know if it matters. John and I have both said that
we will look into going to a dynamic filename.
Also, I am sure that John will take a patch, if you are so inclined.
Robert Love
On Tue, 2004-09-28 at 17:58 -0400, John McCutchan wrote:
> + spin_lock(&dev->lock);
> + ret = idr_get_new(&dev->idr, watcher, &watcher->wd);
> + spin_lock(&dev->lock);
>
> I think you mean spin_unlock on the third line? Other than that I think
> it should work.
Ech, I have an updated patch.
Robert Love
On Tue, 2004-09-28 at 14:39 -0700, Ray Lee wrote:
> It's just as easy to pass crap via ioctl() as write().
The ioctl() at least forces you to break up your calls into request and
argument pairs. And as the requests are inotify-based preprocessor
defines, at least that half of the equation is impossible to mess up.
In other words, ioctl() is slightly more structured than write(), which
is utterly non-structured.
Robert Love
John McCutchan <[email protected]> wrote:
>
> > ~ ~
> >
> > As Chris points out, we still need a way to pass the name or path back
> > to userspace when an event occurs, which is the interface I was harping
> > on a few messages back.
> >
> > It seems we're trying to recreate a variant struct dirent for
> > communicating changes to userspace. Perhaps we can learn something from
> > already trodden ground? Just sayin'.
>
> Yes the current method of passing the name back to user space is
> definitely sub par. But I don't think passing a full path to user space
> is reasonable, as that would require walking the dirent tree for every
> event. Really the best we can provide user space is the filename/dirname
> (relative to the directory you are currently watching).
Userspace requests that the kernel start monitoring an fd. The kernel
returns an integer cookie which corresponds to that monitor.
When an event occurs, the kernel returns the event identifier and the
cookie to userspace.
Userspace then does a lookup to work out what pathname corresponds to the
cookie.
Or is it the case that you expect a single monitor on /etc will return
"/etc/passwd" if someone modified that file, or "/etc/hosts" if someone
modified that file? If so, perhaps we should take that feature away and
require that userspace rescan the directory?
Because passing pathnames into and back from the kernel from this manner is
really not a nice design.
A halfway point might be to return {cookie-of-/etc,EVENT_MODIFY,"hosts"} to
a monitor on the /etc directory.
On Thu, 2004-09-30 at 00:15, Andrew Morton wrote:
> John McCutchan <[email protected]> wrote:
> >
> > > ~ ~
> > >
> > > As Chris points out, we still need a way to pass the name or path back
> > > to userspace when an event occurs, which is the interface I was harping
> > > on a few messages back.
> > >
> > > It seems we're trying to recreate a variant struct dirent for
> > > communicating changes to userspace. Perhaps we can learn something from
> > > already trodden ground? Just sayin'.
> >
> > Yes the current method of passing the name back to user space is
> > definitely sub par. But I don't think passing a full path to user space
> > is reasonable, as that would require walking the dirent tree for every
> > event. Really the best we can provide user space is the filename/dirname
> > (relative to the directory you are currently watching).
>
> Userspace requests that the kernel start monitoring an fd. The kernel
> returns an integer cookie which corresponds to that monitor.
>
> When an event occurs, the kernel returns the event identifier and the
> cookie to userspace.
>
> Userspace then does a lookup to work out what pathname corresponds to the
> cookie.
That is what inotify does. Except user space asks the kernel to monitor
a particular path (not an fd).
>
>
> Or is it the case that you expect a single monitor on /etc will return
> "/etc/passwd" if someone modified that file, or "/etc/hosts" if someone
> modified that file? If so, perhaps we should take that feature away and
> require that userspace rescan the directory?
>
Yes, this is the case that we are discussing. Rescanning the directory
is a waste, if the kernel can cheaply tell userspace what happened.
>
> Because passing pathnames into and back from the kernel from this manner is
> really not a nice design.
>
> A halfway point might be to return {cookie-of-/etc,EVENT_MODIFY,"hosts"} to
> a monitor on the /etc directory.
>
>
This is what the current inotify code does. You get an event with the
filename relative to your watch descriptor (cookie).
John
On Wed, 2004-09-29 at 21:15 -0700, Andrew Morton wrote:
> Or is it the case that you expect a single monitor on /etc will return
> "/etc/passwd" if someone modified that file, or "/etc/hosts" if someone
> modified that file?
A monitor on "/etc" will return "hosts" if hosts is modified. Which I
think is OK--we don't pass the entire path, nor do we want to if we
could do it easily, for numerous reasons ..
> If so, perhaps we should take that feature away and
> require that userspace rescan the directory?
This is one of the issues with dnotify we want to fix. And to do
generic file notification, it is not just rescanning the directory but
caching file modify timestamps (say, keep all of the stat structures in
memory) and then re-stat'ing everything and comparing. Ugh.
> Because passing pathnames into and back from the kernel from this manner is
> really not a nice design.
Agreed.
> A halfway point might be to return {cookie-of-/etc,EVENT_MODIFY,"hosts"} to
> a monitor on the /etc directory.
This is what inotify does. ;-)
Everything is relative to the object being watched.
Robert Love
John McCutchan <[email protected]> wrote:
>
> On Tue, 2004-09-28 at 15:08, Andrew Morton wrote:
> > Ray Lee <[email protected]> wrote:
> > >
> > > The current way pads out the structure unnecessarily, and still doesn't
> > > handle the really long filenames, by your admission. It incurs extra
> > > syscalls, as few filenames are really 256 characters in length.
> >
> > Why don't you pass a file descriptor into the syscall instead of a pathname?
> > You can then take a ref on the inode and userspace can close the file.
> > That gets you permission checking for free.
> >
>
> I don't think moving inotify to a syscall based interface is worth it.
That wasn't my point.
What I'm trying to get away from is this passing of full pathnames into and
out of the kernel, whether by syscall or ioctl. It is a poor interface
and, less importantly, is slow.
And it is slow on the common (notify) path! It's worth adding additional
setup overhead if we can make he event delivery faster, no?
> First off, on startup, this would require about 2k open() calls,
> followed by 2k syscalls to inotify. Not as nice as just 2k ioctl()
> calls.
There's probably not a lot of difference, even if all those pathnames and
inodes are in cache.
Robert wrote:
> A monitor on "/etc" will return "hosts" if hosts is modified. Which I
> think is OK--we don't pass the entire path, nor do we want to if we
> could do it easily, for numerous reasons ..
How about returning the inode number?
Notice that this can be converted to the name without using stat, using
the d_ino field in the struct dirent returned from an opendir/readdir
loop.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
Paul Jackson wrote:
> Robert wrote:
>
>>A monitor on "/etc" will return "hosts" if hosts is modified. Which I
>>think is OK--we don't pass the entire path, nor do we want to if we
>>could do it easily, for numerous reasons ..
>
>
> How about returning the inode number?
Then I think you need the filesystem as well.
Chris
Chris replied to pj:
> > How about returning the inode number?
>
> Then I think you need the filesystem as well.
They already had that, with the integer cookie representing the
directory ("/etc", in the example), as was a bit obvious, or else
returning the relative path ("hosts" in the example) wouldn't have
worked either.
That is, I was suggesting returning:
{cookie-of-/etc,EVENT_MODIFY,inode-number-of-hosts-file}
instead of:
{cookie-of-/etc,EVENT_MODIFY,"hosts"}
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
On Wed, 2004-09-29 at 20:05 -0700, Paul Jackson wrote:
> Robert wrote:
> > A monitor on "/etc" will return "hosts" if hosts is modified. Which I
> > think is OK--we don't pass the entire path, nor do we want to if we
> > could do it easily, for numerous reasons ..
>
> How about returning the inode number?
>
> Notice that this can be converted to the name without using stat, using
> the d_ino field in the struct dirent returned from an opendir/readdir
> loop.
This changes an O(1) process to O(N), where N is the size of the
directory. That's one of the issues with dnotify that they're trying to
fix.
Bottom line, the kernel has that information available to it, and
cheaply. Forcing userspace to rederive that information for the
advantage of not spewing filenames out of the kernel is a false economy,
when viewed from the entire system.
Ray
> This changes an O(1) process to O(N),
At the microlevel, yes. But:
1) If one takes as the "unit of measurement" the number of
system calls made, it's more like O(N/128), given that
one might average 128 directory entries per getdents()
call.
2) This can be cached, with user code mapping inode->d_ino
to d_name, and then the cached name checked with a single
stat(2) call to ensure it wasn't stale.
Be leary of micro optimizations imposing a poorer design, especially
across major API boundaries. Simply waving "O(N)" in our face may not
be adequately persuasive. You might need a compelling performance
analysis, showing that you can only meet critical goals by passing the
name. Such analysis may already be intuitively obvious to you. If it's
already earlier on this thread, don't hesitate to tell me where to go
back and read it. But right now I am unaware of any such compelling
need.
And there is a long history of pain in Unix dealing with variable length
return values. Much better to deal with that entirely in user space
code under your control, than to have to align kernel and glibc code in
addition to your code, to get something fixed. More often than not,
you will end up with faster code when you control the details, than if
you have to align heaven and earth to make changes.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
On Thu, 2004-09-30 at 09:27 -0700, Paul Jackson wrote:
> > This changes an O(1) process to O(N),
>
> At the microlevel, yes. But:
Apologies, I wasn't trying to shut down the conversation with a magic
O(N) wand.
> 1) If one takes as the "unit of measurement" the number of
> system calls made, it's more like O(N/128), given that
> one might average 128 directory entries per getdents()
> call.
O(N/128) = O(N), but you knew that already. My fully arbitrary units,
however, were in terms of what needs to be done from userspace to
recreate the information that the kernel already had, and that means
basically scanning a directory of N items. Sure, it's less than N
syscalls, but userspace is still going to ask the CPU to do N
comparisons across N items, with all the implications that comes along
with that. (To be explicit: Syscalls, memory usage for caching, blowing
the processor cache by visiting lots of useless items, and speed of just
CPU time for the comparisons.)
> 2) This can be cached, with user code mapping inode->d_ino
> to d_name, and then the cached name checked with a single
> stat(2) call to ensure it wasn't stale.
Yeah, well, dnotify asks userspace to do much the same thing. This is
marginally better, in that we'd know what we're looking for once we see
it. However, dnotify has shown that forcing userspace to cache the
entire contents of all the directories it wishes to watch doesn't scale
well. That's one of the core problems they're trying to address, and
passing an inode number back would reintroduce it.
> Be leary of micro optimizations imposing a poorer design, especially
> across major API boundaries.
Yes.
> Simply waving "O(N)" in our face may not be adequately persuasive.
It shouldn't be persuasive on its own. It should be coupled with the
fact that dnotify already forces an O(N) workload per directory onto
userspace, and that has shown itself to not work well.
> And there is a long history of pain in Unix dealing with variable length
> return values.
Y'know, I keep hearing this, but I'm afraid I'm just too dense to get
it. How about we get it right once, in the kernel and userspace, and we
make a library out of it so that luser space authors like myself don't
have to reinvent the wheel? (getdents(2) seems to work somehow.) I'll
even offer to write the code once I can sit down and spend time with the
kernel again, if no one else is willing to touch this.
> Much better to deal with that entirely in user space
> code under your control, than to have to align kernel and glibc code in
> addition to your code, to get something fixed. More often than not,
> you will end up with faster code when you control the details, than if
> you have to align heaven and earth to make changes.
Sure, there are designs that have a natural fit with the system, and
others that don't. Much like trying to express a trig function in terms
of polynomials. Not a good fit, and you get a massive series to try to
approximate it, if you'll forgive a calculus analogy. Anyway, the kernel
*has* that information already, so we're not shoving the work somewhere
else just to make userspace simpler somehow (which is always a specious
argument).
I think I understand your point. What I'd suggest is reading Robert's
"Why inotify?" that's posted along with the inotify patches (see
http://lkml.org/lkml/2004/9/28/201 for example) to get a small feel for
what issues are trying to be addressed.
Off to do honest work,
Ray
Ray wrote:
> However, dnotify has shown that forcing userspace to cache the
> entire contents of all the directories it wishes to watch doesn't scale
> well.
The dnotify cache isn't just an optional performance tweak, caching
inode to name. It's an essential tracking of much of the stat
information of any file of interest, in order to have _any_ way of
detecting which file changed.
The inotify cache could just have the last few most recently used
entries or some such - it's just a space vs time performance tradeoff.
So passing back an inode number doesn't come close to reintroducing
the forced tracking of all the interesting stat data of every file.
> dnotify already forces an O(N) workload per directory onto
> userspace, and that has shown itself to not work well.
Just because there exists an O(N) solution that has been shown to fail
doesn't mean all O(N) solutions fail. If the constants change enough,
then the actual numbers (how many changes per minute, how many compute
cycles and memory bytes, what's the likely cache hit rate for a given
size cache, ...) need to be re-examined. I see no evidence of that
re-examination -- am I missing something?
> getdents(2) seems to work somehow.
Yeah ... but we had to walk up hill, both ways, through 9 foot snow
drifts, for years, summer and winter, to get there ;).
> the kernel *has* that information already, ...
Frustrating, isn't it. The information is right there, and we're trying
to keep you from getting to it.
Whatever ...
> Off to do honest work,
Good idea. Good luck with this. I rest my case, such as it be.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
Okay, now I'm sure I'm not communicating well. Please let me try again,
and I'll see if I can do better this time.
On Thu, 2004-09-30 at 10:48 -0700, Paul Jackson wrote:
> Ray wrote:
> > However, dnotify has shown that forcing userspace to cache the
> > entire contents of all the directories it wishes to watch doesn't scale
> > well.
>
> The dnotify cache isn't just an optional performance tweak, caching
> inode to name. It's an essential tracking of much of the stat
> information of any file of interest, in order to have _any_ way of
> detecting which file changed.
Agreed.
> The inotify cache could just have the last few most recently used
> entries or some such - it's just a space vs time performance tradeoff.
Not the case.
> So passing back an inode number doesn't come close to reintroducing
> the forced tracking of all the interesting stat data of every file.
It certainly forces userspace to track all file names and inodes, at
least. Userspace wishes to know about deletes and renames. Unless it
caches everything, it won't know what was deleted, or what got renamed.
For that matter, passing back the inode number might also cause
confusion for hardlinked files in the same directory, though I'd have to
think about that for a bit to be sure.
Perhaps I'm missing something. I haven't missed anything yet today,
which means I'm overdue.
> > dnotify already forces an O(N) workload per directory onto
> > userspace, and that has shown itself to not work well.
>
> Just because there exists an O(N) solution that has been shown to fail
> doesn't mean all O(N) solutions fail.
<shrug> Okay, I'll agree with you in theory. This is in fact true for
several classes of multiplication algorithms, for example. (Who's gonna
do an FFT multiply of two two-digit numbers, I ask; no one.) So, yeah, I
get your point. Honest.
> If the constants change enough,
> then the actual numbers (how many changes per minute, how many compute
> cycles and memory bytes, what's the likely cache hit rate for a given
> size cache, ...) need to be re-examined. I see no evidence of that
> re-examination -- am I missing something?
This is one of the parts I'm not communicating well. For any 'event'
that inotify wants to relay, the kernel knows the inode and the name.
You're saying pass the inode number, as it's smaller and makes for an
easier and higher speed interface to get changes to userspace (if I
understand you correctly).
I'm saying, sure, if you look at that corner of the problem, you're
right. That's ignoring how this gets used, however, where userspace
ultimately wants to know the name of the file that's changed. So, by
your method, we scan the entire directory, looking for the inode that
matches via the getdents call.
So, the kernel ended up sending the name of the changed file *anyway*,
but userspace had to ask for it. And, oh, the kernel sent the name of
every other file in that directory too. (Okay, maybe half, but then
again maybe not if you want to catch hardlinked files.)
> > the kernel *has* that information already, ...
>
> Frustrating, isn't it. The information is right there, and we're trying
> to keep you from getting to it.
>
> Whatever ...
No, that's not my point at all. I'm not saying Big Bad Kernel Developers
won't let me have my candy. I'm saying that if the kernel has the
information already, and we're not passing it to userspace, and
userspace *needs* that information, then all we've done is guarantee
another long set of syscalls while it tries to pull the directory
contents to match up item for item against its cache of previously known
file states.
Or, in other words, I don't see how tossing away needed information is
going to make anything more efficient in this case. Userspace, like a
bulldog, will be coming back knocking on the kernel's door to get that
information one way or another.
~ ~
I've been avoiding saying this, as this really will be a bit more
complex than even my suggestions, but perhaps everyone would be happier
if we crammed all this through a netlink socket instead. Got me.
Ray
Ray wrote:
> > So passing back an inode number doesn't come close to reintroducing
> > the forced tracking of all the interesting stat data of every file.
>
> It certainly forces userspace to track all file names and inodes, at
> least. Userspace wishes to know about deletes and renames. Unless it
> caches everything, it won't know what was deleted, or what got renamed.
Aha ... you finally got through my thick skull. Congratulations ;).
If file "foo" goes away, and if the kernel only reports that inode
314159 was unlinked from a given directory, unless some user code
previously remembered that inode 314159 was accessible from the
directory entry named "foo", you can't tell the user that "foo"
disappeared.
Do you have the same problem with directories, if some cookie, not the
full pathname, is passed back after an 'rmdir(2)' event? Or is it just
that it's less onerous to track all the directories, because there's
fewer of them?
> You're saying pass the inode number, as it's smaller and makes for an
> easier and higher speed interface to get changes to userspace (if I
> understand you correctly).
That was the idea, yes. Most of it anyway. That and striving to keep
the API the kernel presents to the user minimal and orthogonal.
> I'm saying that if the kernel has the information already, and we're not
> passing it to userspace, and userspace *needs* that information, then
> all we've done is guarantee another long set of syscalls while it tries
> to pull the directory contents to match up item for item against its
> cache of previously known file states.
If there is a way, any way, for user code to get something from the
kernel, then I don't mind dragging my feet on providing other ways,
until I see a good reason. It's always worth a bit of effort to keep
kernel API's minimal and orthogonal.
Your "delete and rename" point above seems now like a good reason.
> I've been avoiding saying this, as this really will be a bit more
> complex than even my suggestions, but perhaps everyone would be happier
> if we crammed all this through a netlink socket instead. Got me.
No comment ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
On Thu, 2004-09-30 at 21:09 -0700, Paul Jackson wrote:
> If file "foo" goes away, and if the kernel only reports that inode
> 314159 was unlinked from a given directory, unless some user code
> previously remembered that inode 314159 was accessible from the
> directory entry named "foo", you can't tell the user that "foo"
> disappeared.
Right. Follow the same argument for renames, as well. ("Well, what did
it *used* be to named?") Short of caching the entire contents, there's
no way to know if all that's returned is an inode.
> Do you have the same problem with directories, if some cookie, not the
> full pathname, is passed back after an 'rmdir(2)' event? Or is it just
> that it's less onerous to track all the directories, because there's
> fewer of them?
In the case of a directory that's being monitored going away, userspace
has already asked for tracking events for that directory explicitly, and
has to keep track of the cookie ("watch descriptor") to directory name
mapping (or some context, at least) already. That's the only way it
knows the correct context for the events.
For example, I ask for monitoring of two different directories on my
system, both with a final path component of "etc". Since the kernel
can't pass back the full path (cf. mount --bind, or consider an
watch/chdir/watch sequence, if you believe the kernel should cache the
full path passed to it from userspace), it instead needs to pass back
something unique for that directory. The name isn't, and the full
pathname isn't available to the kernel. So, the watch descriptor, being
an arbitrary but unique cookie, steps in.
> > You're saying pass the inode number, as it's smaller and makes for an
> > easier and higher speed interface to get changes to userspace (if I
> > understand you correctly).
>
> That was the idea, yes. Most of it anyway. That and striving to keep
> the API the kernel presents to the user minimal and orthogonal.
<Nod> That's actually a stronger argument with me than the micro-
optimization. Orthogonal interfaces allow for more uses. However, I can
find no way to clean up the interface any more than what's been
suggested so far.
> If there is a way, any way, for user code to get something from the
> kernel, then I don't mind dragging my feet on providing other ways,
> until I see a good reason.
Fair enough.
> It's always worth a bit of effort to keep
> kernel API's minimal and orthogonal.
Agreed.
> Your "delete and rename" point above seems now like a good reason.
It would bring inotify almost down to the level of usefulness of
dnotify, which would be painful.
Ray