2004-09-28 22:34:22

by John McCutchan

[permalink] [raw]
Subject: [RFC][PATCH] inotify 0.11.0 [WITH PATCH!]

Hello,

Here is release 0.11.0 of inotify. Attached is a patch to 2.6.8.1

--New in this version--
-remove timer (rml)
-fix typo (rml)
-remove check for dev->file_private (rml)
-redo find_inode (rml)
-use the bitmap functions (rml)
-modularization (rml)
-misc cleanup (rml,me)
-redo inotify_read (me)

John McCutchan

Release notes:

--Why Not dnotify and Why inotify (By Robert Love)--

Everyone seems quick to deride the blunder known as "dnotify" and
applaud a
replacement, any replacement, man anything but that current mess, but in
the
name of fairness I present my treatise on why dnotify is what one might
call
not good:

* dnotify requires the opening of one fd per each directory that you
intend to
watch.
o The file descriptor pins the directory, disallowing the backing
device to be unmounted, which absolutely wrecks havoc with removable
media.
o Watching many directories results in many open file descriptors,
possibly hitting a per-process fd limit.
* dnotify is directory-based. You only learn about changes to
directories.
Sure, a change to a file in a directory affects the directory, but you
are
then forced to keep a cache of stat structures around to compare
things in
order to find out which file.
* dnotify's interface to user-space is awful.
o dnotify uses signals to communicate with user-space.
o Specifically, dnotify uses SIGIO.
o But then you can pick a different signal! So by "signals," I really
meant you need to use real-time signals if you want to queue the
events.
* dnotify basically ignores any problems that would arise in the VFS
from hard
links.
* Rumor is that the "d" in "dnotify" does not stand for "directory" but
for
"suck."

A suitable replacement is "inotify." And now, my tract on what inotify
brings
to the table:

* inotify's interface is a device node, not SIGIO.
o You open only a single fd, to the device node. No more pinning
directories or opening a million file descriptors.
o Usage is nice: open the device, issue simple commands via ioctl(),
and then block on the device. It returns events when, well, there are
events to be returned.
o You can select() on the device node and so it integrates with main
loops like coffee mixed with vanilla milkshake.
* inotify has an event that says "the filesystem that the item you were
watching is on was unmounted" (this is particularly cool).
* inotify can watch directories or files.
* The "i" in inotify does not stand for "suck" but for "inode" -- the
logical
choice since inotify is inode-based.


--COMPLEXITY--

I have been asked what the complexity of inotify is. Inotify has
2 path codes where complexity could be an issue:

Adding a watcher to a device
This code has to check if the inode is already being watched
by the device, this is O(1) since the maximum number of
devices is limited to 8.


Removing a watch from a device
This code has to do a search of all watches on the device to
find the watch descriptor that is being asked to remove.
This involves a linear search, but should not really be an issue
because it is limited to 8192 entries. If this does turn in to
a concern, I would replace the list of watches on the device
with a sorted binary tree, so that the search could be done
very quickly.


The calls to inotify from the VFS code has a complexity of O(1) so
inotify does not affect the speed of VFS operations.

--MEMORY USAGE--

The inotify data structures are light weight:

inotify watch is 40 bytes
inotify device is 68 bytes
inotify event is 272 bytes

So assuming a device has 8192 watches, the structures are only going
to consume 320KB of memory. With a maximum number of 8 devices allowed
to exist at a time, this is still only 2.5 MB

Each device can also have 256 events queued at a time, which sums to
68KB per device. And only .5 MB if all devices are opened and have
a full event queue.

So approximately 3 MB of memory are used in the rare case of
everything open and full.

Each inotify watch pins the inode of a directory/file in memory,
the size of an inode is different per file system but lets assume
that it is 512 byes.

So assuming the maximum number of global watches are active, this would
pin down 32 MB of inodes in the inode cache. Again not a problem
on a modern system.

On smaller systems, the maximum watches / events could be lowered
to provide a smaller foot print.

Keep in mind that this is an absolute worst case memory analysis.
In reality it will most likely cost approximately 5MB.

--HOWTO USE--
Inotify is a character device that when opened offers 2 IOCTL's.
(It actually has 4 but the other 2 are used for debugging)

INOTIFY_WATCH:
Which takes a path and event mask and returns a unique
(to the instance of the driver) integer (wd [watch descriptor]
from here on) that is a 1:1 mapping to the path passed.
What happens is inotify gets the inode (and ref's the inode)
for the path and adds a inotify_watcher structure to the inodes
list of watchers. If this instance of the driver is already
watching the path, the event mask will be updated and
the original wd will be returned.

INOTIFY_IGNORE:
Which takes an integer (that you got from INOTIFY_WATCH)
representing a wd that you are not interested in watching
anymore. This will:

send an IGNORE event to the device
remove the inotify_watcher structure from the device and
from the inode and unref the inode.

After you are watching 1 or more paths, you can read from the fd
and get events. The events are struct inotify_event. If you are
watching a directory and something happens to a file in the directory
the event will contain the filename (just the filename not the full
path).

-- EVENTS --
IN_ACCESS - Sent when file is accessed.
IN_MODIFY - Sent when file is modified.
IN_ATTRIB - Sent when file is chmod'ed.
IN_CLOSE - Sent when file is closed
IN_OPEN - Sent when file is opened.
IN_MOVED_FROM - Sent to the source folder of a move.
IN_MOVED_TO - Sent to the destination folder of a move.
IN_DELETE_SUBDIR - Sent when a sub directory is deleted. (When watching
parent)
IN_DELETE_FILE - Sent when a file is deleted. (When watching parent)
IN_CREATE_SUBDIR - Sent when a sub directory is created. (When watching
parent)
IN_CREATE_FILE - Sent when a file is created. (When watching parent)
IN_DELETE_SELF - Sent when file is deleted.
IN_UNMOUNT - Sent when the filesystem is being unmounted.
IN_Q_OVERFLOW - Sent when your event queue has over flowed.

The MOVED_FROM/MOVED_TO events are always sent in pairs.
MOVED_FROM/MOVED_TO
is also sent when a file is renamed. The cookie field in the event pairs
up MOVED_FROM/MOVED_TO events. These two events are not guaranteed to be
successive in the event stream. You must rely on the cookie to pair
them up. (Note, the cookie is not sent yet.)

If you aren't watching the source and destination folders in a MOVE.
You will only get MOVED_TO or MOVED_FROM. In this case, MOVED_TO
is equivelent to a CREATE and MOVED_FROM is equivelent to a DELETE.

--KERNEL CHANGES--
inotify char device driver.

Adding calls to inotify_inode_queue_event and
inotify_dentry_parent_queue_event from VFS operations.
Dnotify has the same function calls. The complexity of the VFS
operations is not affected because inotify_*_queue_event is O(1).


Adding a call to inotify_super_block_umount from
generic_shutdown_superblock

inotify_super_block_umount consists of this:
find all of the inodes that are on the super block being shut down,
sends each watcher on each inode the UNMOUNT and IGNORED event
removes the watcher structures from each instance of the device driver
and each inode.
unref's the inode.


Attachments:
inotify-0.11.0.diff (39.33 kB)

2004-09-30 19:06:08

by Robert Love

[permalink] [raw]
Subject: [patch] inotify: locking

I finally got around to reviewing the locking in inotify, again, in
response to Andrew's points.

There are still three TODO items:

- Look into replacing i_lock with i_sem.
- dev->lock nests inode->i_lock. Preferably i_lock should
remain an outermost lock. Maybe i_sem can fix this.
- inotify_inode_is_dead() should always have the inode's
i_lock locked when called. I have not yet reviewed the
VFS paths that call it to ensure this is the case.

Anyhow, this patch does the following

- More locking documentation/comments
- Respect lock ranking when locking two different
inodes' i_lock
- Don't grab dentry->d_lock and just use dget_parent(). [1]
- Respect lock ranking with dev->lock vs. inode->i_lock.
- inotify_release_all_watches() needed dev->lock.
- misc. cleanup

Patch is on top of 0.11.0.

Robert Love


2004-09-30 19:24:11

by Robert Love

[permalink] [raw]
Subject: Re: [patch] inotify: locking

On Thu, 2004-09-30 at 15:01 -0400, Robert Love wrote:

> I finally got around to reviewing the locking in inotify, again, in
> response to Andrew's points.

As I was saying. I need to walk down the hall and have the Evolution
hackers add the "Robert intended to add a patch but did not, so let's
automatically add it for him" feature.

Here it is, this time for serious.

Robert Love


Attachments:
inotify-0.11.0-rml-locking-redux-1.patch (6.00 kB)

2004-09-30 21:38:24

by Robert Love

[permalink] [raw]
Subject: [patch] inotify: ioctl makeover

Attached patch greatly cleans up inotify's ioctl method. Oh my.

Net loss of 36 lines.

Main thing is removing all of the access_ok() checks and _IOC macros.
We don't need them. We can just fall off the end of the switch and
return -EINVAL if the cmd does not match. Any other access control is
handled by the device node itself or the copy_from_user() calls.

Also, other misc. changes.

Patch is on top of 0.11.0.

Best,

Robert Love


Attachments:
inotify-0.11.0-rml-ioctl-cleanup-1.patch (2.49 kB)

2004-09-30 22:29:48

by Robert Love

[permalink] [raw]
Subject: [patch] inotify: make user visible types portable

Because we have kernels with different types than their own
userspace--specifically 64-bit kernels with a 32-bit userspace--we
should limit all communication between the kernel and userspace to fixed
sized types.

This patch does that for the two user visible structures, inotify_event
and inotify_watcher.

As far as 32-bit or LP64 systems are concerned, the only type changes
are 'mask' and 'cookie', which are now unsigned. No one is using
'cookie' yet, so the only ABI breaker is 'mask' (speaking of which, we
had 'mask' as an 'unsigned long' inside inotify.c, so this change was
needed anyhow).

Unfortunately the stupid fixed sized types are ugly as sin. I mean,
"__u32" just does not have the same ring to it as "unsigned long".
C'est la vie.

Patch is on top of 0.11.0 and my previous bountiful delights.

Best,

Robert Love


Attachments:
inotify-0.11.0-rml-user-portability-1.patch (4.28 kB)

2004-09-30 22:36:41

by Robert Love

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

On Thu, 2004-09-30 at 18:25 -0400, Robert Love wrote:

> (speaking of which, we had 'mask' as an 'unsigned long' inside inotify.c,
> so this change was needed anyhow).

Ugh. We _also_ add mask sprinkled about as an int.

This patch makes those __u32 types, too.

Robert Love


Attachments:
inotify-more-mask-type-changes-1.patch (819.00 B)

2004-09-30 22:44:47

by Robert Love

[permalink] [raw]
Subject: [patch] inotify: rename inotify_watcher

s/inotify_watcher/inotify_watch/ per the TODO

I agree: The structures are objects we are watching, not the watchers
themselves.

Robert Love


Attachments:
inotify-rename-inotify_watcher.patch (8.85 kB)

2004-09-30 22:45:40

by Robert Love

[permalink] [raw]
Subject: Re: [patch] inotify: rename inotify_watcher

On Thu, 2004-09-30 at 18:43 -0400, Robert Love wrote:

> s/inotify_watcher/inotify_watch/ per the TODO

Oh, BTW, there are many, many instances of "watcher" that probably ought
to be "watch" (same goes for "watchers" and "watches") but the patch was
huge. I settled for at least getting the name of the structure right.

Robert Love


2004-09-30 22:55:23

by Robert Love

[permalink] [raw]
Subject: [patch] inotify: rename slab-related stuff

Hey, John.

Following patch renames some slab-related stuff.

First rename the "kevent_cache" variable to "event_cachep". The name
"kevent" sounds too close to the kernel event layer, which is going in.
And the 'p' suffix is the standard for slab cache variables. No idea
why.

Second rename the "watcher_cache" variable to "watch_cachep" as the
thing is now a watch object, not a watcher. Also, same thing with the
'p'.

We do not have to worry about namespace, since the variables are local
to the file.

Finally, give the slab caches more descriptive user-visible names:
"inotify_watch_cache" and "inotify_event_cache".

Patch is on top of 0.11.0 and my past indiscretions.

Robert Love


Attachments:
inotify-rename-slab-stuff-1.patch (2.41 kB)

2004-09-30 22:58:50

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

Robert wrote:
> "__u32" just does not have the same ring to it as "unsigned long".

Why not "u32"?

$ grep '^typedef.* u32;$' include/asm-*/*.h
include/asm-alpha/types.h:typedef unsigned int u32;
include/asm-arm/types.h:typedef unsigned int u32;
include/asm-arm26/types.h:typedef unsigned int u32;
include/asm-cris/types.h:typedef unsigned int u32;
include/asm-h8300/types.h:typedef unsigned int u32;
include/asm-i386/types.h:typedef unsigned int u32;
include/asm-ia64/types.h:typedef __u32 u32;
include/asm-m32r/types.h:typedef unsigned int u32;
include/asm-m68k/types.h:typedef unsigned int u32;
include/asm-mips/types.h:typedef unsigned int u32;
include/asm-parisc/types.h:typedef unsigned int u32;
include/asm-ppc/types.h:typedef unsigned int u32;
include/asm-ppc64/types.h:typedef unsigned int u32;
include/asm-s390/types.h:typedef unsigned int u32;
include/asm-sh/types.h:typedef unsigned int u32;
include/asm-sh64/types.h:typedef unsigned int u32;
include/asm-sparc/types.h:typedef unsigned int u32;
include/asm-sparc64/types.h:typedef unsigned int u32;
include/asm-v850/types.h:typedef unsigned int u32;
include/asm-x86_64/types.h:typedef unsigned int u32;

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-01 05:36:08

by Robert Love

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

On Thu, 2004-09-30 at 15:57 -0700, Paul Jackson wrote:

> Why not "u32"?

The rule is to use the __foo variants for externally viewable types.

Indeed, the examples you gave are wrapped in __KERNEL__.

Best,

Robert Love


2004-10-01 06:45:49

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

Robert wrote:
> The rule is to use the __foo variants for externally viewable types.
> Indeed, the examples you gave are wrapped in __KERNEL__.

I've no doubt you're right here. But I'm a little confused.

Are you saying to use __u32 so user code can compile with these kernel
headers and see your new inotify symbols w/o polluting their name space
with the non-underscored typedef symbols?

I though such use of kernel headers in compiling user code was
deprecated. I'd have figured this meant while we might not go out of
way to break someone already doing it, we wouldn't make any effort, or
tolerate any ugly as sin __foo names, in order to add to the list of
symbols so accessible.

If you have a few minutes more patience, perhaps you could explain
where my understanding departed from reality.

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-01 07:39:57

by Robert Love

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

On Thu, 2004-09-30 at 23:44 -0700, Paul Jackson wrote:

> I've no doubt you're right here. But I'm a little confused.
>
> Are you saying to use __u32 so user code can compile with these kernel
> headers and see your new inotify symbols w/o polluting their name space
> with the non-underscored typedef symbols?

I am saying I have to use __u32, because they are user visible and u32
is not. Also, the rule is to use __u32.

> I though such use of kernel headers in compiling user code was
> deprecated. I'd have figured this meant while we might not go out of
> way to break someone already doing it, we wouldn't make any effort, or
> tolerate any ugly as sin __foo names, in order to add to the list of
> symbols so accessible.
>
> If you have a few minutes more patience, perhaps you could explain
> where my understanding departed from reality.

How else is user-space to know about this structure?

It has always been a no-no for user-space to access __KERNEL__ wrapped
parts of headers, but sharing a header (or at least generating
user-space's version of the header from the kernel header) is the only
way to ensure that both kernel and user-space speak the same language.

And not just structures, but flags, ioctl commands, ...

Robert Love


2004-10-01 15:43:06

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

Robert wrote:
>
> but sharing a header (or at least generating
> user-space's version of the header from the kernel header) is the only
> way to ensure that both kernel and user-space speak the same language.

Ok - your understanding is clearly stated. So be it.

For now, I will remain in the alternative school that says the "other"
way to keep the kernel and user interfaces aligned is to have two
separate header files, one tuned for each space, using the human brain
to keep them aligned, and keeping things simple enough that the brain
can do so reliably. I find that optimizing the human readability of
this code is more valuable than automatable header sharing across the
kernel-user boundary. In some cases, such as RPC or CORBA, automatic
header sharing is damn near essential, but not here.

I have no delusions of having sufficient standing in the community, or
confidence of my position, to cause you to change your understanding.

Good luck. Thanks for replying.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-01 15:50:44

by Robert Love

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

On Fri, 2004-10-01 at 08:40 -0700, Paul Jackson wrote:

> For now, I will remain in the alternative school that says the "other"
> way to keep the kernel and user interfaces aligned is to have two
> separate header files, one tuned for each space, using the human brain
> to keep them aligned, and keeping things simple enough that the brain
> can do so reliably. I find that optimizing the human readability of
> this code is more valuable than automatable header sharing across the
> kernel-user boundary. In some cases, such as RPC or CORBA, automatic
> header sharing is damn near essential, but not here.

I'm not disagreeing with this, at all.

Most distributions ship kernel headers that have somehow been sanitized.

The canonical structure is still the thing located in inotify.h, though,
whether or not it is 'kept aligned by the human brain' or used
wholesale.

The structure needs to be used exactly the same between the kernel and
the user. We both agree to that, right? It is user visible.

Robert Love


2004-10-01 16:17:54

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

Robert wrote:
> The structure needs to be used exactly the same between the kernel and
> the user. We both agree to that, right? It is user visible.

Certainly the ABI, yes. These stubborn beasts called computers that we
labour over just won't work otherwise.

I'd have no objections to the user header spelling "__u32" where the
kernel header spelled "u32".

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-01 16:32:16

by Chris Friesen

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

Paul Jackson wrote:
> Robert wrote:
>
>>The structure needs to be used exactly the same between the kernel and
>>the user. We both agree to that, right? It is user visible.
>
>
> Certainly the ABI, yes. These stubborn beasts called computers that we
> labour over just won't work otherwise.
>
> I'd have no objections to the user header spelling "__u32" where the
> kernel header spelled "u32".

I believe there is a long-term goal to separate out the userspace-visible part
of the kernel headers into a separate header area, and include them into the kernel.

Even without that, the headers are periodically extracted and cleaned up. Why
make that job harder than it needs to be?

Chris

2004-10-01 17:48:45

by Robert Love

[permalink] [raw]
Subject: [patch] inotify: misc changes

Hey hey, John!

Following patch is misc. changes between my tree and the last posted
inotify patch.

Most of the changes are cosmetic, coding style and such cleanup.

One non-cosmetic change is an optimization in inotify_dev_queue_event().
I cache the results of what was inotify_dev_get_event() and
list_to_inotify_kernel_event(), which cleans up the code a lot and saves
like seven dereferences.

Patch is on top of 0.11.0 and the previous chicanery I posted.

Robert Love


Attachments:
inotify-rml-misc-cleanup-1.patch (6.26 kB)

2004-10-01 18:05:09

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

Chris wrote:
> Why make that job harder than it needs to be?

Well ... I think my motivations were clear enough ... trading off this
against optimizing readability of kernel source code.

I should probably quit responding on this thread ... I've nothing more
worth saying.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-02 09:28:47

by David Woodhouse

[permalink] [raw]
Subject: Re: [patch] inotify: make user visible types portable

On Thu, 2004-09-30 at 18:30 -0400, Robert Love wrote:
> On Thu, 2004-09-30 at 18:25 -0400, Robert Love wrote:
>
> > (speaking of which, we had 'mask' as an 'unsigned long' inside inotify.c,
> > so this change was needed anyhow).
>
> Ugh. We _also_ add mask sprinkled about as an int.
>
> This patch makes those __u32 types, too.

Don't want for the cleanup of kernel headers to be done by someone else.
Stop polluting them more. Take the user-visible structures and put them
into a separate header file, possibly in a separate directory. Then
include that from your kernel header. Then there's _already_ a
'sanitised' header file for userspace. See the contents of include/mtd/
for an example, although I think there may be one or two things in there
I still need to clean up.

I probably still need to change some __u32 to uint32_t for portability,
for example. You should do that too.

--
dwmw2