2004-09-15 16:00:24

by John McCutchan

[permalink] [raw]
Subject: [RFC][PATCH] inotify 0.9

Hello,

I am releasing a new version of inotify. Attached is a patch for
2.6.8.1.

I am interested in getting inotify included in the mm tree.

Inotify is designed as a replacement for dnotify. The key difference's
are that inotify does not require the file to be opened to watch it,
when you are watching something with inotify it can go away (if path
is unmounted) and you will be sent an event telling you it is gone and
events are delivered over a fd not by using signals.

New in this version:
Driver now supports reading more than one event at a time
Bump maximum number of watches per device from 64 to 8192
Bump maximum number of queued events per device from 64 to 256

--COMPLEXITY--

I have been asked what the complexity of inotify is. Inotify has
2 path codes where complexity could be an issue:

Adding a watcher to a device
This code has to check if the inode is already being watched
by the device, this is O(1) since the maximum number of
devices is limited to 8.


Removing a watch from a device
This code has to do a search of all watches on the device to
find the watch descriptor that is being asked to remove.
This involves a linear search, but should not really be an issue
because it is limited to 8192 entries. If this does turn in to
a concern, I would replace the list of watches on the device
with a sorted binary tree, so that the search could be done
very quickly.


The calls to inotify from the VFS code has a complexity of O(1) so
inotify does not affect the speed of VFS operations.

--MEMORY USAGE--

The inotify data structures are light weight:

inotify watch is 40 bytes
inotify device is 68 bytes
inotify event is 272 bytes

So assuming a device has 8192 watches, the structures are only going
to consume 320KB of memory. With a maximum number of 8 devices allowed
to exist at a time, this is still only 2.5 MB

Each device can also have 256 events queued at a time, which sums to
68KB per device. And only .5 MB if all devices are opened and have
a full event queue.

So approximately 3 MB of memory are used in the rare case of
everything open and full.

Each inotify watch pins the inode of a directory/file in memory,
the size of an inode is different per file system but lets assume
that it is 512 byes.

So assuming the maximum number of global watches are active, this would
pin down 32 MB of inodes in the inode cache. Again not a problem
on a modern system.

On smaller systems, the maximum watches / events could be lowered
to provide a smaller foot print.

Older release notes:
I am resubmitting inotify for comments and review. Inotify has
changed drastically from the earlier proposal that Al Viro did not
approve of. There is no longer any use of (device number, inode number)
pairs. Please give this version of inotify a fresh view.


Inotify is a character device that when opened offers 2 IOCTL's.
(It actually has 4 but the other 2 are used for debugging)

INOTIFY_WATCH:
Which takes a path and event mask and returns a unique
(to the instance of the driver) integer (wd [watcher descriptor]
from here on) that is a 1:1 mapping to the path passed.
What happens is inotify gets the inode (and ref's the inode)
for the path and adds a inotify_watcher structure to the inodes
list of watchers. If this instance of the driver is already
watching the path, the event mask will be updated and
the original wd will be returned.

INOTIFY_IGNORE:
Which takes an integer (that you got from INOTIFY_WATCH)
representing a wd that you are not interested in watching
anymore. This will:

send an IGNORE event to the device
remove the inotify_watcher structure from the device and
from the inode and unref the inode.


After you are watching 1 or more paths, you can read from the fd
and get events. The events are struct inotify_event. If you are
watching a directory and something happens to a file in the directory
the event will contain the filename (just the filename not the full
path).

Aside from the inotify character device driver.
The changes to the kernel are very minor.

The first change is adding calls to inotify_inode_queue_event and
inotify_dentry_parent_queue_event from the various vfs functions. This
is identical to dnotify.

The second change is more serious, it adds a call to
inotify_super_block_umount
inside generic_shutdown_superblock. What inotify_super_block_umount does
is:

find all of the inodes that are on the super block being shut down,
sends each watcher on each inode the UNMOUNT and IGNORED event
removes the watcher structures from each instance of the device driver
and each inode.
unref's the inode.

I have tested this code on my system for over three weeks now and have
not had problems. I would appreciate design review, code review and
testing.

John


Attachments:
inotify-0.9.patch (35.71 kB)

2004-09-15 18:03:41

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Wed, 2004-09-15 at 11:52 -0400, John McCutchan wrote:

> I am interested in getting inotify included in the mm tree.
>
> Inotify is designed as a replacement for dnotify. The key difference's
> are that inotify does not require the file to be opened to watch it,
> when you are watching something with inotify it can go away (if path
> is unmounted) and you will be sent an event telling you it is gone and
> events are delivered over a fd not by using signals.

I want to expand on why dnotify is awful and why inotify is a great
replacement, because dnotify's limitations are really showing up on
modern desktop systems.

Some technical issues with dnotify and why inotify solves the problem:

- dnotify requires one fd per watched directory. this results
in a lot of file descriptors if you are trying to do anything
creative. inotify solves this by only having one open file
descriptor.

- with dnotify, you open the fd on the directory to watch, which
pins the directory. this makes unmounting the backing
filesystem impossible and means using dnotify on removable
devices is nontrivial. This is a problem with desktop systems.
Not only does inotify solve this problem (by not requiring an
open of each watched directory), but it even sends an "unmount"
event when the watched directory is unmounted.

- Using dnotify is, uh, interesting. I mean, fcntl(2) and
SIGIO? You end up needing to use real-time signals. Gross
gross gross. This does not working well with modern event-
driven applications that use mainloops. You end up needing a
complicated daemon like FAM. We don't want FAM, and in fact we
should not even need a daemon (although we might want one).
Conversely, inotify is trivial to use and integrates well and is
select()-able.

I have been going over the code for awhile now, and it looks good. I
would really like to hear Al's opinion so we can move on fixing any
possible issues that he has.

Best,

Robert Love


2004-09-16 15:22:01

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

John McCutchan wrote:
> Hello,
>
> I am releasing a new version of inotify. Attached is a patch for
> 2.6.8.1.
>
> I am interested in getting inotify included in the mm tree.
>
> Inotify is designed as a replacement for dnotify. The key difference's
> are that inotify does not require the file to be opened to watch it,
> when you are watching something with inotify it can go away (if path
> is unmounted) and you will be sent an event telling you it is gone and
> events are delivered over a fd not by using signals.
>
> New in this version:
> Driver now supports reading more than one event at a time
> Bump maximum number of watches per device from 64 to 8192
> Bump maximum number of queued events per device from 64 to 256
>
> --COMPLEXITY--
>
> I have been asked what the complexity of inotify is. Inotify has
> 2 path codes where complexity could be an issue:
>
> Adding a watcher to a device
> This code has to check if the inode is already being watched
> by the device, this is O(1) since the maximum number of
> devices is limited to 8.
>
>
> Removing a watch from a device
> This code has to do a search of all watches on the device to
> find the watch descriptor that is being asked to remove.
> This involves a linear search, but should not really be an issue
> because it is limited to 8192 entries. If this does turn in to
> a concern, I would replace the list of watches on the device
> with a sorted binary tree, so that the search could be done
> very quickly.
>
>
> The calls to inotify from the VFS code has a complexity of O(1) so
> inotify does not affect the speed of VFS operations.
>
> --MEMORY USAGE--
>
> The inotify data structures are light weight:
>
> inotify watch is 40 bytes
> inotify device is 68 bytes
> inotify event is 272 bytes
>
> So assuming a device has 8192 watches, the structures are only going
> to consume 320KB of memory. With a maximum number of 8 devices allowed
> to exist at a time, this is still only 2.5 MB
>
> Each device can also have 256 events queued at a time, which sums to
> 68KB per device. And only .5 MB if all devices are opened and have
> a full event queue.
>
> So approximately 3 MB of memory are used in the rare case of
> everything open and full.
>
> Each inotify watch pins the inode of a directory/file in memory,
> the size of an inode is different per file system but lets assume
> that it is 512 byes.
>
> So assuming the maximum number of global watches are active, this would
> pin down 32 MB of inodes in the inode cache. Again not a problem
> on a modern system.

Did you work for Microsoft? Bloat doesn't count? And is this going to be
low memory you pin? And is every file create or delete (or update of
atime) going to blast this mess through cache looking for people to notify?
>
> On smaller systems, the maximum watches / events could be lowered
> to provide a smaller foot print.

Let's rethink this and say the max is by default and by use of proc or
sys or whatever's in vogue today you can enable the feature by setting a
non-zero value.
>
> Older release notes:
> I am resubmitting inotify for comments and review. Inotify has
> changed drastically from the earlier proposal that Al Viro did not
> approve of. There is no longer any use of (device number, inode number)
> pairs. Please give this version of inotify a fresh view.

We are hacking all over the kernel to save 4k in stack size and you want
to pin up to 32MB?
>
>
> Inotify is a character device that when opened offers 2 IOCTL's.
> (It actually has 4 but the other 2 are used for debugging)
>
> INOTIFY_WATCH:
> Which takes a path and event mask and returns a unique
> (to the instance of the driver) integer (wd [watcher descriptor]
> from here on) that is a 1:1 mapping to the path passed.
> What happens is inotify gets the inode (and ref's the inode)
> for the path and adds a inotify_watcher structure to the inodes
> list of watchers. If this instance of the driver is already
> watching the path, the event mask will be updated and
> the original wd will be returned.
>
> INOTIFY_IGNORE:
> Which takes an integer (that you got from INOTIFY_WATCH)
> representing a wd that you are not interested in watching
> anymore. This will:
>
> send an IGNORE event to the device
> remove the inotify_watcher structure from the device and
> from the inode and unref the inode.
>
>
> After you are watching 1 or more paths, you can read from the fd
> and get events. The events are struct inotify_event. If you are
> watching a directory and something happens to a file in the directory
> the event will contain the filename (just the filename not the full
> path).
>
> Aside from the inotify character device driver.
> The changes to the kernel are very minor.
>
> The first change is adding calls to inotify_inode_queue_event and
> inotify_dentry_parent_queue_event from the various vfs functions. This
> is identical to dnotify.
>
> The second change is more serious, it adds a call to
> inotify_super_block_umount
> inside generic_shutdown_superblock. What inotify_super_block_umount does
> is:
>
> find all of the inodes that are on the super block being shut down,
> sends each watcher on each inode the UNMOUNT and IGNORED event
> removes the watcher structures from each instance of the device driver
> and each inode.
> unref's the inode.
>
> I have tested this code on my system for over three weeks now and have
> not had problems. I would appreciate design review, code review and
> testing.
>
> John

If I were doing this, and I admit I may not understand all of the
features, I would have a bitmap per filesystem of inodes being watched,
and anything which did an action which might require notify would check
the bit. If the bit were set the filesystem and inode info would be
passed to user space which could do anything it wanted. Use of the
netlink is an example of ways to do this.

Then the user program could do whatever it wanted in nice pageable
space, allow as many watchers as it wished, and be flexible to anything
a site wanted, scalable, could use semaphores, fifos, network
monitoring, message queues... in other words low impact, scalable, and
flexible.

Feel free to tell me there is some urgent need for this feature to be
present and fast, I learn new things every day.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-09-16 16:31:17

by Chris Friesen

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

Bill Davidsen wrote:

> If I were doing this, and I admit I may not understand all of the
> features, I would have a bitmap per filesystem of inodes being watched,
> and anything which did an action which might require notify would check
> the bit. If the bit were set the filesystem and inode info would be
> passed to user space which could do anything it wanted.

How do you identify the filesystem? Whose mount namespace do you use if you
have multiple processes in different namespaces watching what is really the same
file?

Chris

2004-09-16 16:41:19

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Thu, 2004-09-16 at 11:07 -0400, Bill Davidsen wrote:

> Did you work for Microsoft? Bloat doesn't count? And is this going to be
> low memory you pin? And is every file create or delete (or update of
> atime) going to blast this mess through cache looking for people to notify?

No. I suggest looking at the source.

We are pinning the very inodes we are using. So,

(a) There is no cache effects because the inodes are already
in use. So when you go to, say, write to a file the kernel
already has the inode handy, and we just check in O(1) to
see if the inode has a watcher on it. We never walk a list
of inodes (why would you ever do that? how would you do
that?).
(b) Many of the pinned inodes are already in memory, cached,
since the probability of of used inodes and watched inodes
is high. Right now, on a system without inotify, I have
60MB of inodes in memory.
(c) The inodes are pinned to prevent races. Or, don't even
look at it like this. Just look at it as elevating the
ref count on the data structure while we are using it.

But here is the kicker: I don't think this pinning behavior is any
different than dnotify. So this is a total utter nonissue.

> > Older release notes:
> > I am resubmitting inotify for comments and review. Inotify has
> > changed drastically from the earlier proposal that Al Viro did not
> > approve of. There is no longer any use of (device number, inode number)
> > pairs. Please give this version of inotify a fresh view.
>
> We are hacking all over the kernel to save 4k in stack size and you want
> to pin up to 32MB?

The 4K is 4K per process, and it is done not to save 4K once (or even
4K*number of processes) but because first order allocations (8KB on x86)
become nontrivial as memory becomes fragmented.

I bet on most modern systems there is already much more than 32MB of
inodes in memory, and you have to explicitly add watches anyhow.

> If I were doing this, and I admit I may not understand all of the
> features, I would have a bitmap per filesystem of inodes being watched,
> and anything which did an action which might require notify would check
> the bit. If the bit were set the filesystem and inode info would be
> passed to user space which could do anything it wanted. Use of the
> netlink is an example of ways to do this.

Race, race, race, if even possible to implement "a bitmap per filesystem
of inodes" in a sane way.

> Then the user program could do whatever it wanted in nice pageable
> space, allow as many watchers as it wished, and be flexible to anything
> a site wanted, scalable, could use semaphores, fifos, network
> monitoring, message queues... in other words low impact, scalable, and
> flexible.

If you assume that you have to pin the inodes while you watch them (and
you do), then inotify really is this minimum abstraction that you talk
of.

> Feel free to tell me there is some urgent need for this feature to be
> present and fast, I learn new things every day.

You act like file notification is something new. Every operating system
provides this feature. Linux currently does, too: dnotify.

But dnotify sucks, and modern systems are hitting its numerous limits.
So, enter inotify.

Fondest regards,

Robert Love


2004-09-16 16:48:46

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

> John McCutchan wrote:
> >Hello,
> >
> >I am releasing a new version of inotify. Attached is a patch for
> >2.6.8.1.
<snip>

> >--MEMORY USAGE--
> >
> >The inotify data structures are light weight:
> >
> >inotify watch is 40 bytes
> >inotify device is 68 bytes
> >inotify event is 272 bytes
> >
> >So assuming a device has 8192 watches, the structures are only going
> >to consume 320KB of memory. With a maximum number of 8 devices allowed
> >to exist at a time, this is still only 2.5 MB
> >
> >Each device can also have 256 events queued at a time, which sums to
> >68KB per device. And only .5 MB if all devices are opened and have
> >a full event queue.
> >
> >So approximately 3 MB of memory are used in the rare case of
> >everything open and full.
> >
> >Each inotify watch pins the inode of a directory/file in memory,
> >the size of an inode is different per file system but lets assume
> >that it is 512 byes.
> >
> >So assuming the maximum number of global watches are active, this would
> >pin down 32 MB of inodes in the inode cache. Again not a problem
> >on a modern system.
>
> Did you work for Microsoft? Bloat doesn't count? And is this going to be
> low memory you pin? And is every file create or delete (or update of
> atime) going to blast this mess through cache looking for people to notify?
> >
> >On smaller systems, the maximum watches / events could be lowered
> >to provide a smaller foot print.
>
> Let's rethink this and say the max is by default and by use of proc or
> sys or whatever's in vogue today you can enable the feature by setting a
> non-zero value.
As I understand the patch it won't have any nontrivial memory
footprint in case you won't use inotify. Only in case someone wants to
watch inode, appropriate structure is allocated, inode pined etc. The
numbers above are in the case you watch maximum possible number of
inodes etc...
Maybe you should not be so fast in using your flamethrower;)

Bye
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2004-09-16 22:42:52

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Thu, 16 Sep 2004, Jan Kara wrote:

> > John McCutchan wrote:
> > >Hello,
> > >
> > >I am releasing a new version of inotify. Attached is a patch for
> > >2.6.8.1.
> <snip>
>
> > >--MEMORY USAGE--
> > >
> > >The inotify data structures are light weight:
> > >
> > >inotify watch is 40 bytes
> > >inotify device is 68 bytes
> > >inotify event is 272 bytes
> > >
> > >So assuming a device has 8192 watches, the structures are only going
> > >to consume 320KB of memory. With a maximum number of 8 devices allowed
> > >to exist at a time, this is still only 2.5 MB
> > >
> > >Each device can also have 256 events queued at a time, which sums to
> > >68KB per device. And only .5 MB if all devices are opened and have
> > >a full event queue.
> > >
> > >So approximately 3 MB of memory are used in the rare case of
> > >everything open and full.
> > >
> > >Each inotify watch pins the inode of a directory/file in memory,
> > >the size of an inode is different per file system but lets assume
> > >that it is 512 byes.
> > >
> > >So assuming the maximum number of global watches are active, this would
> > >pin down 32 MB of inodes in the inode cache. Again not a problem
> > >on a modern system.
> >
> > Did you work for Microsoft? Bloat doesn't count? And is this going to be
> > low memory you pin? And is every file create or delete (or update of
> > atime) going to blast this mess through cache looking for people to notify?
> > >
> > >On smaller systems, the maximum watches / events could be lowered
> > >to provide a smaller foot print.
> >
> > Let's rethink this and say the max is by default and by use of proc or
> > sys or whatever's in vogue today you can enable the feature by setting a
> > non-zero value.
> As I understand the patch it won't have any nontrivial memory
> footprint in case you won't use inotify. Only in case someone wants to
> watch inode, appropriate structure is allocated, inode pined etc. The
> numbers above are in the case you watch maximum possible number of
> inodes etc...

The point I was making is that this doesn't scale well, because it eats
resources which may be unavailable on many systems, and which others are
trying to conserve. Since this may limit the use it presents a problem
with usefulness.

> Maybe you should not be so fast in using your flamethrower;)

I didn't intend this as a flame, but I do feel this implementation doesn't
scale. I offered another approach off the top of my head, which appears to
me to be more scalable. I claimed no expertise, I just made a suggestion,
based on my first thought on how I would attack the problem in a way which
appears more scalable.

If we are going to 4k stack because larger memory blocks are hard to find,
I have to suspect that anything which locks up blocks size in MB is going
to cause problems. I didn't even ask what would happen on NUMA machines,
because that's not my usual concern.

I'm still horified by the memory requirements :-(

> --
> Jan Kara <[email protected]>
> SuSE CR Labs
>

Thanks for taking the time to note that my tone may have been harsh even
if my point was valid.


--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2004-09-16 22:55:56

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Thu, 16 Sep 2004, Chris Friesen wrote:

> Bill Davidsen wrote:
>
> > If I were doing this, and I admit I may not understand all of the
> > features, I would have a bitmap per filesystem of inodes being watched,
> > and anything which did an action which might require notify would check
> > the bit. If the bit were set the filesystem and inode info would be
> > passed to user space which could do anything it wanted.
>
> How do you identify the filesystem? Whose mount namespace do you use if you
> have multiple processes in different namespaces watching what is really the same
> file?

You're asking for implementation details on something I threw out off the
top of my head? My first thought is "not by name" since if this is an
unmount that's not going to work well. Since I'm making this up, let's say
a filesysem number and inode number. Then when the watch is set the system
just has to have a unique "filesystem number" identifier which is shared
by every watch request against the f/s.

I haven't looked at how the original proposal handles things like the same
f/s mounted multiple times, etc, so I wouldn't venture to improve on it.
If I were actually going to write something like this, I'd want to start
with a description of functional requirements and response time, and go
from there, trying to move as much as possible out of unpageable memory.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2004-09-16 22:59:35

by David Lang

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Thu, 16 Sep 2004, Bill Davidsen wrote:

> On Thu, 16 Sep 2004, Jan Kara wrote:
>
>>> John McCutchan wrote:
>>>> Hello,
>>>>
>>>> I am releasing a new version of inotify. Attached is a patch for
>>>> 2.6.8.1.
>> <snip>
>>
>>>> --MEMORY USAGE--
>>>>
>>>> The inotify data structures are light weight:
>>>>
>>>> inotify watch is 40 bytes
>>>> inotify device is 68 bytes
>>>> inotify event is 272 bytes
>>>>
>>>> So assuming a device has 8192 watches, the structures are only going
>>>> to consume 320KB of memory. With a maximum number of 8 devices allowed
>>>> to exist at a time, this is still only 2.5 MB
>>>>
>>>> Each device can also have 256 events queued at a time, which sums to
>>>> 68KB per device. And only .5 MB if all devices are opened and have
>>>> a full event queue.
>>>>
>>>> So approximately 3 MB of memory are used in the rare case of
>>>> everything open and full.
>>>>
>>>> Each inotify watch pins the inode of a directory/file in memory,
>>>> the size of an inode is different per file system but lets assume
>>>> that it is 512 byes.
>>>>
>>>> So assuming the maximum number of global watches are active, this would
>>>> pin down 32 MB of inodes in the inode cache. Again not a problem
>>>> on a modern system.
>>>
>>> Did you work for Microsoft? Bloat doesn't count? And is this going to be
>>> low memory you pin? And is every file create or delete (or update of
>>> atime) going to blast this mess through cache looking for people to notify?
>>>>
>>>> On smaller systems, the maximum watches / events could be lowered
>>>> to provide a smaller foot print.
>>>
>>> Let's rethink this and say the max is by default and by use of proc or
>>> sys or whatever's in vogue today you can enable the feature by setting a
>>> non-zero value.
>> As I understand the patch it won't have any nontrivial memory
>> footprint in case you won't use inotify. Only in case someone wants to
>> watch inode, appropriate structure is allocated, inode pined etc. The
>> numbers above are in the case you watch maximum possible number of
>> inodes etc...
>
> The point I was making is that this doesn't scale well, because it eats
> resources which may be unavailable on many systems, and which others are
> trying to conserve. Since this may limit the use it presents a problem
> with usefulness.
>
>> Maybe you should not be so fast in using your flamethrower;)
>
> I didn't intend this as a flame, but I do feel this implementation doesn't
> scale. I offered another approach off the top of my head, which appears to
> me to be more scalable. I claimed no expertise, I just made a suggestion,
> based on my first thought on how I would attack the problem in a way which
> appears more scalable.

IIRC you suggested a bitmap of all the inodes on a filesystem.

on my desktop this is what I see for inodes
dlang@dlang:~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda3 1048576 168266 880310 17% /
/dev/sda5 2097152 158128 1939024 8% /home
/dev/sda1 524288 41797 482491 8% /mnt

so at 8 per byte you are taking about ~500K just to store the info about
which ones somone is interested in watching (and note that this is only a
9GB drive, think about what happens on multi TB systems), then you have to
have another structure to track the events and which node each event goes
to (and what programs are interested in watching which inodes)

I don't think that a bitmap of all possible inodes is going to be the
right thing either.

now it's very possible that you were meaning something else, but it's not
clear what so please try again to restate your idea.

> If we are going to 4k stack because larger memory blocks are hard to find,
> I have to suspect that anything which locks up blocks size in MB is going
> to cause problems. I didn't even ask what would happen on NUMA machines,
> because that's not my usual concern.

actually the memory for this doesn't need to be a contiuous block so it
doesn't run into this problem

> I'm still horified by the memory requirements :-(

David Lang

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2004-09-16 23:32:53

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Thu, 2004-09-16 at 18:34 -0400, Bill Davidsen wrote:

> > Maybe you should not be so fast in using your flamethrower;)
>
> I didn't intend this as a flame, but I do feel this implementation doesn't
> scale. I offered another approach off the top of my head, which appears to
> me to be more scalable. I claimed no expertise, I just made a suggestion,
> based on my first thought on how I would attack the problem in a way which
> appears more scalable.

The thing you are missing is that you absolutely have to pin something
or you have multiple VFS races. Your bitmap suggestion, while cute,
really shows a lack of understanding of the problem space.

dnotify had to do it, inotify has to do it.

Do you want to go down the lets-find-a-race path with Al Viro? ;-)

> If we are going to 4k stack because larger memory blocks are hard to find,
> I have to suspect that anything which locks up blocks size in MB is going
> to cause problems. I didn't even ask what would happen on NUMA machines,
> because that's not my usual concern.

It is not the total size that is the concern, but the per-allocation
size, which has to be contiguous. A first order allocation is hard to
do. You can only find two contiguous free pages in physical memory so
often.

Inodes come from the slabcache. NONE of this is an issue there.

Plus, as I have said, the slabcache is probably caching much of what you
are pinning. So memory consumption is not changed. Finally, these
numbers are WORST case. Watch only a handful of files and you have a
handful of hundreds of bytes pinned.

Robert Love


2004-09-17 00:38:34

by Alan

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Gwe, 2004-09-17 at 00:22, Robert Love wrote:
> The thing you are missing is that you absolutely have to pin something
> or you have multiple VFS races. Your bitmap suggestion, while cute,
> really shows a lack of understanding of the problem space.

How many of the races matter. There seem to be several different
problems here and mixing them up might be a mistake.

1. I absolutely need to get the right file at the right moment, please
mass me a descriptor to the file as the user closes it so I always get
it right (indexer, virus checker)

2. If something happens bug me and I'll have a look (eg file manager)

Also it varies between "This file" and "everything in this subtree".
An indexer for example really wants to know "this file, this path" for
entire subtrees and to index the right object (if the path changes thats
less of an issue).

Alan

2004-09-17 02:29:39

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Fri, 2004-09-17 at 00:35 +0100, Alan Cox wrote:

> How many of the races matter. There seem to be several different
> problems here and mixing them up might be a mistake.
>
> 1. I absolutely need to get the right file at the right moment, please
> mass me a descriptor to the file as the user closes it so I always get
> it right (indexer, virus checker)
>
> 2. If something happens bug me and I'll have a look (eg file manager)

I think we want a solution that works well for both cases.

E.g., we have a few different needs:

- Stuff like Spotlight-esque automatic Indexers.
- File manager notifications
- Other GUI notifications (desktop, menus, etc.)
- To prevent polling (e.g. /proc/mtab)
- Existing dnotify users

dnotify is pretty lame for any of the above situations. Even for
something as trivial as watching the current open directory in Nautilus,
look at the hoops we have to just through with FAM.

And dnotify utterly falls apart on removable media or for any "large"
sort of job, e.g. indexing.

Robert Love


2004-09-17 03:08:24

by Nicholas Miell

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Thu, 2004-09-16 at 19:29, Robert Love wrote:
> I think we want a solution that works well for both cases.
>
> E.g., we have a few different needs:
>
> - Stuff like Spotlight-esque automatic Indexers.
> - File manager notifications
> - Other GUI notifications (desktop, menus, etc.)
> - To prevent polling (e.g. /proc/mtab)
> - Existing dnotify users
>
> dnotify is pretty lame for any of the above situations. Even for
> something as trivial as watching the current open directory in Nautilus,
> look at the hoops we have to just through with FAM.
>
> And dnotify utterly falls apart on removable media or for any "large"
> sort of job, e.g. indexing.

Isn't this the problem that XDSM/DMAPI is supposed to solve? Or is that
one of those specs that's too ugly to be implemented?

2004-09-17 15:44:08

by Alan

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Gwe, 2004-09-17 at 03:29, Robert Love wrote:
> I think we want a solution that works well for both cases.

Why does it have to be "a" solution not different things for different
tasks.

> And dnotify utterly falls apart on removable media or for any "large"
> sort of job, e.g. indexing.

Agreed

2004-09-17 15:57:29

by Alan

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Gwe, 2004-09-17 at 16:48, Robert Love wrote:
> I've looked into more "indexing" specific solutions and you see both
> races and security issues when you move away from the subscribe-to-
> watch-each-inode model.

For the file change case I'm unconvinced, although it looks like it
could be done with the security module hooks and without kernel mods
beyond that.


2004-09-17 16:02:22

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Fri, 2004-09-17 at 15:51 +0100, Alan Cox wrote:

> For the file change case I'm unconvinced, although it looks like it
> could be done with the security module hooks and without kernel mods
> beyond that.

Everyone keeps telling me this. I am unconvinced, too. ;-)

It should get more attention, though..

Robert Love


2004-09-17 16:42:00

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Fri, 2004-09-17 at 15:39 +0100, Alan Cox wrote:

> Why does it have to be "a" solution not different things for different
> tasks.

I have hopes that a single solution can happily solve all the cases. At
their core, all of these tasks are essentially the same - file change
notification - and it seems redundant to implement multiple file change
systems in the kernel.

I've looked into more "indexing" specific solutions and you see both
races and security issues when you move away from the subscribe-to-
watch-each-inode model.

That said, I personally don't have any reason for wanting a single
solution, except that because it is cleaner/simpler/smaller/etc it has a
better chance of success. If you have code that speaks different, then
great!

Robert Love


2004-09-20 20:16:24

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

Robert Love wrote:
> On Thu, 2004-09-16 at 11:07 -0400, Bill Davidsen wrote:
>
>
>>Did you work for Microsoft? Bloat doesn't count? And is this going to be
>> low memory you pin? And is every file create or delete (or update of
>>atime) going to blast this mess through cache looking for people to notify?
>
>
> No. I suggest looking at the source.
>
> We are pinning the very inodes we are using. So,

Well, I guess I misread the intent, I was assuming an inode could be
watched even if it wasn't (at the time of watch) being used. So while I
may want to know when any inode in a directory is used, I don't
particularly desire to have them all pinned in memory.

If you say that's the only way, then clearly only huge system will be
able to do that type of monitoring.
>
> (a) There is no cache effects because the inodes are already
> in use. So when you go to, say, write to a file the kernel
> already has the inode handy, and we just check in O(1) to
> see if the inode has a watcher on it. We never walk a list
> of inodes (why would you ever do that? how would you do
> that?).
> (b) Many of the pinned inodes are already in memory, cached,
> since the probability of of used inodes and watched inodes
> is high. Right now, on a system without inotify, I have
> 60MB of inodes in memory.
> (c) The inodes are pinned to prevent races. Or, don't even
> look at it like this. Just look at it as elevating the
> ref count on the data structure while we are using it.

I'm not clear on what race you would get sending a notify to a user mode
process that an inode had changed, but if you say there could be one I
can't argue.
>
> But here is the kicker: I don't think this pinning behavior is any
> different than dnotify. So this is a total utter nonissue.

If you assume you are going to create the same resource demands doing
one thing as another then it becomes a non-issue. I was suggesting that
it would be desirable not to use as many resources.
>
>
>>>Older release notes:
>>>I am resubmitting inotify for comments and review. Inotify has
>>>changed drastically from the earlier proposal that Al Viro did not
>>>approve of. There is no longer any use of (device number, inode number)
>>>pairs. Please give this version of inotify a fresh view.
>>
>>We are hacking all over the kernel to save 4k in stack size and you want
>>to pin up to 32MB?
>
>
> The 4K is 4K per process, and it is done not to save 4K once (or even
> 4K*number of processes) but because first order allocations (8KB on x86)
> become nontrivial as memory becomes fragmented.
>
> I bet on most modern systems there is already much more than 32MB of
> inodes in memory, and you have to explicitly add watches anyhow.

If by modern you mean huge memory servers, you are right. If you mean
modest desktops which might be able to identify problems by watching a
set of inodes, I suspect the inode usage is lower.
>
>
>>If I were doing this, and I admit I may not understand all of the
>>features, I would have a bitmap per filesystem of inodes being watched,
>>and anything which did an action which might require notify would check
>>the bit. If the bit were set the filesystem and inode info would be
>>passed to user space which could do anything it wanted. Use of the
>>netlink is an example of ways to do this.
>
>
> Race, race, race, if even possible to implement "a bitmap per filesystem
> of inodes" in a sane way.
>
>
>>Then the user program could do whatever it wanted in nice pageable
>>space, allow as many watchers as it wished, and be flexible to anything
>>a site wanted, scalable, could use semaphores, fifos, network
>>monitoring, message queues... in other words low impact, scalable, and
>>flexible.
>
>
> If you assume that you have to pin the inodes while you watch them (and
> you do), then inotify really is this minimum abstraction that you talk
> of.

As I said, if you assume pinning the inodes you can't make any
significant reduction in memory use.
>
>
>>Feel free to tell me there is some urgent need for this feature to be
>>present and fast, I learn new things every day.
>
>
> You act like file notification is something new. Every operating system
> provides this feature. Linux currently does, too: dnotify.
>
> But dnotify sucks, and modern systems are hitting its numerous limits.
> So, enter inotify.

I guess all of us running laptops and the like with memory in MB rather
then GB just aren't modern... the limit we hit is mostly memory size.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-09-20 21:06:17

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Mon, 2004-09-20 at 16:16 -0400, Bill Davidsen wrote:

> Well, I guess I misread the intent, I was assuming an inode could be
> watched even if it wasn't (at the time of watch) being used. So while I
> may want to know when any inode in a directory is used, I don't
> particularly desire to have them all pinned in memory.
>
> If you say that's the only way, then clearly only huge system will be
> able to do that type of monitoring.

You can pin just a directory and retrieve all of the events therein.
You do not need to pin every single inode on your machine. This is the
same as dnotify - except inotify also allows you to watch individual
files.

> I'm not clear on what race you would get sending a notify to a user mode
> process that an inode had changed, but if you say there could be one I
> can't argue.

If you cannot track the lifecycle of the object being watched, you
essentially cannot watch it. To track the lifetime of an inode, you
need to ensure that it remains in the icache.

> If by modern you mean huge memory servers, you are right. If you mean
> modest desktops which might be able to identify problems by watching a
> set of inodes, I suspect the inode usage is lower.
>
> I guess all of us running laptops and the like with memory in MB rather
> then GB just aren't modern... the limit we hit is mostly memory size.

John showed that the absolute worst case is ~30MB in your icache. I
have 77MB of ext3 inodes in cache right now on my desktop. Assuming a
decent overlap between watched and cached inodes, there is little
change.

But the 30MB is worst case. Expect something in the single digits.

Look, Bill: Conjecturing about a potential problem in a space you are
unfamiliar with does nothing but obstruct Linux development and act as
Stop Energy. Constructive, well-informed opinions are money.
Everything else is just liking the sound of your voice.

Thanks!

Best,

Robert Love


2004-09-20 23:06:47

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Mon, 20 Sep 2004, Robert Love wrote:

> On Mon, 2004-09-20 at 16:16 -0400, Bill Davidsen wrote:

> You can pin just a directory and retrieve all of the events therein.
> You do not need to pin every single inode on your machine. This is the
> same as dnotify - except inotify also allows you to watch individual
> files.
>
> > I'm not clear on what race you would get sending a notify to a user mode
> > process that an inode had changed, but if you say there could be one I
> > can't argue.
>
> If you cannot track the lifecycle of the object being watched, you
> essentially cannot watch it. To track the lifetime of an inode, you
> need to ensure that it remains in the icache.

What I proposed as a possible implementation was to have anything which
did a trackable operation on the inode send a notify to user space. And
that isn't the same as dnotify although it might address some of the same
uses. As a for instance when an open is done the open code sends a notify,
and until that time it's not obvious that the inode must be pinned. By
having a single user program accept the notify and decide what to do, the
kernel can do less of it. Yes, that could mean passing out a lot of
information which would be dropped by the user program. That's what I had
in mind when I asked if the process needed to be real time.


> Look, Bill: Conjecturing about a potential problem in a space you are
> unfamiliar with does nothing but obstruct Linux development and act as
> Stop Energy. Constructive, well-informed opinions are money.
> Everything else is just liking the sound of your voice.

If an idea is so tenuous that one person noting that the memory overhead
of a feature is or could be very high and asking "could it be done thus"
provides Stop Enargy then there may be a lack of conviction.

I personally don't mind being questioned on an idea, it points out flaws,
it lets me be confident that I have it right, or at least avoid putting a
lot of effort into something and then having someone say "oh here's a
better way." I don't go with "how dare you question me?"

I have virtually no experience with dnotify, but a lot with putting Linux
on old small systems to give to low income kids who can buy "modern"
machines, and if I see a feature which won't run on a small machine I want
to suggest that there might be a better way. Sorry, a lower resource cost
way.

Developers don't typically have small slow machines, and don't think about
the old, embedded, or laptop uses unless someone mentions it. I'm sorry
you think think I'm talking to hear myself talk, the point I'm making is
valid to me.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2004-09-21 00:02:32

by Robert Love

[permalink] [raw]
Subject: Re: [RFC][PATCH] inotify 0.9

On Mon, 2004-09-20 at 18:59 -0400, Bill Davidsen wrote:

> I'm sorry you think think I'm talking to hear myself talk, the
> point I'm making is valid to me.

Judgment suggests I should drop this, but the problem is that you never
made a valid or well-informed point.

You started off with "Did you work for Microsoft?" and you followed up
with questions and critiques demonstrating no understanding whosoever
for the way that Linux dcache or inode management works and further that
you did not even read the patch.

My reply "well dnotify has this same issue" is not a rallying behind the
status quo (I mean, I want dnotify dead myself) but that no one
complains about the size issue with dnotify. John and I want to address
the issues with dnotify.

If we can do something about space consumption, and if it turns out to
be an issue, I am all for it. I do not yet see a way around it, and no
one has shown that normal use in the real world suffers from any issue.

Thanks,

Robert Love