2009-07-24 20:15:17

by Eric Paris

[permalink] [raw]
Subject: fanotify - overall design before I start sending patches

I plan to start sending patches for fanotify in the next week or two.
I'd like to see more comments on the design, interface, and capabilities
in case there is a recognized need for major reworks or if I'm not
meeting some users needs (other than those noted at the end)

git://git.infradead.org/users/eparis/notify.git fanotify-experimental

should have working code to test what I'm talking about.

What is fanotify?

It is a new notification system that has a limited set of events (open,
close, read, write) in which notification not only comes with metadata
the describes what happened it also comes with an open file descriptor
to the object in question. fanotify will also allow the listener to
make access decisions on open and read events. This allows the
implementation of hierarchical storage management systems or an access
file scanning or integrity checking.

fanotify comes in two flavors 'directed' and 'global.' 'Directed' is
like inotify or dnotify in that you register specific inodes of interest
and only get events pertaining to those inodes. Global means you are
registering interest for event types system wide. With global mode the
listener program can later exclude objects from future events.

fanotify kernel/userspace interaction is over a new socket protocol. A
listener opens a new socket in the new PF_FANOTIFY family. The socket
is then bound to an address. Using the following struct:

struct fanotify_addr {
sa_family_t family;
__u32 priority;
__u32 group_num;
__u32 mask;
__u32 f_flags;
__u32 unused[16];
} __attribute__((packed));

The priority field indicates in which order fanotify listeners will get
events. Since 2 fanotify listeners would 'hear' each others events on
the new fd they create fanotify listeners will not hear events generated
by other fanotify listeners with a lower priority number.

The group_num is at the moment not used, but the plan was to allow 2
processes to bind to the same fanotify group and share the load of
processing events.

The f_flags is the flags which the fanotify listener wishes to use when
opening their notification fds. On access scanners would want to use
O_RDONLY, whereas HSM systems would need to use O_WRONLY.

The mask is the indication of the events this group is interested in.
The set of events of interest if FAN_GLOBAL_LISTENER is set at bind
time. If FAN_GLOBAL_LISTENER is not set, this field is meaningless as
the registration of events on individual inodes will dictate the
reception of events.

* FAN_ACCESS: every file access.
* FAN_MODIFY: file modifications.
* FAN_CLOSE: files are closed.
* FAN_OPEN: open() calls.
* FAN_ACCESS_PERM: like FAN_ACCESS, except that the process trying to
access the file is put on hold while the fanotify client decides whether
to allow the operation.
* FAN_OPEN_PERM: like FAN_OPEN, but with the permission check.
* FAN_EVENT_ON_CHILD: receive notification of events on inodes inside
this subdirectory. (this is not a full recursive notification of all
descendants, only direct children)
* FAN_GLOBAL_LISTENER: notify for events on all files in the system.
* FAN_SURVIVE_MODIFY: special flag that ignores should survive inode
modification. Discussed below.

After the socket is bound events are attained using the read() syscall
(recv* probably also works haven't tested). This will result in the
buffer being filled with one or more events like this:

struct fanotify_event_metadata {
__u32 event_len;
__s32 fd;
__u32 mask;
__u32 f_flags;
__s32 pid;
__s32 tgid;
__u64 cookie;
} __attribute__((packed));

fd specifies the new file descriptor that was created in the context of
the listener. (readlink of /proc/self/fd will give you A pathname)
mask indicates the events type (bitwise OR of the event types listed
above). f_flags here is the f_flags the ORIGINAL process has the file
open with. pid and tgid are from the original process. cookie is used
when the listener needs to allow, deny, or delay the operation.

If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener
must send a response before the 5 second timeout. If no response is
sent before the 5 second timeout the original operation is allowed. If
this happens too many times (10 in a row) the fanotify group is evicted
from the kernel and will not get any new events. Sending a response is
done using the setsockopt() call with the socket options set to
FANOTIFY_ACCESS_RESPONSE. The buffer should contain a structure like:

struct fanotify_so_access {
__u64 cookie;
__u32 response;
} __attribute__((packed));

Where cookie is the cookie from the notification and response is one of:

FAN_ALLOW: allow the original operation
FAN_DENY: deny the original operation
FAN_RESET_TIMEOUT: reset the timeout.

The last main interface is the 'marking' of inodes. The purpose of
inode marks differ between 'directed' and 'global' listeners. Directed
fanotify listeners need to mark inodes of interest. They do that also
using setsockopt() of type FANOTIFY_SET_MARK with the buffer containing
a structure like:

struct fanotify_so_inode_mark {
__s32 fd;
__u32 mask;
__u32 ignored_mask;
} __attribute__((packed));

Where fd is backed by the inode in question. Mask is the events of
interest (only used in directed mode) and ignored_mask is the mask of
events which should be ignored.

The ignored_mask is cleared every time an inode receives a modification
events unless FAN_SURVIVE_MODIFY is also set. The ignored_mask is
mainly used for 2 purposes. Global listeners may just have no interest
in lots of events, so they should spam inodes with an ignored mask. The
ignored mask is also used to 'cache' access decisions. If the listener
sets FAN_ACCESS_PERM in the ignored mask all access operations will be
permitted without the call out to userspace. If the inode is modified
the ignored_mask will be cleared and userspace will again have to
approve the access. If userspace REALLY doesn't care ever they can use
the special FAN_SURVIVE_MODIFY flag inside the ignored_mask.

The only other current interface is the ability to ignore events by
superblock magic number. This makes it easy to ignore all events
in /proc which can be difficult to accomplish firing FANOTIFY_SET_MARK
with ignored_masks over and over as processes are created and destroyed.

***********

Future direction:
There are 2 things I'm interested in adding.
- Rename events.
The updatedb/mlocate people are interested in fanotify as a means to
not thrash the harddrive every night. They could instead update the db
in real time as files are moved.

- subtree notification.
Currently to only watch /home and all of it's descendants one must
either register a directed watch on every directory or use a global
listener. The global listener with ignored_mask is not as bad as it
sounds in my testing, but decent subtree registration and notification
would be a big win in a lot of people's mind.

***********

Please, complaints? sortcomings? design flaws? issues? failures? How
can it be tweaked to suit your needs?

-Eric


2009-07-24 20:50:31

by David Lang

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

getting an open fd to the file is good for things like content scanning,
but for other things like a HSM re-populating the file, you would need to
pass the path used to open the file at open time. is this in the metadata
you are passing?

this currently does not give you a good way to listen for info on a
specific directory tree, you imply at the end that you could go global,
but without a path (which you will not have in some cases) it's impossible
to decide if you care about this file or not.

to avoid race conditions, you may want some way that a listener on a
directory can flag that it wants to also be a listener for all new
directories created under the one it is listening on.

with the PERM checks, can a checker respond within the 5 second window
with 'I need more time'? or does it need to complete all it's work in <5
seconds?

David Lang


On Fri, 24 Jul 2009, Eric Paris wrote:

> I plan to start sending patches for fanotify in the next week or two.
> I'd like to see more comments on the design, interface, and capabilities
> in case there is a recognized need for major reworks or if I'm not
> meeting some users needs (other than those noted at the end)
>
> git://git.infradead.org/users/eparis/notify.git fanotify-experimental
>
> should have working code to test what I'm talking about.
>
> What is fanotify?
>
> It is a new notification system that has a limited set of events (open,
> close, read, write) in which notification not only comes with metadata
> the describes what happened it also comes with an open file descriptor
> to the object in question. fanotify will also allow the listener to
> make access decisions on open and read events. This allows the
> implementation of hierarchical storage management systems or an access
> file scanning or integrity checking.
>
> fanotify comes in two flavors 'directed' and 'global.' 'Directed' is
> like inotify or dnotify in that you register specific inodes of interest
> and only get events pertaining to those inodes. Global means you are
> registering interest for event types system wide. With global mode the
> listener program can later exclude objects from future events.
>
> fanotify kernel/userspace interaction is over a new socket protocol. A
> listener opens a new socket in the new PF_FANOTIFY family. The socket
> is then bound to an address. Using the following struct:
>
> struct fanotify_addr {
> sa_family_t family;
> __u32 priority;
> __u32 group_num;
> __u32 mask;
> __u32 f_flags;
> __u32 unused[16];
> } __attribute__((packed));
>
> The priority field indicates in which order fanotify listeners will get
> events. Since 2 fanotify listeners would 'hear' each others events on
> the new fd they create fanotify listeners will not hear events generated
> by other fanotify listeners with a lower priority number.
>
> The group_num is at the moment not used, but the plan was to allow 2
> processes to bind to the same fanotify group and share the load of
> processing events.
>
> The f_flags is the flags which the fanotify listener wishes to use when
> opening their notification fds. On access scanners would want to use
> O_RDONLY, whereas HSM systems would need to use O_WRONLY.
>
> The mask is the indication of the events this group is interested in.
> The set of events of interest if FAN_GLOBAL_LISTENER is set at bind
> time. If FAN_GLOBAL_LISTENER is not set, this field is meaningless as
> the registration of events on individual inodes will dictate the
> reception of events.
>
> * FAN_ACCESS: every file access.
> * FAN_MODIFY: file modifications.
> * FAN_CLOSE: files are closed.
> * FAN_OPEN: open() calls.
> * FAN_ACCESS_PERM: like FAN_ACCESS, except that the process trying to
> access the file is put on hold while the fanotify client decides whether
> to allow the operation.
> * FAN_OPEN_PERM: like FAN_OPEN, but with the permission check.
> * FAN_EVENT_ON_CHILD: receive notification of events on inodes inside
> this subdirectory. (this is not a full recursive notification of all
> descendants, only direct children)
> * FAN_GLOBAL_LISTENER: notify for events on all files in the system.
> * FAN_SURVIVE_MODIFY: special flag that ignores should survive inode
> modification. Discussed below.
>
> After the socket is bound events are attained using the read() syscall
> (recv* probably also works haven't tested). This will result in the
> buffer being filled with one or more events like this:
>
> struct fanotify_event_metadata {
> __u32 event_len;
> __s32 fd;
> __u32 mask;
> __u32 f_flags;
> __s32 pid;
> __s32 tgid;
> __u64 cookie;
> } __attribute__((packed));
>
> fd specifies the new file descriptor that was created in the context of
> the listener. (readlink of /proc/self/fd will give you A pathname)
> mask indicates the events type (bitwise OR of the event types listed
> above). f_flags here is the f_flags the ORIGINAL process has the file
> open with. pid and tgid are from the original process. cookie is used
> when the listener needs to allow, deny, or delay the operation.
>
> If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener
> must send a response before the 5 second timeout. If no response is
> sent before the 5 second timeout the original operation is allowed. If
> this happens too many times (10 in a row) the fanotify group is evicted
> from the kernel and will not get any new events. Sending a response is
> done using the setsockopt() call with the socket options set to
> FANOTIFY_ACCESS_RESPONSE. The buffer should contain a structure like:
>
> struct fanotify_so_access {
> __u64 cookie;
> __u32 response;
> } __attribute__((packed));
>
> Where cookie is the cookie from the notification and response is one of:
>
> FAN_ALLOW: allow the original operation
> FAN_DENY: deny the original operation
> FAN_RESET_TIMEOUT: reset the timeout.
>
> The last main interface is the 'marking' of inodes. The purpose of
> inode marks differ between 'directed' and 'global' listeners. Directed
> fanotify listeners need to mark inodes of interest. They do that also
> using setsockopt() of type FANOTIFY_SET_MARK with the buffer containing
> a structure like:
>
> struct fanotify_so_inode_mark {
> __s32 fd;
> __u32 mask;
> __u32 ignored_mask;
> } __attribute__((packed));
>
> Where fd is backed by the inode in question. Mask is the events of
> interest (only used in directed mode) and ignored_mask is the mask of
> events which should be ignored.
>
> The ignored_mask is cleared every time an inode receives a modification
> events unless FAN_SURVIVE_MODIFY is also set. The ignored_mask is
> mainly used for 2 purposes. Global listeners may just have no interest
> in lots of events, so they should spam inodes with an ignored mask. The
> ignored mask is also used to 'cache' access decisions. If the listener
> sets FAN_ACCESS_PERM in the ignored mask all access operations will be
> permitted without the call out to userspace. If the inode is modified
> the ignored_mask will be cleared and userspace will again have to
> approve the access. If userspace REALLY doesn't care ever they can use
> the special FAN_SURVIVE_MODIFY flag inside the ignored_mask.
>
> The only other current interface is the ability to ignore events by
> superblock magic number. This makes it easy to ignore all events
> in /proc which can be difficult to accomplish firing FANOTIFY_SET_MARK
> with ignored_masks over and over as processes are created and destroyed.
>
> ***********
>
> Future direction:
> There are 2 things I'm interested in adding.
> - Rename events.
> The updatedb/mlocate people are interested in fanotify as a means to
> not thrash the harddrive every night. They could instead update the db
> in real time as files are moved.
>
> - subtree notification.
> Currently to only watch /home and all of it's descendants one must
> either register a directed watch on every directory or use a global
> listener. The global listener with ignored_mask is not as bad as it
> sounds in my testing, but decent subtree registration and notification
> would be a big win in a lot of people's mind.
>
> ***********
>
> Please, complaints? sortcomings? design flaws? issues? failures? How
> can it be tweaked to suit your needs?
>
> -Eric
>
>

2009-07-24 21:01:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Jul 24, 2009 16:13 -0400, Eric Paris wrote:
> fanotify kernel/userspace interaction is over a new socket protocol. A
> listener opens a new socket in the new PF_FANOTIFY family. The socket
> is then bound to an address. Using the following struct:

Would it make sense to use existing netlink?

> struct fanotify_addr {
> sa_family_t family;
> __u32 priority;
> __u32 group_num;
> __u32 mask;
> __u32 f_flags;
> __u32 unused[16];
> } __attribute__((packed));
>
> The mask is the indication of the events this group is interested in.
> The set of events of interest if FAN_GLOBAL_LISTENER is set at bind
> time. If FAN_GLOBAL_LISTENER is not set, this field is meaningless as
> the registration of events on individual inodes will dictate the
> reception of events.
>
> * FAN_ACCESS: every file access.
> * FAN_MODIFY: file modifications.
> * FAN_CLOSE: files are closed.
> * FAN_OPEN: open() calls.
> * FAN_ACCESS_PERM: like FAN_ACCESS, except that the process trying to
> access the file is put on hold while the fanotify client decides whether
> to allow the operation.
> * FAN_OPEN_PERM: like FAN_OPEN, but with the permission check.
> * FAN_EVENT_ON_CHILD: receive notification of events on inodes inside
> this subdirectory. (this is not a full recursive notification of all
> descendants, only direct children)
> * FAN_GLOBAL_LISTENER: notify for events on all files in the system.
> * FAN_SURVIVE_MODIFY: special flag that ignores should survive inode
> modification. Discussed below.

It seems like a 32-bit mask might not be enough, it wouldn't be hard
at this stage to add a 64-bit mask. Lustre has a similar mechanism
(changelog) that allows tracking all different kinds of filesystem
events (create/unlink/symlink/link/rename/mkdir/setxattr/etc), instead
of just open/close, also use by HSM, enhanced rsync, etc.

> struct fanotify_event_metadata {
> __u32 event_len;
> __s32 fd;
> __u32 mask;
> __u32 f_flags;
> __s32 pid;
> __s32 tgid;
> __u64 cookie;
> } __attribute__((packed));

Getting the attributes that have changed into this message is also
useful, as it avoids a continual stream of "stat" calls on the inodes.

The other thing that is important for HSM is that this log is atomic
and persistent, otherwise there may be files that are missed if the
node crashes. This involves creating atomic update records as part
of the filesystem operation, and then userspace consumes them and
tells the kernel that it is finished with records up to X. Otherwise
you risk inconsistencies between rsync/HSM/updatedb for files that
are updated just before a crash.

> If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener
> must send a response before the 5 second timeout. If no response is
> sent before the 5 second timeout the original operation is allowed. If
> this happens too many times (10 in a row) the fanotify group is evicted
> from the kernel and will not get any new events.

This should be a tunable, since if the intent is to monitor PERM checks
it would be possible for users to DOS the machine and delay the userspace
programs and access files they shouldn't be able to.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-24 21:02:32

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Fri, 2009-07-24 at 13:48 -0700, [email protected] wrote:
> getting an open fd to the file is good for things like content scanning,
> but for other things like a HSM re-populating the file, you would need to
> pass the path used to open the file at open time. is this in the metadata
> you are passing?

No, I will NOT EVER pass a pathname. Period. End of story. I stated
the if userspace wants to deal with pathnames (and they understand the
system setup well enough to know if pathnames even make sense to them)
they can use readlink(2) on /proc/self/fd

> to avoid race conditions, you may want some way that a listener on a
> directory can flag that it wants to also be a listener for all new
> directories created under the one it is listening on.

Interesting way to get the subtree checking people want, you do the
registration yourself the first time on the entire hierarchy and new
directories will be automagically added. I could probably do that, I'll
have to look.

> with the PERM checks, can a checker respond within the 5 second window
> with 'I need more time'? or does it need to complete all it's work in <5
> seconds?

FAN_RESET_TIMEOUT

2009-07-24 21:22:46

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Fri, 2009-07-24 at 15:00 -0600, Andreas Dilger wrote:
> On Jul 24, 2009 16:13 -0400, Eric Paris wrote:
> > fanotify kernel/userspace interaction is over a new socket protocol. A
> > listener opens a new socket in the new PF_FANOTIFY family. The socket
> > is then bound to an address. Using the following struct:
>
> Would it make sense to use existing netlink?

I looked at netlink, but because of the nature of the fact that fd
creation has to be done in the listener context I couldn't figure out
how to make it suitable.

> > struct fanotify_addr {
> > sa_family_t family;
> > __u32 priority;
> > __u32 group_num;
> > __u32 mask;
> > __u32 f_flags;
> > __u32 unused[16];
> > } __attribute__((packed));
> >
> > The mask is the indication of the events this group is interested in.
> > The set of events of interest if FAN_GLOBAL_LISTENER is set at bind
> > time. If FAN_GLOBAL_LISTENER is not set, this field is meaningless as
> > the registration of events on individual inodes will dictate the
> > reception of events.
> >
> > * FAN_ACCESS: every file access.
> > * FAN_MODIFY: file modifications.
> > * FAN_CLOSE: files are closed.
> > * FAN_OPEN: open() calls.
> > * FAN_ACCESS_PERM: like FAN_ACCESS, except that the process trying to
> > access the file is put on hold while the fanotify client decides whether
> > to allow the operation.
> > * FAN_OPEN_PERM: like FAN_OPEN, but with the permission check.
> > * FAN_EVENT_ON_CHILD: receive notification of events on inodes inside
> > this subdirectory. (this is not a full recursive notification of all
> > descendants, only direct children)
> > * FAN_GLOBAL_LISTENER: notify for events on all files in the system.
> > * FAN_SURVIVE_MODIFY: special flag that ignores should survive inode
> > modification. Discussed below.
>
> It seems like a 32-bit mask might not be enough, it wouldn't be hard
> at this stage to add a 64-bit mask. Lustre has a similar mechanism
> (changelog) that allows tracking all different kinds of filesystem
> events (create/unlink/symlink/link/rename/mkdir/setxattr/etc), instead
> of just open/close, also use by HSM, enhanced rsync, etc.

I had a 64 bit mask, but Al Viro ask me to go back to a 32 bit mask
because of i386 register pressure. The bitmask operations are on VERY
hot paths inside the kernel.

> > struct fanotify_event_metadata {
> > __u32 event_len;
> > __s32 fd;
> > __u32 mask;
> > __u32 f_flags;
> > __s32 pid;
> > __s32 tgid;
> > __u64 cookie;
> > } __attribute__((packed));
>
> Getting the attributes that have changed into this message is also
> useful, as it avoids a continual stream of "stat" calls on the inodes.

Hmmm, I'll take a look. Do you have a good example of what you would
want to see? I don't think we know in the notification hooks what
actually is being changed :(

> The other thing that is important for HSM is that this log is atomic
> and persistent, otherwise there may be files that are missed if the
> node crashes. This involves creating atomic update records as part
> of the filesystem operation, and then userspace consumes them and
> tells the kernel that it is finished with records up to X. Otherwise
> you risk inconsistencies between rsync/HSM/updatedb for files that
> are updated just before a crash.

Uhhh, persistent across a crash? Nope, don't have that. Notification
is all in memory. Can't I just put the onus on userspace to recheck
things maybe? Sounds like a user for i_version....

> > If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener
> > must send a response before the 5 second timeout. If no response is
> > sent before the 5 second timeout the original operation is allowed. If
> > this happens too many times (10 in a row) the fanotify group is evicted
> > from the kernel and will not get any new events.
>
> This should be a tunable, since if the intent is to monitor PERM checks
> it would be possible for users to DOS the machine and delay the userspace
> programs and access files they shouldn't be able to.

At the moment I cheat and say root only to bind. I do plan to open it
up to non-root users after it's in and working, but I'm seriously
considering leaving _PERM events as root only. It's hard to map the
original to listener security implications. So making sure the listener
is always root is easy :)

Userspace would never be able to access a file it shouldn't be allowed
to (the new fd is created in the context of the listener and EPERM is
possible.)

2009-07-24 21:44:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

Eric Paris wrote:
> On Fri, 2009-07-24 at 13:48 -0700, [email protected] wrote:
> > getting an open fd to the file is good for things like content scanning,
> > but for other things like a HSM re-populating the file, you would need to
> > pass the path used to open the file at open time. is this in the metadata
> > you are passing?
>
> No, I will NOT EVER pass a pathname. Period. End of story. I stated
> the if userspace wants to deal with pathnames (and they understand the
> system setup well enough to know if pathnames even make sense to them)
> they can use readlink(2) on /proc/self/fd

That makes sense.

In most cases where events trigger userspace cache or index updates,
userspace already has enough information to calculate the path (and
any derived data) from the inode number (in the case of non-hard-link
files) or from the inode number of the parent directory and the name
(not full path).

So it wouldn't even need to call readlink(2), provided those bits of
information are passed in the event.

That is one thing which inotify _nearly_ gets right. Nearly, because
it doesn't pass the inode number when you're watching a directory, and
watching every inode is too expensive.

> > to avoid race conditions, you may want some way that a listener on a
> > directory can flag that it wants to also be a listener for all new
> > directories created under the one it is listening on.
>
> Interesting way to get the subtree checking people want, you do the
> registration yourself the first time on the entire hierarchy and new
> directories will be automagically added. I could probably do that, I'll
> have to look.

Yes, automagically adding directories is essential, otherwise they can
be added, and someone can populate them with files that have some
effect before userspace gets a chance to scan them.

The other part of useful subtree notification is getting notifications
for a subtree without having to initially scan the whole hierachy
,which can take a long time as well as a huge amount of unnecessary
seeking and I/O.

The third part, which by the way is really recommended for security
applications, is persistence across umount/reboot/mount. That can be
done either by assuming there are no filesystem changes when userspace
isn't watching it, or by the simple expedient of letting userspace add
an xattr to things it has indexed, with a specially recognised name
that is automatically removed whenever the file/directory is changed.

-- Jamie

2009-07-24 22:42:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Jul 24, 2009 17:21 -0400, Eric Paris wrote:
> On Fri, 2009-07-24 at 15:00 -0600, Andreas Dilger wrote:
> > On Jul 24, 2009 16:13 -0400, Eric Paris wrote:
> > It seems like a 32-bit mask might not be enough, it wouldn't be hard
> > at this stage to add a 64-bit mask. Lustre has a similar mechanism
> > (changelog) that allows tracking all different kinds of filesystem
> > events (create/unlink/symlink/link/rename/mkdir/setxattr/etc), instead
> > of just open/close, also use by HSM, enhanced rsync, etc.
>
> I had a 64 bit mask, but Al Viro ask me to go back to a 32 bit mask
> because of i386 register pressure. The bitmask operations are on VERY
> hot paths inside the kernel.

How about adding a spare "__u32 mask_hi" for future use, so that it can
be changed directly into a __u64 on LE machines? That preserves the
extensibility for the future, without hitting performance on 32-bit
machines before it is needed.

> > > struct fanotify_event_metadata {
> > > __u32 event_len;
> > > __s32 fd;
> > > __u32 mask;
> > > __u32 f_flags;
> > > __s32 pid;
> > > __s32 tgid;
> > > __u64 cookie;
> > > } __attribute__((packed));
> >
> > Getting the attributes that have changed into this message is also
> > useful, as it avoids a continual stream of "stat" calls on the inodes.
>
> Hmmm, I'll take a look. Do you have a good example of what you would
> want to see? I don't think we know in the notification hooks what
> actually is being changed :(

Well, I'm thinking there will be a lot of events that some applications
will not care about (e.g. PERM checks where the user is only changing
the file mode, vs. PERM checks where the owner of the file is changing).
Even if the old attributes are not available, having a mask of which
fields in the inode changed, and struct stat64 would be very useful.

> > The other thing that is important for HSM is that this log is atomic
> > and persistent, otherwise there may be files that are missed if the
> > node crashes. This involves creating atomic update records as part
> > of the filesystem operation, and then userspace consumes them and
> > tells the kernel that it is finished with records up to X. Otherwise
> > you risk inconsistencies between rsync/HSM/updatedb for files that
> > are updated just before a crash.
>
> Uhhh, persistent across a crash? Nope, don't have that. Notification
> is all in memory. Can't I just put the onus on userspace to recheck
> things maybe? Sounds like a user for i_version....

Well, if new files are created then userspace won't have any idea which
inodes need to be checked, and it will also need to keep a persistent
database of all file i_version values. If you are trying to hook a
backup tool onto such an interface and files created persistently on
disk before a crash are not handled, then they may never be backed up.

Tools like inotify are fine for desktop window refresh and similar uses,
but for applications which require robust handling they also need to
work over a crash.

The other issue is that you might get quite a large queue of operations
in memory, and if this can't be saved to the filesystem then it might
result in OOMing itself.

> > > If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener
> > > must send a response before the 5 second timeout. If no response is
> > > sent before the 5 second timeout the original operation is allowed. If
> > > this happens too many times (10 in a row) the fanotify group is evicted
> > > from the kernel and will not get any new events.
> >
> > This should be a tunable, since if the intent is to monitor PERM checks
> > it would be possible for users to DOS the machine and delay the userspace
> > programs and access files they shouldn't be able to.
>
> At the moment I cheat and say root only to bind. I do plan to open it
> up to non-root users after it's in and working, but I'm seriously
> considering leaving _PERM events as root only. It's hard to map the
> original to listener security implications. So making sure the listener
> is always root is easy :)

My comment has nothing to do with non-root access. It has to do with
how long the userspace watcher has to handle an event. If a regular user
is running a 50-thread iozone with a 1M file directory you can imagine it
will create a lot of events to watch, along with a lot of seeking to slow
down the processing of events. If the user then does "open(secretfile)"
(where your _PERM check is doing something useful) it is possible
that the userspace listener will time out and miss some events.

> Userspace would never be able to access a file it shouldn't be allowed
> to (the new fd is created in the context of the listener and EPERM is
> possible.)

Ah, so the _PERM check is only intended to grant extra access, instead
of restricting it? That should be made clear in the documentation that
doing the opposite is an easily-bypassed security vulnerability.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-24 22:48:33

by Jamie Lokier

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

Eric Paris wrote:
> It is a new notification system that has a limited set of events (open,
> close, read, write) in which notification not only comes with metadata
> the describes what happened it also comes with an open file descriptor
> to the object in question. fanotify will also allow the listener to
> make access decisions on open and read events. This allows the
> implementation of hierarchical storage management systems or an access
> file scanning or integrity checking.

My first thought was to wonder, why not make it the same set of events
that inotify and dnotify provide? That is: open, close, read, write,
create, delete, rename, attribute change? In other words, I don't see
a good reason for it to be a subset of events.

Apart from aesthetics (which is my first thought), creating, renaming
and deleting files and symlinks also has security implications on
typical Linux systems. Since that fanotify is motivated by security
applications among other things, surely those type of events are of
interest too?

For example, just as you have the power to block a file open request
from some application, you may also need the power to block a
symlink(2) request.

> fanotify comes in two flavors 'directed' and 'global.' 'Directed' is
> like inotify or dnotify in that you register specific inodes of interest
> and only get events pertaining to those inodes. Global means you are
> registering interest for event types system wide. With global mode the
> listener program can later exclude objects from future events.

On a large multi-user system with, say, 10k users in /home and 100
logged in at any time, if you want to monitor the files in
/var/lib/ftp/some.ftp.site/, neither 'directed' nor 'global' are going
to be efficient.

Similarly, if you have 'enhanced rsync' as someone else has mentioned
(good example), it will want to monitor /home/me/kernels/2.6 only,
without slowing down the system when any of the other 4 million files
in /home/me are accessed.

I appreciate fanotify does not try to be perfect for every
application. But if we can make it handle a few more things in a more
scalable way without much code, and a clean interface too, that can
only be good.

> fanotify kernel/userspace interaction is over a new socket protocol. A
> listener opens a new socket in the new PF_FANOTIFY family. The socket
> is then bound to an address. Using the following struct:
>
> struct fanotify_addr {
> sa_family_t family;
> __u32 priority;
> __u32 group_num;
> __u32 mask;
> __u32 f_flags;
> __u32 unused[16];
> } __attribute__((packed));
>
> The priority field indicates in which order fanotify listeners will get
> events. Since 2 fanotify listeners would 'hear' each others events on
> the new fd they create fanotify listeners will not hear events generated
> by other fanotify listeners with a lower priority number.

I'm not sure if I understand the priority mechanism. If it means that
events are only delivered to the highest priority listener, that makes
the fanotify subsystem virtually useless for things like 'enhanced
rsync' which someone else has mentioned. Those programs need to know
they will receive all events, not miss some events when another
program is running.

But maybe I misunderstood the priority mechanism?

> The group_num is at the moment not used, but the plan was to allow 2
> processes to bind to the same fanotify group and share the load of
> processing events.

That's an interesting idea. I like it.

Couldn't both processes simply read from the same socket, so you
wouldn't need group_num? I think that would be cleaner and simpler.

For example, look at how Apache waits for incoming connections:
multiple processes call accept() on the same socket, and exactly one
process is woken with each new connection. This is quite efficient.

You could do the same: have each process read from the same socket,
blocking until there is an event, and only send the event to one of
the waiting processes.

It is important that the kernel code to handle reads dequeues events
in each process efficiently, without the "thundering herd" problem
(look it up, Apache used to have it with accept()).

> The f_flags is the flags which the fanotify listener wishes to use when
> opening their notification fds. On access scanners would want to use
> O_RDONLY, whereas HSM systems would need to use O_WRONLY.

Interesting. An option for file change trackers who don't care about
the open file descriptor would be good too. Perhaps they are just
logging.

> The mask is the indication of the events this group is interested in.
> The set of events of interest if FAN_GLOBAL_LISTENER is set at bind
> time. If FAN_GLOBAL_LISTENER is not set, this field is meaningless as
> the registration of events on individual inodes will dictate the
> reception of events.
>
> * FAN_ACCESS: every file access.
> * FAN_MODIFY: file modifications.
> * FAN_CLOSE: files are closed.
> * FAN_OPEN: open() calls.
> * FAN_ACCESS_PERM: like FAN_ACCESS, except that the process trying to
> access the file is put on hold while the fanotify client decides whether
> to allow the operation.
> * FAN_OPEN_PERM: like FAN_OPEN, but with the permission check.
> * FAN_EVENT_ON_CHILD: receive notification of events on inodes inside
> this subdirectory. (this is not a full recursive notification of all
> descendants, only direct children)
> * FAN_GLOBAL_LISTENER: notify for events on all files in the system.
> * FAN_SURVIVE_MODIFY: special flag that ignores should survive inode
> modification. Discussed below.
>
> After the socket is bound events are attained using the read() syscall
> (recv* probably also works haven't tested). This will result in the
> buffer being filled with one or more events like this:
>
> struct fanotify_event_metadata {
> __u32 event_len;
> __s32 fd;
> __u32 mask;
> __u32 f_flags;
> __s32 pid;
> __s32 tgid;
> __u64 cookie;
> } __attribute__((packed));
>
> fd specifies the new file descriptor that was created in the context of
> the listener. (readlink of /proc/self/fd will give you A pathname)
> mask indicates the events type (bitwise OR of the event types listed
> above). f_flags here is the f_flags the ORIGINAL process has the file
> open with. pid and tgid are from the original process. cookie is used
> when the listener needs to allow, deny, or delay the operation.

So far it looks quite similar to inotify, with some differences.
Some things taken away:

- Very similar events, but missing a few like renames (which you
are thinking of adding).
- No file name for things that happen in a subdirectory.
Application expected to call readlink("/proc/self/fd") if it
cares about the file name. But that won't work for every kind of
event!

Some things (useful I agree) added:

- Returns an open file descriptor to the affected file.
- Returns some other attributes, like accessing pid/tgid (uid though?).
- Can block the process trying to access the file.

API-wise, is there a particular reason for using a new socket
interface, rather than extending the inotify interface with a few more
flags and a different event structure?

By the way, you may not know the history of inotify originally. It
used a device, /dev/inotify, when it was a third-party patch. To get
into the mainline kernel, it was requested that it be changed to use
system calls. The same happened to epoll. So you may have better
luck with a system call interface than using a socket. That shouldn't
affect discussions of any other technical aspect, though.

> If a FAN_ACCESS_PERM or FAN_OPEN_PERM event is received the listener
> must send a response before the 5 second timeout. If no response is
> sent before the 5 second timeout the original operation is allowed. If
> this happens too many times (10 in a row) the fanotify group is evicted
> from the kernel and will not get any new events. Sending a response is
> done using the setsockopt() call with the socket options set to
> FANOTIFY_ACCESS_RESPONSE. The buffer should contain a structure like:
>
> struct fanotify_so_access {
> __u64 cookie;
> __u32 response;
> } __attribute__((packed));
>
> Where cookie is the cookie from the notification and response is one of:

What happens when a process sends a cookie that it did not receive,
but another process received it?

> FAN_ALLOW: allow the original operation
> FAN_DENY: deny the original operation
> FAN_RESET_TIMEOUT: reset the timeout.
>
> The last main interface is the 'marking' of inodes. The purpose of
> inode marks differ between 'directed' and 'global' listeners. Directed
> fanotify listeners need to mark inodes of interest. They do that also
> using setsockopt() of type FANOTIFY_SET_MARK with the buffer containing
> a structure like:
>
> struct fanotify_so_inode_mark {
> __s32 fd;
> __u32 mask;
> __u32 ignored_mask;
> } __attribute__((packed));
>
> Where fd is backed by the inode in question. Mask is the events of
> interest (only used in directed mode) and ignored_mask is the mask of
> events which should be ignored.

It's hard to see how this differs much from inotify_add_watch, except
- is this mark global to all processes, or local to the process
setting the mark?

> The ignored_mask is cleared every time an inode receives a modification
> events unless FAN_SURVIVE_MODIFY is also set. The ignored_mask is
> mainly used for 2 purposes. Global listeners may just have no interest
> in lots of events, so they should spam inodes with an ignored mask. The
> ignored mask is also used to 'cache' access decisions. If the listener
> sets FAN_ACCESS_PERM in the ignored mask all access operations will be
> permitted without the call out to userspace. If the inode is modified
> the ignored_mask will be cleared and userspace will again have to
> approve the access. If userspace REALLY doesn't care ever they can use
> the special FAN_SURVIVE_MODIFY flag inside the ignored_mask.

I do like the idea of caching access decisions. Are these flags
global to the whole system, or local to the listening process setting
the flags (or to the specific listener's socket)?

> The only other current interface is the ability to ignore events by
> superblock magic number. This makes it easy to ignore all events
> in /proc which can be difficult to accomplish firing FANOTIFY_SET_MARK
> with ignored_masks over and over as processes are created and destroyed.
>
> ***********
>
> Future direction:

Here's one more thing which may be needed to make hard guarantees for
security applications:

- Mount events, which it would be natural for fanotify to block
temporarily while it assesses the impact and/or synchronises it's
map of the mounts against the change. Mounts do change the set
of visible files, after all.

> There are 2 things I'm interested in adding.
> - Rename events.
> The updatedb/mlocate people are interested in fanotify as a means to
> not thrash the harddrive every night. They could instead update the db
> in real time as files are moved.

Great!

I'm interested in the same thing on narrower (but still large)
subdirectories, for things like enhanced rsync, make, git, indexing,
and complex caching of compiled things. You get the idea: it has a
lot of uses.

> - subtree notification.
> Currently to only watch /home and all of it's descendants one must
> either register a directed watch on every directory or use a global
> listener. The global listener with ignored_mask is not as bad as it
> sounds in my testing, but decent subtree registration and notification
> would be a big win in a lot of people's mind.

I believe we've talked about one suggestion for how to do this, on
lwn.net. I'll repeat it here.

Efficient recursive notifications method:

- You register for event on a directory with a RECURSIVE flag "give
me events for this directory and all paths below it".

- That listener gets events for any access of the appropriate type
whose path is via that directory, *using the specific run-time
path used for the access*.

- That _doesn't_ mean hard-link files need to know all their parent
directories, which would be silly and impossible. The event path
is just the one used at run-time for access, by the application
attempting to open/write/whatever.

- If a listener needs to track all accesses to a particular
hard-linked file, it's the responsibility of the listener to
ensure it listens to enough directories to cover every path to
that file - or listen to the file directly. It knows from
i_nlink and the mount map when it has enough directories.

- Notifying just the access path may seem counterintuitive, but in
fact it's what inotify and dnotify do already, and it does
actually work. Often a listener is maintaining a cache or index
of some kind, in which case it will already have sufficient
knowledge about where the hard-linked files are (or know that it
needs an initial indexing), and whether it has covered enough
parent directories to see all accesses to them.

- In practice it means each access traverses the path, following
parent directories until reaching a mount point, broadcasting
events on each one where there's a recursive listener. That's
not as inefficient as it looks, because paths don't usually have
a large number of components.

- I'm not sure exactly how fast/slow it is, though, and it may a
few thoughtfully cached flags in each dentry to elide traversals.
I won't discuss the details here, for fear of complicating the
discussion too much. They might well mesh with the 'access
decision cache' flags you mentioned.

- It is necessary that link(2) create an attribute-change event
(for i_nlink!) on the source path of the link. dnotify/inotify
don't do that now (unless they changed recently), but they should
to make this work.

Please shoot down the idea. I think it is good enough
for reliable subtree notifications, but I'd love to be proven wrong.

-- Jamie

2009-07-24 23:01:58

by Jamie Lokier

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

Andreas Dilger wrote:
> On Jul 24, 2009 17:21 -0400, Eric Paris wrote:
> > On Fri, 2009-07-24 at 15:00 -0600, Andreas Dilger wrote:
> > > On Jul 24, 2009 16:13 -0400, Eric Paris wrote:
> > > It seems like a 32-bit mask might not be enough, it wouldn't be hard
> > > at this stage to add a 64-bit mask. Lustre has a similar mechanism
> > > (changelog) that allows tracking all different kinds of filesystem
> > > events (create/unlink/symlink/link/rename/mkdir/setxattr/etc), instead
> > > of just open/close, also use by HSM, enhanced rsync, etc.
> >
> > I had a 64 bit mask, but Al Viro ask me to go back to a 32 bit mask
> > because of i386 register pressure. The bitmask operations are on VERY
> > hot paths inside the kernel.
>
> How about adding a spare "__u32 mask_hi" for future use, so that it can
> be changed directly into a __u64 on LE machines? That preserves the
> extensibility for the future, without hitting performance on 32-bit
> machines before it is needed.

If so, remember to put "__u32 mask_hi" *before* "__u32 mask" on BE
32-bit machines. Then it can be changed to __u64 on those too, if
needed.

> Well, if new files are created then userspace won't have any idea which
> inodes need to be checked, and it will also need to keep a persistent
> database of all file i_version values. If you are trying to hook a
> backup tool onto such an interface and files created persistently on
> disk before a crash are not handled, then they may never be backed up.
>
> Tools like inotify are fine for desktop window refresh and similar uses,
> but for applications which require robust handling they also need to
> work over a crash.

I see two ways to handle that:

- Simply assert that the monitoring program is running whenever
there are any changes to a particular filesystem, or the program
is told that it must reindex as a matter of policy.

For example you might run the program before mounting and after
unmounting, so you know there are no changes at other times.

That's not hard security, but then neither is i_version or any
other check, as root (which is responsible for the mount/umount
sequence after all) can also bypass filesystems.

- Have a well-known extended attribute (xattr) or set of them which
are _always deleted_ whenever files are modified. For example
"system.indexing.*". An application called Foo Monitor would
create "system.indexing.foo" xattrs on files prior to indexing
each one.

Each time a MODIFY event occurs on a file, whether it's being
watched or not, the kernel would remove all attributes whose
names match "system.indexing.*".

That includes recursively doing the same to parent directories
via the path used for that access, all the way to the filesystem
root (even if the root isn't visible due to mounting). (See my
other recent mail for why a single path works for hard-linked files).

Tools can look at those xattrs to determine if their indexing
information is up to date - persistently across crashes, unmounts
and reboots.

> The other issue is that you might get quite a large queue of operations
> in memory, and if this can't be saved to the filesystem then it might
> result in OOMing itself.

:-) What does inotify do in this scenario?

-- Jamie

2009-07-24 23:27:13

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Fri, 2009-07-24 at 23:48 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > It is a new notification system that has a limited set of events (open,
> > close, read, write) in which notification not only comes with metadata
> > the describes what happened it also comes with an open file descriptor
> > to the object in question. fanotify will also allow the listener to
> > make access decisions on open and read events. This allows the
> > implementation of hierarchical storage management systems or an access
> > file scanning or integrity checking.
>
> My first thought was to wonder, why not make it the same set of events
> that inotify and dnotify provide? That is: open, close, read, write,
> create, delete, rename, attribute change? In other words, I don't see
> a good reason for it to be a subset of events.

The two real reasons?

1) These were the only 4 my original use case cared about.
2) These are the only 4 where the notification hook has enough
information to open a fd in the context of the listener.

In the kernel most notification is done with either an inode or a dentry
as that is enough for inotify, dnotify, audit_watch and audit_tree.
Opening a file descriptor, and thus fanotify, requires a dentry and a
vfsmnt, which is much harder to come by in the kernel.

Maybe as future work I'll try to convince Al to allow me to have that
information in more places, but for today, those 4 are the only ones I
can probably slip past him...

2009-07-24 23:46:20

by Jamie Lokier

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

Eric Paris wrote:
> On Fri, 2009-07-24 at 23:48 +0100, Jamie Lokier wrote:
> > Eric Paris wrote:
> > > It is a new notification system that has a limited set of events (open,
> > > close, read, write) in which notification not only comes with metadata
> > > the describes what happened it also comes with an open file descriptor
> > > to the object in question. fanotify will also allow the listener to
> > > make access decisions on open and read events. This allows the
> > > implementation of hierarchical storage management systems or an access
> > > file scanning or integrity checking.
> >
> > My first thought was to wonder, why not make it the same set of events
> > that inotify and dnotify provide? That is: open, close, read, write,
> > create, delete, rename, attribute change? In other words, I don't see
> > a good reason for it to be a subset of events.
>
> The two real reasons?
>
> 1) These were the only 4 my original use case cared about.
> 2) These are the only 4 where the notification hook has enough
> information to open a fd in the context of the listener.
>
> In the kernel most notification is done with either an inode or a dentry
> as that is enough for inotify, dnotify, audit_watch and audit_tree.
> Opening a file descriptor, and thus fanotify, requires a dentry and a
> vfsmnt, which is much harder to come by in the kernel.
>
> Maybe as future work I'll try to convince Al to allow me to have that
> information in more places, but for today, those 4 are the only ones I
> can probably slip past him...

For the other events, maybe there is no need for a file descriptor
anyway.

-- Jamie

2009-07-24 23:50:44

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Fri, 2009-07-24 at 23:48 +0100, Jamie Lokier wrote:
> Eric Paris wrote:

> > fanotify kernel/userspace interaction is over a new socket protocol. A
> > listener opens a new socket in the new PF_FANOTIFY family. The socket
> > is then bound to an address. Using the following struct:
> >
> > struct fanotify_addr {
> > sa_family_t family;
> > __u32 priority;
> > __u32 group_num;
> > __u32 mask;
> > __u32 f_flags;
> > __u32 unused[16];
> > } __attribute__((packed));
> >
> > The priority field indicates in which order fanotify listeners will get
> > events. Since 2 fanotify listeners would 'hear' each others events on
> > the new fd they create fanotify listeners will not hear events generated
> > by other fanotify listeners with a lower priority number.
>
> I'm not sure if I understand the priority mechanism. If it means that
> events are only delivered to the highest priority listener, that makes
> the fanotify subsystem virtually useless for things like 'enhanced
> rsync' which someone else has mentioned. Those programs need to know
> they will receive all events, not miss some events when another
> program is running.
>
> But maybe I misunderstood the priority mechanism?

The priority mechanism ONLY excludes events generated by processes which
have an open fanotify listener. (someone is going to barf when the see
that patch)

The problem was basically.

Process A opens [file]
Listener 1 gets the event about A and opens [file]
Listener 2 gets the event about A and opens [file]
Listener 1 gets the event about 2 and opens [file]
Listener 2 gets the event about 1 and opens [file]
Listener 1 gets the event about 2 and opens [file]
..... shit, see how this keeps going forever?

Now that I type it out there might be some horribleness left since my
solution was to only send events caused by 3 to listeners with priority
< 3. So given 3 listeners and the above situation we get

Process A opens [file]
Listener 1 gets the event about A and opens [file]
Listener 2 gets the event about A and opens [file]
Listener 3 gets the event about A and opens [file]
Listener 1 gets the event about 2 and opens [file]
Listener 1 gets the event about 3 and opens [file]
Listener 2 gets the event about 3 and opens [file]
Listener 1 gets the event about 2 and opens [file]
done.

But maybe I should jsut do the 'if you have fanotify open, you don't
create other fanotify events'... so everyone gets what they expect...




>
> > The f_flags is the flags which the fanotify listener wishes to use when
> > opening their notification fds. On access scanners would want to use
> > O_RDONLY, whereas HSM systems would need to use O_WRONLY.
>
> Interesting. An option for file change trackers who don't care about
> the open file descriptor would be good too. Perhaps they are just
> logging.

No open fd would be pretty worthless, you'd know 'some file opened' but
you wouldn't know what file :) The open fd is the whole point of
fanotify.

> > fd specifies the new file descriptor that was created in the context of
> > the listener. (readlink of /proc/self/fd will give you A pathname)
> > mask indicates the events type (bitwise OR of the event types listed
> > above). f_flags here is the f_flags the ORIGINAL process has the file
> > open with. pid and tgid are from the original process. cookie is used
> > when the listener needs to allow, deny, or delay the operation.
>
> So far it looks quite similar to inotify, with some differences.
> Some things taken away:
>
> - Very similar events, but missing a few like renames (which you
> are thinking of adding).
> - No file name for things that happen in a subdirectory.

Actually I should be more clear about that. If you call
setsockopt(FANOTIFY_ADD_MARK) where

struct fanotify_so_inode_mark {
__s32 fd; = "/tmp/"
__u32 mask; = (FAN_OPEN | FAN_EVENT_ON_CHILD);
__u32 ignored_mask; = 0
};

and someone opens /tmp/file1 you are going to get an open fd
for /tmp/file1 NOT for /tmp. This is different than inotify.

> Application expected to call readlink("/proc/self/fd") if it
> cares about the file name. But that won't work for every kind of
> event!

It does, since I only give you events where it works :)

> Some things (useful I agree) added:
>
> - Returns an open file descriptor to the affected file.
> - Returns some other attributes, like accessing pid/tgid (uid though?).
> - Can block the process trying to access the file.
>
> API-wise, is there a particular reason for using a new socket
> interface, rather than extending the inotify interface with a few more
> flags and a different event structure?

Since they are syscalls, they pretty much suck to change (setsockopt is
SOOOOO much more versionable)


> So you may have better
> luck with a system call interface than using a socket.

I'm going on the suggestion of Alan Cox, but honestly the interface is
clearly segregated, so it can be changed if there is a better idea....



> > struct fanotify_so_access {
> > __u64 cookie;
> > __u32 response;
> > } __attribute__((packed));
> >
> > Where cookie is the cookie from the notification and response is one of:
>
> What happens when a process sends a cookie that it did not receive,
> but another process received it?

Cookie's are specific to your fanotify socket. If you response with an
invalid cookie the setsockopt() call returns -EINVAL;

>
> > FAN_ALLOW: allow the original operation
> > FAN_DENY: deny the original operation
> > FAN_RESET_TIMEOUT: reset the timeout.
> >
> > The last main interface is the 'marking' of inodes. The purpose of
> > inode marks differ between 'directed' and 'global' listeners. Directed
> > fanotify listeners need to mark inodes of interest. They do that also
> > using setsockopt() of type FANOTIFY_SET_MARK with the buffer containing
> > a structure like:
> >
> > struct fanotify_so_inode_mark {
> > __s32 fd;
> > __u32 mask;
> > __u32 ignored_mask;
> > } __attribute__((packed));
> >
> > Where fd is backed by the inode in question. Mask is the events of
> > interest (only used in directed mode) and ignored_mask is the mask of
> > events which should be ignored.
>
> It's hard to see how this differs much from inotify_add_watch, except
> - is this mark global to all processes, or local to the process
> setting the mark?

there are a LOT of similarities. bind() is a lot like inotify_init().
adding a mark is a lot like inotify_add_watch().....

as in inotify setting a mark only applies to the socket it was
associated with....

> > The ignored_mask is cleared every time an inode receives a modification
> > events unless FAN_SURVIVE_MODIFY is also set. The ignored_mask is
> > mainly used for 2 purposes. Global listeners may just have no interest
> > in lots of events, so they should spam inodes with an ignored mask. The
> > ignored mask is also used to 'cache' access decisions. If the listener
> > sets FAN_ACCESS_PERM in the ignored mask all access operations will be
> > permitted without the call out to userspace. If the inode is modified
> > the ignored_mask will be cleared and userspace will again have to
> > approve the access. If userspace REALLY doesn't care ever they can use
> > the special FAN_SURVIVE_MODIFY flag inside the ignored_mask.
>
> I do like the idea of caching access decisions. Are these flags
> global to the whole system, or local to the listening process setting
> the flags (or to the specific listener's socket)?

socket.

> > The only other current interface is the ability to ignore events by
> > superblock magic number. This makes it easy to ignore all events
> > in /proc which can be difficult to accomplish firing FANOTIFY_SET_MARK
> > with ignored_masks over and over as processes are created and destroyed.
> >
> > ***********
> >
> > Future direction:
>
> Here's one more thing which may be needed to make hard guarantees for
> security applications:
>
> - Mount events, which it would be natural for fanotify to block
> temporarily while it assesses the impact and/or synchronises it's
> map of the mounts against the change. Mounts do change the set
> of visible files, after all.
>
> > There are 2 things I'm interested in adding.
> > - Rename events.
> > The updatedb/mlocate people are interested in fanotify as a means to
> > not thrash the harddrive every night. They could instead update the db
> > in real time as files are moved.
>
> Great!
>
> I'm interested in the same thing on narrower (but still large)
> subdirectories, for things like enhanced rsync, make, git, indexing,
> and complex caching of compiled things. You get the idea: it has a
> lot of uses.
>
> > - subtree notification.
> > Currently to only watch /home and all of it's descendants one must
> > either register a directed watch on every directory or use a global
> > listener. The global listener with ignored_mask is not as bad as it
> > sounds in my testing, but decent subtree registration and notification
> > would be a big win in a lot of people's mind.
>
> I believe we've talked about one suggestion for how to do this, on
> lwn.net. I'll repeat it here.
>
> Efficient recursive notifications method:
>
> - You register for event on a directory with a RECURSIVE flag "give
> me events for this directory and all paths below it".
>
> - That listener gets events for any access of the appropriate type
> whose path is via that directory, *using the specific run-time
> path used for the access*.
>
> - That _doesn't_ mean hard-link files need to know all their parent
> directories, which would be silly and impossible. The event path
> is just the one used at run-time for access, by the application
> attempting to open/write/whatever.
>
> - If a listener needs to track all accesses to a particular
> hard-linked file, it's the responsibility of the listener to
> ensure it listens to enough directories to cover every path to
> that file - or listen to the file directly. It knows from
> i_nlink and the mount map when it has enough directories.
>
> - Notifying just the access path may seem counterintuitive, but in
> fact it's what inotify and dnotify do already, and it does
> actually work. Often a listener is maintaining a cache or index
> of some kind, in which case it will already have sufficient
> knowledge about where the hard-linked files are (or know that it
> needs an initial indexing), and whether it has covered enough
> parent directories to see all accesses to them.
>
> - In practice it means each access traverses the path, following
> parent directories until reaching a mount point, broadcasting
> events on each one where there's a recursive listener. That's
> not as inefficient as it looks, because paths don't usually have
> a large number of components.
>
> - I'm not sure exactly how fast/slow it is, though, and it may a
> few thoughtfully cached flags in each dentry to elide traversals.
> I won't discuss the details here, for fear of complicating the
> discussion too much. They might well mesh with the 'access
> decision cache' flags you mentioned.
>
> - It is necessary that link(2) create an attribute-change event
> (for i_nlink!) on the source path of the link. dnotify/inotify
> don't do that now (unless they changed recently), but they should
> to make this work.
>
> Please shoot down the idea. I think it is good enough
> for reliable subtree notifications, but I'd love to be proven wrong.
>
> -- Jamie

2009-07-25 00:29:30

by Jamie Lokier

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

Eric Paris wrote:
> But maybe I should jsut do the 'if you have fanotify open, you don't
> create other fanotify events'... so everyone gets what they expect...

O_NONOTIFY. Similar security concerns, more control.

The security concern is clear: If you allow a process with fanotify
open to not create events, then any (root) process can open a fanotify
socket to hide it's behaviour.

> > - No file name for things that happen in a subdirectory.
>
> Actually I should be more clear about that. If you call
> setsockopt(FANOTIFY_ADD_MARK) where
>
> struct fanotify_so_inode_mark {
> __s32 fd; = "/tmp/"
> __u32 mask; = (FAN_OPEN | FAN_EVENT_ON_CHILD);
> __u32 ignored_mask; = 0
> };
>
> and someone opens /tmp/file1 you are going to get an open fd
> for /tmp/file1 NOT for /tmp. This is different than inotify.

If you inotify for IN_OPEN on /tmp, it would return an event when you
open /tmp/file1, with the name "file1" and the directory /tmp. It's
no so different. The main difference is inotify returns a name and no
inode number (so it's racy, sigh), whereas fanotify returns an open file.

Do I see right that you need to open the directory before you can set
the mark on it?

The main reason behind inotify's design wasn't the API (although it is
better than dnotify); it was to avoid having to open thousands of
directories, and to allow a filesystem to be unmounted while it is
being watched.

Does a fanotify mark stop a filesystem from being unmounted? If not,
if the filesystem is unmounted and remounted, is the mark lost?

-- Jamie

2009-07-25 14:22:50

by Niraj kumar

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

>
> - subtree notification.
> ? ? ? ?Currently to only watch /home and all of it's descendants one must
> either register a directed watch on every directory or use a global
> listener. ?The global listener with ignored_mask is not as bad as it
> sounds in my testing, but decent subtree registration and notification
> would be a big win in a lot of people's mind.

Unless it's already covered in some way, I would also be interested
in notification based on "process subtree". What this means is that
for whatever notification a particular process requests, it's only
interested in events generated by itself and it's children.
This is useful in doing auditing for file system related access.

-Niraj

2009-07-27 16:54:38

by Jan Kara

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Fri 24-07-09 23:48:14, Jamie Lokier wrote:
> > - subtree notification.
> > Currently to only watch /home and all of it's descendants one must
> > either register a directed watch on every directory or use a global
> > listener. The global listener with ignored_mask is not as bad as it
> > sounds in my testing, but decent subtree registration and notification
> > would be a big win in a lot of people's mind.
>
> I believe we've talked about one suggestion for how to do this, on
> lwn.net. I'll repeat it here.
>
> Efficient recursive notifications method:
>
> - You register for event on a directory with a RECURSIVE flag "give
> me events for this directory and all paths below it".
>
> - That listener gets events for any access of the appropriate type
> whose path is via that directory, *using the specific run-time
> path used for the access*.
>
> - That _doesn't_ mean hard-link files need to know all their parent
> directories, which would be silly and impossible. The event path
> is just the one used at run-time for access, by the application
> attempting to open/write/whatever.
>
> - If a listener needs to track all accesses to a particular
> hard-linked file, it's the responsibility of the listener to
> ensure it listens to enough directories to cover every path to
> that file - or listen to the file directly. It knows from
> i_nlink and the mount map when it has enough directories.
>
> - Notifying just the access path may seem counterintuitive, but in
> fact it's what inotify and dnotify do already, and it does
> actually work. Often a listener is maintaining a cache or index
> of some kind, in which case it will already have sufficient
> knowledge about where the hard-linked files are (or know that it
> needs an initial indexing), and whether it has covered enough
> parent directories to see all accesses to them.
>
> - In practice it means each access traverses the path, following
> parent directories until reaching a mount point, broadcasting
> events on each one where there's a recursive listener. That's
> not as inefficient as it looks, because paths don't usually have
> a large number of components.
>
> - I'm not sure exactly how fast/slow it is, though, and it may a
> few thoughtfully cached flags in each dentry to elide traversals.
> I won't discuss the details here, for fear of complicating the
> discussion too much. They might well mesh with the 'access
> decision cache' flags you mentioned.
>
> - It is necessary that link(2) create an attribute-change event
> (for i_nlink!) on the source path of the link. dnotify/inotify
> don't do that now (unless they changed recently), but they should
> to make this work.
About two years ago, I had a similar idea for a lightweight persistent
recursive modification. I even have a proof-of-concept patch against 2.6.23
(attached to get an idea) which works nicely. I've aimed at things like
efficient backup or desktop indexing which are interested in processing
lots of changes in a batch once in a longer period of time... Actually I
believe this kind of use is quite different from the kind of use fanotify
aims at and maybe different approaches even make sence here... My approach
is only able to give the information "something in the subtree has changed"
via an inode flag in the directory inode and the application has to track
down what exactly it was (by recursively looking on the flags of the
subdirectories and stating regular files). The benefit is it's rather
scalable I believe.
Generally the trouble with this approach is that one has to handle
hardlinks, bind mounts and filesystems which don't support persistent
storage of your attributes. It's all doable but tricky, and I'm still
trying to get all the details right in a shared library wrapping up the
kernel feature (well, one of the problems also is I get to this for only a
few days a year :().

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (4.01 kB)
ext3-2.6.23-1-i_flags_atomicity.diff (9.26 kB)
ext3-2.6.23-2-make_frags_unused.diff (5.54 kB)
ext3-2.6.23-3-recursive_mtime.diff (14.65 kB)
Download all attachments

2009-07-27 17:52:35

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

Hi.

On Fri, Jul 24, 2009 at 10:44:01PM +0100, Jamie Lokier ([email protected]) wrote:
> > No, I will NOT EVER pass a pathname. Period. End of story. I stated
> > the if userspace wants to deal with pathnames (and they understand the
> > system setup well enough to know if pathnames even make sense to them)
> > they can use readlink(2) on /proc/self/fd
>
> That makes sense.
>
> In most cases where events trigger userspace cache or index updates,
> userspace already has enough information to calculate the path (and
> any derived data) from the inode number (in the case of non-hard-link
> files) or from the inode number of the parent directory and the name
> (not full path).

Except that rlimits may forbid to open new file descriptor while queue
length is enough to put another event with the full or partial path
name.

I will read initial mail next, but if it is not described there, how
rlimit problem is handled?

--
Evgeniy Polyakov

2009-07-27 18:34:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Jul 25, 2009 01:29 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > But maybe I should jsut do the 'if you have fanotify open, you don't
> > create other fanotify events'... so everyone gets what they expect...
>
> O_NONOTIFY. Similar security concerns, more control.
>
> The security concern is clear: If you allow a process with fanotify
> open to not create events, then any (root) process can open a fanotify
> socket to hide it's behaviour.

I think the "fanotify doesn't generate more fanotify events" makes the
most sense. Given that the open will be done in the kernel specifically
due to fanotify, this doesn't actually allow the listener to open files
without detection (unlike the "O_NONOTIFY" flag would). The fanotify
"opens" would only be in response to other processes opening the file.

It might also make sense to verify that the process doing the open has
at least permission to open the file in question (i.e. root) so that
some unauthorized process cannot just get file handles to arbitrary files.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-27 19:24:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

Andreas Dilger wrote:
> On Jul 25, 2009 01:29 +0100, Jamie Lokier wrote:
> > Eric Paris wrote:
> > > But maybe I should jsut do the 'if you have fanotify open, you don't
> > > create other fanotify events'... so everyone gets what they expect...
> >
> > O_NONOTIFY. Similar security concerns, more control.
> >
> > The security concern is clear: If you allow a process with fanotify
> > open to not create events, then any (root) process can open a fanotify
> > socket to hide it's behaviour.
>
> I think the "fanotify doesn't generate more fanotify events" makes the
> most sense. Given that the open will be done in the kernel specifically
> due to fanotify, this doesn't actually allow the listener to open files
> without detection (unlike the "O_NONOTIFY" flag would). The fanotify
> "opens" would only be in response to other processes opening the file.

Nice idea (if I understand it) - the file descriptors opened by
fanotify wouldn't cause fanotify events? Effectively having an
in-kernel O_NONOTIFY flag which can't be set from userspace?

'if you have fanotify open, you don't create other fanotify events' is
too severe - it means you can circumvent fanotify entirely for
everything your process does by just opening a fanotify socket and not
using it, which is clearly worse than having an O_NONOTIFY flag.

All ways to avoid creating fanotify events introduce security and
cache integrity problems though.

Why should processes which simply _watch_ the filesystem, without
modifying any files, ever fail to receive information about file
changes? Why should they be unable to claim they provide integrity
guarantees because of the loophole?

So before we implement that loophole: let's ask ourselves if we really
need it at all.

What's wrong with fanotify-using applications generating events when
they modify files themselves?

An example was given where app A gets an event and modifies the file,
then app B gets an event and modifies the file, and app A... cycling.

But if you have two "virus checker" style applications fighting over
modifying the same file without locking, you have much bigger problems
already. There's nothing to gain by fixing the fanotify cycle.

Maybe there's no need to suppress events after all.

Programs monitoring for writes to maintain caches or integrity checks
will be much happier if they know they're getting all file modifications.

-- Jamie

2009-07-28 11:48:41

by Jon Masters

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Fri, 2009-07-24 at 16:13 -0400, Eric Paris wrote:

> I plan to start sending patches for fanotify in the next week or two.

Generally, I appreciate your effort (as I'm sure does everyone else).

I agree with Jamie that it's good to consider extending inotify and also
that the special socket idea probably won't work for mainline. Also:

1). Ability to watch only certain mount-points, not just directories. Or
directories and block on mount operations as Jamie suggested. Or both :)

2). Add event on mmap perhaps. Future theoretical cloud cuckoo land
ideas include forcing all mmap operations to be read-only and then
having the page fault handler fire an event for every write so that the
anti-malware thing can monitor every single touched page...joke.

3). Sounds a lot like netlink could be close enough. Kay and others have
been playing with in-kernel multiplexing and re-broadcasting of netlink
events, and I'm pretty sure most of the rest is doable.

I'm looking forward to updatedb using this. Let's try up-playing the use
cases outside malware for this stuff. I think the average person is
going to get more excited to see "Beagle done right" or "something like
Microsoft indexer service"[0] than 1970s updatedb. It's certainly a nice
and compelling reason to get this into mainline IMO.

Jon.

[0] Except anything but as crap as their version. Seriously, the last
time I used a Windows system and looked at it, the indexer was consuming
more CPU than Beagle ever did. And I liked the Beagle concept.

2009-07-28 18:00:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Jul 27, 2009 20:23 +0100, Jamie Lokier wrote:
> Andreas Dilger wrote:
> > I think the "fanotify doesn't generate more fanotify events" makes the
> > most sense. Given that the open will be done in the kernel specifically
> > due to fanotify, this doesn't actually allow the listener to open files
> > without detection (unlike the "O_NONOTIFY" flag would). The fanotify
> > "opens" would only be in response to other processes opening the file.
>
> Nice idea (if I understand it) - the file descriptors opened by
> fanotify wouldn't cause fanotify events? Effectively having an
> in-kernel O_NONOTIFY flag which can't be set from userspace?

Right.

> 'if you have fanotify open, you don't create other fanotify events' is
> too severe - it means you can circumvent fanotify entirely for
> everything your process does by just opening a fanotify socket and not
> using it, which is clearly worse than having an O_NONOTIFY flag.

That isn't what I meant... It was only "operations on fanotify file
descriptors don't produce further fanotify events", otherwise madness
ensues.

> What's wrong with fanotify-using applications generating events when
> they modify files themselves?
>
> An example was given where app A gets an event and modifies the file,
> then app B gets an event and modifies the file, and app A... cycling.
>
> But if you have two "virus checker" style applications fighting over
> modifying the same file without locking, you have much bigger problems
> already. There's nothing to gain by fixing the fanotify cycle.

I don't think they are fighting over modifying the file, they are
generating an endless stream of events for each other to process.
I don't think locking will help.

It's the same as e.g. having verbose kernel debug messages related
to filesystem/block IO going to /var/log/messages on the filesystem
you are monitoring. After the first (external) IO, it is logged
to disk, which causes an IO to the filesystem, which is logged to
disk, which causes an IO...

There has to be some way to NOT generate any events on the files
that you are monitoring.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-07-29 20:09:16

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Sat, 2009-07-25 at 19:52 +0530, Niraj kumar wrote:
> >
> > - subtree notification.
> > Currently to only watch /home and all of it's descendants one must
> > either register a directed watch on every directory or use a global
> > listener. The global listener with ignored_mask is not as bad as it
> > sounds in my testing, but decent subtree registration and notification
> > would be a big win in a lot of people's mind.
>
> Unless it's already covered in some way, I would also be interested
> in notification based on "process subtree". What this means is that
> for whatever notification a particular process requests, it's only
> interested in events generated by itself and it's children.
> This is useful in doing auditing for file system related access.

Has not been considered. You'd have to really flesh out the use case
before I could.....

-Eric

2009-07-29 20:09:25

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Sat, 2009-07-25 at 01:29 +0100, Jamie Lokier wrote:
> Eric Paris wrote:
> > But maybe I should jsut do the 'if you have fanotify open, you don't
> > create other fanotify events'... so everyone gets what they expect...
>
> O_NONOTIFY. Similar security concerns, more control.
>
> The security concern is clear: If you allow a process with fanotify
> open to not create events, then any (root) process can open a fanotify
> socket to hide it's behaviour.

Let me first say I'm not sure of how many useful 'security' goals can be
met using fanotify mainly because I haven't focused on any. You can do
data integrity checking before access but I'm saying up front, malicious
programs can almost certainly sneak information across these checks.

The 'problem' is really strongly with the open and open_perm, but there
are problems with read_perm as well. With just 2 groups listening to
'open' for the same file we get that infinite event loop as they each
see the open from the other listener as each listener opens an fd on the
object in question. If both groups are listening to open_perm you
obviously have a deadlock....

The same applies if 2 fanotify groups need read_perm as they both may
need to block waiting for the decision of the other.

I have an easy solution to the 'open' problems, just don't fire the open
and open_perm events when it is fanotify doing the open. I guess I
could use file->private_data when the fanotify listener does io on an
fanotify opened file to see if it was an fanotify opened fd and ignore
only those events.... I'll try this afternoon.

So really now I'm planning to only not send events to other fanotify
listeners which are on fanotify opened fds.

> Do I see right that you need to open the directory before you can set
> the mark on it?

Correct. But unlike dnotify, you don't have to keep it open.

> The main reason behind inotify's design wasn't the API (although it is
> better than dnotify); it was to avoid having to open thousands of
> directories, and to allow a filesystem to be unmounted while it is
> being watched.

I could make fanotify_so_mark pathname rather than fd based. But fd
based works very nicely in global mode as the fd you get in the event
metadata can be used in setting marks.

You need to open the directory so you have an fd, so you can set the
mark. You can then close the directory. If the directory is deleted,
your mark is gone (forever.)

> Does a fanotify mark stop a filesystem from being unmounted? If not,
> if the filesystem is unmounted and remounted, is the mark lost?

No, it does not prevent unmounting. Yes, if it is unmounted and
remounted the marks are lost forever.

-Eric

2009-07-29 20:12:26

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Mon, 2009-07-27 at 21:52 +0400, Evgeniy Polyakov wrote:
> Hi.
>
> On Fri, Jul 24, 2009 at 10:44:01PM +0100, Jamie Lokier ([email protected]) wrote:
> > > No, I will NOT EVER pass a pathname. Period. End of story. I stated
> > > the if userspace wants to deal with pathnames (and they understand the
> > > system setup well enough to know if pathnames even make sense to them)
> > > they can use readlink(2) on /proc/self/fd
> >
> > That makes sense.
> >
> > In most cases where events trigger userspace cache or index updates,
> > userspace already has enough information to calculate the path (and
> > any derived data) from the inode number (in the case of non-hard-link
> > files) or from the inode number of the parent directory and the name
> > (not full path).
>
> Except that rlimits may forbid to open new file descriptor while queue
> length is enough to put another event with the full or partial path
> name.
>
> I will read initial mail next, but if it is not described there, how
> rlimit problem is handled?

At the moment if you run out of rlimit fds you start getting (useless)
events with the fd equal to some errno (don't remember what hitting
rlimit errno is offhand)

-Eric

2009-07-29 20:14:22

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Mon, 2009-07-27 at 12:33 -0600, Andreas Dilger wrote:
> On Jul 25, 2009 01:29 +0100, Jamie Lokier wrote:

> It might also make sense to verify that the process doing the open has
> at least permission to open the file in question (i.e. root) so that
> some unauthorized process cannot just get file handles to arbitrary files.

All current permissions between the listener process and the object are
done. It's quite possible to get fanotify events where the fd = -EPERM.

-Eric

2009-07-29 20:16:52

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Mon, 2009-07-27 at 20:23 +0100, Jamie Lokier wrote:
> Andreas Dilger wrote:
> > On Jul 25, 2009 01:29 +0100, Jamie Lokier wrote:
> > > Eric Paris wrote:

> What's wrong with fanotify-using applications generating events when
> they modify files themselves?
>
> An example was given where app A gets an event and modifies the file,
> then app B gets an event and modifies the file, and app A... cycling.

No the example was the 'open' which the kernel does on behalf of the
listener. I'm thinking now I should only exclude

OPEN
OPEN_PERM
ACCESS_PERM

as those are the only 3 event types I can see deadlock/recursion
problems with.

2009-07-29 20:22:26

by Eric Paris

[permalink] [raw]
Subject: Re: fanotify - overall design before I start sending patches

On Tue, 2009-07-28 at 07:48 -0400, Jon Masters wrote:
> On Fri, 2009-07-24 at 16:13 -0400, Eric Paris wrote:
>
> > I plan to start sending patches for fanotify in the next week or two.
>
> Generally, I appreciate your effort (as I'm sure does everyone else).
>
> I agree with Jamie that it's good to consider extending inotify and also
> that the special socket idea probably won't work for mainline. Also:

The special socket idea was Alan Cox's idea and I haven't heard a usable
alternative.

> 1). Ability to watch only certain mount-points, not just directories. Or
> directories and block on mount operations as Jamie suggested. Or both :)

Show me the user and I'll consider it.

> 2). Add event on mmap perhaps. Future theoretical cloud cuckoo land
> ideas include forcing all mmap operations to be read-only and then
> having the page fault handler fire an event for every write so that the
> anti-malware thing can monitor every single touched page...joke.
>
> 3). Sounds a lot like netlink could be close enough. Kay and others have
> been playing with in-kernel multiplexing and re-broadcasting of netlink
> events, and I'm pretty sure most of the rest is doable.

Jeez, about the 50th time someone has said netlink. I need to do the fd
open in the context of the receiving process. How do I do that with
netlink? It cannot be done at the netlink msg send side (which is the
context of the original process accessing the file)

> I'm looking forward to updatedb using this.

Well, that's still in the future work, as all updatedb cares about it
rename events, and the kernel does have enough information to send
fanotify events during rename.

> Let's try up-playing the use
> cases outside malware for this stuff.

I'm not playing any spin or bullshit. I've got HSM users who wants it.
Readahead profiling wants it. I'm told that wine can properly do
windows style notification with it instead of some hack they have now.
I've also got a need to get people who want to run integrity
checkers/virus scanners to stop binary patching/hacking my kernels. I'd
say I've found plenty of use cases today even if you don't find them as
sexy as beagle :)

-Eric