LinuxLists.cc - fanotify: the fscking all notification system

2009-06-29 20:09:16

Subject: fanotify: the fscking all notification system

So it's back to that time. I'm not quite sure how to present fanotify.
I can start sending patches (they are available), but this message is
just going to be a re-into, what questions and problems are still out
there?

Long ago the anti-malware vendors started asking the community for a
reasonable way to do on access file scanning, historically they have
used syscall table rewrites and binary LSM hook hacks to get their
information. Customers and Linux users keep demanding this stuff and in
an effort give them a supportable method to use these products I have
been working to develop fanotify.

fanotify provides two things:
1) a new notification system, sorta like inotify, only instead of an
arbitrary 'watch descriptor' which userspace has to know how to map back
to an object on the filesystem, fanotify provides an open read-only fd
back to the original object. It should be noted that the set of
fanotify events is much smaller than the set of inotify events.

2) an access system in which processes may be blocked until the fanotify
userspace listener has decided if the operation should be allowed.

There was a long discussion in which I was asked to define the security
model being implemented and at the end of the day the answer is that
there is no security model here. This is NOT an LSM. This is not
intended to provide system security. fanotify is intended to provide an
interface for on access file scanning and permissions gating based on
the results of those scans. fanotify does not prevent, nor does it
attempt to prevent, malicious code running on the Linux machine. Read
that again, once malicious code is running on the Linux machine this
interface (along with whatever magic someone creates in userspace) is
not intended to prevent malicious actions. There is some hope in that
if userspace can identify the malicious code it could prevent it from
every being executed by a normal program and so there is clearly
security benefit possible, but it is a very very weak assurance. Those
long discussion can be found at:
http://thread.gmane.org/gmane.linux.kernel.malware/22
http://thread.gmane.org/gmane.linux.kernel/716539

fanotify is close to working, although some of the 'features' are
completely untested and a couple are unimplemented but it's pretty
close. It's currently implemented over 34 patches which hopefully are
each small enough for good review, I'll be sending them a couple or so
at a time for review but first I want to make sure we are all on the
same page....

fanotify has two basic 'modes' directed and global. fanotify directed
works much like inotify in that userspace marks inodes it is interested
in and gets events from those inodes. fanotify global instead indicates
that it wants everything on the system and then individually marks
inodes that it doesn't care about. They both have the same userspace
interface and rely on the same fsnotify in kernel infrastrucute
(although the infrastructure did have to modified to support the global
listener concept)

In either case the fanotify userspace interface is based on socket calls
loosely of this format.

1) open an fanotify socket
2) bind the socket here you define yourself and directed or global and
if global define all the events you want.
2.5) if directed call setsockattr to attach marks to inodes you care
about.
3) call getsockattr on the socket to get back data about events that
took place and to get fd's opened in your context

At the very end of the message is a small program which, might even
build, and will printf for every single open that takes place on the
system as a reference for a brief understanding of the interface.
(although it does not provide an example of access decisions)

fanotify has a limited set of events, open, close, access(read),
modify(write) and a permissions event for open and modify. fanotify
provides no means to notice mv/rename. This is something I plan to look
into to simplify fanotify's use for use file indexers, but at this time
the requisite information is not available in the right places in the
kernel.

When userspace gets an event it comes in the form of one or more struct
fanotify_event_metadata in the getsockopt buffer.

struct fanotify_event_metadata {
__u32 event_len;
__s32 fd;
__u32 mask;
__u32 f_flags;
pid_t pid;
pid_t tgid;
__u64 cookie;
} __attribute__((packed));

This provides information about the event including the type, the
location of the new fd that was opened pointing to the object in
question, and it provides information about the process which triggered
the event.

If the event was a permissions gating event type (FAN_ACCESS_PERM |
FAN_OPEN_PERM) then cookie will be non-zero and userspace will need to
tell the kernel if the original calling process should be allowed or
denied. This is done with a setsockopt() call passing the

struct fanotify_so_access {
__u64 cookie;
__u32 response;
} __attribute__((packed));

In which this answer indicates the cookie from the event in question and
the response (allow/deny)

The third type of message, the inode mark, is done by passing

struct fanotify_so_inode_mark {
__s32 fd;
__u32 mask;
__u32 ignored_mask;
} __attribute__((packed));

to a setsockopt() call. If using fanotify in a 'directed' manor this
will mark an inode that we are interested in events in mask. The
ignored mask is used to indicate events we no longer want to hear,
although the ignored mask is cleared on inode modification. So if one
were to register FAN_ACCESS and after the first one send FAN_ACCESS in
the ignored_mask userspace would not get any more FAN_ACCESS events
until after the inode was next modified.

fanotify global groups use these similarly, only they are unable to set
anything in the mask and can only use the ignored_mask.

So what problems do people have? What complaints? What questions?
What do you want to know? What do you wish it could do? How could this
interface be better? What other information do you want?

Later today a 'working' set of fanotify patches should be available at
git://git.infradead.org/users/eparis/notify.git fanotify-experimental
THIS BRANCH WILL REGULARLY REBASE, I'm not trying to work nicely with
downstream trees! Patches gladly accepted, merge requests? not so much.

[paris@paris kernel-2]$ git diff f82c9a712458d835 | diffstat -p1
fs/compat.c | 5
fs/exec.c | 7
fs/nfsd/vfs.c | 4
fs/notify/Kconfig | 13
fs/notify/Makefile | 2
fs/notify/dnotify/dnotify.c | 7
fs/notify/fanotify/Kconfig | 27 +
fs/notify/fanotify/Makefile | 1
fs/notify/fanotify/af_fanotify.c | 694 +++++++++++++++++++++++++++++++++++
fs/notify/fanotify/af_fanotify.h | 21 +
fs/notify/fanotify/fanotify.c | 364 ++++++++++++++++++
fs/notify/fanotify/fanotify.h | 38 +
fs/notify/fsnotify.c | 86 +++-
fs/notify/fsnotify.h | 9
fs/notify/group.c | 128 +++++-
fs/notify/inode_mark.c | 16
fs/notify/inotify/inotify_fsnotify.c | 50 ++
fs/notify/inotify/inotify_user.c | 4
fs/notify/notification.c | 167 +++++---
fs/notify/second_q.c | 128 ++++++
fs/open.c | 2
fs/read_write.c | 8
include/linux/Kbuild | 1
include/linux/fanotify.h | 134 ++++++
include/linux/fsnotify.h | 60 ++-
include/linux/fsnotify_backend.h | 80 +++-
include/linux/init_task.h | 8
include/linux/sched.h | 4
include/linux/security.h | 5
include/linux/socket.h | 5
kernel/audit_tree.c | 7
kernel/audit_watch.c | 7
kernel/fork.c | 5
net/core/sock.c | 6
security/security.c | 18
35 files changed, 1955 insertions(+), 166 deletions(-)

Example program to printf for every open on a system!

int main(void) {
int fan_fd, len;
struct fanotify_addr addr;
socklen_t socklen;
char buf[4096];
struct fanotify_event_metadata *metadata;

memset(&addr, 0, sizeof(addr));
addr.family = AF_FANOTIFY;
addr.group_num = 123456;
addr.priority = 32768;
addr.mask = FAN_OPEN | FAN_GLOBAL_LISTENER;

fan_fd = socket(PF_FANOTIFY, SOCK_RAW, 0);
bind(fan_fd, (struct sockaddr *)&addr, sizeof(addr));
while (1) {
socklen = sizeof(buf);
getsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_GET_EVENT,
buf, &socklen);
metadata = &buf;
len = socklen;
while(FAN_EVENT_OK(metadata, len)) {
printf("got event!\n"
close(metadata->fd);
metadata = FAN_EVENT_NEXT(metadata, len);
}
}
}

2009-06-30 14:36:49

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Mon, 29 Jun 2009 16:08:45 EDT, Eric Paris said:

> fanotify provides two things:
> 1) a new notification system, sorta like inotify, only instead of an
> arbitrary 'watch descriptor' which userspace has to know how to map back
> to an object on the filesystem, fanotify provides an open read-only fd
> back to the original object. It should be noted that the set of
> fanotify events is much smaller than the set of inotify events.
>
> 2) an access system in which processes may be blocked until the fanotify
> userspace listener has decided if the operation should be allowed.

I don't care much about virus scanners - but some of us with petabytes of
disk space to manage could use tis for HSM applications. The HSM daemon
could fanotify on the file, notice that the file accessed referred to a
special "I've been archived" stub/token, and put the file back before
giving the go-ahead to the process.

The only sticky question - does this happen early enough that the accessing
process, when un-blocked, will continue through open() and get the *new*
version of the newly restored file?

Attachments:

(No filename) (226.00 B)

2009-06-30 17:13:33

by Bryan Donlan

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Tue, Jun 30, 2009 at 9:22 AM, <[email protected]> wrote:
> On Mon, 29 Jun 2009 16:08:45 EDT, Eric Paris said:
>
>> fanotify provides two things:
>> 1) a new notification system, sorta like inotify, only instead of an
>> arbitrary 'watch descriptor' which userspace has to know how to map back
>> to an object on the filesystem, fanotify provides an open read-only fd
>> back to the original object. ?It should be noted that the set of
>> fanotify events is much smaller than the set of inotify events.
>>
>> 2) an access system in which processes may be blocked until the fanotify
>> userspace listener has decided if the operation should be allowed.
>
> I don't care much about virus scanners - but some of us with petabytes of
> disk space to manage could use tis for HSM applications. ?The HSM daemon
> could fanotify on the file, notice that the file accessed referred to a
> special "I've been archived" stub/token, and put the file back before
> giving the go-ahead to the process.
>
> The only sticky question - does this happen early enough that the accessing
> process, when un-blocked, will continue through open() and get the *new*
> version of the newly restored file?
>

In that particular case, why not just make it a sparse file (so the
size fields in stat are correct), and put back the original data in
the very same inode?

2009-06-30 17:26:58

by Eric Paris

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Tue, 2009-06-30 at 09:22 -0400, [email protected] wrote:
> On Mon, 29 Jun 2009 16:08:45 EDT, Eric Paris said:
>
> > fanotify provides two things:
> > 1) a new notification system, sorta like inotify, only instead of an
> > arbitrary 'watch descriptor' which userspace has to know how to map back
> > to an object on the filesystem, fanotify provides an open read-only fd
> > back to the original object. It should be noted that the set of
> > fanotify events is much smaller than the set of inotify events.
> >
> > 2) an access system in which processes may be blocked until the fanotify
> > userspace listener has decided if the operation should be allowed.
>
> I don't care much about virus scanners - but some of us with petabytes of
> disk space to manage could use tis for HSM applications. The HSM daemon
> could fanotify on the file, notice that the file accessed referred to a
> special "I've been archived" stub/token, and put the file back before
> giving the go-ahead to the process.
>
> The only sticky question - does this happen early enough that the accessing
> process, when un-blocked, will continue through open() and get the *new*
> version of the newly restored file?

Yes. Although it going to take just a tiny bit of work from me and you
are limited in how to use it.

You'll need to request FAN_OPEN_PERM on the file in question. When the
file is opened you will need to bring in all the data and write all that
data into the file. There is no mechanism to only bring in a piece of a
file since you have no way of knowing what exactly the requesting
process is requesting. So it's a whole file or nothing type of deal.

Current problems:

1) the fd the fanotify listener gets is O_RDONLY. I think I'll add an
"f_flags" option which if 0 defaults to O_RDONLY|O_LARGEFILE like we
have today but which you could use to indicate O_RDWR or O_WRONLY. We
currently give O_RDONLY since you can't request O_WR* on files/libraries
mapped exec, so this won't work for executables.....

2) Right now you have 5 seconds to answer an fanotify permissions
request, if you don't get it in 5 seconds you are done and the original
process gets an allow. But I have a half finished patch which would
allow you to delay them infinitely. As long as you keep them delayed
you can modify the file they are about to access however you like.

You can not rename over the fd you are given, you have to actually copy
the data into that fd, since the inode behind the fd you got from
fanotify is the same inode the original process is going to get access
to after your listener sends an allow.

This sound good to you? I don't know your details, but I believe that
fanotify directed mode with access permissions is going to be almost
exactly what you want....

-Eric

2009-07-01 04:10:45

by Matt Helsley

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Mon, Jun 29, 2009 at 04:08:45PM -0400, Eric Paris wrote:
> So it's back to that time. I'm not quite sure how to present fanotify.
> I can start sending patches (they are available), but this message is
> just going to be a re-into, what questions and problems are still out
> there?
>
> Long ago the anti-malware vendors started asking the community for a
> reasonable way to do on access file scanning, historically they have
> used syscall table rewrites and binary LSM hook hacks to get their
> information. Customers and Linux users keep demanding this stuff and in
> an effort give them a supportable method to use these products I have
> been working to develop fanotify.
>
> fanotify provides two things:
> 1) a new notification system, sorta like inotify, only instead of an
> arbitrary 'watch descriptor' which userspace has to know how to map back
> to an object on the filesystem, fanotify provides an open read-only fd
> back to the original object. It should be noted that the set of
> fanotify events is much smaller than the set of inotify events.
>
> 2) an access system in which processes may be blocked until the fanotify
> userspace listener has decided if the operation should be allowed.
>
> There was a long discussion in which I was asked to define the security
> model being implemented and at the end of the day the answer is that
> there is no security model here. This is NOT an LSM. This is not
> intended to provide system security. fanotify is intended to provide an
> interface for on access file scanning and permissions gating based on
> the results of those scans. fanotify does not prevent, nor does it
> attempt to prevent, malicious code running on the Linux machine. Read
> that again, once malicious code is running on the Linux machine this
> interface (along with whatever magic someone creates in userspace) is
> not intended to prevent malicious actions. There is some hope in that
> if userspace can identify the malicious code it could prevent it from
> every being executed by a normal program and so there is clearly
> security benefit possible, but it is a very very weak assurance. Those
> long discussion can be found at:
> http://thread.gmane.org/gmane.linux.kernel.malware/22
> http://thread.gmane.org/gmane.linux.kernel/716539
>
> fanotify is close to working, although some of the 'features' are
> completely untested and a couple are unimplemented but it's pretty
> close. It's currently implemented over 34 patches which hopefully are
> each small enough for good review, I'll be sending them a couple or so
> at a time for review but first I want to make sure we are all on the
> same page....
>
> fanotify has two basic 'modes' directed and global. fanotify directed
> works much like inotify in that userspace marks inodes it is interested
> in and gets events from those inodes. fanotify global instead indicates
> that it wants everything on the system and then individually marks
> inodes that it doesn't care about. They both have the same userspace
> interface and rely on the same fsnotify in kernel infrastrucute
> (although the infrastructure did have to modified to support the global
> listener concept)
>
> In either case the fanotify userspace interface is based on socket calls
> loosely of this format.
>
> 1) open an fanotify socket
> 2) bind the socket here you define yourself and directed or global and
> if global define all the events you want.
> 2.5) if directed call setsockattr to attach marks to inodes you care
> about.
> 3) call getsockattr on the socket to get back data about events that
> took place and to get fd's opened in your context
>
> At the very end of the message is a small program which, might even
> build, and will printf for every single open that takes place on the
> system as a reference for a brief understanding of the interface.
> (although it does not provide an example of access decisions)
>
> fanotify has a limited set of events, open, close, access(read),
> modify(write) and a permissions event for open and modify. fanotify
> provides no means to notice mv/rename. This is something I plan to look
> into to simplify fanotify's use for use file indexers, but at this time
> the requisite information is not available in the right places in the
> kernel.
>
> When userspace gets an event it comes in the form of one or more struct
> fanotify_event_metadata in the getsockopt buffer.
>
> struct fanotify_event_metadata {
> __u32 event_len;
> __s32 fd;
> __u32 mask;
> __u32 f_flags;
> pid_t pid;
> pid_t tgid;
> __u64 cookie;
> } __attribute__((packed));

Since it passes pids from the kernel to userspace via a socket I suspect
this needs input from the folks working on pid namespaces. The events may
need to be dropped if the pid namespace of the event's origin doesn't
match that of the destination. Otherwise the pid would be ambiguous and
this interface will only work for tasks in the initial pid namespace.

<snip>

> Later today a 'working' set of fanotify patches should be available at
> git://git.infradead.org/users/eparis/notify.git fanotify-experimental
> THIS BRANCH WILL REGULARLY REBASE, I'm not trying to work nicely with
> downstream trees! Patches gladly accepted, merge requests? not so much.

Cheers,
-Matt Helsley

2009-07-02 15:02:53

by Jan Engelhardt

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Monday 2009-06-29 22:08, Eric Paris wrote:
>
>struct fanotify_event_metadata {
> __u32 event_len;
> __s32 fd;
> __u32 mask;
> __u32 f_flags;
> pid_t pid;
> pid_t tgid;
> __u64 cookie;
>} __attribute__((packed));

Why pid_t and not s32 like the other fields are?

>3) call getsockattr on the socket to get back data about events that
>took place and to get fd's opened in your context
>[..].
>This provides information about the event including the type, the
>location of the new fd that was opened pointing to the object in
>question, and it provides information about the process which triggered
>the event.

Now the question would be whether getsockattr (meant getsockopt?)
returns data about one event or multiple events.

Why is not read() used? Sure, it is not as easy, but if you require
the size argument to read() being at least sizeof(event_metadata),
things should fit, should not they?
Using getsockopt feels like an ioctl - and they for one are
more or less discouraged in light of netlink for example.

>fanotify has two basic 'modes' directed and global. fanotify directed
>works much like inotify in that userspace marks inodes it is interested
>in and gets events from those inodes. fanotify global instead indicates
>that it wants everything on the system and then individually marks
>inodes that it doesn't care about.

So indexers and such only need a single fd for getting notifications
about touched files? Sounds great.

2009-07-02 15:22:32

by Eric Paris

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Thu, 2009-07-02 at 17:02 +0200, Jan Engelhardt wrote:
> On Monday 2009-06-29 22:08, Eric Paris wrote:
> >
> >struct fanotify_event_metadata {
> > __u32 event_len;
> > __s32 fd;
> > __u32 mask;
> > __u32 f_flags;
> > pid_t pid;
> > pid_t tgid;
> > __u64 cookie;
> >} __attribute__((packed));
>
> Why pid_t and not s32 like the other fields are?

I'll change them just to be sure userspace/kernel space have to be
consistent.

> >3) call getsockattr on the socket to get back data about events that
> >took place and to get fd's opened in your context
> >[..].
> >This provides information about the event including the type, the
> >location of the new fd that was opened pointing to the object in
> >question, and it provides information about the process which triggered
> >the event.
>
> Now the question would be whether getsockattr (meant getsockopt?)
> returns data about one event or multiple events.

yes, getsockopt(). Currently it fills your buffer with as many of these
as it can. The FAN_EVENT_OK and FAN_EVENT_NEXT macros are intended to
make it easy to process the whole buffer full.

> Why is not read() used? Sure, it is not as easy, but if you require
> the size argument to read() being at least sizeof(event_metadata),
> things should fit, should not they?
> Using getsockopt feels like an ioctl - and they for one are
> more or less discouraged in light of netlink for example.

I can't think of a reason why the event polling process couldn't be done
using read() on the socket instead of getsockopt(). I'll try to code
something up today to make sure it works like I think it will.

> >fanotify has two basic 'modes' directed and global. fanotify directed
> >works much like inotify in that userspace marks inodes it is interested
> >in and gets events from those inodes. fanotify global instead indicates
> >that it wants everything on the system and then individually marks
> >inodes that it doesn't care about.
>
> So indexers and such only need a single fd for getting notifications
> about touched files? Sounds great.

That's the gist, then again, that's how inotify is supposed to work too
(although the indexer in inotify has to use the wd to open the file that
just got touched, fanotify gives you the open fd with the event)

-Eric

2009-07-07 19:42:31

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Tue, 30 Jun 2009 13:26:37 EDT, Eric Paris said:

(Sorry for the late reply...)

> Yes. Although it going to take just a tiny bit of work from me and you
> are limited in how to use it.
>
> You'll need to request FAN_OPEN_PERM on the file in question. When the
> file is opened you will need to bring in all the data and write all that
> data into the file. There is no mechanism to only bring in a piece of a
> file since you have no way of knowing what exactly the requesting
> process is requesting. So it's a whole file or nothing type of deal.

That's OK - most current HSM implementations work that way already - if they
archive a file, they write out a little stub that basically says "The data is
on tape 948324, file 3342", and on access, the entire file is retrieved.

The only problem is that when we finally return control to the user program,
we need to have wiped out that stub and replaced it with the user's data.

> Current problems:
>
> 1) the fd the fanotify listener gets is O_RDONLY. I think I'll add an
> "f_flags" option which if 0 defaults to O_RDONLY|O_LARGEFILE like we
> have today but which you could use to indicate O_RDWR or O_WRONLY. We
> currently give O_RDONLY since you can't request O_WR* on files/libraries
> mapped exec, so this won't work for executables.....

Having only O_RDONLY is a show-stopper, because then we can't replace the file
contents before continuing. For many places, the executables and shared
libraries aren't the problem, so "You can't HSM an executable" as a
semi-permanent restriction isn't too bad.

It's the terabytes of data files that are usually the issue - everything from
4-year-old Word documents to video (actual use case - a transportation safety
group here is instrumenting several hundred tractor-trailers with a dozen
cameras that are all rolling at all times the engine is on. That's a lot of
archival video). An HSM would want O_RDWR for those - RD so it can check and
see if there's a archival stub present, and then WR to restore the file with.

> 2) Right now you have 5 seconds to answer an fanotify permissions
> request, if you don't get it in 5 seconds you are done and the original
> process gets an allow. But I have a half finished patch which would
> allow you to delay them infinitely. As long as you keep them delayed
> you can modify the file they are about to access however you like.

That would work fine.

> You can not rename over the fd you are given, you have to actually copy
> the data into that fd, since the inode behind the fd you got from
> fanotify is the same inode the original process is going to get access
> to after your listener sends an allow.

That's an acceptable restriction as well.

> This sound good to you? I don't know your details, but I believe that
> fanotify directed mode with access permissions is going to be almost
> exactly what you want....

Yes, it's sounding like it. If that's in mainstream, then I can deploy an HSM
without needing kernel hackery like most do currently.

Attachments:

(No filename) (226.00 B)

2009-07-07 20:57:50

by Eric Paris

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

On Tue, 2009-07-07 at 15:41 -0400, [email protected] wrote:
> On Tue, 30 Jun 2009 13:26:37 EDT, Eric Paris said:

> > 1) the fd the fanotify listener gets is O_RDONLY. I think I'll add an
> > "f_flags" option which if 0 defaults to O_RDONLY|O_LARGEFILE like we
> > have today but which you could use to indicate O_RDWR or O_WRONLY. We
> > currently give O_RDONLY since you can't request O_WR* on files/libraries
> > mapped exec, so this won't work for executables.....
>
> Having only O_RDONLY is a show-stopper, because then we can't replace the file
> contents before continuing. For many places, the executables and shared
> libraries aren't the problem, so "You can't HSM an executable" as a
> semi-permanent restriction isn't too bad.

My current devel work allows you to set to open flags. So an HSM would
use O_RDWR|O_LARGEFILE whereas a file scanner/indexer which cares about
executables/libraries would use O_RDONLY|O_LARGEFILE.

> > 2) Right now you have 5 seconds to answer an fanotify permissions
> > request, if you don't get it in 5 seconds you are done and the original
> > process gets an allow. But I have a half finished patch which would
> > allow you to delay them infinitely. As long as you keep them delayed
> > you can modify the file they are about to access however you like.
>
> That would work fine.

I've added the ability to delay indefinitely. Maybe your tape
robot/intern is slow/busy. I'll point you to code when I want you to
see it :)

> Yes, it's sounding like it. If that's in mainstream, then I can deploy an HSM
> without needing kernel hackery like most do currently.

I'm trying to be useful!

-Eric

2009-07-11 19:17:57

by Sukadev Bhattiprolu

[permalink] [raw]

Subject: Re: fanotify: the fscking all notification system

| > struct fanotify_event_metadata {
| > __u32 event_len;
| > __s32 fd;
| > __u32 mask;
| > __u32 f_flags;
| > pid_t pid;
| > pid_t tgid;
| > __u64 cookie;
| > } __attribute__((packed));
|
| Since it passes pids from the kernel to userspace via a socket I suspect
| this needs input from the folks working on pid namespaces. The events may
| need to be dropped if the pid namespace of the event's origin doesn't
| match that of the destination. Otherwise the pid would be ambiguous and
| this interface will only work for tasks in the initial pid namespace.

If we do have the destination pid namespace, we could translate the pid
of the process modifying the file, into the pid namespace of the process
receiving the notification, like we do in mq_notify().

So when a process in an ancestor or unrelated pid namespace modifies the
file, the si_pid would get set to 0.

Sukadev