2010-08-23 00:37:38

by Eric Paris

[permalink] [raw]
Subject: [PATCH 1/4] fanotify: flush outstanding perm requests on group destroy

When an fanotify listener is closing it may cause a deadlock between the
listener and the original task doing an fs operation. If the original task
is waiting for a permissions response it will be holding the srcu lock. The
listener cannot clean up and exit until after that srcu lock is syncronized.
Thus deadlock. The fix introduced here is to stop accepting new permissions
events when a listener is shutting down and to grant permission for all
outstanding events. Thus the original task will eventually release the srcu
lock and the listener can complete shutdown.

Reported-by: Andreas Gruenbacher <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Signed-off-by: Eric Paris <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 27 +++++++++++++++++++++++++++
include/linux/fanotify.h | 7 -------
include/linux/fsnotify_backend.h | 1 +
3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 032b837..b966b72 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -195,6 +195,14 @@ static int prepare_for_access_response(struct fsnotify_group *group,
re->fd = fd;

mutex_lock(&group->fanotify_data.access_mutex);
+
+ if (group->fanotify_data.bypass_perm) {
+ mutex_unlock(&group->fanotify_data.access_mutex);
+ kmem_cache_free(fanotify_response_event_cache, re);
+ event->response = FAN_ALLOW;
+ return 0;
+ }
+
list_add_tail(&re->list, &group->fanotify_data.access_list);
mutex_unlock(&group->fanotify_data.access_mutex);

@@ -364,9 +372,28 @@ static ssize_t fanotify_write(struct file *file, const char __user *buf, size_t
static int fanotify_release(struct inode *ignored, struct file *file)
{
struct fsnotify_group *group = file->private_data;
+ struct fanotify_response_event *re, *lre;

pr_debug("%s: file=%p group=%p\n", __func__, file, group);

+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+ mutex_lock(&group->fanotify_data.access_mutex);
+
+ group->fanotify_data.bypass_perm = true;
+
+ list_for_each_entry_safe(re, lre, &group->fanotify_data.access_list, list) {
+ pr_debug("%s: found group=%p re=%p event=%p\n", __func__, group,
+ re, re->event);
+
+ list_del_init(&re->list);
+ re->event->response = FAN_ALLOW;
+
+ kmem_cache_free(fanotify_response_event_cache, re);
+ }
+ mutex_unlock(&group->fanotify_data.access_mutex);
+
+ wake_up(&group->fanotify_data.access_waitq);
+#endif
/* matches the fanotify_init->fsnotify_alloc_group */
fsnotify_put_group(group);

diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index f0949a5..9854356 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -95,11 +95,4 @@ struct fanotify_response {
(long)(meta)->event_len >= (long)FAN_EVENT_METADATA_LEN && \
(long)(meta)->event_len <= (long)(len))

-#ifdef __KERNEL__
-
-struct fanotify_wait {
- struct fsnotify_event *event;
- __s32 fd;
-};
-#endif /* __KERNEL__ */
#endif /* _LINUX_FANOTIFY_H */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index ed36fb5..e40190d 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -156,6 +156,7 @@ struct fsnotify_group {
struct mutex access_mutex;
struct list_head access_list;
wait_queue_head_t access_waitq;
+ bool bypass_perm; /* protected by access_mutex */
#endif /* CONFIG_FANOTIFY_ACCESS_PERMISSIONS */
int f_flags;
} fanotify_data;
--
1.6.5.3


2010-08-23 00:37:25

by Eric Paris

[permalink] [raw]
Subject: [PATCH 2/4] fanotify: drop duplicate pr_debug statement

From: Tvrtko Ursulin <[email protected]>

This reminded me... you have two pr_debugs in fanotify_should_send_event
which output redundant information. Maybe you intended it like that so
it is selectable how much log spam you want, or if not you may want to
apply this patch.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: Eric Paris <[email protected]>
---
fs/notify/fanotify/fanotify.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 756566f..85366c7 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -165,9 +165,6 @@ static bool fanotify_should_send_event(struct fsnotify_group *group,
"mask=%x data=%p data_type=%d\n", __func__, group, to_tell,
inode_mark, vfsmnt_mark, event_mask, data, data_type);

- pr_debug("%s: group=%p vfsmount_mark=%p inode_mark=%p mask=%x\n",
- __func__, group, vfsmnt_mark, inode_mark, event_mask);
-
/* sorry, fanotify only gives a damn about files and dirs */
if (!S_ISREG(to_tell->i_mode) &&
!S_ISDIR(to_tell->i_mode))
--
1.6.5.3

2010-08-23 00:37:37

by Eric Paris

[permalink] [raw]
Subject: [PATCH 4/4] fanotify: drops the packed attribute from userspace event metadata

The userspace event metadata structure was packed so when sent from a kernel
with a certain set of alignment rules to a userspace listener with a different
set of alignment rules the userspace process would be able to use the
structure. On some arches just using packed, even if it doesn't do anything
to the alignment can cause a severe performance hit. From now on we are
not going to set the packed attribute and will just need to be very careful
to make sure the structure is naturally aligned and that explicit padding is
used when necessary. To make sure noone gets this wrong in the future, we
enforce this fact, at build time, using a similar structure that is packed
and comparing their sizes.

Cc: Andreas Dilger <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Signed-off-by: Eric Paris <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 2 ++
include/linux/fanotify.h | 22 ++++++++++++++++++++--
2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index b966b72..c3a5742 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -778,6 +778,8 @@ SYSCALL_ALIAS(sys_fanotify_mark, SyS_fanotify_mark);
*/
static int __init fanotify_user_setup(void)
{
+ BUILD_BUG_ON(sizeof(struct fanotify_event_metadata) !=
+ sizeof(struct fan_event_meta_packed));
fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
fanotify_response_event_cache = KMEM_CACHE(fanotify_response_event,
SLAB_PANIC);
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 0535461..3d7a9b5 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -66,14 +66,21 @@
FAN_Q_OVERFLOW)

#define FANOTIFY_METADATA_VERSION 1
-
+/*
+ * This structue must be naturally aligned so that a 32 bit userspace process
+ * will find the offsets the same as a 64bit process. If there would be padding
+ * in the structure it must be added explictly by hand. Please note that
+ * anything added to this structure must also be added to the fan_event_meta_packed
+ * struct, which is used to enforce the alignment and padding rules at build
+ * time.
+ */
struct fanotify_event_metadata {
__u32 event_len;
__u32 vers;
__u64 mask;
__s32 fd;
__s32 pid;
-} __attribute__ ((packed));
+};

struct fanotify_response {
__s32 fd;
@@ -95,4 +102,15 @@ struct fanotify_response {
(long)(meta)->event_len >= (long)FAN_EVENT_METADATA_LEN && \
(long)(meta)->event_len <= (long)(len))

+#ifdef __KERNEL__
+/* see struct fanotify_event_metadata for the reason this exists */
+struct fan_event_meta_packed {
+ __u32 event_len;
+ __u32 vers;
+ __u64 mask;
+ __s32 fd;
+ __s32 pid;
+} __attribute__ ((packed));
+
+#endif /* __KERNEL__ */
#endif /* _LINUX_FANOTIFY_H */
--
1.6.5.3

2010-08-23 00:37:35

by Eric Paris

[permalink] [raw]
Subject: [PATCH 3/4] fanotify: resize pid and reorder structure

From: Tvrtko Ursulin <[email protected]>

resize pid and reorder the fanotify_event_metadata so it is naturally
aligned and we can work towards dropping the packed attributed

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Andreas Dilger <[email protected]>
Signed-off-by: Eric Paris <[email protected]>
---
include/linux/fanotify.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 9854356..0535461 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -70,9 +70,9 @@
struct fanotify_event_metadata {
__u32 event_len;
__u32 vers;
- __s32 fd;
__u64 mask;
- __s64 pid;
+ __s32 fd;
+ __s32 pid;
} __attribute__ ((packed));

struct fanotify_response {
--
1.6.5.3

2010-08-23 16:19:17

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH 4/4] fanotify: drops the packed attribute from userspace event metadata

On Monday 23 August 2010 02:37:10 Eric Paris wrote:
> #define FANOTIFY_METADATA_VERSION 1

FANOTIFY_METADATA_VERSION should be incremented, too.

> struct fanotify_event_metadata {
> __u32 event_len;
> __u32 vers;
> __u64 mask;

We don't actually care if there are any holes in this structure; all we
care about is that the structure has the same alignment on 32-bit and
64-bit architectures.

Using type aligned_u64 here instead of __u64 will do the trick.

> +#ifdef __KERNEL__
> +/* see struct fanotify_event_metadata for the reason this exists */
> +struct fan_event_meta_packed {
> + __u32 event_len;
> + __u32 vers;
> + __u64 mask;
> + __s32 fd;
> + __s32 pid;
> +} __attribute__ ((packed));

This does not add much value; the two structures can still go out of sync
A note to be careful about changes to struct fanotify_event_metadata
should really be warning enough.

Andreas

2010-08-24 01:13:36

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH 1/4] fanotify: flush outstanding perm requests on group destroy

On Monday 23 August 2010 02:37:07 Eric Paris wrote:
> When an fanotify listener is closing it may cause a deadlock between the
> listener and the original task doing an fs operation. If the original task
> is waiting for a permissions response it will be holding the srcu lock.
> The listener cannot clean up and exit until after that srcu lock is
> syncronized. Thus deadlock. The fix introduced here is to stop accepting
> new permissions events when a listener is shutting down and to grant
> permission for all outstanding events. Thus the original task will
> eventually release the srcu lock and the listener can complete shutdown.

This seems to work now.

The one remaining issue is that processes blocked on perm events cannot be
killed even with SIGKILL. I don't know how hard it will be to fix this.

Thanks,
Andreas

2010-08-24 08:49:52

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 1/4] fanotify: flush outstanding perm requests on group destroy

On Tuesday 24 Aug 2010 02:13:11 Andreas Gruenbacher wrote:
> On Monday 23 August 2010 02:37:07 Eric Paris wrote:
> > When an fanotify listener is closing it may cause a deadlock between the
> > listener and the original task doing an fs operation. If the original
> > task is waiting for a permissions response it will be holding the srcu
> > lock. The listener cannot clean up and exit until after that srcu lock
> > is syncronized. Thus deadlock. The fix introduced here is to stop
> > accepting new permissions events when a listener is shutting down and to
> > grant permission for all outstanding events. Thus the original task
> > will eventually release the srcu lock and the listener can complete
> > shutdown.
>
> This seems to work now.
>
> The one remaining issue is that processes blocked on perm events cannot be
> killed even with SIGKILL. I don't know how hard it will be to fix this.

I think just switching to interruptible sleep in
fanotify_get_response_from_access should be fine. And it should probably deny
the current event when signal is received.

Tvrtko

Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 348 3873 20.

2010-08-24 09:36:59

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH 1/4] fanotify: flush outstanding perm requests on group destroy

On Tuesday 24 August 2010 10:49:45 Tvrtko Ursulin wrote:
> I think just switching to interruptible sleep in
> fanotify_get_response_from_access should be fine. And it should probably
> deny the current event when signal is received.

Well the result would be -EINTR from the system call that blocked on the
perm event, the same as with an interruptible nfs mount. The process would
never get -EPERM. Processes may
not be prepared to handle -EINTR in all cases, and so it may make more sense
to use the same behavior as NFS and only allow SIGKILL to kill a process
blocked on a perm event (which the blocked process will never see).

>From the nfs man page:
> intr / nointr
>
> Selects whether to allow signals to interrupt file operations on this mount
> point. If neither option is specified (or if nointr is specified),
> signals do not interrupt NFS file operations. If intr is specified,
> system calls return EINTR if an in-progress NFS operation is interrupted
> by a signal.
>
> Using the intr option is preferred to using the soft option because it is
> significantly less likely to result in data corruption. The intr / nointr
> mount option is deprecated after kernel 2.6.25. Only SIGKILL can
> interrupt a pending NFS operation on these kernels, and if specified,
> this mount option is ignored to provide backwards compatibility with
> older kernels.

Andreas

2010-08-24 09:51:35

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [PATCH 1/4] fanotify: flush outstanding perm requests on group destroy

On Tuesday 24 Aug 2010 10:36:29 Andreas Gruenbacher wrote:
> On Tuesday 24 August 2010 10:49:45 Tvrtko Ursulin wrote:
> > I think just switching to interruptible sleep in
> > fanotify_get_response_from_access should be fine. And it should probably
> > deny the current event when signal is received.
>
> Well the result would be -EINTR from the system call that blocked on the
> perm event, the same as with an interruptible nfs mount. The process would
> never get -EPERM. Processes may
> not be prepared to handle -EINTR in all cases, and so it may make more
> sense to use the same behavior as NFS and only allow SIGKILL to kill a
> process blocked on a perm event (which the blocked process will never
> see).

That would be wait_event_killable then, even simpler change.

I agree that hiding -EINTR from open, from userspace code is probably a more
compatible way of doing it. If I remember correctly POSIX does allow EINTR
from open but form our experience there are indeed applications which do not
handle it. Some version of X immediately spring to mind who used to set up a
periodic signal delivery (something like internal jiffy, I think they called
it smart scheduler) to themselves and would not handle -EINTR from open. Samba
also had a problem here in specific circumstances.

Tvrtko


Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 348 3873 20.