2013-06-09 12:41:05

by Lino Sanfilippo

[permalink] [raw]
Subject: filesystem notification fixes/cleanup

Hi,

this patch series fixes a couple of races (fanotify/inotify) or simplifies code
(dnotify) by means of the fsnotify groups mark mutex. The last patch updates the
comments in fs/notify/mark.c concerning the changed fsnotify locking implementation.

Regards,
Lino

[PATCH 1/5] fanotify: Fix races when adding/removing marks
[PATCH 2/5] fanotify: Put duplicate code for adding vfsmount/inode
[PATCH 3/5] dnotify: replace dnotify_mark_mutex with mark mutex of
[PATCH 4/5] inotify: fix race when adding a new watch
[PATCH 5/5] fsnotify: update comments concerning locking scheme


2013-06-09 12:41:48

by Lino Sanfilippo

[permalink] [raw]
Subject: [PATCH 1/5] fanotify: Fix races when adding/removing marks

For both, adding an event to an existing mark and destroying a mark we first
have to find it via fsnotify_find_[inode|vfsmount]_mark(). But getting the mark
and adding an event (or destroying it) is not done atomically. This opens a
race where a thread is about to destroy a mark while another thread still finds
the same mark and adds an event to its mask although it will be destroyed.

Another race exists concerning the excess of a groups number of marks limit:
When a mark is added the number of group marks is checked against the max number
of marks per group and increased afterwards. Since check and increment is also
not done atomically, this may result in 2 or more processes passing the check
at the same time and increasing the number of group marks above the allowed limit.

With this patch both races are avoided by doing the concerning operations with
the groups mark mutex locked.

Signed-off-by: Lino Sanfilippo <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 49 +++++++++++++++++++++++++++---------
1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 5d84442..213f0a1 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -525,14 +525,18 @@ static int fanotify_remove_vfsmount_mark(struct fsnotify_group *group,
__u32 removed;
int destroy_mark;

+ mutex_lock(&group->mark_mutex);
fsn_mark = fsnotify_find_vfsmount_mark(group, mnt);
- if (!fsn_mark)
+ if (!fsn_mark) {
+ mutex_unlock(&group->mark_mutex);
return -ENOENT;
+ }

removed = fanotify_mark_remove_from_mask(fsn_mark, mask, flags,
&destroy_mark);
if (destroy_mark)
- fsnotify_destroy_mark(fsn_mark, group);
+ fsnotify_destroy_mark_locked(fsn_mark, group);
+ mutex_unlock(&group->mark_mutex);

fsnotify_put_mark(fsn_mark);
if (removed & real_mount(mnt)->mnt_fsnotify_mask)
@@ -549,14 +553,19 @@ static int fanotify_remove_inode_mark(struct fsnotify_group *group,
__u32 removed;
int destroy_mark;

+ mutex_lock(&group->mark_mutex);
fsn_mark = fsnotify_find_inode_mark(group, inode);
- if (!fsn_mark)
+ if (!fsn_mark) {
+ mutex_unlock(&group->mark_mutex);
return -ENOENT;
+ }

removed = fanotify_mark_remove_from_mask(fsn_mark, mask, flags,
&destroy_mark);
if (destroy_mark)
- fsnotify_destroy_mark(fsn_mark, group);
+ fsnotify_destroy_mark_locked(fsn_mark, group);
+ mutex_unlock(&group->mark_mutex);
+
/* matches the fsnotify_find_inode_mark() */
fsnotify_put_mark(fsn_mark);
if (removed & inode->i_fsnotify_mask)
@@ -600,21 +609,29 @@ static int fanotify_add_vfsmount_mark(struct fsnotify_group *group,
__u32 added;
int ret = 0;

+ mutex_lock(&group->mark_mutex);
fsn_mark = fsnotify_find_vfsmount_mark(group, mnt);
if (!fsn_mark) {
- if (atomic_read(&group->num_marks) > group->fanotify_data.max_marks)
+ if (atomic_read(&group->num_marks) > group->fanotify_data.max_marks) {
+ mutex_unlock(&group->mark_mutex);
return -ENOSPC;
+ }

fsn_mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
- if (!fsn_mark)
+ if (!fsn_mark) {
+ mutex_unlock(&group->mark_mutex);
return -ENOMEM;
+ }

fsnotify_init_mark(fsn_mark, fanotify_free_mark);
- ret = fsnotify_add_mark(fsn_mark, group, NULL, mnt, 0);
- if (ret)
+ ret = fsnotify_add_mark_locked(fsn_mark, group, NULL, mnt, 0);
+ if (ret) {
+ mutex_unlock(&group->mark_mutex);
goto err;
+ }
}
added = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
+ mutex_unlock(&group->mark_mutex);

if (added & ~real_mount(mnt)->mnt_fsnotify_mask)
fsnotify_recalc_vfsmount_mask(mnt);
@@ -643,21 +660,29 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
(atomic_read(&inode->i_writecount) > 0))
return 0;

+ mutex_lock(&group->mark_mutex);
fsn_mark = fsnotify_find_inode_mark(group, inode);
if (!fsn_mark) {
- if (atomic_read(&group->num_marks) > group->fanotify_data.max_marks)
+ if (atomic_read(&group->num_marks) > group->fanotify_data.max_marks) {
+ mutex_unlock(&group->mark_mutex);
return -ENOSPC;
+ }

fsn_mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
- if (!fsn_mark)
+ if (!fsn_mark) {
+ mutex_unlock(&group->mark_mutex);
return -ENOMEM;
+ }

fsnotify_init_mark(fsn_mark, fanotify_free_mark);
- ret = fsnotify_add_mark(fsn_mark, group, inode, NULL, 0);
- if (ret)
+ ret = fsnotify_add_mark_locked(fsn_mark, group, inode, NULL, 0);
+ if (ret) {
+ mutex_unlock(&group->mark_mutex);
goto err;
+ }
}
added = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
+ mutex_unlock(&group->mark_mutex);

if (added & ~inode->i_fsnotify_mask)
fsnotify_recalc_inode_mask(inode);
--
1.7.9.5

2013-06-09 12:41:17

by Lino Sanfilippo

[permalink] [raw]
Subject: [PATCH 2/5] fanotify: Put duplicate code for adding vfsmount/inode marks into an own function

The code under the groups mark_mutex in fanotify_add_inode_mark() and
fanotify_add_vfsmount_mark() is almost identical. So put it into a seperate
function.

Signed-off-by: Lino Sanfilippo <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 71 ++++++++++++++++++------------------
1 file changed, 35 insertions(+), 36 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 213f0a1..6c307d7 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -601,33 +601,45 @@ static __u32 fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
return mask & ~oldmask;
}

+static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
+ struct inode *inode,
+ struct vfsmount *mnt)
+{
+ struct fsnotify_mark *mark;
+ int ret;
+
+ if (atomic_read(&group->num_marks) > group->fanotify_data.max_marks)
+ return ERR_PTR(-ENOSPC);
+
+ mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
+ if (!mark)
+ return ERR_PTR(-ENOMEM);
+
+ fsnotify_init_mark(mark, fanotify_free_mark);
+ ret = fsnotify_add_mark_locked(mark, group, inode, mnt, 0);
+ if (ret) {
+ fsnotify_put_mark(mark);
+ return ERR_PTR(ret);
+ }
+
+ return mark;
+}
+
+
static int fanotify_add_vfsmount_mark(struct fsnotify_group *group,
struct vfsmount *mnt, __u32 mask,
unsigned int flags)
{
struct fsnotify_mark *fsn_mark;
__u32 added;
- int ret = 0;

mutex_lock(&group->mark_mutex);
fsn_mark = fsnotify_find_vfsmount_mark(group, mnt);
if (!fsn_mark) {
- if (atomic_read(&group->num_marks) > group->fanotify_data.max_marks) {
- mutex_unlock(&group->mark_mutex);
- return -ENOSPC;
- }
-
- fsn_mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
- if (!fsn_mark) {
- mutex_unlock(&group->mark_mutex);
- return -ENOMEM;
- }
-
- fsnotify_init_mark(fsn_mark, fanotify_free_mark);
- ret = fsnotify_add_mark_locked(fsn_mark, group, NULL, mnt, 0);
- if (ret) {
+ fsn_mark = fanotify_add_new_mark(group, NULL, mnt);
+ if (IS_ERR(fsn_mark)) {
mutex_unlock(&group->mark_mutex);
- goto err;
+ return PTR_ERR(fsn_mark);
}
}
added = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
@@ -635,9 +647,9 @@ static int fanotify_add_vfsmount_mark(struct fsnotify_group *group,

if (added & ~real_mount(mnt)->mnt_fsnotify_mask)
fsnotify_recalc_vfsmount_mask(mnt);
-err:
+
fsnotify_put_mark(fsn_mark);
- return ret;
+ return 0;
}

static int fanotify_add_inode_mark(struct fsnotify_group *group,
@@ -646,7 +658,6 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
{
struct fsnotify_mark *fsn_mark;
__u32 added;
- int ret = 0;

pr_debug("%s: group=%p inode=%p\n", __func__, group, inode);

@@ -663,22 +674,10 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
mutex_lock(&group->mark_mutex);
fsn_mark = fsnotify_find_inode_mark(group, inode);
if (!fsn_mark) {
- if (atomic_read(&group->num_marks) > group->fanotify_data.max_marks) {
+ fsn_mark = fanotify_add_new_mark(group, inode, NULL);
+ if (IS_ERR(fsn_mark)) {
mutex_unlock(&group->mark_mutex);
- return -ENOSPC;
- }
-
- fsn_mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
- if (!fsn_mark) {
- mutex_unlock(&group->mark_mutex);
- return -ENOMEM;
- }
-
- fsnotify_init_mark(fsn_mark, fanotify_free_mark);
- ret = fsnotify_add_mark_locked(fsn_mark, group, inode, NULL, 0);
- if (ret) {
- mutex_unlock(&group->mark_mutex);
- goto err;
+ return PTR_ERR(fsn_mark);
}
}
added = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
@@ -686,9 +685,9 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,

if (added & ~inode->i_fsnotify_mask)
fsnotify_recalc_inode_mask(inode);
-err:
+
fsnotify_put_mark(fsn_mark);
- return ret;
+ return 0;
}

/* fanotify syscalls */
--
1.7.9.5

2013-06-09 12:41:45

by Lino Sanfilippo

[permalink] [raw]
Subject: [PATCH 5/5] fsnotify: update comments concerning locking scheme

There has been changes in the locking scheme of fsnotify but the comments in the
source code have not been updated yet. This patch corrects this.

Signed-off-by: Lino Sanfilippo <[email protected]>
---
fs/notify/mark.c | 50 ++++++++++++++++++++++----------------------------
1 file changed, 22 insertions(+), 28 deletions(-)

diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index fc6b49b..923fe4a 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -20,28 +20,29 @@
* fsnotify inode mark locking/lifetime/and refcnting
*
* REFCNT:
- * The mark->refcnt tells how many "things" in the kernel currently are
- * referencing this object. The object typically will live inside the kernel
- * with a refcnt of 2, one for each list it is on (i_list, g_list). Any task
- * which can find this object holding the appropriete locks, can take a reference
- * and the object itself is guaranteed to survive until the reference is dropped.
+ * The group->recnt and mark->refcnt tell how many "things" in the kernel
+ * currently are referencing the objects. Both kind of objects typically will
+ * live inside the kernel with a refcnt of 2, one for its creation and one for
+ * the reference a group and a mark hold to each other.
+ * If you are holding the appropriate locks, you can take a reference and the
+ * object itself is guaranteed to survive until the reference is dropped.
*
* LOCKING:
- * There are 3 spinlocks involved with fsnotify inode marks and they MUST
- * be taken in order as follows:
+ * There are 3 locks involved with fsnotify inode marks and they MUST be taken
+ * in order as follows:
*
+ * group->mark_mutex
* mark->lock
- * group->mark_lock
* inode->i_lock
*
- * mark->lock protects 2 things, mark->group and mark->inode. You must hold
- * that lock to dereference either of these things (they could be NULL even with
- * the lock)
- *
- * group->mark_lock protects the marks_list anchored inside a given group
- * and each mark is hooked via the g_list. It also sorta protects the
- * free_g_list, which when used is anchored by a private list on the stack of the
- * task which held the group->mark_lock.
+ * group->mark_mutex protects the marks_list anchored inside a given group and
+ * each mark is hooked via the g_list. It also protects the groups private
+ * data (i.e group limits).
+
+ * mark->lock protects the marks attributes like its masks and flags.
+ * Furthermore it protects the access to a reference of the group that the mark
+ * is assigned to as well as the access to a reference of the inode/vfsmount
+ * that is being watched by the mark.
*
* inode->i_lock protects the i_fsnotify_marks list anchored inside a
* given inode and each mark is hooked via the i_list. (and sorta the
@@ -64,18 +65,11 @@
* inode. We take i_lock and walk the i_fsnotify_marks safely. For each
* mark on the list we take a reference (so the mark can't disappear under us).
* We remove that mark form the inode's list of marks and we add this mark to a
- * private list anchored on the stack using i_free_list; At this point we no
- * longer fear anything finding the mark using the inode's list of marks.
- *
- * We can safely and locklessly run the private list on the stack of everything
- * we just unattached from the original inode. For each mark on the private list
- * we grab the mark-> and can thus dereference mark->group and mark->inode. If
- * we see the group and inode are not NULL we take those locks. Now holding all
- * 3 locks we can completely remove the mark from other tasks finding it in the
- * future. Remember, 10 things might already be referencing this mark, but they
- * better be holding a ref. We drop our reference we took before we unhooked it
- * from the inode. When the ref hits 0 we can free the mark.
- *
+ * private list anchored on the stack using i_free_list; we walk i_free_list
+ * and before we destroy the mark we make sure that we dont race with a
+ * concurrent destroy_group by getting a ref to the marks group and taking the
+ * groups mutex.
+
* Very similarly for freeing by group, except we use free_g_list.
*
* This has the very interesting property of being able to run concurrently with
--
1.7.9.5

2013-06-09 12:41:44

by Lino Sanfilippo

[permalink] [raw]
Subject: [PATCH 4/5] inotify: fix race when adding a new watch

In inotify_new_watch() the number of watches for a group is compared against
the max number of allowed watches and increased afterwards.
The check and incrementation is not done atomically, so it is possible for
multiple concurrent threads to pass the check and increment the number of marks
above the allowed max.
This patch uses an inotify groups mark_lock to ensure that both check and
incrementation are done atomic.
Furthermore we dont have to worry about the race that allows a concurrent
thread to add a watch just after inotify_update_existing_watch() returned with
-ENOENT anymore, since this is also synchronized by the groups mark mutex now.

Signed-off-by: Lino Sanfilippo <[email protected]>
---
fs/notify/inotify/inotify_user.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index e0f7c12..14d7786 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -644,7 +644,8 @@ static int inotify_new_watch(struct fsnotify_group *group,
goto out_err;

/* we are on the idr, now get on the inode */
- ret = fsnotify_add_mark(&tmp_i_mark->fsn_mark, group, inode, NULL, 0);
+ ret = fsnotify_add_mark_locked(&tmp_i_mark->fsn_mark, group, inode,
+ NULL, 0);
if (ret) {
/* we failed to get on the inode, get off the idr */
inotify_remove_from_idr(group, tmp_i_mark);
@@ -668,19 +669,13 @@ static int inotify_update_watch(struct fsnotify_group *group, struct inode *inod
{
int ret = 0;

-retry:
+ mutex_lock(&group->mark_mutex);
/* try to update and existing watch with the new arg */
ret = inotify_update_existing_watch(group, inode, arg);
/* no mark present, try to add a new one */
if (ret == -ENOENT)
ret = inotify_new_watch(group, inode, arg);
- /*
- * inotify_new_watch could race with another thread which did an
- * inotify_new_watch between the update_existing and the add watch
- * here, go back and try to update an existing mark again.
- */
- if (ret == -EEXIST)
- goto retry;
+ mutex_unlock(&group->mark_mutex);

return ret;
}
--
1.7.9.5

2013-06-09 12:41:42

by Lino Sanfilippo

[permalink] [raw]
Subject: [PATCH 3/5] dnotify: replace dnotify_mark_mutex with mark mutex of dnotify_group

There is no need to use a special mutex to protect against the fcntl/close race
(see dnotify.c for a description of this race). Instead the dnotify_groups mark
mutex can be used.

Signed-off-by: Lino Sanfilippo <[email protected]>
---
fs/notify/dnotify/dnotify.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 2bfe6dc..1fedd5f 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -31,7 +31,6 @@ int dir_notify_enable __read_mostly = 1;
static struct kmem_cache *dnotify_struct_cache __read_mostly;
static struct kmem_cache *dnotify_mark_cache __read_mostly;
static struct fsnotify_group *dnotify_group __read_mostly;
-static DEFINE_MUTEX(dnotify_mark_mutex);

/*
* dnotify will attach one of these to each inode (i_fsnotify_marks) which
@@ -183,7 +182,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
return;
dn_mark = container_of(fsn_mark, struct dnotify_mark, fsn_mark);

- mutex_lock(&dnotify_mark_mutex);
+ mutex_lock(&dnotify_group->mark_mutex);

spin_lock(&fsn_mark->lock);
prev = &dn_mark->dn;
@@ -199,11 +198,12 @@ void dnotify_flush(struct file *filp, fl_owner_t id)

spin_unlock(&fsn_mark->lock);

- /* nothing else could have found us thanks to the dnotify_mark_mutex */
+ /* nothing else could have found us thanks to the dnotify_groups
+ mark_mutex */
if (dn_mark->dn == NULL)
- fsnotify_destroy_mark(fsn_mark, dnotify_group);
+ fsnotify_destroy_mark_locked(fsn_mark, dnotify_group);

- mutex_unlock(&dnotify_mark_mutex);
+ mutex_unlock(&dnotify_group->mark_mutex);

fsnotify_put_mark(fsn_mark);
}
@@ -326,7 +326,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
new_dn_mark->dn = NULL;

/* this is needed to prevent the fcntl/close race described below */
- mutex_lock(&dnotify_mark_mutex);
+ mutex_lock(&dnotify_group->mark_mutex);

/* add the new_fsn_mark or find an old one. */
fsn_mark = fsnotify_find_inode_mark(dnotify_group, inode);
@@ -334,7 +334,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
dn_mark = container_of(fsn_mark, struct dnotify_mark, fsn_mark);
spin_lock(&fsn_mark->lock);
} else {
- fsnotify_add_mark(new_fsn_mark, dnotify_group, inode, NULL, 0);
+ fsnotify_add_mark_locked(new_fsn_mark, dnotify_group, inode,
+ NULL, 0);
spin_lock(&new_fsn_mark->lock);
fsn_mark = new_fsn_mark;
dn_mark = new_dn_mark;
@@ -348,9 +349,9 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)

/* if (f != filp) means that we lost a race and another task/thread
* actually closed the fd we are still playing with before we grabbed
- * the dnotify_mark_mutex and fsn_mark->lock. Since closing the fd is the
- * only time we clean up the marks we need to get our mark off
- * the list. */
+ * the dnotify_groups mark_mutex and fsn_mark->lock. Since closing the
+ * fd is the only time we clean up the marks we need to get our mark
+ * off the list. */
if (f != filp) {
/* if we added ourselves, shoot ourselves, it's possible that
* the flush actually did shoot this fsn_mark. That's fine too
@@ -385,9 +386,9 @@ out:
spin_unlock(&fsn_mark->lock);

if (destroy)
- fsnotify_destroy_mark(fsn_mark, dnotify_group);
+ fsnotify_destroy_mark_locked(fsn_mark, dnotify_group);

- mutex_unlock(&dnotify_mark_mutex);
+ mutex_unlock(&dnotify_group->mark_mutex);
fsnotify_put_mark(fsn_mark);
out_err:
if (new_fsn_mark)
--
1.7.9.5