2021-08-12 22:56:29

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 00/21] File system wide monitoring

Hi,

This is the 6th version of the FAN_FS_ERROR patches. This applies
the feedback from last version (thanks Amir, Jan).

There are important changes in this version. some of which brings us
back to previous versions of this series. I did my best to avoid
problems that were mentioned during earlier revisions, and I think I
covered everything. But I apologize if this requires reviewers to repeat
some comments.

First of all, despite initializing the error event from inside the
insert callback and abusing the merge logic for err_count update, this
version reverts to a simple insertion code, and configures the event
before sending it to be queued by fsnotify. This makes the submission
code less trivial, but addresses the potential problem of encoding the
FH while holding the group->notification_lock.

This version also drops the slot replacement code when dequeueing, and
reverts back to the copy-to-stack mechanism. This simplifies the code a
lot.

The way we report superblock errors also changed. Now, the handle is
omitted and we return the handle_bytes as 0.

Finally, we no longer play games with predicting the file handle size
beforehand. Now, the code just allocates space for the largest handle
possible, and assume that is enough.

On another note, this also restores the mark reference owned by the
error event while it is queued. As Amir explained, this is required to
prevent the mark from going away while the event is queued.

This was tested with LTP for regressions and also using the sample code
on the last patch, with a corrupted image. I wrote a new ltp test for
this feature which is being reviewed and is available at:

https://gitlab.collabora.com/krisman/ltp -b fan-fs-error

In addition, I wrote a man-page that can be pulled from:

https://gitlab.collabora.com/krisman/man-pages.git -b fan-fs-error

And is being reviewed at the list.

I also pushed this full series to:

https://gitlab.collabora.com/krisman/linux -b fanotify-notifications-single-slot

Thank you

Original cover letter
---------------------
Hi,

This series follow up on my previous proposal [1] to support file system
wide monitoring. As suggested by Amir, this proposal drops the ring
buffer in favor of a single slot associated with each mark. This
simplifies a bit the implementation, as you can see in the code.

As a reminder, This proposal is limited to an interface for
administrators to monitor the health of a file system, instead of a
generic inteface for file errors. Therefore, this doesn't solve the
problem of writeback errors or the need to watch a specific subtree.

In comparison to the previous RFC, this implementation also drops the
per-fs data and location, and leave those as future extensions.

* Implementation

The feature is implemented on top of fanotify, as a new type of fanotify
mark, FAN_ERROR, which a file system monitoring tool can register to
receive error notifications. When an error occurs a new notification is
generated, in addition followed by this info field:

- FS generic data: A file system agnostic structure that has a generic
error code and identifies the filesystem. Basically, it let's
userspace know something happened on a monitored filesystem. Since
only the first error is recorded since the last read, this also
includes a counter of errors that happened since the last read.

* Testing

This was tested by watching notifications flowing from an intentionally
corrupted filesystem in different places. In addition, other events
were watched in an attempt to detect regressions.

Is there a specific testsuite for fanotify I should be running?

* Patches

This patchset is divided as follows: Patch 1 through 5 are refactoring
to fsnotify/fanotify in preparation for FS_ERROR/FAN_ERROR; patch 6 and
7 implement the FS_ERROR API for filesystems to report error; patch 8
add support for FAN_ERROR in fanotify; Patch 9 is an example
implementation for ext4; patch 10 and 11 provide a sample userspace code
and documentation.

I also pushed the full series to:

https://gitlab.collabora.com/krisman/linux -b fanotify-notifications-single-slot

[1] https://lwn.net/Articles/854545/
[2] https://lwn.net/Articles/856916/

Cc: Darrick J. Wong <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: [email protected]
To: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Gabriel Krisman Bertazi (21):
fsnotify: Don't insert unmergeable events in hashtable
fanotify: Fold event size calculation to its own function
fanotify: Split fsid check from other fid mode checks
fsnotify: Reserve mark flag bits for backends
fanotify: Split superblock marks out to a new cache
inotify: Don't force FS_IN_IGNORED
fsnotify: Add helper to detect overflow_event
fsnotify: Add wrapper around fsnotify_add_event
fsnotify: Allow events reported with an empty inode
fsnotify: Support FS_ERROR event type
fanotify: Allow file handle encoding for unhashed events
fanotify: Encode invalid file handle when no inode is provided
fanotify: Require fid_mode for any non-fd event
fanotify: Reserve UAPI bits for FAN_FS_ERROR
fanotify: Preallocate per superblock mark error event
fanotify: Handle FAN_FS_ERROR events
fanotify: Report fid info for file related file system errors
fanotify: Emit generic error info type for error event
ext4: Send notifications on error
samples: Add fs error monitoring example
docs: Document the FAN_FS_ERROR event

.../admin-guide/filesystem-monitoring.rst | 70 +++++
Documentation/admin-guide/index.rst | 1 +
fs/ext4/super.c | 8 +
fs/notify/fanotify/fanotify.c | 139 +++++++++-
fs/notify/fanotify/fanotify.h | 69 ++++-
fs/notify/fanotify/fanotify_user.c | 256 ++++++++++++++----
fs/notify/fsnotify.c | 19 +-
fs/notify/inotify/inotify_fsnotify.c | 2 +-
fs/notify/inotify/inotify_user.c | 6 +-
fs/notify/notification.c | 12 +-
include/linux/fanotify.h | 9 +-
include/linux/fsnotify.h | 13 +
include/linux/fsnotify_backend.h | 64 ++++-
include/uapi/linux/fanotify.h | 8 +
samples/Kconfig | 9 +
samples/Makefile | 1 +
samples/fanotify/Makefile | 5 +
samples/fanotify/fs-monitor.c | 138 ++++++++++
18 files changed, 740 insertions(+), 89 deletions(-)
create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst
create mode 100644 samples/fanotify/Makefile
create mode 100644 samples/fanotify/fs-monitor.c

--
2.32.0


2021-08-12 22:56:34

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 01/21] fsnotify: Don't insert unmergeable events in hashtable

Some events, like the overflow event, are not mergeable, so they are not
hashed. But, when failing inside fsnotify_add_event for lack of space,
fsnotify_add_event() still calls the insert hook, which adds the
overflow event to the merge list. Add a check to prevent any kind of
unmergeable event to be inserted in the hashtable.

Fixes: 94e00d28a680 ("fsnotify: use hash table for faster events merge")
Reviewed-by: Amir Goldstein <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v2:
- Do check for hashed events inside the insert hook (Amir)
---
fs/notify/fanotify/fanotify.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 057abd2cf887..310246f8d3f1 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -702,6 +702,9 @@ static void fanotify_insert_event(struct fsnotify_group *group,

assert_spin_locked(&group->notification_lock);

+ if (!fanotify_is_hashed_event(event->mask))
+ return;
+
pr_debug("%s: group=%p event=%p bucket=%u\n", __func__,
group, event, bucket);

@@ -779,8 +782,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,

fsn_event = &event->fse;
ret = fsnotify_add_event(group, fsn_event, fanotify_merge,
- fanotify_is_hashed_event(mask) ?
- fanotify_insert_event : NULL);
+ fanotify_insert_event);
if (ret) {
/* Permission events shouldn't be merged */
BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
--
2.32.0

2021-08-12 22:58:06

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 04/21] fsnotify: Reserve mark flag bits for backends

Split out the final bits of struct fsnotify_mark->flags for use by a
backend.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Changes since v1:
- turn consts into defines (jan)
---
include/linux/fsnotify_backend.h | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 1ce66748a2d2..ae1bd9f06808 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -363,6 +363,20 @@ struct fsnotify_mark_connector {
struct hlist_head list;
};

+enum fsnotify_mark_bits {
+ FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY,
+ FSN_MARK_FL_BIT_ALIVE,
+ FSN_MARK_FL_BIT_ATTACHED,
+ FSN_MARK_PRIVATE_FLAGS,
+};
+
+#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY \
+ (1 << FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY)
+#define FSNOTIFY_MARK_FLAG_ALIVE \
+ (1 << FSN_MARK_FL_BIT_ALIVE)
+#define FSNOTIFY_MARK_FLAG_ATTACHED \
+ (1 << FSN_MARK_FL_BIT_ATTACHED)
+
/*
* A mark is simply an object attached to an in core inode which allows an
* fsnotify listener to indicate they are either no longer interested in events
@@ -398,9 +412,7 @@ struct fsnotify_mark {
struct fsnotify_mark_connector *connector;
/* Events types to ignore [mark->lock, group->mark_mutex] */
__u32 ignored_mask;
-#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY 0x01
-#define FSNOTIFY_MARK_FLAG_ALIVE 0x02
-#define FSNOTIFY_MARK_FLAG_ATTACHED 0x04
+ /* Upper bits [31:PRIVATE_FLAGS] are reserved for backend usage */
unsigned int flags; /* flags [mark->lock] */
};

--
2.32.0

2021-08-12 22:58:31

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 03/21] fanotify: Split fsid check from other fid mode checks

FAN_FS_ERROR will require fsid, but not necessarily require the
filesystem to expose a file handle. Split those checks into different
functions, so they can be used separately when setting up an event.

While there, update a comment about tmpfs having 0 fsid, which is no
longer true.

Reviewed-by: Amir Goldstein <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v2:
- FAN_ERROR -> FAN_FS_ERROR (Amir)
- Update comment (Amir)

Changes since v1:
(Amir)
- Sort hunks to simplify diff.
Changes since RFC:
(Amir)
- Rename fanotify_check_path_fsid -> fanotify_test_fsid.
- Use dentry directly instead of path.
---
fs/notify/fanotify/fanotify_user.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 68a53d3534f8..67b18dfe0025 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1192,16 +1192,15 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
return fd;
}

-/* Check if filesystem can encode a unique fid */
-static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
+static int fanotify_test_fsid(struct dentry *dentry, __kernel_fsid_t *fsid)
{
__kernel_fsid_t root_fsid;
int err;

/*
- * Make sure path is not in filesystem with zero fsid (e.g. tmpfs).
+ * Make sure dentry is not of a filesystem with zero fsid (e.g. fuse).
*/
- err = vfs_get_fsid(path->dentry, fsid);
+ err = vfs_get_fsid(dentry, fsid);
if (err)
return err;

@@ -1209,10 +1208,10 @@ static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
return -ENODEV;

/*
- * Make sure path is not inside a filesystem subvolume (e.g. btrfs)
+ * Make sure dentry is not of a filesystem subvolume (e.g. btrfs)
* which uses a different fsid than sb root.
*/
- err = vfs_get_fsid(path->dentry->d_sb->s_root, &root_fsid);
+ err = vfs_get_fsid(dentry->d_sb->s_root, &root_fsid);
if (err)
return err;

@@ -1220,6 +1219,12 @@ static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
root_fsid.val[1] != fsid->val[1])
return -EXDEV;

+ return 0;
+}
+
+/* Check if filesystem can encode a unique fid */
+static int fanotify_test_fid(struct dentry *dentry)
+{
/*
* We need to make sure that the file system supports at least
* encoding a file handle so user can use name_to_handle_at() to
@@ -1227,8 +1232,8 @@ static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
* objects. However, name_to_handle_at() requires that the
* filesystem also supports decoding file handles.
*/
- if (!path->dentry->d_sb->s_export_op ||
- !path->dentry->d_sb->s_export_op->fh_to_dentry)
+ if (!dentry->d_sb->s_export_op ||
+ !dentry->d_sb->s_export_op->fh_to_dentry)
return -EOPNOTSUPP;

return 0;
@@ -1379,7 +1384,11 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
}

if (fid_mode) {
- ret = fanotify_test_fid(&path, &__fsid);
+ ret = fanotify_test_fsid(path.dentry, &__fsid);
+ if (ret)
+ goto path_put_and_out;
+
+ ret = fanotify_test_fid(path.dentry);
if (ret)
goto path_put_and_out;

--
2.32.0

2021-08-12 22:58:47

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 05/21] fanotify: Split superblock marks out to a new cache

FAN_FS_ERROR will require an error structure to be stored per mark.
But, since FAN_FS_ERROR doesn't apply to inode/mount marks, it should
suffice to only expose this information for superblock marks. Therefore,
wrap this kind of marks into a container and plumb it for the future.

Reviewed-by: Amir Goldstein <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v5:
- turn the flag bits into defines (jan)
- don't use zalloc for consistency (jan)
Changes since v2:
- Move mark initialization to fanotify_alloc_mark (Amir)

Changes since v1:
- Only extend superblock marks (Amir)
---
fs/notify/fanotify/fanotify.c | 10 ++++++--
fs/notify/fanotify/fanotify.h | 20 ++++++++++++++++
fs/notify/fanotify/fanotify_user.c | 38 ++++++++++++++++++++++++++++--
3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 310246f8d3f1..c3eefe3f6494 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -869,9 +869,15 @@ static void fanotify_freeing_mark(struct fsnotify_mark *mark,
dec_ucount(group->fanotify_data.ucounts, UCOUNT_FANOTIFY_MARKS);
}

-static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
+static void fanotify_free_mark(struct fsnotify_mark *mark)
{
- kmem_cache_free(fanotify_mark_cache, fsn_mark);
+ if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
+ struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
+
+ kmem_cache_free(fanotify_sb_mark_cache, fa_mark);
+ } else {
+ kmem_cache_free(fanotify_mark_cache, mark);
+ }
}

const struct fsnotify_ops fanotify_fsnotify_ops = {
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 4a5e555dc3d2..3b11dd03df59 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -6,6 +6,7 @@
#include <linux/hashtable.h>

extern struct kmem_cache *fanotify_mark_cache;
+extern struct kmem_cache *fanotify_sb_mark_cache;
extern struct kmem_cache *fanotify_fid_event_cachep;
extern struct kmem_cache *fanotify_path_event_cachep;
extern struct kmem_cache *fanotify_perm_event_cachep;
@@ -129,6 +130,25 @@ static inline void fanotify_info_copy_name(struct fanotify_info *info,
name->name);
}

+enum fanotify_mark_bits {
+ FANOTIFY_MARK_FLAG_BIT_SB_MARK = FSN_MARK_PRIVATE_FLAGS,
+};
+
+#define FANOTIFY_MARK_FLAG_SB_MARK \
+ (1 << FANOTIFY_MARK_FLAG_BIT_SB_MARK)
+
+struct fanotify_sb_mark {
+ struct fsnotify_mark fsn_mark;
+};
+
+static inline
+struct fanotify_sb_mark *FANOTIFY_SB_MARK(struct fsnotify_mark *mark)
+{
+ WARN_ON(!(mark->flags & FANOTIFY_MARK_FLAG_SB_MARK));
+
+ return container_of(mark, struct fanotify_sb_mark, fsn_mark);
+}
+
/*
* Common structure for fanotify events. Concrete structs are allocated in
* fanotify_handle_event() and freed when the information is retrieved by
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 67b18dfe0025..c47a5a45c0d3 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -99,6 +99,7 @@ struct ctl_table fanotify_table[] = {
extern const struct fsnotify_ops fanotify_fsnotify_ops;

struct kmem_cache *fanotify_mark_cache __read_mostly;
+struct kmem_cache *fanotify_sb_mark_cache __read_mostly;
struct kmem_cache *fanotify_fid_event_cachep __read_mostly;
struct kmem_cache *fanotify_path_event_cachep __read_mostly;
struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
@@ -915,6 +916,38 @@ static __u32 fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
return mask & ~oldmask;
}

+static struct fsnotify_mark *fanotify_alloc_mark(struct fsnotify_group *group,
+ unsigned int type)
+{
+ struct fanotify_sb_mark *sb_mark;
+ struct fsnotify_mark *mark;
+
+ switch (type) {
+ case FSNOTIFY_OBJ_TYPE_SB:
+ sb_mark = kmem_cache_alloc(fanotify_sb_mark_cache, GFP_KERNEL);
+ if (!sb_mark)
+ return NULL;
+ mark = &sb_mark->fsn_mark;
+ break;
+
+ case FSNOTIFY_OBJ_TYPE_INODE:
+ case FSNOTIFY_OBJ_TYPE_PARENT:
+ case FSNOTIFY_OBJ_TYPE_VFSMOUNT:
+ mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
+ break;
+ default:
+ WARN_ON(1);
+ return NULL;
+ }
+
+ fsnotify_init_mark(mark, group);
+
+ if (type == FSNOTIFY_OBJ_TYPE_SB)
+ mark->flags |= FANOTIFY_MARK_FLAG_SB_MARK;
+
+ return mark;
+}
+
static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
fsnotify_connp_t *connp,
unsigned int type,
@@ -933,13 +966,12 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
!inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_FANOTIFY_MARKS))
return ERR_PTR(-ENOSPC);

- mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
+ mark = fanotify_alloc_mark(group, type);
if (!mark) {
ret = -ENOMEM;
goto out_dec_ucounts;
}

- fsnotify_init_mark(mark, group);
ret = fsnotify_add_mark_locked(mark, connp, type, 0, fsid);
if (ret) {
fsnotify_put_mark(mark);
@@ -1497,6 +1529,8 @@ static int __init fanotify_user_setup(void)

fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
SLAB_PANIC|SLAB_ACCOUNT);
+ fanotify_sb_mark_cache = KMEM_CACHE(fanotify_sb_mark,
+ SLAB_PANIC|SLAB_ACCOUNT);
fanotify_fid_event_cachep = KMEM_CACHE(fanotify_fid_event,
SLAB_PANIC);
fanotify_path_event_cachep = KMEM_CACHE(fanotify_path_event,
--
2.32.0

2021-08-12 22:58:53

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 06/21] inotify: Don't force FS_IN_IGNORED

According to Amir:

"FS_IN_IGNORED is completely internal to inotify and there is no need
to set it in i_fsnotify_mask at all, so if we remove the bit from the
output of inotify_arg_to_mask() no functionality will change and we will
be able to overload the event bit for FS_ERROR."

This is done in preparation to overload FS_ERROR with the notification
mechanism in fanotify.

Suggested-by: Amir Goldstein <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/inotify/inotify_user.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 98f61b31745a..4d17be6dd58d 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -89,10 +89,10 @@ static inline __u32 inotify_arg_to_mask(struct inode *inode, u32 arg)
__u32 mask;

/*
- * Everything should accept their own ignored and should receive events
- * when the inode is unmounted. All directories care about children.
+ * Everything should receive events when the inode is unmounted.
+ * All directories care about children.
*/
- mask = (FS_IN_IGNORED | FS_UNMOUNT);
+ mask = (FS_UNMOUNT);
if (S_ISDIR(inode->i_mode))
mask |= FS_EVENT_ON_CHILD;

--
2.32.0

2021-08-12 22:59:29

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 07/21] fsnotify: Add helper to detect overflow_event

Similarly to fanotify_is_perm_event and friends, provide a helper
predicate to say whether a mask is of an overflow event.

Suggested-by: Amir Goldstein <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.h | 3 ++-
include/linux/fsnotify_backend.h | 5 +++++
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 3b11dd03df59..b3ab620822c2 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -335,7 +335,8 @@ static inline struct path *fanotify_event_path(struct fanotify_event *event)
*/
static inline bool fanotify_is_hashed_event(u32 mask)
{
- return !fanotify_is_perm_event(mask) && !(mask & FS_Q_OVERFLOW);
+ return !(fanotify_is_perm_event(mask) ||
+ fsnotify_is_overflow_event(mask));
}

static inline unsigned int fanotify_event_hash_bucket(
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index ae1bd9f06808..cb75fb6d130a 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -506,6 +506,11 @@ static inline void fsnotify_queue_overflow(struct fsnotify_group *group)
fsnotify_add_event(group, group->overflow_event, NULL, NULL);
}

+static inline bool fsnotify_is_overflow_event(u32 mask)
+{
+ return mask & FS_Q_OVERFLOW;
+}
+
static inline bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group)
{
assert_spin_locked(&group->notification_lock);
--
2.32.0

2021-08-12 22:59:39

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 08/21] fsnotify: Add wrapper around fsnotify_add_event

fsnotify_add_event is growing in number of parameters, which in most
case are just passed a NULL pointer. So, split out a new
fsnotify_insert_event function to clean things up for users who don't
need an insert hook.

Suggested-by: Amir Goldstein <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.c | 4 ++--
fs/notify/inotify/inotify_fsnotify.c | 2 +-
fs/notify/notification.c | 12 ++++++------
include/linux/fsnotify_backend.h | 23 ++++++++++++++++-------
4 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index c3eefe3f6494..acf78c0ed219 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -781,8 +781,8 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
}

fsn_event = &event->fse;
- ret = fsnotify_add_event(group, fsn_event, fanotify_merge,
- fanotify_insert_event);
+ ret = fsnotify_insert_event(group, fsn_event, fanotify_merge,
+ fanotify_insert_event);
if (ret) {
/* Permission events shouldn't be merged */
BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index d1a64daa0171..a96582cbfad1 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -116,7 +116,7 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
if (len)
strcpy(event->name, name->name);

- ret = fsnotify_add_event(group, fsn_event, inotify_merge, NULL);
+ ret = fsnotify_add_event(group, fsn_event, inotify_merge);
if (ret) {
/* Our event wasn't used in the end. Free it. */
fsnotify_destroy_event(group, fsn_event);
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 32f45543b9c6..44bb10f50715 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -78,12 +78,12 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
* 2 if the event was not queued - either the queue of events has overflown
* or the group is shutting down.
*/
-int fsnotify_add_event(struct fsnotify_group *group,
- struct fsnotify_event *event,
- int (*merge)(struct fsnotify_group *,
- struct fsnotify_event *),
- void (*insert)(struct fsnotify_group *,
- struct fsnotify_event *))
+int fsnotify_insert_event(struct fsnotify_group *group,
+ struct fsnotify_event *event,
+ int (*merge)(struct fsnotify_group *,
+ struct fsnotify_event *),
+ void (*insert)(struct fsnotify_group *,
+ struct fsnotify_event *))
{
int ret = 0;
struct list_head *list = &group->notification_list;
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index cb75fb6d130a..e027af3cd8dd 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -494,16 +494,25 @@ extern int fsnotify_fasync(int fd, struct file *file, int on);
extern void fsnotify_destroy_event(struct fsnotify_group *group,
struct fsnotify_event *event);
/* attach the event to the group notification queue */
-extern int fsnotify_add_event(struct fsnotify_group *group,
- struct fsnotify_event *event,
- int (*merge)(struct fsnotify_group *,
- struct fsnotify_event *),
- void (*insert)(struct fsnotify_group *,
- struct fsnotify_event *));
+extern int fsnotify_insert_event(struct fsnotify_group *group,
+ struct fsnotify_event *event,
+ int (*merge)(struct fsnotify_group *,
+ struct fsnotify_event *),
+ void (*insert)(struct fsnotify_group *,
+ struct fsnotify_event *));
+
+static inline int fsnotify_add_event(struct fsnotify_group *group,
+ struct fsnotify_event *event,
+ int (*merge)(struct fsnotify_group *,
+ struct fsnotify_event *))
+{
+ return fsnotify_insert_event(group, event, merge, NULL);
+}
+
/* Queue overflow event to a notification group */
static inline void fsnotify_queue_overflow(struct fsnotify_group *group)
{
- fsnotify_add_event(group, group->overflow_event, NULL, NULL);
+ fsnotify_add_event(group, group->overflow_event, NULL);
}

static inline bool fsnotify_is_overflow_event(u32 mask)
--
2.32.0

2021-08-12 23:00:05

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 09/21] fsnotify: Allow events reported with an empty inode

Some file system events (i.e. FS_ERROR) might not be associated with an
inode. For these, it makes sense to associate them directly with the
super block of the file system they apply to. This patch allows the
event to be reported with a NULL inode, by recovering the superblock
directly from the data field, if needed.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

--
Changes since v5:
- add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
---
fs/notify/fsnotify.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 30d422b8c0fc..536db02cb26e 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
fsnotify_clear_marks_by_sb(sb);
}

+static struct super_block *fsnotify_data_sb(const void *data, int data_type)
+{
+ struct inode *inode = fsnotify_data_inode(data, data_type);
+ struct super_block *sb = inode ? inode->i_sb : NULL;
+
+ return sb;
+}
+
/*
* Given an inode, first check if we care what happens to our children. Inotify
* and dnotify both tell their parents about events. If we care about any event
@@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
* @file_name is relative to
* @file_name: optional file name associated with event
* @inode: optional inode associated with event -
- * either @dir or @inode must be non-NULL.
- * if both are non-NULL event may be reported to both.
+ * If @dir and @inode are NULL, @data must have a type that
+ * allows retrieving the file system associated with this
+ * event. if both are non-NULL event may be reported to
+ * both.
* @cookie: inotify rename cookie
*/
int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
@@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
*/
parent = dir;
}
- sb = inode->i_sb;
+ sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);

/*
* Optimization: srcu_read_lock() has a memory barrier which can
--
2.32.0

2021-08-12 23:00:18

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 10/21] fsnotify: Support FS_ERROR event type

Expose a new type of fsnotify event for filesystems to report errors for
userspace monitoring tools. fanotify will send this type of
notification for FAN_FS_ERROR events. This also introduce a helper for
generating the new event.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v5:
- pass sb inside data field (jan)
Changes since v3:
- Squash patch ("fsnotify: Introduce helpers to send error_events")
- Drop reviewed-bys!

Changes since v2:
- FAN_ERROR->FAN_FS_ERROR (Amir)

Changes since v1:
- Overload FS_ERROR with FS_IN_IGNORED
- Implement support for this type on fsnotify_data_inode (Amir)
---
fs/notify/fsnotify.c | 3 +++
include/linux/fsnotify.h | 13 +++++++++++++
include/linux/fsnotify_backend.h | 18 +++++++++++++++++-
3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 536db02cb26e..6d3b3de4f8ee 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -103,6 +103,9 @@ static struct super_block *fsnotify_data_sb(const void *data, int data_type)
struct inode *inode = fsnotify_data_inode(data, data_type);
struct super_block *sb = inode ? inode->i_sb : NULL;

+ if (!sb && data_type == FSNOTIFY_EVENT_ERROR)
+ sb = ((struct fs_error_report *) data)->sb;
+
return sb;
}

diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index f8acddcf54fb..521234af1827 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -317,4 +317,17 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
fsnotify_dentry(dentry, mask);
}

+static inline int fsnotify_sb_error(struct super_block *sb, struct inode *inode,
+ int error)
+{
+ struct fs_error_report report = {
+ .error = error,
+ .inode = inode,
+ .sb = sb,
+ };
+
+ return fsnotify(FS_ERROR, &report, FSNOTIFY_EVENT_ERROR,
+ NULL, NULL, NULL, 0);
+}
+
#endif /* _LINUX_FS_NOTIFY_H */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index e027af3cd8dd..277b6f3e0998 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -42,6 +42,12 @@

#define FS_UNMOUNT 0x00002000 /* inode on umount fs */
#define FS_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
+#define FS_ERROR 0x00008000 /* Filesystem Error (fanotify) */
+
+/*
+ * FS_IN_IGNORED overloads FS_ERROR. It is only used internally by inotify
+ * which does not support FS_ERROR.
+ */
#define FS_IN_IGNORED 0x00008000 /* last inotify event here */

#define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
@@ -95,7 +101,8 @@
#define ALL_FSNOTIFY_EVENTS (ALL_FSNOTIFY_DIRENT_EVENTS | \
FS_EVENTS_POSS_ON_CHILD | \
FS_DELETE_SELF | FS_MOVE_SELF | FS_DN_RENAME | \
- FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED)
+ FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED | \
+ FS_ERROR)

/* Extra flags that may be reported with event or control handling of events */
#define ALL_FSNOTIFY_FLAGS (FS_EXCL_UNLINK | FS_ISDIR | FS_IN_ONESHOT | \
@@ -248,6 +255,13 @@ enum fsnotify_data_type {
FSNOTIFY_EVENT_NONE,
FSNOTIFY_EVENT_PATH,
FSNOTIFY_EVENT_INODE,
+ FSNOTIFY_EVENT_ERROR,
+};
+
+struct fs_error_report {
+ int error;
+ struct inode *inode;
+ struct super_block *sb;
};

static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
@@ -257,6 +271,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
return (struct inode *)data;
case FSNOTIFY_EVENT_PATH:
return d_inode(((const struct path *)data)->dentry);
+ case FSNOTIFY_EVENT_ERROR:
+ return ((struct fs_error_report *)data)->inode;
default:
return NULL;
}
--
2.32.0

2021-08-12 23:01:31

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 11/21] fanotify: Allow file handle encoding for unhashed events

FAN_FS_ERROR will report a file handle, but it is an unhashed event.
Allow passing a NULL hash to fanotify_encode_fh and avoid calculating
the hash if not needed.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/notify/fanotify/fanotify.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index acf78c0ed219..50fce4fec0d6 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -403,8 +403,12 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
fh->type = type;
fh->len = fh_len;

- /* Mix fh into event merge key */
- *hash ^= fanotify_hash_fh(fh);
+ /*
+ * Mix fh into event merge key. Hash might be NULL in case of
+ * unhashed FID events (i.e. FAN_FS_ERROR).
+ */
+ if (hash)
+ *hash ^= fanotify_hash_fh(fh);

return FANOTIFY_FH_HDR_LEN + fh_len;

--
2.32.0

2021-08-12 23:01:35

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 12/21] fanotify: Encode invalid file handle when no inode is provided

Instead of failing, encode an invalid file handle in fanotify_encode_fh
if no inode is provided. This bogus file handle will be reported by
FAN_FS_ERROR for non-inode errors.

When being reported to userspace, the length information is actually
reset and the handle cleaned up, such that userspace don't have the
visibility of the internal kernel representation of this null handle.

Also adjust the single caller that might rely on failure after passing
an empty inode.

Suggested-by: Amir Goldstein <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v5:
- Preserve flags initialization (jan)
- Add BUILD_BUG_ON (amir)
- Require minimum of FANOTIFY_NULL_FH_LEN for fh_len(amir)
- Improve comment to explain the null FH length (jan)
- Simplify logic
---
fs/notify/fanotify/fanotify.c | 27 ++++++++++++++++++-----
fs/notify/fanotify/fanotify_user.c | 35 +++++++++++++++++-------------
2 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 50fce4fec0d6..2b1ab031fbe5 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -334,6 +334,8 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
return test_mask & user_mask;
}

+#define FANOTIFY_NULL_FH_LEN 4
+
/*
* Check size needed to encode fanotify_fh.
*
@@ -345,7 +347,7 @@ static int fanotify_encode_fh_len(struct inode *inode)
int dwords = 0;

if (!inode)
- return 0;
+ return FANOTIFY_NULL_FH_LEN;

exportfs_encode_inode_fh(inode, NULL, &dwords, NULL);

@@ -367,11 +369,23 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
void *buf = fh->buf;
int err;

- fh->type = FILEID_ROOT;
- fh->len = 0;
+ BUILD_BUG_ON(FANOTIFY_NULL_FH_LEN < 4 ||
+ FANOTIFY_NULL_FH_LEN > FANOTIFY_INLINE_FH_LEN);
+
fh->flags = 0;
- if (!inode)
- return 0;
+
+ if (!inode) {
+ /*
+ * Invalid FHs are used on FAN_FS_ERROR for errors not
+ * linked to any inode. The f_handle won't be reported
+ * back to userspace. The extra bytes are cleared prior
+ * to reporting.
+ */
+ type = FILEID_INVALID;
+ fh_len = FANOTIFY_NULL_FH_LEN;
+
+ goto success;
+ }

/*
* !gpf means preallocated variable size fh, but fh_len could
@@ -400,6 +414,7 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
if (!type || type == FILEID_INVALID || fh_len != dwords << 2)
goto out_err;

+success:
fh->type = type;
fh->len = fh_len;

@@ -529,7 +544,7 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
struct fanotify_info *info;
struct fanotify_fh *dfh, *ffh;
unsigned int dir_fh_len = fanotify_encode_fh_len(id);
- unsigned int child_fh_len = fanotify_encode_fh_len(child);
+ unsigned int child_fh_len = child ? fanotify_encode_fh_len(child) : 0;
unsigned int size;

size = sizeof(*fne) + FANOTIFY_FH_HDR_LEN + dir_fh_len;
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index c47a5a45c0d3..4cacea5fcaca 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -360,7 +360,10 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
return -EFAULT;

handle.handle_type = fh->type;
- handle.handle_bytes = fh_len;
+
+ /* FILEID_INVALID handle type is reported without its f_handle. */
+ if (fh->type != FILEID_INVALID)
+ handle.handle_bytes = fh_len;
if (copy_to_user(buf, &handle, sizeof(handle)))
return -EFAULT;

@@ -369,20 +372,22 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
if (WARN_ON_ONCE(len < fh_len))
return -EFAULT;

- /*
- * For an inline fh and inline file name, copy through stack to exclude
- * the copy from usercopy hardening protections.
- */
- fh_buf = fanotify_fh_buf(fh);
- if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
- memcpy(bounce, fh_buf, fh_len);
- fh_buf = bounce;
+ if (fh->type != FILEID_INVALID) {
+ /*
+ * For an inline fh and inline file name, copy through
+ * stack to exclude the copy from usercopy hardening
+ * protections.
+ */
+ fh_buf = fanotify_fh_buf(fh);
+ if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
+ memcpy(bounce, fh_buf, fh_len);
+ fh_buf = bounce;
+ }
+ if (copy_to_user(buf, fh_buf, fh_len))
+ return -EFAULT;
+ buf += fh_len;
+ len -= fh_len;
}
- if (copy_to_user(buf, fh_buf, fh_len))
- return -EFAULT;
-
- buf += fh_len;
- len -= fh_len;

if (name_len) {
/* Copy the filename with terminating null */
@@ -398,7 +403,7 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
}

/* Pad with 0's */
- WARN_ON_ONCE(len < 0 || len >= FANOTIFY_EVENT_ALIGN);
+ WARN_ON_ONCE(len < 0);
if (len > 0 && clear_user(buf, len))
return -EFAULT;

--
2.32.0

2021-08-12 23:03:49

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 13/21] fanotify: Require fid_mode for any non-fd event

Like inode events, FAN_FS_ERROR will require fid mode. Therefore,
convert the verification during fanotify_mark(2) to require fid for any
non-fd event. This means fid_mode will not only be required for inode
events, but for any event that doesn't provide a descriptor.

Suggested-by: Amir Goldstein <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
changes since v5:
- Fix condition to include FANOTIFY_EVENT_FLAGS. (me)
- Fix comment identation (jan)
---
fs/notify/fanotify/fanotify_user.c | 12 ++++++------
include/linux/fanotify.h | 3 +++
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 4cacea5fcaca..54107f1533d5 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1387,14 +1387,14 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
goto fput_and_out;

/*
- * Events with data type inode do not carry enough information to report
- * event->fd, so we do not allow setting a mask for inode events unless
- * group supports reporting fid.
- * inode events are not supported on a mount mark, because they do not
- * carry enough information (i.e. path) to be filtered by mount point.
+ * Events that do not carry enough information to report
+ * event->fd require a group that supports reporting fid. Those
+ * events are not supported on a mount mark, because they do not
+ * carry enough information (i.e. path) to be filtered by mount
+ * point.
*/
fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
- if (mask & FANOTIFY_INODE_EVENTS &&
+ if (mask & ~(FANOTIFY_FD_EVENTS|FANOTIFY_EVENT_FLAGS) &&
(!fid_mode || mark_type == FAN_MARK_MOUNT))
goto fput_and_out;

diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index a16dbeced152..c05d45bde8b8 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -81,6 +81,9 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
*/
#define FANOTIFY_DIRENT_EVENTS (FAN_MOVE | FAN_CREATE | FAN_DELETE)

+/* Events that can be reported with event->fd */
+#define FANOTIFY_FD_EVENTS (FANOTIFY_PATH_EVENTS | FANOTIFY_PERM_EVENTS)
+
/* Events that can only be reported with data type FSNOTIFY_EVENT_INODE */
#define FANOTIFY_INODE_EVENTS (FANOTIFY_DIRENT_EVENTS | \
FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
--
2.32.0

2021-08-12 23:03:56

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 14/21] fanotify: Reserve UAPI bits for FAN_FS_ERROR

FAN_FS_ERROR allows reporting of event type FS_ERROR to userspace, which
a mechanism to report file system wide problems via fanotify. This
commit preallocate userspace visible bits to match the FS_ERROR event.

Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.c | 1 +
include/uapi/linux/fanotify.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 2b1ab031fbe5..ebb6c557cea1 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -760,6 +760,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
BUILD_BUG_ON(FAN_OPEN_EXEC != FS_OPEN_EXEC);
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
+ BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);

BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);

diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index fbf9c5c7dd59..16402037fc7a 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -20,6 +20,7 @@
#define FAN_OPEN_EXEC 0x00001000 /* File was opened for exec */

#define FAN_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
+#define FAN_FS_ERROR 0x00008000 /* Filesystem error */

#define FAN_OPEN_PERM 0x00010000 /* File open in perm check */
#define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
--
2.32.0

2021-08-12 23:05:09

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 15/21] fanotify: Preallocate per superblock mark error event

Error reporting needs to be done in an atomic context. This patch
introduces a single error slot for superblock marks that report the
FAN_FS_ERROR event, to be used during event submission.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes v5:
- Restore mark references. (jan)
- Tie fee slot to the mark lifetime.(jan)
- Don't reallocate event(jan)
---
fs/notify/fanotify/fanotify.c | 12 ++++++++++++
fs/notify/fanotify/fanotify.h | 13 +++++++++++++
fs/notify/fanotify/fanotify_user.c | 31 ++++++++++++++++++++++++++++--
3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index ebb6c557cea1..3bf6fd85c634 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -855,6 +855,14 @@ static void fanotify_free_name_event(struct fanotify_event *event)
kfree(FANOTIFY_NE(event));
}

+static void fanotify_free_error_event(struct fanotify_event *event)
+{
+ /*
+ * The actual event is tied to a mark, and is released on mark
+ * removal
+ */
+}
+
static void fanotify_free_event(struct fsnotify_event *fsn_event)
{
struct fanotify_event *event;
@@ -877,6 +885,9 @@ static void fanotify_free_event(struct fsnotify_event *fsn_event)
case FANOTIFY_EVENT_TYPE_OVERFLOW:
kfree(event);
break;
+ case FANOTIFY_EVENT_TYPE_FS_ERROR:
+ fanotify_free_error_event(event);
+ break;
default:
WARN_ON_ONCE(1);
}
@@ -894,6 +905,7 @@ static void fanotify_free_mark(struct fsnotify_mark *mark)
if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);

+ kfree(fa_mark->fee_slot);
kmem_cache_free(fanotify_sb_mark_cache, fa_mark);
} else {
kmem_cache_free(fanotify_mark_cache, mark);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index b3ab620822c2..3f03333df32f 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -139,6 +139,7 @@ enum fanotify_mark_bits {

struct fanotify_sb_mark {
struct fsnotify_mark fsn_mark;
+ struct fanotify_error_event *fee_slot;
};

static inline
@@ -161,6 +162,7 @@ enum fanotify_event_type {
FANOTIFY_EVENT_TYPE_PATH,
FANOTIFY_EVENT_TYPE_PATH_PERM,
FANOTIFY_EVENT_TYPE_OVERFLOW, /* struct fanotify_event */
+ FANOTIFY_EVENT_TYPE_FS_ERROR, /* struct fanotify_error_event */
__FANOTIFY_EVENT_TYPE_NUM
};

@@ -216,6 +218,17 @@ FANOTIFY_NE(struct fanotify_event *event)
return container_of(event, struct fanotify_name_event, fae);
}

+struct fanotify_error_event {
+ struct fanotify_event fae;
+ struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
+};
+
+static inline struct fanotify_error_event *
+FANOTIFY_EE(struct fanotify_event *event)
+{
+ return container_of(event, struct fanotify_error_event, fae);
+}
+
static inline __kernel_fsid_t *fanotify_event_fsid(struct fanotify_event *event)
{
if (event->type == FANOTIFY_EVENT_TYPE_FID)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 54107f1533d5..b77030386d7f 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -947,8 +947,10 @@ static struct fsnotify_mark *fanotify_alloc_mark(struct fsnotify_group *group,

fsnotify_init_mark(mark, group);

- if (type == FSNOTIFY_OBJ_TYPE_SB)
+ if (type == FSNOTIFY_OBJ_TYPE_SB) {
mark->flags |= FANOTIFY_MARK_FLAG_SB_MARK;
+ sb_mark->fee_slot = NULL;
+ }

return mark;
}
@@ -999,6 +1001,7 @@ static int fanotify_add_mark(struct fsnotify_group *group,
{
struct fsnotify_mark *fsn_mark;
__u32 added;
+ int ret = 0;

mutex_lock(&group->mark_mutex);
fsn_mark = fsnotify_find_mark(connp, group);
@@ -1009,13 +1012,37 @@ static int fanotify_add_mark(struct fsnotify_group *group,
return PTR_ERR(fsn_mark);
}
}
+
+ /*
+ * Error events are allocated per super-block mark only if
+ * strictly needed (i.e. FAN_FS_ERROR was requested).
+ */
+ if (type == FSNOTIFY_OBJ_TYPE_SB && !(flags & FAN_MARK_IGNORED_MASK) &&
+ (mask & FAN_FS_ERROR)) {
+ struct fanotify_sb_mark *sb_mark = FANOTIFY_SB_MARK(fsn_mark);
+
+ if (!sb_mark->fee_slot) {
+ struct fanotify_error_event *fee =
+ kzalloc(sizeof(*fee), GFP_KERNEL_ACCOUNT);
+ if (!fee) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ fanotify_init_event(&fee->fae, 0, FS_ERROR);
+ fee->sb_mark = sb_mark;
+ sb_mark->fee_slot = fee;
+ }
+ }
+
added = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
if (added & ~fsnotify_conn_mask(fsn_mark->connector))
fsnotify_recalc_mask(fsn_mark->connector);
+
+out:
mutex_unlock(&group->mark_mutex);

fsnotify_put_mark(fsn_mark);
- return 0;
+ return ret;
}

static int fanotify_add_vfsmount_mark(struct fsnotify_group *group,
--
2.32.0

2021-08-12 23:05:30

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 16/21] fanotify: Handle FAN_FS_ERROR events

Wire up FAN_FS_ERROR in the fanotify_mark syscall. The event can only
be requested for the entire filesystem, thus it requires the
FAN_MARK_FILESYSTEM.

FAN_FS_ERROR has to be handled slightly differently from other events
because it needs to be submitted in an atomic context, using
preallocated memory. This patch implements the submission path by only
storing the first error event that happened in the slot (userspace
resets the slot by reading the event).

Extra error events happening when the slot is occupied are merged to the
original report, and the only information keep for these extra errors is
an accumulator counting the number of events, which is part of the
record reported back to userspace.

Reporting only the first event should be fine, since when a FS error
happens, a cascade of error usually follows, but the most meaningful
information is (usually) on the first erro.

The event dequeueing is also a bit special to avoid losing events. Since
event merging only happens while the event is queued, there is a window
between when an error event is dequeued (notification_lock is dropped)
until it is reset (.free_event()) where the slot is full, but no merges
can happen.

The proposed solution is to copy the event to the stack prior to
dropping the lock. This way, if a new event arrives in the time between
the event was dequeued and the time it resets, the new errors will still
be logged and merged in the recently freed slot.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v5:
- Copy to stack instead of replacing the fee slot(jan)
- prepare error slot outside of the notification lock(jan)
Changes since v4:
- Split parts to earlier patches (amir)
- Simplify fanotify entry replacement
- Update handle size prediction on overflow
Changes since v3:
- Convert WARN_ON to pr_warn (amir)
- Remove unecessary READ/WRITE_ONCE (amir)
- Alloc with GFP_KERNEL_ACCOUNT(amir)
- Simplify flags on mark allocation (amir)
- Avoid atomic set of error_count (amir)
- Simplify rules when merging error_event (amir)
- Allocate new error_event on get_one_event (amir)
- Report superblock error with invalid FH (amir,jan)

Changes since v2:
- Support and equire FID mode (amir)
- Goto error path instead of early return (amir)
- Simplify get_one_event (me)
- Base merging on error_count
- drop fanotify_queue_error_event

Changes since v1:
- Pass dentry to fanotify_check_fsid (Amir)
- FANOTIFY_EVENT_TYPE_ERROR -> FANOTIFY_EVENT_TYPE_FS_ERROR
- Merge previous patch into it
- Use a single slot
- Move fanotify_mark.error_event definition to this commit
- Rename FAN_ERROR -> FAN_FS_ERROR
- Restrict FAN_FS_ERROR to FAN_MARK_FILESYSTEM
---
fs/notify/fanotify/fanotify.c | 57 +++++++++++++++++++++++++++++-
fs/notify/fanotify/fanotify.h | 21 +++++++++++
fs/notify/fanotify/fanotify_user.c | 39 ++++++++++++++++++--
include/linux/fanotify.h | 6 +++-
4 files changed, 119 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 3bf6fd85c634..0c7667d3f5d1 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -709,6 +709,55 @@ static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info)
return fsid;
}

+static void fanotify_insert_error_event(struct fsnotify_group *group,
+ struct fsnotify_event *fsn_event)
+
+{
+ struct fanotify_event *event = FANOTIFY_E(fsn_event);
+
+ if (!fanotify_is_error_event(event->mask))
+ return;
+
+ /*
+ * Prevent the mark from going away while an outstanding error
+ * event is queued. The reference is released by
+ * fanotify_dequeue_first_event.
+ */
+ fsnotify_get_mark(&FANOTIFY_EE(event)->sb_mark->fsn_mark);
+
+}
+
+static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
+ struct fsnotify_group *group,
+ const struct fs_error_report *report)
+{
+ struct fanotify_sb_mark *sb_mark =
+ FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
+ struct fanotify_error_event *fee = sb_mark->fee_slot;
+
+ spin_lock(&group->notification_lock);
+ if (fee->err_count++) {
+ spin_unlock(&group->notification_lock);
+ return 0;
+ }
+ spin_unlock(&group->notification_lock);
+
+ fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
+
+ if (fsnotify_insert_event(group, &fee->fae.fse,
+ NULL, fanotify_insert_error_event)) {
+ /*
+ * Even if an error occurred, an overflow event is
+ * queued. Just reset the error count and succeed.
+ */
+ spin_lock(&group->notification_lock);
+ fanotify_reset_error_slot(fee);
+ spin_unlock(&group->notification_lock);
+ }
+
+ return 0;
+}
+
/*
* Add an event to hash table for faster merge.
*/
@@ -762,7 +811,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);

- BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 20);

mask = fanotify_group_event_mask(group, iter_info, mask, data,
data_type, dir);
@@ -787,6 +836,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
return 0;
}

+ if (fanotify_is_error_event(mask))
+ return fanotify_handle_error_event(iter_info, group, data);
+
event = fanotify_alloc_event(group, mask, data, data_type, dir,
file_name, &fsid);
ret = -ENOMEM;
@@ -857,10 +909,13 @@ static void fanotify_free_name_event(struct fanotify_event *event)

static void fanotify_free_error_event(struct fanotify_event *event)
{
+ struct fanotify_error_event *fee = FANOTIFY_EE(event);
+
/*
* The actual event is tied to a mark, and is released on mark
* removal
*/
+ fsnotify_put_mark(&fee->sb_mark->fsn_mark);
}

static void fanotify_free_event(struct fsnotify_event *fsn_event)
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 3f03333df32f..eeb4a85af74e 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -220,6 +220,8 @@ FANOTIFY_NE(struct fanotify_event *event)

struct fanotify_error_event {
struct fanotify_event fae;
+ u32 err_count; /* Suppressed errors count */
+
struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
};

@@ -320,6 +322,11 @@ static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
return container_of(fse, struct fanotify_event, fse);
}

+static inline bool fanotify_is_error_event(u32 mask)
+{
+ return mask & FAN_FS_ERROR;
+}
+
static inline bool fanotify_event_has_path(struct fanotify_event *event)
{
return event->type == FANOTIFY_EVENT_TYPE_PATH ||
@@ -349,6 +356,7 @@ static inline struct path *fanotify_event_path(struct fanotify_event *event)
static inline bool fanotify_is_hashed_event(u32 mask)
{
return !(fanotify_is_perm_event(mask) ||
+ fanotify_is_error_event(mask) ||
fsnotify_is_overflow_event(mask));
}

@@ -358,3 +366,16 @@ static inline unsigned int fanotify_event_hash_bucket(
{
return event->hash & FANOTIFY_HTABLE_MASK;
}
+
+/*
+ * Reset the FAN_FS_ERROR event slot
+ *
+ * This is used to restore the error event slot to a a zeroed state,
+ * where it can be used for a new incoming error. It does not
+ * initialize the event, but clear only the required data to free the
+ * slot.
+ */
+static inline void fanotify_reset_error_slot(struct fanotify_error_event *fee)
+{
+ fee->err_count = 0;
+}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index b77030386d7f..3fff0c994dc8 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -167,6 +167,19 @@ static void fanotify_unhash_event(struct fsnotify_group *group,
hlist_del_init(&event->merge_list);
}

+static struct fanotify_event *fanotify_dup_error_to_stack(
+ struct fanotify_error_event *fee,
+ struct fanotify_error_event *error_on_stack)
+{
+ fanotify_init_event(&error_on_stack->fae, 0, FS_ERROR);
+
+ error_on_stack->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
+ error_on_stack->err_count = fee->err_count;
+ error_on_stack->sb_mark = fee->sb_mark;
+
+ return &error_on_stack->fae;
+}
+
/*
* Get an fanotify notification event if one exists and is small
* enough to fit in "count". Return an error pointer if the count
@@ -174,7 +187,9 @@ static void fanotify_unhash_event(struct fsnotify_group *group,
* updated accordingly.
*/
static struct fanotify_event *get_one_event(struct fsnotify_group *group,
- size_t count)
+ size_t count,
+ struct fanotify_error_event *error_on_stack)
+
{
size_t event_size;
struct fanotify_event *event = NULL;
@@ -205,6 +220,16 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
FANOTIFY_PERM(event)->state = FAN_EVENT_REPORTED;
if (fanotify_is_hashed_event(event->mask))
fanotify_unhash_event(group, event);
+
+ if (fanotify_is_error_event(event->mask)) {
+ /*
+ * Error events are returned as a copy of the error
+ * slot. The actual error slot is reused.
+ */
+ fanotify_dup_error_to_stack(FANOTIFY_EE(event), error_on_stack);
+ fanotify_reset_error_slot(FANOTIFY_EE(event));
+ event = &error_on_stack->fae;
+ }
out:
spin_unlock(&group->notification_lock);
return event;
@@ -564,6 +589,7 @@ static __poll_t fanotify_poll(struct file *file, poll_table *wait)
static ssize_t fanotify_read(struct file *file, char __user *buf,
size_t count, loff_t *pos)
{
+ struct fanotify_error_event error_on_stack;
struct fsnotify_group *group;
struct fanotify_event *event;
char __user *start;
@@ -582,7 +608,7 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
* in case there are lots of available events.
*/
cond_resched();
- event = get_one_event(group, count);
+ event = get_one_event(group, count, &error_on_stack);
if (IS_ERR(event)) {
ret = PTR_ERR(event);
break;
@@ -1031,6 +1057,10 @@ static int fanotify_add_mark(struct fsnotify_group *group,
fanotify_init_event(&fee->fae, 0, FS_ERROR);
fee->sb_mark = sb_mark;
sb_mark->fee_slot = fee;
+
+ /* Mark the error slot ready to receive events. */
+ fanotify_reset_error_slot(fee);
+
}
}

@@ -1459,6 +1489,11 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
fsid = &__fsid;
}

+ if (mask & FAN_FS_ERROR && mark_type != FAN_MARK_FILESYSTEM) {
+ ret = -EINVAL;
+ goto path_put_and_out;
+ }
+
/* inode held in place by reference to path; group by fget on fd */
if (mark_type == FAN_MARK_INODE)
inode = path.dentry->d_inode;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index c05d45bde8b8..c4d49308b2d0 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -88,9 +88,13 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
#define FANOTIFY_INODE_EVENTS (FANOTIFY_DIRENT_EVENTS | \
FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)

+/* Events that can only be reported with data type FSNOTIFY_EVENT_ERROR */
+#define FANOTIFY_ERROR_EVENTS (FAN_FS_ERROR)
+
/* Events that user can request to be notified on */
#define FANOTIFY_EVENTS (FANOTIFY_PATH_EVENTS | \
- FANOTIFY_INODE_EVENTS)
+ FANOTIFY_INODE_EVENTS | \
+ FANOTIFY_ERROR_EVENTS)

/* Events that require a permission response from user */
#define FANOTIFY_PERM_EVENTS (FAN_OPEN_PERM | FAN_ACCESS_PERM | \
--
2.32.0

2021-08-12 23:05:57

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

The Error info type is a record sent to users on FAN_FS_ERROR events
documenting the type of error. It also carries an error count,
documenting how many errors were observed since the last reporting.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v5:
- Move error code here
---
fs/notify/fanotify/fanotify.c | 1 +
fs/notify/fanotify/fanotify.h | 1 +
fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
include/uapi/linux/fanotify.h | 7 ++++++
4 files changed, 45 insertions(+)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index f5c16ac37835..b49a474c1d7f 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -745,6 +745,7 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
spin_unlock(&group->notification_lock);

fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
+ fee->error = report->error;
fee->fsid = fee->sb_mark->fsn_mark.connector->fsid;

fh_len = fanotify_encode_fh_len(inode);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 158cf0c4b0bd..0cfe376c6fd9 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -220,6 +220,7 @@ FANOTIFY_NE(struct fanotify_event *event)

struct fanotify_error_event {
struct fanotify_event fae;
+ s32 error; /* Error reported by the Filesystem. */
u32 err_count; /* Suppressed errors count */

struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 1ab8f9d8b3ac..ca53159ce673 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -107,6 +107,8 @@ struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
#define FANOTIFY_EVENT_ALIGN 4
#define FANOTIFY_INFO_HDR_LEN \
(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
+#define FANOTIFY_INFO_ERROR_LEN \
+ (sizeof(struct fanotify_event_info_error))

static int fanotify_fid_info_len(int fh_len, int name_len)
{
@@ -130,6 +132,9 @@ static size_t fanotify_event_len(struct fanotify_event *event,
if (!fid_mode)
return event_len;

+ if (fanotify_is_error_event(event->mask))
+ event_len += FANOTIFY_INFO_ERROR_LEN;
+
info = fanotify_event_info(event);
dir_fh_len = fanotify_event_dir_fh_len(event);
fh_len = fanotify_event_object_fh_len(event);
@@ -176,6 +181,7 @@ static struct fanotify_event *fanotify_dup_error_to_stack(
error_on_stack->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
error_on_stack->err_count = fee->err_count;
error_on_stack->sb_mark = fee->sb_mark;
+ error_on_stack->error = fee->error;

error_on_stack->fsid = fee->fsid;

@@ -342,6 +348,28 @@ static int process_access_response(struct fsnotify_group *group,
return -ENOENT;
}

+static size_t copy_error_info_to_user(struct fanotify_event *event,
+ char __user *buf, int count)
+{
+ struct fanotify_event_info_error info;
+ struct fanotify_error_event *fee = FANOTIFY_EE(event);
+
+ info.hdr.info_type = FAN_EVENT_INFO_TYPE_ERROR;
+ info.hdr.pad = 0;
+ info.hdr.len = FANOTIFY_INFO_ERROR_LEN;
+
+ if (WARN_ON(count < info.hdr.len))
+ return -EFAULT;
+
+ info.error = fee->error;
+ info.error_count = fee->err_count;
+
+ if (copy_to_user(buf, &info, sizeof(info)))
+ return -EFAULT;
+
+ return info.hdr.len;
+}
+
static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
int info_type, const char *name, size_t name_len,
char __user *buf, size_t count)
@@ -505,6 +533,14 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
if (f)
fd_install(fd, f);

+ if (fanotify_is_error_event(event->mask)) {
+ ret = copy_error_info_to_user(event, buf, count);
+ if (ret < 0)
+ goto out_close_fd;
+ buf += ret;
+ count -= ret;
+ }
+
/* Event info records order is: dir fid + name, child fid */
if (fanotify_event_dir_fh_len(event)) {
info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 16402037fc7a..80040a92e9d9 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -124,6 +124,7 @@ struct fanotify_event_metadata {
#define FAN_EVENT_INFO_TYPE_FID 1
#define FAN_EVENT_INFO_TYPE_DFID_NAME 2
#define FAN_EVENT_INFO_TYPE_DFID 3
+#define FAN_EVENT_INFO_TYPE_ERROR 4

/* Variable length info record following event metadata */
struct fanotify_event_info_header {
@@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
unsigned char handle[0];
};

+struct fanotify_event_info_error {
+ struct fanotify_event_info_header hdr;
+ __s32 error;
+ __u32 error_count;
+};
+
struct fanotify_response {
__s32 fd;
__u32 response;
--
2.32.0

2021-08-12 23:06:04

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 17/21] fanotify: Report fid info for file related file system errors

Plumb the pieces to add a FID report to error records. Since all error
event memory must be pre-allocated, we estimate a file handle size and
if it is insuficient, we report an invalid FID and increase the
prediction for the next error slot allocation.

For errors that don't expose a file handle report it with an invalid
FID.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v5:
- Use preallocated MAX_HANDLE_SZ FH buffer
- Report superblock errors with a zerolength INVALID FID (jan, amir)
---
fs/notify/fanotify/fanotify.c | 15 +++++++++++++++
fs/notify/fanotify/fanotify.h | 11 +++++++++++
fs/notify/fanotify/fanotify_user.c | 7 +++++++
3 files changed, 33 insertions(+)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 0c7667d3f5d1..f5c16ac37835 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -734,6 +734,8 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
struct fanotify_sb_mark *sb_mark =
FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
struct fanotify_error_event *fee = sb_mark->fee_slot;
+ struct inode *inode = report->inode;
+ int fh_len;

spin_lock(&group->notification_lock);
if (fee->err_count++) {
@@ -743,6 +745,19 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
spin_unlock(&group->notification_lock);

fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
+ fee->fsid = fee->sb_mark->fsn_mark.connector->fsid;
+
+ fh_len = fanotify_encode_fh_len(inode);
+ if (WARN_ON(fh_len > MAX_HANDLE_SZ)) {
+ /*
+ * Fallback to reporting the error against the super
+ * block. It should never happen.
+ */
+ inode = NULL;
+ fh_len = fanotify_encode_fh_len(NULL);
+ }
+
+ fanotify_encode_fh(&fee->object_fh, inode, fh_len, NULL, 0);

if (fsnotify_insert_event(group, &fee->fae.fse,
NULL, fanotify_insert_error_event)) {
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index eeb4a85af74e..158cf0c4b0bd 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -223,6 +223,13 @@ struct fanotify_error_event {
u32 err_count; /* Suppressed errors count */

struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
+
+ __kernel_fsid_t fsid; /* FSID this error refers to. */
+
+ /* object_fh must be followed by the inline handle buffer. */
+ struct fanotify_fh object_fh;
+ /* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
+ unsigned char _inline_fh_buf[MAX_HANDLE_SZ];
};

static inline struct fanotify_error_event *
@@ -237,6 +244,8 @@ static inline __kernel_fsid_t *fanotify_event_fsid(struct fanotify_event *event)
return &FANOTIFY_FE(event)->fsid;
else if (event->type == FANOTIFY_EVENT_TYPE_FID_NAME)
return &FANOTIFY_NE(event)->fsid;
+ else if (event->type == FANOTIFY_EVENT_TYPE_FS_ERROR)
+ return &FANOTIFY_EE(event)->fsid;
else
return NULL;
}
@@ -248,6 +257,8 @@ static inline struct fanotify_fh *fanotify_event_object_fh(
return &FANOTIFY_FE(event)->object_fh;
else if (event->type == FANOTIFY_EVENT_TYPE_FID_NAME)
return fanotify_info_file_fh(&FANOTIFY_NE(event)->info);
+ else if (event->type == FANOTIFY_EVENT_TYPE_FS_ERROR)
+ return &FANOTIFY_EE(event)->object_fh;
else
return NULL;
}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 3fff0c994dc8..1ab8f9d8b3ac 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -177,6 +177,13 @@ static struct fanotify_event *fanotify_dup_error_to_stack(
error_on_stack->err_count = fee->err_count;
error_on_stack->sb_mark = fee->sb_mark;

+ error_on_stack->fsid = fee->fsid;
+
+ memcpy(&error_on_stack->object_fh, &fee->object_fh,
+ sizeof(fee->object_fh));
+ memcpy(error_on_stack->object_fh.buf, fee->object_fh.buf,
+ fee->object_fh.len);
+
return &error_on_stack->fae;
}

--
2.32.0

2021-08-12 23:06:19

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 19/21] ext4: Send notifications on error

Send a FS_ERROR message via fsnotify to a userspace monitoring tool
whenever a ext4 error condition is triggered. This follows the existing
error conditions in ext4, so it is hooked to the ext4_error* functions.

It also follows the current dmesg reporting in the format. The
filesystem message is composed mostly by the string that would be
otherwise printed in dmesg.

A new ext4 specific record format is exposed in the uapi, such that a
monitoring tool knows what to expect when listening errors of an ext4
filesystem.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
---
fs/ext4/super.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dfa09a277b56..b9ecd43678d7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -46,6 +46,7 @@
#include <linux/part_stat.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
+#include <linux/fsnotify.h>

#include "ext4.h"
#include "ext4_extents.h" /* Needed for trace points definition */
@@ -762,6 +763,8 @@ void __ext4_error(struct super_block *sb, const char *function,
sb->s_id, function, line, current->comm, &vaf);
va_end(args);
}
+ fsnotify_sb_error(sb, NULL, error);
+
ext4_handle_error(sb, force_ro, error, 0, block, function, line);
}

@@ -792,6 +795,8 @@ void __ext4_error_inode(struct inode *inode, const char *function,
current->comm, &vaf);
va_end(args);
}
+ fsnotify_sb_error(inode->i_sb, inode, error);
+
ext4_handle_error(inode->i_sb, false, error, inode->i_ino, block,
function, line);
}
@@ -830,6 +835,8 @@ void __ext4_error_file(struct file *file, const char *function,
current->comm, path, &vaf);
va_end(args);
}
+ fsnotify_sb_error(inode->i_sb, inode, EFSCORRUPTED);
+
ext4_handle_error(inode->i_sb, false, EFSCORRUPTED, inode->i_ino, block,
function, line);
}
@@ -897,6 +904,7 @@ void __ext4_std_error(struct super_block *sb, const char *function,
printk(KERN_CRIT "EXT4-fs error (device %s) in %s:%d: %s\n",
sb->s_id, function, line, errstr);
}
+ fsnotify_sb_error(sb, sb->s_root->d_inode, errno);

ext4_handle_error(sb, false, -errno, 0, 0, function, line);
}
--
2.32.0

2021-08-12 23:06:44

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 20/21] samples: Add fs error monitoring example

Introduce an example of a FAN_FS_ERROR fanotify user to track filesystem
errors.

Reviewed-by: Amir Goldstein <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes since v4:
- Protect file_handle defines with ifdef guards

Changes since v1:
- minor fixes
---
samples/Kconfig | 9 +++
samples/Makefile | 1 +
samples/fanotify/Makefile | 5 ++
samples/fanotify/fs-monitor.c | 138 ++++++++++++++++++++++++++++++++++
4 files changed, 153 insertions(+)
create mode 100644 samples/fanotify/Makefile
create mode 100644 samples/fanotify/fs-monitor.c

diff --git a/samples/Kconfig b/samples/Kconfig
index b0503ef058d3..88353b8eac0b 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -120,6 +120,15 @@ config SAMPLE_CONNECTOR
with it.
See also Documentation/driver-api/connector.rst

+config SAMPLE_FANOTIFY_ERROR
+ bool "Build fanotify error monitoring sample"
+ depends on FANOTIFY
+ help
+ When enabled, this builds an example code that uses the
+ FAN_FS_ERROR fanotify mechanism to monitor filesystem
+ errors.
+ See also Documentation/admin-guide/filesystem-monitoring.rst.
+
config SAMPLE_HIDRAW
bool "hidraw sample"
depends on CC_CAN_LINK && HEADERS_INSTALL
diff --git a/samples/Makefile b/samples/Makefile
index 087e0988ccc5..931a81847c48 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -5,6 +5,7 @@ subdir-$(CONFIG_SAMPLE_AUXDISPLAY) += auxdisplay
subdir-$(CONFIG_SAMPLE_ANDROID_BINDERFS) += binderfs
obj-$(CONFIG_SAMPLE_CONFIGFS) += configfs/
obj-$(CONFIG_SAMPLE_CONNECTOR) += connector/
+obj-$(CONFIG_SAMPLE_FANOTIFY_ERROR) += fanotify/
subdir-$(CONFIG_SAMPLE_HIDRAW) += hidraw
obj-$(CONFIG_SAMPLE_HW_BREAKPOINT) += hw_breakpoint/
obj-$(CONFIG_SAMPLE_KDB) += kdb/
diff --git a/samples/fanotify/Makefile b/samples/fanotify/Makefile
new file mode 100644
index 000000000000..e20db1bdde3b
--- /dev/null
+++ b/samples/fanotify/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+userprogs-always-y += fs-monitor
+
+userccflags += -I usr/include -Wall
+
diff --git a/samples/fanotify/fs-monitor.c b/samples/fanotify/fs-monitor.c
new file mode 100644
index 000000000000..e115053382be
--- /dev/null
+++ b/samples/fanotify/fs-monitor.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2021, Collabora Ltd.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <err.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <sys/fanotify.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/types.h>
+
+#ifndef FAN_FS_ERROR
+#define FAN_FS_ERROR 0x00008000
+#define FAN_EVENT_INFO_TYPE_ERROR 4
+
+struct fanotify_event_info_error {
+ struct fanotify_event_info_header hdr;
+ __s32 error;
+ __u32 error_count;
+};
+#endif
+
+#ifndef FILEID_INO32_GEN
+#define FILEID_INO32_GEN 1
+#endif
+
+#ifndef FILEID_INVALID
+#define FILEID_INVALID 0xff
+#endif
+
+static void print_fh(struct file_handle *fh)
+{
+ int i;
+ uint32_t *h = (uint32_t *) fh->f_handle;
+
+ printf("\tfh: ");
+ for (i = 0; i < fh->handle_bytes; i++)
+ printf("%hhx", fh->f_handle[i]);
+ printf("\n");
+
+ printf("\tdecoded fh: ");
+ if (fh->handle_type == FILEID_INO32_GEN)
+ printf("inode=%u gen=%u\n", h[0], h[1]);
+ else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
+ printf("Type %d (Superblock error)\n", fh->handle_type);
+ else
+ printf("Type %d (Unknown)\n", fh->handle_type);
+
+}
+
+static void handle_notifications(char *buffer, int len)
+{
+ struct fanotify_event_metadata *metadata;
+ struct fanotify_event_info_error *error;
+ struct fanotify_event_info_fid *fid;
+ char *next;
+
+ for (metadata = (struct fanotify_event_metadata *) buffer;
+ FAN_EVENT_OK(metadata, len);
+ metadata = FAN_EVENT_NEXT(metadata, len)) {
+ next = (char *)metadata + metadata->event_len;
+ if (metadata->mask != FAN_FS_ERROR) {
+ printf("unexpected FAN MARK: %llx\n", metadata->mask);
+ goto next_event;
+ } else if (metadata->fd != FAN_NOFD) {
+ printf("Unexpected fd (!= FAN_NOFD)\n");
+ goto next_event;
+ }
+
+ printf("FAN_FS_ERROR found len=%d\n", metadata->event_len);
+
+ error = (struct fanotify_event_info_error *) (metadata+1);
+ if (error->hdr.info_type != FAN_EVENT_INFO_TYPE_ERROR) {
+ printf("unknown record: %d (Expecting TYPE_ERROR)\n",
+ error->hdr.info_type);
+ goto next_event;
+ }
+
+ printf("\tGeneric Error Record: len=%d\n", error->hdr.len);
+ printf("\terror: %d\n", error->error);
+ printf("\terror_count: %d\n", error->error_count);
+
+ fid = (struct fanotify_event_info_fid *) (error + 1);
+ if ((char *) fid >= next) {
+ printf("Event doesn't have FID\n");
+ goto next_event;
+ }
+ printf("FID record found\n");
+
+ if (fid->hdr.info_type != FAN_EVENT_INFO_TYPE_FID) {
+ printf("unknown record: %d (Expecting TYPE_FID)\n",
+ fid->hdr.info_type);
+ goto next_event;
+ }
+ printf("\tfsid: %x%x\n", fid->fsid.val[0], fid->fsid.val[1]);
+ print_fh((struct file_handle *) &fid->handle);
+
+next_event:
+ printf("---\n\n");
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int fd;
+
+ char buffer[BUFSIZ];
+
+ if (argc < 2) {
+ printf("Missing path argument\n");
+ return 1;
+ }
+
+ fd = fanotify_init(FAN_CLASS_NOTIF|FAN_REPORT_FID, O_RDONLY);
+ if (fd < 0)
+ errx(1, "fanotify_init");
+
+ if (fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM,
+ FAN_FS_ERROR, AT_FDCWD, argv[1])) {
+ errx(1, "fanotify_mark");
+ }
+
+ while (1) {
+ int n = read(fd, buffer, BUFSIZ);
+
+ if (n < 0)
+ errx(1, "read");
+
+ handle_notifications(buffer, n);
+ }
+
+ return 0;
+}
--
2.32.0

2021-08-12 23:07:35

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v6 21/21] docs: Document the FAN_FS_ERROR event

Document the FAN_FS_ERROR event for user administrators and user space
developers.

Reviewed-by: Amir Goldstein <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

---
Changes Since v4:
- Update documentation about reporting non-file error.
Changes Since v3:
- Move FAN_FS_ERROR notification into a subsection of the file.
Changes Since v2:
- NTR
Changes since v1:
- Drop references to location record
- Explain that the inode field is optional
- Explain we are reporting only the first error
---
.../admin-guide/filesystem-monitoring.rst | 70 +++++++++++++++++++
Documentation/admin-guide/index.rst | 1 +
2 files changed, 71 insertions(+)
create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst

diff --git a/Documentation/admin-guide/filesystem-monitoring.rst b/Documentation/admin-guide/filesystem-monitoring.rst
new file mode 100644
index 000000000000..b03093567a93
--- /dev/null
+++ b/Documentation/admin-guide/filesystem-monitoring.rst
@@ -0,0 +1,70 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+File system Monitoring with fanotify
+====================================
+
+File system Error Reporting
+===========================
+
+fanotify supports the FAN_FS_ERROR mark for file system-wide error
+reporting. It is meant to be used by file system health monitoring
+daemons who listen on that interface and take actions (notify sysadmin,
+start recovery) when a file system problem is detected by the kernel.
+
+By design, A FAN_FS_ERROR notification exposes sufficient information for a
+monitoring tool to know a problem in the file system has happened. It
+doesn't necessarily provide a user space application with semantics to
+verify an IO operation was successfully executed. That is outside of
+scope of this feature. Instead, it is only meant as a framework for
+early file system problem detection and reporting recovery tools.
+
+When a file system operation fails, it is common for dozens of kernel
+errors to cascade after the initial failure, hiding the original failure
+log, which is usually the most useful debug data to troubleshoot the
+problem. For this reason, FAN_FS_ERROR only reports the first error that
+occurred since the last notification, and it simply counts addition
+errors. This ensures that the most important piece of error information
+is never lost.
+
+FAN_FS_ERROR requires the fanotify group to be setup with the
+FAN_REPORT_FID flag.
+
+At the time of this writing, the only file system that emits FAN_FS_ERROR
+notifications is Ext4.
+
+A user space example code is provided at ``samples/fanotify/fs-monitor.c``.
+
+A FAN_FS_ERROR Notification has the following format::
+
+ [ Notification Metadata (Mandatory) ]
+ [ Generic Error Record (Mandatory) ]
+ [ FID record (Mandatory) ]
+
+Generic error record
+--------------------
+
+The generic error record provides enough information for a file system
+agnostic tool to learn about a problem in the file system, without
+providing any additional details about the problem. This record is
+identified by ``struct fanotify_event_info_header.info_type`` being set
+to FAN_EVENT_INFO_TYPE_ERROR.
+
+ struct fanotify_event_info_error {
+ struct fanotify_event_info_header hdr;
+ __s32 error;
+ __u32 error_count;
+ };
+
+The `error` field identifies the type of error. `error_count` count
+tracks the number of errors that occurred and were suppressed to
+preserve the original error, since the last notification.
+
+FID record
+----------
+
+The FID record can be used to uniquely identify the inode that triggered
+the error through the combination of fsid and file handle. A file system
+specific application can use that information to attempt a recovery
+procedure. Errors that are not related to an inode are reported with an
+empty file handle, with type FILEID_INVALID.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index dc00afcabb95..1bedab498104 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -82,6 +82,7 @@ configure specific aspects of kernel behavior to your liking.
edid
efi-stub
ext4
+ filesystem-monitoring
nfs/index
gpio/index
highuid
--
2.32.0

2021-08-13 07:41:43

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 04/21] fsnotify: Reserve mark flag bits for backends

On Fri, Aug 13, 2021 at 12:40 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Split out the final bits of struct fsnotify_mark->flags for use by a
> backend.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> Changes since v1:
> - turn consts into defines (jan)
> ---
> include/linux/fsnotify_backend.h | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 1ce66748a2d2..ae1bd9f06808 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -363,6 +363,20 @@ struct fsnotify_mark_connector {
> struct hlist_head list;
> };
>
> +enum fsnotify_mark_bits {
> + FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY,
> + FSN_MARK_FL_BIT_ALIVE,
> + FSN_MARK_FL_BIT_ATTACHED,
> + FSN_MARK_PRIVATE_FLAGS,
> +};
> +
> +#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY \
> + (1 << FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY)
> +#define FSNOTIFY_MARK_FLAG_ALIVE \
> + (1 << FSN_MARK_FL_BIT_ALIVE)
> +#define FSNOTIFY_MARK_FLAG_ATTACHED \
> + (1 << FSN_MARK_FL_BIT_ATTACHED)
> +
> /*
> * A mark is simply an object attached to an in core inode which allows an
> * fsnotify listener to indicate they are either no longer interested in events
> @@ -398,9 +412,7 @@ struct fsnotify_mark {
> struct fsnotify_mark_connector *connector;
> /* Events types to ignore [mark->lock, group->mark_mutex] */
> __u32 ignored_mask;
> -#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY 0x01
> -#define FSNOTIFY_MARK_FLAG_ALIVE 0x02
> -#define FSNOTIFY_MARK_FLAG_ATTACHED 0x04
> + /* Upper bits [31:PRIVATE_FLAGS] are reserved for backend usage */

I don't understand what [31:PRIVATE_FLAGS] means

Otherwise:

Reviewed-by: Amir Goldstein <[email protected]>

Thanks,
Amir.

2021-08-13 07:55:48

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 10/21] fsnotify: Support FS_ERROR event type

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Expose a new type of fsnotify event for filesystems to report errors for
> userspace monitoring tools. fanotify will send this type of
> notification for FAN_FS_ERROR events. This also introduce a helper for
> generating the new event.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Reviewed-by: Amir Goldstein <[email protected]>

>
> ---
> Changes since v5:
> - pass sb inside data field (jan)
> Changes since v3:
> - Squash patch ("fsnotify: Introduce helpers to send error_events")
> - Drop reviewed-bys!
>
> Changes since v2:
> - FAN_ERROR->FAN_FS_ERROR (Amir)
>
> Changes since v1:
> - Overload FS_ERROR with FS_IN_IGNORED
> - Implement support for this type on fsnotify_data_inode (Amir)
> ---
> fs/notify/fsnotify.c | 3 +++
> include/linux/fsnotify.h | 13 +++++++++++++
> include/linux/fsnotify_backend.h | 18 +++++++++++++++++-
> 3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 536db02cb26e..6d3b3de4f8ee 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -103,6 +103,9 @@ static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> struct inode *inode = fsnotify_data_inode(data, data_type);
> struct super_block *sb = inode ? inode->i_sb : NULL;
>
> + if (!sb && data_type == FSNOTIFY_EVENT_ERROR)
> + sb = ((struct fs_error_report *) data)->sb;
> +
> return sb;
> }
>
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index f8acddcf54fb..521234af1827 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -317,4 +317,17 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
> fsnotify_dentry(dentry, mask);
> }
>
> +static inline int fsnotify_sb_error(struct super_block *sb, struct inode *inode,
> + int error)
> +{
> + struct fs_error_report report = {
> + .error = error,
> + .inode = inode,
> + .sb = sb,
> + };
> +
> + return fsnotify(FS_ERROR, &report, FSNOTIFY_EVENT_ERROR,
> + NULL, NULL, NULL, 0);
> +}
> +
> #endif /* _LINUX_FS_NOTIFY_H */
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index e027af3cd8dd..277b6f3e0998 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -42,6 +42,12 @@
>
> #define FS_UNMOUNT 0x00002000 /* inode on umount fs */
> #define FS_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
> +#define FS_ERROR 0x00008000 /* Filesystem Error (fanotify) */
> +
> +/*
> + * FS_IN_IGNORED overloads FS_ERROR. It is only used internally by inotify
> + * which does not support FS_ERROR.
> + */
> #define FS_IN_IGNORED 0x00008000 /* last inotify event here */
>
> #define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
> @@ -95,7 +101,8 @@
> #define ALL_FSNOTIFY_EVENTS (ALL_FSNOTIFY_DIRENT_EVENTS | \
> FS_EVENTS_POSS_ON_CHILD | \
> FS_DELETE_SELF | FS_MOVE_SELF | FS_DN_RENAME | \
> - FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED)
> + FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED | \
> + FS_ERROR)
>
> /* Extra flags that may be reported with event or control handling of events */
> #define ALL_FSNOTIFY_FLAGS (FS_EXCL_UNLINK | FS_ISDIR | FS_IN_ONESHOT | \
> @@ -248,6 +255,13 @@ enum fsnotify_data_type {
> FSNOTIFY_EVENT_NONE,
> FSNOTIFY_EVENT_PATH,
> FSNOTIFY_EVENT_INODE,
> + FSNOTIFY_EVENT_ERROR,
> +};
> +
> +struct fs_error_report {
> + int error;
> + struct inode *inode;
> + struct super_block *sb;
> };
>
> static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
> @@ -257,6 +271,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
> return (struct inode *)data;
> case FSNOTIFY_EVENT_PATH:
> return d_inode(((const struct path *)data)->dentry);
> + case FSNOTIFY_EVENT_ERROR:
> + return ((struct fs_error_report *)data)->inode;
> default:
> return NULL;
> }
> --
> 2.32.0
>

2021-08-13 08:01:09

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 09/21] fsnotify: Allow events reported with an empty inode

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Some file system events (i.e. FS_ERROR) might not be associated with an
> inode. For these, it makes sense to associate them directly with the
> super block of the file system they apply to. This patch allows the
> event to be reported with a NULL inode, by recovering the superblock
> directly from the data field, if needed.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> --
> Changes since v5:
> - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> ---
> fs/notify/fsnotify.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 30d422b8c0fc..536db02cb26e 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
> fsnotify_clear_marks_by_sb(sb);
> }
>
> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> +{
> + struct inode *inode = fsnotify_data_inode(data, data_type);
> + struct super_block *sb = inode ? inode->i_sb : NULL;
> +
> + return sb;
> +}
> +
> /*
> * Given an inode, first check if we care what happens to our children. Inotify
> * and dnotify both tell their parents about events. If we care about any event
> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> * @file_name is relative to
> * @file_name: optional file name associated with event
> * @inode: optional inode associated with event -
> - * either @dir or @inode must be non-NULL.
> - * if both are non-NULL event may be reported to both.
> + * If @dir and @inode are NULL, @data must have a type that
> + * allows retrieving the file system associated with this

Irrelevant comment. sb must always be available from @data.

> + * event. if both are non-NULL event may be reported to
> + * both.
> * @cookie: inotify rename cookie
> */
> int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> */
> parent = dir;
> }
> - sb = inode->i_sb;
> + sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);

const struct path *path = fsnotify_data_path(data, data_type);
+ const struct super_block *sb = fsnotify_data_sb(data, data_type);

All the games with @data @inode and @dir args are irrelevant to this.
sb should always be available from @data and it does not matter
if fsnotify_data_inode() is the same as @inode, @dir or neither.
All those inodes are anyway on the same sb.

Thanks,
Amir.

2021-08-13 08:01:21

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 11/21] fanotify: Allow file handle encoding for unhashed events

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> FAN_FS_ERROR will report a file handle, but it is an unhashed event.
> Allow passing a NULL hash to fanotify_encode_fh and avoid calculating
> the hash if not needed.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>

Reviewed-by: Amir Goldstein <[email protected]>

> ---
> fs/notify/fanotify/fanotify.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index acf78c0ed219..50fce4fec0d6 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -403,8 +403,12 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
> fh->type = type;
> fh->len = fh_len;
>
> - /* Mix fh into event merge key */
> - *hash ^= fanotify_hash_fh(fh);
> + /*
> + * Mix fh into event merge key. Hash might be NULL in case of
> + * unhashed FID events (i.e. FAN_FS_ERROR).
> + */
> + if (hash)
> + *hash ^= fanotify_hash_fh(fh);
>
> return FANOTIFY_FH_HDR_LEN + fh_len;
>
> --
> 2.32.0
>

2021-08-13 08:29:37

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 12/21] fanotify: Encode invalid file handle when no inode is provided

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Instead of failing, encode an invalid file handle in fanotify_encode_fh
> if no inode is provided. This bogus file handle will be reported by
> FAN_FS_ERROR for non-inode errors.
>
> When being reported to userspace, the length information is actually
> reset and the handle cleaned up, such that userspace don't have the
> visibility of the internal kernel representation of this null handle.
>
> Also adjust the single caller that might rely on failure after passing
> an empty inode.
>
> Suggested-by: Amir Goldstein <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> ---
> Changes since v5:
> - Preserve flags initialization (jan)
> - Add BUILD_BUG_ON (amir)
> - Require minimum of FANOTIFY_NULL_FH_LEN for fh_len(amir)
> - Improve comment to explain the null FH length (jan)
> - Simplify logic
> ---
> fs/notify/fanotify/fanotify.c | 27 ++++++++++++++++++-----
> fs/notify/fanotify/fanotify_user.c | 35 +++++++++++++++++-------------
> 2 files changed, 41 insertions(+), 21 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 50fce4fec0d6..2b1ab031fbe5 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -334,6 +334,8 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
> return test_mask & user_mask;
> }
>
> +#define FANOTIFY_NULL_FH_LEN 4
> +
> /*
> * Check size needed to encode fanotify_fh.
> *
> @@ -345,7 +347,7 @@ static int fanotify_encode_fh_len(struct inode *inode)
> int dwords = 0;
>
> if (!inode)
> - return 0;
> + return FANOTIFY_NULL_FH_LEN;
>
> exportfs_encode_inode_fh(inode, NULL, &dwords, NULL);
>
> @@ -367,11 +369,23 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
> void *buf = fh->buf;
> int err;
>
> - fh->type = FILEID_ROOT;
> - fh->len = 0;
> + BUILD_BUG_ON(FANOTIFY_NULL_FH_LEN < 4 ||
> + FANOTIFY_NULL_FH_LEN > FANOTIFY_INLINE_FH_LEN);
> +
> fh->flags = 0;
> - if (!inode)
> - return 0;
> +
> + if (!inode) {
> + /*
> + * Invalid FHs are used on FAN_FS_ERROR for errors not
> + * linked to any inode. The f_handle won't be reported
> + * back to userspace. The extra bytes are cleared prior
> + * to reporting.
> + */
> + type = FILEID_INVALID;
> + fh_len = FANOTIFY_NULL_FH_LEN;

Please memset() the NULL_FH buffer to zero.

> +
> + goto success;
> + }
>
> /*
> * !gpf means preallocated variable size fh, but fh_len could
> @@ -400,6 +414,7 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
> if (!type || type == FILEID_INVALID || fh_len != dwords << 2)
> goto out_err;
>
> +success:
> fh->type = type;
> fh->len = fh_len;
>
> @@ -529,7 +544,7 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
> struct fanotify_info *info;
> struct fanotify_fh *dfh, *ffh;
> unsigned int dir_fh_len = fanotify_encode_fh_len(id);
> - unsigned int child_fh_len = fanotify_encode_fh_len(child);
> + unsigned int child_fh_len = child ? fanotify_encode_fh_len(child) : 0;
> unsigned int size;
>
> size = sizeof(*fne) + FANOTIFY_FH_HDR_LEN + dir_fh_len;
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index c47a5a45c0d3..4cacea5fcaca 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -360,7 +360,10 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> return -EFAULT;
>
> handle.handle_type = fh->type;
> - handle.handle_bytes = fh_len;
> +
> + /* FILEID_INVALID handle type is reported without its f_handle. */
> + if (fh->type != FILEID_INVALID)
> + handle.handle_bytes = fh_len;

I know I suggested those exact lines, but looking at the patch,
I think it would be better to do:
+ if (fh->type != FILEID_INVALID)
+ fh_len = 0;
handle.handle_bytes = fh_len;

> if (copy_to_user(buf, &handle, sizeof(handle)))
> return -EFAULT;
>
> @@ -369,20 +372,22 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> if (WARN_ON_ONCE(len < fh_len))
> return -EFAULT;
>
> - /*
> - * For an inline fh and inline file name, copy through stack to exclude
> - * the copy from usercopy hardening protections.
> - */
> - fh_buf = fanotify_fh_buf(fh);
> - if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
> - memcpy(bounce, fh_buf, fh_len);
> - fh_buf = bounce;
> + if (fh->type != FILEID_INVALID) {

... and here: if (fh_len) {

> + /*
> + * For an inline fh and inline file name, copy through
> + * stack to exclude the copy from usercopy hardening
> + * protections.
> + */
> + fh_buf = fanotify_fh_buf(fh);
> + if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
> + memcpy(bounce, fh_buf, fh_len);
> + fh_buf = bounce;
> + }
> + if (copy_to_user(buf, fh_buf, fh_len))
> + return -EFAULT;
> + buf += fh_len;
> + len -= fh_len;
> }
> - if (copy_to_user(buf, fh_buf, fh_len))
> - return -EFAULT;
> -
> - buf += fh_len;
> - len -= fh_len;
>
> if (name_len) {
> /* Copy the filename with terminating null */
> @@ -398,7 +403,7 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> }
>
> /* Pad with 0's */
> - WARN_ON_ONCE(len < 0 || len >= FANOTIFY_EVENT_ALIGN);
> + WARN_ON_ONCE(len < 0);

According to my calculations, FAN_FS_ERROR event with NULL_FH is expected
to get here with len == 4, so you can change this to:
WARN_ON_ONCE(len < 0 || len > FANOTIFY_EVENT_ALIGN);

But first, I would like to get Jan's feedback on this concept of keeping
unneeded 4 bytes zero padding in reported event in case of NULL_FH
in order to keep the FID reporting code simpler.

Thanks,
Amir.

2021-08-13 08:30:21

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 13/21] fanotify: Require fid_mode for any non-fd event

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Like inode events, FAN_FS_ERROR will require fid mode. Therefore,
> convert the verification during fanotify_mark(2) to require fid for any
> non-fd event. This means fid_mode will not only be required for inode
> events, but for any event that doesn't provide a descriptor.
>
> Suggested-by: Amir Goldstein <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Reviewed-by: Amir Goldstein <[email protected]>

>
> ---
> changes since v5:
> - Fix condition to include FANOTIFY_EVENT_FLAGS. (me)
> - Fix comment identation (jan)
> ---
> fs/notify/fanotify/fanotify_user.c | 12 ++++++------
> include/linux/fanotify.h | 3 +++
> 2 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 4cacea5fcaca..54107f1533d5 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1387,14 +1387,14 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> goto fput_and_out;
>
> /*
> - * Events with data type inode do not carry enough information to report
> - * event->fd, so we do not allow setting a mask for inode events unless
> - * group supports reporting fid.
> - * inode events are not supported on a mount mark, because they do not
> - * carry enough information (i.e. path) to be filtered by mount point.
> + * Events that do not carry enough information to report
> + * event->fd require a group that supports reporting fid. Those
> + * events are not supported on a mount mark, because they do not
> + * carry enough information (i.e. path) to be filtered by mount
> + * point.
> */
> fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
> - if (mask & FANOTIFY_INODE_EVENTS &&
> + if (mask & ~(FANOTIFY_FD_EVENTS|FANOTIFY_EVENT_FLAGS) &&
> (!fid_mode || mark_type == FAN_MARK_MOUNT))
> goto fput_and_out;
>
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index a16dbeced152..c05d45bde8b8 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -81,6 +81,9 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
> */
> #define FANOTIFY_DIRENT_EVENTS (FAN_MOVE | FAN_CREATE | FAN_DELETE)
>
> +/* Events that can be reported with event->fd */
> +#define FANOTIFY_FD_EVENTS (FANOTIFY_PATH_EVENTS | FANOTIFY_PERM_EVENTS)
> +
> /* Events that can only be reported with data type FSNOTIFY_EVENT_INODE */
> #define FANOTIFY_INODE_EVENTS (FANOTIFY_DIRENT_EVENTS | \
> FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
> --
> 2.32.0
>

2021-08-13 08:30:21

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 14/21] fanotify: Reserve UAPI bits for FAN_FS_ERROR

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> FAN_FS_ERROR allows reporting of event type FS_ERROR to userspace, which
> a mechanism to report file system wide problems via fanotify. This
> commit preallocate userspace visible bits to match the FS_ERROR event.
>
> Reviewed-by: Jan Kara <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Reviewed-by: Amir Goldstein <[email protected]>

> ---
> fs/notify/fanotify/fanotify.c | 1 +
> include/uapi/linux/fanotify.h | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 2b1ab031fbe5..ebb6c557cea1 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -760,6 +760,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
> BUILD_BUG_ON(FAN_OPEN_EXEC != FS_OPEN_EXEC);
> BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
> + BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
>
> BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
>
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index fbf9c5c7dd59..16402037fc7a 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -20,6 +20,7 @@
> #define FAN_OPEN_EXEC 0x00001000 /* File was opened for exec */
>
> #define FAN_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
> +#define FAN_FS_ERROR 0x00008000 /* Filesystem error */
>
> #define FAN_OPEN_PERM 0x00010000 /* File open in perm check */
> #define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
> --
> 2.32.0
>

2021-08-13 09:04:06

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 15/21] fanotify: Preallocate per superblock mark error event

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Error reporting needs to be done in an atomic context. This patch
> introduces a single error slot for superblock marks that report the
> FAN_FS_ERROR event, to be used during event submission.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> ---
> Changes v5:
> - Restore mark references. (jan)
> - Tie fee slot to the mark lifetime.(jan)
> - Don't reallocate event(jan)
> ---
> fs/notify/fanotify/fanotify.c | 12 ++++++++++++
> fs/notify/fanotify/fanotify.h | 13 +++++++++++++
> fs/notify/fanotify/fanotify_user.c | 31 ++++++++++++++++++++++++++++--
> 3 files changed, 54 insertions(+), 2 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index ebb6c557cea1..3bf6fd85c634 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -855,6 +855,14 @@ static void fanotify_free_name_event(struct fanotify_event *event)
> kfree(FANOTIFY_NE(event));
> }
>
> +static void fanotify_free_error_event(struct fanotify_event *event)
> +{
> + /*
> + * The actual event is tied to a mark, and is released on mark
> + * removal
> + */
> +}
> +
> static void fanotify_free_event(struct fsnotify_event *fsn_event)
> {
> struct fanotify_event *event;
> @@ -877,6 +885,9 @@ static void fanotify_free_event(struct fsnotify_event *fsn_event)
> case FANOTIFY_EVENT_TYPE_OVERFLOW:
> kfree(event);
> break;
> + case FANOTIFY_EVENT_TYPE_FS_ERROR:
> + fanotify_free_error_event(event);
> + break;
> default:
> WARN_ON_ONCE(1);
> }
> @@ -894,6 +905,7 @@ static void fanotify_free_mark(struct fsnotify_mark *mark)
> if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
> struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
>
> + kfree(fa_mark->fee_slot);
> kmem_cache_free(fanotify_sb_mark_cache, fa_mark);
> } else {
> kmem_cache_free(fanotify_mark_cache, mark);
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index b3ab620822c2..3f03333df32f 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -139,6 +139,7 @@ enum fanotify_mark_bits {
>
> struct fanotify_sb_mark {
> struct fsnotify_mark fsn_mark;
> + struct fanotify_error_event *fee_slot;
> };
>
> static inline
> @@ -161,6 +162,7 @@ enum fanotify_event_type {
> FANOTIFY_EVENT_TYPE_PATH,
> FANOTIFY_EVENT_TYPE_PATH_PERM,
> FANOTIFY_EVENT_TYPE_OVERFLOW, /* struct fanotify_event */
> + FANOTIFY_EVENT_TYPE_FS_ERROR, /* struct fanotify_error_event */
> __FANOTIFY_EVENT_TYPE_NUM
> };
>
> @@ -216,6 +218,17 @@ FANOTIFY_NE(struct fanotify_event *event)
> return container_of(event, struct fanotify_name_event, fae);
> }
>
> +struct fanotify_error_event {
> + struct fanotify_event fae;
> + struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
> +};
> +
> +static inline struct fanotify_error_event *
> +FANOTIFY_EE(struct fanotify_event *event)
> +{
> + return container_of(event, struct fanotify_error_event, fae);
> +}
> +
> static inline __kernel_fsid_t *fanotify_event_fsid(struct fanotify_event *event)
> {
> if (event->type == FANOTIFY_EVENT_TYPE_FID)
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 54107f1533d5..b77030386d7f 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -947,8 +947,10 @@ static struct fsnotify_mark *fanotify_alloc_mark(struct fsnotify_group *group,
>
> fsnotify_init_mark(mark, group);
>
> - if (type == FSNOTIFY_OBJ_TYPE_SB)
> + if (type == FSNOTIFY_OBJ_TYPE_SB) {
> mark->flags |= FANOTIFY_MARK_FLAG_SB_MARK;
> + sb_mark->fee_slot = NULL;
> + }
>
> return mark;
> }
> @@ -999,6 +1001,7 @@ static int fanotify_add_mark(struct fsnotify_group *group,
> {
> struct fsnotify_mark *fsn_mark;
> __u32 added;
> + int ret = 0;
>
> mutex_lock(&group->mark_mutex);
> fsn_mark = fsnotify_find_mark(connp, group);
> @@ -1009,13 +1012,37 @@ static int fanotify_add_mark(struct fsnotify_group *group,
> return PTR_ERR(fsn_mark);
> }
> }
> +
> + /*
> + * Error events are allocated per super-block mark only if
> + * strictly needed (i.e. FAN_FS_ERROR was requested).
> + */
> + if (type == FSNOTIFY_OBJ_TYPE_SB && !(flags & FAN_MARK_IGNORED_MASK) &&
> + (mask & FAN_FS_ERROR)) {
> + struct fanotify_sb_mark *sb_mark = FANOTIFY_SB_MARK(fsn_mark);
> +
> + if (!sb_mark->fee_slot) {
> + struct fanotify_error_event *fee =
> + kzalloc(sizeof(*fee), GFP_KERNEL_ACCOUNT);
> + if (!fee) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + fanotify_init_event(&fee->fae, 0, FS_ERROR);
> + fee->sb_mark = sb_mark;

I think Jan wanted to avoid zalloc()?
Please use kmalloc() and init the rest of the fee-> members.
We do not need to fill the entire fh buf with zeroes.

Thanks,
Amir.

2021-08-13 09:04:43

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> The Error info type is a record sent to users on FAN_FS_ERROR events
> documenting the type of error. It also carries an error count,
> documenting how many errors were observed since the last reporting.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>

Reviewed-by: Amir Goldstein <[email protected]>

> ---
> Changes since v5:
> - Move error code here
> ---
> fs/notify/fanotify/fanotify.c | 1 +
> fs/notify/fanotify/fanotify.h | 1 +
> fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> include/uapi/linux/fanotify.h | 7 ++++++
> 4 files changed, 45 insertions(+)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index f5c16ac37835..b49a474c1d7f 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -745,6 +745,7 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
> spin_unlock(&group->notification_lock);
>
> fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> + fee->error = report->error;
> fee->fsid = fee->sb_mark->fsn_mark.connector->fsid;
>
> fh_len = fanotify_encode_fh_len(inode);
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 158cf0c4b0bd..0cfe376c6fd9 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -220,6 +220,7 @@ FANOTIFY_NE(struct fanotify_event *event)
>
> struct fanotify_error_event {
> struct fanotify_event fae;
> + s32 error; /* Error reported by the Filesystem. */
> u32 err_count; /* Suppressed errors count */
>
> struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 1ab8f9d8b3ac..ca53159ce673 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -107,6 +107,8 @@ struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
> #define FANOTIFY_EVENT_ALIGN 4
> #define FANOTIFY_INFO_HDR_LEN \
> (sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
> +#define FANOTIFY_INFO_ERROR_LEN \
> + (sizeof(struct fanotify_event_info_error))
>
> static int fanotify_fid_info_len(int fh_len, int name_len)
> {
> @@ -130,6 +132,9 @@ static size_t fanotify_event_len(struct fanotify_event *event,
> if (!fid_mode)
> return event_len;
>
> + if (fanotify_is_error_event(event->mask))
> + event_len += FANOTIFY_INFO_ERROR_LEN;
> +
> info = fanotify_event_info(event);
> dir_fh_len = fanotify_event_dir_fh_len(event);
> fh_len = fanotify_event_object_fh_len(event);
> @@ -176,6 +181,7 @@ static struct fanotify_event *fanotify_dup_error_to_stack(
> error_on_stack->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> error_on_stack->err_count = fee->err_count;
> error_on_stack->sb_mark = fee->sb_mark;
> + error_on_stack->error = fee->error;
>
> error_on_stack->fsid = fee->fsid;
>
> @@ -342,6 +348,28 @@ static int process_access_response(struct fsnotify_group *group,
> return -ENOENT;
> }
>
> +static size_t copy_error_info_to_user(struct fanotify_event *event,
> + char __user *buf, int count)
> +{
> + struct fanotify_event_info_error info;
> + struct fanotify_error_event *fee = FANOTIFY_EE(event);
> +
> + info.hdr.info_type = FAN_EVENT_INFO_TYPE_ERROR;
> + info.hdr.pad = 0;
> + info.hdr.len = FANOTIFY_INFO_ERROR_LEN;
> +
> + if (WARN_ON(count < info.hdr.len))
> + return -EFAULT;
> +
> + info.error = fee->error;
> + info.error_count = fee->err_count;
> +
> + if (copy_to_user(buf, &info, sizeof(info)))
> + return -EFAULT;
> +
> + return info.hdr.len;
> +}
> +
> static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> int info_type, const char *name, size_t name_len,
> char __user *buf, size_t count)
> @@ -505,6 +533,14 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> if (f)
> fd_install(fd, f);
>
> + if (fanotify_is_error_event(event->mask)) {
> + ret = copy_error_info_to_user(event, buf, count);
> + if (ret < 0)
> + goto out_close_fd;
> + buf += ret;
> + count -= ret;
> + }
> +
> /* Event info records order is: dir fid + name, child fid */
> if (fanotify_event_dir_fh_len(event)) {
> info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index 16402037fc7a..80040a92e9d9 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> #define FAN_EVENT_INFO_TYPE_FID 1
> #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> #define FAN_EVENT_INFO_TYPE_DFID 3
> +#define FAN_EVENT_INFO_TYPE_ERROR 4
>
> /* Variable length info record following event metadata */
> struct fanotify_event_info_header {
> @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> unsigned char handle[0];
> };
>
> +struct fanotify_event_info_error {
> + struct fanotify_event_info_header hdr;
> + __s32 error;
> + __u32 error_count;
> +};
> +
> struct fanotify_response {
> __s32 fd;
> __u32 response;
> --
> 2.32.0
>

2021-08-13 09:06:54

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 17/21] fanotify: Report fid info for file related file system errors

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Plumb the pieces to add a FID report to error records. Since all error
> event memory must be pre-allocated, we estimate a file handle size and
> if it is insuficient, we report an invalid FID and increase the
> prediction for the next error slot allocation.
>
> For errors that don't expose a file handle report it with an invalid
> FID.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> ---
> Changes since v5:
> - Use preallocated MAX_HANDLE_SZ FH buffer
> - Report superblock errors with a zerolength INVALID FID (jan, amir)
> ---
> fs/notify/fanotify/fanotify.c | 15 +++++++++++++++
> fs/notify/fanotify/fanotify.h | 11 +++++++++++
> fs/notify/fanotify/fanotify_user.c | 7 +++++++
> 3 files changed, 33 insertions(+)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 0c7667d3f5d1..f5c16ac37835 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -734,6 +734,8 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
> struct fanotify_sb_mark *sb_mark =
> FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
> struct fanotify_error_event *fee = sb_mark->fee_slot;
> + struct inode *inode = report->inode;
> + int fh_len;
>
> spin_lock(&group->notification_lock);
> if (fee->err_count++) {
> @@ -743,6 +745,19 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
> spin_unlock(&group->notification_lock);
>
> fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> + fee->fsid = fee->sb_mark->fsn_mark.connector->fsid;
> +
> + fh_len = fanotify_encode_fh_len(inode);
> + if (WARN_ON(fh_len > MAX_HANDLE_SZ)) {
> + /*
> + * Fallback to reporting the error against the super
> + * block. It should never happen.
> + */
> + inode = NULL;
> + fh_len = fanotify_encode_fh_len(NULL);
> + }
> +
> + fanotify_encode_fh(&fee->object_fh, inode, fh_len, NULL, 0);
>
> if (fsnotify_insert_event(group, &fee->fae.fse,
> NULL, fanotify_insert_error_event)) {
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index eeb4a85af74e..158cf0c4b0bd 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -223,6 +223,13 @@ struct fanotify_error_event {
> u32 err_count; /* Suppressed errors count */
>
> struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
> +
> + __kernel_fsid_t fsid; /* FSID this error refers to. */
> +
> + /* object_fh must be followed by the inline handle buffer. */
> + struct fanotify_fh object_fh;
> + /* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
> + unsigned char _inline_fh_buf[MAX_HANDLE_SZ];
> };
>
> static inline struct fanotify_error_event *
> @@ -237,6 +244,8 @@ static inline __kernel_fsid_t *fanotify_event_fsid(struct fanotify_event *event)
> return &FANOTIFY_FE(event)->fsid;
> else if (event->type == FANOTIFY_EVENT_TYPE_FID_NAME)
> return &FANOTIFY_NE(event)->fsid;
> + else if (event->type == FANOTIFY_EVENT_TYPE_FS_ERROR)
> + return &FANOTIFY_EE(event)->fsid;
> else
> return NULL;
> }
> @@ -248,6 +257,8 @@ static inline struct fanotify_fh *fanotify_event_object_fh(
> return &FANOTIFY_FE(event)->object_fh;
> else if (event->type == FANOTIFY_EVENT_TYPE_FID_NAME)
> return fanotify_info_file_fh(&FANOTIFY_NE(event)->info);
> + else if (event->type == FANOTIFY_EVENT_TYPE_FS_ERROR)
> + return &FANOTIFY_EE(event)->object_fh;
> else
> return NULL;
> }
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 3fff0c994dc8..1ab8f9d8b3ac 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -177,6 +177,13 @@ static struct fanotify_event *fanotify_dup_error_to_stack(
> error_on_stack->err_count = fee->err_count;
> error_on_stack->sb_mark = fee->sb_mark;
>
> + error_on_stack->fsid = fee->fsid;
> +
> + memcpy(&error_on_stack->object_fh, &fee->object_fh,
> + sizeof(fee->object_fh));
> + memcpy(error_on_stack->object_fh.buf, fee->object_fh.buf,
> + fee->object_fh.len);
> +

I would go with:

size_t len = offsetof(struct fanotify_error_event, _inline_fh_buf)
+ fee->object_fh.len);

memcpy(error_on_stack, fee, len);

But maybe it's just me, so I don't insist.

Thanks,
Amir.

2021-08-13 09:08:45

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 17/21] fanotify: Report fid info for file related file system errors

On Fri, Aug 13, 2021 at 12:00 PM Amir Goldstein <[email protected]> wrote:
>
> On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> <[email protected]> wrote:
> >
> > Plumb the pieces to add a FID report to error records. Since all error
> > event memory must be pre-allocated, we estimate a file handle size and
> > if it is insuficient, we report an invalid FID and increase the
> > prediction for the next error slot allocation.

This commit message is out dated.

Thanks,
Amir.

2021-08-13 09:56:45

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 16/21] fanotify: Handle FAN_FS_ERROR events

On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Wire up FAN_FS_ERROR in the fanotify_mark syscall. The event can only
> be requested for the entire filesystem, thus it requires the
> FAN_MARK_FILESYSTEM.

Please split the Wire-up to fanotify_mark syscall into a separate patch applied
after patches that implement the report of event info records.

>
> FAN_FS_ERROR has to be handled slightly differently from other events
> because it needs to be submitted in an atomic context, using
> preallocated memory. This patch implements the submission path by only
> storing the first error event that happened in the slot (userspace
> resets the slot by reading the event).
>
> Extra error events happening when the slot is occupied are merged to the
> original report, and the only information keep for these extra errors is
> an accumulator counting the number of events, which is part of the
> record reported back to userspace.
>
> Reporting only the first event should be fine, since when a FS error
> happens, a cascade of error usually follows, but the most meaningful
> information is (usually) on the first erro.
>
> The event dequeueing is also a bit special to avoid losing events. Since
> event merging only happens while the event is queued, there is a window
> between when an error event is dequeued (notification_lock is dropped)
> until it is reset (.free_event()) where the slot is full, but no merges
> can happen.
>
> The proposed solution is to copy the event to the stack prior to
> dropping the lock. This way, if a new event arrives in the time between
> the event was dequeued and the time it resets, the new errors will still
> be logged and merged in the recently freed slot.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> ---
> Changes since v5:
> - Copy to stack instead of replacing the fee slot(jan)
> - prepare error slot outside of the notification lock(jan)
> Changes since v4:
> - Split parts to earlier patches (amir)
> - Simplify fanotify entry replacement
> - Update handle size prediction on overflow
> Changes since v3:
> - Convert WARN_ON to pr_warn (amir)
> - Remove unecessary READ/WRITE_ONCE (amir)
> - Alloc with GFP_KERNEL_ACCOUNT(amir)
> - Simplify flags on mark allocation (amir)
> - Avoid atomic set of error_count (amir)
> - Simplify rules when merging error_event (amir)
> - Allocate new error_event on get_one_event (amir)
> - Report superblock error with invalid FH (amir,jan)
>
> Changes since v2:
> - Support and equire FID mode (amir)
> - Goto error path instead of early return (amir)
> - Simplify get_one_event (me)
> - Base merging on error_count
> - drop fanotify_queue_error_event
>
> Changes since v1:
> - Pass dentry to fanotify_check_fsid (Amir)
> - FANOTIFY_EVENT_TYPE_ERROR -> FANOTIFY_EVENT_TYPE_FS_ERROR
> - Merge previous patch into it
> - Use a single slot
> - Move fanotify_mark.error_event definition to this commit
> - Rename FAN_ERROR -> FAN_FS_ERROR
> - Restrict FAN_FS_ERROR to FAN_MARK_FILESYSTEM
> ---
> fs/notify/fanotify/fanotify.c | 57 +++++++++++++++++++++++++++++-
> fs/notify/fanotify/fanotify.h | 21 +++++++++++
> fs/notify/fanotify/fanotify_user.c | 39 ++++++++++++++++++--
> include/linux/fanotify.h | 6 +++-
> 4 files changed, 119 insertions(+), 4 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 3bf6fd85c634..0c7667d3f5d1 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -709,6 +709,55 @@ static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info)
> return fsid;
> }
>
> +static void fanotify_insert_error_event(struct fsnotify_group *group,
> + struct fsnotify_event *fsn_event)
> +
> +{
> + struct fanotify_event *event = FANOTIFY_E(fsn_event);
> +
> + if (!fanotify_is_error_event(event->mask))
> + return;
> +
> + /*
> + * Prevent the mark from going away while an outstanding error
> + * event is queued. The reference is released by
> + * fanotify_dequeue_first_event.
> + */
> + fsnotify_get_mark(&FANOTIFY_EE(event)->sb_mark->fsn_mark);
> +
> +}
> +
> +static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
> + struct fsnotify_group *group,
> + const struct fs_error_report *report)
> +{
> + struct fanotify_sb_mark *sb_mark =
> + FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
> + struct fanotify_error_event *fee = sb_mark->fee_slot;
> +
> + spin_lock(&group->notification_lock);
> + if (fee->err_count++) {
> + spin_unlock(&group->notification_lock);
> + return 0;
> + }

Please add commentary to explain why logic is before merge()/insert().

> + spin_unlock(&group->notification_lock);
> +
> + fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> +
> + if (fsnotify_insert_event(group, &fee->fae.fse,
> + NULL, fanotify_insert_error_event)) {
> + /*
> + * Even if an error occurred, an overflow event is
> + * queued. Just reset the error count and succeed.
> + */
> + spin_lock(&group->notification_lock);
> + fanotify_reset_error_slot(fee);
> + spin_unlock(&group->notification_lock);

This feels racy.
I think that fanotify_reset_error_slot() should WARN about
trying to reset a queued error event and here we need to
check that fee was not queued while we dropped the lock.

And I am not convinced about correctness of incrementing
err_count while the lock is dropped.
Need to see the commentary.

> + }
> +
> + return 0;
> +}
> +
> /*
> * Add an event to hash table for faster merge.
> */
> @@ -762,7 +811,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
> BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
>
> - BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
> + BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 20);
>
> mask = fanotify_group_event_mask(group, iter_info, mask, data,
> data_type, dir);
> @@ -787,6 +836,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> return 0;
> }
>
> + if (fanotify_is_error_event(mask))
> + return fanotify_handle_error_event(iter_info, group, data);
> +
> event = fanotify_alloc_event(group, mask, data, data_type, dir,
> file_name, &fsid);
> ret = -ENOMEM;
> @@ -857,10 +909,13 @@ static void fanotify_free_name_event(struct fanotify_event *event)
>
> static void fanotify_free_error_event(struct fanotify_event *event)
> {
> + struct fanotify_error_event *fee = FANOTIFY_EE(event);
> +
> /*
> * The actual event is tied to a mark, and is released on mark
> * removal
> */
> + fsnotify_put_mark(&fee->sb_mark->fsn_mark);
> }
>
> static void fanotify_free_event(struct fsnotify_event *fsn_event)
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 3f03333df32f..eeb4a85af74e 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -220,6 +220,8 @@ FANOTIFY_NE(struct fanotify_event *event)
>
> struct fanotify_error_event {
> struct fanotify_event fae;
> + u32 err_count; /* Suppressed errors count */
> +
> struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
> };
>
> @@ -320,6 +322,11 @@ static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
> return container_of(fse, struct fanotify_event, fse);
> }
>
> +static inline bool fanotify_is_error_event(u32 mask)
> +{
> + return mask & FAN_FS_ERROR;
> +}
> +
> static inline bool fanotify_event_has_path(struct fanotify_event *event)
> {
> return event->type == FANOTIFY_EVENT_TYPE_PATH ||
> @@ -349,6 +356,7 @@ static inline struct path *fanotify_event_path(struct fanotify_event *event)
> static inline bool fanotify_is_hashed_event(u32 mask)
> {
> return !(fanotify_is_perm_event(mask) ||
> + fanotify_is_error_event(mask) ||
> fsnotify_is_overflow_event(mask));
> }
>
> @@ -358,3 +366,16 @@ static inline unsigned int fanotify_event_hash_bucket(
> {
> return event->hash & FANOTIFY_HTABLE_MASK;
> }
> +
> +/*
> + * Reset the FAN_FS_ERROR event slot
> + *
> + * This is used to restore the error event slot to a a zeroed state,
> + * where it can be used for a new incoming error. It does not
> + * initialize the event, but clear only the required data to free the
> + * slot.
> + */
> +static inline void fanotify_reset_error_slot(struct fanotify_error_event *fee)
> +{
> + fee->err_count = 0;

Makes sense that it should also zero the error field. No?


> +}
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index b77030386d7f..3fff0c994dc8 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -167,6 +167,19 @@ static void fanotify_unhash_event(struct fsnotify_group *group,
> hlist_del_init(&event->merge_list);
> }
>
> +static struct fanotify_event *fanotify_dup_error_to_stack(
> + struct fanotify_error_event *fee,
> + struct fanotify_error_event *error_on_stack)
> +{
> + fanotify_init_event(&error_on_stack->fae, 0, FS_ERROR);
> +
> + error_on_stack->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> + error_on_stack->err_count = fee->err_count;
> + error_on_stack->sb_mark = fee->sb_mark;
> +
> + return &error_on_stack->fae;
> +}
> +
> /*
> * Get an fanotify notification event if one exists and is small
> * enough to fit in "count". Return an error pointer if the count
> @@ -174,7 +187,9 @@ static void fanotify_unhash_event(struct fsnotify_group *group,
> * updated accordingly.
> */
> static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> - size_t count)
> + size_t count,
> + struct fanotify_error_event *error_on_stack)
> +
> {
> size_t event_size;
> struct fanotify_event *event = NULL;
> @@ -205,6 +220,16 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> FANOTIFY_PERM(event)->state = FAN_EVENT_REPORTED;
> if (fanotify_is_hashed_event(event->mask))
> fanotify_unhash_event(group, event);
> +
> + if (fanotify_is_error_event(event->mask)) {
> + /*
> + * Error events are returned as a copy of the error
> + * slot. The actual error slot is reused.
> + */
> + fanotify_dup_error_to_stack(FANOTIFY_EE(event), error_on_stack);
> + fanotify_reset_error_slot(FANOTIFY_EE(event));
> + event = &error_on_stack->fae;
> + }
> out:
> spin_unlock(&group->notification_lock);
> return event;
> @@ -564,6 +589,7 @@ static __poll_t fanotify_poll(struct file *file, poll_table *wait)
> static ssize_t fanotify_read(struct file *file, char __user *buf,
> size_t count, loff_t *pos)
> {
> + struct fanotify_error_event error_on_stack;
> struct fsnotify_group *group;
> struct fanotify_event *event;
> char __user *start;
> @@ -582,7 +608,7 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
> * in case there are lots of available events.
> */
> cond_resched();
> - event = get_one_event(group, count);
> + event = get_one_event(group, count, &error_on_stack);
> if (IS_ERR(event)) {
> ret = PTR_ERR(event);
> break;
> @@ -1031,6 +1057,10 @@ static int fanotify_add_mark(struct fsnotify_group *group,
> fanotify_init_event(&fee->fae, 0, FS_ERROR);
> fee->sb_mark = sb_mark;
> sb_mark->fee_slot = fee;
> +
> + /* Mark the error slot ready to receive events. */
> + fanotify_reset_error_slot(fee);
> +
> }
> }
>
> @@ -1459,6 +1489,11 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> fsid = &__fsid;
> }
>
> + if (mask & FAN_FS_ERROR && mark_type != FAN_MARK_FILESYSTEM) {
> + ret = -EINVAL;
> + goto path_put_and_out;
> + }
> +

Split to Wire-up patch please.

> /* inode held in place by reference to path; group by fget on fd */
> if (mark_type == FAN_MARK_INODE)
> inode = path.dentry->d_inode;
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index c05d45bde8b8..c4d49308b2d0 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -88,9 +88,13 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
> #define FANOTIFY_INODE_EVENTS (FANOTIFY_DIRENT_EVENTS | \
> FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
>
> +/* Events that can only be reported with data type FSNOTIFY_EVENT_ERROR */
> +#define FANOTIFY_ERROR_EVENTS (FAN_FS_ERROR)
> +
> /* Events that user can request to be notified on */
> #define FANOTIFY_EVENTS (FANOTIFY_PATH_EVENTS | \
> - FANOTIFY_INODE_EVENTS)
> + FANOTIFY_INODE_EVENTS | \
> + FANOTIFY_ERROR_EVENTS)
>

Split to Wire-up patch please.

Thanks,
Amir.

2021-08-16 13:18:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 04/21] fsnotify: Reserve mark flag bits for backends

On Fri 13-08-21 10:28:27, Amir Goldstein wrote:
> On Fri, Aug 13, 2021 at 12:40 AM Gabriel Krisman Bertazi
> <[email protected]> wrote:
> >
> > Split out the final bits of struct fsnotify_mark->flags for use by a
> > backend.
> >
> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> >
> > Changes since v1:
> > - turn consts into defines (jan)
> > ---
> > include/linux/fsnotify_backend.h | 18 +++++++++++++++---
> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> > index 1ce66748a2d2..ae1bd9f06808 100644
> > --- a/include/linux/fsnotify_backend.h
> > +++ b/include/linux/fsnotify_backend.h
> > @@ -363,6 +363,20 @@ struct fsnotify_mark_connector {
> > struct hlist_head list;
> > };
> >
> > +enum fsnotify_mark_bits {
> > + FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY,
> > + FSN_MARK_FL_BIT_ALIVE,
> > + FSN_MARK_FL_BIT_ATTACHED,
> > + FSN_MARK_PRIVATE_FLAGS,
> > +};
> > +
> > +#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY \
> > + (1 << FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY)
> > +#define FSNOTIFY_MARK_FLAG_ALIVE \
> > + (1 << FSN_MARK_FL_BIT_ALIVE)
> > +#define FSNOTIFY_MARK_FLAG_ATTACHED \
> > + (1 << FSN_MARK_FL_BIT_ATTACHED)
> > +
> > /*
> > * A mark is simply an object attached to an in core inode which allows an
> > * fsnotify listener to indicate they are either no longer interested in events
> > @@ -398,9 +412,7 @@ struct fsnotify_mark {
> > struct fsnotify_mark_connector *connector;
> > /* Events types to ignore [mark->lock, group->mark_mutex] */
> > __u32 ignored_mask;
> > -#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY 0x01
> > -#define FSNOTIFY_MARK_FLAG_ALIVE 0x02
> > -#define FSNOTIFY_MARK_FLAG_ATTACHED 0x04
> > + /* Upper bits [31:PRIVATE_FLAGS] are reserved for backend usage */
>
> I don't understand what [31:PRIVATE_FLAGS] means

I think it should be [FSN_MARK_PRIVATE_FLAGS:31] (identifying a range of
bits). I'd maybe write just "Bits starting from FSN_MARK_PRIVATE_FLAGS are
reserved for backend usage". With this fixed feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 13:21:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 05/21] fanotify: Split superblock marks out to a new cache

On Thu 12-08-21 17:39:54, Gabriel Krisman Bertazi wrote:
> FAN_FS_ERROR will require an error structure to be stored per mark.
> But, since FAN_FS_ERROR doesn't apply to inode/mount marks, it should
> suffice to only expose this information for superblock marks. Therefore,
> wrap this kind of marks into a container and plumb it for the future.
>
> Reviewed-by: Amir Goldstein <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> Changes since v5:
> - turn the flag bits into defines (jan)
> - don't use zalloc for consistency (jan)
> Changes since v2:
> - Move mark initialization to fanotify_alloc_mark (Amir)
>
> Changes since v1:
> - Only extend superblock marks (Amir)
> ---
> fs/notify/fanotify/fanotify.c | 10 ++++++--
> fs/notify/fanotify/fanotify.h | 20 ++++++++++++++++
> fs/notify/fanotify/fanotify_user.c | 38 ++++++++++++++++++++++++++++--
> 3 files changed, 64 insertions(+), 4 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 310246f8d3f1..c3eefe3f6494 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -869,9 +869,15 @@ static void fanotify_freeing_mark(struct fsnotify_mark *mark,
> dec_ucount(group->fanotify_data.ucounts, UCOUNT_FANOTIFY_MARKS);
> }
>
> -static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
> +static void fanotify_free_mark(struct fsnotify_mark *mark)
> {
> - kmem_cache_free(fanotify_mark_cache, fsn_mark);
> + if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
> + struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
> +
> + kmem_cache_free(fanotify_sb_mark_cache, fa_mark);
> + } else {
> + kmem_cache_free(fanotify_mark_cache, mark);
> + }
> }
>
> const struct fsnotify_ops fanotify_fsnotify_ops = {
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 4a5e555dc3d2..3b11dd03df59 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -6,6 +6,7 @@
> #include <linux/hashtable.h>
>
> extern struct kmem_cache *fanotify_mark_cache;
> +extern struct kmem_cache *fanotify_sb_mark_cache;
> extern struct kmem_cache *fanotify_fid_event_cachep;
> extern struct kmem_cache *fanotify_path_event_cachep;
> extern struct kmem_cache *fanotify_perm_event_cachep;
> @@ -129,6 +130,25 @@ static inline void fanotify_info_copy_name(struct fanotify_info *info,
> name->name);
> }
>
> +enum fanotify_mark_bits {
> + FANOTIFY_MARK_FLAG_BIT_SB_MARK = FSN_MARK_PRIVATE_FLAGS,
> +};
> +
> +#define FANOTIFY_MARK_FLAG_SB_MARK \
> + (1 << FANOTIFY_MARK_FLAG_BIT_SB_MARK)
> +
> +struct fanotify_sb_mark {
> + struct fsnotify_mark fsn_mark;
> +};
> +
> +static inline
> +struct fanotify_sb_mark *FANOTIFY_SB_MARK(struct fsnotify_mark *mark)
> +{
> + WARN_ON(!(mark->flags & FANOTIFY_MARK_FLAG_SB_MARK));
> +
> + return container_of(mark, struct fanotify_sb_mark, fsn_mark);
> +}
> +
> /*
> * Common structure for fanotify events. Concrete structs are allocated in
> * fanotify_handle_event() and freed when the information is retrieved by
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 67b18dfe0025..c47a5a45c0d3 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -99,6 +99,7 @@ struct ctl_table fanotify_table[] = {
> extern const struct fsnotify_ops fanotify_fsnotify_ops;
>
> struct kmem_cache *fanotify_mark_cache __read_mostly;
> +struct kmem_cache *fanotify_sb_mark_cache __read_mostly;
> struct kmem_cache *fanotify_fid_event_cachep __read_mostly;
> struct kmem_cache *fanotify_path_event_cachep __read_mostly;
> struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
> @@ -915,6 +916,38 @@ static __u32 fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
> return mask & ~oldmask;
> }
>
> +static struct fsnotify_mark *fanotify_alloc_mark(struct fsnotify_group *group,
> + unsigned int type)
> +{
> + struct fanotify_sb_mark *sb_mark;
> + struct fsnotify_mark *mark;
> +
> + switch (type) {
> + case FSNOTIFY_OBJ_TYPE_SB:
> + sb_mark = kmem_cache_alloc(fanotify_sb_mark_cache, GFP_KERNEL);
> + if (!sb_mark)
> + return NULL;
> + mark = &sb_mark->fsn_mark;
> + break;
> +
> + case FSNOTIFY_OBJ_TYPE_INODE:
> + case FSNOTIFY_OBJ_TYPE_PARENT:
> + case FSNOTIFY_OBJ_TYPE_VFSMOUNT:
> + mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
> + break;
> + default:
> + WARN_ON(1);
> + return NULL;
> + }
> +
> + fsnotify_init_mark(mark, group);
> +
> + if (type == FSNOTIFY_OBJ_TYPE_SB)
> + mark->flags |= FANOTIFY_MARK_FLAG_SB_MARK;
> +
> + return mark;
> +}
> +
> static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
> fsnotify_connp_t *connp,
> unsigned int type,
> @@ -933,13 +966,12 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
> !inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_FANOTIFY_MARKS))
> return ERR_PTR(-ENOSPC);
>
> - mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
> + mark = fanotify_alloc_mark(group, type);
> if (!mark) {
> ret = -ENOMEM;
> goto out_dec_ucounts;
> }
>
> - fsnotify_init_mark(mark, group);
> ret = fsnotify_add_mark_locked(mark, connp, type, 0, fsid);
> if (ret) {
> fsnotify_put_mark(mark);
> @@ -1497,6 +1529,8 @@ static int __init fanotify_user_setup(void)
>
> fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> SLAB_PANIC|SLAB_ACCOUNT);
> + fanotify_sb_mark_cache = KMEM_CACHE(fanotify_sb_mark,
> + SLAB_PANIC|SLAB_ACCOUNT);
> fanotify_fid_event_cachep = KMEM_CACHE(fanotify_fid_event,
> SLAB_PANIC);
> fanotify_path_event_cachep = KMEM_CACHE(fanotify_path_event,
> --
> 2.32.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 13:26:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 10/21] fsnotify: Support FS_ERROR event type

On Thu 12-08-21 17:39:59, Gabriel Krisman Bertazi wrote:
> Expose a new type of fsnotify event for filesystems to report errors for
> userspace monitoring tools. fanotify will send this type of
> notification for FAN_FS_ERROR events. This also introduce a helper for
> generating the new event.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

>
> ---
> Changes since v5:
> - pass sb inside data field (jan)
> Changes since v3:
> - Squash patch ("fsnotify: Introduce helpers to send error_events")
> - Drop reviewed-bys!
>
> Changes since v2:
> - FAN_ERROR->FAN_FS_ERROR (Amir)
>
> Changes since v1:
> - Overload FS_ERROR with FS_IN_IGNORED
> - Implement support for this type on fsnotify_data_inode (Amir)
> ---
> fs/notify/fsnotify.c | 3 +++
> include/linux/fsnotify.h | 13 +++++++++++++
> include/linux/fsnotify_backend.h | 18 +++++++++++++++++-
> 3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 536db02cb26e..6d3b3de4f8ee 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -103,6 +103,9 @@ static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> struct inode *inode = fsnotify_data_inode(data, data_type);
> struct super_block *sb = inode ? inode->i_sb : NULL;
>
> + if (!sb && data_type == FSNOTIFY_EVENT_ERROR)
> + sb = ((struct fs_error_report *) data)->sb;
> +
> return sb;
> }
>
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index f8acddcf54fb..521234af1827 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -317,4 +317,17 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
> fsnotify_dentry(dentry, mask);
> }
>
> +static inline int fsnotify_sb_error(struct super_block *sb, struct inode *inode,
> + int error)
> +{
> + struct fs_error_report report = {
> + .error = error,
> + .inode = inode,
> + .sb = sb,
> + };
> +
> + return fsnotify(FS_ERROR, &report, FSNOTIFY_EVENT_ERROR,
> + NULL, NULL, NULL, 0);
> +}
> +
> #endif /* _LINUX_FS_NOTIFY_H */
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index e027af3cd8dd..277b6f3e0998 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -42,6 +42,12 @@
>
> #define FS_UNMOUNT 0x00002000 /* inode on umount fs */
> #define FS_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
> +#define FS_ERROR 0x00008000 /* Filesystem Error (fanotify) */
> +
> +/*
> + * FS_IN_IGNORED overloads FS_ERROR. It is only used internally by inotify
> + * which does not support FS_ERROR.
> + */
> #define FS_IN_IGNORED 0x00008000 /* last inotify event here */
>
> #define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
> @@ -95,7 +101,8 @@
> #define ALL_FSNOTIFY_EVENTS (ALL_FSNOTIFY_DIRENT_EVENTS | \
> FS_EVENTS_POSS_ON_CHILD | \
> FS_DELETE_SELF | FS_MOVE_SELF | FS_DN_RENAME | \
> - FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED)
> + FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED | \
> + FS_ERROR)
>
> /* Extra flags that may be reported with event or control handling of events */
> #define ALL_FSNOTIFY_FLAGS (FS_EXCL_UNLINK | FS_ISDIR | FS_IN_ONESHOT | \
> @@ -248,6 +255,13 @@ enum fsnotify_data_type {
> FSNOTIFY_EVENT_NONE,
> FSNOTIFY_EVENT_PATH,
> FSNOTIFY_EVENT_INODE,
> + FSNOTIFY_EVENT_ERROR,
> +};
> +
> +struct fs_error_report {
> + int error;
> + struct inode *inode;
> + struct super_block *sb;
> };
>
> static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
> @@ -257,6 +271,8 @@ static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
> return (struct inode *)data;
> case FSNOTIFY_EVENT_PATH:
> return d_inode(((const struct path *)data)->dentry);
> + case FSNOTIFY_EVENT_ERROR:
> + return ((struct fs_error_report *)data)->inode;
> default:
> return NULL;
> }
> --
> 2.32.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 14:07:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 12/21] fanotify: Encode invalid file handle when no inode is provided

On Fri 13-08-21 11:27:48, Amir Goldstein wrote:
> On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> <[email protected]> wrote:
> >
> > Instead of failing, encode an invalid file handle in fanotify_encode_fh
> > if no inode is provided. This bogus file handle will be reported by
> > FAN_FS_ERROR for non-inode errors.
> >
> > When being reported to userspace, the length information is actually
> > reset and the handle cleaned up, such that userspace don't have the
> > visibility of the internal kernel representation of this null handle.
> >
> > Also adjust the single caller that might rely on failure after passing
> > an empty inode.
> >
> > Suggested-by: Amir Goldstein <[email protected]>
> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> >
> > ---
> > Changes since v5:
> > - Preserve flags initialization (jan)
> > - Add BUILD_BUG_ON (amir)
> > - Require minimum of FANOTIFY_NULL_FH_LEN for fh_len(amir)
> > - Improve comment to explain the null FH length (jan)
> > - Simplify logic
> > ---
> > fs/notify/fanotify/fanotify.c | 27 ++++++++++++++++++-----
> > fs/notify/fanotify/fanotify_user.c | 35 +++++++++++++++++-------------
> > 2 files changed, 41 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > index 50fce4fec0d6..2b1ab031fbe5 100644
> > --- a/fs/notify/fanotify/fanotify.c
> > +++ b/fs/notify/fanotify/fanotify.c
> > @@ -334,6 +334,8 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
> > return test_mask & user_mask;
> > }
> >
> > +#define FANOTIFY_NULL_FH_LEN 4
> > +
> > /*
> > * Check size needed to encode fanotify_fh.
> > *
> > @@ -345,7 +347,7 @@ static int fanotify_encode_fh_len(struct inode *inode)
> > int dwords = 0;
> >
> > if (!inode)
> > - return 0;
> > + return FANOTIFY_NULL_FH_LEN;
> >
> > exportfs_encode_inode_fh(inode, NULL, &dwords, NULL);
> >
> > @@ -367,11 +369,23 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
> > void *buf = fh->buf;
> > int err;
> >
> > - fh->type = FILEID_ROOT;
> > - fh->len = 0;
> > + BUILD_BUG_ON(FANOTIFY_NULL_FH_LEN < 4 ||
> > + FANOTIFY_NULL_FH_LEN > FANOTIFY_INLINE_FH_LEN);
> > +
> > fh->flags = 0;
> > - if (!inode)
> > - return 0;
> > +
> > + if (!inode) {
> > + /*
> > + * Invalid FHs are used on FAN_FS_ERROR for errors not
> > + * linked to any inode. The f_handle won't be reported
> > + * back to userspace. The extra bytes are cleared prior
> > + * to reporting.
> > + */
> > + type = FILEID_INVALID;
> > + fh_len = FANOTIFY_NULL_FH_LEN;
>
> Please memset() the NULL_FH buffer to zero.
>
> > +
> > + goto success;
> > + }
> >
> > /*
> > * !gpf means preallocated variable size fh, but fh_len could
> > @@ -400,6 +414,7 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
> > if (!type || type == FILEID_INVALID || fh_len != dwords << 2)
> > goto out_err;
> >
> > +success:
> > fh->type = type;
> > fh->len = fh_len;
> >
> > @@ -529,7 +544,7 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
> > struct fanotify_info *info;
> > struct fanotify_fh *dfh, *ffh;
> > unsigned int dir_fh_len = fanotify_encode_fh_len(id);
> > - unsigned int child_fh_len = fanotify_encode_fh_len(child);
> > + unsigned int child_fh_len = child ? fanotify_encode_fh_len(child) : 0;
> > unsigned int size;
> >
> > size = sizeof(*fne) + FANOTIFY_FH_HDR_LEN + dir_fh_len;
> > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > index c47a5a45c0d3..4cacea5fcaca 100644
> > --- a/fs/notify/fanotify/fanotify_user.c
> > +++ b/fs/notify/fanotify/fanotify_user.c
> > @@ -360,7 +360,10 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> > return -EFAULT;
> >
> > handle.handle_type = fh->type;
> > - handle.handle_bytes = fh_len;
> > +
> > + /* FILEID_INVALID handle type is reported without its f_handle. */
> > + if (fh->type != FILEID_INVALID)
> > + handle.handle_bytes = fh_len;
>
> I know I suggested those exact lines, but looking at the patch,
> I think it would be better to do:
> + if (fh->type != FILEID_INVALID)
> + fh_len = 0;
> handle.handle_bytes = fh_len;
>
> > if (copy_to_user(buf, &handle, sizeof(handle)))
> > return -EFAULT;
> >
> > @@ -369,20 +372,22 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> > if (WARN_ON_ONCE(len < fh_len))
> > return -EFAULT;
> >
> > - /*
> > - * For an inline fh and inline file name, copy through stack to exclude
> > - * the copy from usercopy hardening protections.
> > - */
> > - fh_buf = fanotify_fh_buf(fh);
> > - if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
> > - memcpy(bounce, fh_buf, fh_len);
> > - fh_buf = bounce;
> > + if (fh->type != FILEID_INVALID) {
>
> ... and here: if (fh_len) {
>
> > + /*
> > + * For an inline fh and inline file name, copy through
> > + * stack to exclude the copy from usercopy hardening
> > + * protections.
> > + */
> > + fh_buf = fanotify_fh_buf(fh);
> > + if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
> > + memcpy(bounce, fh_buf, fh_len);
> > + fh_buf = bounce;
> > + }
> > + if (copy_to_user(buf, fh_buf, fh_len))
> > + return -EFAULT;
> > + buf += fh_len;
> > + len -= fh_len;
> > }
> > - if (copy_to_user(buf, fh_buf, fh_len))
> > - return -EFAULT;
> > -
> > - buf += fh_len;
> > - len -= fh_len;
> >
> > if (name_len) {
> > /* Copy the filename with terminating null */
> > @@ -398,7 +403,7 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> > }
> >
> > /* Pad with 0's */
> > - WARN_ON_ONCE(len < 0 || len >= FANOTIFY_EVENT_ALIGN);
> > + WARN_ON_ONCE(len < 0);
>
> According to my calculations, FAN_FS_ERROR event with NULL_FH is expected
> to get here with len == 4, so you can change this to:
> WARN_ON_ONCE(len < 0 || len > FANOTIFY_EVENT_ALIGN);
>
> But first, I would like to get Jan's feedback on this concept of keeping
> unneeded 4 bytes zero padding in reported event in case of NULL_FH
> in order to keep the FID reporting code simpler.

Dunno, it still seems like quite some complications (simple ones but
non-trivial amount of them) for what is rather a corner case. What if we
*internally* propagated the information that there's no inode info with
FILEID_ROOT fh? That means: No changes to fanotify_encode_fh_len(),
fanotify_encode_fh(), or fanotify_alloc_name_event(). In
copy_info_to_user() we just mangle FILEID_ROOT to FILEID_INVALID and that's
all. No useless padding, no specialcasing of copying etc. Am I missing
something?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 15:56:29

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 12/21] fanotify: Encode invalid file handle when no inode is provided

On Mon, Aug 16, 2021 at 5:07 PM Jan Kara <[email protected]> wrote:
>
> On Fri 13-08-21 11:27:48, Amir Goldstein wrote:
> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> > <[email protected]> wrote:
> > >
> > > Instead of failing, encode an invalid file handle in fanotify_encode_fh
> > > if no inode is provided. This bogus file handle will be reported by
> > > FAN_FS_ERROR for non-inode errors.
> > >
> > > When being reported to userspace, the length information is actually
> > > reset and the handle cleaned up, such that userspace don't have the
> > > visibility of the internal kernel representation of this null handle.
> > >
> > > Also adjust the single caller that might rely on failure after passing
> > > an empty inode.
> > >
> > > Suggested-by: Amir Goldstein <[email protected]>
> > > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > >
> > > ---
> > > Changes since v5:
> > > - Preserve flags initialization (jan)
> > > - Add BUILD_BUG_ON (amir)
> > > - Require minimum of FANOTIFY_NULL_FH_LEN for fh_len(amir)
> > > - Improve comment to explain the null FH length (jan)
> > > - Simplify logic
> > > ---
> > > fs/notify/fanotify/fanotify.c | 27 ++++++++++++++++++-----
> > > fs/notify/fanotify/fanotify_user.c | 35 +++++++++++++++++-------------
> > > 2 files changed, 41 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > > index 50fce4fec0d6..2b1ab031fbe5 100644
> > > --- a/fs/notify/fanotify/fanotify.c
> > > +++ b/fs/notify/fanotify/fanotify.c
> > > @@ -334,6 +334,8 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
> > > return test_mask & user_mask;
> > > }
> > >
> > > +#define FANOTIFY_NULL_FH_LEN 4
> > > +
> > > /*
> > > * Check size needed to encode fanotify_fh.
> > > *
> > > @@ -345,7 +347,7 @@ static int fanotify_encode_fh_len(struct inode *inode)
> > > int dwords = 0;
> > >
> > > if (!inode)
> > > - return 0;
> > > + return FANOTIFY_NULL_FH_LEN;
> > >
> > > exportfs_encode_inode_fh(inode, NULL, &dwords, NULL);
> > >
> > > @@ -367,11 +369,23 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
> > > void *buf = fh->buf;
> > > int err;
> > >
> > > - fh->type = FILEID_ROOT;
> > > - fh->len = 0;
> > > + BUILD_BUG_ON(FANOTIFY_NULL_FH_LEN < 4 ||
> > > + FANOTIFY_NULL_FH_LEN > FANOTIFY_INLINE_FH_LEN);
> > > +
> > > fh->flags = 0;
> > > - if (!inode)
> > > - return 0;
> > > +
> > > + if (!inode) {
> > > + /*
> > > + * Invalid FHs are used on FAN_FS_ERROR for errors not
> > > + * linked to any inode. The f_handle won't be reported
> > > + * back to userspace. The extra bytes are cleared prior
> > > + * to reporting.
> > > + */
> > > + type = FILEID_INVALID;
> > > + fh_len = FANOTIFY_NULL_FH_LEN;
> >
> > Please memset() the NULL_FH buffer to zero.
> >
> > > +
> > > + goto success;
> > > + }
> > >
> > > /*
> > > * !gpf means preallocated variable size fh, but fh_len could
> > > @@ -400,6 +414,7 @@ static int fanotify_encode_fh(struct fanotify_fh *fh, struct inode *inode,
> > > if (!type || type == FILEID_INVALID || fh_len != dwords << 2)
> > > goto out_err;
> > >
> > > +success:
> > > fh->type = type;
> > > fh->len = fh_len;
> > >
> > > @@ -529,7 +544,7 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
> > > struct fanotify_info *info;
> > > struct fanotify_fh *dfh, *ffh;
> > > unsigned int dir_fh_len = fanotify_encode_fh_len(id);
> > > - unsigned int child_fh_len = fanotify_encode_fh_len(child);
> > > + unsigned int child_fh_len = child ? fanotify_encode_fh_len(child) : 0;
> > > unsigned int size;
> > >
> > > size = sizeof(*fne) + FANOTIFY_FH_HDR_LEN + dir_fh_len;
> > > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > > index c47a5a45c0d3..4cacea5fcaca 100644
> > > --- a/fs/notify/fanotify/fanotify_user.c
> > > +++ b/fs/notify/fanotify/fanotify_user.c
> > > @@ -360,7 +360,10 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> > > return -EFAULT;
> > >
> > > handle.handle_type = fh->type;
> > > - handle.handle_bytes = fh_len;
> > > +
> > > + /* FILEID_INVALID handle type is reported without its f_handle. */
> > > + if (fh->type != FILEID_INVALID)
> > > + handle.handle_bytes = fh_len;
> >
> > I know I suggested those exact lines, but looking at the patch,
> > I think it would be better to do:
> > + if (fh->type != FILEID_INVALID)
> > + fh_len = 0;
> > handle.handle_bytes = fh_len;
> >
> > > if (copy_to_user(buf, &handle, sizeof(handle)))
> > > return -EFAULT;
> > >
> > > @@ -369,20 +372,22 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> > > if (WARN_ON_ONCE(len < fh_len))
> > > return -EFAULT;
> > >
> > > - /*
> > > - * For an inline fh and inline file name, copy through stack to exclude
> > > - * the copy from usercopy hardening protections.
> > > - */
> > > - fh_buf = fanotify_fh_buf(fh);
> > > - if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
> > > - memcpy(bounce, fh_buf, fh_len);
> > > - fh_buf = bounce;
> > > + if (fh->type != FILEID_INVALID) {
> >
> > ... and here: if (fh_len) {
> >
> > > + /*
> > > + * For an inline fh and inline file name, copy through
> > > + * stack to exclude the copy from usercopy hardening
> > > + * protections.
> > > + */
> > > + fh_buf = fanotify_fh_buf(fh);
> > > + if (fh_len <= FANOTIFY_INLINE_FH_LEN) {
> > > + memcpy(bounce, fh_buf, fh_len);
> > > + fh_buf = bounce;
> > > + }
> > > + if (copy_to_user(buf, fh_buf, fh_len))
> > > + return -EFAULT;
> > > + buf += fh_len;
> > > + len -= fh_len;
> > > }
> > > - if (copy_to_user(buf, fh_buf, fh_len))
> > > - return -EFAULT;
> > > -
> > > - buf += fh_len;
> > > - len -= fh_len;
> > >
> > > if (name_len) {
> > > /* Copy the filename with terminating null */
> > > @@ -398,7 +403,7 @@ static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> > > }
> > >
> > > /* Pad with 0's */
> > > - WARN_ON_ONCE(len < 0 || len >= FANOTIFY_EVENT_ALIGN);
> > > + WARN_ON_ONCE(len < 0);
> >
> > According to my calculations, FAN_FS_ERROR event with NULL_FH is expected
> > to get here with len == 4, so you can change this to:
> > WARN_ON_ONCE(len < 0 || len > FANOTIFY_EVENT_ALIGN);
> >
> > But first, I would like to get Jan's feedback on this concept of keeping
> > unneeded 4 bytes zero padding in reported event in case of NULL_FH
> > in order to keep the FID reporting code simpler.
>
> Dunno, it still seems like quite some complications (simple ones but
> non-trivial amount of them) for what is rather a corner case. What if we
> *internally* propagated the information that there's no inode info with
> FILEID_ROOT fh? That means: No changes to fanotify_encode_fh_len(),
> fanotify_encode_fh(), or fanotify_alloc_name_event(). In
> copy_info_to_user() we just mangle FILEID_ROOT to FILEID_INVALID and that's
> all. No useless padding, no specialcasing of copying etc. Am I missing
> something?

I am perfectly fine with encoding "no inode" with FILEID_ROOT internally.
It's already the value used by fanotify_encode_fh() in upstream.

However, if we use zero len internally, we need to pass fh_type to
fanotify_fid_info_len() and special case FILEID_ROOT in order to
take FANOTIFY_FID_INFO_HDR_LEN into account.

And special case fanotify_event_object_fh_len() in
fanotify_event_info_len() and in copy_info_records_to_user().

Or maybe I am missing something....

Thanks,
Amir.

2021-08-16 15:58:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 15/21] fanotify: Preallocate per superblock mark error event

On Thu 12-08-21 17:40:04, Gabriel Krisman Bertazi wrote:
> Error reporting needs to be done in an atomic context. This patch
> introduces a single error slot for superblock marks that report the
> FAN_FS_ERROR event, to be used during event submission.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> ---
> Changes v5:
> - Restore mark references. (jan)
> - Tie fee slot to the mark lifetime.(jan)
> - Don't reallocate event(jan)
> ---
> fs/notify/fanotify/fanotify.c | 12 ++++++++++++
> fs/notify/fanotify/fanotify.h | 13 +++++++++++++
> fs/notify/fanotify/fanotify_user.c | 31 ++++++++++++++++++++++++++++--
> 3 files changed, 54 insertions(+), 2 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index ebb6c557cea1..3bf6fd85c634 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -855,6 +855,14 @@ static void fanotify_free_name_event(struct fanotify_event *event)
> kfree(FANOTIFY_NE(event));
> }
>
> +static void fanotify_free_error_event(struct fanotify_event *event)
> +{
> + /*
> + * The actual event is tied to a mark, and is released on mark
> + * removal
> + */
> +}
> +

I was pondering about the lifetime rules some more. This is also related to
patch 16/21 but I'll comment here. When we hold mark ref from queued event,
we introduce a subtle race into group destruction logic. There we first
evict all marks, wait for them to be destroyed by worker thread after SRCU
period expires, and then we remove queued events. When we hold mark
reference from an event we break this as mark will exist until the event is
dequeued and then group can get freed before we actually free the mark and
so mark freeing can hit use-after-free issues.

So we'll have to do this a bit differently. I have two options:

1) Instead of preallocating events explicitely like this, we could setup a
mempool to allocate error events from for each notification group. We would
resize the mempool when adding error mark so that it has as many reserved
events as error marks. Upside is error events will be much less special -
no special lifetime rules. We'd just need to setup & resize the mempool. We
would also have to provide proper merge function for error events (to merge
events from the same sb). Also there will be limitation of number of error
marks per group because mempools use kmalloc() for an array tracking
reserved events. But we could certainly manage 512, likely 1024 error marks
per notification group.

2) We would keep attaching event to mark as currently. As far as I have
checked the event doesn't actually need a back-ref to sb_mark. It is
really only used for mark reference taking (and then to get to sb from
fanotify_handle_error_event() but we can certainly get to sb by easier
means there). So I would just remove that. What we still need to know in
fanotify_free_error_event() though is whether the sb_mark is still alive or
not. If it is alive, we leave the event alone, otherwise we need to free it.
So we need a mark_alive flag in the error event and then do in ->freeing_mark
callback something like:

if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);

### /* Maybe we could use mark->lock for this? */
spin_lock(&group->notification_lock);
if (fa_mark->fee_slot) {
if (list_empty(&fa_mark->fee_slot->fae.fse.list)) {
kfree(fa_mark->fee_slot);
fa_mark->fee_slot = NULL;
} else {
fa_mark->fee_slot->mark_alive = 0;
}
}
spin_unlock(&group->notification_lock);
}

And then when queueing and dequeueing event we would have to carefully
check what is the mark & event state under appropriate lock (because
->handle_event() callbacks can see marks on the way to be destroyed as they
are protected just by SRCU).


> @@ -1009,13 +1012,37 @@ static int fanotify_add_mark(struct fsnotify_group *group,
> return PTR_ERR(fsn_mark);
> }
> }
> +
> + /*
> + * Error events are allocated per super-block mark only if
> + * strictly needed (i.e. FAN_FS_ERROR was requested).
> + */
> + if (type == FSNOTIFY_OBJ_TYPE_SB && !(flags & FAN_MARK_IGNORED_MASK) &&
> + (mask & FAN_FS_ERROR)) {
> + struct fanotify_sb_mark *sb_mark = FANOTIFY_SB_MARK(fsn_mark);
> +
> + if (!sb_mark->fee_slot) {
> + struct fanotify_error_event *fee =
> + kzalloc(sizeof(*fee), GFP_KERNEL_ACCOUNT);

As Amir mentioned, no need for kzalloc() here.

> + if (!fee) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + fanotify_init_event(&fee->fae, 0, FS_ERROR);
> + fee->sb_mark = sb_mark;
> + sb_mark->fee_slot = fee;

Careful here. The 'sb_mark' can be already attached to sb and events can
walk it. So we should make sure these readers don't see half initialized
'fee' due to CPU reordering stores. So this needs to be protected by the
same lock that we use when generating error event.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 16:13:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 12/21] fanotify: Encode invalid file handle when no inode is provided

On Mon 16-08-21 18:54:58, Amir Goldstein wrote:
> On Mon, Aug 16, 2021 at 5:07 PM Jan Kara <[email protected]> wrote:
> > Dunno, it still seems like quite some complications (simple ones but
> > non-trivial amount of them) for what is rather a corner case. What if we
> > *internally* propagated the information that there's no inode info with
> > FILEID_ROOT fh? That means: No changes to fanotify_encode_fh_len(),
> > fanotify_encode_fh(), or fanotify_alloc_name_event(). In
> > copy_info_to_user() we just mangle FILEID_ROOT to FILEID_INVALID and that's
> > all. No useless padding, no specialcasing of copying etc. Am I missing
> > something?
>
> I am perfectly fine with encoding "no inode" with FILEID_ROOT internally.
> It's already the value used by fanotify_encode_fh() in upstream.
>
> However, if we use zero len internally, we need to pass fh_type to
> fanotify_fid_info_len() and special case FILEID_ROOT in order to
> take FANOTIFY_FID_INFO_HDR_LEN into account.
>
> And special case fanotify_event_object_fh_len() in
> fanotify_event_info_len() and in copy_info_records_to_user().

Right, this will need some tweaking. I would actually leave
fanotify_fid_info_len() alone, just have in fanotify_event_info_len()
something like:

- if (fh_len)
+ if (fh_len || fanotify_event_needs_fsid(event))

and similarly in copy_info_records_to_user():

- if (fanotify_event_object_fh_len(event)) {
+ if (fanotify_event_object_fh_len(event) ||
+ fanotify_event_needs_fsid(event)) {

And that should be all that's needed as far as I'm reading the code.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 16:19:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 17/21] fanotify: Report fid info for file related file system errors

On Thu 12-08-21 17:40:06, Gabriel Krisman Bertazi wrote:
> Plumb the pieces to add a FID report to error records. Since all error
> event memory must be pre-allocated, we estimate a file handle size and
> if it is insuficient, we report an invalid FID and increase the
> prediction for the next error slot allocation.

This needs updating. The code now uses MAX_HANDLE_SZ...

> For errors that don't expose a file handle report it with an invalid
> FID.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> ---
> Changes since v5:
> - Use preallocated MAX_HANDLE_SZ FH buffer
> - Report superblock errors with a zerolength INVALID FID (jan, amir)
> ---
> fs/notify/fanotify/fanotify.c | 15 +++++++++++++++
> fs/notify/fanotify/fanotify.h | 11 +++++++++++
> fs/notify/fanotify/fanotify_user.c | 7 +++++++
> 3 files changed, 33 insertions(+)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 0c7667d3f5d1..f5c16ac37835 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -734,6 +734,8 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
> struct fanotify_sb_mark *sb_mark =
> FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
> struct fanotify_error_event *fee = sb_mark->fee_slot;
> + struct inode *inode = report->inode;
> + int fh_len;
>
> spin_lock(&group->notification_lock);
> if (fee->err_count++) {
> @@ -743,6 +745,19 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
> spin_unlock(&group->notification_lock);
>
> fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> + fee->fsid = fee->sb_mark->fsn_mark.connector->fsid;

Why don't you use sb_mark directly?

Otherwise the patch looks good to me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 16:24:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Thu 12-08-21 17:40:07, Gabriel Krisman Bertazi wrote:
> The Error info type is a record sent to users on FAN_FS_ERROR events
> documenting the type of error. It also carries an error count,
> documenting how many errors were observed since the last reporting.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

>
> ---
> Changes since v5:
> - Move error code here
> ---
> fs/notify/fanotify/fanotify.c | 1 +
> fs/notify/fanotify/fanotify.h | 1 +
> fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> include/uapi/linux/fanotify.h | 7 ++++++
> 4 files changed, 45 insertions(+)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index f5c16ac37835..b49a474c1d7f 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -745,6 +745,7 @@ static int fanotify_handle_error_event(struct fsnotify_iter_info *iter_info,
> spin_unlock(&group->notification_lock);
>
> fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> + fee->error = report->error;
> fee->fsid = fee->sb_mark->fsn_mark.connector->fsid;
>
> fh_len = fanotify_encode_fh_len(inode);
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 158cf0c4b0bd..0cfe376c6fd9 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -220,6 +220,7 @@ FANOTIFY_NE(struct fanotify_event *event)
>
> struct fanotify_error_event {
> struct fanotify_event fae;
> + s32 error; /* Error reported by the Filesystem. */
> u32 err_count; /* Suppressed errors count */
>
> struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 1ab8f9d8b3ac..ca53159ce673 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -107,6 +107,8 @@ struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
> #define FANOTIFY_EVENT_ALIGN 4
> #define FANOTIFY_INFO_HDR_LEN \
> (sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
> +#define FANOTIFY_INFO_ERROR_LEN \
> + (sizeof(struct fanotify_event_info_error))
>
> static int fanotify_fid_info_len(int fh_len, int name_len)
> {
> @@ -130,6 +132,9 @@ static size_t fanotify_event_len(struct fanotify_event *event,
> if (!fid_mode)
> return event_len;
>
> + if (fanotify_is_error_event(event->mask))
> + event_len += FANOTIFY_INFO_ERROR_LEN;
> +
> info = fanotify_event_info(event);
> dir_fh_len = fanotify_event_dir_fh_len(event);
> fh_len = fanotify_event_object_fh_len(event);
> @@ -176,6 +181,7 @@ static struct fanotify_event *fanotify_dup_error_to_stack(
> error_on_stack->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> error_on_stack->err_count = fee->err_count;
> error_on_stack->sb_mark = fee->sb_mark;
> + error_on_stack->error = fee->error;
>
> error_on_stack->fsid = fee->fsid;
>
> @@ -342,6 +348,28 @@ static int process_access_response(struct fsnotify_group *group,
> return -ENOENT;
> }
>
> +static size_t copy_error_info_to_user(struct fanotify_event *event,
> + char __user *buf, int count)
> +{
> + struct fanotify_event_info_error info;
> + struct fanotify_error_event *fee = FANOTIFY_EE(event);
> +
> + info.hdr.info_type = FAN_EVENT_INFO_TYPE_ERROR;
> + info.hdr.pad = 0;
> + info.hdr.len = FANOTIFY_INFO_ERROR_LEN;
> +
> + if (WARN_ON(count < info.hdr.len))
> + return -EFAULT;
> +
> + info.error = fee->error;
> + info.error_count = fee->err_count;
> +
> + if (copy_to_user(buf, &info, sizeof(info)))
> + return -EFAULT;
> +
> + return info.hdr.len;
> +}
> +
> static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> int info_type, const char *name, size_t name_len,
> char __user *buf, size_t count)
> @@ -505,6 +533,14 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> if (f)
> fd_install(fd, f);
>
> + if (fanotify_is_error_event(event->mask)) {
> + ret = copy_error_info_to_user(event, buf, count);
> + if (ret < 0)
> + goto out_close_fd;
> + buf += ret;
> + count -= ret;
> + }
> +
> /* Event info records order is: dir fid + name, child fid */
> if (fanotify_event_dir_fh_len(event)) {
> info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index 16402037fc7a..80040a92e9d9 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> #define FAN_EVENT_INFO_TYPE_FID 1
> #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> #define FAN_EVENT_INFO_TYPE_DFID 3
> +#define FAN_EVENT_INFO_TYPE_ERROR 4
>
> /* Variable length info record following event metadata */
> struct fanotify_event_info_header {
> @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> unsigned char handle[0];
> };
>
> +struct fanotify_event_info_error {
> + struct fanotify_event_info_header hdr;
> + __s32 error;
> + __u32 error_count;
> +};
> +
> struct fanotify_response {
> __s32 fd;
> __u32 response;
> --
> 2.32.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 16:27:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 19/21] ext4: Send notifications on error

On Thu 12-08-21 17:40:08, Gabriel Krisman Bertazi wrote:
> Send a FS_ERROR message via fsnotify to a userspace monitoring tool
> whenever a ext4 error condition is triggered. This follows the existing
> error conditions in ext4, so it is hooked to the ext4_error* functions.
>
> It also follows the current dmesg reporting in the format. The
> filesystem message is composed mostly by the string that would be
> otherwise printed in dmesg.
>
> A new ext4 specific record format is exposed in the uapi, such that a
> monitoring tool knows what to expect when listening errors of an ext4
> filesystem.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> Reviewed-by: Amir Goldstein <[email protected]>
> ---
> fs/ext4/super.c | 8 ++++++++
> 1 file changed, 8 insertions(+)

<snip>

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index dfa09a277b56..b9ecd43678d7 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -897,6 +904,7 @@ void __ext4_std_error(struct super_block *sb, const char *function,
> printk(KERN_CRIT "EXT4-fs error (device %s) in %s:%d: %s\n",
> sb->s_id, function, line, errstr);
> }
> + fsnotify_sb_error(sb, sb->s_root->d_inode, errno);
>
> ext4_handle_error(sb, false, -errno, 0, 0, function, line);
> }

Does it make sense to report root inode here? ext4_std_error() gets
generally used for filesystem-wide errors.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 16:41:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 21/21] docs: Document the FAN_FS_ERROR event

On Thu 12-08-21 17:40:10, Gabriel Krisman Bertazi wrote:
> Document the FAN_FS_ERROR event for user administrators and user space
> developers.
>
> Reviewed-by: Amir Goldstein <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

<snip>

> diff --git a/Documentation/admin-guide/filesystem-monitoring.rst b/Documentation/admin-guide/filesystem-monitoring.rst
> new file mode 100644
> index 000000000000..b03093567a93
> --- /dev/null
> +++ b/Documentation/admin-guide/filesystem-monitoring.rst
> @@ -0,0 +1,70 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +====================================
> +File system Monitoring with fanotify
> +====================================
> +
> +File system Error Reporting
> +===========================
> +
> +fanotify supports the FAN_FS_ERROR mark for file system-wide error
^ Capital 'F'. ^^^ I'd rather write "event type".

> +reporting. It is meant to be used by file system health monitoring
> +daemons who listen on that interface and take actions (notify sysadmin,
^^^ which ^^^^^^^^^^^^^^^^^ for these events

> +start recovery) when a file system problem is detected by the kernel.
> +
> +By design, A FAN_FS_ERROR notification exposes sufficient information for a
> +monitoring tool to know a problem in the file system has happened. It
> +doesn't necessarily provide a user space application with semantics to
> +verify an IO operation was successfully executed. That is outside of
> +scope of this feature. Instead, it is only meant as a framework for
> +early file system problem detection and reporting recovery tools.
> +
> +When a file system operation fails, it is common for dozens of kernel
> +errors to cascade after the initial failure, hiding the original failure
> +log, which is usually the most useful debug data to troubleshoot the
> +problem. For this reason, FAN_FS_ERROR only reports the first error that
> +occurred since the last notification, and it simply counts addition
^^^ additional

> +errors. This ensures that the most important piece of error information
> +is never lost.
> +
> +FAN_FS_ERROR requires the fanotify group to be setup with the
> +FAN_REPORT_FID flag.
> +
> +At the time of this writing, the only file system that emits FAN_FS_ERROR
> +notifications is Ext4.
> +
> +A user space example code is provided at ``samples/fanotify/fs-monitor.c``.
> +
> +A FAN_FS_ERROR Notification has the following format::
> +
> + [ Notification Metadata (Mandatory) ]
> + [ Generic Error Record (Mandatory) ]
> + [ FID record (Mandatory) ]
> +
> +Generic error record
> +--------------------
> +
> +The generic error record provides enough information for a file system
> +agnostic tool to learn about a problem in the file system, without
> +providing any additional details about the problem. This record is
> +identified by ``struct fanotify_event_info_header.info_type`` being set
> +to FAN_EVENT_INFO_TYPE_ERROR.
> +
> + struct fanotify_event_info_error {
> + struct fanotify_event_info_header hdr;
> + __s32 error;
> + __u32 error_count;
> + };
> +
> +The `error` field identifies the type of error. `error_count` count
> +tracks the number of errors that occurred and were suppressed to
> +preserve the original error, since the last notification.

So is 'error' expected to be errno? Or is that some fs-specific error
identifier? Will it be positive (i.e. real errno) or negative (as errno is
usually passed in the kernel)? I think it should be specified here.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-16 21:42:28

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Thu, Aug 12, 2021 at 05:40:07PM -0400, Gabriel Krisman Bertazi wrote:
> The Error info type is a record sent to users on FAN_FS_ERROR events
> documenting the type of error. It also carries an error count,
> documenting how many errors were observed since the last reporting.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> ---
> Changes since v5:
> - Move error code here
> ---
> fs/notify/fanotify/fanotify.c | 1 +
> fs/notify/fanotify/fanotify.h | 1 +
> fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> include/uapi/linux/fanotify.h | 7 ++++++
> 4 files changed, 45 insertions(+)

<snip>

> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index 16402037fc7a..80040a92e9d9 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> #define FAN_EVENT_INFO_TYPE_FID 1
> #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> #define FAN_EVENT_INFO_TYPE_DFID 3
> +#define FAN_EVENT_INFO_TYPE_ERROR 4
>
> /* Variable length info record following event metadata */
> struct fanotify_event_info_header {
> @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> unsigned char handle[0];
> };
>
> +struct fanotify_event_info_error {
> + struct fanotify_event_info_header hdr;
> + __s32 error;
> + __u32 error_count;
> +};

My apologies for not having time to review this patchset since it was
redesigned to use fanotify. Someday it would be helpful to be able to
export more detailed error reports from XFS, but as I'm not ready to
move forward and write that today, I'll try to avoid derailling this at
the last minute.

Eventually, XFS might want to be able to report errors in file data,
file metadata, allocation group metadata, and whole-filesystem metadata.
Userspace can already gather reports from XFS about corruptions reported
by the online fsck code (see xfs_health.c).

I /think/ we could subclass the file error structure that you've
provided like so:

struct fanotify_event_info_xfs_filesystem_error {
struct fanotify_event_info_error base;

__u32 magic; /* 0x58465342 to identify xfs */
__u32 type; /* quotas, realtime bitmap, etc. */
};

struct fanotify_event_info_xfs_perag_error {
struct fanotify_event_info_error base;

__u32 magic; /* 0x58465342 to identify xfs */
__u32 type; /* agf, agi, agfl, bno btree, ino btree, etc. */
__u32 agno; /* allocation group number */
};

struct fanotify_event_info_xfs_file_error {
struct fanotify_event_info_error base;

__u32 magic; /* 0x58465342 to identify xfs */
__u32 type; /* extent map, dir, attr, etc. */
__u64 offset; /* file data offset, if applicable */
__u64 length; /* file data length, if applicable */
};

(A real XFS implementation might have one structure with the type code
providing for a tagged union or something; I split it into three
separate structs here to avoid confusing things.)

I have three questions at this point:

1) What's the maximum size of a fanotify event structure? None of these
structures exceed 36 bytes, which I hope will fit in whatever size
constraints?

2) If a program written for today's notification events sees a
fanotify_event_info_header from future-XFS with a header length that is
larger than FANOTIFY_INFO_ERROR_LEN, will it be able to react
appropriately? Which is to say, ignore it on the grounds that the
length is unexpectedly large?

It /looks/ like this is the case; really I'm just fishing around here
to make sure nothing in the design of /this/ patchset would make it Very
Difficult(tm) to add more information later.

3) Once we let filesystem implementations create their own extended
error notifications, should we have a "u32 magic" to aid in decoding?
Or even add it to fanotify_event_info_error now?

--D

> +
> struct fanotify_response {
> __s32 fd;
> __u32 response;
> --
> 2.32.0
>

2021-08-17 09:06:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Mon 16-08-21 14:41:03, Darrick J. Wong wrote:
> On Thu, Aug 12, 2021 at 05:40:07PM -0400, Gabriel Krisman Bertazi wrote:
> > The Error info type is a record sent to users on FAN_FS_ERROR events
> > documenting the type of error. It also carries an error count,
> > documenting how many errors were observed since the last reporting.
> >
> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> >
> > ---
> > Changes since v5:
> > - Move error code here
> > ---
> > fs/notify/fanotify/fanotify.c | 1 +
> > fs/notify/fanotify/fanotify.h | 1 +
> > fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> > include/uapi/linux/fanotify.h | 7 ++++++
> > 4 files changed, 45 insertions(+)
>
> <snip>
>
> > diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> > index 16402037fc7a..80040a92e9d9 100644
> > --- a/include/uapi/linux/fanotify.h
> > +++ b/include/uapi/linux/fanotify.h
> > @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> > #define FAN_EVENT_INFO_TYPE_FID 1
> > #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> > #define FAN_EVENT_INFO_TYPE_DFID 3
> > +#define FAN_EVENT_INFO_TYPE_ERROR 4
> >
> > /* Variable length info record following event metadata */
> > struct fanotify_event_info_header {
> > @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> > unsigned char handle[0];
> > };
> >
> > +struct fanotify_event_info_error {
> > + struct fanotify_event_info_header hdr;
> > + __s32 error;
> > + __u32 error_count;
> > +};
>
> My apologies for not having time to review this patchset since it was
> redesigned to use fanotify. Someday it would be helpful to be able to
> export more detailed error reports from XFS, but as I'm not ready to
> move forward and write that today, I'll try to avoid derailling this at
> the last minute.

I think we are not quite there and tweaking the passed structure is easy
enough so no worries. Eventually, passing some filesystem-specific blob
together with the event was the plan AFAIR. You're right now is a good
moment to think how exactly we want that passed.

> Eventually, XFS might want to be able to report errors in file data,
> file metadata, allocation group metadata, and whole-filesystem metadata.
> Userspace can already gather reports from XFS about corruptions reported
> by the online fsck code (see xfs_health.c).

Yes, although note that the current plan is that we currently have only one
error event queue, others are just added to error_count until the event is
fetched by userspace (on the grounds that the first error is usually the
most meaningful, the others are usually just cascading problems). But I'm
not sure if this scheme would be suitable for online fsck usecase since we
may discard even valid independent errors this way.

> I /think/ we could subclass the file error structure that you've
> provided like so:
>
> struct fanotify_event_info_xfs_filesystem_error {
> struct fanotify_event_info_error base;
>
> __u32 magic; /* 0x58465342 to identify xfs */
> __u32 type; /* quotas, realtime bitmap, etc. */
> };
>
> struct fanotify_event_info_xfs_perag_error {
> struct fanotify_event_info_error base;
>
> __u32 magic; /* 0x58465342 to identify xfs */
> __u32 type; /* agf, agi, agfl, bno btree, ino btree, etc. */
> __u32 agno; /* allocation group number */
> };
>
> struct fanotify_event_info_xfs_file_error {
> struct fanotify_event_info_error base;
>
> __u32 magic; /* 0x58465342 to identify xfs */
> __u32 type; /* extent map, dir, attr, etc. */
> __u64 offset; /* file data offset, if applicable */
> __u64 length; /* file data length, if applicable */
> };
>
> (A real XFS implementation might have one structure with the type code
> providing for a tagged union or something; I split it into three
> separate structs here to avoid confusing things.)

The structure of fanotify event as passed to userspace generally is:

struct fanotify_event_metadata {
__u32 event_len;
__u8 vers;
__u8 reserved;
__u16 metadata_len;
__aligned_u64 mask;
__s32 fd;
__s32 pid;
};

If event_len is > sizeof(struct fanotify_event_metadata), userspace is
expected to look for struct fanotify_event_info_header after struct
fanotify_event_metadata. struct fanotify_event_info_header looks like:

struct fanotify_event_info_header {
__u8 info_type;
__u8 pad;
__u16 len;
};

Again if the end of this info (defined by 'len') is smaller than
'event_len', there is next header with next payload of data. So for example
error event will have:

struct fanotify_event_metadata
struct fanotify_event_info_error
struct fanotify_event_info_fid

Now either we could add fs specific blob into fanotify_event_info_error
(but then it would be good to add 'magic' to fanotify_event_info_error now
and define that if 'len' is larger, fs-specific blob follows after fixed
data) or we can add another info type FAN_EVENT_INFO_TYPE_ERROR_FS_DATA
(i.e., attach another structure into the event) which would contain the
'magic' and then blob of data. I don't have strong preference.

> I have three questions at this point:
>
> 1) What's the maximum size of a fanotify event structure? None of these
> structures exceed 36 bytes, which I hope will fit in whatever size
> constraints?

Whole event must fit into 4G, each event info needs to fit in 64k. At least
these are the limits of the interface. Practically, it would be difficult
and inefficient to manipulate such huge events...

> 2) If a program written for today's notification events sees a
> fanotify_event_info_header from future-XFS with a header length that is
> larger than FANOTIFY_INFO_ERROR_LEN, will it be able to react
> appropriately? Which is to say, ignore it on the grounds that the
> length is unexpectedly large?

That is the expected behavior :). But I guess separate info type for
fs-specific blob might be more foolproof in this sense - when parsing
events, you are expected to just skip info_types you don't understand
(based on 'len' and 'type' in the common header) and generally different
events have different sets of infos attached to them so you mostly have to
implement this logic to be able to process events.

> It /looks/ like this is the case; really I'm just fishing around here
> to make sure nothing in the design of /this/ patchset would make it Very
> Difficult(tm) to add more information later.
>
> 3) Once we let filesystem implementations create their own extended
> error notifications, should we have a "u32 magic" to aid in decoding?
> Or even add it to fanotify_event_info_error now?

If we go via the 'separate info type' route, then the magic can go into
that structure and there's no great use for 'magic' in
fanotify_event_info_error.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-17 10:09:42

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Tue, Aug 17, 2021 at 12:05 PM Jan Kara <[email protected]> wrote:
>
> On Mon 16-08-21 14:41:03, Darrick J. Wong wrote:
> > On Thu, Aug 12, 2021 at 05:40:07PM -0400, Gabriel Krisman Bertazi wrote:
> > > The Error info type is a record sent to users on FAN_FS_ERROR events
> > > documenting the type of error. It also carries an error count,
> > > documenting how many errors were observed since the last reporting.
> > >
> > > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > >
> > > ---
> > > Changes since v5:
> > > - Move error code here
> > > ---
> > > fs/notify/fanotify/fanotify.c | 1 +
> > > fs/notify/fanotify/fanotify.h | 1 +
> > > fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> > > include/uapi/linux/fanotify.h | 7 ++++++
> > > 4 files changed, 45 insertions(+)
> >
> > <snip>
> >
> > > diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> > > index 16402037fc7a..80040a92e9d9 100644
> > > --- a/include/uapi/linux/fanotify.h
> > > +++ b/include/uapi/linux/fanotify.h
> > > @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> > > #define FAN_EVENT_INFO_TYPE_FID 1
> > > #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> > > #define FAN_EVENT_INFO_TYPE_DFID 3
> > > +#define FAN_EVENT_INFO_TYPE_ERROR 4
> > >
> > > /* Variable length info record following event metadata */
> > > struct fanotify_event_info_header {
> > > @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> > > unsigned char handle[0];
> > > };
> > >
> > > +struct fanotify_event_info_error {
> > > + struct fanotify_event_info_header hdr;
> > > + __s32 error;
> > > + __u32 error_count;
> > > +};
> >
> > My apologies for not having time to review this patchset since it was
> > redesigned to use fanotify. Someday it would be helpful to be able to
> > export more detailed error reports from XFS, but as I'm not ready to
> > move forward and write that today, I'll try to avoid derailling this at
> > the last minute.
>
> I think we are not quite there and tweaking the passed structure is easy
> enough so no worries. Eventually, passing some filesystem-specific blob
> together with the event was the plan AFAIR. You're right now is a good
> moment to think how exactly we want that passed.
>
> > Eventually, XFS might want to be able to report errors in file data,
> > file metadata, allocation group metadata, and whole-filesystem metadata.
> > Userspace can already gather reports from XFS about corruptions reported
> > by the online fsck code (see xfs_health.c).
>
> Yes, although note that the current plan is that we currently have only one
> error event queue, others are just added to error_count until the event is
> fetched by userspace (on the grounds that the first error is usually the
> most meaningful, the others are usually just cascading problems). But I'm
> not sure if this scheme would be suitable for online fsck usecase since we
> may discard even valid independent errors this way.
>
> > I /think/ we could subclass the file error structure that you've
> > provided like so:
> >
> > struct fanotify_event_info_xfs_filesystem_error {
> > struct fanotify_event_info_error base;
> >
> > __u32 magic; /* 0x58465342 to identify xfs */
> > __u32 type; /* quotas, realtime bitmap, etc. */
> > };
> >
> > struct fanotify_event_info_xfs_perag_error {
> > struct fanotify_event_info_error base;
> >
> > __u32 magic; /* 0x58465342 to identify xfs */
> > __u32 type; /* agf, agi, agfl, bno btree, ino btree, etc. */
> > __u32 agno; /* allocation group number */
> > };
> >
> > struct fanotify_event_info_xfs_file_error {
> > struct fanotify_event_info_error base;
> >
> > __u32 magic; /* 0x58465342 to identify xfs */
> > __u32 type; /* extent map, dir, attr, etc. */
> > __u64 offset; /* file data offset, if applicable */
> > __u64 length; /* file data length, if applicable */
> > };
> >
> > (A real XFS implementation might have one structure with the type code
> > providing for a tagged union or something; I split it into three
> > separate structs here to avoid confusing things.)
>
> The structure of fanotify event as passed to userspace generally is:
>
> struct fanotify_event_metadata {
> __u32 event_len;
> __u8 vers;
> __u8 reserved;
> __u16 metadata_len;
> __aligned_u64 mask;
> __s32 fd;
> __s32 pid;
> };
>
> If event_len is > sizeof(struct fanotify_event_metadata), userspace is
> expected to look for struct fanotify_event_info_header after struct
> fanotify_event_metadata. struct fanotify_event_info_header looks like:
>
> struct fanotify_event_info_header {
> __u8 info_type;
> __u8 pad;
> __u16 len;
> };
>
> Again if the end of this info (defined by 'len') is smaller than
> 'event_len', there is next header with next payload of data. So for example
> error event will have:
>
> struct fanotify_event_metadata
> struct fanotify_event_info_error
> struct fanotify_event_info_fid
>
> Now either we could add fs specific blob into fanotify_event_info_error
> (but then it would be good to add 'magic' to fanotify_event_info_error now
> and define that if 'len' is larger, fs-specific blob follows after fixed
> data) or we can add another info type FAN_EVENT_INFO_TYPE_ERROR_FS_DATA
> (i.e., attach another structure into the event) which would contain the
> 'magic' and then blob of data. I don't have strong preference.
>
> > I have three questions at this point:
> >
> > 1) What's the maximum size of a fanotify event structure? None of these
> > structures exceed 36 bytes, which I hope will fit in whatever size
> > constraints?
>
> Whole event must fit into 4G, each event info needs to fit in 64k. At least
> these are the limits of the interface. Practically, it would be difficult
> and inefficient to manipulate such huge events...
>

Just keep in mind that the current scheme pre-allocates the single event slot
on fanotify_mark() time and (I think) we agreed to pre-allocate
sizeof(fsnotify_error_event) + MAX_HDNALE_SZ.
If filesystems would want to store some variable length fs specific info,
a future implementation will have to take that into account.

> > 2) If a program written for today's notification events sees a
> > fanotify_event_info_header from future-XFS with a header length that is
> > larger than FANOTIFY_INFO_ERROR_LEN, will it be able to react
> > appropriately? Which is to say, ignore it on the grounds that the
> > length is unexpectedly large?
>
> That is the expected behavior :). But I guess separate info type for
> fs-specific blob might be more foolproof in this sense - when parsing
> events, you are expected to just skip info_types you don't understand
> (based on 'len' and 'type' in the common header) and generally different
> events have different sets of infos attached to them so you mostly have to
> implement this logic to be able to process events.
>
> > It /looks/ like this is the case; really I'm just fishing around here
> > to make sure nothing in the design of /this/ patchset would make it Very
> > Difficult(tm) to add more information later.
> >
> > 3) Once we let filesystem implementations create their own extended
> > error notifications, should we have a "u32 magic" to aid in decoding?
> > Or even add it to fanotify_event_info_error now?
>
> If we go via the 'separate info type' route, then the magic can go into
> that structure and there's no great use for 'magic' in
> fanotify_event_info_error.

My 0.02$:
With current patch set, filesystem reports error using:
fsnotify_sb_error(sb, inode, error)

The optional @inode argument is encoded to a filesystem opaque
blob using exportfs_encode_inode_fh(), recorded in the event
as a blob and reported to userspace as a blob.

If filesystem would like to report a different type of opaque blob
(e.g. xfs_perag_info), the interface should be extended to:
fsnotify_sb_error(sb, inode, error, info, info_len)
and the 'separate info type' route seems like the best and most natural
way to deal with the case of information that is only emitted from
a specific filesystem with a specific feature enabled (online fsck).

IOW, there is no need for fanotify_event_info_xfs_perag_error
in fanotify UAPI if you ask me.

Regarding 'magic' in fanotify_event_info_error, I also don't see the
need for that, because the event already has fsid which can be
used to identify the filesystem in question.

Keep in mind that the value of handle_type inside struct file_handle
inside struct fanotify_event_info_fid is not a universal classifier.
Specifically, the type 0x81 means "XFS_FILEID_INO64_GEN"
only in the context of XFS and it can mean something else in the
context of another type of filesystem.

If we add a new info record fanotify_event_info_fs_private
it could even be an alias to fanotify_event_info_fid with the only
difference that the handle[0] member is not expected to be
struct file_handle, but some other fs private struct.

Thanks,
Amir.

2021-08-18 00:11:37

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Tue, Aug 17, 2021 at 11:05:38AM +0200, Jan Kara wrote:
> On Mon 16-08-21 14:41:03, Darrick J. Wong wrote:
> > On Thu, Aug 12, 2021 at 05:40:07PM -0400, Gabriel Krisman Bertazi wrote:
> > > The Error info type is a record sent to users on FAN_FS_ERROR events
> > > documenting the type of error. It also carries an error count,
> > > documenting how many errors were observed since the last reporting.
> > >
> > > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > >
> > > ---
> > > Changes since v5:
> > > - Move error code here
> > > ---
> > > fs/notify/fanotify/fanotify.c | 1 +
> > > fs/notify/fanotify/fanotify.h | 1 +
> > > fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> > > include/uapi/linux/fanotify.h | 7 ++++++
> > > 4 files changed, 45 insertions(+)
> >
> > <snip>
> >
> > > diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> > > index 16402037fc7a..80040a92e9d9 100644
> > > --- a/include/uapi/linux/fanotify.h
> > > +++ b/include/uapi/linux/fanotify.h
> > > @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> > > #define FAN_EVENT_INFO_TYPE_FID 1
> > > #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> > > #define FAN_EVENT_INFO_TYPE_DFID 3
> > > +#define FAN_EVENT_INFO_TYPE_ERROR 4
> > >
> > > /* Variable length info record following event metadata */
> > > struct fanotify_event_info_header {
> > > @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> > > unsigned char handle[0];
> > > };
> > >
> > > +struct fanotify_event_info_error {
> > > + struct fanotify_event_info_header hdr;
> > > + __s32 error;
> > > + __u32 error_count;
> > > +};
> >
> > My apologies for not having time to review this patchset since it was
> > redesigned to use fanotify. Someday it would be helpful to be able to
> > export more detailed error reports from XFS, but as I'm not ready to
> > move forward and write that today, I'll try to avoid derailling this at
> > the last minute.
>
> I think we are not quite there and tweaking the passed structure is easy
> enough so no worries. Eventually, passing some filesystem-specific blob
> together with the event was the plan AFAIR. You're right now is a good
> moment to think how exactly we want that passed.
>
> > Eventually, XFS might want to be able to report errors in file data,
> > file metadata, allocation group metadata, and whole-filesystem metadata.
> > Userspace can already gather reports from XFS about corruptions reported
> > by the online fsck code (see xfs_health.c).
>
> Yes, although note that the current plan is that we currently have only one
> error event queue, others are just added to error_count until the event is
> fetched by userspace (on the grounds that the first error is usually the
> most meaningful, the others are usually just cascading problems). But I'm
> not sure if this scheme would be suitable for online fsck usecase since we
> may discard even valid independent errors this way.

<nod> The use-cases might split here -- we probably don't want online
fsck to be generating fs error events since the only tool that can do
anything about the broken metadata is the online fsck tool itself.

However, for random errors found by regular reader/writer threads, I
have a patchset in djwong-dev that adds recording of those errors;
that's the place where I think I'd want to add the ability to send
notification blobs to userspace.

Hmm. For handling accumulated errors, can we still access the
fanotify_event_info_* object once we've handed it to fanotify? If the
user hasn't picked up the event yet, it might be acceptable to set more
bits in the type mask and bump the error count. In other words, every
time userspace actually reads the event, it'll get the latest error
state. I /think/ that's where the design of this patchset is going,
right?

> > I /think/ we could subclass the file error structure that you've
> > provided like so:
> >
> > struct fanotify_event_info_xfs_filesystem_error {
> > struct fanotify_event_info_error base;
> >
> > __u32 magic; /* 0x58465342 to identify xfs */
> > __u32 type; /* quotas, realtime bitmap, etc. */
> > };
> >
> > struct fanotify_event_info_xfs_perag_error {
> > struct fanotify_event_info_error base;
> >
> > __u32 magic; /* 0x58465342 to identify xfs */
> > __u32 type; /* agf, agi, agfl, bno btree, ino btree, etc. */
> > __u32 agno; /* allocation group number */
> > };
> >
> > struct fanotify_event_info_xfs_file_error {
> > struct fanotify_event_info_error base;
> >
> > __u32 magic; /* 0x58465342 to identify xfs */
> > __u32 type; /* extent map, dir, attr, etc. */
> > __u64 offset; /* file data offset, if applicable */
> > __u64 length; /* file data length, if applicable */
> > };
> >
> > (A real XFS implementation might have one structure with the type code
> > providing for a tagged union or something; I split it into three
> > separate structs here to avoid confusing things.)
>
> The structure of fanotify event as passed to userspace generally is:
>
> struct fanotify_event_metadata {
> __u32 event_len;
> __u8 vers;
> __u8 reserved;
> __u16 metadata_len;
> __aligned_u64 mask;
> __s32 fd;
> __s32 pid;
> };
>
> If event_len is > sizeof(struct fanotify_event_metadata), userspace is
> expected to look for struct fanotify_event_info_header after struct
> fanotify_event_metadata. struct fanotify_event_info_header looks like:
>
> struct fanotify_event_info_header {
> __u8 info_type;
> __u8 pad;
> __u16 len;
> };
>
> Again if the end of this info (defined by 'len') is smaller than
> 'event_len', there is next header with next payload of data. So for example
> error event will have:
>
> struct fanotify_event_metadata
> struct fanotify_event_info_error
> struct fanotify_event_info_fid
>
> Now either we could add fs specific blob into fanotify_event_info_error
> (but then it would be good to add 'magic' to fanotify_event_info_error now
> and define that if 'len' is larger, fs-specific blob follows after fixed
> data) or we can add another info type FAN_EVENT_INFO_TYPE_ERROR_FS_DATA
> (i.e., attach another structure into the event) which would contain the
> 'magic' and then blob of data. I don't have strong preference.

I have a slight preference for the second. It doesn't make much sense
to have a magic value in fanotify_event_info_error to decode a totally
separate structure.

> > I have three questions at this point:
> >
> > 1) What's the maximum size of a fanotify event structure? None of these
> > structures exceed 36 bytes, which I hope will fit in whatever size
> > constraints?
>
> Whole event must fit into 4G, each event info needs to fit in 64k. At least
> these are the limits of the interface. Practically, it would be difficult
> and inefficient to manipulate such huge events...

Ok. I doubt we'll ever get close to a 4k page for a single fs object.

> > 2) If a program written for today's notification events sees a
> > fanotify_event_info_header from future-XFS with a header length that is
> > larger than FANOTIFY_INFO_ERROR_LEN, will it be able to react
> > appropriately? Which is to say, ignore it on the grounds that the
> > length is unexpectedly large?
>
> That is the expected behavior :). But I guess separate info type for
> fs-specific blob might be more foolproof in this sense - when parsing
> events, you are expected to just skip info_types you don't understand
> (based on 'len' and 'type' in the common header) and generally different
> events have different sets of infos attached to them so you mostly have to
> implement this logic to be able to process events.

Ok, good to hear this. :)

> > It /looks/ like this is the case; really I'm just fishing around here
> > to make sure nothing in the design of /this/ patchset would make it Very
> > Difficult(tm) to add more information later.
> >
> > 3) Once we let filesystem implementations create their own extended
> > error notifications, should we have a "u32 magic" to aid in decoding?
> > Or even add it to fanotify_event_info_error now?
>
> If we go via the 'separate info type' route, then the magic can go into
> that structure and there's no great use for 'magic' in
> fanotify_event_info_error.

Ok. So far so good; now on to Amir's email...

--D

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2021-08-18 00:19:05

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Tue, Aug 17, 2021 at 01:08:06PM +0300, Amir Goldstein wrote:
> On Tue, Aug 17, 2021 at 12:05 PM Jan Kara <[email protected]> wrote:
> >
> > On Mon 16-08-21 14:41:03, Darrick J. Wong wrote:
> > > On Thu, Aug 12, 2021 at 05:40:07PM -0400, Gabriel Krisman Bertazi wrote:
> > > > The Error info type is a record sent to users on FAN_FS_ERROR events
> > > > documenting the type of error. It also carries an error count,
> > > > documenting how many errors were observed since the last reporting.
> > > >
> > > > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > > >
> > > > ---
> > > > Changes since v5:
> > > > - Move error code here
> > > > ---
> > > > fs/notify/fanotify/fanotify.c | 1 +
> > > > fs/notify/fanotify/fanotify.h | 1 +
> > > > fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> > > > include/uapi/linux/fanotify.h | 7 ++++++
> > > > 4 files changed, 45 insertions(+)
> > >
> > > <snip>
> > >
> > > > diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> > > > index 16402037fc7a..80040a92e9d9 100644
> > > > --- a/include/uapi/linux/fanotify.h
> > > > +++ b/include/uapi/linux/fanotify.h
> > > > @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> > > > #define FAN_EVENT_INFO_TYPE_FID 1
> > > > #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> > > > #define FAN_EVENT_INFO_TYPE_DFID 3
> > > > +#define FAN_EVENT_INFO_TYPE_ERROR 4
> > > >
> > > > /* Variable length info record following event metadata */
> > > > struct fanotify_event_info_header {
> > > > @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> > > > unsigned char handle[0];
> > > > };
> > > >
> > > > +struct fanotify_event_info_error {
> > > > + struct fanotify_event_info_header hdr;
> > > > + __s32 error;
> > > > + __u32 error_count;
> > > > +};
> > >
> > > My apologies for not having time to review this patchset since it was
> > > redesigned to use fanotify. Someday it would be helpful to be able to
> > > export more detailed error reports from XFS, but as I'm not ready to
> > > move forward and write that today, I'll try to avoid derailling this at
> > > the last minute.
> >
> > I think we are not quite there and tweaking the passed structure is easy
> > enough so no worries. Eventually, passing some filesystem-specific blob
> > together with the event was the plan AFAIR. You're right now is a good
> > moment to think how exactly we want that passed.
> >
> > > Eventually, XFS might want to be able to report errors in file data,
> > > file metadata, allocation group metadata, and whole-filesystem metadata.
> > > Userspace can already gather reports from XFS about corruptions reported
> > > by the online fsck code (see xfs_health.c).
> >
> > Yes, although note that the current plan is that we currently have only one
> > error event queue, others are just added to error_count until the event is
> > fetched by userspace (on the grounds that the first error is usually the
> > most meaningful, the others are usually just cascading problems). But I'm
> > not sure if this scheme would be suitable for online fsck usecase since we
> > may discard even valid independent errors this way.
> >
> > > I /think/ we could subclass the file error structure that you've
> > > provided like so:
> > >
> > > struct fanotify_event_info_xfs_filesystem_error {
> > > struct fanotify_event_info_error base;
> > >
> > > __u32 magic; /* 0x58465342 to identify xfs */
> > > __u32 type; /* quotas, realtime bitmap, etc. */
> > > };
> > >
> > > struct fanotify_event_info_xfs_perag_error {
> > > struct fanotify_event_info_error base;
> > >
> > > __u32 magic; /* 0x58465342 to identify xfs */
> > > __u32 type; /* agf, agi, agfl, bno btree, ino btree, etc. */
> > > __u32 agno; /* allocation group number */
> > > };
> > >
> > > struct fanotify_event_info_xfs_file_error {
> > > struct fanotify_event_info_error base;
> > >
> > > __u32 magic; /* 0x58465342 to identify xfs */
> > > __u32 type; /* extent map, dir, attr, etc. */
> > > __u64 offset; /* file data offset, if applicable */
> > > __u64 length; /* file data length, if applicable */
> > > };
> > >
> > > (A real XFS implementation might have one structure with the type code
> > > providing for a tagged union or something; I split it into three
> > > separate structs here to avoid confusing things.)
> >
> > The structure of fanotify event as passed to userspace generally is:
> >
> > struct fanotify_event_metadata {
> > __u32 event_len;
> > __u8 vers;
> > __u8 reserved;
> > __u16 metadata_len;
> > __aligned_u64 mask;
> > __s32 fd;
> > __s32 pid;
> > };
> >
> > If event_len is > sizeof(struct fanotify_event_metadata), userspace is
> > expected to look for struct fanotify_event_info_header after struct
> > fanotify_event_metadata. struct fanotify_event_info_header looks like:
> >
> > struct fanotify_event_info_header {
> > __u8 info_type;
> > __u8 pad;
> > __u16 len;
> > };
> >
> > Again if the end of this info (defined by 'len') is smaller than
> > 'event_len', there is next header with next payload of data. So for example
> > error event will have:
> >
> > struct fanotify_event_metadata
> > struct fanotify_event_info_error
> > struct fanotify_event_info_fid
> >
> > Now either we could add fs specific blob into fanotify_event_info_error
> > (but then it would be good to add 'magic' to fanotify_event_info_error now
> > and define that if 'len' is larger, fs-specific blob follows after fixed
> > data) or we can add another info type FAN_EVENT_INFO_TYPE_ERROR_FS_DATA
> > (i.e., attach another structure into the event) which would contain the
> > 'magic' and then blob of data. I don't have strong preference.
> >
> > > I have three questions at this point:
> > >
> > > 1) What's the maximum size of a fanotify event structure? None of these
> > > structures exceed 36 bytes, which I hope will fit in whatever size
> > > constraints?
> >
> > Whole event must fit into 4G, each event info needs to fit in 64k. At least
> > these are the limits of the interface. Practically, it would be difficult
> > and inefficient to manipulate such huge events...
> >
>
> Just keep in mind that the current scheme pre-allocates the single event slot
> on fanotify_mark() time and (I think) we agreed to pre-allocate
> sizeof(fsnotify_error_event) + MAX_HDNALE_SZ.
> If filesystems would want to store some variable length fs specific info,
> a future implementation will have to take that into account.

<nod> I /think/ for the fs and AG metadata we could preallocate these,
so long as fsnotify doesn't free them out from under us. For inodes...
there are many more of those, so they'd have to be allocated
dynamically.

> > > 2) If a program written for today's notification events sees a
> > > fanotify_event_info_header from future-XFS with a header length that is
> > > larger than FANOTIFY_INFO_ERROR_LEN, will it be able to react
> > > appropriately? Which is to say, ignore it on the grounds that the
> > > length is unexpectedly large?
> >
> > That is the expected behavior :). But I guess separate info type for
> > fs-specific blob might be more foolproof in this sense - when parsing
> > events, you are expected to just skip info_types you don't understand
> > (based on 'len' and 'type' in the common header) and generally different
> > events have different sets of infos attached to them so you mostly have to
> > implement this logic to be able to process events.
> >
> > > It /looks/ like this is the case; really I'm just fishing around here
> > > to make sure nothing in the design of /this/ patchset would make it Very
> > > Difficult(tm) to add more information later.
> > >
> > > 3) Once we let filesystem implementations create their own extended
> > > error notifications, should we have a "u32 magic" to aid in decoding?
> > > Or even add it to fanotify_event_info_error now?
> >
> > If we go via the 'separate info type' route, then the magic can go into
> > that structure and there's no great use for 'magic' in
> > fanotify_event_info_error.
>
> My 0.02$:
> With current patch set, filesystem reports error using:
> fsnotify_sb_error(sb, inode, error)
>
> The optional @inode argument is encoded to a filesystem opaque
> blob using exportfs_encode_inode_fh(), recorded in the event
> as a blob and reported to userspace as a blob.
>
> If filesystem would like to report a different type of opaque blob
> (e.g. xfs_perag_info), the interface should be extended to:
> fsnotify_sb_error(sb, inode, error, info, info_len)
> and the 'separate info type' route seems like the best and most natural
> way to deal with the case of information that is only emitted from
> a specific filesystem with a specific feature enabled (online fsck).

<nod> This seems reasonable to me.

> IOW, there is no need for fanotify_event_info_xfs_perag_error
> in fanotify UAPI if you ask me.
>
> Regarding 'magic' in fanotify_event_info_error, I also don't see the
> need for that, because the event already has fsid which can be
> used to identify the filesystem in question.
>
> Keep in mind that the value of handle_type inside struct file_handle
> inside struct fanotify_event_info_fid is not a universal classifier.
> Specifically, the type 0x81 means "XFS_FILEID_INO64_GEN"
> only in the context of XFS and it can mean something else in the
> context of another type of filesystem.

Can you pass the handle into the kernel to open a fd to file mentioned
in the report? I don't think userspace is supposed to know what's
inside a file handle, and it would be helpful if it didn't matter here
either. :)

> If we add a new info record fanotify_event_info_fs_private
> it could even be an alias to fanotify_event_info_fid with the only
> difference that the handle[0] member is not expected to be
> struct file_handle, but some other fs private struct.

I ... think I prefer it being a separate info blob.

--D

>
> Thanks,
> Amir.

2021-08-18 03:25:29

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

[...]

> > Just keep in mind that the current scheme pre-allocates the single event slot
> > on fanotify_mark() time and (I think) we agreed to pre-allocate
> > sizeof(fsnotify_error_event) + MAX_HDNALE_SZ.
> > If filesystems would want to store some variable length fs specific info,
> > a future implementation will have to take that into account.
>
> <nod> I /think/ for the fs and AG metadata we could preallocate these,
> so long as fsnotify doesn't free them out from under us.

fs won't get notified when the event is freed, so fsnotify must
take ownership on the data structure.
I was thinking more along the lines of limiting maximum size for fs
specific info and pre-allocating that size for the event.

> For inodes...
> there are many more of those, so they'd have to be allocated
> dynamically.

The current scheme is that the size of the queue for error events
is one and the single slot is pre-allocated.
The reason for pre-allocate is that the assumption is that fsnotify_error()
could be called from contexts where memory allocation would be
inconvenient.
Therefore, we can store the encoded file handle of the first erroneous
inode, but we do not store any more events until user read this
one event.

> Hmm. For handling accumulated errors, can we still access the
> fanotify_event_info_* object once we've handed it to fanotify? If the
> user hasn't picked up the event yet, it might be acceptable to set more
> bits in the type mask and bump the error count. In other words, every
> time userspace actually reads the event, it'll get the latest error
> state. I /think/ that's where the design of this patchset is going,
> right?

Sort of.
fsnotify does have a concept of "merging" new event with an event
already in queue.

With most fsnotify events, merge only happens if the info related
to the new event (e.g. sb,inode) is the same as that off the queued
event and the "merge" is only in the event mask
(e.g. FS_OPEN|FS_CLOSE).

However, the current scheme for "merge" of an FS_ERROR event is only
bumping err_count, even if the new reported error or inode do not
match the error/inode in the queued event.

If we define error event subtypes (e.g. FS_ERROR_WRITEBACK,
FS_ERROR_METADATA), then the error event could contain
a field for subtype mask and user could read the subtype mask
along with the accumulated error count, but this cannot be
done by providing the filesystem access to modify an internal
fsnotify event, so those have to be generic UAPI defined subtypes.

If you think that would be useful, then we may want to consider
reserving the subtype mask field in fanotify_event_info_error in
advance.

>
> > > > 2) If a program written for today's notification events sees a
> > > > fanotify_event_info_header from future-XFS with a header length that is
> > > > larger than FANOTIFY_INFO_ERROR_LEN, will it be able to react
> > > > appropriately? Which is to say, ignore it on the grounds that the
> > > > length is unexpectedly large?
> > >
> > > That is the expected behavior :). But I guess separate info type for
> > > fs-specific blob might be more foolproof in this sense - when parsing
> > > events, you are expected to just skip info_types you don't understand
> > > (based on 'len' and 'type' in the common header) and generally different
> > > events have different sets of infos attached to them so you mostly have to
> > > implement this logic to be able to process events.
> > >
> > > > It /looks/ like this is the case; really I'm just fishing around here
> > > > to make sure nothing in the design of /this/ patchset would make it Very
> > > > Difficult(tm) to add more information later.
> > > >
> > > > 3) Once we let filesystem implementations create their own extended
> > > > error notifications, should we have a "u32 magic" to aid in decoding?
> > > > Or even add it to fanotify_event_info_error now?
> > >
> > > If we go via the 'separate info type' route, then the magic can go into
> > > that structure and there's no great use for 'magic' in
> > > fanotify_event_info_error.
> >
> > My 0.02$:
> > With current patch set, filesystem reports error using:
> > fsnotify_sb_error(sb, inode, error)
> >
> > The optional @inode argument is encoded to a filesystem opaque
> > blob using exportfs_encode_inode_fh(), recorded in the event
> > as a blob and reported to userspace as a blob.
> >
> > If filesystem would like to report a different type of opaque blob
> > (e.g. xfs_perag_info), the interface should be extended to:
> > fsnotify_sb_error(sb, inode, error, info, info_len)
> > and the 'separate info type' route seems like the best and most natural
> > way to deal with the case of information that is only emitted from
> > a specific filesystem with a specific feature enabled (online fsck).
>
> <nod> This seems reasonable to me.
>
> > IOW, there is no need for fanotify_event_info_xfs_perag_error
> > in fanotify UAPI if you ask me.
> >
> > Regarding 'magic' in fanotify_event_info_error, I also don't see the
> > need for that, because the event already has fsid which can be
> > used to identify the filesystem in question.
> >
> > Keep in mind that the value of handle_type inside struct file_handle
> > inside struct fanotify_event_info_fid is not a universal classifier.
> > Specifically, the type 0x81 means "XFS_FILEID_INO64_GEN"
> > only in the context of XFS and it can mean something else in the
> > context of another type of filesystem.
>
> Can you pass the handle into the kernel to open a fd to file mentioned
> in the report? I don't think userspace is supposed to know what's
> inside a file handle, and it would be helpful if it didn't matter here
> either. :)
>

User gets a file handle and can do whatever users can do with file
handles... that is, open_by_handle_at() (if filesystem and inode are
still alive and healthy) and for less privileged users, compare with
result of name_to_handle_at() of another object.

Obviously, filesystem specialized tools could parse the file handle
to extract more information.

> > If we add a new info record fanotify_event_info_fs_private
> > it could even be an alias to fanotify_event_info_fid with the only
> > difference that the handle[0] member is not expected to be
> > struct file_handle, but some other fs private struct.
>
> I ... think I prefer it being a separate info blob.
>

Yes. That is what I meant.
Separate info record INFO_TYPE_ERROR_FS_DATA, whose info record
format is quite the same as that of INFO_TYPE_FID, but the blob is a
different type of blob.

Thanks,
Amir.

2021-08-18 09:58:50

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Wed 18-08-21 06:24:26, Amir Goldstein wrote:
> [...]
>
> > > Just keep in mind that the current scheme pre-allocates the single event slot
> > > on fanotify_mark() time and (I think) we agreed to pre-allocate
> > > sizeof(fsnotify_error_event) + MAX_HDNALE_SZ.
> > > If filesystems would want to store some variable length fs specific info,
> > > a future implementation will have to take that into account.
> >
> > <nod> I /think/ for the fs and AG metadata we could preallocate these,
> > so long as fsnotify doesn't free them out from under us.
>
> fs won't get notified when the event is freed, so fsnotify must
> take ownership on the data structure.
> I was thinking more along the lines of limiting maximum size for fs
> specific info and pre-allocating that size for the event.

Agreed. If there's a sensible upperbound than preallocating this inside
fsnotify is likely the least problematic solution.

> > For inodes...
> > there are many more of those, so they'd have to be allocated
> > dynamically.
>
> The current scheme is that the size of the queue for error events
> is one and the single slot is pre-allocated.
> The reason for pre-allocate is that the assumption is that fsnotify_error()
> could be called from contexts where memory allocation would be
> inconvenient.
> Therefore, we can store the encoded file handle of the first erroneous
> inode, but we do not store any more events until user read this
> one event.

Right. OTOH I can imagine allowing GFP_NOFS allocations in the error
context. At least for ext4 it would be workable (after all ext4 manages to
lock & modify superblock in its error handlers, GFP_NOFS allocation isn't
harder). But then if events are dynamically allocated there's still the
inconvenient question what are you going to do if you need to report fs
error and you hit ENOMEM. Just not sending the notification may have nasty
consequences and in the world of containerization and virtualization
tightly packed machines where ENOMEM happens aren't that unlikely. It is
just difficult to make assumptions about filesystems overall so we decided
to be better safe and preallocate the event.

Or, we could leave the allocation troubles for the filesystem and
fsnotify_sb_error() would be passed already allocated event (this way
attaching of fs-specific blobs to the event is handled as well) which it
would just queue. Plus we'd need to provide some helper to fill in generic
part of the event...

The disadvantage is that if there are filesystems / callsites needing
preallocated events, it would be painful for them. OTOH current two users -
ext4 & xfs - can handle allocation in the error path AFAIU.

Thinking about this some more, maybe we could have event preallocated (like
a "rescue event"). Normally we would dynamically allocate (or get passed
from fs) the event and only if the allocation fails, we would queue the
rescue event to indicate to listeners that something bad happened, there
was error but we could not fully report it.

But then, even if we'd go for dynamic event allocation by default, we need
to efficiently merge events since some fs failures (e.g. resulting in
journal abort in ext4) lead to basically all operations with the filesystem
to fail and that could easily swamp the notification system with useless
events. Current system with preallocated event nicely handles this
situation, it is questionable how to extend it for online fsck usecase
where we need to queue more than one event (but even there probably needs
to be some sensible upper-bound). I'll think about it...

> > Hmm. For handling accumulated errors, can we still access the
> > fanotify_event_info_* object once we've handed it to fanotify? If the
> > user hasn't picked up the event yet, it might be acceptable to set more
> > bits in the type mask and bump the error count. In other words, every
> > time userspace actually reads the event, it'll get the latest error
> > state. I /think/ that's where the design of this patchset is going,
> > right?
>
> Sort of.
> fsnotify does have a concept of "merging" new event with an event
> already in queue.
>
> With most fsnotify events, merge only happens if the info related
> to the new event (e.g. sb,inode) is the same as that off the queued
> event and the "merge" is only in the event mask
> (e.g. FS_OPEN|FS_CLOSE).
>
> However, the current scheme for "merge" of an FS_ERROR event is only
> bumping err_count, even if the new reported error or inode do not
> match the error/inode in the queued event.
>
> If we define error event subtypes (e.g. FS_ERROR_WRITEBACK,
> FS_ERROR_METADATA), then the error event could contain
> a field for subtype mask and user could read the subtype mask
> along with the accumulated error count, but this cannot be
> done by providing the filesystem access to modify an internal
> fsnotify event, so those have to be generic UAPI defined subtypes.
>
> If you think that would be useful, then we may want to consider
> reserving the subtype mask field in fanotify_event_info_error in
> advance.

It depends on what exactly Darrick has in mind but I suspect we'd need a
fs-specific merge helper that would look at fs-specific blobs in the event
and decide whether events can be merged or not, possibly also handling the
merge by updating the blob. From the POV of fsnotify that would probably
mean merge callback in the event itself. But I guess this needs more
details from Darrick and maybe we don't need to decide this at this moment
since nobody is close to the point of having code needing to pass fs-blobs
with events.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-18 13:04:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 20/21] samples: Add fs error monitoring example

On Thu 12-08-21 17:40:09, Gabriel Krisman Bertazi wrote:
> Introduce an example of a FAN_FS_ERROR fanotify user to track filesystem
> errors.
>
> Reviewed-by: Amir Goldstein <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

<snip>

> diff --git a/samples/fanotify/fs-monitor.c b/samples/fanotify/fs-monitor.c
> new file mode 100644
> index 000000000000..e115053382be
> --- /dev/null
> +++ b/samples/fanotify/fs-monitor.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2021, Collabora Ltd.
> + */
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <err.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <sys/fanotify.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +#include <sys/types.h>
> +
> +#ifndef FAN_FS_ERROR
> +#define FAN_FS_ERROR 0x00008000
> +#define FAN_EVENT_INFO_TYPE_ERROR 4
> +
> +struct fanotify_event_info_error {
> + struct fanotify_event_info_header hdr;
> + __s32 error;
> + __u32 error_count;
> +};
> +#endif

Shouldn't we get these from uapi headers? But I guess the problem is that
you want this sample to work before glibc picks up the new headers? Is this
meant as a sample code for userspace to copy from or more as a testcase?

> +#ifndef FILEID_INO32_GEN
> +#define FILEID_INO32_GEN 1
> +#endif
> +
> +#ifndef FILEID_INVALID
> +#define FILEID_INVALID 0xff
> +#endif
> +
> +static void print_fh(struct file_handle *fh)
> +{
> + int i;
> + uint32_t *h = (uint32_t *) fh->f_handle;
> +
> + printf("\tfh: ");
> + for (i = 0; i < fh->handle_bytes; i++)
> + printf("%hhx", fh->f_handle[i]);
> + printf("\n");
> +
> + printf("\tdecoded fh: ");
> + if (fh->handle_type == FILEID_INO32_GEN)
> + printf("inode=%u gen=%u\n", h[0], h[1]);
> + else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> + printf("Type %d (Superblock error)\n", fh->handle_type);
> + else
> + printf("Type %d (Unknown)\n", fh->handle_type);
> +
> +}
> +
> +static void handle_notifications(char *buffer, int len)
> +{
> + struct fanotify_event_metadata *metadata;
> + struct fanotify_event_info_error *error;
> + struct fanotify_event_info_fid *fid;
> + char *next;
> +
> + for (metadata = (struct fanotify_event_metadata *) buffer;
> + FAN_EVENT_OK(metadata, len);
> + metadata = FAN_EVENT_NEXT(metadata, len)) {
> + next = (char *)metadata + metadata->event_len;
> + if (metadata->mask != FAN_FS_ERROR) {
> + printf("unexpected FAN MARK: %llx\n", metadata->mask);
> + goto next_event;
> + } else if (metadata->fd != FAN_NOFD) {
> + printf("Unexpected fd (!= FAN_NOFD)\n");
> + goto next_event;
> + }
> +
> + printf("FAN_FS_ERROR found len=%d\n", metadata->event_len);
> +
> + error = (struct fanotify_event_info_error *) (metadata+1);
> + if (error->hdr.info_type != FAN_EVENT_INFO_TYPE_ERROR) {
> + printf("unknown record: %d (Expecting TYPE_ERROR)\n",
> + error->hdr.info_type);
> + goto next_event;
> + }

The ordering of additional infos is undefined. Your code must not rely on
the fact that FAN_EVENT_INFO_TYPE_ERROR comes first and
FAN_EVENT_INFO_TYPE_FID second. Also you should ignore (maybe just print
type and len in this sample code) when you see unexpected info types as
later additions to the API may add additional info records

> +
> + printf("\tGeneric Error Record: len=%d\n", error->hdr.len);
> + printf("\terror: %d\n", error->error);
> + printf("\terror_count: %d\n", error->error_count);
> +
> + fid = (struct fanotify_event_info_fid *) (error + 1);
> + if ((char *) fid >= next) {
> + printf("Event doesn't have FID\n");
> + goto next_event;
> + }
> + printf("FID record found\n");
> +
> + if (fid->hdr.info_type != FAN_EVENT_INFO_TYPE_FID) {
> + printf("unknown record: %d (Expecting TYPE_FID)\n",
> + fid->hdr.info_type);
> + goto next_event;
> + }
> + printf("\tfsid: %x%x\n", fid->fsid.val[0], fid->fsid.val[1]);
> + print_fh((struct file_handle *) &fid->handle);
> +
> +next_event:
> + printf("---\n\n");
> + }

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-08-19 03:59:32

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Wed, Aug 18, 2021 at 11:58:18AM +0200, Jan Kara wrote:
> On Wed 18-08-21 06:24:26, Amir Goldstein wrote:
> > [...]
> >
> > > > Just keep in mind that the current scheme pre-allocates the single event slot
> > > > on fanotify_mark() time and (I think) we agreed to pre-allocate
> > > > sizeof(fsnotify_error_event) + MAX_HDNALE_SZ.
> > > > If filesystems would want to store some variable length fs specific info,
> > > > a future implementation will have to take that into account.
> > >
> > > <nod> I /think/ for the fs and AG metadata we could preallocate these,
> > > so long as fsnotify doesn't free them out from under us.
> >
> > fs won't get notified when the event is freed, so fsnotify must
> > take ownership on the data structure.
> > I was thinking more along the lines of limiting maximum size for fs
> > specific info and pre-allocating that size for the event.
>
> Agreed. If there's a sensible upperbound than preallocating this inside
> fsnotify is likely the least problematic solution.
>
> > > For inodes...
> > > there are many more of those, so they'd have to be allocated
> > > dynamically.
> >
> > The current scheme is that the size of the queue for error events
> > is one and the single slot is pre-allocated.
> > The reason for pre-allocate is that the assumption is that fsnotify_error()
> > could be called from contexts where memory allocation would be
> > inconvenient.
> > Therefore, we can store the encoded file handle of the first erroneous
> > inode, but we do not store any more events until user read this
> > one event.
>
> Right. OTOH I can imagine allowing GFP_NOFS allocations in the error
> context. At least for ext4 it would be workable (after all ext4 manages to
> lock & modify superblock in its error handlers, GFP_NOFS allocation isn't
> harder). But then if events are dynamically allocated there's still the
> inconvenient question what are you going to do if you need to report fs
> error and you hit ENOMEM. Just not sending the notification may have nasty
> consequences and in the world of containerization and virtualization
> tightly packed machines where ENOMEM happens aren't that unlikely. It is
> just difficult to make assumptions about filesystems overall so we decided
> to be better safe and preallocate the event.
>
> Or, we could leave the allocation troubles for the filesystem and
> fsnotify_sb_error() would be passed already allocated event (this way
> attaching of fs-specific blobs to the event is handled as well) which it
> would just queue. Plus we'd need to provide some helper to fill in generic
> part of the event...
>
> The disadvantage is that if there are filesystems / callsites needing
> preallocated events, it would be painful for them. OTOH current two users -
> ext4 & xfs - can handle allocation in the error path AFAIU.
>
> Thinking about this some more, maybe we could have event preallocated (like
> a "rescue event"). Normally we would dynamically allocate (or get passed
> from fs) the event and only if the allocation fails, we would queue the
> rescue event to indicate to listeners that something bad happened, there
> was error but we could not fully report it.

Yes.

> But then, even if we'd go for dynamic event allocation by default, we need
> to efficiently merge events since some fs failures (e.g. resulting in
> journal abort in ext4) lead to basically all operations with the filesystem
> to fail and that could easily swamp the notification system with useless
> events.

Hm. Going out on a limb, I would guess that the majority of fs error
flood events happen if the storage fails catastrophically. Assuming
that a catastrophic failure will quickly take the filesystem offline, I
would say that for XFS we should probably send one last "and then we
died" event and stop reporting after that.

> Current system with preallocated event nicely handles this
> situation, it is questionable how to extend it for online fsck usecase
> where we need to queue more than one event (but even there probably needs
> to be some sensible upper-bound). I'll think about it...

At least for XFS, I was figuring that xfs_scrub errors wouldn't be
reported via fsnotify since the repair tool is already running anyway.

> > > Hmm. For handling accumulated errors, can we still access the
> > > fanotify_event_info_* object once we've handed it to fanotify? If the
> > > user hasn't picked up the event yet, it might be acceptable to set more
> > > bits in the type mask and bump the error count. In other words, every
> > > time userspace actually reads the event, it'll get the latest error
> > > state. I /think/ that's where the design of this patchset is going,
> > > right?
> >
> > Sort of.
> > fsnotify does have a concept of "merging" new event with an event
> > already in queue.
> >
> > With most fsnotify events, merge only happens if the info related
> > to the new event (e.g. sb,inode) is the same as that off the queued
> > event and the "merge" is only in the event mask
> > (e.g. FS_OPEN|FS_CLOSE).
> >
> > However, the current scheme for "merge" of an FS_ERROR event is only
> > bumping err_count, even if the new reported error or inode do not
> > match the error/inode in the queued event.
> >
> > If we define error event subtypes (e.g. FS_ERROR_WRITEBACK,
> > FS_ERROR_METADATA), then the error event could contain
> > a field for subtype mask and user could read the subtype mask
> > along with the accumulated error count, but this cannot be
> > done by providing the filesystem access to modify an internal
> > fsnotify event, so those have to be generic UAPI defined subtypes.
> >
> > If you think that would be useful, then we may want to consider
> > reserving the subtype mask field in fanotify_event_info_error in
> > advance.
>
> It depends on what exactly Darrick has in mind but I suspect we'd need a
> fs-specific merge helper that would look at fs-specific blobs in the event
> and decide whether events can be merged or not, possibly also handling the
> merge by updating the blob.

Yes. If the filesystem itself were allowed to manage the lifespan of
the fsnotify error event object then this would be trivial -- we'll own
the object, keep it updated as needed, and fsnotify can copy the
contents to userspace whenever convenient.

(This might be a na?ve view of fsnotify...)

> From the POV of fsnotify that would probably
> mean merge callback in the event itself. But I guess this needs more
> details from Darrick and maybe we don't need to decide this at this moment
> since nobody is close to the point of having code needing to pass fs-blobs
> with events.

<nod> We ... probably don't need to decide this now.

--D

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2021-08-23 14:37:48

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v6 04/21] fsnotify: Reserve mark flag bits for backends

Jan Kara <[email protected]> writes:

> On Fri 13-08-21 10:28:27, Amir Goldstein wrote:
>> On Fri, Aug 13, 2021 at 12:40 AM Gabriel Krisman Bertazi
>> <[email protected]> wrote:
>> >
>> > Split out the final bits of struct fsnotify_mark->flags for use by a
>> > backend.
>> >
>> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>> >
>> > Changes since v1:
>> > - turn consts into defines (jan)
>> > ---
>> > include/linux/fsnotify_backend.h | 18 +++++++++++++++---
>> > 1 file changed, 15 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
>> > index 1ce66748a2d2..ae1bd9f06808 100644
>> > --- a/include/linux/fsnotify_backend.h
>> > +++ b/include/linux/fsnotify_backend.h
>> > @@ -363,6 +363,20 @@ struct fsnotify_mark_connector {
>> > struct hlist_head list;
>> > };
>> >
>> > +enum fsnotify_mark_bits {
>> > + FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY,
>> > + FSN_MARK_FL_BIT_ALIVE,
>> > + FSN_MARK_FL_BIT_ATTACHED,
>> > + FSN_MARK_PRIVATE_FLAGS,
>> > +};
>> > +
>> > +#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY \
>> > + (1 << FSN_MARK_FL_BIT_IGNORED_SURV_MODIFY)
>> > +#define FSNOTIFY_MARK_FLAG_ALIVE \
>> > + (1 << FSN_MARK_FL_BIT_ALIVE)
>> > +#define FSNOTIFY_MARK_FLAG_ATTACHED \
>> > + (1 << FSN_MARK_FL_BIT_ATTACHED)
>> > +
>> > /*
>> > * A mark is simply an object attached to an in core inode which allows an
>> > * fsnotify listener to indicate they are either no longer interested in events
>> > @@ -398,9 +412,7 @@ struct fsnotify_mark {
>> > struct fsnotify_mark_connector *connector;
>> > /* Events types to ignore [mark->lock, group->mark_mutex] */
>> > __u32 ignored_mask;
>> > -#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY 0x01
>> > -#define FSNOTIFY_MARK_FLAG_ALIVE 0x02
>> > -#define FSNOTIFY_MARK_FLAG_ATTACHED 0x04
>> > + /* Upper bits [31:PRIVATE_FLAGS] are reserved for backend usage */
>>
>> I don't understand what [31:PRIVATE_FLAGS] means
>
> I think it should be [FSN_MARK_PRIVATE_FLAGS:31] (identifying a range of
> bits). I'd maybe write just "Bits starting from FSN_MARK_PRIVATE_FLAGS are
> reserved for backend usage". With this fixed feel free to add:

Thank you, I will address the comment and add your reviewed-by tags.

--
Gabriel Krisman Bertazi

2021-08-23 14:50:09

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v6 20/21] samples: Add fs error monitoring example

Jan Kara <[email protected]> writes:
n
> On Thu 12-08-21 17:40:09, Gabriel Krisman Bertazi wrote:
>> Introduce an example of a FAN_FS_ERROR fanotify user to track filesystem
>> errors.
>>
>> Reviewed-by: Amir Goldstein <[email protected]>
>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>
> <snip>
>
>> diff --git a/samples/fanotify/fs-monitor.c b/samples/fanotify/fs-monitor.c
>> new file mode 100644
>> index 000000000000..e115053382be
>> --- /dev/null
>> +++ b/samples/fanotify/fs-monitor.c
>> @@ -0,0 +1,138 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright 2021, Collabora Ltd.
>> + */
>> +
>> +#define _GNU_SOURCE
>> +#include <errno.h>
>> +#include <err.h>
>> +#include <stdlib.h>
>> +#include <stdio.h>
>> +#include <fcntl.h>
>> +#include <sys/fanotify.h>
>> +#include <sys/types.h>
>> +#include <unistd.h>
>> +#include <sys/types.h>
>> +
>> +#ifndef FAN_FS_ERROR
>> +#define FAN_FS_ERROR 0x00008000
>> +#define FAN_EVENT_INFO_TYPE_ERROR 4
>> +
>> +struct fanotify_event_info_error {
>> + struct fanotify_event_info_header hdr;
>> + __s32 error;
>> + __u32 error_count;
>> +};
>> +#endif
>
> Shouldn't we get these from uapi headers? But I guess the problem is that
> you want this sample to work before glibc picks up the new headers? Is this
> meant as a sample code for userspace to copy from or more as a
> testcase?

Hi,

Yes, this will be picked from the uapi headers, but the guards try to
guarantee against an older libc. They also have the side effect of
silencing the kernel test robot about this patch... :)

This is meant as a sample code for users to copy from. it was also used
as testing in the beginning but now I have a proper ltp testcase in a
different series.

>> +#ifndef FILEID_INO32_GEN
>> +#define FILEID_INO32_GEN 1
>> +#endif
>> +
>> +#ifndef FILEID_INVALID
>> +#define FILEID_INVALID 0xff
>> +#endif
>> +
>> +static void print_fh(struct file_handle *fh)
>> +{
>> + int i;
>> + uint32_t *h = (uint32_t *) fh->f_handle;
>> +
>> + printf("\tfh: ");
>> + for (i = 0; i < fh->handle_bytes; i++)
>> + printf("%hhx", fh->f_handle[i]);
>> + printf("\n");
>> +
>> + printf("\tdecoded fh: ");
>> + if (fh->handle_type == FILEID_INO32_GEN)
>> + printf("inode=%u gen=%u\n", h[0], h[1]);
>> + else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
>> + printf("Type %d (Superblock error)\n", fh->handle_type);
>> + else
>> + printf("Type %d (Unknown)\n", fh->handle_type);
>> +
>> +}
>> +
>> +static void handle_notifications(char *buffer, int len)
>> +{
>> + struct fanotify_event_metadata *metadata;
>> + struct fanotify_event_info_error *error;
>> + struct fanotify_event_info_fid *fid;
>> + char *next;
>> +
>> + for (metadata = (struct fanotify_event_metadata *) buffer;
>> + FAN_EVENT_OK(metadata, len);
>> + metadata = FAN_EVENT_NEXT(metadata, len)) {
>> + next = (char *)metadata + metadata->event_len;
>> + if (metadata->mask != FAN_FS_ERROR) {
>> + printf("unexpected FAN MARK: %llx\n", metadata->mask);
>> + goto next_event;
>> + } else if (metadata->fd != FAN_NOFD) {
>> + printf("Unexpected fd (!= FAN_NOFD)\n");
>> + goto next_event;
>> + }
>> +
>> + printf("FAN_FS_ERROR found len=%d\n", metadata->event_len);
>> +
>> + error = (struct fanotify_event_info_error *) (metadata+1);
>> + if (error->hdr.info_type != FAN_EVENT_INFO_TYPE_ERROR) {
>> + printf("unknown record: %d (Expecting TYPE_ERROR)\n",
>> + error->hdr.info_type);
>> + goto next_event;
>> + }
>
> The ordering of additional infos is undefined. Your code must not rely on
> the fact that FAN_EVENT_INFO_TYPE_ERROR comes first and
> FAN_EVENT_INFO_TYPE_FID second. Also you should ignore (maybe just print
> type and len in this sample code) when you see unexpected info types as
> later additions to the API may add additional info records

Ah, I was really wondering whether the order is guaranteed or not.
Even though the current code forces it that way, I couldn't find the man
page explicitly saying whether it is guaranteed. Thanks, I will fix it up.

--
Gabriel Krisman Bertazi

2021-08-24 16:54:00

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

Jan Kara <[email protected]> writes:

> On Mon 16-08-21 14:41:03, Darrick J. Wong wrote:
>> On Thu, Aug 12, 2021 at 05:40:07PM -0400, Gabriel Krisman Bertazi wrote:
>> > The Error info type is a record sent to users on FAN_FS_ERROR events
>> > documenting the type of error. It also carries an error count,
>> > documenting how many errors were observed since the last reporting.
>> >
>> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>> >
>> > ---
>> > Changes since v5:
>> > - Move error code here
>> > ---
>> > fs/notify/fanotify/fanotify.c | 1 +
>> > fs/notify/fanotify/fanotify.h | 1 +
>> > fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
>> > include/uapi/linux/fanotify.h | 7 ++++++
>> > 4 files changed, 45 insertions(+)
>>
>> <snip>
>>
>> > diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
>> > index 16402037fc7a..80040a92e9d9 100644
>> > --- a/include/uapi/linux/fanotify.h
>> > +++ b/include/uapi/linux/fanotify.h
>> > @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
>> > #define FAN_EVENT_INFO_TYPE_FID 1
>> > #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
>> > #define FAN_EVENT_INFO_TYPE_DFID 3
>> > +#define FAN_EVENT_INFO_TYPE_ERROR 4
>> >
>> > /* Variable length info record following event metadata */
>> > struct fanotify_event_info_header {
>> > @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
>> > unsigned char handle[0];
>> > };
>> >
>> > +struct fanotify_event_info_error {
>> > + struct fanotify_event_info_header hdr;
>> > + __s32 error;
>> > + __u32 error_count;
>> > +};
>>
>> My apologies for not having time to review this patchset since it was
>> redesigned to use fanotify. Someday it would be helpful to be able to
>> export more detailed error reports from XFS, but as I'm not ready to
>> move forward and write that today, I'll try to avoid derailling this at
>> the last minute.
>
> I think we are not quite there and tweaking the passed structure is easy
> enough so no worries. Eventually, passing some filesystem-specific blob
> together with the event was the plan AFAIR. You're right now is a good
> moment to think how exactly we want that passed.
>
>> Eventually, XFS might want to be able to report errors in file data,
>> file metadata, allocation group metadata, and whole-filesystem metadata.
>> Userspace can already gather reports from XFS about corruptions reported
>> by the online fsck code (see xfs_health.c).
>
> Yes, although note that the current plan is that we currently have only one
> error event queue, others are just added to error_count until the event is
> fetched by userspace (on the grounds that the first error is usually the
> most meaningful, the others are usually just cascading problems). But I'm
> not sure if this scheme would be suitable for online fsck usecase since we
> may discard even valid independent errors this way.
>
>> I /think/ we could subclass the file error structure that you've
>> provided like so:
>>
>> struct fanotify_event_info_xfs_filesystem_error {
>> struct fanotify_event_info_error base;
>>
>> __u32 magic; /* 0x58465342 to identify xfs */
>> __u32 type; /* quotas, realtime bitmap, etc. */
>> };
>>
>> struct fanotify_event_info_xfs_perag_error {
>> struct fanotify_event_info_error base;
>>
>> __u32 magic; /* 0x58465342 to identify xfs */
>> __u32 type; /* agf, agi, agfl, bno btree, ino btree, etc. */
>> __u32 agno; /* allocation group number */
>> };
>>
>> struct fanotify_event_info_xfs_file_error {
>> struct fanotify_event_info_error base;
>>
>> __u32 magic; /* 0x58465342 to identify xfs */
>> __u32 type; /* extent map, dir, attr, etc. */
>> __u64 offset; /* file data offset, if applicable */
>> __u64 length; /* file data length, if applicable */
>> };
>>
>> (A real XFS implementation might have one structure with the type code
>> providing for a tagged union or something; I split it into three
>> separate structs here to avoid confusing things.)
>
> The structure of fanotify event as passed to userspace generally is:
>
> struct fanotify_event_metadata {
> __u32 event_len;
> __u8 vers;
> __u8 reserved;
> __u16 metadata_len;
> __aligned_u64 mask;
> __s32 fd;
> __s32 pid;
> };
>
> If event_len is > sizeof(struct fanotify_event_metadata), userspace is
> expected to look for struct fanotify_event_info_header after struct
> fanotify_event_metadata. struct fanotify_event_info_header looks like:
>
> struct fanotify_event_info_header {
> __u8 info_type;
> __u8 pad;
> __u16 len;
> };
>
> Again if the end of this info (defined by 'len') is smaller than
> 'event_len', there is next header with next payload of data. So for example
> error event will have:
>
> struct fanotify_event_metadata
> struct fanotify_event_info_error
> struct fanotify_event_info_fid
>
> Now either we could add fs specific blob into fanotify_event_info_error
> (but then it would be good to add 'magic' to fanotify_event_info_error now
> and define that if 'len' is larger, fs-specific blob follows after fixed
> data) or we can add another info type FAN_EVENT_INFO_TYPE_ERROR_FS_DATA
> (i.e., attach another structure into the event) which would contain the
> 'magic' and then blob of data. I don't have strong preference.

In the v1 of this patchset [1] I implemented the later option, a new
info type that the filesystem could provide as a blob. It was dropped
by Amir's request to leave it out of the discussion at that moment. Should I
ressucitate it for the next iteration? I believe it would attend to XFS needs.

[1] https://lwn.net/ml/linux-fsdevel/[email protected]/

--
Gabriel Krisman Bertazi

2021-08-25 04:11:43

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v6 18/21] fanotify: Emit generic error info type for error event

On Tue, Aug 24, 2021 at 12:53:24PM -0400, Gabriel Krisman Bertazi wrote:
> Jan Kara <[email protected]> writes:
>
> > On Mon 16-08-21 14:41:03, Darrick J. Wong wrote:
> >> On Thu, Aug 12, 2021 at 05:40:07PM -0400, Gabriel Krisman Bertazi wrote:
> >> > The Error info type is a record sent to users on FAN_FS_ERROR events
> >> > documenting the type of error. It also carries an error count,
> >> > documenting how many errors were observed since the last reporting.
> >> >
> >> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> >> >
> >> > ---
> >> > Changes since v5:
> >> > - Move error code here
> >> > ---
> >> > fs/notify/fanotify/fanotify.c | 1 +
> >> > fs/notify/fanotify/fanotify.h | 1 +
> >> > fs/notify/fanotify/fanotify_user.c | 36 ++++++++++++++++++++++++++++++
> >> > include/uapi/linux/fanotify.h | 7 ++++++
> >> > 4 files changed, 45 insertions(+)
> >>
> >> <snip>
> >>
> >> > diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> >> > index 16402037fc7a..80040a92e9d9 100644
> >> > --- a/include/uapi/linux/fanotify.h
> >> > +++ b/include/uapi/linux/fanotify.h
> >> > @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> >> > #define FAN_EVENT_INFO_TYPE_FID 1
> >> > #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> >> > #define FAN_EVENT_INFO_TYPE_DFID 3
> >> > +#define FAN_EVENT_INFO_TYPE_ERROR 4
> >> >
> >> > /* Variable length info record following event metadata */
> >> > struct fanotify_event_info_header {
> >> > @@ -149,6 +150,12 @@ struct fanotify_event_info_fid {
> >> > unsigned char handle[0];
> >> > };
> >> >
> >> > +struct fanotify_event_info_error {
> >> > + struct fanotify_event_info_header hdr;
> >> > + __s32 error;
> >> > + __u32 error_count;
> >> > +};
> >>
> >> My apologies for not having time to review this patchset since it was
> >> redesigned to use fanotify. Someday it would be helpful to be able to
> >> export more detailed error reports from XFS, but as I'm not ready to
> >> move forward and write that today, I'll try to avoid derailling this at
> >> the last minute.
> >
> > I think we are not quite there and tweaking the passed structure is easy
> > enough so no worries. Eventually, passing some filesystem-specific blob
> > together with the event was the plan AFAIR. You're right now is a good
> > moment to think how exactly we want that passed.
> >
> >> Eventually, XFS might want to be able to report errors in file data,
> >> file metadata, allocation group metadata, and whole-filesystem metadata.
> >> Userspace can already gather reports from XFS about corruptions reported
> >> by the online fsck code (see xfs_health.c).
> >
> > Yes, although note that the current plan is that we currently have only one
> > error event queue, others are just added to error_count until the event is
> > fetched by userspace (on the grounds that the first error is usually the
> > most meaningful, the others are usually just cascading problems). But I'm
> > not sure if this scheme would be suitable for online fsck usecase since we
> > may discard even valid independent errors this way.
> >
> >> I /think/ we could subclass the file error structure that you've
> >> provided like so:
> >>
> >> struct fanotify_event_info_xfs_filesystem_error {
> >> struct fanotify_event_info_error base;
> >>
> >> __u32 magic; /* 0x58465342 to identify xfs */
> >> __u32 type; /* quotas, realtime bitmap, etc. */
> >> };
> >>
> >> struct fanotify_event_info_xfs_perag_error {
> >> struct fanotify_event_info_error base;
> >>
> >> __u32 magic; /* 0x58465342 to identify xfs */
> >> __u32 type; /* agf, agi, agfl, bno btree, ino btree, etc. */
> >> __u32 agno; /* allocation group number */
> >> };
> >>
> >> struct fanotify_event_info_xfs_file_error {
> >> struct fanotify_event_info_error base;
> >>
> >> __u32 magic; /* 0x58465342 to identify xfs */
> >> __u32 type; /* extent map, dir, attr, etc. */
> >> __u64 offset; /* file data offset, if applicable */
> >> __u64 length; /* file data length, if applicable */
> >> };
> >>
> >> (A real XFS implementation might have one structure with the type code
> >> providing for a tagged union or something; I split it into three
> >> separate structs here to avoid confusing things.)
> >
> > The structure of fanotify event as passed to userspace generally is:
> >
> > struct fanotify_event_metadata {
> > __u32 event_len;
> > __u8 vers;
> > __u8 reserved;
> > __u16 metadata_len;
> > __aligned_u64 mask;
> > __s32 fd;
> > __s32 pid;
> > };
> >
> > If event_len is > sizeof(struct fanotify_event_metadata), userspace is
> > expected to look for struct fanotify_event_info_header after struct
> > fanotify_event_metadata. struct fanotify_event_info_header looks like:
> >
> > struct fanotify_event_info_header {
> > __u8 info_type;
> > __u8 pad;
> > __u16 len;
> > };
> >
> > Again if the end of this info (defined by 'len') is smaller than
> > 'event_len', there is next header with next payload of data. So for example
> > error event will have:
> >
> > struct fanotify_event_metadata
> > struct fanotify_event_info_error
> > struct fanotify_event_info_fid
> >
> > Now either we could add fs specific blob into fanotify_event_info_error
> > (but then it would be good to add 'magic' to fanotify_event_info_error now
> > and define that if 'len' is larger, fs-specific blob follows after fixed
> > data) or we can add another info type FAN_EVENT_INFO_TYPE_ERROR_FS_DATA
> > (i.e., attach another structure into the event) which would contain the
> > 'magic' and then blob of data. I don't have strong preference.
>
> In the v1 of this patchset [1] I implemented the later option, a new
> info type that the filesystem could provide as a blob. It was dropped
> by Amir's request to leave it out of the discussion at that moment. Should I
> ressucitate it for the next iteration? I believe it would attend to XFS needs.

I don't think it's necessary at this time. We (XFS community) would
have a bit more work to do before we get to the point of needing those
sorts of hooks in upstream. :)

--D

>
> [1] https://lwn.net/ml/linux-fsdevel/[email protected]/
>
> --
> Gabriel Krisman Bertazi

2021-08-25 21:06:31

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v6 09/21] fsnotify: Allow events reported with an empty inode

Amir Goldstein <[email protected]> writes:

> On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> <[email protected]> wrote:
>>
>> Some file system events (i.e. FS_ERROR) might not be associated with an
>> inode. For these, it makes sense to associate them directly with the
>> super block of the file system they apply to. This patch allows the
>> event to be reported with a NULL inode, by recovering the superblock
>> directly from the data field, if needed.
>>
>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>>
>> --
>> Changes since v5:
>> - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
>> ---
>> fs/notify/fsnotify.c | 16 +++++++++++++---
>> 1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> index 30d422b8c0fc..536db02cb26e 100644
>> --- a/fs/notify/fsnotify.c
>> +++ b/fs/notify/fsnotify.c
>> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
>> fsnotify_clear_marks_by_sb(sb);
>> }
>>
>> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
>> +{
>> + struct inode *inode = fsnotify_data_inode(data, data_type);
>> + struct super_block *sb = inode ? inode->i_sb : NULL;
>> +
>> + return sb;
>> +}
>> +
>> /*
>> * Given an inode, first check if we care what happens to our children. Inotify
>> * and dnotify both tell their parents about events. If we care about any event
>> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
>> * @file_name is relative to
>> * @file_name: optional file name associated with event
>> * @inode: optional inode associated with event -
>> - * either @dir or @inode must be non-NULL.
>> - * if both are non-NULL event may be reported to both.
>> + * If @dir and @inode are NULL, @data must have a type that
>> + * allows retrieving the file system associated with this
>
> Irrelevant comment. sb must always be available from @data.
>
>> + * event. if both are non-NULL event may be reported to
>> + * both.
>> * @cookie: inotify rename cookie
>> */
>> int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>> */
>> parent = dir;
>> }
>> - sb = inode->i_sb;
>> + sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
>
> const struct path *path = fsnotify_data_path(data, data_type);
> + const struct super_block *sb = fsnotify_data_sb(data, data_type);
>
> All the games with @data @inode and @dir args are irrelevant to this.
> sb should always be available from @data and it does not matter
> if fsnotify_data_inode() is the same as @inode, @dir or neither.
> All those inodes are anyway on the same sb.

Hi Amir,

I think this is actually necessary. I could identify at least one event
(FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
In that case, fsnotify_dirent is called with a negative dentry from
vfs_mkdir(). I'm not sure why exactly the dentry is negative after the
mkdir, but it would be possible we are racing with the file removal, I
guess? It might be a bug in fsnotify that this case even happen, but
I'm not sure yet.

The easiest way around it is what I proposed: just use inode->i_sb if
data is NULL. Since, as you said, data, inode and dir are all on the
same superblock, it should work, I think.

--
Gabriel Krisman Bertazi

2021-08-25 21:10:27

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 09/21] fsnotify: Allow events reported with an empty inode

On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
>
> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> > <[email protected]> wrote:
> >>
> >> Some file system events (i.e. FS_ERROR) might not be associated with an
> >> inode. For these, it makes sense to associate them directly with the
> >> super block of the file system they apply to. This patch allows the
> >> event to be reported with a NULL inode, by recovering the superblock
> >> directly from the data field, if needed.
> >>
> >> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> >>
> >> --
> >> Changes since v5:
> >> - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> >> ---
> >> fs/notify/fsnotify.c | 16 +++++++++++++---
> >> 1 file changed, 13 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> >> index 30d422b8c0fc..536db02cb26e 100644
> >> --- a/fs/notify/fsnotify.c
> >> +++ b/fs/notify/fsnotify.c
> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
> >> fsnotify_clear_marks_by_sb(sb);
> >> }
> >>
> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> >> +{
> >> + struct inode *inode = fsnotify_data_inode(data, data_type);
> >> + struct super_block *sb = inode ? inode->i_sb : NULL;
> >> +
> >> + return sb;
> >> +}
> >> +
> >> /*
> >> * Given an inode, first check if we care what happens to our children. Inotify
> >> * and dnotify both tell their parents about events. If we care about any event
> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> >> * @file_name is relative to
> >> * @file_name: optional file name associated with event
> >> * @inode: optional inode associated with event -
> >> - * either @dir or @inode must be non-NULL.
> >> - * if both are non-NULL event may be reported to both.
> >> + * If @dir and @inode are NULL, @data must have a type that
> >> + * allows retrieving the file system associated with this
> >
> > Irrelevant comment. sb must always be available from @data.
> >
> >> + * event. if both are non-NULL event may be reported to
> >> + * both.
> >> * @cookie: inotify rename cookie
> >> */
> >> int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >> */
> >> parent = dir;
> >> }
> >> - sb = inode->i_sb;
> >> + sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
> >
> > const struct path *path = fsnotify_data_path(data, data_type);
> > + const struct super_block *sb = fsnotify_data_sb(data, data_type);
> >
> > All the games with @data @inode and @dir args are irrelevant to this.
> > sb should always be available from @data and it does not matter
> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
> > All those inodes are anyway on the same sb.
>
> Hi Amir,
>
> I think this is actually necessary. I could identify at least one event
> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> In that case, fsnotify_dirent is called with a negative dentry from
> vfs_mkdir(). I'm not sure why exactly the dentry is negative after the

That doesn't sound right at all.
Are you sure about this?
Which filesystem was this mkdir called on?

> mkdir, but it would be possible we are racing with the file removal, I

No. Both vfs_mkdir() and vfs_rmdir() hold the dir inode lock (on parent).

> guess? It might be a bug in fsnotify that this case even happen, but
> I'm not sure yet.

fsnotify_data_inode() should not be NULL.
fsnotify_handle_inode_event() passes fsnotify_data_inode() to
callbacks like audit_watch_handle_event() that checks
WARN_ON_ONCE(!inode)

>
> The easiest way around it is what I proposed: just use inode->i_sb if
> data is NULL. Since, as you said, data, inode and dir are all on the
> same superblock, it should work, I think.
>

It would be papering over another issue.
I don't mind if we use inode->i_sb as long as we understand the reason
for what you are reporting.

Please provide more information.

Thanks,
Amir.

2021-08-25 21:52:28

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v6 09/21] fsnotify: Allow events reported with an empty inode

Amir Goldstein <[email protected]> writes:

> On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
> <[email protected]> wrote:
>>
>> Amir Goldstein <[email protected]> writes:
>>
>> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
>> > <[email protected]> wrote:
>> >>
>> >> Some file system events (i.e. FS_ERROR) might not be associated with an
>> >> inode. For these, it makes sense to associate them directly with the
>> >> super block of the file system they apply to. This patch allows the
>> >> event to be reported with a NULL inode, by recovering the superblock
>> >> directly from the data field, if needed.
>> >>
>> >> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>> >>
>> >> --
>> >> Changes since v5:
>> >> - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
>> >> ---
>> >> fs/notify/fsnotify.c | 16 +++++++++++++---
>> >> 1 file changed, 13 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> >> index 30d422b8c0fc..536db02cb26e 100644
>> >> --- a/fs/notify/fsnotify.c
>> >> +++ b/fs/notify/fsnotify.c
>> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
>> >> fsnotify_clear_marks_by_sb(sb);
>> >> }
>> >>
>> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
>> >> +{
>> >> + struct inode *inode = fsnotify_data_inode(data, data_type);
>> >> + struct super_block *sb = inode ? inode->i_sb : NULL;
>> >> +
>> >> + return sb;
>> >> +}
>> >> +
>> >> /*
>> >> * Given an inode, first check if we care what happens to our children. Inotify
>> >> * and dnotify both tell their parents about events. If we care about any event
>> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
>> >> * @file_name is relative to
>> >> * @file_name: optional file name associated with event
>> >> * @inode: optional inode associated with event -
>> >> - * either @dir or @inode must be non-NULL.
>> >> - * if both are non-NULL event may be reported to both.
>> >> + * If @dir and @inode are NULL, @data must have a type that
>> >> + * allows retrieving the file system associated with this
>> >
>> > Irrelevant comment. sb must always be available from @data.
>> >
>> >> + * event. if both are non-NULL event may be reported to
>> >> + * both.
>> >> * @cookie: inotify rename cookie
>> >> */
>> >> int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>> >> */
>> >> parent = dir;
>> >> }
>> >> - sb = inode->i_sb;
>> >> + sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
>> >
>> > const struct path *path = fsnotify_data_path(data, data_type);
>> > + const struct super_block *sb = fsnotify_data_sb(data, data_type);
>> >
>> > All the games with @data @inode and @dir args are irrelevant to this.
>> > sb should always be available from @data and it does not matter
>> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
>> > All those inodes are anyway on the same sb.
>>
>> Hi Amir,
>>
>> I think this is actually necessary. I could identify at least one event
>> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
>> In that case, fsnotify_dirent is called with a negative dentry from
>> vfs_mkdir(). I'm not sure why exactly the dentry is negative after the
>
> That doesn't sound right at all.
> Are you sure about this?
> Which filesystem was this mkdir called on?

You should be able to reproduce it on top of mainline if you pick only this
patch and do the change you suggested:

- sb = inode->i_sb;
+ sb = fsnotify_data_sb(data, data_type);

And then boot a Debian stable with systemd. The notification happens on
the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
by systemd itself. The event that arrives with a NULL data is telling the
directory /sys/fs/cgroup/*/ about the creation of directory
`init.scope`.

The change above triggers the following null dereference of struct
super_block, since data is NULL.

I will keep looking but you might be able to answer it immediately...

fsnotify was called with:
data_type=2
mask=40000100
data=0
name=init.scope

The code looks like this:

fsnotify_mkdir(dir, dentry) {
fsnotify_dirent(inode, dentry, FS_CREATE | FS_ISDIR) {
fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0) {
fsnotify(mask, child, FSNOTIFY_EVENT_INODE, dir, name, NULL, cookie);
}
}
}

The entire log:

BUG: kernel NULL pointer dereference, address: 00000000000003a8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 3 PID: 1 Comm: systemd Not tainted 5.14.0-rc5- #279
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
RIP: 0010:fsnotify+0x103/0x5e0
Code: 84 c2 04 00 00 83 f8 02 0f 85 c6 04 00 00 48 8b 04 24 31 db 48 85 c0 0f 84 b9 04 00 00 48 8b 48 28 48 85 c9 0f 84 ac 04 00 00 <48> 83 b9 a8 03 00 00 00 0f 84 e3 03 00 00 8b 81 a0 03 00 00 48 85
RSP: 0018:ffffaff800013e18 EFLAGS: 00010246
RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffaff800013ca8 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: c0000000ffffefff
R10: 0000000000000001 R11: ffffaff800013c48 R12: ffff9bbb80467778
R13: 0000000040000100 R14: 00000000000001ed R15: 0000000000000000
FS: 00007f2e4d2c4900(0000) GS:ffff9bbbbbd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000003a8 CR3: 0000000100054000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? cgroup_kn_unlock+0x33/0x80
? cgroup_mkdir+0x13e/0x410
vfs_mkdir+0x16e/0x1d0
do_mkdirat+0x8c/0x100
do_syscall_64+0x3a/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2e4da91b07
Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc02877ad8 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
RAX: ffffffffffffffda RBX: 00007ffc02877b50 RCX: 00007f2e4da91b07
RDX: 00007ffc02877980 RSI: 00000000000001ed RDI: 00005646df5f01d0
RBP: 00005646de9ed770 R08: 0000000000000001 R09: 0000000000000000
R10: 00005646df5f01c0 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: 00007ffc02877b50 R15: 00007ffc02877c60
Modules linked in:
CR2: 00000000000003a8
---[ end trace 4642e1d1df9669cb ]---

>
>> mkdir, but it would be possible we are racing with the file removal, I
>
> No. Both vfs_mkdir() and vfs_rmdir() hold the dir inode lock (on
> parent).
>
>> guess? It might be a bug in fsnotify that this case even happen, but
>> I'm not sure yet.
>
> fsnotify_data_inode() should not be NULL.
> fsnotify_handle_inode_event() passes fsnotify_data_inode() to
> callbacks like audit_watch_handle_event() that checks
> WARN_ON_ONCE(!inode)
>
>>
>> The easiest way around it is what I proposed: just use inode->i_sb if
>> data is NULL. Since, as you said, data, inode and dir are all on the
>> same superblock, it should work, I think.
>>
>
> It would be papering over another issue.
> I don't mind if we use inode->i_sb as long as we understand the reason
> for what you are reporting.
>
> Please provide more information.



--
Gabriel Krisman Bertazi

2021-08-26 10:46:31

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 09/21] fsnotify: Allow events reported with an empty inode

On Thu, Aug 26, 2021 at 12:50 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
>
> > On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
> > <[email protected]> wrote:
> >>
> >> Amir Goldstein <[email protected]> writes:
> >>
> >> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> >> > <[email protected]> wrote:
> >> >>
> >> >> Some file system events (i.e. FS_ERROR) might not be associated with an
> >> >> inode. For these, it makes sense to associate them directly with the
> >> >> super block of the file system they apply to. This patch allows the
> >> >> event to be reported with a NULL inode, by recovering the superblock
> >> >> directly from the data field, if needed.
> >> >>
> >> >> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> >> >>
> >> >> --
> >> >> Changes since v5:
> >> >> - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> >> >> ---
> >> >> fs/notify/fsnotify.c | 16 +++++++++++++---
> >> >> 1 file changed, 13 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> >> >> index 30d422b8c0fc..536db02cb26e 100644
> >> >> --- a/fs/notify/fsnotify.c
> >> >> +++ b/fs/notify/fsnotify.c
> >> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
> >> >> fsnotify_clear_marks_by_sb(sb);
> >> >> }
> >> >>
> >> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> >> >> +{
> >> >> + struct inode *inode = fsnotify_data_inode(data, data_type);
> >> >> + struct super_block *sb = inode ? inode->i_sb : NULL;
> >> >> +
> >> >> + return sb;
> >> >> +}
> >> >> +
> >> >> /*
> >> >> * Given an inode, first check if we care what happens to our children. Inotify
> >> >> * and dnotify both tell their parents about events. If we care about any event
> >> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> >> >> * @file_name is relative to
> >> >> * @file_name: optional file name associated with event
> >> >> * @inode: optional inode associated with event -
> >> >> - * either @dir or @inode must be non-NULL.
> >> >> - * if both are non-NULL event may be reported to both.
> >> >> + * If @dir and @inode are NULL, @data must have a type that
> >> >> + * allows retrieving the file system associated with this
> >> >
> >> > Irrelevant comment. sb must always be available from @data.
> >> >
> >> >> + * event. if both are non-NULL event may be reported to
> >> >> + * both.
> >> >> * @cookie: inotify rename cookie
> >> >> */
> >> >> int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >> >> */
> >> >> parent = dir;
> >> >> }
> >> >> - sb = inode->i_sb;
> >> >> + sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
> >> >
> >> > const struct path *path = fsnotify_data_path(data, data_type);
> >> > + const struct super_block *sb = fsnotify_data_sb(data, data_type);
> >> >
> >> > All the games with @data @inode and @dir args are irrelevant to this.
> >> > sb should always be available from @data and it does not matter
> >> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
> >> > All those inodes are anyway on the same sb.
> >>
> >> Hi Amir,
> >>
> >> I think this is actually necessary. I could identify at least one event
> >> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> >> In that case, fsnotify_dirent is called with a negative dentry from
> >> vfs_mkdir(). I'm not sure why exactly the dentry is negative after the
> >
> > That doesn't sound right at all.
> > Are you sure about this?
> > Which filesystem was this mkdir called on?
>
> You should be able to reproduce it on top of mainline if you pick only this
> patch and do the change you suggested:
>
> - sb = inode->i_sb;
> + sb = fsnotify_data_sb(data, data_type);
>
> And then boot a Debian stable with systemd. The notification happens on
> the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
> by systemd itself. The event that arrives with a NULL data is telling the
> directory /sys/fs/cgroup/*/ about the creation of directory
> `init.scope`.
>
> The change above triggers the following null dereference of struct
> super_block, since data is NULL.
>
> I will keep looking but you might be able to answer it immediately...

Yes, I see what is going on.

cgroupfs is a sort of kernfs and kernfs_iop_mkdir() does not instantiate
the negative dentry. Instead, kernfs_dop_revalidate() always invalidates
negative dentries to force re-lookup to find the inode.

Documentation/filesystems/vfs.rst says on create() and friends:
"...you will probably call d_instantiate() with the dentry and the
newly created inode..."

So this behavior seems legit.
Meaning that we have made a wrong assumption in fsnotify_create()
and fsnotify_mkdir().
Please note the comment above fsnotify_link() which anticipates
negative dentries.

I've audited the fsnotify backends and it seems that the
WARN_ON(!inode) in kernel/audit_* is the only immediate implication
of negative dentry with FS_CREATE.
I am the one who added these WARN_ON(), so I will remove them.
I think that missing inode in an FS_CREATE event really breaks
audit on kernfs, but not sure if that is a valid use case (Paul?).

Anyway, regarding your patch, I still prefer the solution proposed by Jan,
but not with a different implementation of fsnotify_data_sb().

Please see branch fsnotify_data_sb[1] with the proposed fixes.
The fixes assert the statement that "sb should always be available
from @data", regardless of kernfs anomaly.

If this works for you, please prepend those patches to your next
submission.

Regarding the state of this patch set in general, I must admit that
I wasn't able to follow if a conclusion was reached about the lifetime
management of fsnotify_error_event and associated sb mark.
Jan is going out on vacation and I think there is little point in spinning
another patch set revision before this issue is settled with Jan.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fsnotify_data_sb

2021-08-27 02:29:46

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH v6 09/21] fsnotify: Allow events reported with an empty inode

On Thu, Aug 26, 2021 at 6:45 AM Amir Goldstein <[email protected]> wrote:
> On Thu, Aug 26, 2021 at 12:50 AM Gabriel Krisman Bertazi
> <[email protected]> wrote:
> >
> > Amir Goldstein <[email protected]> writes:
> >
> > > On Wed, Aug 25, 2021 at 9:40 PM Gabriel Krisman Bertazi
> > > <[email protected]> wrote:
> > >>
> > >> Amir Goldstein <[email protected]> writes:
> > >>
> > >> > On Fri, Aug 13, 2021 at 12:41 AM Gabriel Krisman Bertazi
> > >> > <[email protected]> wrote:
> > >> >>
> > >> >> Some file system events (i.e. FS_ERROR) might not be associated with an
> > >> >> inode. For these, it makes sense to associate them directly with the
> > >> >> super block of the file system they apply to. This patch allows the
> > >> >> event to be reported with a NULL inode, by recovering the superblock
> > >> >> directly from the data field, if needed.
> > >> >>
> > >> >> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > >> >>
> > >> >> --
> > >> >> Changes since v5:
> > >> >> - add fsnotify_data_sb handle to retrieve sb from the data field. (jan)
> > >> >> ---
> > >> >> fs/notify/fsnotify.c | 16 +++++++++++++---
> > >> >> 1 file changed, 13 insertions(+), 3 deletions(-)
> > >> >>
> > >> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> > >> >> index 30d422b8c0fc..536db02cb26e 100644
> > >> >> --- a/fs/notify/fsnotify.c
> > >> >> +++ b/fs/notify/fsnotify.c
> > >> >> @@ -98,6 +98,14 @@ void fsnotify_sb_delete(struct super_block *sb)
> > >> >> fsnotify_clear_marks_by_sb(sb);
> > >> >> }
> > >> >>
> > >> >> +static struct super_block *fsnotify_data_sb(const void *data, int data_type)
> > >> >> +{
> > >> >> + struct inode *inode = fsnotify_data_inode(data, data_type);
> > >> >> + struct super_block *sb = inode ? inode->i_sb : NULL;
> > >> >> +
> > >> >> + return sb;
> > >> >> +}
> > >> >> +
> > >> >> /*
> > >> >> * Given an inode, first check if we care what happens to our children. Inotify
> > >> >> * and dnotify both tell their parents about events. If we care about any event
> > >> >> @@ -455,8 +463,10 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> > >> >> * @file_name is relative to
> > >> >> * @file_name: optional file name associated with event
> > >> >> * @inode: optional inode associated with event -
> > >> >> - * either @dir or @inode must be non-NULL.
> > >> >> - * if both are non-NULL event may be reported to both.
> > >> >> + * If @dir and @inode are NULL, @data must have a type that
> > >> >> + * allows retrieving the file system associated with this
> > >> >
> > >> > Irrelevant comment. sb must always be available from @data.
> > >> >
> > >> >> + * event. if both are non-NULL event may be reported to
> > >> >> + * both.
> > >> >> * @cookie: inotify rename cookie
> > >> >> */
> > >> >> int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > >> >> @@ -483,7 +493,7 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > >> >> */
> > >> >> parent = dir;
> > >> >> }
> > >> >> - sb = inode->i_sb;
> > >> >> + sb = inode ? inode->i_sb : fsnotify_data_sb(data, data_type);
> > >> >
> > >> > const struct path *path = fsnotify_data_path(data, data_type);
> > >> > + const struct super_block *sb = fsnotify_data_sb(data, data_type);
> > >> >
> > >> > All the games with @data @inode and @dir args are irrelevant to this.
> > >> > sb should always be available from @data and it does not matter
> > >> > if fsnotify_data_inode() is the same as @inode, @dir or neither.
> > >> > All those inodes are anyway on the same sb.
> > >>
> > >> Hi Amir,
> > >>
> > >> I think this is actually necessary. I could identify at least one event
> > >> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> > >> In that case, fsnotify_dirent is called with a negative dentry from
> > >> vfs_mkdir(). I'm not sure why exactly the dentry is negative after the
> > >
> > > That doesn't sound right at all.
> > > Are you sure about this?
> > > Which filesystem was this mkdir called on?
> >
> > You should be able to reproduce it on top of mainline if you pick only this
> > patch and do the change you suggested:
> >
> > - sb = inode->i_sb;
> > + sb = fsnotify_data_sb(data, data_type);
> >
> > And then boot a Debian stable with systemd. The notification happens on
> > the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
> > by systemd itself. The event that arrives with a NULL data is telling the
> > directory /sys/fs/cgroup/*/ about the creation of directory
> > `init.scope`.
> >
> > The change above triggers the following null dereference of struct
> > super_block, since data is NULL.
> >
> > I will keep looking but you might be able to answer it immediately...
>
> Yes, I see what is going on.
>
> cgroupfs is a sort of kernfs and kernfs_iop_mkdir() does not instantiate
> the negative dentry. Instead, kernfs_dop_revalidate() always invalidates
> negative dentries to force re-lookup to find the inode.
>
> Documentation/filesystems/vfs.rst says on create() and friends:
> "...you will probably call d_instantiate() with the dentry and the
> newly created inode..."
>
> So this behavior seems legit.
> Meaning that we have made a wrong assumption in fsnotify_create()
> and fsnotify_mkdir().
> Please note the comment above fsnotify_link() which anticipates
> negative dentries.
>
> I've audited the fsnotify backends and it seems that the
> WARN_ON(!inode) in kernel/audit_* is the only immediate implication
> of negative dentry with FS_CREATE.
> I am the one who added these WARN_ON(), so I will remove them.
> I think that missing inode in an FS_CREATE event really breaks
> audit on kernfs, but not sure if that is a valid use case (Paul?).

While it is tempting to ignore kernfs from an audit filesystem watch
perspective, I can see admins potentially wanting to watch
kernfs/cgroupfs/other-config-pseudofs to detect who is potentially
playing with the system config. Arguably the most important config
changes would already be audited if they were security relevant, but I
could also see an admin wanting to watch for *any* changes so it's
probably best not to rule out a kernfs based watch right now.

I'm sure I'm missing some details, but from what I gather from the
portion of the thread that I'm seeing, it looks like the audit issue
lies in audit_mark_handle_event() and audit_watch_handle_event(). In
both cases it looks like the functions are at least safe with a NULL
inode pointer, even with the WARN_ON() removed; the problem being that
the mark and watch will not be updated with the device and inode
number which means the audit filters based on those marks/watches will
not trigger. Is that about right or did I read the thread and code a
bit too quickly?

Working under the assumption that the above is close enough to
correct, that is a bit of a problem as it means audit can't
effectively watch kernfs based filesystems, although it sounds like it
wasn't really working properly to begin with, yes? Before I start
thinking too hard about this, does anyone already have a great idea to
fix this that they want to share?

--
paul moore
http://www.paul-moore.com

2021-08-27 18:18:55

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v6 15/21] fanotify: Preallocate per superblock mark error event

Jan Kara <[email protected]> writes:

> On Thu 12-08-21 17:40:04, Gabriel Krisman Bertazi wrote:
>> Error reporting needs to be done in an atomic context. This patch
>> introduces a single error slot for superblock marks that report the
>> FAN_FS_ERROR event, to be used during event submission.
>>
>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>>
>> ---
>> Changes v5:
>> - Restore mark references. (jan)
>> - Tie fee slot to the mark lifetime.(jan)
>> - Don't reallocate event(jan)
>> ---
>> fs/notify/fanotify/fanotify.c | 12 ++++++++++++
>> fs/notify/fanotify/fanotify.h | 13 +++++++++++++
>> fs/notify/fanotify/fanotify_user.c | 31 ++++++++++++++++++++++++++++--
>> 3 files changed, 54 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
>> index ebb6c557cea1..3bf6fd85c634 100644
>> --- a/fs/notify/fanotify/fanotify.c
>> +++ b/fs/notify/fanotify/fanotify.c
>> @@ -855,6 +855,14 @@ static void fanotify_free_name_event(struct fanotify_event *event)
>> kfree(FANOTIFY_NE(event));
>> }
>>
>> +static void fanotify_free_error_event(struct fanotify_event *event)
>> +{
>> + /*
>> + * The actual event is tied to a mark, and is released on mark
>> + * removal
>> + */
>> +}
>> +
>
> I was pondering about the lifetime rules some more. This is also related to
> patch 16/21 but I'll comment here. When we hold mark ref from queued event,
> we introduce a subtle race into group destruction logic. There we first
> evict all marks, wait for them to be destroyed by worker thread after SRCU
> period expires, and then we remove queued events. When we hold mark
> reference from an event we break this as mark will exist until the event is
> dequeued and then group can get freed before we actually free the mark and
> so mark freeing can hit use-after-free issues.
>
> So we'll have to do this a bit differently. I have two options:
>
> 1) Instead of preallocating events explicitely like this, we could setup a
> mempool to allocate error events from for each notification group. We would
> resize the mempool when adding error mark so that it has as many reserved
> events as error marks. Upside is error events will be much less special -
> no special lifetime rules. We'd just need to setup & resize the mempool. We
> would also have to provide proper merge function for error events (to merge
> events from the same sb). Also there will be limitation of number of error
> marks per group because mempools use kmalloc() for an array tracking
> reserved events. But we could certainly manage 512, likely 1024 error marks
> per notification group.
>
> 2) We would keep attaching event to mark as currently. As far as I have
> checked the event doesn't actually need a back-ref to sb_mark. It is
> really only used for mark reference taking (and then to get to sb from
> fanotify_handle_error_event() but we can certainly get to sb by easier
> means there). So I would just remove that. What we still need to know in
> fanotify_free_error_event() though is whether the sb_mark is still alive or
> not. If it is alive, we leave the event alone, otherwise we need to free it.
> So we need a mark_alive flag in the error event and then do in ->freeing_mark
> callback something like:
>
> if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
> struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
>
> ### /* Maybe we could use mark->lock for this? */
> spin_lock(&group->notification_lock);
> if (fa_mark->fee_slot) {
> if (list_empty(&fa_mark->fee_slot->fae.fse.list)) {
> kfree(fa_mark->fee_slot);
> fa_mark->fee_slot = NULL;
> } else {
> fa_mark->fee_slot->mark_alive = 0;
> }
> }
> spin_unlock(&group->notification_lock);
> }
>
> And then when queueing and dequeueing event we would have to carefully
> check what is the mark & event state under appropriate lock (because
> ->handle_event() callbacks can see marks on the way to be destroyed as they
> are protected just by SRCU).

Thanks for the review. That is indeed a subtle race that I hadn't
noticed.

Option 2 is much more straightforward. And considering the uABI won't
be changed if we decide to change to option 1 later, I gave that a try
and should be able to prepare a new version that leaves the error event
with a weak association to the mark, without the back reference, and
allowing it to be deleted by the latest between dequeue and
->freeing_mark, as you suggested.

--
Gabriel Krisman Bertazi

2021-09-02 21:27:23

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v6 15/21] fanotify: Preallocate per superblock mark error event

Gabriel Krisman Bertazi <[email protected]> writes:

> Jan Kara <[email protected]> writes:
>
>> On Thu 12-08-21 17:40:04, Gabriel Krisman Bertazi wrote:
>>> Error reporting needs to be done in an atomic context. This patch
>>> introduces a single error slot for superblock marks that report the
>>> FAN_FS_ERROR event, to be used during event submission.
>>>
>>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>>>
>>> ---
>>> Changes v5:
>>> - Restore mark references. (jan)
>>> - Tie fee slot to the mark lifetime.(jan)
>>> - Don't reallocate event(jan)
>>> ---
>>> fs/notify/fanotify/fanotify.c | 12 ++++++++++++
>>> fs/notify/fanotify/fanotify.h | 13 +++++++++++++
>>> fs/notify/fanotify/fanotify_user.c | 31 ++++++++++++++++++++++++++++--
>>> 3 files changed, 54 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
>>> index ebb6c557cea1..3bf6fd85c634 100644
>>> --- a/fs/notify/fanotify/fanotify.c
>>> +++ b/fs/notify/fanotify/fanotify.c
>>> @@ -855,6 +855,14 @@ static void fanotify_free_name_event(struct fanotify_event *event)
>>> kfree(FANOTIFY_NE(event));
>>> }
>>>
>>> +static void fanotify_free_error_event(struct fanotify_event *event)
>>> +{
>>> + /*
>>> + * The actual event is tied to a mark, and is released on mark
>>> + * removal
>>> + */
>>> +}
>>> +
>>
>> I was pondering about the lifetime rules some more. This is also related to
>> patch 16/21 but I'll comment here. When we hold mark ref from queued event,
>> we introduce a subtle race into group destruction logic. There we first
>> evict all marks, wait for them to be destroyed by worker thread after SRCU
>> period expires, and then we remove queued events. When we hold mark
>> reference from an event we break this as mark will exist until the event is
>> dequeued and then group can get freed before we actually free the mark and
>> so mark freeing can hit use-after-free issues.
>>
>> So we'll have to do this a bit differently. I have two options:
>>
>> 1) Instead of preallocating events explicitely like this, we could setup a
>> mempool to allocate error events from for each notification group. We would
>> resize the mempool when adding error mark so that it has as many reserved
>> events as error marks. Upside is error events will be much less special -
>> no special lifetime rules. We'd just need to setup & resize the mempool. We
>> would also have to provide proper merge function for error events (to merge
>> events from the same sb). Also there will be limitation of number of error
>> marks per group because mempools use kmalloc() for an array tracking
>> reserved events. But we could certainly manage 512, likely 1024 error marks
>> per notification group.
>>
>> 2) We would keep attaching event to mark as currently. As far as I have
>> checked the event doesn't actually need a back-ref to sb_mark. It is
>> really only used for mark reference taking (and then to get to sb from
>> fanotify_handle_error_event() but we can certainly get to sb by easier
>> means there). So I would just remove that. What we still need to know in
>> fanotify_free_error_event() though is whether the sb_mark is still alive or
>> not. If it is alive, we leave the event alone, otherwise we need to free it.
>> So we need a mark_alive flag in the error event and then do in ->freeing_mark
>> callback something like:
>>
>> if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
>> struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
>>
>> ### /* Maybe we could use mark->lock for this? */
>> spin_lock(&group->notification_lock);
>> if (fa_mark->fee_slot) {
>> if (list_empty(&fa_mark->fee_slot->fae.fse.list)) {
>> kfree(fa_mark->fee_slot);
>> fa_mark->fee_slot = NULL;
>> } else {
>> fa_mark->fee_slot->mark_alive = 0;
>> }
>> }
>> spin_unlock(&group->notification_lock);
>> }
>>
>> And then when queueing and dequeueing event we would have to carefully
>> check what is the mark & event state under appropriate lock (because
>> ->handle_event() callbacks can see marks on the way to be destroyed as they
>> are protected just by SRCU).
>
> Thanks for the review. That is indeed a subtle race that I hadn't
> noticed.
>
> Option 2 is much more straightforward. And considering the uABI won't
> be changed if we decide to change to option 1 later, I gave that a try
> and should be able to prepare a new version that leaves the error event
> with a weak association to the mark, without the back reference, and
> allowing it to be deleted by the latest between dequeue and
> ->freeing_mark, as you suggested.

Actually, I don't think this will work for insertion unless we keep a
bounce buffer for the file_handle, because we need to keep the
group->notification_lock to ensure the fee doesn't go away with the mark
(since it is not yet enqueued) but, as discussed before, we don't want
to hold that lock when generating the FH.

I think the correct way is to have some sort of refcount of the error
event slot. We could use err_count for that and change the suggestion
above to:

if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);

spin_lock(&group->notification_lock);
if (fa_mark->fee_slot) {
if (!fee->err_count) {
kfree(fa_mark->fee_slot);
fa_mark->fee_slot = NULL;
} else {
fa_mark->fee_slot->mark_alive = 0;
}
}
spin_unlock(&group->notification_lock);
}

And insertion would look like this:

static int fanotify_handle_error_event(....) {

spin_lock(&group->notification_lock);

if (!mark->fee || (mark->fee->err_count++) {
spin_unlock(&group->notification_lock);
return 0;
}

spin_unlock(&group->notification_lock);

mark->fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;

/* ... Write report data to error event ... */

fanotify_encode_fh(&fee->object_fh, fanotify_encode_fh_len(inode),
NULL, 0);

fsnotify_add_event(group, &fee->fae.fse, NULL);
}

Unless you think this is too hack-ish.

To be fair, I think it is hack-ish. I would add a proper refcount_t
to the error event, and let the mark own a reference to it, which is
dropped when the mark goes away. Enqueue and Dequeue will acquire and
drop references, respectively. In this case, err_count is not
overloaded.

Will it work?

--
Gabriel Krisman Bertazi

2021-09-03 04:19:07

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6 15/21] fanotify: Preallocate per superblock mark error event

On Fri, Sep 3, 2021 at 12:24 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Gabriel Krisman Bertazi <[email protected]> writes:
>
> > Jan Kara <[email protected]> writes:
> >
> >> On Thu 12-08-21 17:40:04, Gabriel Krisman Bertazi wrote:
> >>> Error reporting needs to be done in an atomic context. This patch
> >>> introduces a single error slot for superblock marks that report the
> >>> FAN_FS_ERROR event, to be used during event submission.
> >>>
> >>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> >>>
> >>> ---
> >>> Changes v5:
> >>> - Restore mark references. (jan)
> >>> - Tie fee slot to the mark lifetime.(jan)
> >>> - Don't reallocate event(jan)
> >>> ---
> >>> fs/notify/fanotify/fanotify.c | 12 ++++++++++++
> >>> fs/notify/fanotify/fanotify.h | 13 +++++++++++++
> >>> fs/notify/fanotify/fanotify_user.c | 31 ++++++++++++++++++++++++++++--
> >>> 3 files changed, 54 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> >>> index ebb6c557cea1..3bf6fd85c634 100644
> >>> --- a/fs/notify/fanotify/fanotify.c
> >>> +++ b/fs/notify/fanotify/fanotify.c
> >>> @@ -855,6 +855,14 @@ static void fanotify_free_name_event(struct fanotify_event *event)
> >>> kfree(FANOTIFY_NE(event));
> >>> }
> >>>
> >>> +static void fanotify_free_error_event(struct fanotify_event *event)
> >>> +{
> >>> + /*
> >>> + * The actual event is tied to a mark, and is released on mark
> >>> + * removal
> >>> + */
> >>> +}
> >>> +
> >>
> >> I was pondering about the lifetime rules some more. This is also related to
> >> patch 16/21 but I'll comment here. When we hold mark ref from queued event,
> >> we introduce a subtle race into group destruction logic. There we first
> >> evict all marks, wait for them to be destroyed by worker thread after SRCU
> >> period expires, and then we remove queued events. When we hold mark
> >> reference from an event we break this as mark will exist until the event is
> >> dequeued and then group can get freed before we actually free the mark and
> >> so mark freeing can hit use-after-free issues.
> >>
> >> So we'll have to do this a bit differently. I have two options:
> >>
> >> 1) Instead of preallocating events explicitely like this, we could setup a
> >> mempool to allocate error events from for each notification group. We would
> >> resize the mempool when adding error mark so that it has as many reserved
> >> events as error marks. Upside is error events will be much less special -
> >> no special lifetime rules. We'd just need to setup & resize the mempool. We
> >> would also have to provide proper merge function for error events (to merge
> >> events from the same sb). Also there will be limitation of number of error
> >> marks per group because mempools use kmalloc() for an array tracking
> >> reserved events. But we could certainly manage 512, likely 1024 error marks
> >> per notification group.
> >>
> >> 2) We would keep attaching event to mark as currently. As far as I have
> >> checked the event doesn't actually need a back-ref to sb_mark. It is
> >> really only used for mark reference taking (and then to get to sb from
> >> fanotify_handle_error_event() but we can certainly get to sb by easier
> >> means there). So I would just remove that. What we still need to know in
> >> fanotify_free_error_event() though is whether the sb_mark is still alive or
> >> not. If it is alive, we leave the event alone, otherwise we need to free it.
> >> So we need a mark_alive flag in the error event and then do in ->freeing_mark
> >> callback something like:
> >>
> >> if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
> >> struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
> >>
> >> ### /* Maybe we could use mark->lock for this? */
> >> spin_lock(&group->notification_lock);
> >> if (fa_mark->fee_slot) {
> >> if (list_empty(&fa_mark->fee_slot->fae.fse.list)) {
> >> kfree(fa_mark->fee_slot);
> >> fa_mark->fee_slot = NULL;
> >> } else {
> >> fa_mark->fee_slot->mark_alive = 0;
> >> }
> >> }
> >> spin_unlock(&group->notification_lock);
> >> }
> >>
> >> And then when queueing and dequeueing event we would have to carefully

"would have to carefully..." oh oh! there are not words that I like to
read unless
I have to.
I think that fs error events are rare enough case and not performance sensitive
at all, so we should strive to KISS design principle in this case.

> >> check what is the mark & event state under appropriate lock (because
> >> ->handle_event() callbacks can see marks on the way to be destroyed as they
> >> are protected just by SRCU).
> >
> > Thanks for the review. That is indeed a subtle race that I hadn't
> > noticed.
> >
> > Option 2 is much more straightforward. And considering the uABI won't
> > be changed if we decide to change to option 1 later, I gave that a try
> > and should be able to prepare a new version that leaves the error event
> > with a weak association to the mark, without the back reference, and
> > allowing it to be deleted by the latest between dequeue and
> > ->freeing_mark, as you suggested.
>
> Actually, I don't think this will work for insertion unless we keep a
> bounce buffer for the file_handle, because we need to keep the
> group->notification_lock to ensure the fee doesn't go away with the mark
> (since it is not yet enqueued) but, as discussed before, we don't want
> to hold that lock when generating the FH.
>
> I think the correct way is to have some sort of refcount of the error
> event slot. We could use err_count for that and change the suggestion
> above to:
>
> if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
> struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
>
> spin_lock(&group->notification_lock);
> if (fa_mark->fee_slot) {
> if (!fee->err_count) {
> kfree(fa_mark->fee_slot);
> fa_mark->fee_slot = NULL;
> } else {
> fa_mark->fee_slot->mark_alive = 0;
> }
> }
> spin_unlock(&group->notification_lock);
> }
>
> And insertion would look like this:
>
> static int fanotify_handle_error_event(....) {
>
> spin_lock(&group->notification_lock);
>
> if (!mark->fee || (mark->fee->err_count++) {
> spin_unlock(&group->notification_lock);
> return 0;
> }
>
> spin_unlock(&group->notification_lock);
>
> mark->fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
>
> /* ... Write report data to error event ... */
>
> fanotify_encode_fh(&fee->object_fh, fanotify_encode_fh_len(inode),
> NULL, 0);
>
> fsnotify_add_event(group, &fee->fae.fse, NULL);
> }
>
> Unless you think this is too hack-ish.
>
> To be fair, I think it is hack-ish.

Actually, I wouldn't mind the hack-ish-ness if it would simplify things,
but I do not see how this is the case here.
I still cannot wrap my head around the semantics, which is a big red light.
First of all a suggestion should start with the lifetime rules:
- Possible states
- State transition rules

Speaking for myself, I simply cannot review a proposal without these
documented rules.

> I would add a proper refcount_t
> to the error event, and let the mark own a reference to it, which is
> dropped when the mark goes away. Enqueue and Dequeue will acquire and
> drop references, respectively. In this case, err_count is not
> overloaded.
>
> Will it work?

Maybe, I still don't see the full picture, but if this can get us to a state
where error events handling is simpler then it's a good idea.
Saving the space of refcount_t in error event struct is not important at all.

But if Jan's option #1 (mempool) brings us to less special casing
of enqueue/dequeue of error events, then I think that would be
my preference.

In any case, I suggest to wait for Jan's inputs before you continue.

Thanks,
Amir.

2021-09-15 10:38:04

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 15/21] fanotify: Preallocate per superblock mark error event

On Fri 03-09-21 07:16:33, Amir Goldstein wrote:
> On Fri, Sep 3, 2021 at 12:24 AM Gabriel Krisman Bertazi
> <[email protected]> wrote:
> > Actually, I don't think this will work for insertion unless we keep a
> > bounce buffer for the file_handle, because we need to keep the
> > group->notification_lock to ensure the fee doesn't go away with the mark
> > (since it is not yet enqueued) but, as discussed before, we don't want
> > to hold that lock when generating the FH.
> >
> > I think the correct way is to have some sort of refcount of the error
> > event slot. We could use err_count for that and change the suggestion
> > above to:
> >
> > if (mark->flags & FANOTIFY_MARK_FLAG_SB_MARK) {
> > struct fanotify_sb_mark *fa_mark = FANOTIFY_SB_MARK(mark);
> >
> > spin_lock(&group->notification_lock);
> > if (fa_mark->fee_slot) {
> > if (!fee->err_count) {
> > kfree(fa_mark->fee_slot);
> > fa_mark->fee_slot = NULL;
> > } else {
> > fa_mark->fee_slot->mark_alive = 0;
> > }
> > }
> > spin_unlock(&group->notification_lock);
> > }
> >
> > And insertion would look like this:
> >
> > static int fanotify_handle_error_event(....) {
> >
> > spin_lock(&group->notification_lock);
> >
> > if (!mark->fee || (mark->fee->err_count++) {
> > spin_unlock(&group->notification_lock);
> > return 0;
> > }
> >
> > spin_unlock(&group->notification_lock);
> >
> > mark->fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> >
> > /* ... Write report data to error event ... */
> >
> > fanotify_encode_fh(&fee->object_fh, fanotify_encode_fh_len(inode),
> > NULL, 0);
> >
> > fsnotify_add_event(group, &fee->fae.fse, NULL);
> > }
> >
> > Unless you think this is too hack-ish.
> >
> > To be fair, I think it is hack-ish.
>
> Actually, I wouldn't mind the hack-ish-ness if it would simplify things,
> but I do not see how this is the case here.
> I still cannot wrap my head around the semantics, which is a big red light.
> First of all a suggestion should start with the lifetime rules:
> - Possible states
> - State transition rules
>
> Speaking for myself, I simply cannot review a proposal without these
> documented rules.

Hum, getting back up to speed on this after vacation is tough which
suggests maybe we've indeed overengineered this :) So let's try to simplify
things.

> > I would add a proper refcount_t
> > to the error event, and let the mark own a reference to it, which is
> > dropped when the mark goes away. Enqueue and Dequeue will acquire and
> > drop references, respectively. In this case, err_count is not
> > overloaded.
> >
> > Will it work?
>
> Maybe, I still don't see the full picture, but if this can get us to a state
> where error events handling is simpler then it's a good idea.
> Saving the space of refcount_t in error event struct is not important at all.
>
> But if Jan's option #1 (mempool) brings us to less special casing
> of enqueue/dequeue of error events, then I think that would be
> my preference.

Yes, I think mempools would result in a simpler code overall (the
complexity of recycling events would be handled by mempool for us). Maybe
we would not even need to play tricks with mempool resizing? We could just
make sure it has couple of events reserved and if it ever happens that
mempool_alloc() cannot give us any event, we'd report queue overflow (like
we already do for other event types if that happens). I think we could
require that callers generating error events are in a context where GFP_NOFS
allocation is OK - this should be achievable target for filesystems and
allocation failures should be rare with such mask.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR