2021-04-26 18:43:03

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 00/15] File system wide monitoring

Hi,

In an attempt to consolidate some of the feedback from the previous
proposals, I wrote a new attempt to solve the file system error reporting
problem. Before I spend more time polishing it, I'd like to hear your
feedback if I'm going in the wrong direction, in particular with the
modifications to fsnotify.

This RFC follows up on my previous proposals which attempted to leverage
watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
to push error notifications to user space. This proposal starts by, as
suggested by Darrick, limiting the scope of what I'm trying to do to an
interface for administrators to monitor the health of a file system,
instead of a generic inteface for file errors. Therefore, this doesn't
solve the problem of writeback errors or the need to watch a specific
subsystem.

* Format

The feature is implemented on top of fanotify, as a new type of fanotify
mark, FAN_ERROR, which a file system monitoring tool can register to
receive notifications. A notification is split in three parts, and only
the first is guaranteed to exist for any given error event:

- FS generic data: A file system agnostic structure that has a generic
error code and identifies the filesystem. Basically, it let's
userspace know something happen on a monitored filesystem.

- FS location data: Identifies where in the code the problem
happened. (This is important for the use case of analysing frequent
error points that we discussed earlier).

- FS specific data: A detailed error report in a filesystem specific
format that details what the error is. Ideally, a capable monitoring
tool can use the information here for error recovery. For instance,
xfs can put the xfs_scrub structures here, ext4 can send its error
reports, etc. An example of usage is done in the ext4 patch of this
series.

More details on the information in each record can be found on the
documentation introduced in patch 15.

* Using fanotify

Using fanotify for this kind of thing is slightly tricky because we want
to guarantee delivery in some complicated conditions, for instance, the
file system might want to send an error while holding several locks.

Instead of working around file system constraints at the file system
level, this proposal tries to make the FAN_ERROR submission safe in
those contexts. This is done with a new mode in fsnotify that
preallocates the memory at group creation to be used for the
notification submission.

This new mode in fsnotify introduces a ring buffer to queue
notifications, which eliminates the allocation path in fsnotify. From
what I saw, the allocation is the only problem in fsnotify for
filesystems to submit errors in constrained situations.

* Visibility

Since the usecase is limited to a tool for whole file system monitoring,
errors are associated with the superblock and visible filesystem-wide.
It is assumed and required that userspace has CAP_SYS_ADMIN.

* Testing

This was tested with corrupted ext4 images in a few scenarios, which
caused errors to be triggered and monitored with the sample tool
provided in the next to final patch.

* patches

Patches 1-4 massage fanotify attempt to refactor fanotify a bit for
the patches to come. Patch 5 introduce the ring buffer interface to
fsnotify, while patch 6 enable this support in fanotify. Patch 7, 8 wire
the FS_ERROR event type, which will be used by filesystems. In
sequennce, patches 9-12 implement the FAN_ERROR record types and create
the new event. Patch 13 is an ext4 example implementation supporting
this feature. Finally, patches 14 and 15 document and provide examples
of a userspace tool that uses this feature.

I also pushed the full series to:

https://gitlab.collabora.com/krisman/linux -b fanotify-notifications

[1] https://lwn.net/Articles/839310/
[2] https://www.spinics.net/lists/linux-fsdevel/msg187075.html

Gabriel Krisman Bertazi (15):
fanotify: Fold event size calculation to its own function
fanotify: Split fsid check from other fid mode checks
fsnotify: Wire flags field on group allocation
fsnotify: Wire up group information on event initialization
fsnotify: Support event submission through ring buffer
fanotify: Support submission through ring buffer
fsnotify: Support FS_ERROR event type
fsnotify: Introduce helpers to send error_events
fanotify: Introduce generic error record
fanotify: Introduce code location record
fanotify: Introduce filesystem specific data record
fanotify: Introduce the FAN_ERROR mark
ext4: Send notifications on error
samples: Add fs error monitoring example
Documentation: Document the FAN_ERROR framework

.../admin-guide/filesystem-monitoring.rst | 103 ++++++
Documentation/admin-guide/index.rst | 1 +
fs/ext4/super.c | 60 +++-
fs/notify/Makefile | 2 +-
fs/notify/dnotify/dnotify.c | 2 +-
fs/notify/fanotify/fanotify.c | 127 +++++--
fs/notify/fanotify/fanotify.h | 35 +-
fs/notify/fanotify/fanotify_user.c | 319 ++++++++++++++----
fs/notify/fsnotify.c | 2 +-
fs/notify/group.c | 25 +-
fs/notify/inotify/inotify_fsnotify.c | 2 +-
fs/notify/inotify/inotify_user.c | 4 +-
fs/notify/notification.c | 10 +
fs/notify/ring.c | 199 +++++++++++
include/linux/fanotify.h | 12 +-
include/linux/fsnotify.h | 15 +
include/linux/fsnotify_backend.h | 63 +++-
include/uapi/linux/ext4-notify.h | 17 +
include/uapi/linux/fanotify.h | 26 ++
kernel/audit_fsnotify.c | 2 +-
kernel/audit_tree.c | 2 +-
kernel/audit_watch.c | 2 +-
samples/Kconfig | 7 +
samples/Makefile | 1 +
samples/fanotify/Makefile | 3 +
samples/fanotify/fs-monitor.c | 135 ++++++++
26 files changed, 1034 insertions(+), 142 deletions(-)
create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst
create mode 100644 fs/notify/ring.c
create mode 100644 include/uapi/linux/ext4-notify.h
create mode 100644 samples/fanotify/Makefile
create mode 100644 samples/fanotify/fs-monitor.c

--
2.31.0


2021-04-26 18:43:04

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 01/15] fanotify: Fold event size calculation to its own function

Every time this function is invoked, it is immediately added to
FAN_EVENT_METADATA_LEN, since there is no need to just calculate the
length of info records. This minor clean up folds the rest of the
calculation into the function, which now operates in terms of events,
returning the size of the entire event, including metadata.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 40 +++++++++++++++++-------------
1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9e0c1afac8bd..0332c4afeec3 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -64,17 +64,24 @@ static int fanotify_fid_info_len(int fh_len, int name_len)
return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
}

-static int fanotify_event_info_len(unsigned int fid_mode,
- struct fanotify_event *event)
+static size_t fanotify_event_len(struct fanotify_event *event,
+ unsigned int fid_mode)
{
- struct fanotify_info *info = fanotify_event_info(event);
- int dir_fh_len = fanotify_event_dir_fh_len(event);
- int fh_len = fanotify_event_object_fh_len(event);
- int info_len = 0;
+ size_t event_len = FAN_EVENT_METADATA_LEN;
+ struct fanotify_info *info;
+ int dir_fh_len;
+ int fh_len;
int dot_len = 0;

+ if (!fid_mode)
+ return event_len;
+
+ info = fanotify_event_info(event);
+ dir_fh_len = fanotify_event_dir_fh_len(event);
+ fh_len = fanotify_event_object_fh_len(event);
+
if (dir_fh_len) {
- info_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
+ event_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
} else if ((fid_mode & FAN_REPORT_NAME) && (event->mask & FAN_ONDIR)) {
/*
* With group flag FAN_REPORT_NAME, if name was not recorded in
@@ -84,9 +91,9 @@ static int fanotify_event_info_len(unsigned int fid_mode,
}

if (fh_len)
- info_len += fanotify_fid_info_len(fh_len, dot_len);
+ event_len += fanotify_fid_info_len(fh_len, dot_len);

- return info_len;
+ return event_len;
}

/*
@@ -98,7 +105,8 @@ static int fanotify_event_info_len(unsigned int fid_mode,
static struct fanotify_event *get_one_event(struct fsnotify_group *group,
size_t count)
{
- size_t event_size = FAN_EVENT_METADATA_LEN;
+ size_t event_size;
+ struct fsnotify_event *fse;
struct fanotify_event *event = NULL;
unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);

@@ -108,16 +116,15 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
if (fsnotify_notify_queue_is_empty(group))
goto out;

- if (fid_mode) {
- event_size += fanotify_event_info_len(fid_mode,
- FANOTIFY_E(fsnotify_peek_first_event(group)));
- }
+ fse = fsnotify_peek_first_event(group);
+ event = FANOTIFY_E(fse);
+ event_size = fanotify_event_len(event, fid_mode);

if (event_size > count) {
event = ERR_PTR(-EINVAL);
goto out;
}
- event = FANOTIFY_E(fsnotify_remove_first_event(group));
+ fsnotify_remove_queued_event(group, fse);
if (fanotify_is_perm_event(event->mask))
FANOTIFY_PERM(event)->state = FAN_EVENT_REPORTED;
out:
@@ -334,8 +341,7 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,

pr_debug("%s: group=%p event=%p\n", __func__, group, event);

- metadata.event_len = FAN_EVENT_METADATA_LEN +
- fanotify_event_info_len(fid_mode, event);
+ metadata.event_len = fanotify_event_len(event, fid_mode);
metadata.metadata_len = FAN_EVENT_METADATA_LEN;
metadata.vers = FANOTIFY_METADATA_VERSION;
metadata.reserved = 0;
--
2.31.0

2021-04-26 18:43:12

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 02/15] fanotify: Split fsid check from other fid mode checks

FAN_ERROR will require fsid, but not necessarily require the filesystem
to expose a file handle. Split those checks into different functions, so
they can be used separately when creating a mark.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 35 +++++++++++++++++++-----------
1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 0332c4afeec3..e0d113e3b65c 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1055,7 +1055,23 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
}

/* Check if filesystem can encode a unique fid */
-static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
+static int fanotify_test_fid(struct path *path)
+{
+ /*
+ * We need to make sure that the file system supports at least
+ * encoding a file handle so user can use name_to_handle_at() to
+ * compare fid returned with event to the file handle of watched
+ * objects. However, name_to_handle_at() requires that the
+ * filesystem also supports decoding file handles.
+ */
+ if (!path->dentry->d_sb->s_export_op ||
+ !path->dentry->d_sb->s_export_op->fh_to_dentry)
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
+static int fanotify_check_path_fsid(struct path *path, __kernel_fsid_t *fsid)
{
__kernel_fsid_t root_fsid;
int err;
@@ -1082,17 +1098,6 @@ static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
root_fsid.val[1] != fsid->val[1])
return -EXDEV;

- /*
- * We need to make sure that the file system supports at least
- * encoding a file handle so user can use name_to_handle_at() to
- * compare fid returned with event to the file handle of watched
- * objects. However, name_to_handle_at() requires that the
- * filesystem also supports decoding file handles.
- */
- if (!path->dentry->d_sb->s_export_op ||
- !path->dentry->d_sb->s_export_op->fh_to_dentry)
- return -EOPNOTSUPP;
-
return 0;
}

@@ -1230,7 +1235,11 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
}

if (fid_mode) {
- ret = fanotify_test_fid(&path, &__fsid);
+ ret = fanotify_check_path_fsid(&path, &__fsid);
+ if (ret)
+ goto path_put_and_out;
+
+ ret = fanotify_test_fid(&path);
if (ret)
goto path_put_and_out;

--
2.31.0

2021-04-26 18:43:16

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 03/15] fsnotify: Wire flags field on group allocation

Introduce a flags field in fsnotify_group to track the mode of
submission this group has.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/dnotify/dnotify.c | 2 +-
fs/notify/fanotify/fanotify_user.c | 4 ++--
fs/notify/group.c | 13 ++++++++-----
fs/notify/inotify/inotify_user.c | 2 +-
include/linux/fsnotify_backend.h | 7 +++++--
kernel/audit_fsnotify.c | 2 +-
kernel/audit_tree.c | 2 +-
kernel/audit_watch.c | 2 +-
8 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index e85e13c50d6d..37960c8750e4 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -383,7 +383,7 @@ static int __init dnotify_init(void)
SLAB_PANIC|SLAB_ACCOUNT);
dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);

- dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
+ dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops, 0);
if (IS_ERR(dnotify_group))
panic("unable to allocate fsnotify group for dnotify\n");
return 0;
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index e0d113e3b65c..f50c4ab721e3 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -929,7 +929,7 @@ static struct fsnotify_event *fanotify_alloc_overflow_event(void)
SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
{
struct fsnotify_group *group;
- int f_flags, fd;
+ int f_flags, fd, fsn_flags = 0;
struct user_struct *user;
unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
unsigned int class = flags & FANOTIFY_CLASS_BITS;
@@ -982,7 +982,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
f_flags |= O_NONBLOCK;

/* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
- group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops);
+ group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops, fsn_flags);
if (IS_ERR(group)) {
free_uid(user);
return PTR_ERR(group);
diff --git a/fs/notify/group.c b/fs/notify/group.c
index ffd723ffe46d..08acb1afc0c2 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -112,7 +112,7 @@ void fsnotify_put_group(struct fsnotify_group *group)
EXPORT_SYMBOL_GPL(fsnotify_put_group);

static struct fsnotify_group *__fsnotify_alloc_group(
- const struct fsnotify_ops *ops, gfp_t gfp)
+ const struct fsnotify_ops *ops, unsigned int flags, gfp_t gfp)
{
struct fsnotify_group *group;

@@ -134,6 +134,7 @@ static struct fsnotify_group *__fsnotify_alloc_group(
INIT_LIST_HEAD(&group->marks_list);

group->ops = ops;
+ group->flags = flags;

return group;
}
@@ -141,18 +142,20 @@ static struct fsnotify_group *__fsnotify_alloc_group(
/*
* Create a new fsnotify_group and hold a reference for the group returned.
*/
-struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops)
+struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops,
+ unsigned int flags)
{
- return __fsnotify_alloc_group(ops, GFP_KERNEL);
+ return __fsnotify_alloc_group(ops, flags, GFP_KERNEL);
}
EXPORT_SYMBOL_GPL(fsnotify_alloc_group);

/*
* Create a new fsnotify_group and hold a reference for the group returned.
*/
-struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops)
+struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops,
+ unsigned int flags)
{
- return __fsnotify_alloc_group(ops, GFP_KERNEL_ACCOUNT);
+ return __fsnotify_alloc_group(ops, flags, GFP_KERNEL_ACCOUNT);
}
EXPORT_SYMBOL_GPL(fsnotify_alloc_user_group);

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index c71be4fb7dc5..f2687267bc15 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -632,7 +632,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
struct fsnotify_group *group;
struct inotify_event_info *oevent;

- group = fsnotify_alloc_user_group(&inotify_fsnotify_ops);
+ group = fsnotify_alloc_user_group(&inotify_fsnotify_ops, 0);
if (IS_ERR(group))
return group;

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index e5409b83e731..ef4352563ede 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -221,6 +221,7 @@ struct fsnotify_group {
* full */

struct mem_cgroup *memcg; /* memcg to charge allocations */
+ unsigned int flags;

/* groups can define private fields here or use the void *private */
union {
@@ -469,8 +470,10 @@ static inline void fsnotify_update_flags(struct dentry *dentry)
/* called from fsnotify listeners, such as fanotify or dnotify */

/* create a new group */
-extern struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops);
-extern struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops);
+extern struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops,
+ unsigned int flags);
+extern struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops,
+ unsigned int flags);
/* get reference to a group */
extern void fsnotify_get_group(struct fsnotify_group *group);
/* drop reference on a group from fsnotify_alloc_group */
diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c
index 60739d5e3373..dce6a6212f8f 100644
--- a/kernel/audit_fsnotify.c
+++ b/kernel/audit_fsnotify.c
@@ -182,7 +182,7 @@ static const struct fsnotify_ops audit_mark_fsnotify_ops = {

static int __init audit_fsnotify_init(void)
{
- audit_fsnotify_group = fsnotify_alloc_group(&audit_mark_fsnotify_ops);
+ audit_fsnotify_group = fsnotify_alloc_group(&audit_mark_fsnotify_ops, 0);
if (IS_ERR(audit_fsnotify_group)) {
audit_fsnotify_group = NULL;
audit_panic("cannot create audit fsnotify group");
diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
index 6c91902f4f45..3d045fc791f2 100644
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -1077,7 +1077,7 @@ static int __init audit_tree_init(void)

audit_tree_mark_cachep = KMEM_CACHE(audit_tree_mark, SLAB_PANIC);

- audit_tree_group = fsnotify_alloc_group(&audit_tree_ops);
+ audit_tree_group = fsnotify_alloc_group(&audit_tree_ops, 0);
if (IS_ERR(audit_tree_group))
audit_panic("cannot initialize fsnotify group for rectree watches");

diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index 2acf7ca49154..80a8c14de961 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -493,7 +493,7 @@ static const struct fsnotify_ops audit_watch_fsnotify_ops = {

static int __init audit_watch_init(void)
{
- audit_watch_group = fsnotify_alloc_group(&audit_watch_fsnotify_ops);
+ audit_watch_group = fsnotify_alloc_group(&audit_watch_fsnotify_ops, 0);
if (IS_ERR(audit_watch_group)) {
audit_watch_group = NULL;
audit_panic("cannot create audit fsnotify group");
--
2.31.0

2021-04-26 18:43:20

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 04/15] fsnotify: Wire up group information on event initialization

This is used by following patches when deciding about which fields to
initialize.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.c | 2 +-
fs/notify/fanotify/fanotify.h | 5 +++--
fs/notify/fanotify/fanotify_user.c | 6 +++---
fs/notify/inotify/inotify_fsnotify.c | 2 +-
fs/notify/inotify/inotify_user.c | 2 +-
include/linux/fsnotify_backend.h | 3 ++-
6 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 1192c9953620..e3669d8a4a64 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -601,7 +601,7 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
* event queue, so event reported on parent is merged with event
* reported on child when both directory and child watches exist.
*/
- fanotify_init_event(event, (unsigned long)id, mask);
+ fanotify_init_event(group, event, (unsigned long)id, mask);
if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
event->pid = get_pid(task_pid(current));
else
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 896c819a1786..47299e3d6efd 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -144,10 +144,11 @@ struct fanotify_event {
struct pid *pid;
};

-static inline void fanotify_init_event(struct fanotify_event *event,
+static inline void fanotify_init_event(struct fsnotify_group *group,
+ struct fanotify_event *event,
unsigned long id, u32 mask)
{
- fsnotify_init_event(&event->fse, id);
+ fsnotify_init_event(group, &event->fse, id);
event->mask = mask;
event->pid = NULL;
}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index f50c4ab721e3..fe605359af88 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -911,7 +911,7 @@ static int fanotify_add_inode_mark(struct fsnotify_group *group,
FSNOTIFY_OBJ_TYPE_INODE, mask, flags, fsid);
}

-static struct fsnotify_event *fanotify_alloc_overflow_event(void)
+static struct fsnotify_event *fanotify_alloc_overflow_event(struct fsnotify_group *group)
{
struct fanotify_event *oevent;

@@ -919,7 +919,7 @@ static struct fsnotify_event *fanotify_alloc_overflow_event(void)
if (!oevent)
return NULL;

- fanotify_init_event(oevent, 0, FS_Q_OVERFLOW);
+ fanotify_init_event(group, oevent, 0, FS_Q_OVERFLOW);
oevent->type = FANOTIFY_EVENT_TYPE_OVERFLOW;

return &oevent->fse;
@@ -993,7 +993,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
atomic_inc(&user->fanotify_listeners);
group->memcg = get_mem_cgroup_from_mm(current->mm);

- group->overflow_event = fanotify_alloc_overflow_event();
+ group->overflow_event = fanotify_alloc_overflow_event(group);
if (unlikely(!group->overflow_event)) {
fd = -ENOMEM;
goto out_destroy_group;
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 1901d799909b..c6eceb663ac3 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -107,7 +107,7 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
mask &= ~IN_ISDIR;

fsn_event = &event->fse;
- fsnotify_init_event(fsn_event, 0);
+ fsnotify_init_event(group, fsn_event, 0);
event->mask = mask;
event->wd = i_mark->wd;
event->sync_cookie = cookie;
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index f2687267bc15..fb9a62b988e5 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -642,7 +642,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
return ERR_PTR(-ENOMEM);
}
group->overflow_event = &oevent->fse;
- fsnotify_init_event(group->overflow_event, 0);
+ fsnotify_init_event(group, group->overflow_event, 0);
oevent->mask = FS_Q_OVERFLOW;
oevent->wd = -1;
oevent->sync_cookie = 0;
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index ef4352563ede..190c6a402e98 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -579,7 +579,8 @@ extern void fsnotify_put_mark(struct fsnotify_mark *mark);
extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info);
extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);

-static inline void fsnotify_init_event(struct fsnotify_event *event,
+static inline void fsnotify_init_event(struct fsnotify_group *group,
+ struct fsnotify_event *event,
unsigned long objectid)
{
INIT_LIST_HEAD(&event->list);
--
2.31.0

2021-04-26 18:43:22

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 05/15] fsnotify: Support event submission through ring buffer

In order to support file system health/error reporting over fanotify,
fsnotify needs to expose a submission path that doesn't allow sleeping.
The only problem I identified with the current submission path is the
need to dynamically allocate memory for the event queue.

This patch avoids the problem by introducing a new mode in fsnotify,
where a ring buffer is used to submit events for a group. Each group
has its own ring buffer, and error notifications are submitted
exclusively through it.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/Makefile | 2 +-
fs/notify/group.c | 12 +-
fs/notify/notification.c | 10 ++
fs/notify/ring.c | 199 +++++++++++++++++++++++++++++++
include/linux/fsnotify_backend.h | 37 +++++-
5 files changed, 255 insertions(+), 5 deletions(-)
create mode 100644 fs/notify/ring.c

diff --git a/fs/notify/Makefile b/fs/notify/Makefile
index 63a4b8828df4..61dae1e90f2d 100644
--- a/fs/notify/Makefile
+++ b/fs/notify/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_FSNOTIFY) += fsnotify.o notification.o group.o mark.o \
- fdinfo.o
+ fdinfo.o ring.o

obj-y += dnotify/
obj-y += inotify/
diff --git a/fs/notify/group.c b/fs/notify/group.c
index 08acb1afc0c2..b99b3de36696 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -81,7 +81,10 @@ void fsnotify_destroy_group(struct fsnotify_group *group)
* notification against this group. So clearing the notification queue
* of all events is reliable now.
*/
- fsnotify_flush_notify(group);
+ if (group->flags & FSN_SUBMISSION_RING_BUFFER)
+ fsnotify_free_ring_buffer(group);
+ else
+ fsnotify_flush_notify(group);

/*
* Destroy overflow event (we cannot use fsnotify_destroy_event() as
@@ -136,6 +139,13 @@ static struct fsnotify_group *__fsnotify_alloc_group(
group->ops = ops;
group->flags = flags;

+ if (group->flags & FSN_SUBMISSION_RING_BUFFER) {
+ if (fsnotify_create_ring_buffer(group)) {
+ kfree(group);
+ return ERR_PTR(-ENOMEM);
+ }
+ }
+
return group;
}

diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 75d79d6d3ef0..32f97e7b7a80 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -51,6 +51,10 @@ EXPORT_SYMBOL_GPL(fsnotify_get_cookie);
bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group)
{
assert_spin_locked(&group->notification_lock);
+
+ if (group->flags & FSN_SUBMISSION_RING_BUFFER)
+ return fsnotify_ring_notify_queue_is_empty(group);
+
return list_empty(&group->notification_list) ? true : false;
}

@@ -132,6 +136,9 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
struct fsnotify_event *event)
{
assert_spin_locked(&group->notification_lock);
+
+ if (group->flags & FSN_SUBMISSION_RING_BUFFER)
+ return;
/*
* We need to init list head for the case of overflow event so that
* check in fsnotify_add_event() works
@@ -166,6 +173,9 @@ struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
{
assert_spin_locked(&group->notification_lock);

+ if (group->flags & FSN_SUBMISSION_RING_BUFFER)
+ return fsnotify_ring_peek_first_event(group);
+
return list_first_entry(&group->notification_list,
struct fsnotify_event, list);
}
diff --git a/fs/notify/ring.c b/fs/notify/ring.c
new file mode 100644
index 000000000000..75e8af1f8d80
--- /dev/null
+++ b/fs/notify/ring.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/fsnotify.h>
+#include <linux/memcontrol.h>
+
+#define INVALID_RING_SLOT -1
+
+#define FSNOTIFY_RING_PAGES 16
+
+#define NEXT_SLOT(cur, len, ring_size) ((cur + len) & (ring_size-1))
+#define NEXT_PAGE(cur, ring_size) (round_up(cur, PAGE_SIZE) & (ring_size-1))
+
+bool fsnotify_ring_notify_queue_is_empty(struct fsnotify_group *group)
+{
+ assert_spin_locked(&group->notification_lock);
+
+ if (group->ring_buffer.tail == group->ring_buffer.head)
+ return true;
+ return false;
+}
+
+struct fsnotify_event *fsnotify_ring_peek_first_event(struct fsnotify_group *group)
+{
+ u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
+ struct fsnotify_event *fsn;
+ char *kaddr;
+ u64 tail;
+
+ assert_spin_locked(&group->notification_lock);
+
+again:
+ tail = group->ring_buffer.tail;
+
+ if ((PAGE_SIZE - (tail & (PAGE_SIZE-1))) < sizeof(struct fsnotify_event)) {
+ group->ring_buffer.tail = NEXT_PAGE(tail, ring_size);
+ goto again;
+ }
+
+ kaddr = kmap_atomic(group->ring_buffer.pages[tail / PAGE_SIZE]);
+ if (!kaddr)
+ return NULL;
+ fsn = (struct fsnotify_event *) (kaddr + (tail & (PAGE_SIZE-1)));
+
+ if (fsn->slot_len == INVALID_RING_SLOT) {
+ group->ring_buffer.tail = NEXT_PAGE(tail, ring_size);
+ kunmap_atomic(kaddr);
+ goto again;
+ }
+
+ /* will be unmapped when entry is consumed. */
+ return fsn;
+}
+
+void fsnotify_ring_buffer_consume_event(struct fsnotify_group *group,
+ struct fsnotify_event *event)
+{
+ u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
+ u64 new_tail = NEXT_SLOT(group->ring_buffer.tail, event->slot_len, ring_size);
+
+ kunmap_atomic(event);
+
+ pr_debug("%s: group=%p tail=%llx->%llx ring_size=%llu\n", __func__,
+ group, group->ring_buffer.tail, new_tail, ring_size);
+
+ WRITE_ONCE(group->ring_buffer.tail, new_tail);
+}
+
+struct fsnotify_event *fsnotify_ring_alloc_event_slot(struct fsnotify_group *group,
+ size_t size)
+ __acquires(&group->notification_lock)
+{
+ struct fsnotify_event *fsn;
+ u64 head, tail;
+ u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
+ u64 new_head;
+ void *kaddr;
+
+ if (WARN_ON(!(group->flags & FSN_SUBMISSION_RING_BUFFER) || size > PAGE_SIZE))
+ return ERR_PTR(-EINVAL);
+
+ pr_debug("%s: start group=%p ring_size=%llu, requested=%lu\n", __func__, group,
+ ring_size, size);
+
+ spin_lock(&group->notification_lock);
+again:
+ head = group->ring_buffer.head;
+ tail = group->ring_buffer.tail;
+ new_head = NEXT_SLOT(head, size, ring_size);
+
+ /* head would catch up to tail, corrupting an entry. */
+ if ((head < tail && new_head > tail) || (head > new_head && new_head > tail)) {
+ fsn = ERR_PTR(-ENOMEM);
+ goto err;
+ }
+
+ /*
+ * Not event a skip message fits in the page. We can detect the
+ * lack of space. Move on to the next page.
+ */
+ if ((PAGE_SIZE - (head & (PAGE_SIZE-1))) < sizeof(struct fsnotify_event)) {
+ /* Start again on next page */
+ group->ring_buffer.head = NEXT_PAGE(head, ring_size);
+ goto again;
+ }
+
+ kaddr = kmap_atomic(group->ring_buffer.pages[head / PAGE_SIZE]);
+ if (!kaddr) {
+ fsn = ERR_PTR(-EFAULT);
+ goto err;
+ }
+
+ fsn = (struct fsnotify_event *) (kaddr + (head & (PAGE_SIZE-1)));
+
+ if ((head >> PAGE_SHIFT) != (new_head >> PAGE_SHIFT)) {
+ /*
+ * No room in the current page. Add a fake entry
+ * consuming the end the page to avoid splitting event
+ * structure.
+ */
+ fsn->slot_len = INVALID_RING_SLOT;
+ kunmap_atomic(kaddr);
+ /* Start again on the next page */
+ group->ring_buffer.head = NEXT_PAGE(head, ring_size);
+
+ goto again;
+ }
+ fsn->slot_len = size;
+
+ return fsn;
+
+err:
+ spin_unlock(&group->notification_lock);
+ return fsn;
+}
+
+void fsnotify_ring_commit_slot(struct fsnotify_group *group, struct fsnotify_event *fsn)
+ __releases(&group->notification_lock)
+{
+ u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
+ u64 head = group->ring_buffer.head;
+ u64 new_head = NEXT_SLOT(head, fsn->slot_len, ring_size);
+
+ pr_debug("%s: group=%p head=%llx->%llx ring_size=%llu\n", __func__,
+ group, head, new_head, ring_size);
+
+ kunmap_atomic(fsn);
+ group->ring_buffer.head = new_head;
+
+ spin_unlock(&group->notification_lock);
+
+ wake_up(&group->notification_waitq);
+ kill_fasync(&group->fsn_fa, SIGIO, POLL_IN);
+
+}
+
+void fsnotify_free_ring_buffer(struct fsnotify_group *group)
+{
+ int i;
+
+ for (i = 0; i < group->ring_buffer.nr_pages; i++)
+ __free_page(group->ring_buffer.pages[i]);
+ kfree(group->ring_buffer.pages);
+ group->ring_buffer.nr_pages = 0;
+}
+
+int fsnotify_create_ring_buffer(struct fsnotify_group *group)
+{
+ int nr_pages = FSNOTIFY_RING_PAGES;
+ int i;
+
+ pr_debug("%s: group=%p pages=%d\n", __func__, group, nr_pages);
+
+ group->ring_buffer.pages = kmalloc_array(nr_pages, sizeof(struct pages *),
+ GFP_KERNEL);
+ if (!group->ring_buffer.pages)
+ return -ENOMEM;
+
+ group->ring_buffer.head = 0;
+ group->ring_buffer.tail = 0;
+
+ for (i = 0; i < nr_pages; i++) {
+ group->ring_buffer.pages[i] = alloc_pages(GFP_KERNEL, 1);
+ if (!group->ring_buffer.pages)
+ goto err_dealloc;
+ }
+
+ group->ring_buffer.nr_pages = nr_pages;
+
+ return 0;
+
+err_dealloc:
+ for (--i; i >= 0; i--)
+ __free_page(group->ring_buffer.pages[i]);
+ kfree(group->ring_buffer.pages);
+ group->ring_buffer.nr_pages = 0;
+ return -ENOMEM;
+}
+
+
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 190c6a402e98..a1a4dd69e5ed 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -74,6 +74,8 @@
#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
FS_OPEN_EXEC_PERM)

+#define FSN_SUBMISSION_RING_BUFFER 0x00000080
+
/*
* This is a list of all events that may get sent to a parent that is watching
* with flag FS_EVENT_ON_CHILD based on fs event on a child of that directory.
@@ -166,7 +168,11 @@ struct fsnotify_ops {
* listener this structure is where you need to be adding fields.
*/
struct fsnotify_event {
- struct list_head list;
+ union {
+ struct list_head list;
+ int slot_len;
+ };
+
unsigned long objectid; /* identifier for queue merges */
};

@@ -191,7 +197,21 @@ struct fsnotify_group {

/* needed to send notification to userspace */
spinlock_t notification_lock; /* protect the notification_list */
- struct list_head notification_list; /* list of event_holder this group needs to send to userspace */
+
+ union {
+ /*
+ * list of event_holder this group needs to send to
+ * userspace. Either a linked list (default), or a ring
+ * buffer(FSN_SUBMISSION_RING_BUFFER).
+ */
+ struct list_head notification_list;
+ struct {
+ struct page **pages;
+ int nr_pages;
+ u64 head;
+ u64 tail;
+ } ring_buffer;
+ };
wait_queue_head_t notification_waitq; /* read() on the notification file blocks on this waitq */
unsigned int q_len; /* events on the queue */
unsigned int max_events; /* maximum events allowed on the list */
@@ -492,6 +512,16 @@ extern int fsnotify_add_event(struct fsnotify_group *group,
struct fsnotify_event *event,
int (*merge)(struct list_head *,
struct fsnotify_event *));
+
+extern int fsnotify_create_ring_buffer(struct fsnotify_group *group);
+extern void fsnotify_free_ring_buffer(struct fsnotify_group *group);
+extern struct fsnotify_event *fsnotify_ring_alloc_event_slot(struct fsnotify_group *group,
+ size_t size);
+extern void fsnotify_ring_buffer_consume_event(struct fsnotify_group *group,
+ struct fsnotify_event *event);
+extern bool fsnotify_ring_notify_queue_is_empty(struct fsnotify_group *group);
+struct fsnotify_event *fsnotify_ring_peek_first_event(struct fsnotify_group *group);
+extern void fsnotify_ring_commit_slot(struct fsnotify_group *group, struct fsnotify_event *fsn);
/* Queue overflow event to a notification group */
static inline void fsnotify_queue_overflow(struct fsnotify_group *group)
{
@@ -583,7 +613,8 @@ static inline void fsnotify_init_event(struct fsnotify_group *group,
struct fsnotify_event *event,
unsigned long objectid)
{
- INIT_LIST_HEAD(&event->list);
+ if (!(group->flags & FSN_SUBMISSION_RING_BUFFER))
+ INIT_LIST_HEAD(&event->list);
event->objectid = objectid;
}

--
2.31.0

2021-04-26 18:43:27

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 06/15] fanotify: Support submission through ring buffer

This adds support for the ring buffer mode in fanotify. It is enabled
by a new flag FAN_PREALLOC_QUEUE passed to fanotify_init. If this flag
is enabled, the group only allows marks that support the ring buffer
submission. In a following patch, FAN_ERROR will make use of this
mechanism.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.c | 77 +++++++++++++++++++---------
fs/notify/fanotify/fanotify_user.c | 81 ++++++++++++++++++------------
include/linux/fanotify.h | 5 +-
include/uapi/linux/fanotify.h | 1 +
4 files changed, 105 insertions(+), 59 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index e3669d8a4a64..98591a8155a7 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -612,6 +612,26 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
return event;
}

+static struct fanotify_event *fanotify_ring_get_slot(struct fsnotify_group *group,
+ u32 mask, const void *data,
+ int data_type)
+{
+ size_t size = 0;
+
+ pr_debug("%s: group=%p mask=%x size=%lu\n", __func__, group, mask, size);
+
+ return FANOTIFY_E(fsnotify_ring_alloc_event_slot(group, size));
+}
+
+static void fanotify_ring_write_event(struct fsnotify_group *group,
+ struct fanotify_event *event, u32 mask,
+ const void *data, __kernel_fsid_t *fsid)
+{
+ fanotify_init_event(group, event, 0, mask);
+
+ event->pid = get_pid(task_tgid(current));
+}
+
/*
* Get cached fsid of the filesystem containing the object from any connector.
* All connectors are supposed to have the same fsid, but we do not verify that
@@ -701,31 +721,38 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
return 0;
}

- event = fanotify_alloc_event(group, mask, data, data_type, dir,
- file_name, &fsid);
- ret = -ENOMEM;
- if (unlikely(!event)) {
- /*
- * We don't queue overflow events for permission events as
- * there the access is denied and so no event is in fact lost.
- */
- if (!fanotify_is_perm_event(mask))
- fsnotify_queue_overflow(group);
- goto finish;
- }
-
- fsn_event = &event->fse;
- ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
- if (ret) {
- /* Permission events shouldn't be merged */
- BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
- /* Our event wasn't used in the end. Free it. */
- fsnotify_destroy_event(group, fsn_event);
-
- ret = 0;
- } else if (fanotify_is_perm_event(mask)) {
- ret = fanotify_get_response(group, FANOTIFY_PERM(event),
- iter_info);
+ if (group->flags & FSN_SUBMISSION_RING_BUFFER) {
+ event = fanotify_ring_get_slot(group, mask, data, data_type);
+ if (IS_ERR(event))
+ return PTR_ERR(event);
+ fanotify_ring_write_event(group, event, mask, data, &fsid);
+ fsnotify_ring_commit_slot(group, &event->fse);
+ } else {
+ event = fanotify_alloc_event(group, mask, data, data_type, dir,
+ file_name, &fsid);
+ ret = -ENOMEM;
+ if (unlikely(!event)) {
+ /*
+ * We don't queue overflow events for permission events as
+ * there the access is denied and so no event is in fact lost.
+ */
+ if (!fanotify_is_perm_event(mask))
+ fsnotify_queue_overflow(group);
+ goto finish;
+ }
+ fsn_event = &event->fse;
+ ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
+ if (ret) {
+ /* Permission events shouldn't be merged */
+ BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
+ /* Our event wasn't used in the end. Free it. */
+ fsnotify_destroy_event(group, fsn_event);
+
+ ret = 0;
+ } else if (fanotify_is_perm_event(mask)) {
+ ret = fanotify_get_response(group, FANOTIFY_PERM(event),
+ iter_info);
+ }
}
finish:
if (fanotify_is_perm_event(mask))
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index fe605359af88..5031198bf7db 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -521,7 +521,9 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
* Permission events get queued to wait for response. Other
* events can be destroyed now.
*/
- if (!fanotify_is_perm_event(event->mask)) {
+ if (group->fanotify_data.flags & FAN_PREALLOC_QUEUE) {
+ fsnotify_ring_buffer_consume_event(group, &event->fse);
+ } else if (!fanotify_is_perm_event(event->mask)) {
fsnotify_destroy_event(group, &event->fse);
} else {
if (ret <= 0) {
@@ -587,40 +589,39 @@ static int fanotify_release(struct inode *ignored, struct file *file)
*/
fsnotify_group_stop_queueing(group);

- /*
- * Process all permission events on access_list and notification queue
- * and simulate reply from userspace.
- */
- spin_lock(&group->notification_lock);
- while (!list_empty(&group->fanotify_data.access_list)) {
- struct fanotify_perm_event *event;
-
- event = list_first_entry(&group->fanotify_data.access_list,
- struct fanotify_perm_event, fae.fse.list);
- list_del_init(&event->fae.fse.list);
- finish_permission_event(group, event, FAN_ALLOW);
+ if (!(group->flags & FSN_SUBMISSION_RING_BUFFER)) {
+ /*
+ * Process all permission events on access_list and notification queue
+ * and simulate reply from userspace.
+ */
spin_lock(&group->notification_lock);
- }
-
- /*
- * Destroy all non-permission events. For permission events just
- * dequeue them and set the response. They will be freed once the
- * response is consumed and fanotify_get_response() returns.
- */
- while (!fsnotify_notify_queue_is_empty(group)) {
- struct fanotify_event *event;
-
- event = FANOTIFY_E(fsnotify_remove_first_event(group));
- if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
- spin_unlock(&group->notification_lock);
- fsnotify_destroy_event(group, &event->fse);
- } else {
- finish_permission_event(group, FANOTIFY_PERM(event),
- FAN_ALLOW);
+ while (!list_empty(&group->fanotify_data.access_list)) {
+ struct fanotify_perm_event *event;
+ event = list_first_entry(&group->fanotify_data.access_list,
+ struct fanotify_perm_event, fae.fse.list);
+ list_del_init(&event->fae.fse.list);
+ finish_permission_event(group, event, FAN_ALLOW);
+ spin_lock(&group->notification_lock);
}
- spin_lock(&group->notification_lock);
+ /*
+ * Destroy all non-permission events. For permission events just
+ * dequeue them and set the response. They will be freed once the
+ * response is consumed and fanotify_get_response() returns.
+ */
+ while (!fsnotify_notify_queue_is_empty(group)) {
+ struct fanotify_event *event;
+ event = FANOTIFY_E(fsnotify_remove_first_event(group));
+ if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
+ spin_unlock(&group->notification_lock);
+ fsnotify_destroy_event(group, &event->fse);
+ } else {
+ finish_permission_event(group, FANOTIFY_PERM(event),
+ FAN_ALLOW);
+ }
+ spin_lock(&group->notification_lock);
+ }
+ spin_unlock(&group->notification_lock);
}
- spin_unlock(&group->notification_lock);

/* Response for all permission events it set, wakeup waiters */
wake_up(&group->fanotify_data.access_waitq);
@@ -981,6 +982,16 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
if (flags & FAN_NONBLOCK)
f_flags |= O_NONBLOCK;

+ if (flags & FAN_PREALLOC_QUEUE) {
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (flags & FAN_UNLIMITED_QUEUE)
+ return -EINVAL;
+
+ fsn_flags = FSN_SUBMISSION_RING_BUFFER;
+ }
+
/* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops, fsn_flags);
if (IS_ERR(group)) {
@@ -1223,6 +1234,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
goto fput_and_out;
}

+ if ((group->flags & FSN_SUBMISSION_RING_BUFFER) &&
+ (mask & ~FANOTIFY_SUBMISSION_BUFFER_EVENTS))
+ goto fput_and_out;
+
ret = fanotify_find_path(dfd, pathname, &path, flags,
(mask & ALL_FSNOTIFY_EVENTS), obj_type);
if (ret)
@@ -1327,7 +1342,7 @@ SYSCALL32_DEFINE6(fanotify_mark,
*/
static int __init fanotify_user_setup(void)
{
- BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 10);
+ BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 11);
BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);

fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 3e9c56ee651f..5a4cefb4b1c3 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -23,7 +23,8 @@
#define FANOTIFY_INIT_FLAGS (FANOTIFY_CLASS_BITS | FANOTIFY_FID_BITS | \
FAN_REPORT_TID | \
FAN_CLOEXEC | FAN_NONBLOCK | \
- FAN_UNLIMITED_QUEUE | FAN_UNLIMITED_MARKS)
+ FAN_UNLIMITED_QUEUE | FAN_UNLIMITED_MARKS | \
+ FAN_PREALLOC_QUEUE)

#define FANOTIFY_MARK_TYPE_BITS (FAN_MARK_INODE | FAN_MARK_MOUNT | \
FAN_MARK_FILESYSTEM)
@@ -71,6 +72,8 @@
FANOTIFY_PERM_EVENTS | \
FAN_Q_OVERFLOW | FAN_ONDIR)

+#define FANOTIFY_SUBMISSION_BUFFER_EVENTS 0
+
#define ALL_FANOTIFY_EVENT_BITS (FANOTIFY_OUTGOING_EVENTS | \
FANOTIFY_EVENT_FLAGS)

diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index fbf9c5c7dd59..b283531549f1 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -49,6 +49,7 @@
#define FAN_UNLIMITED_QUEUE 0x00000010
#define FAN_UNLIMITED_MARKS 0x00000020
#define FAN_ENABLE_AUDIT 0x00000040
+#define FAN_PREALLOC_QUEUE 0x00000080

/* Flags to determine fanotify event format */
#define FAN_REPORT_TID 0x00000100 /* event->pid is thread id */
--
2.31.0

2021-04-26 18:43:38

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 07/15] fsnotify: Support FS_ERROR event type

Expose a new type of fsnotify event for filesystems to report errors for
userspace monitoring tools. fanotify will send this type of
notification for FAN_ERROR marks.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fsnotify.c | 2 +-
include/linux/fsnotify_backend.h | 16 ++++++++++++++++
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 30d422b8c0fc..9fff35e67b37 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -558,7 +558,7 @@ static __init int fsnotify_init(void)
{
int ret;

- BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 25);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 26);

ret = init_srcu_struct(&fsnotify_mark_srcu);
if (ret)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index a1a4dd69e5ed..f850bfbe30d4 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -48,6 +48,8 @@
#define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
#define FS_OPEN_EXEC_PERM 0x00040000 /* open/exec event in a permission hook */

+#define FS_ERROR 0x00100000 /* Used for filesystem error reporting */
+
#define FS_EXCL_UNLINK 0x04000000 /* do not send events if object is unlinked */
/*
* Set on inode mark that cares about things that happen to its children.
@@ -74,6 +76,8 @@
#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
FS_OPEN_EXEC_PERM)

+#define ALL_FSNOTIFY_ERROR_EVENTS FS_ERROR
+
#define FSN_SUBMISSION_RING_BUFFER 0x00000080

/*
@@ -95,6 +99,7 @@

/* Events that can be reported to backends */
#define ALL_FSNOTIFY_EVENTS (ALL_FSNOTIFY_DIRENT_EVENTS | \
+ ALL_FSNOTIFY_ERROR_EVENTS | \
FS_EVENTS_POSS_ON_CHILD | \
FS_DELETE_SELF | FS_MOVE_SELF | FS_DN_RENAME | \
FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED)
@@ -272,6 +277,17 @@ enum fsnotify_data_type {
FSNOTIFY_EVENT_NONE,
FSNOTIFY_EVENT_PATH,
FSNOTIFY_EVENT_INODE,
+ FSNOTIFY_EVENT_ERROR,
+};
+
+struct fs_error_report {
+ int error;
+
+ int line;
+ const char *function;
+
+ size_t fs_data_size;
+ void *fs_data;
};

static inline struct inode *fsnotify_data_inode(const void *data, int data_type)
--
2.31.0

2021-04-26 18:45:24

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 09/15] fanotify: Introduce generic error record

This record describes a fs error in a fs agnostic way. It will be send
back to userspace in response to a FSNOTIFY_EVENT_ERROR for groups with
the FAN_ERROR mark.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.h | 16 ++++++++++++++++
fs/notify/fanotify/fanotify_user.c | 28 ++++++++++++++++++++++++++++
include/uapi/linux/fanotify.h | 10 ++++++++++
3 files changed, 54 insertions(+)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 47299e3d6efd..4cb9dd31f084 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -179,6 +179,22 @@ FANOTIFY_NE(struct fanotify_event *event)
return container_of(event, struct fanotify_name_event, fae);
}

+struct fanotify_error_event {
+ struct fanotify_event fae;
+ int error;
+ __kernel_fsid_t fsid;
+
+ int fs_data_size;
+ /* Must be the last item in the structure */
+ char fs_data[0];
+};
+
+static inline struct fanotify_error_event *
+FANOTIFY_EE(struct fanotify_event *event)
+{
+ return container_of(event, struct fanotify_error_event, fae);
+}
+
static inline __kernel_fsid_t *fanotify_event_fsid(struct fanotify_event *event)
{
if (event->type == FANOTIFY_EVENT_TYPE_FID)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 5031198bf7db..21162d347bd1 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -64,6 +64,11 @@ static int fanotify_fid_info_len(int fh_len, int name_len)
return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
}

+static size_t fanotify_error_info_len(struct fanotify_error_event *fee)
+{
+ return sizeof(struct fanotify_event_info_error);
+}
+
static size_t fanotify_event_len(struct fanotify_event *event,
unsigned int fid_mode)
{
@@ -232,6 +237,29 @@ static int process_access_response(struct fsnotify_group *group,
return -ENOENT;
}

+static size_t copy_error_info_to_user(struct fanotify_error_event *fee,
+ char __user *buf, int count)
+{
+ struct fanotify_event_info_error info;
+
+ info.hdr.info_type = FAN_EVENT_INFO_TYPE_ERROR;
+ info.hdr.pad = 0;
+ info.hdr.len = fanotify_error_info_len(fee);
+
+ if (WARN_ON(count < info.hdr.len))
+ return -EFAULT;
+
+ info.version = FANOTIFY_EVENT_INFO_ERROR_VERS_1;
+ info.error = fee->error;
+ info.fsid = fee->fsid;
+
+ if (copy_to_user(buf, &info, sizeof(info)))
+ return -EFAULT;
+
+ return info.hdr.len;
+
+}
+
static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
int info_type, const char *name, size_t name_len,
char __user *buf, size_t count)
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index b283531549f1..cc9a1fa80e30 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -124,6 +124,7 @@ struct fanotify_event_metadata {
#define FAN_EVENT_INFO_TYPE_FID 1
#define FAN_EVENT_INFO_TYPE_DFID_NAME 2
#define FAN_EVENT_INFO_TYPE_DFID 3
+#define FAN_EVENT_INFO_TYPE_ERROR 4

/* Variable length info record following event metadata */
struct fanotify_event_info_header {
@@ -149,6 +150,15 @@ struct fanotify_event_info_fid {
unsigned char handle[0];
};

+#define FANOTIFY_EVENT_INFO_ERROR_VERS_1 1
+
+struct fanotify_event_info_error {
+ struct fanotify_event_info_header hdr;
+ int version;
+ int error;
+ __kernel_fsid_t fsid;
+};
+
struct fanotify_response {
__s32 fd;
__u32 response;
--
2.31.0

2021-04-26 18:45:24

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 08/15] fsnotify: Introduce helpers to send error_events

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/fsnotify.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index f8acddcf54fb..b3ac1a9d0d4d 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -317,4 +317,19 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
fsnotify_dentry(dentry, mask);
}

+static inline void fsnotify_error_event(int error, struct inode *dir,
+ const char *function, int line,
+ void *fs_data, int fs_data_size)
+{
+ struct fs_error_report report = {
+ .error = error,
+ .line = line,
+ .function = function,
+ .fs_data_size = fs_data_size,
+ .fs_data = fs_data,
+ };
+
+ fsnotify(FS_ERROR, &report, FSNOTIFY_EVENT_ERROR, dir, NULL, NULL, 0);
+}
+
#endif /* _LINUX_FS_NOTIFY_H */
--
2.31.0

2021-04-26 18:45:31

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 11/15] fanotify: Introduce filesystem specific data record

Allow a FS_ERROR_TYPE notification to send a filesystem provided blob
back to userspace. This is useful for filesystems who want to provide
debug information for recovery tools.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 27 +++++++++++++++++++++++++++
include/uapi/linux/fanotify.h | 10 ++++++++--
2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index cb79a4a8e6fb..e2f4599dfc25 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -69,6 +69,14 @@ static size_t fanotify_error_info_len(struct fanotify_error_event *fee)
return sizeof(struct fanotify_event_info_error);
}

+static size_t fanotify_error_fsdata_len(struct fanotify_error_event *fee)
+{
+ if (!fee->fs_data_size)
+ return 0;
+
+ return sizeof(struct fanotify_event_info_fsdata) + fee->fs_data_size;
+}
+
static size_t fanotify_location_info_len(const struct fanotify_event_location *loc)
{
if (!loc->function)
@@ -295,6 +303,25 @@ static size_t copy_location_info_to_user(struct fanotify_event_location *locatio
return info.hdr.len;
}

+static ssize_t copy_error_fsdata_info_to_user(struct fanotify_error_event *fee,
+ char __user *buf, int count)
+{
+ struct fanotify_event_info_fsdata info;
+
+ info.hdr.info_type = FAN_EVENT_INFO_TYPE_FSDATA;
+ info.hdr.len = fanotify_error_fsdata_len(fee);
+
+ if (copy_to_user(buf, &info, sizeof(info)))
+ return -EFAULT;
+
+ buf += sizeof(info);
+
+ if (copy_to_user(buf, fee->fs_data, fee->fs_data_size))
+ return -EFAULT;
+
+ return info.hdr.len;
+}
+
static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
int info_type, const char *name, size_t name_len,
char __user *buf, size_t count)
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 67217756dac9..49808c857ee1 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -124,8 +124,9 @@ struct fanotify_event_metadata {
#define FAN_EVENT_INFO_TYPE_FID 1
#define FAN_EVENT_INFO_TYPE_DFID_NAME 2
#define FAN_EVENT_INFO_TYPE_DFID 3
-#define FAN_EVENT_INFO_TYPE_ERROR 4
-#define FAN_EVENT_INFO_TYPE_LOCATION 5
+#define FAN_EVENT_INFO_TYPE_LOCATION 4
+#define FAN_EVENT_INFO_TYPE_ERROR 5
+#define FAN_EVENT_INFO_TYPE_FSDATA 6

/* Variable length info record following event metadata */
struct fanotify_event_info_header {
@@ -166,6 +167,11 @@ struct fanotify_event_info_location {
char function[0];
};

+struct fanotify_event_info_fsdata {
+ struct fanotify_event_info_header hdr;
+ char data[0];
+};
+
struct fanotify_response {
__s32 fd;
__u32 response;
--
2.31.0

2021-04-26 18:45:31

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 14/15] samples: Add fs error monitoring example

Introduce an example of a FAN_ERROR fanotify user to track filesystem
errors.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
samples/Kconfig | 7 ++
samples/Makefile | 1 +
samples/fanotify/Makefile | 3 +
samples/fanotify/fs-monitor.c | 135 ++++++++++++++++++++++++++++++++++
4 files changed, 146 insertions(+)
create mode 100644 samples/fanotify/Makefile
create mode 100644 samples/fanotify/fs-monitor.c

diff --git a/samples/Kconfig b/samples/Kconfig
index e76cdfc50e25..a2968338517f 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -120,6 +120,13 @@ config SAMPLE_CONNECTOR
with it.
See also Documentation/driver-api/connector.rst

+config SAMPLE_FANOTIFY_ERROR
+ bool "Build fanotify error monitoring sample"
+ depends on FANOTIFY
+ help
+ When enabled, this builds an example code that uses the FAN_ERROR
+ fanotify mechanism to monitor filesystem errors.
+
config SAMPLE_HIDRAW
bool "hidraw sample"
depends on CC_CAN_LINK && HEADERS_INSTALL
diff --git a/samples/Makefile b/samples/Makefile
index c3392a595e4b..93e2d64bc9a7 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -5,6 +5,7 @@ subdir-$(CONFIG_SAMPLE_AUXDISPLAY) += auxdisplay
subdir-$(CONFIG_SAMPLE_ANDROID_BINDERFS) += binderfs
obj-$(CONFIG_SAMPLE_CONFIGFS) += configfs/
obj-$(CONFIG_SAMPLE_CONNECTOR) += connector/
+obj-$(CONFIG_SAMPLE_FANOTIFY_ERROR) += fanotify/
subdir-$(CONFIG_SAMPLE_HIDRAW) += hidraw
obj-$(CONFIG_SAMPLE_HW_BREAKPOINT) += hw_breakpoint/
obj-$(CONFIG_SAMPLE_KDB) += kdb/
diff --git a/samples/fanotify/Makefile b/samples/fanotify/Makefile
new file mode 100644
index 000000000000..b3d5cc826e6f
--- /dev/null
+++ b/samples/fanotify/Makefile
@@ -0,0 +1,3 @@
+userprogs-always-y += fs-monitor
+
+userccflags += -I usr/include
diff --git a/samples/fanotify/fs-monitor.c b/samples/fanotify/fs-monitor.c
new file mode 100644
index 000000000000..cdece8344c20
--- /dev/null
+++ b/samples/fanotify/fs-monitor.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2021, Collabora Ltd.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <err.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <sys/fanotify.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#ifndef FAN_ERROR
+
+#define FAN_ERROR 0x00100000
+#define FAN_PREALLOC_QUEUE 0x00000080
+
+#define FAN_EVENT_INFO_TYPE_LOCATION 4
+#define FAN_EVENT_INFO_TYPE_ERROR 5
+#define FAN_EVENT_INFO_TYPE_FSDATA 6
+
+struct fanotify_event_info_error {
+ struct fanotify_event_info_header hdr;
+ int version;
+ int error;
+ long long unsigned fsid;
+};
+
+struct fanotify_event_info_location {
+ struct fanotify_event_info_header hdr;
+ int line;
+ char function[0];
+};
+
+struct fanotify_event_info_fsdata {
+ struct fanotify_event_info_header hdr;
+ char data[0];
+};
+
+struct ext4_error_inode_report {
+ unsigned long long inode;
+ unsigned long long block;
+ char desc[40];
+};
+#endif
+
+static void handle_notifications(char *buffer, int len)
+{
+ struct fanotify_event_metadata *metadata;
+ struct fanotify_event_info_header *hdr = 0;
+ char *off, *next;
+
+ for (metadata = (struct fanotify_event_metadata *) buffer;
+ FAN_EVENT_OK(metadata, len); metadata = FAN_EVENT_NEXT(metadata, len)) {
+ next = (char*)metadata + metadata->event_len;
+ if (!(metadata->mask == FAN_ERROR)) {
+ printf("unexpected FAN MARK: %llx\n", metadata->mask);
+ continue;
+ }
+ if (metadata->fd != FAN_NOFD) {
+ printf("bizar fd != FAN_NOFD\n");
+ continue;;
+ }
+
+ printf("FAN_ERROR found len=%d\n", metadata->event_len);
+
+ for (off = (char*)(metadata+1); off < next; off = off + hdr->len) {
+ hdr = (struct fanotify_event_info_header*)(off);
+
+ if (hdr->info_type == FAN_EVENT_INFO_TYPE_ERROR) {
+ struct fanotify_event_info_error *error =
+ (struct fanotify_event_info_error*) hdr;
+
+ printf(" Generic Error Record: len=%d\n", hdr->len);
+ printf(" version: %d\n", error->version);
+ printf(" error: %d\n", error->error);
+ printf(" fsid: %llx\n", error->fsid);
+
+ } else if(hdr->info_type == FAN_EVENT_INFO_TYPE_LOCATION) {
+ struct fanotify_event_info_location *loc =
+ (struct fanotify_event_info_location*) hdr;
+
+ printf(" Location Record Size = %d\n", loc->hdr.len);
+ printf(" loc=%s:%d\n", loc->function, loc->line);
+
+ } else if(hdr->info_type == FAN_EVENT_INFO_TYPE_FSDATA) {
+ struct fanotify_event_info_fsdata *data =
+ (struct fanotify_event_info_fsdata *)hdr;
+ struct ext4_error_inode_report *fsdata =
+ (struct ext4_error_inode_report*) ((char*)data->data);
+
+ printf(" Fsdata Record: len=%d\n", hdr->len);
+ printf(" inode=%llu\n", fsdata->inode);
+ if (fsdata->block != -1L)
+ printf(" block=%llu\n", fsdata->block);
+ printf(" desc=%s\n", fsdata->desc);
+ }
+ }
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int fd;
+ char buffer[BUFSIZ];
+
+ if (argc < 2) {
+ printf("Missing path argument\n");
+ return 1;
+ }
+
+ fd = fanotify_init(FAN_CLASS_NOTIF|FAN_PREALLOC_QUEUE, O_RDONLY);
+ if (fd < 0)
+ errx(1, "fanotify_init");
+
+ if (fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM,
+ FAN_ERROR, AT_FDCWD, argv[1])) {
+ errx(1, "fanotify_mark");
+ }
+
+ while (1) {
+ int n = read(fd, buffer, BUFSIZ);
+ if (n < 0)
+ errx(1, "read");
+
+ handle_notifications(buffer, n);
+ }
+
+ return 0;
+}
--
2.31.0

2021-04-26 18:45:31

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 10/15] fanotify: Introduce code location record

This patch introduces an optional info record that describes the
source (as in the region of the source-code where an event was
initiated). This record is not produced for other type of existing
notification, but it is optionally enabled for FAN_ERROR notifications.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.h | 6 +++++
fs/notify/fanotify/fanotify_user.c | 35 ++++++++++++++++++++++++++++++
include/uapi/linux/fanotify.h | 7 ++++++
3 files changed, 48 insertions(+)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 4cb9dd31f084..0d1b4cb8b005 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -161,6 +161,11 @@ struct fanotify_fid_event {
unsigned char _inline_fh_buf[FANOTIFY_INLINE_FH_LEN];
};

+struct fanotify_event_location {
+ int line;
+ const char *function;
+};
+
static inline struct fanotify_fid_event *
FANOTIFY_FE(struct fanotify_event *event)
{
@@ -183,6 +188,7 @@ struct fanotify_error_event {
struct fanotify_event fae;
int error;
__kernel_fsid_t fsid;
+ struct fanotify_event_location loc;

int fs_data_size;
/* Must be the last item in the structure */
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 21162d347bd1..cb79a4a8e6fb 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -69,6 +69,16 @@ static size_t fanotify_error_info_len(struct fanotify_error_event *fee)
return sizeof(struct fanotify_event_info_error);
}

+static size_t fanotify_location_info_len(const struct fanotify_event_location *loc)
+{
+ if (!loc->function)
+ return 0;
+
+ /* Includes NULL byte at end of loc->function */
+ return (sizeof(struct fanotify_event_info_location) +
+ strlen(loc->function) + 1);
+}
+
static size_t fanotify_event_len(struct fanotify_event *event,
unsigned int fid_mode)
{
@@ -260,6 +270,31 @@ static size_t copy_error_info_to_user(struct fanotify_error_event *fee,

}

+static size_t copy_location_info_to_user(struct fanotify_event_location *location,
+ char __user *buf, int count)
+{
+ size_t len = fanotify_location_info_len(location);
+ size_t tail = len - sizeof(struct fanotify_event_info_location);
+ struct fanotify_event_info_location info;
+
+ if (!len)
+ return 0;
+
+ info.hdr.info_type = FAN_EVENT_INFO_TYPE_LOCATION;
+ info.hdr.len = len;
+ info.line = location->line;
+
+ if (copy_to_user(buf, &info, sizeof(info)))
+ return -EFAULT;
+
+ buf += sizeof(info);
+
+ if (copy_to_user(buf, location->function, tail))
+ return -EFAULT;
+
+ return info.hdr.len;
+}
+
static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
int info_type, const char *name, size_t name_len,
char __user *buf, size_t count)
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index cc9a1fa80e30..67217756dac9 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -125,6 +125,7 @@ struct fanotify_event_metadata {
#define FAN_EVENT_INFO_TYPE_DFID_NAME 2
#define FAN_EVENT_INFO_TYPE_DFID 3
#define FAN_EVENT_INFO_TYPE_ERROR 4
+#define FAN_EVENT_INFO_TYPE_LOCATION 5

/* Variable length info record following event metadata */
struct fanotify_event_info_header {
@@ -159,6 +160,12 @@ struct fanotify_event_info_error {
__kernel_fsid_t fsid;
};

+struct fanotify_event_info_location {
+ struct fanotify_event_info_header hdr;
+ int line;
+ char function[0];
+};
+
struct fanotify_response {
__s32 fd;
__u32 response;
--
2.31.0

2021-04-26 18:45:34

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 12/15] fanotify: Introduce the FAN_ERROR mark

The FAN_ERROR mark is used by filesystem wide monitoring tools to
receive notifications of type FS_ERROR_EVENT, emited by filesystems when
a problem is detected. The error notification includes a generic error
descriptor, an optional location record and a filesystem specific blob.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/notify/fanotify/fanotify.c | 48 +++++++++++++++++++----
fs/notify/fanotify/fanotify.h | 8 ++++
fs/notify/fanotify/fanotify_user.c | 63 ++++++++++++++++++++++++++++++
include/linux/fanotify.h | 9 ++++-
include/uapi/linux/fanotify.h | 2 +
5 files changed, 120 insertions(+), 10 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 98591a8155a7..6bae23d42e5e 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -240,12 +240,14 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
__func__, iter_info->report_mask, event_mask, data, data_type);

if (!fid_mode) {
- /* Do we have path to open a file descriptor? */
- if (!path)
- return 0;
- /* Path type events are only relevant for files and dirs */
- if (!d_is_reg(path->dentry) && !d_can_lookup(path->dentry))
- return 0;
+ if (!fanotify_is_error_event(event_mask)) {
+ /* Do we have path to open a file descriptor? */
+ if (!path)
+ return 0;
+ /* Path type events are only relevant for files and dirs */
+ if (!d_is_reg(path->dentry) && !d_can_lookup(path->dentry))
+ return 0;
+ }
} else if (!(fid_mode & FAN_REPORT_FID)) {
/* Do we have a directory inode to report? */
if (!dir && !(event_mask & FS_ISDIR))
@@ -458,6 +460,25 @@ static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
return &pevent->fae;
}

+static void fanotify_init_error_event(struct fanotify_event *fae,
+ const struct fs_error_report *report,
+ __kernel_fsid_t *fsid)
+{
+ struct fanotify_error_event *fee;
+
+ fae->type = FANOTIFY_EVENT_TYPE_ERROR;
+ fee = FANOTIFY_EE(fae);
+ fee->error = report->error;
+ fee->fsid = *fsid;
+
+ fee->loc.line = report->line;
+ fee->loc.function = report->function;
+
+ fee->fs_data_size = report->fs_data_size;
+
+ memcpy(&fee->fs_data, report->fs_data, report->fs_data_size);
+}
+
static struct fanotify_event *fanotify_alloc_fid_event(struct inode *id,
__kernel_fsid_t *fsid,
gfp_t gfp)
@@ -618,6 +639,13 @@ static struct fanotify_event *fanotify_ring_get_slot(struct fsnotify_group *grou
{
size_t size = 0;

+ if (fanotify_is_error_event(mask)) {
+ const struct fs_error_report *report = data;
+ size = sizeof(struct fanotify_error_event) + report->fs_data_size;
+ } else {
+ return ERR_PTR(-EINVAL);
+ }
+
pr_debug("%s: group=%p mask=%x size=%lu\n", __func__, group, mask, size);

return FANOTIFY_E(fsnotify_ring_alloc_event_slot(group, size));
@@ -629,6 +657,9 @@ static void fanotify_ring_write_event(struct fsnotify_group *group,
{
fanotify_init_event(group, event, 0, mask);

+ if (fanotify_is_error_event(mask))
+ fanotify_init_error_event(event, data, fsid);
+
event->pid = get_pid(task_tgid(current));
}

@@ -695,8 +726,9 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
BUILD_BUG_ON(FAN_OPEN_EXEC != FS_OPEN_EXEC);
BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
+ BUILD_BUG_ON(FAN_ERROR != FS_ERROR);

- BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
+ BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 20);

mask = fanotify_group_event_mask(group, iter_info, mask, data,
data_type, dir);
@@ -714,7 +746,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
return 0;
}

- if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS)) {
+ if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS) || mask == FAN_ERROR) {
fsid = fanotify_get_fsid(iter_info);
/* Racing with mark destruction or creation? */
if (!fsid.val[0] && !fsid.val[1])
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 0d1b4cb8b005..097667be9079 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -135,6 +135,7 @@ enum fanotify_event_type {
FANOTIFY_EVENT_TYPE_PATH,
FANOTIFY_EVENT_TYPE_PATH_PERM,
FANOTIFY_EVENT_TYPE_OVERFLOW, /* struct fanotify_event */
+ FANOTIFY_EVENT_TYPE_ERROR,
};

struct fanotify_event {
@@ -207,6 +208,8 @@ static inline __kernel_fsid_t *fanotify_event_fsid(struct fanotify_event *event)
return &FANOTIFY_FE(event)->fsid;
else if (event->type == FANOTIFY_EVENT_TYPE_FID_NAME)
return &FANOTIFY_NE(event)->fsid;
+ else if (event->type == FANOTIFY_EVENT_TYPE_ERROR)
+ return &FANOTIFY_EE(event)->fsid;
else
return NULL;
}
@@ -292,6 +295,11 @@ static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
return container_of(fse, struct fanotify_event, fse);
}

+static inline bool fanotify_is_error_event(u32 mask)
+{
+ return mask & FANOTIFY_ERROR_EVENTS;
+}
+
static inline bool fanotify_event_has_path(struct fanotify_event *event)
{
return event->type == FANOTIFY_EVENT_TYPE_PATH ||
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index e2f4599dfc25..6270083bee36 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -96,6 +96,24 @@ static size_t fanotify_event_len(struct fanotify_event *event,
int fh_len;
int dot_len = 0;

+ if (fanotify_is_error_event(event->mask)) {
+ struct fanotify_error_event *fee = FANOTIFY_EE(event);
+ /*
+ * Error events (FAN_ERROR) have a different format
+ * as follows:
+ * [ event_metadata ]
+ * [ fs-generic error header ]
+ * [ error location (optional) ]
+ * [ fs-specific blob ]
+ */
+ event_len = fanotify_error_info_len(fee);
+ if (fee->loc.function)
+ event_len += fanotify_location_info_len(&fee->loc);
+ if (fee->fs_data)
+ event_len += fanotify_error_fsdata_len(fee);
+ return event_len;
+ }
+
if (!fid_mode)
return event_len;

@@ -322,6 +340,38 @@ static ssize_t copy_error_fsdata_info_to_user(struct fanotify_error_event *fee,
return info.hdr.len;
}

+static int copy_error_event_to_user(struct fanotify_event *event,
+ char __user *buf, int count)
+{
+ struct fanotify_error_event *fee = FANOTIFY_EE(event);
+ ssize_t len;
+ int original_count = count;
+
+ len = copy_error_info_to_user(fee, buf, count);
+ if (len < 0)
+ return -EFAULT;
+ buf += len;
+ count -= len;
+
+ if (fee->loc.function) {
+ len = copy_location_info_to_user(&fee->loc, buf, count);
+ if (len < 0)
+ return len;
+ buf += len;
+ count -= len;
+ }
+
+ if (fee->fs_data_size) {
+ len = copy_error_fsdata_info_to_user(fee, buf, count);
+ if (len < 0)
+ return len;
+ buf += len;
+ count -= len;
+ }
+
+ return original_count - count;
+}
+
static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
int info_type, const char *name, size_t name_len,
char __user *buf, size_t count)
@@ -528,6 +578,9 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
count -= ret;
}

+ if (fanotify_is_error_event(event->mask))
+ ret = copy_error_event_to_user(event, buf, count);
+
return metadata.event_len;

out_close_fd:
@@ -1328,6 +1381,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
(mask & ~FANOTIFY_SUBMISSION_BUFFER_EVENTS))
goto fput_and_out;

+ if (fanotify_is_error_event(mask) &&
+ !(group->flags & FSN_SUBMISSION_RING_BUFFER))
+ goto fput_and_out;
+
ret = fanotify_find_path(dfd, pathname, &path, flags,
(mask & ALL_FSNOTIFY_EVENTS), obj_type);
if (ret)
@@ -1350,6 +1407,12 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,

fsid = &__fsid;
}
+ if (mask & FAN_ERROR) {
+ ret = fanotify_check_path_fsid(&path, &__fsid);
+ if (ret)
+ goto path_put_and_out;
+ fsid = &__fsid;
+ }

/* inode held in place by reference to path; group by fget on fd */
if (mark_type == FAN_MARK_INODE)
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 5a4cefb4b1c3..e08be5fae14a 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -56,9 +56,13 @@
#define FANOTIFY_INODE_EVENTS (FANOTIFY_DIRENT_EVENTS | \
FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)

+#define FANOTIFY_ERROR_EVENTS (FAN_ERROR)
+
/* Events that user can request to be notified on */
#define FANOTIFY_EVENTS (FANOTIFY_PATH_EVENTS | \
- FANOTIFY_INODE_EVENTS)
+ FANOTIFY_INODE_EVENTS | \
+ FANOTIFY_ERROR_EVENTS)
+

/* Events that require a permission response from user */
#define FANOTIFY_PERM_EVENTS (FAN_OPEN_PERM | FAN_ACCESS_PERM | \
@@ -70,9 +74,10 @@
/* Events that may be reported to user */
#define FANOTIFY_OUTGOING_EVENTS (FANOTIFY_EVENTS | \
FANOTIFY_PERM_EVENTS | \
+ FANOTIFY_ERROR_EVENTS | \
FAN_Q_OVERFLOW | FAN_ONDIR)

-#define FANOTIFY_SUBMISSION_BUFFER_EVENTS 0
+#define FANOTIFY_SUBMISSION_BUFFER_EVENTS FANOTIFY_ERROR_EVENTS

#define ALL_FANOTIFY_EVENT_BITS (FANOTIFY_OUTGOING_EVENTS | \
FANOTIFY_EVENT_FLAGS)
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 49808c857ee1..ee0ae8b1e20b 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -25,6 +25,8 @@
#define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
#define FAN_OPEN_EXEC_PERM 0x00040000 /* File open/exec in perm check */

+#define FAN_ERROR 0x00100000 /* Filesystem error */
+
#define FAN_EVENT_ON_CHILD 0x08000000 /* Interested in child events */

#define FAN_ONDIR 0x40000000 /* Event occurred against dir */
--
2.31.0

2021-04-26 18:45:44

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 13/15] ext4: Send notifications on error

Send a FS_ERROR message via fsnotify to a userspace monitoring tool
whenever a ext4 error condition is triggered. This follows the existing
error conditions in ext4, so it is hooked to the ext4_error* functions.

It also follows the current dmesg reporting in the format. The
filesystem message is composed mostly by the string that would be
otherwise printed in dmesg.

A new ext4 specific record format is exposed in the uapi, such that a
monitoring tool knows what to expect when listening errors of an ext4
filesystem.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/super.c | 60 ++++++++++++++++++++++++--------
include/uapi/linux/ext4-notify.h | 17 +++++++++
2 files changed, 62 insertions(+), 15 deletions(-)
create mode 100644 include/uapi/linux/ext4-notify.h

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9693680463a..032e29e7ff6a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -46,6 +46,8 @@
#include <linux/part_stat.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
+#include <linux/fsnotify.h>
+#include <uapi/linux/ext4-notify.h>

#include "ext4.h"
#include "ext4_extents.h" /* Needed for trace points definition */
@@ -727,6 +729,22 @@ static void flush_stashed_error_work(struct work_struct *work)
ext4_commit_super(sbi->s_sb);
}

+static void ext4_fsnotify_error(int error, struct inode *inode, __u64 block,
+ const char *func, int line,
+ const char *desc, struct va_format *vaf)
+{
+ struct ext4_error_inode_report report;
+
+ if (inode->i_sb->s_fsnotify_marks) {
+ report.inode = inode ? inode->i_ino : -1L;
+ report.block = block ? block : -1L;
+
+ snprintf(report.desc, EXT4_FSN_DESC_LEN, "%s%pV\n", desc?:"", vaf);
+
+ fsnotify_error_event(error, inode, func, line, &report, sizeof(report));
+ }
+}
+
#define ext4_error_ratelimit(sb) \
___ratelimit(&(EXT4_SB(sb)->s_err_ratelimit_state), \
"EXT4-fs error")
@@ -742,15 +760,18 @@ void __ext4_error(struct super_block *sb, const char *function,
return;

trace_ext4_error(sb, function, line);
+
+ va_start(args, fmt);
+ vaf.fmt = fmt;
+ vaf.va = &args;
if (ext4_error_ratelimit(sb)) {
- va_start(args, fmt);
- vaf.fmt = fmt;
- vaf.va = &args;
printk(KERN_CRIT
"EXT4-fs error (device %s): %s:%d: comm %s: %pV\n",
sb->s_id, function, line, current->comm, &vaf);
- va_end(args);
+
}
+ ext4_fsnotify_error(error, sb->s_root->d_inode, block, function, line, NULL, &vaf);
+ va_end(args);
ext4_handle_error(sb, force_ro, error, 0, block, function, line);
}

@@ -765,10 +786,10 @@ void __ext4_error_inode(struct inode *inode, const char *function,
return;

trace_ext4_error(inode->i_sb, function, line);
+ va_start(args, fmt);
+ vaf.fmt = fmt;
+ vaf.va = &args;
if (ext4_error_ratelimit(inode->i_sb)) {
- va_start(args, fmt);
- vaf.fmt = fmt;
- vaf.va = &args;
if (block)
printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: "
"inode #%lu: block %llu: comm %s: %pV\n",
@@ -779,8 +800,11 @@ void __ext4_error_inode(struct inode *inode, const char *function,
"inode #%lu: comm %s: %pV\n",
inode->i_sb->s_id, function, line, inode->i_ino,
current->comm, &vaf);
- va_end(args);
}
+
+ ext4_fsnotify_error(error, inode, block, function, line, NULL, &vaf);
+ va_end(args);
+
ext4_handle_error(inode->i_sb, false, error, inode->i_ino, block,
function, line);
}
@@ -798,13 +822,16 @@ void __ext4_error_file(struct file *file, const char *function,
return;

trace_ext4_error(inode->i_sb, function, line);
+
+ path = file_path(file, pathname, sizeof(pathname));
+ if (IS_ERR(path))
+ path = "(unknown)";
+
+ va_start(args, fmt);
+ vaf.fmt = fmt;
+ vaf.va = &args;
+
if (ext4_error_ratelimit(inode->i_sb)) {
- path = file_path(file, pathname, sizeof(pathname));
- if (IS_ERR(path))
- path = "(unknown)";
- va_start(args, fmt);
- vaf.fmt = fmt;
- vaf.va = &args;
if (block)
printk(KERN_CRIT
"EXT4-fs error (device %s): %s:%d: inode #%lu: "
@@ -817,8 +844,10 @@ void __ext4_error_file(struct file *file, const char *function,
"comm %s: path %s: %pV\n",
inode->i_sb->s_id, function, line, inode->i_ino,
current->comm, path, &vaf);
- va_end(args);
}
+ ext4_fsnotify_error(EFSCORRUPTED, inode, block, function, line, NULL, &vaf);
+ va_end(args);
+
ext4_handle_error(inode->i_sb, false, EFSCORRUPTED, inode->i_ino, block,
function, line);
}
@@ -886,6 +915,7 @@ void __ext4_std_error(struct super_block *sb, const char *function,
printk(KERN_CRIT "EXT4-fs error (device %s) in %s:%d: %s\n",
sb->s_id, function, line, errstr);
}
+ ext4_fsnotify_error(errno, NULL, -1L, function, line, errstr, NULL);

ext4_handle_error(sb, false, -errno, 0, 0, function, line);
}
diff --git a/include/uapi/linux/ext4-notify.h b/include/uapi/linux/ext4-notify.h
new file mode 100644
index 000000000000..31a3bbcafd13
--- /dev/null
+++ b/include/uapi/linux/ext4-notify.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/*
+ * Copyright 2021, Collabora Ltd.
+ */
+
+#ifndef EXT4_NOTIFY_H
+#define EXT4_NOTIFY_H
+
+#define EXT4_FSN_DESC_LEN 256
+
+struct ext4_error_inode_report {
+ u64 inode;
+ u64 block;
+ char desc[EXT4_FSN_DESC_LEN];
+};
+
+#endif
--
2.31.0

2021-04-26 18:47:32

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 15/15] Documentation: Document the FAN_ERROR framework

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
.../admin-guide/filesystem-monitoring.rst | 103 ++++++++++++++++++
Documentation/admin-guide/index.rst | 1 +
2 files changed, 104 insertions(+)
create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst

diff --git a/Documentation/admin-guide/filesystem-monitoring.rst b/Documentation/admin-guide/filesystem-monitoring.rst
new file mode 100644
index 000000000000..e19bf792dd7a
--- /dev/null
+++ b/Documentation/admin-guide/filesystem-monitoring.rst
@@ -0,0 +1,103 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+File system Monitoring with fanotify
+====================================
+
+fanotify supports the FAN_ERROR mark for file system-wide error
+reporting. It is meant to be used by file system health monitoring
+daemons who listen on that interface and take actions (notify sysadmin,
+start recovery) when a file system problem is detected by the kernel.
+
+By design, A FAN_ERROR notification exposes sufficient information for a
+monitoring tool to map a problem to specific region of the file system
+or its code and trigger recovery procedures. It doesn't necessarily
+provide a user space application with semantics to verify an IO
+operation was successfully executed. That is outside of scope of this
+feature. Instead, it is only meant as a framework for early file system
+problem detection and reporting recovery tools.
+
+At the time of this writing, the only file system that emits this
+FAN_ERROR notifications is ext4.
+
+An example code for ext4 is provided at ``samples/fanotify/fs-monitor.c``.
+
+Usage
+=====
+
+In order to guarantee notification delivery on different error
+conditions, FAN_ERROR requires the fanotify group to be created with
+FAN_PREALLOC_QUEUE. This means a group that emits FAN_ERROR
+notifications currently cannot be reused for any other kind of
+notification.
+
+To setup a group for error notification::
+
+ fanotify_init(FAN_CLASS_NOTIF | FAN_PREALLOC_QUEUE, O_RDONLY);
+
+Then, enable the FAN_ERROR mark on a specific path::
+
+ fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_FILESYSTEM, FAN_ERROR, AT_FDCWD, "/mnt");
+
+Notification structure
+======================
+
+A FAN_ERROR Notification has the following format::
+
+ [ Notification Metadata (Mandatory) ]
+ [ Generic Error Record (Mandatory) ]
+ [ Error Location Record (Optional) ]
+ [ FS-Specific Record (Optional) ]
+
+With the exception of the notification metadata and the generic
+information, all information records are optional. Each record type is
+identified by its unique ``struct fanotify_event_info_header.info_type``.
+
+Generic error Location
+----------------------
+
+The Generic error record provides enough information for a file system
+agnostic tool to learn about a problem in the file system, without
+requiring any details about the problem.::
+
+ struct fanotify_event_info_error {
+ struct fanotify_event_info_header hdr; /* info_type = FAN_EVENT_INFO_TYPE_ERROR */
+ int version;
+ int error;
+ __kernel_fsid_t fsid;
+ };
+
+Error Location Record
+---------------------
+
+Error location is required by some use cases to easily associate an
+error with a specific line of code. Not every user case requires it and
+they might not be emitted for different file systems.
+
+Notice this field is variable length, but its size is found in ```hdr.len```.::
+
+ struct fanotify_event_info_location {
+ struct fanotify_event_info_header hdr; /* info_type = FAN_EVENT_INFO_TYPE_LOCATION */
+ int line;
+ char function[0];
+ };
+
+File system specific Record
+---------------------------
+
+The file system specific record attempts to provide file system specific
+tools with enough information to uniquely identify the problem and
+hopefully recover from it.
+
+Since each file system defines its own specific data, this record is
+composed by a header, followed by a data blob, that is defined by each
+file system. Review the file system documentation for more information.
+
+While the hdr.info_type identifies the presence of this field,
+``hdr.len`` field identifies the length of the file system specific
+structure following the header.::
+
+ struct fanotify_event_info_fsdata {
+ struct fanotify_event_info_header hdr; /* info_type = FAN_EVENT_INFO_TYPE_FSDATA */
+ struct data[0];
+ };
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 423116c4e787..a0d1bf76629f 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -83,6 +83,7 @@ configure specific aspects of kernel behavior to your liking.
edid
efi-stub
ext4
+ filesystem-monitoring
nfs/index
gpio/index
highuid
--
2.31.0

2021-04-27 04:13:25

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] File system wide monitoring

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Hi,
>
> In an attempt to consolidate some of the feedback from the previous
> proposals, I wrote a new attempt to solve the file system error reporting
> problem. Before I spend more time polishing it, I'd like to hear your
> feedback if I'm going in the wrong direction, in particular with the
> modifications to fsnotify.
>

IMO you are going in the right direction, but you have gone a bit too far ;-)

My understanding of the requirements and my interpretation of the feedback
from filesystem maintainers is that the missing piece in the ecosystem is a
user notification that "something went wrong". The "what went wrong" part
is something that users and admins have long been able to gather from the
kernel log and from filesystem tools (e.g. last error recorded).

I do not see the need to duplicate existing functionality in fsmonitor.
Don't get me wrong, I understand why it would have been nice for fsmonitor
to be able to get all the errors nicely without looking anywhere else, but I
don't think it justifies the extra complexity.

> This RFC follows up on my previous proposals which attempted to leverage
> watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
> to push error notifications to user space. This proposal starts by, as
> suggested by Darrick, limiting the scope of what I'm trying to do to an
> interface for administrators to monitor the health of a file system,
> instead of a generic inteface for file errors. Therefore, this doesn't
> solve the problem of writeback errors or the need to watch a specific
> subsystem.
>
> * Format
>
> The feature is implemented on top of fanotify, as a new type of fanotify
> mark, FAN_ERROR, which a file system monitoring tool can register to

You have a terminology mistake throughout your series.
FAN_ERROR is not a type of a mark, it is a type of an event.
A mark describes the watched object (i.e. a filesystem, mount, inode).

> receive notifications. A notification is split in three parts, and only
> the first is guaranteed to exist for any given error event:
>
> - FS generic data: A file system agnostic structure that has a generic
> error code and identifies the filesystem. Basically, it let's
> userspace know something happen on a monitored filesystem.

I think an error seq counter per fs would be a nice addition to generic data.
It does not need to be persistent (it could be if filesystem supports it).

>
> - FS location data: Identifies where in the code the problem
> happened. (This is important for the use case of analysing frequent
> error points that we discussed earlier).
>
> - FS specific data: A detailed error report in a filesystem specific
> format that details what the error is. Ideally, a capable monitoring
> tool can use the information here for error recovery. For instance,
> xfs can put the xfs_scrub structures here, ext4 can send its error
> reports, etc. An example of usage is done in the ext4 patch of this
> series.
>
> More details on the information in each record can be found on the
> documentation introduced in patch 15.
>
> * Using fanotify
>
> Using fanotify for this kind of thing is slightly tricky because we want
> to guarantee delivery in some complicated conditions, for instance, the
> file system might want to send an error while holding several locks.
>
> Instead of working around file system constraints at the file system
> level, this proposal tries to make the FAN_ERROR submission safe in
> those contexts. This is done with a new mode in fsnotify that
> preallocates the memory at group creation to be used for the
> notification submission.
>
> This new mode in fsnotify introduces a ring buffer to queue
> notifications, which eliminates the allocation path in fsnotify. From
> what I saw, the allocation is the only problem in fsnotify for
> filesystems to submit errors in constrained situations.
>

The ring buffer functionality for fsnotify is interesting and it may be
useful on its own, but IMO, its too big of a hammer for the problem
at hand.

The question that you should be asking yourself is what is the
expected behavior in case of a flood of filesystem corruption errors.
I think it has already been expressed by filesystem maintainers on
one your previous postings, that a flood of filesystem corruption
errors is often noise and the only interesting information is the first error.

For this reason, I think that FS_ERROR could be implemented
by attaching an fsnotify_error_info object to an fsnotify_sb_mark:

struct fsnotify_sb_mark {
struct fsnotify_mark fsn_mark;
struct fsnotify_error_info info;
}

Similar to fd sampled errseq, there can be only one error report
per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
the error report can be allocated at the time of setting the filesystem mark.

With this, you will not need the added complexity of the ring buffer
and you will not need to limit FAN_ERROR reporting to a group that
is only listening for FAN_ERROR, which is an unneeded limitation IMO.

Anyway, in case, others do like the ring buffer approach, I do have
some technical comments on the implementation.
I will comment on individual patches.

Thanks,
Amir.

> * Visibility
>
> Since the usecase is limited to a tool for whole file system monitoring,
> errors are associated with the superblock and visible filesystem-wide.
> It is assumed and required that userspace has CAP_SYS_ADMIN.
>
> * Testing
>
> This was tested with corrupted ext4 images in a few scenarios, which
> caused errors to be triggered and monitored with the sample tool
> provided in the next to final patch.
>
> * patches
>
> Patches 1-4 massage fanotify attempt to refactor fanotify a bit for
> the patches to come. Patch 5 introduce the ring buffer interface to
> fsnotify, while patch 6 enable this support in fanotify. Patch 7, 8 wire
> the FS_ERROR event type, which will be used by filesystems. In
> sequennce, patches 9-12 implement the FAN_ERROR record types and create
> the new event. Patch 13 is an ext4 example implementation supporting
> this feature. Finally, patches 14 and 15 document and provide examples
> of a userspace tool that uses this feature.
>
> I also pushed the full series to:
>
> https://gitlab.collabora.com/krisman/linux -b fanotify-notifications
>
> [1] https://lwn.net/Articles/839310/
> [2] https://www.spinics.net/lists/linux-fsdevel/msg187075.html
>
> Gabriel Krisman Bertazi (15):
> fanotify: Fold event size calculation to its own function
> fanotify: Split fsid check from other fid mode checks
> fsnotify: Wire flags field on group allocation
> fsnotify: Wire up group information on event initialization
> fsnotify: Support event submission through ring buffer
> fanotify: Support submission through ring buffer
> fsnotify: Support FS_ERROR event type
> fsnotify: Introduce helpers to send error_events
> fanotify: Introduce generic error record
> fanotify: Introduce code location record
> fanotify: Introduce filesystem specific data record
> fanotify: Introduce the FAN_ERROR mark
> ext4: Send notifications on error
> samples: Add fs error monitoring example
> Documentation: Document the FAN_ERROR framework
>
> .../admin-guide/filesystem-monitoring.rst | 103 ++++++
> Documentation/admin-guide/index.rst | 1 +
> fs/ext4/super.c | 60 +++-
> fs/notify/Makefile | 2 +-
> fs/notify/dnotify/dnotify.c | 2 +-
> fs/notify/fanotify/fanotify.c | 127 +++++--
> fs/notify/fanotify/fanotify.h | 35 +-
> fs/notify/fanotify/fanotify_user.c | 319 ++++++++++++++----
> fs/notify/fsnotify.c | 2 +-
> fs/notify/group.c | 25 +-
> fs/notify/inotify/inotify_fsnotify.c | 2 +-
> fs/notify/inotify/inotify_user.c | 4 +-
> fs/notify/notification.c | 10 +
> fs/notify/ring.c | 199 +++++++++++
> include/linux/fanotify.h | 12 +-
> include/linux/fsnotify.h | 15 +
> include/linux/fsnotify_backend.h | 63 +++-
> include/uapi/linux/ext4-notify.h | 17 +
> include/uapi/linux/fanotify.h | 26 ++
> kernel/audit_fsnotify.c | 2 +-
> kernel/audit_tree.c | 2 +-
> kernel/audit_watch.c | 2 +-
> samples/Kconfig | 7 +
> samples/Makefile | 1 +
> samples/fanotify/Makefile | 3 +
> samples/fanotify/fs-monitor.c | 135 ++++++++
> 26 files changed, 1034 insertions(+), 142 deletions(-)
> create mode 100644 Documentation/admin-guide/filesystem-monitoring.rst
> create mode 100644 fs/notify/ring.c
> create mode 100644 include/uapi/linux/ext4-notify.h
> create mode 100644 samples/fanotify/Makefile
> create mode 100644 samples/fanotify/fs-monitor.c
>
> --
> 2.31.0
>

2021-04-27 04:33:40

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 13/15] ext4: Send notifications on error

On Mon, Apr 26, 2021 at 9:43 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Send a FS_ERROR message via fsnotify to a userspace monitoring tool
> whenever a ext4 error condition is triggered. This follows the existing
> error conditions in ext4, so it is hooked to the ext4_error* functions.
>
> It also follows the current dmesg reporting in the format. The
> filesystem message is composed mostly by the string that would be
> otherwise printed in dmesg.
>
> A new ext4 specific record format is exposed in the uapi, such that a
> monitoring tool knows what to expect when listening errors of an ext4
> filesystem.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/ext4/super.c | 60 ++++++++++++++++++++++++--------
> include/uapi/linux/ext4-notify.h | 17 +++++++++
> 2 files changed, 62 insertions(+), 15 deletions(-)
> create mode 100644 include/uapi/linux/ext4-notify.h
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b9693680463a..032e29e7ff6a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -46,6 +46,8 @@
> #include <linux/part_stat.h>
> #include <linux/kthread.h>
> #include <linux/freezer.h>
> +#include <linux/fsnotify.h>
> +#include <uapi/linux/ext4-notify.h>
>
> #include "ext4.h"
> #include "ext4_extents.h" /* Needed for trace points definition */
> @@ -727,6 +729,22 @@ static void flush_stashed_error_work(struct work_struct *work)
> ext4_commit_super(sbi->s_sb);
> }
>
> +static void ext4_fsnotify_error(int error, struct inode *inode, __u64 block,
> + const char *func, int line,
> + const char *desc, struct va_format *vaf)
> +{
> + struct ext4_error_inode_report report;
> +
> + if (inode->i_sb->s_fsnotify_marks) {
> + report.inode = inode ? inode->i_ino : -1L;
> + report.block = block ? block : -1L;
> +
> + snprintf(report.desc, EXT4_FSN_DESC_LEN, "%s%pV\n", desc?:"", vaf);
> +
> + fsnotify_error_event(error, inode, func, line, &report, sizeof(report));
> + }
> +}
> +
> #define ext4_error_ratelimit(sb) \
> ___ratelimit(&(EXT4_SB(sb)->s_err_ratelimit_state), \
> "EXT4-fs error")
> @@ -742,15 +760,18 @@ void __ext4_error(struct super_block *sb, const char *function,
> return;
>
> trace_ext4_error(sb, function, line);
> +
> + va_start(args, fmt);
> + vaf.fmt = fmt;
> + vaf.va = &args;
> if (ext4_error_ratelimit(sb)) {
> - va_start(args, fmt);
> - vaf.fmt = fmt;
> - vaf.va = &args;
> printk(KERN_CRIT
> "EXT4-fs error (device %s): %s:%d: comm %s: %pV\n",
> sb->s_id, function, line, current->comm, &vaf);
> - va_end(args);
> +
> }
> + ext4_fsnotify_error(error, sb->s_root->d_inode, block, function, line, NULL, &vaf);
> + va_end(args);
> ext4_handle_error(sb, force_ro, error, 0, block, function, line);
> }
>

So error reporting to kernel log is ratelimited and error reporting to
fsnotify is limited by a fixed size ring buffer which may be filled by
report floods from another filesystem, so user can miss the first
important error report from this filesystem.

Not optimal.

With my proposal of keeping a single fsnotify_error_info in every
fsnotify_sb_mark, users will be guaranteed to get the first error
report from every filesystem and once they read that report they
will be guaranteed to also get the next report.

Thanks,
Amir.

2021-04-27 04:43:06

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 01/15] fanotify: Fold event size calculation to its own function

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Every time this function is invoked, it is immediately added to
> FAN_EVENT_METADATA_LEN, since there is no need to just calculate the
> length of info records. This minor clean up folds the rest of the
> calculation into the function, which now operates in terms of events,
> returning the size of the entire event, including metadata.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Nice

Reviewed-by: Amir Goldstein <[email protected]>

> ---
> fs/notify/fanotify/fanotify_user.c | 40 +++++++++++++++++-------------
> 1 file changed, 23 insertions(+), 17 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 9e0c1afac8bd..0332c4afeec3 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -64,17 +64,24 @@ static int fanotify_fid_info_len(int fh_len, int name_len)
> return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
> }
>
> -static int fanotify_event_info_len(unsigned int fid_mode,
> - struct fanotify_event *event)
> +static size_t fanotify_event_len(struct fanotify_event *event,
> + unsigned int fid_mode)
> {
> - struct fanotify_info *info = fanotify_event_info(event);
> - int dir_fh_len = fanotify_event_dir_fh_len(event);
> - int fh_len = fanotify_event_object_fh_len(event);
> - int info_len = 0;
> + size_t event_len = FAN_EVENT_METADATA_LEN;
> + struct fanotify_info *info;
> + int dir_fh_len;
> + int fh_len;
> int dot_len = 0;
>
> + if (!fid_mode)
> + return event_len;
> +
> + info = fanotify_event_info(event);
> + dir_fh_len = fanotify_event_dir_fh_len(event);
> + fh_len = fanotify_event_object_fh_len(event);
> +
> if (dir_fh_len) {
> - info_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
> + event_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
> } else if ((fid_mode & FAN_REPORT_NAME) && (event->mask & FAN_ONDIR)) {
> /*
> * With group flag FAN_REPORT_NAME, if name was not recorded in
> @@ -84,9 +91,9 @@ static int fanotify_event_info_len(unsigned int fid_mode,
> }
>
> if (fh_len)
> - info_len += fanotify_fid_info_len(fh_len, dot_len);
> + event_len += fanotify_fid_info_len(fh_len, dot_len);
>
> - return info_len;
> + return event_len;
> }
>
> /*
> @@ -98,7 +105,8 @@ static int fanotify_event_info_len(unsigned int fid_mode,
> static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> size_t count)
> {
> - size_t event_size = FAN_EVENT_METADATA_LEN;
> + size_t event_size;
> + struct fsnotify_event *fse;
> struct fanotify_event *event = NULL;
> unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
>
> @@ -108,16 +116,15 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> if (fsnotify_notify_queue_is_empty(group))
> goto out;
>
> - if (fid_mode) {
> - event_size += fanotify_event_info_len(fid_mode,
> - FANOTIFY_E(fsnotify_peek_first_event(group)));
> - }
> + fse = fsnotify_peek_first_event(group);
> + event = FANOTIFY_E(fse);
> + event_size = fanotify_event_len(event, fid_mode);
>
> if (event_size > count) {
> event = ERR_PTR(-EINVAL);
> goto out;
> }
> - event = FANOTIFY_E(fsnotify_remove_first_event(group));
> + fsnotify_remove_queued_event(group, fse);
> if (fanotify_is_perm_event(event->mask))
> FANOTIFY_PERM(event)->state = FAN_EVENT_REPORTED;
> out:
> @@ -334,8 +341,7 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>
> pr_debug("%s: group=%p event=%p\n", __func__, group, event);
>
> - metadata.event_len = FAN_EVENT_METADATA_LEN +
> - fanotify_event_info_len(fid_mode, event);
> + metadata.event_len = fanotify_event_len(event, fid_mode);
> metadata.metadata_len = FAN_EVENT_METADATA_LEN;
> metadata.vers = FANOTIFY_METADATA_VERSION;
> metadata.reserved = 0;
> --
> 2.31.0
>

2021-04-27 04:54:17

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 02/15] fanotify: Split fsid check from other fid mode checks

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> FAN_ERROR will require fsid, but not necessarily require the filesystem
> to expose a file handle. Split those checks into different functions, so
> they can be used separately when creating a mark.

Ok for the split, but...

>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/notify/fanotify/fanotify_user.c | 35 +++++++++++++++++++-----------
> 1 file changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 0332c4afeec3..e0d113e3b65c 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1055,7 +1055,23 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> }
>
> /* Check if filesystem can encode a unique fid */
> -static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
> +static int fanotify_test_fid(struct path *path)

This helper can take a dentry.

> +{
> + /*
> + * We need to make sure that the file system supports at least
> + * encoding a file handle so user can use name_to_handle_at() to
> + * compare fid returned with event to the file handle of watched
> + * objects. However, name_to_handle_at() requires that the
> + * filesystem also supports decoding file handles.
> + */
> + if (!path->dentry->d_sb->s_export_op ||
> + !path->dentry->d_sb->s_export_op->fh_to_dentry)
> + return -EOPNOTSUPP;
> +
> + return 0;
> +}
> +
> +static int fanotify_check_path_fsid(struct path *path, __kernel_fsid_t *fsid)

And so does this helper.
I certainly don't see the need for the _path_ in the helper name.

> {
> __kernel_fsid_t root_fsid;
> int err;
> @@ -1082,17 +1098,6 @@ static int fanotify_test_fid(struct path *path, __kernel_fsid_t *fsid)
> root_fsid.val[1] != fsid->val[1])
> return -EXDEV;
>
> - /*
> - * We need to make sure that the file system supports at least
> - * encoding a file handle so user can use name_to_handle_at() to
> - * compare fid returned with event to the file handle of watched
> - * objects. However, name_to_handle_at() requires that the
> - * filesystem also supports decoding file handles.
> - */
> - if (!path->dentry->d_sb->s_export_op ||
> - !path->dentry->d_sb->s_export_op->fh_to_dentry)
> - return -EOPNOTSUPP;
> -
> return 0;
> }
>
> @@ -1230,7 +1235,11 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> }
>
> if (fid_mode) {
> - ret = fanotify_test_fid(&path, &__fsid);
> + ret = fanotify_check_path_fsid(&path, &__fsid);
> + if (ret)
> + goto path_put_and_out;
> +
> + ret = fanotify_test_fid(&path);

Whether _test_ or _check_ please stick to one.

Thanks,
Amir.

2021-04-27 05:05:39

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 03/15] fsnotify: Wire flags field on group allocation

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Introduce a flags field in fsnotify_group to track the mode of
> submission this group has.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/notify/dnotify/dnotify.c | 2 +-
> fs/notify/fanotify/fanotify_user.c | 4 ++--
> fs/notify/group.c | 13 ++++++++-----
> fs/notify/inotify/inotify_user.c | 2 +-
> include/linux/fsnotify_backend.h | 7 +++++--
> kernel/audit_fsnotify.c | 2 +-
> kernel/audit_tree.c | 2 +-
> kernel/audit_watch.c | 2 +-
> 8 files changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
> index e85e13c50d6d..37960c8750e4 100644
> --- a/fs/notify/dnotify/dnotify.c
> +++ b/fs/notify/dnotify/dnotify.c
> @@ -383,7 +383,7 @@ static int __init dnotify_init(void)
> SLAB_PANIC|SLAB_ACCOUNT);
> dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT);
>
> - dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops);
> + dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops, 0);
> if (IS_ERR(dnotify_group))
> panic("unable to allocate fsnotify group for dnotify\n");
> return 0;
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index e0d113e3b65c..f50c4ab721e3 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -929,7 +929,7 @@ static struct fsnotify_event *fanotify_alloc_overflow_event(void)
> SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> {
> struct fsnotify_group *group;
> - int f_flags, fd;
> + int f_flags, fd, fsn_flags = 0;
> struct user_struct *user;
> unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
> unsigned int class = flags & FANOTIFY_CLASS_BITS;
> @@ -982,7 +982,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> f_flags |= O_NONBLOCK;
>
> /* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
> - group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops);
> + group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops, fsn_flags);
> if (IS_ERR(group)) {
> free_uid(user);
> return PTR_ERR(group);
> diff --git a/fs/notify/group.c b/fs/notify/group.c
> index ffd723ffe46d..08acb1afc0c2 100644
> --- a/fs/notify/group.c
> +++ b/fs/notify/group.c
> @@ -112,7 +112,7 @@ void fsnotify_put_group(struct fsnotify_group *group)
> EXPORT_SYMBOL_GPL(fsnotify_put_group);
>
> static struct fsnotify_group *__fsnotify_alloc_group(
> - const struct fsnotify_ops *ops, gfp_t gfp)
> + const struct fsnotify_ops *ops, unsigned int flags, gfp_t gfp)
> {
> struct fsnotify_group *group;
>
> @@ -134,6 +134,7 @@ static struct fsnotify_group *__fsnotify_alloc_group(
> INIT_LIST_HEAD(&group->marks_list);
>
> group->ops = ops;
> + group->flags = flags;
>
> return group;
> }
> @@ -141,18 +142,20 @@ static struct fsnotify_group *__fsnotify_alloc_group(
> /*
> * Create a new fsnotify_group and hold a reference for the group returned.
> */
> -struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops)
> +struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops,
> + unsigned int flags)
> {
> - return __fsnotify_alloc_group(ops, GFP_KERNEL);
> + return __fsnotify_alloc_group(ops, flags, GFP_KERNEL);
> }
> EXPORT_SYMBOL_GPL(fsnotify_alloc_group);
>
> /*
> * Create a new fsnotify_group and hold a reference for the group returned.
> */
> -struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops)
> +struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops,
> + unsigned int flags)
> {
> - return __fsnotify_alloc_group(ops, GFP_KERNEL_ACCOUNT);
> + return __fsnotify_alloc_group(ops, flags, GFP_KERNEL_ACCOUNT);
> }
> EXPORT_SYMBOL_GPL(fsnotify_alloc_user_group);
>

*IF* we go this way, note that fsnotify_alloc_group() doesn't need
flags argument.
None of the callers of fsnotify_alloc_group() ever use the
notification list, so it
would be better to pass flag FSN_NOTIFICATION_LIST from the backends that
do use it (fanotify and inotify) for the sake of symmetry with FSN_RING_BUFFER
and no need to change other callers.

Thanks,
Amir.

2021-04-27 05:43:27

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 05/15] fsnotify: Support event submission through ring buffer

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> In order to support file system health/error reporting over fanotify,
> fsnotify needs to expose a submission path that doesn't allow sleeping.
> The only problem I identified with the current submission path is the
> need to dynamically allocate memory for the event queue.
>
> This patch avoids the problem by introducing a new mode in fsnotify,
> where a ring buffer is used to submit events for a group. Each group
> has its own ring buffer, and error notifications are submitted
> exclusively through it.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/notify/Makefile | 2 +-
> fs/notify/group.c | 12 +-
> fs/notify/notification.c | 10 ++
> fs/notify/ring.c | 199 +++++++++++++++++++++++++++++++
> include/linux/fsnotify_backend.h | 37 +++++-
> 5 files changed, 255 insertions(+), 5 deletions(-)
> create mode 100644 fs/notify/ring.c
>
> diff --git a/fs/notify/Makefile b/fs/notify/Makefile
> index 63a4b8828df4..61dae1e90f2d 100644
> --- a/fs/notify/Makefile
> +++ b/fs/notify/Makefile
> @@ -1,6 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_FSNOTIFY) += fsnotify.o notification.o group.o mark.o \
> - fdinfo.o
> + fdinfo.o ring.o
>
> obj-y += dnotify/
> obj-y += inotify/
> diff --git a/fs/notify/group.c b/fs/notify/group.c
> index 08acb1afc0c2..b99b3de36696 100644
> --- a/fs/notify/group.c
> +++ b/fs/notify/group.c
> @@ -81,7 +81,10 @@ void fsnotify_destroy_group(struct fsnotify_group *group)
> * notification against this group. So clearing the notification queue
> * of all events is reliable now.
> */
> - fsnotify_flush_notify(group);
> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
> + fsnotify_free_ring_buffer(group);
> + else
> + fsnotify_flush_notify(group);
>
> /*
> * Destroy overflow event (we cannot use fsnotify_destroy_event() as
> @@ -136,6 +139,13 @@ static struct fsnotify_group *__fsnotify_alloc_group(
> group->ops = ops;
> group->flags = flags;
>
> + if (group->flags & FSN_SUBMISSION_RING_BUFFER) {
> + if (fsnotify_create_ring_buffer(group)) {
> + kfree(group);
> + return ERR_PTR(-ENOMEM);
> + }
> + }
> +
> return group;
> }
>
> diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> index 75d79d6d3ef0..32f97e7b7a80 100644
> --- a/fs/notify/notification.c
> +++ b/fs/notify/notification.c
> @@ -51,6 +51,10 @@ EXPORT_SYMBOL_GPL(fsnotify_get_cookie);
> bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group)
> {
> assert_spin_locked(&group->notification_lock);
> +
> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
> + return fsnotify_ring_notify_queue_is_empty(group);
> +
> return list_empty(&group->notification_list) ? true : false;
> }
>
> @@ -132,6 +136,9 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
> struct fsnotify_event *event)
> {
> assert_spin_locked(&group->notification_lock);
> +
> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
> + return;
> /*
> * We need to init list head for the case of overflow event so that
> * check in fsnotify_add_event() works
> @@ -166,6 +173,9 @@ struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
> {
> assert_spin_locked(&group->notification_lock);
>
> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
> + return fsnotify_ring_peek_first_event(group);
> +
> return list_first_entry(&group->notification_list,
> struct fsnotify_event, list);
> }
> diff --git a/fs/notify/ring.c b/fs/notify/ring.c
> new file mode 100644
> index 000000000000..75e8af1f8d80
> --- /dev/null
> +++ b/fs/notify/ring.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/fsnotify.h>
> +#include <linux/memcontrol.h>
> +
> +#define INVALID_RING_SLOT -1
> +
> +#define FSNOTIFY_RING_PAGES 16
> +
> +#define NEXT_SLOT(cur, len, ring_size) ((cur + len) & (ring_size-1))
> +#define NEXT_PAGE(cur, ring_size) (round_up(cur, PAGE_SIZE) & (ring_size-1))
> +
> +bool fsnotify_ring_notify_queue_is_empty(struct fsnotify_group *group)
> +{
> + assert_spin_locked(&group->notification_lock);
> +
> + if (group->ring_buffer.tail == group->ring_buffer.head)
> + return true;
> + return false;
> +}
> +
> +struct fsnotify_event *fsnotify_ring_peek_first_event(struct fsnotify_group *group)
> +{
> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
> + struct fsnotify_event *fsn;
> + char *kaddr;
> + u64 tail;
> +
> + assert_spin_locked(&group->notification_lock);
> +
> +again:
> + tail = group->ring_buffer.tail;
> +
> + if ((PAGE_SIZE - (tail & (PAGE_SIZE-1))) < sizeof(struct fsnotify_event)) {
> + group->ring_buffer.tail = NEXT_PAGE(tail, ring_size);
> + goto again;
> + }
> +
> + kaddr = kmap_atomic(group->ring_buffer.pages[tail / PAGE_SIZE]);
> + if (!kaddr)
> + return NULL;
> + fsn = (struct fsnotify_event *) (kaddr + (tail & (PAGE_SIZE-1)));
> +
> + if (fsn->slot_len == INVALID_RING_SLOT) {
> + group->ring_buffer.tail = NEXT_PAGE(tail, ring_size);
> + kunmap_atomic(kaddr);
> + goto again;
> + }
> +
> + /* will be unmapped when entry is consumed. */
> + return fsn;
> +}
> +
> +void fsnotify_ring_buffer_consume_event(struct fsnotify_group *group,
> + struct fsnotify_event *event)
> +{
> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
> + u64 new_tail = NEXT_SLOT(group->ring_buffer.tail, event->slot_len, ring_size);
> +
> + kunmap_atomic(event);
> +
> + pr_debug("%s: group=%p tail=%llx->%llx ring_size=%llu\n", __func__,
> + group, group->ring_buffer.tail, new_tail, ring_size);
> +
> + WRITE_ONCE(group->ring_buffer.tail, new_tail);
> +}
> +
> +struct fsnotify_event *fsnotify_ring_alloc_event_slot(struct fsnotify_group *group,
> + size_t size)
> + __acquires(&group->notification_lock)
> +{
> + struct fsnotify_event *fsn;
> + u64 head, tail;
> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
> + u64 new_head;
> + void *kaddr;
> +
> + if (WARN_ON(!(group->flags & FSN_SUBMISSION_RING_BUFFER) || size > PAGE_SIZE))
> + return ERR_PTR(-EINVAL);
> +
> + pr_debug("%s: start group=%p ring_size=%llu, requested=%lu\n", __func__, group,
> + ring_size, size);
> +
> + spin_lock(&group->notification_lock);
> +again:
> + head = group->ring_buffer.head;
> + tail = group->ring_buffer.tail;
> + new_head = NEXT_SLOT(head, size, ring_size);
> +
> + /* head would catch up to tail, corrupting an entry. */
> + if ((head < tail && new_head > tail) || (head > new_head && new_head > tail)) {
> + fsn = ERR_PTR(-ENOMEM);
> + goto err;
> + }
> +
> + /*
> + * Not event a skip message fits in the page. We can detect the
> + * lack of space. Move on to the next page.
> + */
> + if ((PAGE_SIZE - (head & (PAGE_SIZE-1))) < sizeof(struct fsnotify_event)) {
> + /* Start again on next page */
> + group->ring_buffer.head = NEXT_PAGE(head, ring_size);
> + goto again;
> + }
> +
> + kaddr = kmap_atomic(group->ring_buffer.pages[head / PAGE_SIZE]);
> + if (!kaddr) {
> + fsn = ERR_PTR(-EFAULT);
> + goto err;
> + }
> +
> + fsn = (struct fsnotify_event *) (kaddr + (head & (PAGE_SIZE-1)));
> +
> + if ((head >> PAGE_SHIFT) != (new_head >> PAGE_SHIFT)) {
> + /*
> + * No room in the current page. Add a fake entry
> + * consuming the end the page to avoid splitting event
> + * structure.
> + */
> + fsn->slot_len = INVALID_RING_SLOT;
> + kunmap_atomic(kaddr);
> + /* Start again on the next page */
> + group->ring_buffer.head = NEXT_PAGE(head, ring_size);
> +
> + goto again;
> + }
> + fsn->slot_len = size;
> +
> + return fsn;
> +
> +err:
> + spin_unlock(&group->notification_lock);
> + return fsn;
> +}
> +
> +void fsnotify_ring_commit_slot(struct fsnotify_group *group, struct fsnotify_event *fsn)
> + __releases(&group->notification_lock)
> +{
> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
> + u64 head = group->ring_buffer.head;
> + u64 new_head = NEXT_SLOT(head, fsn->slot_len, ring_size);
> +
> + pr_debug("%s: group=%p head=%llx->%llx ring_size=%llu\n", __func__,
> + group, head, new_head, ring_size);
> +
> + kunmap_atomic(fsn);
> + group->ring_buffer.head = new_head;
> +
> + spin_unlock(&group->notification_lock);
> +
> + wake_up(&group->notification_waitq);
> + kill_fasync(&group->fsn_fa, SIGIO, POLL_IN);
> +
> +}
> +
> +void fsnotify_free_ring_buffer(struct fsnotify_group *group)
> +{
> + int i;
> +
> + for (i = 0; i < group->ring_buffer.nr_pages; i++)
> + __free_page(group->ring_buffer.pages[i]);
> + kfree(group->ring_buffer.pages);
> + group->ring_buffer.nr_pages = 0;
> +}
> +
> +int fsnotify_create_ring_buffer(struct fsnotify_group *group)
> +{
> + int nr_pages = FSNOTIFY_RING_PAGES;
> + int i;
> +
> + pr_debug("%s: group=%p pages=%d\n", __func__, group, nr_pages);
> +
> + group->ring_buffer.pages = kmalloc_array(nr_pages, sizeof(struct pages *),
> + GFP_KERNEL);
> + if (!group->ring_buffer.pages)
> + return -ENOMEM;
> +
> + group->ring_buffer.head = 0;
> + group->ring_buffer.tail = 0;
> +
> + for (i = 0; i < nr_pages; i++) {
> + group->ring_buffer.pages[i] = alloc_pages(GFP_KERNEL, 1);
> + if (!group->ring_buffer.pages)
> + goto err_dealloc;
> + }
> +
> + group->ring_buffer.nr_pages = nr_pages;
> +
> + return 0;
> +
> +err_dealloc:
> + for (--i; i >= 0; i--)
> + __free_page(group->ring_buffer.pages[i]);
> + kfree(group->ring_buffer.pages);
> + group->ring_buffer.nr_pages = 0;
> + return -ENOMEM;
> +}
> +
> +

Nothing in this file is fsnotify specific.
Is there no kernel lib implementation for this already?
If there isn't (I'd be very surprised) please put this in lib/ and post it
for wider review including self tests.

> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 190c6a402e98..a1a4dd69e5ed 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -74,6 +74,8 @@
> #define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
> FS_OPEN_EXEC_PERM)
>
> +#define FSN_SUBMISSION_RING_BUFFER 0x00000080

FSNOTIFY_GROUP_FLAG_RING_BUFFER please (or FSN_GROUP_ if you must)
and please define this above struct fsnotify_group, even right above the flags
field like FSNOTIFY_CONN_FLAG_HAS_FSID

*IF* we go this way :)

Thanks,
Amir.

2021-04-27 06:05:27

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 06/15] fanotify: Support submission through ring buffer

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> This adds support for the ring buffer mode in fanotify. It is enabled
> by a new flag FAN_PREALLOC_QUEUE passed to fanotify_init. If this flag
> is enabled, the group only allows marks that support the ring buffer

I don't like this limitation.
I think FAN_PREALLOC_QUEUE can work with other events, why not?

In any case if we keep ring buffer, please use a different set of
fanotify_ring_buffer_ops struct instead of spraying if/else all over the
event queue implementation.

> submission. In a following patch, FAN_ERROR will make use of this
> mechanism.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/notify/fanotify/fanotify.c | 77 +++++++++++++++++++---------
> fs/notify/fanotify/fanotify_user.c | 81 ++++++++++++++++++------------
> include/linux/fanotify.h | 5 +-
> include/uapi/linux/fanotify.h | 1 +
> 4 files changed, 105 insertions(+), 59 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index e3669d8a4a64..98591a8155a7 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -612,6 +612,26 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
> return event;
> }
>
> +static struct fanotify_event *fanotify_ring_get_slot(struct fsnotify_group *group,
> + u32 mask, const void *data,
> + int data_type)
> +{
> + size_t size = 0;
> +
> + pr_debug("%s: group=%p mask=%x size=%lu\n", __func__, group, mask, size);
> +
> + return FANOTIFY_E(fsnotify_ring_alloc_event_slot(group, size));
> +}
> +
> +static void fanotify_ring_write_event(struct fsnotify_group *group,
> + struct fanotify_event *event, u32 mask,
> + const void *data, __kernel_fsid_t *fsid)
> +{
> + fanotify_init_event(group, event, 0, mask);
> +
> + event->pid = get_pid(task_tgid(current));
> +}
> +
> /*
> * Get cached fsid of the filesystem containing the object from any connector.
> * All connectors are supposed to have the same fsid, but we do not verify that
> @@ -701,31 +721,38 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> return 0;
> }
>
> - event = fanotify_alloc_event(group, mask, data, data_type, dir,
> - file_name, &fsid);
> - ret = -ENOMEM;
> - if (unlikely(!event)) {
> - /*
> - * We don't queue overflow events for permission events as
> - * there the access is denied and so no event is in fact lost.
> - */
> - if (!fanotify_is_perm_event(mask))
> - fsnotify_queue_overflow(group);
> - goto finish;
> - }
> -
> - fsn_event = &event->fse;
> - ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
> - if (ret) {
> - /* Permission events shouldn't be merged */
> - BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
> - /* Our event wasn't used in the end. Free it. */
> - fsnotify_destroy_event(group, fsn_event);
> -
> - ret = 0;
> - } else if (fanotify_is_perm_event(mask)) {
> - ret = fanotify_get_response(group, FANOTIFY_PERM(event),
> - iter_info);
> + if (group->flags & FSN_SUBMISSION_RING_BUFFER) {
> + event = fanotify_ring_get_slot(group, mask, data, data_type);
> + if (IS_ERR(event))
> + return PTR_ERR(event);

So no FAN_OVERFLOW with the ring buffer implementation?
This will be unexpected for fanotify users and frankly, less useful IMO.
I also don't see the technical reason to omit the overflow event.

> + fanotify_ring_write_event(group, event, mask, data, &fsid);
> + fsnotify_ring_commit_slot(group, &event->fse);
> + } else {
> + event = fanotify_alloc_event(group, mask, data, data_type, dir,
> + file_name, &fsid);
> + ret = -ENOMEM;
> + if (unlikely(!event)) {
> + /*
> + * We don't queue overflow events for permission events as
> + * there the access is denied and so no event is in fact lost.
> + */
> + if (!fanotify_is_perm_event(mask))
> + fsnotify_queue_overflow(group);
> + goto finish;
> + }
> + fsn_event = &event->fse;
> + ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
> + if (ret) {
> + /* Permission events shouldn't be merged */
> + BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
> + /* Our event wasn't used in the end. Free it. */
> + fsnotify_destroy_event(group, fsn_event);
> +
> + ret = 0;
> + } else if (fanotify_is_perm_event(mask)) {
> + ret = fanotify_get_response(group, FANOTIFY_PERM(event),
> + iter_info);
> + }
> }
> finish:
> if (fanotify_is_perm_event(mask))
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index fe605359af88..5031198bf7db 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -521,7 +521,9 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
> * Permission events get queued to wait for response. Other
> * events can be destroyed now.
> */
> - if (!fanotify_is_perm_event(event->mask)) {
> + if (group->fanotify_data.flags & FAN_PREALLOC_QUEUE) {
> + fsnotify_ring_buffer_consume_event(group, &event->fse);
> + } else if (!fanotify_is_perm_event(event->mask)) {
> fsnotify_destroy_event(group, &event->fse);
> } else {
> if (ret <= 0) {
> @@ -587,40 +589,39 @@ static int fanotify_release(struct inode *ignored, struct file *file)
> */
> fsnotify_group_stop_queueing(group);
>
> - /*
> - * Process all permission events on access_list and notification queue
> - * and simulate reply from userspace.
> - */
> - spin_lock(&group->notification_lock);
> - while (!list_empty(&group->fanotify_data.access_list)) {
> - struct fanotify_perm_event *event;
> -
> - event = list_first_entry(&group->fanotify_data.access_list,
> - struct fanotify_perm_event, fae.fse.list);
> - list_del_init(&event->fae.fse.list);
> - finish_permission_event(group, event, FAN_ALLOW);
> + if (!(group->flags & FSN_SUBMISSION_RING_BUFFER)) {
> + /*
> + * Process all permission events on access_list and notification queue
> + * and simulate reply from userspace.
> + */
> spin_lock(&group->notification_lock);
> - }
> -
> - /*
> - * Destroy all non-permission events. For permission events just
> - * dequeue them and set the response. They will be freed once the
> - * response is consumed and fanotify_get_response() returns.
> - */
> - while (!fsnotify_notify_queue_is_empty(group)) {
> - struct fanotify_event *event;
> -
> - event = FANOTIFY_E(fsnotify_remove_first_event(group));
> - if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
> - spin_unlock(&group->notification_lock);
> - fsnotify_destroy_event(group, &event->fse);
> - } else {
> - finish_permission_event(group, FANOTIFY_PERM(event),
> - FAN_ALLOW);
> + while (!list_empty(&group->fanotify_data.access_list)) {
> + struct fanotify_perm_event *event;
> + event = list_first_entry(&group->fanotify_data.access_list,
> + struct fanotify_perm_event, fae.fse.list);
> + list_del_init(&event->fae.fse.list);
> + finish_permission_event(group, event, FAN_ALLOW);
> + spin_lock(&group->notification_lock);
> }
> - spin_lock(&group->notification_lock);
> + /*
> + * Destroy all non-permission events. For permission events just
> + * dequeue them and set the response. They will be freed once the
> + * response is consumed and fanotify_get_response() returns.
> + */
> + while (!fsnotify_notify_queue_is_empty(group)) {
> + struct fanotify_event *event;
> + event = FANOTIFY_E(fsnotify_remove_first_event(group));
> + if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
> + spin_unlock(&group->notification_lock);
> + fsnotify_destroy_event(group, &event->fse);
> + } else {
> + finish_permission_event(group, FANOTIFY_PERM(event),
> + FAN_ALLOW);
> + }
> + spin_lock(&group->notification_lock);
> + }
> + spin_unlock(&group->notification_lock);
> }
> - spin_unlock(&group->notification_lock);
>
> /* Response for all permission events it set, wakeup waiters */
> wake_up(&group->fanotify_data.access_waitq);
> @@ -981,6 +982,16 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> if (flags & FAN_NONBLOCK)
> f_flags |= O_NONBLOCK;
>
> + if (flags & FAN_PREALLOC_QUEUE) {
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + if (flags & FAN_UNLIMITED_QUEUE)
> + return -EINVAL;
> +
> + fsn_flags = FSN_SUBMISSION_RING_BUFFER;
> + }
> +
> /* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
> group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops, fsn_flags);
> if (IS_ERR(group)) {
> @@ -1223,6 +1234,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
> goto fput_and_out;
> }
>
> + if ((group->flags & FSN_SUBMISSION_RING_BUFFER) &&
> + (mask & ~FANOTIFY_SUBMISSION_BUFFER_EVENTS))
> + goto fput_and_out;
> +
> ret = fanotify_find_path(dfd, pathname, &path, flags,
> (mask & ALL_FSNOTIFY_EVENTS), obj_type);
> if (ret)
> @@ -1327,7 +1342,7 @@ SYSCALL32_DEFINE6(fanotify_mark,
> */
> static int __init fanotify_user_setup(void)
> {
> - BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 10);
> + BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 11);
> BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);
>
> fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index 3e9c56ee651f..5a4cefb4b1c3 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -23,7 +23,8 @@
> #define FANOTIFY_INIT_FLAGS (FANOTIFY_CLASS_BITS | FANOTIFY_FID_BITS | \
> FAN_REPORT_TID | \
> FAN_CLOEXEC | FAN_NONBLOCK | \
> - FAN_UNLIMITED_QUEUE | FAN_UNLIMITED_MARKS)
> + FAN_UNLIMITED_QUEUE | FAN_UNLIMITED_MARKS | \
> + FAN_PREALLOC_QUEUE)
>
> #define FANOTIFY_MARK_TYPE_BITS (FAN_MARK_INODE | FAN_MARK_MOUNT | \
> FAN_MARK_FILESYSTEM)
> @@ -71,6 +72,8 @@
> FANOTIFY_PERM_EVENTS | \
> FAN_Q_OVERFLOW | FAN_ONDIR)
>
> +#define FANOTIFY_SUBMISSION_BUFFER_EVENTS 0

FANOTIFY_RING_BUFFER_EVENTS? FANOTIFY_PREALLOC_EVENTS?

Please leave a comment above to state what this group means.
I *think* there is no reason to limit the set of events, only the sort of
information that is possible with FAN_PREALLOC_QUEUE.

Perhaps FAN_REPORT_FID cannot be allowed and as a result
FANOTIFY_INODE_EVENTS will not be allowed, but I am not even
sure if that limitation is needed.

Thanks,
Amir.

2021-04-27 06:50:06

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 08/15] fsnotify: Introduce helpers to send error_events

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> include/linux/fsnotify.h | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index f8acddcf54fb..b3ac1a9d0d4d 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -317,4 +317,19 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
> fsnotify_dentry(dentry, mask);
> }
>
> +static inline void fsnotify_error_event(int error, struct inode *dir,
> + const char *function, int line,
> + void *fs_data, int fs_data_size)
> +{
> + struct fs_error_report report = {
> + .error = error,
> + .line = line,
> + .function = function,
> + .fs_data_size = fs_data_size,
> + .fs_data = fs_data,
> + };
> +
> + fsnotify(FS_ERROR, &report, FSNOTIFY_EVENT_ERROR, dir, NULL, NULL, 0);

The way you use this helper from ext4_fsnotify_error() it would make more sense
to name the inode argument 'inode' and call:

fsnotify(FS_ERROR, &report, FSNOTIFY_EVENT_ERROR, NULL, NULL, inode, 0);

Also, if we stick with returning ENOMEM instead of overflow event (I
don't think we should),
then this helper should return the error as well.

Thanks,
Amir.

2021-04-27 07:02:31

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 09/15] fanotify: Introduce generic error record

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> This record describes a fs error in a fs agnostic way. It will be send
> back to userspace in response to a FSNOTIFY_EVENT_ERROR for groups with
> the FAN_ERROR mark.

It's not a mark, it's an event, so:
"...for groups with the FAN_ERROR event in their mark mask"

>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/notify/fanotify/fanotify.h | 16 ++++++++++++++++
> fs/notify/fanotify/fanotify_user.c | 28 ++++++++++++++++++++++++++++
> include/uapi/linux/fanotify.h | 10 ++++++++++
> 3 files changed, 54 insertions(+)
>
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 47299e3d6efd..4cb9dd31f084 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -179,6 +179,22 @@ FANOTIFY_NE(struct fanotify_event *event)
> return container_of(event, struct fanotify_name_event, fae);
> }
>
> +struct fanotify_error_event {
> + struct fanotify_event fae;
> + int error;
> + __kernel_fsid_t fsid;
> +
> + int fs_data_size;
> + /* Must be the last item in the structure */
> + char fs_data[0];
> +};
> +
> +static inline struct fanotify_error_event *
> +FANOTIFY_EE(struct fanotify_event *event)
> +{
> + return container_of(event, struct fanotify_error_event, fae);
> +}
> +
> static inline __kernel_fsid_t *fanotify_event_fsid(struct fanotify_event *event)
> {
> if (event->type == FANOTIFY_EVENT_TYPE_FID)
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 5031198bf7db..21162d347bd1 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -64,6 +64,11 @@ static int fanotify_fid_info_len(int fh_len, int name_len)
> return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
> }
>
> +static size_t fanotify_error_info_len(struct fanotify_error_event *fee)
> +{
> + return sizeof(struct fanotify_event_info_error);
> +}
> +
> static size_t fanotify_event_len(struct fanotify_event *event,
> unsigned int fid_mode)
> {
> @@ -232,6 +237,29 @@ static int process_access_response(struct fsnotify_group *group,
> return -ENOENT;
> }
>
> +static size_t copy_error_info_to_user(struct fanotify_error_event *fee,
> + char __user *buf, int count)
> +{
> + struct fanotify_event_info_error info;
> +
> + info.hdr.info_type = FAN_EVENT_INFO_TYPE_ERROR;
> + info.hdr.pad = 0;
> + info.hdr.len = fanotify_error_info_len(fee);
> +
> + if (WARN_ON(count < info.hdr.len))
> + return -EFAULT;
> +
> + info.version = FANOTIFY_EVENT_INFO_ERROR_VERS_1;
> + info.error = fee->error;
> + info.fsid = fee->fsid;
> +
> + if (copy_to_user(buf, &info, sizeof(info)))
> + return -EFAULT;
> +
> + return info.hdr.len;
> +
> +}
> +
> static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> int info_type, const char *name, size_t name_len,
> char __user *buf, size_t count)
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index b283531549f1..cc9a1fa80e30 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -124,6 +124,7 @@ struct fanotify_event_metadata {
> #define FAN_EVENT_INFO_TYPE_FID 1
> #define FAN_EVENT_INFO_TYPE_DFID_NAME 2
> #define FAN_EVENT_INFO_TYPE_DFID 3
> +#define FAN_EVENT_INFO_TYPE_ERROR 4
>
> /* Variable length info record following event metadata */
> struct fanotify_event_info_header {
> @@ -149,6 +150,15 @@ struct fanotify_event_info_fid {
> unsigned char handle[0];
> };
>
> +#define FANOTIFY_EVENT_INFO_ERROR_VERS_1 1

Honestly, this struct is too simple to have a 'version'.
The format of this simple struct is already defined by
FAN_EVENT_INFO_TYPE_ERROR and if we want to change
the reported info in the future, we can use
FAN_EVENT_INFO_TYPE_ERROR_V2.
In fact, I suggest to name the type
FAN_EVENT_INFO_TYPE_FS_ERROR
to differentiate from a future
FAN_EVENT_INFO_TYPE_WB_ERROR

> +
> +struct fanotify_event_info_error {
> + struct fanotify_event_info_header hdr;
> + int version;
> + int error;
> + __kernel_fsid_t fsid;
> +};

I suggest to put an error seq counter in this struct.
The per-sb seq counter can be provided by the filesystem
or by fsnotify.

Thanks,
Amir.

2021-04-27 07:13:05

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 10/15] fanotify: Introduce code location record

On Mon, Apr 26, 2021 at 9:43 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> This patch introduces an optional info record that describes the
> source (as in the region of the source-code where an event was
> initiated). This record is not produced for other type of existing
> notification, but it is optionally enabled for FAN_ERROR notifications.
>

I find this functionality controversial, because think that the fs provided
s_last_error*, s_first_error* is more reliable and more powerful than this
functionality.

Let's leave it for a future extending proposal, should fanotify event reporting
proposal pass muster, shall we?
Or do you think that without this optional extension fanotify event reporting
will not be valuable enough?

Thanks,
Amir.

2021-04-27 07:14:04

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 11/15] fanotify: Introduce filesystem specific data record

On Mon, Apr 26, 2021 at 9:43 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Allow a FS_ERROR_TYPE notification to send a filesystem provided blob
> back to userspace. This is useful for filesystems who want to provide
> debug information for recovery tools.
>

Same comment as for FAN_EVENT_INFO_TYPE_LOCATION.
Can we leave this patch out of the discussion for now?

Thanks,
Amir.

2021-04-27 07:26:00

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 12/15] fanotify: Introduce the FAN_ERROR mark

On Mon, Apr 26, 2021 at 9:43 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> The FAN_ERROR mark is used by filesystem wide monitoring tools to
> receive notifications of type FS_ERROR_EVENT, emited by filesystems when
> a problem is detected. The error notification includes a generic error
> descriptor, an optional location record and a filesystem specific blob.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/notify/fanotify/fanotify.c | 48 +++++++++++++++++++----
> fs/notify/fanotify/fanotify.h | 8 ++++
> fs/notify/fanotify/fanotify_user.c | 63 ++++++++++++++++++++++++++++++
> include/linux/fanotify.h | 9 ++++-
> include/uapi/linux/fanotify.h | 2 +
> 5 files changed, 120 insertions(+), 10 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 98591a8155a7..6bae23d42e5e 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -240,12 +240,14 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group,
> __func__, iter_info->report_mask, event_mask, data, data_type);
>
> if (!fid_mode) {
> - /* Do we have path to open a file descriptor? */
> - if (!path)
> - return 0;
> - /* Path type events are only relevant for files and dirs */
> - if (!d_is_reg(path->dentry) && !d_can_lookup(path->dentry))
> - return 0;
> + if (!fanotify_is_error_event(event_mask)) {

This open coded nested condition is not nice.
If we get as far as this, I will explain what needs to be done.
Need helpers fanotify_is_reporting_fd(), fanotify_is_reporting_fid() and
fanotify_is_reporting_dir_fid().

Thanks,
Amir.

2021-04-27 08:42:31

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH RFC 07/15] fsnotify: Support FS_ERROR event type

On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Expose a new type of fsnotify event for filesystems to report errors for
> userspace monitoring tools. fanotify will send this type of
> notification for FAN_ERROR marks.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/notify/fsnotify.c | 2 +-
> include/linux/fsnotify_backend.h | 16 ++++++++++++++++
> 2 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 30d422b8c0fc..9fff35e67b37 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -558,7 +558,7 @@ static __init int fsnotify_init(void)
> {
> int ret;
>
> - BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 25);
> + BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 26);
>
> ret = init_srcu_struct(&fsnotify_mark_srcu);
> if (ret)
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index a1a4dd69e5ed..f850bfbe30d4 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -48,6 +48,8 @@
> #define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
> #define FS_OPEN_EXEC_PERM 0x00040000 /* open/exec event in a permission hook */
>
> +#define FS_ERROR 0x00100000 /* Used for filesystem error reporting */
> +

Why skip 0x00080000?

Anyway, event bits are starting to run out so I would prefer that you overload
an existing bit used only by inotify/dnotify.

FS_IN_IGNORED is completely internal to inotify and there is no need to
set it in i_fsnotify_mask at all, so if we remove the bit from the output of
inotify_arg_to_mask() no functionality will change and we will be able to
overload the event bit for FS_ERROR (see patch below).

I also kind of like that FS_ERROR is adjacent to FS_UMOUNT and
FS_Q_OVERFLOW :-)

Other FS_IN/FS_DN bits may also be reclaimed but it takes a bit more work
I have patches for those.

Thanks,
Amir.

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 98f61b31745a..351517bae716 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -89,10 +89,10 @@ static inline __u32 inotify_arg_to_mask(struct
inode *inode, u32 arg)
__u32 mask;

/*
- * Everything should accept their own ignored and should receive events
- * when the inode is unmounted. All directories care about children.
+ * Everything should receive events when the inode is unmounted.
+ * All directories care about children.
*/
- mask = (FS_IN_IGNORED | FS_UNMOUNT);
+ mask = FS_UNMOUNT;
if (S_ISDIR(inode->i_mode))
mask |= FS_EVENT_ON_CHILD;

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 1ce66748a2d2..ecbafb3f36d7 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -42,6 +42,11 @@

#define FS_UNMOUNT 0x00002000 /* inode on umount fs */
#define FS_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
+#define FS_ERROR 0x00008000 /* Filesystem error report */
+/*
+ * FS_IN_IGNORED overloads FS_ERROR. It is only used internally by inotify
+ * which does not support FS_ERROR.
+ */
#define FS_IN_IGNORED 0x00008000 /* last inotify event here */

2021-04-27 15:45:20

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] File system wide monitoring

Amir Goldstein <[email protected]> writes:

> On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
> <[email protected]> wrote:
>>
>> Hi,
>>
>> In an attempt to consolidate some of the feedback from the previous
>> proposals, I wrote a new attempt to solve the file system error reporting
>> problem. Before I spend more time polishing it, I'd like to hear your
>> feedback if I'm going in the wrong direction, in particular with the
>> modifications to fsnotify.
>>
>
> IMO you are going in the right direction, but you have gone a bit too far ;-)
>
> My understanding of the requirements and my interpretation of the feedback
> from filesystem maintainers is that the missing piece in the ecosystem is a
> user notification that "something went wrong". The "what went wrong" part
> is something that users and admins have long been able to gather from the
> kernel log and from filesystem tools (e.g. last error recorded).
>
> I do not see the need to duplicate existing functionality in fsmonitor.
> Don't get me wrong, I understand why it would have been nice for fsmonitor
> to be able to get all the errors nicely without looking anywhere else, but I
> don't think it justifies the extra complexity.

Hi Amir,

Thanks for the detailed review.

The reasons for the location record and the ring buffer is the use case
from Google to do analysis on a series of errors. I understand this is
important to them, which is why I expanded a bit on the 'what went
wrong' and multiple errors. In addition, The file system specific blob
attempts to assist online recovery tools with more information, but it
might make sense to do it in the future, when it is needed.

>> This RFC follows up on my previous proposals which attempted to leverage
>> watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
>> to push error notifications to user space. This proposal starts by, as
>> suggested by Darrick, limiting the scope of what I'm trying to do to an
>> interface for administrators to monitor the health of a file system,
>> instead of a generic inteface for file errors. Therefore, this doesn't
>> solve the problem of writeback errors or the need to watch a specific
>> subsystem.
>>
>> * Format
>>
>> The feature is implemented on top of fanotify, as a new type of fanotify
>> mark, FAN_ERROR, which a file system monitoring tool can register to
>
> You have a terminology mistake throughout your series.
> FAN_ERROR is not a type of a mark, it is a type of an event.
> A mark describes the watched object (i.e. a filesystem, mount, inode).

Right. I understand the mistake and will fix it around the series.
>
>> receive notifications. A notification is split in three parts, and only
>> the first is guaranteed to exist for any given error event:
>>
>> - FS generic data: A file system agnostic structure that has a generic
>> error code and identifies the filesystem. Basically, it let's
>> userspace know something happen on a monitored filesystem.
>
> I think an error seq counter per fs would be a nice addition to generic data.
> It does not need to be persistent (it could be if filesystem supports it).

Makes sense to me.

>>
>> - FS location data: Identifies where in the code the problem
>> happened. (This is important for the use case of analysing frequent
>> error points that we discussed earlier).
>>
>> - FS specific data: A detailed error report in a filesystem specific
>> format that details what the error is. Ideally, a capable monitoring
>> tool can use the information here for error recovery. For instance,
>> xfs can put the xfs_scrub structures here, ext4 can send its error
>> reports, etc. An example of usage is done in the ext4 patch of this
>> series.
>>
>> More details on the information in each record can be found on the
>> documentation introduced in patch 15.
>>
>> * Using fanotify
>>
>> Using fanotify for this kind of thing is slightly tricky because we want
>> to guarantee delivery in some complicated conditions, for instance, the
>> file system might want to send an error while holding several locks.
>>
>> Instead of working around file system constraints at the file system
>> level, this proposal tries to make the FAN_ERROR submission safe in
>> those contexts. This is done with a new mode in fsnotify that
>> preallocates the memory at group creation to be used for the
>> notification submission.
>>
>> This new mode in fsnotify introduces a ring buffer to queue
>> notifications, which eliminates the allocation path in fsnotify. From
>> what I saw, the allocation is the only problem in fsnotify for
>> filesystems to submit errors in constrained situations.
>>
>
> The ring buffer functionality for fsnotify is interesting and it may be
> useful on its own, but IMO, its too big of a hammer for the problem
> at hand.
>
> The question that you should be asking yourself is what is the
> expected behavior in case of a flood of filesystem corruption errors.
> I think it has already been expressed by filesystem maintainers on
> one your previous postings, that a flood of filesystem corruption
> errors is often noise and the only interesting information is the
> first error.

My idea was be to provide an ioctl for the user to resize the ring
buffer when needed, to make the flood manageable. But I understand your
main point about the ring buffer. i'm not sure saving only the first
notification solves Google's use case of error monitoring and analysis,
though. Khazhy, Ted, can you weight in?

> For this reason, I think that FS_ERROR could be implemented
> by attaching an fsnotify_error_info object to an fsnotify_sb_mark:
>
> struct fsnotify_sb_mark {
> struct fsnotify_mark fsn_mark;
> struct fsnotify_error_info info;
> }
>
> Similar to fd sampled errseq, there can be only one error report
> per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
> the error report can be allocated at the time of setting the filesystem mark.
>
> With this, you will not need the added complexity of the ring buffer
> and you will not need to limit FAN_ERROR reporting to a group that
> is only listening for FAN_ERROR, which is an unneeded limitation IMO.

The limitation exists because I was concerned about not breaking the
semantics of FAN_ACCESS and others, with regards to merged
notifications. I believe there should be no other reason why
notifications of FAN_CLASS_NOTIF can't be sent to the ring buffer too.
That limitation could be lifted for everything but permission events, I
think.

--
Gabriel Krisman Bertazi

2021-04-29 00:58:21

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH RFC 13/15] ext4: Send notifications on error

On Mon, Apr 26, 2021 at 02:41:59PM -0400, Gabriel Krisman Bertazi wrote:
> Send a FS_ERROR message via fsnotify to a userspace monitoring tool
> whenever a ext4 error condition is triggered. This follows the existing
> error conditions in ext4, so it is hooked to the ext4_error* functions.
>
> It also follows the current dmesg reporting in the format. The
> filesystem message is composed mostly by the string that would be
> otherwise printed in dmesg.
>
> A new ext4 specific record format is exposed in the uapi, such that a
> monitoring tool knows what to expect when listening errors of an ext4
> filesystem.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/ext4/super.c | 60 ++++++++++++++++++++++++--------
> include/uapi/linux/ext4-notify.h | 17 +++++++++
> 2 files changed, 62 insertions(+), 15 deletions(-)
> create mode 100644 include/uapi/linux/ext4-notify.h
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b9693680463a..032e29e7ff6a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -46,6 +46,8 @@
> #include <linux/part_stat.h>
> #include <linux/kthread.h>
> #include <linux/freezer.h>
> +#include <linux/fsnotify.h>
> +#include <uapi/linux/ext4-notify.h>
>
> #include "ext4.h"
> #include "ext4_extents.h" /* Needed for trace points definition */
> @@ -727,6 +729,22 @@ static void flush_stashed_error_work(struct work_struct *work)
> ext4_commit_super(sbi->s_sb);
> }
>
> +static void ext4_fsnotify_error(int error, struct inode *inode, __u64 block,
> + const char *func, int line,
> + const char *desc, struct va_format *vaf)
> +{
> + struct ext4_error_inode_report report;
> +
> + if (inode->i_sb->s_fsnotify_marks) {
> + report.inode = inode ? inode->i_ino : -1L;
> + report.block = block ? block : -1L;
> +
> + snprintf(report.desc, EXT4_FSN_DESC_LEN, "%s%pV\n", desc?:"", vaf);
> +
> + fsnotify_error_event(error, inode, func, line, &report, sizeof(report));
> + }
> +}
> +
> #define ext4_error_ratelimit(sb) \
> ___ratelimit(&(EXT4_SB(sb)->s_err_ratelimit_state), \
> "EXT4-fs error")
> @@ -742,15 +760,18 @@ void __ext4_error(struct super_block *sb, const char *function,
> return;
>
> trace_ext4_error(sb, function, line);
> +
> + va_start(args, fmt);
> + vaf.fmt = fmt;
> + vaf.va = &args;
> if (ext4_error_ratelimit(sb)) {
> - va_start(args, fmt);
> - vaf.fmt = fmt;
> - vaf.va = &args;
> printk(KERN_CRIT
> "EXT4-fs error (device %s): %s:%d: comm %s: %pV\n",
> sb->s_id, function, line, current->comm, &vaf);
> - va_end(args);
> +
> }
> + ext4_fsnotify_error(error, sb->s_root->d_inode, block, function, line, NULL, &vaf);
> + va_end(args);
> ext4_handle_error(sb, force_ro, error, 0, block, function, line);
> }
>
> @@ -765,10 +786,10 @@ void __ext4_error_inode(struct inode *inode, const char *function,
> return;
>
> trace_ext4_error(inode->i_sb, function, line);
> + va_start(args, fmt);
> + vaf.fmt = fmt;
> + vaf.va = &args;
> if (ext4_error_ratelimit(inode->i_sb)) {
> - va_start(args, fmt);
> - vaf.fmt = fmt;
> - vaf.va = &args;
> if (block)
> printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: "
> "inode #%lu: block %llu: comm %s: %pV\n",
> @@ -779,8 +800,11 @@ void __ext4_error_inode(struct inode *inode, const char *function,
> "inode #%lu: comm %s: %pV\n",
> inode->i_sb->s_id, function, line, inode->i_ino,
> current->comm, &vaf);
> - va_end(args);
> }
> +
> + ext4_fsnotify_error(error, inode, block, function, line, NULL, &vaf);
> + va_end(args);
> +
> ext4_handle_error(inode->i_sb, false, error, inode->i_ino, block,
> function, line);
> }
> @@ -798,13 +822,16 @@ void __ext4_error_file(struct file *file, const char *function,
> return;
>
> trace_ext4_error(inode->i_sb, function, line);
> +
> + path = file_path(file, pathname, sizeof(pathname));
> + if (IS_ERR(path))
> + path = "(unknown)";
> +
> + va_start(args, fmt);
> + vaf.fmt = fmt;
> + vaf.va = &args;
> +
> if (ext4_error_ratelimit(inode->i_sb)) {
> - path = file_path(file, pathname, sizeof(pathname));
> - if (IS_ERR(path))
> - path = "(unknown)";
> - va_start(args, fmt);
> - vaf.fmt = fmt;
> - vaf.va = &args;
> if (block)
> printk(KERN_CRIT
> "EXT4-fs error (device %s): %s:%d: inode #%lu: "
> @@ -817,8 +844,10 @@ void __ext4_error_file(struct file *file, const char *function,
> "comm %s: path %s: %pV\n",
> inode->i_sb->s_id, function, line, inode->i_ino,
> current->comm, path, &vaf);
> - va_end(args);
> }
> + ext4_fsnotify_error(EFSCORRUPTED, inode, block, function, line, NULL, &vaf);
> + va_end(args);
> +
> ext4_handle_error(inode->i_sb, false, EFSCORRUPTED, inode->i_ino, block,
> function, line);
> }
> @@ -886,6 +915,7 @@ void __ext4_std_error(struct super_block *sb, const char *function,
> printk(KERN_CRIT "EXT4-fs error (device %s) in %s:%d: %s\n",
> sb->s_id, function, line, errstr);
> }
> + ext4_fsnotify_error(errno, NULL, -1L, function, line, errstr, NULL);
>
> ext4_handle_error(sb, false, -errno, 0, 0, function, line);
> }
> diff --git a/include/uapi/linux/ext4-notify.h b/include/uapi/linux/ext4-notify.h
> new file mode 100644
> index 000000000000..31a3bbcafd13
> --- /dev/null
> +++ b/include/uapi/linux/ext4-notify.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +/*
> + * Copyright 2021, Collabora Ltd.
> + */
> +
> +#ifndef EXT4_NOTIFY_H
> +#define EXT4_NOTIFY_H
> +
> +#define EXT4_FSN_DESC_LEN 256
> +
> +struct ext4_error_inode_report {
> + u64 inode;

I don't have much to contribute this time, other than suggesting that
you might want to encode the inode generation here so that forensics
tools won't waste their time if the inode has been deleted and recreated
in between when the error happens and when the fs gets pulled offline
for analysis.

(...and maybe add a u32 flags field that can remain zero for now)

> + u64 block;

...and maybe call this "lblk" (assuming this is the logical block offset
within the file?) since that's already in wide use around e2fsprogs and
fs/ext4/.

--D

> + char desc[EXT4_FSN_DESC_LEN];
> +};
> +
> +#endif
> --
> 2.31.0
>

2021-04-29 18:34:07

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC 05/15] fsnotify: Support event submission through ring buffer

Amir Goldstein <[email protected]> writes:

> On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
> <[email protected]> wrote:
>>
>> In order to support file system health/error reporting over fanotify,
>> fsnotify needs to expose a submission path that doesn't allow sleeping.
>> The only problem I identified with the current submission path is the
>> need to dynamically allocate memory for the event queue.
>>
>> This patch avoids the problem by introducing a new mode in fsnotify,
>> where a ring buffer is used to submit events for a group. Each group
>> has its own ring buffer, and error notifications are submitted
>> exclusively through it.
>>
>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>> ---
>> fs/notify/Makefile | 2 +-
>> fs/notify/group.c | 12 +-
>> fs/notify/notification.c | 10 ++
>> fs/notify/ring.c | 199 +++++++++++++++++++++++++++++++
>> include/linux/fsnotify_backend.h | 37 +++++-
>> 5 files changed, 255 insertions(+), 5 deletions(-)
>> create mode 100644 fs/notify/ring.c
>>
>> diff --git a/fs/notify/Makefile b/fs/notify/Makefile
>> index 63a4b8828df4..61dae1e90f2d 100644
>> --- a/fs/notify/Makefile
>> +++ b/fs/notify/Makefile
>> @@ -1,6 +1,6 @@
>> # SPDX-License-Identifier: GPL-2.0
>> obj-$(CONFIG_FSNOTIFY) += fsnotify.o notification.o group.o mark.o \
>> - fdinfo.o
>> + fdinfo.o ring.o
>>
>> obj-y += dnotify/
>> obj-y += inotify/
>> diff --git a/fs/notify/group.c b/fs/notify/group.c
>> index 08acb1afc0c2..b99b3de36696 100644
>> --- a/fs/notify/group.c
>> +++ b/fs/notify/group.c
>> @@ -81,7 +81,10 @@ void fsnotify_destroy_group(struct fsnotify_group *group)
>> * notification against this group. So clearing the notification queue
>> * of all events is reliable now.
>> */
>> - fsnotify_flush_notify(group);
>> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
>> + fsnotify_free_ring_buffer(group);
>> + else
>> + fsnotify_flush_notify(group);
>>
>> /*
>> * Destroy overflow event (we cannot use fsnotify_destroy_event() as
>> @@ -136,6 +139,13 @@ static struct fsnotify_group *__fsnotify_alloc_group(
>> group->ops = ops;
>> group->flags = flags;
>>
>> + if (group->flags & FSN_SUBMISSION_RING_BUFFER) {
>> + if (fsnotify_create_ring_buffer(group)) {
>> + kfree(group);
>> + return ERR_PTR(-ENOMEM);
>> + }
>> + }
>> +
>> return group;
>> }
>>
>> diff --git a/fs/notify/notification.c b/fs/notify/notification.c
>> index 75d79d6d3ef0..32f97e7b7a80 100644
>> --- a/fs/notify/notification.c
>> +++ b/fs/notify/notification.c
>> @@ -51,6 +51,10 @@ EXPORT_SYMBOL_GPL(fsnotify_get_cookie);
>> bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group)
>> {
>> assert_spin_locked(&group->notification_lock);
>> +
>> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
>> + return fsnotify_ring_notify_queue_is_empty(group);
>> +
>> return list_empty(&group->notification_list) ? true : false;
>> }
>>
>> @@ -132,6 +136,9 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
>> struct fsnotify_event *event)
>> {
>> assert_spin_locked(&group->notification_lock);
>> +
>> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
>> + return;
>> /*
>> * We need to init list head for the case of overflow event so that
>> * check in fsnotify_add_event() works
>> @@ -166,6 +173,9 @@ struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
>> {
>> assert_spin_locked(&group->notification_lock);
>>
>> + if (group->flags & FSN_SUBMISSION_RING_BUFFER)
>> + return fsnotify_ring_peek_first_event(group);
>> +
>> return list_first_entry(&group->notification_list,
>> struct fsnotify_event, list);
>> }
>> diff --git a/fs/notify/ring.c b/fs/notify/ring.c
>> new file mode 100644
>> index 000000000000..75e8af1f8d80
>> --- /dev/null
>> +++ b/fs/notify/ring.c
>> @@ -0,0 +1,199 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/types.h>
>> +#include <linux/fsnotify.h>
>> +#include <linux/memcontrol.h>
>> +
>> +#define INVALID_RING_SLOT -1
>> +
>> +#define FSNOTIFY_RING_PAGES 16
>> +
>> +#define NEXT_SLOT(cur, len, ring_size) ((cur + len) & (ring_size-1))
>> +#define NEXT_PAGE(cur, ring_size) (round_up(cur, PAGE_SIZE) & (ring_size-1))
>> +
>> +bool fsnotify_ring_notify_queue_is_empty(struct fsnotify_group *group)
>> +{
>> + assert_spin_locked(&group->notification_lock);
>> +
>> + if (group->ring_buffer.tail == group->ring_buffer.head)
>> + return true;
>> + return false;
>> +}
>> +
>> +struct fsnotify_event *fsnotify_ring_peek_first_event(struct fsnotify_group *group)
>> +{
>> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
>> + struct fsnotify_event *fsn;
>> + char *kaddr;
>> + u64 tail;
>> +
>> + assert_spin_locked(&group->notification_lock);
>> +
>> +again:
>> + tail = group->ring_buffer.tail;
>> +
>> + if ((PAGE_SIZE - (tail & (PAGE_SIZE-1))) < sizeof(struct fsnotify_event)) {
>> + group->ring_buffer.tail = NEXT_PAGE(tail, ring_size);
>> + goto again;
>> + }
>> +
>> + kaddr = kmap_atomic(group->ring_buffer.pages[tail / PAGE_SIZE]);
>> + if (!kaddr)
>> + return NULL;
>> + fsn = (struct fsnotify_event *) (kaddr + (tail & (PAGE_SIZE-1)));
>> +
>> + if (fsn->slot_len == INVALID_RING_SLOT) {
>> + group->ring_buffer.tail = NEXT_PAGE(tail, ring_size);
>> + kunmap_atomic(kaddr);
>> + goto again;
>> + }
>> +
>> + /* will be unmapped when entry is consumed. */
>> + return fsn;
>> +}
>> +
>> +void fsnotify_ring_buffer_consume_event(struct fsnotify_group *group,
>> + struct fsnotify_event *event)
>> +{
>> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
>> + u64 new_tail = NEXT_SLOT(group->ring_buffer.tail, event->slot_len, ring_size);
>> +
>> + kunmap_atomic(event);
>> +
>> + pr_debug("%s: group=%p tail=%llx->%llx ring_size=%llu\n", __func__,
>> + group, group->ring_buffer.tail, new_tail, ring_size);
>> +
>> + WRITE_ONCE(group->ring_buffer.tail, new_tail);
>> +}
>> +
>> +struct fsnotify_event *fsnotify_ring_alloc_event_slot(struct fsnotify_group *group,
>> + size_t size)
>> + __acquires(&group->notification_lock)
>> +{
>> + struct fsnotify_event *fsn;
>> + u64 head, tail;
>> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
>> + u64 new_head;
>> + void *kaddr;
>> +
>> + if (WARN_ON(!(group->flags & FSN_SUBMISSION_RING_BUFFER) || size > PAGE_SIZE))
>> + return ERR_PTR(-EINVAL);
>> +
>> + pr_debug("%s: start group=%p ring_size=%llu, requested=%lu\n", __func__, group,
>> + ring_size, size);
>> +
>> + spin_lock(&group->notification_lock);
>> +again:
>> + head = group->ring_buffer.head;
>> + tail = group->ring_buffer.tail;
>> + new_head = NEXT_SLOT(head, size, ring_size);
>> +
>> + /* head would catch up to tail, corrupting an entry. */
>> + if ((head < tail && new_head > tail) || (head > new_head && new_head > tail)) {
>> + fsn = ERR_PTR(-ENOMEM);
>> + goto err;
>> + }
>> +
>> + /*
>> + * Not event a skip message fits in the page. We can detect the
>> + * lack of space. Move on to the next page.
>> + */
>> + if ((PAGE_SIZE - (head & (PAGE_SIZE-1))) < sizeof(struct fsnotify_event)) {
>> + /* Start again on next page */
>> + group->ring_buffer.head = NEXT_PAGE(head, ring_size);
>> + goto again;
>> + }
>> +
>> + kaddr = kmap_atomic(group->ring_buffer.pages[head / PAGE_SIZE]);
>> + if (!kaddr) {
>> + fsn = ERR_PTR(-EFAULT);
>> + goto err;
>> + }
>> +
>> + fsn = (struct fsnotify_event *) (kaddr + (head & (PAGE_SIZE-1)));
>> +
>> + if ((head >> PAGE_SHIFT) != (new_head >> PAGE_SHIFT)) {
>> + /*
>> + * No room in the current page. Add a fake entry
>> + * consuming the end the page to avoid splitting event
>> + * structure.
>> + */
>> + fsn->slot_len = INVALID_RING_SLOT;
>> + kunmap_atomic(kaddr);
>> + /* Start again on the next page */
>> + group->ring_buffer.head = NEXT_PAGE(head, ring_size);
>> +
>> + goto again;
>> + }
>> + fsn->slot_len = size;
>> +
>> + return fsn;
>> +
>> +err:
>> + spin_unlock(&group->notification_lock);
>> + return fsn;
>> +}
>> +
>> +void fsnotify_ring_commit_slot(struct fsnotify_group *group, struct fsnotify_event *fsn)
>> + __releases(&group->notification_lock)
>> +{
>> + u64 ring_size = group->ring_buffer.nr_pages << PAGE_SHIFT;
>> + u64 head = group->ring_buffer.head;
>> + u64 new_head = NEXT_SLOT(head, fsn->slot_len, ring_size);
>> +
>> + pr_debug("%s: group=%p head=%llx->%llx ring_size=%llu\n", __func__,
>> + group, head, new_head, ring_size);
>> +
>> + kunmap_atomic(fsn);
>> + group->ring_buffer.head = new_head;
>> +
>> + spin_unlock(&group->notification_lock);
>> +
>> + wake_up(&group->notification_waitq);
>> + kill_fasync(&group->fsn_fa, SIGIO, POLL_IN);
>> +
>> +}
>> +
>> +void fsnotify_free_ring_buffer(struct fsnotify_group *group)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < group->ring_buffer.nr_pages; i++)
>> + __free_page(group->ring_buffer.pages[i]);
>> + kfree(group->ring_buffer.pages);
>> + group->ring_buffer.nr_pages = 0;
>> +}
>> +
>> +int fsnotify_create_ring_buffer(struct fsnotify_group *group)
>> +{
>> + int nr_pages = FSNOTIFY_RING_PAGES;
>> + int i;
>> +
>> + pr_debug("%s: group=%p pages=%d\n", __func__, group, nr_pages);
>> +
>> + group->ring_buffer.pages = kmalloc_array(nr_pages, sizeof(struct pages *),
>> + GFP_KERNEL);
>> + if (!group->ring_buffer.pages)
>> + return -ENOMEM;
>> +
>> + group->ring_buffer.head = 0;
>> + group->ring_buffer.tail = 0;
>> +
>> + for (i = 0; i < nr_pages; i++) {
>> + group->ring_buffer.pages[i] = alloc_pages(GFP_KERNEL, 1);
>> + if (!group->ring_buffer.pages)
>> + goto err_dealloc;
>> + }
>> +
>> + group->ring_buffer.nr_pages = nr_pages;
>> +
>> + return 0;
>> +
>> +err_dealloc:
>> + for (--i; i >= 0; i--)
>> + __free_page(group->ring_buffer.pages[i]);
>> + kfree(group->ring_buffer.pages);
>> + group->ring_buffer.nr_pages = 0;
>> + return -ENOMEM;
>> +}
>> +
>> +
>
> Nothing in this file is fsnotify specific.
> Is there no kernel lib implementation for this already?
> If there isn't (I'd be very surprised) please put this in lib/ and post it
> for wider review including self tests.

About the implementation, the only generic code I could find is
include/linux/circ_buf.h, but it doesn't really do much or fit well
here. For instance, it doesn't deal well with non-contiguous pages.

There are other smarter implementations around the kernel like
Documentation/trace/ring-buffer-design.txt, that would be a better
candidate to be a generic ring buffer in lib/, but I admit to haven't
checked them well enough to see if they would solve the problem for
fsnotify, which has a very simple ring buffer anyway.

>> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
>> index 190c6a402e98..a1a4dd69e5ed 100644
>> --- a/include/linux/fsnotify_backend.h
>> +++ b/include/linux/fsnotify_backend.h
>> @@ -74,6 +74,8 @@
>> #define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \
>> FS_OPEN_EXEC_PERM)
>>
>> +#define FSN_SUBMISSION_RING_BUFFER 0x00000080
>
> FSNOTIFY_GROUP_FLAG_RING_BUFFER please (or FSN_GROUP_ if you must)
> and please define this above struct fsnotify_group, even right above the flags
> field like FSNOTIFY_CONN_FLAG_HAS_FSID
>
> *IF* we go this way :)
>
> Thanks,
> Amir.

--
Gabriel Krisman Bertazi

2021-04-29 18:36:52

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC 06/15] fanotify: Support submission through ring buffer

Amir Goldstein <[email protected]> writes:

> On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
> <[email protected]> wrote:
>>
>> This adds support for the ring buffer mode in fanotify. It is enabled
>> by a new flag FAN_PREALLOC_QUEUE passed to fanotify_init. If this flag
>> is enabled, the group only allows marks that support the ring buffer
>
> I don't like this limitation.
> I think FAN_PREALLOC_QUEUE can work with other events, why not?

The only complications I see are permission events and mergeable events,
which would no longer be merged. The merging problem is not big,
except it changes the existing expectations. Other than that, it should
be trivial to have every FAN_CLASS_NOTIF events in the ring buffer.
>
> In any case if we keep ring buffer, please use a different set of
> fanotify_ring_buffer_ops struct instead of spraying if/else all over the
> event queue implementation.
>
>> submission. In a following patch, FAN_ERROR will make use of this
>> mechanism.
>>
>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>> ---
>> fs/notify/fanotify/fanotify.c | 77 +++++++++++++++++++---------
>> fs/notify/fanotify/fanotify_user.c | 81 ++++++++++++++++++------------
>> include/linux/fanotify.h | 5 +-
>> include/uapi/linux/fanotify.h | 1 +
>> 4 files changed, 105 insertions(+), 59 deletions(-)
>>
>> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
>> index e3669d8a4a64..98591a8155a7 100644
>> --- a/fs/notify/fanotify/fanotify.c
>> +++ b/fs/notify/fanotify/fanotify.c
>> @@ -612,6 +612,26 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
>> return event;
>> }
>>
>> +static struct fanotify_event *fanotify_ring_get_slot(struct fsnotify_group *group,
>> + u32 mask, const void *data,
>> + int data_type)
>> +{
>> + size_t size = 0;
>> +
>> + pr_debug("%s: group=%p mask=%x size=%lu\n", __func__, group, mask, size);
>> +
>> + return FANOTIFY_E(fsnotify_ring_alloc_event_slot(group, size));
>> +}
>> +
>> +static void fanotify_ring_write_event(struct fsnotify_group *group,
>> + struct fanotify_event *event, u32 mask,
>> + const void *data, __kernel_fsid_t *fsid)
>> +{
>> + fanotify_init_event(group, event, 0, mask);
>> +
>> + event->pid = get_pid(task_tgid(current));
>> +}
>> +
>> /*
>> * Get cached fsid of the filesystem containing the object from any connector.
>> * All connectors are supposed to have the same fsid, but we do not verify that
>> @@ -701,31 +721,38 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>> return 0;
>> }
>>
>> - event = fanotify_alloc_event(group, mask, data, data_type, dir,
>> - file_name, &fsid);
>> - ret = -ENOMEM;
>> - if (unlikely(!event)) {
>> - /*
>> - * We don't queue overflow events for permission events as
>> - * there the access is denied and so no event is in fact lost.
>> - */
>> - if (!fanotify_is_perm_event(mask))
>> - fsnotify_queue_overflow(group);
>> - goto finish;
>> - }
>> -
>> - fsn_event = &event->fse;
>> - ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
>> - if (ret) {
>> - /* Permission events shouldn't be merged */
>> - BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
>> - /* Our event wasn't used in the end. Free it. */
>> - fsnotify_destroy_event(group, fsn_event);
>> -
>> - ret = 0;
>> - } else if (fanotify_is_perm_event(mask)) {
>> - ret = fanotify_get_response(group, FANOTIFY_PERM(event),
>> - iter_info);
>> + if (group->flags & FSN_SUBMISSION_RING_BUFFER) {
>> + event = fanotify_ring_get_slot(group, mask, data, data_type);
>> + if (IS_ERR(event))
>> + return PTR_ERR(event);
>
> So no FAN_OVERFLOW with the ring buffer implementation?
> This will be unexpected for fanotify users and frankly, less useful IMO.
> I also don't see the technical reason to omit the overflow event.
>
>> + fanotify_ring_write_event(group, event, mask, data, &fsid);
>> + fsnotify_ring_commit_slot(group, &event->fse);
>> + } else {
>> + event = fanotify_alloc_event(group, mask, data, data_type, dir,
>> + file_name, &fsid);
>> + ret = -ENOMEM;
>> + if (unlikely(!event)) {
>> + /*
>> + * We don't queue overflow events for permission events as
>> + * there the access is denied and so no event is in fact lost.
>> + */
>> + if (!fanotify_is_perm_event(mask))
>> + fsnotify_queue_overflow(group);
>> + goto finish;
>> + }
>> + fsn_event = &event->fse;
>> + ret = fsnotify_add_event(group, fsn_event, fanotify_merge);
>> + if (ret) {
>> + /* Permission events shouldn't be merged */
>> + BUG_ON(ret == 1 && mask & FANOTIFY_PERM_EVENTS);
>> + /* Our event wasn't used in the end. Free it. */
>> + fsnotify_destroy_event(group, fsn_event);
>> +
>> + ret = 0;
>> + } else if (fanotify_is_perm_event(mask)) {
>> + ret = fanotify_get_response(group, FANOTIFY_PERM(event),
>> + iter_info);
>> + }
>> }
>> finish:
>> if (fanotify_is_perm_event(mask))
>> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
>> index fe605359af88..5031198bf7db 100644
>> --- a/fs/notify/fanotify/fanotify_user.c
>> +++ b/fs/notify/fanotify/fanotify_user.c
>> @@ -521,7 +521,9 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
>> * Permission events get queued to wait for response. Other
>> * events can be destroyed now.
>> */
>> - if (!fanotify_is_perm_event(event->mask)) {
>> + if (group->fanotify_data.flags & FAN_PREALLOC_QUEUE) {
>> + fsnotify_ring_buffer_consume_event(group, &event->fse);
>> + } else if (!fanotify_is_perm_event(event->mask)) {
>> fsnotify_destroy_event(group, &event->fse);
>> } else {
>> if (ret <= 0) {
>> @@ -587,40 +589,39 @@ static int fanotify_release(struct inode *ignored, struct file *file)
>> */
>> fsnotify_group_stop_queueing(group);
>>
>> - /*
>> - * Process all permission events on access_list and notification queue
>> - * and simulate reply from userspace.
>> - */
>> - spin_lock(&group->notification_lock);
>> - while (!list_empty(&group->fanotify_data.access_list)) {
>> - struct fanotify_perm_event *event;
>> -
>> - event = list_first_entry(&group->fanotify_data.access_list,
>> - struct fanotify_perm_event, fae.fse.list);
>> - list_del_init(&event->fae.fse.list);
>> - finish_permission_event(group, event, FAN_ALLOW);
>> + if (!(group->flags & FSN_SUBMISSION_RING_BUFFER)) {
>> + /*
>> + * Process all permission events on access_list and notification queue
>> + * and simulate reply from userspace.
>> + */
>> spin_lock(&group->notification_lock);
>> - }
>> -
>> - /*
>> - * Destroy all non-permission events. For permission events just
>> - * dequeue them and set the response. They will be freed once the
>> - * response is consumed and fanotify_get_response() returns.
>> - */
>> - while (!fsnotify_notify_queue_is_empty(group)) {
>> - struct fanotify_event *event;
>> -
>> - event = FANOTIFY_E(fsnotify_remove_first_event(group));
>> - if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
>> - spin_unlock(&group->notification_lock);
>> - fsnotify_destroy_event(group, &event->fse);
>> - } else {
>> - finish_permission_event(group, FANOTIFY_PERM(event),
>> - FAN_ALLOW);
>> + while (!list_empty(&group->fanotify_data.access_list)) {
>> + struct fanotify_perm_event *event;
>> + event = list_first_entry(&group->fanotify_data.access_list,
>> + struct fanotify_perm_event, fae.fse.list);
>> + list_del_init(&event->fae.fse.list);
>> + finish_permission_event(group, event, FAN_ALLOW);
>> + spin_lock(&group->notification_lock);
>> }
>> - spin_lock(&group->notification_lock);
>> + /*
>> + * Destroy all non-permission events. For permission events just
>> + * dequeue them and set the response. They will be freed once the
>> + * response is consumed and fanotify_get_response() returns.
>> + */
>> + while (!fsnotify_notify_queue_is_empty(group)) {
>> + struct fanotify_event *event;
>> + event = FANOTIFY_E(fsnotify_remove_first_event(group));
>> + if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
>> + spin_unlock(&group->notification_lock);
>> + fsnotify_destroy_event(group, &event->fse);
>> + } else {
>> + finish_permission_event(group, FANOTIFY_PERM(event),
>> + FAN_ALLOW);
>> + }
>> + spin_lock(&group->notification_lock);
>> + }
>> + spin_unlock(&group->notification_lock);
>> }
>> - spin_unlock(&group->notification_lock);
>>
>> /* Response for all permission events it set, wakeup waiters */
>> wake_up(&group->fanotify_data.access_waitq);
>> @@ -981,6 +982,16 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>> if (flags & FAN_NONBLOCK)
>> f_flags |= O_NONBLOCK;
>>
>> + if (flags & FAN_PREALLOC_QUEUE) {
>> + if (!capable(CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + if (flags & FAN_UNLIMITED_QUEUE)
>> + return -EINVAL;
>> +
>> + fsn_flags = FSN_SUBMISSION_RING_BUFFER;
>> + }
>> +
>> /* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
>> group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops, fsn_flags);
>> if (IS_ERR(group)) {
>> @@ -1223,6 +1234,10 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>> goto fput_and_out;
>> }
>>
>> + if ((group->flags & FSN_SUBMISSION_RING_BUFFER) &&
>> + (mask & ~FANOTIFY_SUBMISSION_BUFFER_EVENTS))
>> + goto fput_and_out;
>> +
>> ret = fanotify_find_path(dfd, pathname, &path, flags,
>> (mask & ALL_FSNOTIFY_EVENTS), obj_type);
>> if (ret)
>> @@ -1327,7 +1342,7 @@ SYSCALL32_DEFINE6(fanotify_mark,
>> */
>> static int __init fanotify_user_setup(void)
>> {
>> - BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 10);
>> + BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 11);
>> BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);
>>
>> fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
>> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
>> index 3e9c56ee651f..5a4cefb4b1c3 100644
>> --- a/include/linux/fanotify.h
>> +++ b/include/linux/fanotify.h
>> @@ -23,7 +23,8 @@
>> #define FANOTIFY_INIT_FLAGS (FANOTIFY_CLASS_BITS | FANOTIFY_FID_BITS | \
>> FAN_REPORT_TID | \
>> FAN_CLOEXEC | FAN_NONBLOCK | \
>> - FAN_UNLIMITED_QUEUE | FAN_UNLIMITED_MARKS)
>> + FAN_UNLIMITED_QUEUE | FAN_UNLIMITED_MARKS | \
>> + FAN_PREALLOC_QUEUE)
>>
>> #define FANOTIFY_MARK_TYPE_BITS (FAN_MARK_INODE | FAN_MARK_MOUNT | \
>> FAN_MARK_FILESYSTEM)
>> @@ -71,6 +72,8 @@
>> FANOTIFY_PERM_EVENTS | \
>> FAN_Q_OVERFLOW | FAN_ONDIR)
>>
>> +#define FANOTIFY_SUBMISSION_BUFFER_EVENTS 0
>
> FANOTIFY_RING_BUFFER_EVENTS? FANOTIFY_PREALLOC_EVENTS?
>
> Please leave a comment above to state what this group means.
> I *think* there is no reason to limit the set of events, only the sort of
> information that is possible with FAN_PREALLOC_QUEUE.
>
> Perhaps FAN_REPORT_FID cannot be allowed and as a result
> FANOTIFY_INODE_EVENTS will not be allowed, but I am not even
> sure if that limitation is needed.
>
> Thanks,
> Amir.

--
Gabriel Krisman Bertazi

2021-04-29 18:41:18

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC 10/15] fanotify: Introduce code location record

Amir Goldstein <[email protected]> writes:

> On Mon, Apr 26, 2021 at 9:43 PM Gabriel Krisman Bertazi
> <[email protected]> wrote:
>>
>> This patch introduces an optional info record that describes the
>> source (as in the region of the source-code where an event was
>> initiated). This record is not produced for other type of existing
>> notification, but it is optionally enabled for FAN_ERROR notifications.
>>
>
> I find this functionality controversial, because think that the fs provided
> s_last_error*, s_first_error* is more reliable and more powerful than this
> functionality.
>
> Let's leave it for a future extending proposal, should fanotify event reporting
> proposal pass muster, shall we?
> Or do you think that without this optional extension fanotify event reporting
> will not be valuable enough?

I think it is valuable enough without this bit, at least on a first
moment. I understand it would be useful for ext4 to analyse information
through this interface, but the main priority is to have a way to push
out the information that an error occured, as you mentioned.

Also, this might be more powerful if we stick to the ring buffer instead
of single stlot, as it would allow more data to be collected than just
first/last.
>
> Thanks,
> Amir.

--
Gabriel Krisman Bertazi

2021-05-11 04:48:34

by Khazhismel Kumykov

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] File system wide monitoring

On Tue, Apr 27, 2021 at 8:44 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
>
> > On Mon, Apr 26, 2021 at 9:42 PM Gabriel Krisman Bertazi
> > <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> In an attempt to consolidate some of the feedback from the previous
> >> proposals, I wrote a new attempt to solve the file system error reporting
> >> problem. Before I spend more time polishing it, I'd like to hear your
> >> feedback if I'm going in the wrong direction, in particular with the
> >> modifications to fsnotify.
> >>
> >
> > IMO you are going in the right direction, but you have gone a bit too far ;-)
> >
> > My understanding of the requirements and my interpretation of the feedback
> > from filesystem maintainers is that the missing piece in the ecosystem is a
> > user notification that "something went wrong". The "what went wrong" part
> > is something that users and admins have long been able to gather from the
> > kernel log and from filesystem tools (e.g. last error recorded).
> >
> > I do not see the need to duplicate existing functionality in fsmonitor.
> > Don't get me wrong, I understand why it would have been nice for fsmonitor
> > to be able to get all the errors nicely without looking anywhere else, but I
> > don't think it justifies the extra complexity.
>
> Hi Amir,
>
> Thanks for the detailed review.
>
> The reasons for the location record and the ring buffer is the use case
> from Google to do analysis on a series of errors. I understand this is
> important to them, which is why I expanded a bit on the 'what went
> wrong' and multiple errors. In addition, The file system specific blob
> attempts to assist online recovery tools with more information, but it
> might make sense to do it in the future, when it is needed.
>
> >> This RFC follows up on my previous proposals which attempted to leverage
> >> watch_queue[1] and fsnotify[2] to provide a mechanism for file systems
> >> to push error notifications to user space. This proposal starts by, as
> >> suggested by Darrick, limiting the scope of what I'm trying to do to an
> >> interface for administrators to monitor the health of a file system,
> >> instead of a generic inteface for file errors. Therefore, this doesn't
> >> solve the problem of writeback errors or the need to watch a specific
> >> subsystem.
> >>
> >> * Format
> >>
> >> The feature is implemented on top of fanotify, as a new type of fanotify
> >> mark, FAN_ERROR, which a file system monitoring tool can register to
> >
> > You have a terminology mistake throughout your series.
> > FAN_ERROR is not a type of a mark, it is a type of an event.
> > A mark describes the watched object (i.e. a filesystem, mount, inode).
>
> Right. I understand the mistake and will fix it around the series.
> >
> >> receive notifications. A notification is split in three parts, and only
> >> the first is guaranteed to exist for any given error event:
> >>
> >> - FS generic data: A file system agnostic structure that has a generic
> >> error code and identifies the filesystem. Basically, it let's
> >> userspace know something happen on a monitored filesystem.
> >
> > I think an error seq counter per fs would be a nice addition to generic data.
> > It does not need to be persistent (it could be if filesystem supports it).
>
> Makes sense to me.
>
> >>
> >> - FS location data: Identifies where in the code the problem
> >> happened. (This is important for the use case of analysing frequent
> >> error points that we discussed earlier).
> >>
> >> - FS specific data: A detailed error report in a filesystem specific
> >> format that details what the error is. Ideally, a capable monitoring
> >> tool can use the information here for error recovery. For instance,
> >> xfs can put the xfs_scrub structures here, ext4 can send its error
> >> reports, etc. An example of usage is done in the ext4 patch of this
> >> series.
> >>
> >> More details on the information in each record can be found on the
> >> documentation introduced in patch 15.
> >>
> >> * Using fanotify
> >>
> >> Using fanotify for this kind of thing is slightly tricky because we want
> >> to guarantee delivery in some complicated conditions, for instance, the
> >> file system might want to send an error while holding several locks.
> >>
> >> Instead of working around file system constraints at the file system
> >> level, this proposal tries to make the FAN_ERROR submission safe in
> >> those contexts. This is done with a new mode in fsnotify that
> >> preallocates the memory at group creation to be used for the
> >> notification submission.
> >>
> >> This new mode in fsnotify introduces a ring buffer to queue
> >> notifications, which eliminates the allocation path in fsnotify. From
> >> what I saw, the allocation is the only problem in fsnotify for
> >> filesystems to submit errors in constrained situations.
> >>
> >
> > The ring buffer functionality for fsnotify is interesting and it may be
> > useful on its own, but IMO, its too big of a hammer for the problem
> > at hand.
> >
> > The question that you should be asking yourself is what is the
> > expected behavior in case of a flood of filesystem corruption errors.
> > I think it has already been expressed by filesystem maintainers on
> > one your previous postings, that a flood of filesystem corruption
> > errors is often noise and the only interesting information is the
> > first error.
>
> My idea was be to provide an ioctl for the user to resize the ring
> buffer when needed, to make the flood manageable. But I understand your
> main point about the ring buffer. i'm not sure saving only the first
> notification solves Google's use case of error monitoring and analysis,
> though. Khazhy, Ted, can you weight in?

I think this is a good point to bring up - a flood of errors shouldn't
drown out other filesystems, and from the perspective of error
reporting, it's better to drop all but one notification from one FS
than to drop the only notification from another. In the cases I can
think of, the first error is probably enough and does simplify things
quite a bit. There is the option of setting up a ring buffer per fs,
which does seem excessive in light of the previous statement.

>
> > For this reason, I think that FS_ERROR could be implemented
> > by attaching an fsnotify_error_info object to an fsnotify_sb_mark:
> >
> > struct fsnotify_sb_mark {
> > struct fsnotify_mark fsn_mark;
> > struct fsnotify_error_info info;
> > }
> >
> > Similar to fd sampled errseq, there can be only one error report
> > per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
> > the error report can be allocated at the time of setting the filesystem mark.
> >
> > With this, you will not need the added complexity of the ring buffer
> > and you will not need to limit FAN_ERROR reporting to a group that
> > is only listening for FAN_ERROR, which is an unneeded limitation IMO.
>
> The limitation exists because I was concerned about not breaking the
> semantics of FAN_ACCESS and others, with regards to merged
> notifications. I believe there should be no other reason why
> notifications of FAN_CLASS_NOTIF can't be sent to the ring buffer too.
> That limitation could be lifted for everything but permission events, I
> think.
>
> --
> Gabriel Krisman Bertazi


Attachments:
smime.p7s (3.90 kB)
S/MIME Cryptographic Signature

2021-05-11 05:36:23

by Khazhismel Kumykov

[permalink] [raw]
Subject: Re: [PATCH RFC 10/15] fanotify: Introduce code location record

On Thu, Apr 29, 2021 at 11:40 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
>
> > On Mon, Apr 26, 2021 at 9:43 PM Gabriel Krisman Bertazi
> > <[email protected]> wrote:
> >>
> >> This patch introduces an optional info record that describes the
> >> source (as in the region of the source-code where an event was
> >> initiated). This record is not produced for other type of existing
> >> notification, but it is optionally enabled for FAN_ERROR notifications.
> >>
> >
> > I find this functionality controversial, because think that the fs provided
> > s_last_error*, s_first_error* is more reliable and more powerful than this
> > functionality.
> >
> > Let's leave it for a future extending proposal, should fanotify event reporting
> > proposal pass muster, shall we?
> > Or do you think that without this optional extension fanotify event reporting
> > will not be valuable enough?
>
> I think it is valuable enough without this bit, at least on a first
> moment. I understand it would be useful for ext4 to analyse information
> through this interface, but the main priority is to have a way to push
> out the information that an error occured, as you mentioned.

Ack, if it's deemed cleaner we could look at sysfs on notification,
but having the information in the same event provides some convenience
factor, and avoids racing in the event that we're looking at an error
after the first one.

>
> Also, this might be more powerful if we stick to the ring buffer instead
> of single stlot, as it would allow more data to be collected than just
> first/last.
> >
> > Thanks,
> > Amir.
>
> --
> Gabriel Krisman Bertazi


Attachments:
smime.p7s (3.90 kB)
S/MIME Cryptographic Signature

2021-05-11 10:44:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 00/15] File system wide monitoring

On Tue 27-04-21 07:11:49, Amir Goldstein wrote:
> The ring buffer functionality for fsnotify is interesting and it may be
> useful on its own, but IMO, its too big of a hammer for the problem
> at hand.
>
> The question that you should be asking yourself is what is the
> expected behavior in case of a flood of filesystem corruption errors.
> I think it has already been expressed by filesystem maintainers on
> one your previous postings, that a flood of filesystem corruption
> errors is often noise and the only interesting information is the first error.
>
> For this reason, I think that FS_ERROR could be implemented
> by attaching an fsnotify_error_info object to an fsnotify_sb_mark:
>
> struct fsnotify_sb_mark {
> struct fsnotify_mark fsn_mark;
> struct fsnotify_error_info info;
> }
>
> Similar to fd sampled errseq, there can be only one error report
> per sb-group pair (i.e. fsnotify_sb_mark) and the memory needed to store
> the error report can be allocated at the time of setting the filesystem mark.
>
> With this, you will not need the added complexity of the ring buffer
> and you will not need to limit FAN_ERROR reporting to a group that
> is only listening for FAN_ERROR, which is an unneeded limitation IMO.

Seeing that this 'single error per mark' idea is gathering some support I'd
like to add my 2c: Probably we don't want fsnotify_error_info attached to
every fsnotify_mark, I guess we can have:

struct fanotify_mark {
struct fsnotify_mark fsn_mark;
struct fanotify_error_event *event;
};

and 'event' will be normally NULL and if we add FAN_ERROR to mark's mask,
we will allocate event (also containing error info) to use when generating
error event. And then the handling will be somewhat similar to how we
handle overflow events.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR