2020-12-08 00:54:48

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH 0/8] Superblock Notifications

Hi,

After the two previous RFCs this is an attempt to get the ball spinning
again.

The problem I'm trying to solve is providing an interface to monitor
filesystem errors. This patchset includes an example implementation of
ext4 error notification. This goes obviously on top of the watch_queue
mechanism.

Regarding the buffer overrun issue that would require fsinfo or another
method to expose counters, I think they can be added at a later date
with no change to what this patch series attempts to do, therefore I'm
proposing we don't wait for fsinfo before getting this merged.

I mostly tested this with the samples program I have published in the
last patch. In addition I will be sharing a patchset shortly with
proper documentation and some selftests for the feature.

David, can you please reply to this patchset? What do you think about
the watch_queue modifications I'm proposing? I really don't want to
waste more time on this code if it doesn't fit the watch_queue API. Can
I have some guidance in having this upstreamed?

In addition, I'm carrying "watch_queue: Make watch_sizeof() check record
size" on this patchset for now, but is it in anyone's tree going to
Linus any time soon? I haven't found it.

I also shared this patchset in a branch at:

Repo: https://gitlab.collabora.com/krisman/linux.git -b notifications

Previous RFC submissions can be found at:

RFC: https://www.spinics.net/lists/linux-ext4/msg74596.html
RFC v2:
https://lore.kernel.org/linux-fsdevel/[email protected]/

Original cover letter:
======================

Google has been using an out-of-tree mechanism for error notification in
Ext4 and we decided it is time to push for an upstream solution. This
would surely fit on top of David's notification work.

This patchset is an attempt to restart that discussion. It forward
ports
some code from David on top of Linus tree, adds features to
watch_queue and implements ext4 support.

The new notifications are designed after ext4 messages, so it exposes
notifications types to fit that filesystem, but it doesn't change much
to other filesystems, so it should be easily extensible.

I'm aware of the discussion around fsinfo, but I'd like to ask if there
are other missing pieces and what we could do to help that work go
upstream. From a previous mailing list discussion, Linus complained
about lack of users as a main reason for it to not be merged, so hey! :)

In addition, I'd like to ask for feedback on the current implementation,
specifically regarding the passing of extra unformatted information at
the end of the notification and the ext4 support.

The work, as shared on this patchset can be found at:

https://gitlab.collabora.com/krisman/linux.git -b
ext4-error-notifications

And there is an example code at:

https://gitlab.collabora.com/krisman/ext4-watcher

I'm Cc'ing Khazhismel Kumykov, from Google, who can provide more
information about their use case, if requested.

David Howells (3):
watch_queue: Make watch_sizeof() check record size
security: Add hooks to rule on setting a watch for superblock
vfs: Add superblock notifications

Gabriel Krisman Bertazi (5):
watch_queue: Support a text field at the end of the notification
vfs: Include origin of the SB error notification
fs: Add more superblock error subtypes
ext4: Implement SB error notification through watch_sb
samples: watch_queue: Add sample of SB notifications

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/Kconfig | 12 ++
fs/ext4/super.c | 31 +++--
fs/super.c | 127 +++++++++++++++++++++
include/linux/fs.h | 150 +++++++++++++++++++++++++
include/linux/lsm_hook_defs.h | 1 +
include/linux/lsm_hooks.h | 4 +
include/linux/security.h | 13 +++
include/linux/syscalls.h | 2 +
include/linux/watch_queue.h | 21 +++-
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/watch_queue.h | 54 ++++++++-
kernel/sys_ni.c | 3 +
kernel/watch_queue.c | 29 ++++-
samples/watch_queue/Makefile | 2 +-
samples/watch_queue/watch_sb.c | 114 +++++++++++++++++++
security/security.c | 6 +
18 files changed, 556 insertions(+), 19 deletions(-)
create mode 100644 samples/watch_queue/watch_sb.c

--
2.29.2


2020-12-08 00:56:37

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH 1/8] watch_queue: Make watch_sizeof() check record size

From: David Howells <[email protected]>

Make watch_sizeof() give a build error if the size of the struct won't fit
into the size field in the header.

Reported-by: Miklos Szeredi <[email protected]>
Signed-off-by: David Howells <[email protected]>
[Rebase to 5.10]
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/watch_queue.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/watch_queue.h b/include/linux/watch_queue.h
index c994d1b2cdba..f1086d12cd03 100644
--- a/include/linux/watch_queue.h
+++ b/include/linux/watch_queue.h
@@ -120,7 +120,12 @@ static inline void remove_watch_list(struct watch_list *wlist, u64 id)
* watch_sizeof - Calculate the information part of the size of a watch record,
* given the structure size.
*/
-#define watch_sizeof(STRUCT) (sizeof(STRUCT) << WATCH_INFO_LENGTH__SHIFT)
+#define watch_sizeof(STRUCT) \
+ ({ \
+ size_t max = WATCH_INFO_LENGTH >> WATCH_INFO_LENGTH__SHIFT; \
+ BUILD_BUG_ON(sizeof(STRUCT) > max); \
+ sizeof(STRUCT) << WATCH_INFO_LENGTH__SHIFT; \
+ })

#else
static inline int watch_queue_init(struct pipe_inode_info *pipe)
--
2.29.2

2020-12-08 00:56:37

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH 5/8] vfs: Include origin of the SB error notification

When reporting a filesystem error, we really need to know where the
error came from, therefore, include "function:line" information in the
notification sent to userspace. There is no current users of notify_sb
in the kernel, so there are no callers to update.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/fs.h | 11 +++++++++--
include/uapi/linux/watch_queue.h | 3 +++
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index df588edc0a34..864d86fcc68c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3514,14 +3514,17 @@ static inline void notify_sb(struct super_block *s,
/**
* notify_sb_error: Post superblock error notification.
* @s: The superblock the notification is about.
+ * @function: function name reported as source of the warning.
+ * @line: source code line reported as source of the warning.
* @error: The error number to be recorded.
* @inode: The inode the error refers to (if available, 0 otherwise)
* @block: The block the error refers to (if available, 0 otherwise)
* @fmt: Formating string for extra information appended to the notification
* @args: arguments for extra information string appended to the notification
*/
-static inline int notify_sb_error(struct super_block *s, int error, u64 inode,
- u64 block, const char *fmt, va_list *args)
+static inline int notify_sb_error(struct super_block *s, const char *function, int line,
+ int error, u64 inode, u64 block,
+ const char *fmt, va_list *args)
{
#ifdef CONFIG_SB_NOTIFICATIONS
if (unlikely(s->s_watchers)) {
@@ -3534,8 +3537,12 @@ static inline int notify_sb_error(struct super_block *s, int error, u64 inode,
.error_cookie = 0,
.inode = inode,
.block = block,
+ .line = line,
};

+ memcpy(&n.function, function, SB_NOTIFICATION_FNAME_LEN);
+ n.function[SB_NOTIFICATION_FNAME_LEN-1] = '\0';
+
post_sb_notification(s, &n.s, fmt, args);
}
#endif
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index 937363d9f7b3..5fa5286c5cc7 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -114,6 +114,7 @@ enum superblock_notification_type {

#define NOTIFY_SUPERBLOCK_IS_NOW_RO WATCH_INFO_FLAG_0 /* Superblock changed to R/O */

+#define SB_NOTIFICATION_FNAME_LEN 30
/*
* Superblock notification record.
* - watch.type = WATCH_TYPE_MOUNT_NOTIFY
@@ -130,6 +131,8 @@ struct superblock_error_notification {
__u32 error_cookie;
__u64 inode;
__u64 block;
+ char function[SB_NOTIFICATION_FNAME_LEN];
+ __u16 line;
char desc[0];
};

--
2.29.2

2020-12-08 00:56:37

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH 3/8] watch_queue: Support a text field at the end of the notification

This allow notifications to send text information to userspace without
having to copy it to a temporary buffer to then copy to the ring. One
use case to pass text information in notifications is for error
reporting, where more debug information might be needed, but we don't
want to explode the number of subtypes of notifications. For instance,
ext4 can have a single inode error notification subtype, and pass more
information on the cause of the error in this field.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/watch_queue.h | 14 ++++++++++++--
kernel/watch_queue.c | 29 ++++++++++++++++++++++++-----
2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/watch_queue.h b/include/linux/watch_queue.h
index f1086d12cd03..2f5a7446bca6 100644
--- a/include/linux/watch_queue.h
+++ b/include/linux/watch_queue.h
@@ -79,7 +79,7 @@ struct watch_list {
extern void __post_watch_notification(struct watch_list *,
struct watch_notification *,
const struct cred *,
- u64);
+ u64, const char*, va_list*);
extern struct watch_queue *get_watch_queue(int);
extern void put_watch_queue(struct watch_queue *);
extern void init_watch(struct watch *, struct watch_queue *);
@@ -105,7 +105,17 @@ static inline void post_watch_notification(struct watch_list *wlist,
u64 id)
{
if (unlikely(wlist))
- __post_watch_notification(wlist, n, cred, id);
+ __post_watch_notification(wlist, n, cred, id, NULL, NULL);
+}
+
+static inline void post_watch_notification_string(struct watch_list *wlist,
+ struct watch_notification *n,
+ const struct cred *cred,
+ u64 id, const char *fmt,
+ va_list *args)
+{
+ if (unlikely(wlist))
+ __post_watch_notification(wlist, n, cred, id, fmt, args);
}

static inline void remove_watch_list(struct watch_list *wlist, u64 id)
diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c
index 0ef8f65bd2d7..89fcf0420ce7 100644
--- a/kernel/watch_queue.c
+++ b/kernel/watch_queue.c
@@ -70,13 +70,15 @@ static const struct pipe_buf_operations watch_queue_pipe_buf_ops = {
* Post a notification to a watch queue.
*/
static bool post_one_notification(struct watch_queue *wqueue,
- struct watch_notification *n)
+ struct watch_notification *n,
+ const char *fmt, va_list *args)
{
void *p;
struct pipe_inode_info *pipe = wqueue->pipe;
struct pipe_buffer *buf;
struct page *page;
unsigned int head, tail, mask, note, offset, len;
+ int wlen = 0;
bool done = false;

if (!pipe)
@@ -102,6 +104,23 @@ static bool post_one_notification(struct watch_queue *wqueue,
get_page(page);
len = n->info & WATCH_INFO_LENGTH;
p = kmap_atomic(page);
+ /*
+ * Write the tail description before the actual header, because
+ * the string needs to be generated to calculate the final
+ * notification size, that is passed in the header.
+ */
+ if (fmt) {
+ wlen = vscnprintf(p + offset + len, WATCH_INFO_LENGTH - len,
+ fmt, (args ? *args : NULL));
+ wlen += 1; /* vscnprintf doesn't include '\0' */
+ if (wlen > 0) {
+ n->info = n->info & ~WATCH_INFO_LENGTH;
+ n->info |= (len + wlen) & WATCH_INFO_LENGTH;
+ } else {
+ /* Drop errors when writing the extra string. */
+ wlen = 0;
+ }
+ }
memcpy(p + offset, n, len);
kunmap_atomic(p);

@@ -110,7 +129,7 @@ static bool post_one_notification(struct watch_queue *wqueue,
buf->private = (unsigned long)wqueue;
buf->ops = &watch_queue_pipe_buf_ops;
buf->offset = offset;
- buf->len = len;
+ buf->len = (len + wlen);
buf->flags = PIPE_BUF_FLAG_WHOLE;
pipe->head = head + 1;

@@ -175,7 +194,7 @@ static bool filter_watch_notification(const struct watch_filter *wf,
void __post_watch_notification(struct watch_list *wlist,
struct watch_notification *n,
const struct cred *cred,
- u64 id)
+ u64 id, const char *fmt, va_list *args)
{
const struct watch_filter *wf;
struct watch_queue *wqueue;
@@ -202,7 +221,7 @@ void __post_watch_notification(struct watch_list *wlist,
if (security_post_notification(watch->cred, cred, n) < 0)
continue;

- post_one_notification(wqueue, n);
+ post_one_notification(wqueue, n, fmt, args);
}

rcu_read_unlock();
@@ -522,7 +541,7 @@ int remove_watch_from_object(struct watch_list *wlist, struct watch_queue *wq,
* protecting *wqueue from deallocation.
*/
if (wqueue) {
- post_one_notification(wqueue, &n.watch);
+ post_one_notification(wqueue, &n.watch, NULL, NULL);

spin_lock_bh(&wqueue->lock);

--
2.29.2

2020-12-08 00:56:37

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH 7/8] ext4: Implement SB error notification through watch_sb

This follows the same implementation of ext4 error reporting via dmesg,
but expose that information via the new watch_queue notifications API.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/super.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 94472044f4c1..f239624003fc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -713,15 +713,17 @@ void __ext4_error(struct super_block *sb, const char *function,
return;

trace_ext4_error(sb, function, line);
+ va_start(args, fmt);
if (ext4_error_ratelimit(sb)) {
- va_start(args, fmt);
vaf.fmt = fmt;
vaf.va = &args;
printk(KERN_CRIT
"EXT4-fs error (device %s): %s:%d: comm %s: %pV\n",
sb->s_id, function, line, current->comm, &vaf);
- va_end(args);
}
+ notify_sb_error(sb, function, line, error, 0, 0, fmt, &args);
+ va_end(args);
+
save_error_info(sb, error, 0, block, function, line);
ext4_handle_error(sb);
}
@@ -737,8 +739,8 @@ void __ext4_error_inode(struct inode *inode, const char *function,
return;

trace_ext4_error(inode->i_sb, function, line);
+ va_start(args, fmt);
if (ext4_error_ratelimit(inode->i_sb)) {
- va_start(args, fmt);
vaf.fmt = fmt;
vaf.va = &args;
if (block)
@@ -751,8 +753,11 @@ void __ext4_error_inode(struct inode *inode, const char *function,
"inode #%lu: comm %s: %pV\n",
inode->i_sb->s_id, function, line, inode->i_ino,
current->comm, &vaf);
- va_end(args);
}
+ notify_sb_error(inode->i_sb, function, line, error, inode->i_ino, block,
+ fmt, &args);
+ va_end(args);
+
save_error_info(inode->i_sb, error, inode->i_ino, block,
function, line);
ext4_handle_error(inode->i_sb);
@@ -771,11 +776,11 @@ void __ext4_error_file(struct file *file, const char *function,
return;

trace_ext4_error(inode->i_sb, function, line);
+ va_start(args, fmt);
if (ext4_error_ratelimit(inode->i_sb)) {
path = file_path(file, pathname, sizeof(pathname));
if (IS_ERR(path))
path = "(unknown)";
- va_start(args, fmt);
vaf.fmt = fmt;
vaf.va = &args;
if (block)
@@ -790,8 +795,11 @@ void __ext4_error_file(struct file *file, const char *function,
"comm %s: path %s: %pV\n",
inode->i_sb->s_id, function, line, inode->i_ino,
current->comm, path, &vaf);
- va_end(args);
}
+ notify_sb_error(inode->i_sb, function, line, EFSCORRUPTED,
+ inode->i_ino, block, fmt, &args);
+ va_end(args);
+
save_error_info(inode->i_sb, EFSCORRUPTED, inode->i_ino, block,
function, line);
ext4_handle_error(inode->i_sb);
@@ -861,6 +869,8 @@ void __ext4_std_error(struct super_block *sb, const char *function,
sb->s_id, function, line, errstr);
}

+ notify_sb_error(sb, function, line, errno, 0, 0, errstr, NULL);
+
save_error_info(sb, -errno, 0, 0, function, line);
ext4_handle_error(sb);
}
@@ -890,6 +900,7 @@ void __ext4_abort(struct super_block *sb, const char *function,
vaf.va = &args;
printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: %pV\n",
sb->s_id, function, line, &vaf);
+ notify_sb_error(sb, function, line, error, 0, 0, fmt, &args);
va_end(args);

if (sb_rdonly(sb) == 0) {
@@ -923,6 +934,7 @@ void __ext4_msg(struct super_block *sb,
vaf.fmt = fmt;
vaf.va = &args;
printk("%sEXT4-fs (%s): %pV\n", prefix, sb->s_id, &vaf);
+ notify_sb_msg(sb, fmt, &args);
va_end(args);
}

@@ -947,6 +959,7 @@ void __ext4_warning(struct super_block *sb, const char *function,
vaf.va = &args;
printk(KERN_WARNING "EXT4-fs warning (device %s): %s:%d: %pV\n",
sb->s_id, function, line, &vaf);
+ notify_sb_warning(sb, function, line, 0, 0, fmt, &args);
va_end(args);
}

@@ -965,6 +978,7 @@ void __ext4_warning_inode(const struct inode *inode, const char *function,
printk(KERN_WARNING "EXT4-fs warning (device %s): %s:%d: "
"inode #%lu: comm %s: %pV\n", inode->i_sb->s_id,
function, line, inode->i_ino, current->comm, &vaf);
+ notify_sb_warning(inode->i_sb, function, line, inode->i_ino, 0, fmt, &args);
va_end(args);
}

@@ -984,8 +998,8 @@ __acquires(bitlock)
trace_ext4_error(sb, function, line);
__save_error_info(sb, EFSCORRUPTED, ino, block, function, line);

+ va_start(args, fmt);
if (ext4_error_ratelimit(sb)) {
- va_start(args, fmt);
vaf.fmt = fmt;
vaf.va = &args;
printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: group %u, ",
@@ -996,8 +1010,9 @@ __acquires(bitlock)
printk(KERN_CONT "block %llu:",
(unsigned long long) block);
printk(KERN_CONT "%pV\n", &vaf);
- va_end(args);
}
+ notify_sb_error(sb, function, line, EFSCORRUPTED, ino, block, fmt, &args);
+ va_end(args);

if (test_opt(sb, WARN_ON_ERROR))
WARN_ON_ONCE(1);
--
2.29.2

2020-12-08 00:57:39

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

On Mon, Dec 07, 2020 at 09:31:14PM -0300, Gabriel Krisman Bertazi wrote:
> When reporting a filesystem error, we really need to know where the
> error came from, therefore, include "function:line" information in the
> notification sent to userspace. There is no current users of notify_sb
> in the kernel, so there are no callers to update.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> include/linux/fs.h | 11 +++++++++--
> include/uapi/linux/watch_queue.h | 3 +++
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index df588edc0a34..864d86fcc68c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3514,14 +3514,17 @@ static inline void notify_sb(struct super_block *s,
> /**
> * notify_sb_error: Post superblock error notification.
> * @s: The superblock the notification is about.
> + * @function: function name reported as source of the warning.
> + * @line: source code line reported as source of the warning.
> * @error: The error number to be recorded.
> * @inode: The inode the error refers to (if available, 0 otherwise)
> * @block: The block the error refers to (if available, 0 otherwise)
> * @fmt: Formating string for extra information appended to the notification
> * @args: arguments for extra information string appended to the notification
> */
> -static inline int notify_sb_error(struct super_block *s, int error, u64 inode,
> - u64 block, const char *fmt, va_list *args)
> +static inline int notify_sb_error(struct super_block *s, const char *function, int line,
> + int error, u64 inode, u64 block,
> + const char *fmt, va_list *args)
> {
> #ifdef CONFIG_SB_NOTIFICATIONS
> if (unlikely(s->s_watchers)) {
> @@ -3534,8 +3537,12 @@ static inline int notify_sb_error(struct super_block *s, int error, u64 inode,
> .error_cookie = 0,
> .inode = inode,
> .block = block,
> + .line = line,
> };
>
> + memcpy(&n.function, function, SB_NOTIFICATION_FNAME_LEN);
> + n.function[SB_NOTIFICATION_FNAME_LEN-1] = '\0';
> +
> post_sb_notification(s, &n.s, fmt, args);
> }
> #endif
> diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
> index 937363d9f7b3..5fa5286c5cc7 100644
> --- a/include/uapi/linux/watch_queue.h
> +++ b/include/uapi/linux/watch_queue.h
> @@ -114,6 +114,7 @@ enum superblock_notification_type {
>
> #define NOTIFY_SUPERBLOCK_IS_NOW_RO WATCH_INFO_FLAG_0 /* Superblock changed to R/O */
>
> +#define SB_NOTIFICATION_FNAME_LEN 30
> /*
> * Superblock notification record.
> * - watch.type = WATCH_TYPE_MOUNT_NOTIFY
> @@ -130,6 +131,8 @@ struct superblock_error_notification {
> __u32 error_cookie;
> __u64 inode;
> __u64 block;
> + char function[SB_NOTIFICATION_FNAME_LEN];
> + __u16 line;

Er... this is enlarging a structure in the userspace ABI, right? Which
will break userspace that latched on to the structure definition in the
previous patch, and therefore isn't expecting a function name here.

If you're gonna put a character string(?) at the end then I guess you
have to pre-pad the notification structure so that we can add things
later, or.... bump the type code every time you add a field?

(Maybe I misread that? But this is include/uapi/...)

--D

> char desc[0];
> };
>
> --
> 2.29.2
>

2020-12-08 00:58:20

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

"Darrick J. Wong" <[email protected]> writes:

> On Mon, Dec 07, 2020 at 09:31:14PM -0300, Gabriel Krisman Bertazi wrote:
>> When reporting a filesystem error, we really need to know where the
>> error came from, therefore, include "function:line" information in the
>> notification sent to userspace. There is no current users of notify_sb
>> in the kernel, so there are no callers to update.
>>
>> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
>> ---
>> include/linux/fs.h | 11 +++++++++--
>> include/uapi/linux/watch_queue.h | 3 +++
>> 2 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index df588edc0a34..864d86fcc68c 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -3514,14 +3514,17 @@ static inline void notify_sb(struct super_block *s,
>> /**
>> * notify_sb_error: Post superblock error notification.
>> * @s: The superblock the notification is about.
>> + * @function: function name reported as source of the warning.
>> + * @line: source code line reported as source of the warning.
>> * @error: The error number to be recorded.
>> * @inode: The inode the error refers to (if available, 0 otherwise)
>> * @block: The block the error refers to (if available, 0 otherwise)
>> * @fmt: Formating string for extra information appended to the notification
>> * @args: arguments for extra information string appended to the notification
>> */
>> -static inline int notify_sb_error(struct super_block *s, int error, u64 inode,
>> - u64 block, const char *fmt, va_list *args)
>> +static inline int notify_sb_error(struct super_block *s, const char *function, int line,
>> + int error, u64 inode, u64 block,
>> + const char *fmt, va_list *args)
>> {
>> #ifdef CONFIG_SB_NOTIFICATIONS
>> if (unlikely(s->s_watchers)) {
>> @@ -3534,8 +3537,12 @@ static inline int notify_sb_error(struct super_block *s, int error, u64 inode,
>> .error_cookie = 0,
>> .inode = inode,
>> .block = block,
>> + .line = line,
>> };
>>
>> + memcpy(&n.function, function, SB_NOTIFICATION_FNAME_LEN);
>> + n.function[SB_NOTIFICATION_FNAME_LEN-1] = '\0';
>> +
>> post_sb_notification(s, &n.s, fmt, args);
>> }
>> #endif
>> diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
>> index 937363d9f7b3..5fa5286c5cc7 100644
>> --- a/include/uapi/linux/watch_queue.h
>> +++ b/include/uapi/linux/watch_queue.h
>> @@ -114,6 +114,7 @@ enum superblock_notification_type {
>>
>> #define NOTIFY_SUPERBLOCK_IS_NOW_RO WATCH_INFO_FLAG_0 /* Superblock changed to R/O */
>>
>> +#define SB_NOTIFICATION_FNAME_LEN 30
>> /*
>> * Superblock notification record.
>> * - watch.type = WATCH_TYPE_MOUNT_NOTIFY
>> @@ -130,6 +131,8 @@ struct superblock_error_notification {
>> __u32 error_cookie;
>> __u64 inode;
>> __u64 block;
>> + char function[SB_NOTIFICATION_FNAME_LEN];
>> + __u16 line;
>
> Er... this is enlarging a structure in the userspace ABI, right? Which
> will break userspace that latched on to the structure definition in the
> previous patch, and therefore isn't expecting a function name here.

Hi Darrick,

Since the structure is defined in the patch immediately before, I
thought it would be ok to split the patch to preserve authorship of the
different parts. I will fold this into patch 4 in the next iteration.

>
> If you're gonna put a character string(?) at the end then I guess you
> have to pre-pad the notification structure so that we can add things
> later, or.... bump the type code every time you add a field?
>
> (Maybe I misread that? But this is include/uapi/...)
>
> --D
>
>> char desc[0];
>> };
>>
>> --
>> 2.29.2
>>

--
Gabriel Krisman Bertazi

2020-12-08 12:58:05

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

Gabriel Krisman Bertazi <[email protected]> wrote:

> Since the structure is defined in the patch immediately before, I
> thought it would be ok to split the patch to preserve authorship of the
> different parts.

Think git bisect.

David

2020-12-08 13:01:40

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

Gabriel Krisman Bertazi <[email protected]> wrote:

> @@ -130,6 +131,8 @@ struct superblock_error_notification {
> __u32 error_cookie;
> __u64 inode;
> __u64 block;
> + char function[SB_NOTIFICATION_FNAME_LEN];
> + __u16 line;
> char desc[0];
> };

As Darrick said, this is a UAPI breaker, so you shouldn't do this (you can,
however, merge this ahead a patch). Also, I would put the __u16 before the
char[].

That said, I'm not sure whether it's useful to include the function name and
line. Both fields are liable to change over kernel commits, so it's not
something userspace can actually interpret. I think you're better off dumping
those into dmesg.

Further, this reduces the capacity of desc[] significantly - I don't know if
that's a problem.

And yet further, there's no room for addition of new fields with the desc[]
buffer on the end. Now maybe you're planning on making use of desc[] for
text-encoding?

David

2020-12-08 13:03:39

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

David Howells <[email protected]> writes:

> Gabriel Krisman Bertazi <[email protected]> wrote:
>
>> @@ -130,6 +131,8 @@ struct superblock_error_notification {
>> __u32 error_cookie;
>> __u64 inode;
>> __u64 block;
>> + char function[SB_NOTIFICATION_FNAME_LEN];
>> + __u16 line;
>> char desc[0];
>> };
>
> As Darrick said, this is a UAPI breaker, so you shouldn't do this (you can,
> however, merge this ahead a patch). Also, I would put the __u16 before the
> char[].
>
> That said, I'm not sure whether it's useful to include the function name and
> line. Both fields are liable to change over kernel commits, so it's not
> something userspace can actually interpret. I think you're better off dumping
> those into dmesg.
>
> Further, this reduces the capacity of desc[] significantly - I don't know if
> that's a problem.

Yes, that is a big problem as desc is already quite limited. I don't
think it is a problem for them to change between kernel versions, as the
monitoring userspace can easily associate it with the running kernel.
The alternative would be generating something like unique IDs for each
error notification in the filesystem, no?

> And yet further, there's no room for addition of new fields with the desc[]
> buffer on the end. Now maybe you're planning on making use of desc[] for
> text-encoding?

Yes. I would like to be able to provide more details on the error,
without having a unique id. For instance, desc would have the formatted
string below, describing the warning:

ext4_warning(inode->i_sb, "couldn't mark inode dirty (err %d)", err);


>
> David
>

--
Gabriel Krisman Bertazi

2020-12-08 13:06:01

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/8] Superblock Notifications

Gabriel Krisman Bertazi <[email protected]> wrote:

> After the two previous RFCs this is an attempt to get the ball spinning
> again.
>
> The problem I'm trying to solve is providing an interface to monitor
> filesystem errors. This patchset includes an example implementation of
> ext4 error notification. This goes obviously on top of the watch_queue
> mechanism.

Thanks for picking this up and advancing it.

> Regarding the buffer overrun issue that would require fsinfo or another
> method to expose counters, I think they can be added at a later date
> with no change to what this patch series attempts to do, therefore I'm
> proposing we don't wait for fsinfo before getting this merged.

That's fine, but anyone wanting to use this will need to be aware that
overruns are a problem.

David

2020-12-08 13:06:17

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 3/8] watch_queue: Support a text field at the end of the notification

Gabriel Krisman Bertazi <[email protected]> wrote:

> This allow notifications to send text information to userspace without
> having to copy it to a temporary buffer to then copy to the ring. One
> use case to pass text information in notifications is for error
> reporting, where more debug information might be needed, but we don't
> want to explode the number of subtypes of notifications. For instance,
> ext4 can have a single inode error notification subtype, and pass more
> information on the cause of the error in this field.

I'm generally okay with this, but you really need to note that if you make use
of this for a subtype, you don't get to add more fields for that subtype
unless there's an offset to it.

Speaking of that, you need to update:

Documentation/watch_queue.rst

since you're changing the API.

The "Watch Sources" section will also need altering to indicate that
superblock events are now a thing.

David

2020-12-08 20:52:34

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

"Darrick J. Wong" <[email protected]> writes:

> On Tue, Dec 08, 2020 at 09:58:25AM -0300, Gabriel Krisman Bertazi wrote:
>> David Howells <[email protected]> writes:
>>
>> > Gabriel Krisman Bertazi <[email protected]> wrote:
>> >
>> >> @@ -130,6 +131,8 @@ struct superblock_error_notification {
>
> FWIW I wonder if this really should be inode_error_notification?
>
> If (for example) ext4 discovered an error in the blockgroup descriptor
> and wanted to report it, the inode and block numbers would be
> irrelevant, but the blockgroup number would be nice to have.

A previous RFC had superblock_error_notification and
superblock_inode_error_notification split, I think we can recover that.

>
>> >> __u32 error_cookie;
>> >> __u64 inode;
>> >> __u64 block;
>> >> + char function[SB_NOTIFICATION_FNAME_LEN];
>> >> + __u16 line;
>> >> char desc[0];
>> >> };
>> >
>> > As Darrick said, this is a UAPI breaker, so you shouldn't do this (you can,
>> > however, merge this ahead a patch). Also, I would put the __u16 before the
>> > char[].
>> >
>> > That said, I'm not sure whether it's useful to include the function name and
>> > line. Both fields are liable to change over kernel commits, so it's not
>> > something userspace can actually interpret. I think you're better off dumping
>> > those into dmesg.
>> >
>> > Further, this reduces the capacity of desc[] significantly - I don't know if
>> > that's a problem.
>>
>> Yes, that is a big problem as desc is already quite limited. I don't
>
> How limited?

The largest notification is 128 bytes, the one with the biggest header
is superblock_error_notification which leaves 56 bytes for description.

>
>> think it is a problem for them to change between kernel versions, as the
>> monitoring userspace can easily associate it with the running kernel.
>
> How do you make that association? $majordistro's 4.18 kernel is not the
> same as the upstream 4.18. Wouldn't you rather the notification message
> be entirely self-describing rather than depending on some external
> information about the sender?

True. I was thinking on my use case where the customer controls their
infrastructure and would specialize their userspace tools, but that is
poor design on my part. A self describing mechanism would be better.

>
>> The alternative would be generating something like unique IDs for each
>> error notification in the filesystem, no?
>>
>> > And yet further, there's no room for addition of new fields with the desc[]
>> > buffer on the end. Now maybe you're planning on making use of desc[] for
>> > text-encoding?
>>
>> Yes. I would like to be able to provide more details on the error,
>> without having a unique id. For instance, desc would have the formatted
>> string below, describing the warning:
>>
>> ext4_warning(inode->i_sb, "couldn't mark inode dirty (err %d)", err);
>
> Depending on the upper limit on the length of messages, I wonder if you
> could split the superblock notification and the description string into
> separate messages (with maybe the error cookie to tie them together) so
> that the struct isn't limited by having a VLA on the end, and the
> description can be more or less an arbitrary string?
>
> (That said I'm not familiar with the watch queue system so I have no
> idea if chained messages even make sense here, or are already
> implemented in some other way, or...)

I don't see any support for chaining messages in the current watch_queue
implementation, I'd need to extend the interface to support it. I
considered this idea before, given the small description size, but I
thought it would be over-complicated, even though much more future
proof. I will look into that.

What about the kernel exporting a per-filesystem table, as a build
target or in /sys/fs/<fs>/errors, that has descriptions strings for each
error? Then the notification can have only the FS type, index to the
table and params. This won't exactly be self-describing as you wanted
but, differently from function:line, it removes the need for the source
code, and allows localization. The per-filesystem table would be
stable ABI, of course.

--
Gabriel Krisman Bertazi

2020-12-08 21:14:45

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

On Tue, Dec 08, 2020 at 09:58:25AM -0300, Gabriel Krisman Bertazi wrote:
> David Howells <[email protected]> writes:
>
> > Gabriel Krisman Bertazi <[email protected]> wrote:
> >
> >> @@ -130,6 +131,8 @@ struct superblock_error_notification {

FWIW I wonder if this really should be inode_error_notification?

If (for example) ext4 discovered an error in the blockgroup descriptor
and wanted to report it, the inode and block numbers would be
irrelevant, but the blockgroup number would be nice to have.

> >> __u32 error_cookie;
> >> __u64 inode;
> >> __u64 block;
> >> + char function[SB_NOTIFICATION_FNAME_LEN];
> >> + __u16 line;
> >> char desc[0];
> >> };
> >
> > As Darrick said, this is a UAPI breaker, so you shouldn't do this (you can,
> > however, merge this ahead a patch). Also, I would put the __u16 before the
> > char[].
> >
> > That said, I'm not sure whether it's useful to include the function name and
> > line. Both fields are liable to change over kernel commits, so it's not
> > something userspace can actually interpret. I think you're better off dumping
> > those into dmesg.
> >
> > Further, this reduces the capacity of desc[] significantly - I don't know if
> > that's a problem.
>
> Yes, that is a big problem as desc is already quite limited. I don't

How limited?

> think it is a problem for them to change between kernel versions, as the
> monitoring userspace can easily associate it with the running kernel.

How do you make that association? $majordistro's 4.18 kernel is not the
same as the upstream 4.18. Wouldn't you rather the notification message
be entirely self-describing rather than depending on some external
information about the sender?

> The alternative would be generating something like unique IDs for each
> error notification in the filesystem, no?
>
> > And yet further, there's no room for addition of new fields with the desc[]
> > buffer on the end. Now maybe you're planning on making use of desc[] for
> > text-encoding?
>
> Yes. I would like to be able to provide more details on the error,
> without having a unique id. For instance, desc would have the formatted
> string below, describing the warning:
>
> ext4_warning(inode->i_sb, "couldn't mark inode dirty (err %d)", err);

Depending on the upper limit on the length of messages, I wonder if you
could split the superblock notification and the description string into
separate messages (with maybe the error cookie to tie them together) so
that the struct isn't limited by having a VLA on the end, and the
description can be more or less an arbitrary string?

(That said I'm not familiar with the watch queue system so I have no
idea if chained messages even make sense here, or are already
implemented in some other way, or...)

Even better if you could find a way to send the string and the arguments
separately so that whatever's on the receiving end could run it through
a localization catalogue. Though I remember my roommates trying to
localize 2.0.35 into Latin 25 years ago and never getting very far. :)

--D

>
>
> >
> > David
> >
>
> --
> Gabriel Krisman Bertazi

2020-12-09 03:26:47

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

On Tue, Dec 08, 2020 at 04:29:32PM -0300, Gabriel Krisman Bertazi wrote:
> "Darrick J. Wong" <[email protected]> writes:
>
> > On Tue, Dec 08, 2020 at 09:58:25AM -0300, Gabriel Krisman Bertazi wrote:
> >> David Howells <[email protected]> writes:
> >>
> >> > Gabriel Krisman Bertazi <[email protected]> wrote:
> >> >
> >> >> @@ -130,6 +131,8 @@ struct superblock_error_notification {
> >
> > FWIW I wonder if this really should be inode_error_notification?
> >
> > If (for example) ext4 discovered an error in the blockgroup descriptor
> > and wanted to report it, the inode and block numbers would be
> > irrelevant, but the blockgroup number would be nice to have.
>
> A previous RFC had superblock_error_notification and
> superblock_inode_error_notification split, I think we can recover that.
>
> >
> >> >> __u32 error_cookie;
> >> >> __u64 inode;
> >> >> __u64 block;
> >> >> + char function[SB_NOTIFICATION_FNAME_LEN];
> >> >> + __u16 line;
> >> >> char desc[0];
> >> >> };
> >> >
> >> > As Darrick said, this is a UAPI breaker, so you shouldn't do this (you can,
> >> > however, merge this ahead a patch). Also, I would put the __u16 before the
> >> > char[].
> >> >
> >> > That said, I'm not sure whether it's useful to include the function name and
> >> > line. Both fields are liable to change over kernel commits, so it's not
> >> > something userspace can actually interpret. I think you're better off dumping
> >> > those into dmesg.
> >> >
> >> > Further, this reduces the capacity of desc[] significantly - I don't know if
> >> > that's a problem.
> >>
> >> Yes, that is a big problem as desc is already quite limited. I don't
> >
> > How limited?
>
> The largest notification is 128 bytes, the one with the biggest header
> is superblock_error_notification which leaves 56 bytes for description.
>
> >
> >> think it is a problem for them to change between kernel versions, as the
> >> monitoring userspace can easily associate it with the running kernel.
> >
> > How do you make that association? $majordistro's 4.18 kernel is not the
> > same as the upstream 4.18. Wouldn't you rather the notification message
> > be entirely self-describing rather than depending on some external
> > information about the sender?
>
> True. I was thinking on my use case where the customer controls their
> infrastructure and would specialize their userspace tools, but that is
> poor design on my part. A self describing mechanism would be better.
>
> >
> >> The alternative would be generating something like unique IDs for each
> >> error notification in the filesystem, no?
> >>
> >> > And yet further, there's no room for addition of new fields with the desc[]
> >> > buffer on the end. Now maybe you're planning on making use of desc[] for
> >> > text-encoding?
> >>
> >> Yes. I would like to be able to provide more details on the error,
> >> without having a unique id. For instance, desc would have the formatted
> >> string below, describing the warning:
> >>
> >> ext4_warning(inode->i_sb, "couldn't mark inode dirty (err %d)", err);
> >
> > Depending on the upper limit on the length of messages, I wonder if you
> > could split the superblock notification and the description string into
> > separate messages (with maybe the error cookie to tie them together) so
> > that the struct isn't limited by having a VLA on the end, and the
> > description can be more or less an arbitrary string?
> >
> > (That said I'm not familiar with the watch queue system so I have no
> > idea if chained messages even make sense here, or are already
> > implemented in some other way, or...)
>
> I don't see any support for chaining messages in the current watch_queue
> implementation, I'd need to extend the interface to support it. I
> considered this idea before, given the small description size, but I
> thought it would be over-complicated, even though much more future
> proof. I will look into that.
>
> What about the kernel exporting a per-filesystem table, as a build
> target or in /sys/fs/<fs>/errors, that has descriptions strings for each
> error? Then the notification can have only the FS type, index to the
> table and params. This won't exactly be self-describing as you wanted
> but, differently from function:line, it removes the need for the source
> code, and allows localization. The per-filesystem table would be
> stable ABI, of course.

Yikes. I don't think people are going to be ok with a message table
where we can never remove the strings. I bet GregKH won't like that
either (one value per sysfs file).

(Maybe I misread that and all you meant by stable ABI is the fact that
the table exists at a given path and the notification message gives you
a index into ... wherever we put it.)

--D

>
> --
> Gabriel Krisman Bertazi

2020-12-09 13:10:09

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

"Darrick J. Wong" <[email protected]> writes:

> On Tue, Dec 08, 2020 at 04:29:32PM -0300, Gabriel Krisman Bertazi wrote:
>> "Darrick J. Wong" <[email protected]> writes:
>>
>> > On Tue, Dec 08, 2020 at 09:58:25AM -0300, Gabriel Krisman Bertazi wrote:
>> >> David Howells <[email protected]> writes:
>> >>
>> >> > Gabriel Krisman Bertazi <[email protected]> wrote:
>> >> >
>> >> >> @@ -130,6 +131,8 @@ struct superblock_error_notification {
>> >
>> > FWIW I wonder if this really should be inode_error_notification?
>> >
>> > If (for example) ext4 discovered an error in the blockgroup descriptor
>> > and wanted to report it, the inode and block numbers would be
>> > irrelevant, but the blockgroup number would be nice to have.
>>
>> A previous RFC had superblock_error_notification and
>> superblock_inode_error_notification split, I think we can recover that.
>>
>> >
>> >> >> __u32 error_cookie;
>> >> >> __u64 inode;
>> >> >> __u64 block;
>> >> >> + char function[SB_NOTIFICATION_FNAME_LEN];
>> >> >> + __u16 line;
>> >> >> char desc[0];
>> >> >> };
>> >> >
>> >> > As Darrick said, this is a UAPI breaker, so you shouldn't do this (you can,
>> >> > however, merge this ahead a patch). Also, I would put the __u16 before the
>> >> > char[].
>> >> >
>> >> > That said, I'm not sure whether it's useful to include the function name and
>> >> > line. Both fields are liable to change over kernel commits, so it's not
>> >> > something userspace can actually interpret. I think you're better off dumping
>> >> > those into dmesg.
>> >> >
>> >> > Further, this reduces the capacity of desc[] significantly - I don't know if
>> >> > that's a problem.
>> >>
>> >> Yes, that is a big problem as desc is already quite limited. I don't
>> >
>> > How limited?
>>
>> The largest notification is 128 bytes, the one with the biggest header
>> is superblock_error_notification which leaves 56 bytes for description.
>>
>> >
>> >> think it is a problem for them to change between kernel versions, as the
>> >> monitoring userspace can easily associate it with the running kernel.
>> >
>> > How do you make that association? $majordistro's 4.18 kernel is not the
>> > same as the upstream 4.18. Wouldn't you rather the notification message
>> > be entirely self-describing rather than depending on some external
>> > information about the sender?
>>
>> True. I was thinking on my use case where the customer controls their
>> infrastructure and would specialize their userspace tools, but that is
>> poor design on my part. A self describing mechanism would be better.
>>
>> >
>> >> The alternative would be generating something like unique IDs for each
>> >> error notification in the filesystem, no?
>> >>
>> >> > And yet further, there's no room for addition of new fields with the desc[]
>> >> > buffer on the end. Now maybe you're planning on making use of desc[] for
>> >> > text-encoding?
>> >>
>> >> Yes. I would like to be able to provide more details on the error,
>> >> without having a unique id. For instance, desc would have the formatted
>> >> string below, describing the warning:
>> >>
>> >> ext4_warning(inode->i_sb, "couldn't mark inode dirty (err %d)", err);
>> >
>> > Depending on the upper limit on the length of messages, I wonder if you
>> > could split the superblock notification and the description string into
>> > separate messages (with maybe the error cookie to tie them together) so
>> > that the struct isn't limited by having a VLA on the end, and the
>> > description can be more or less an arbitrary string?
>> >
>> > (That said I'm not familiar with the watch queue system so I have no
>> > idea if chained messages even make sense here, or are already
>> > implemented in some other way, or...)
>>
>> I don't see any support for chaining messages in the current watch_queue
>> implementation, I'd need to extend the interface to support it. I
>> considered this idea before, given the small description size, but I
>> thought it would be over-complicated, even though much more future
>> proof. I will look into that.
>>
>> What about the kernel exporting a per-filesystem table, as a build
>> target or in /sys/fs/<fs>/errors, that has descriptions strings for each
>> error? Then the notification can have only the FS type, index to the
>> table and params. This won't exactly be self-describing as you wanted
>> but, differently from function:line, it removes the need for the source
>> code, and allows localization. The per-filesystem table would be
>> stable ABI, of course.
>
> Yikes. I don't think people are going to be ok with a message table
> where we can never remove the strings. I bet GregKH won't like that
> either (one value per sysfs file).

Indeed, sysfs seems out of question. In fact the string format doesn't
even need to be in the kernel, and we don't need the strings to be sent
as part of the notifications. What if we can have a bunch of
notification types, specific for each error message, and a library in
userspace that parses the notifications and understands the parameters
passed? The library then displays the data as they wish.

> (Maybe I misread that and all you meant by stable ABI is the fact that
> the table exists at a given path and the notification message gives you
> a index into ... wherever we put it.)

The kernel could even export the table as a build-time target, that
get's installed into X. But even that is not necessary if a library can
make sense of a notification that uniquely identifies each error and
only includes the useful debug parameters without any string formatting?

--
Gabriel Krisman Bertazi

2020-12-13 05:21:02

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 5/8] vfs: Include origin of the SB error notification

On Wed, Dec 09, 2020 at 10:06:07AM -0300, Gabriel Krisman Bertazi wrote:
> "Darrick J. Wong" <[email protected]> writes:
>
> > On Tue, Dec 08, 2020 at 04:29:32PM -0300, Gabriel Krisman Bertazi wrote:
> >> "Darrick J. Wong" <[email protected]> writes:
> >>
> >> > On Tue, Dec 08, 2020 at 09:58:25AM -0300, Gabriel Krisman Bertazi wrote:
> >> >> David Howells <[email protected]> writes:
> >> >>
> >> >> > Gabriel Krisman Bertazi <[email protected]> wrote:
> >> >> >
> >> >> >> @@ -130,6 +131,8 @@ struct superblock_error_notification {
> >> >
> >> > FWIW I wonder if this really should be inode_error_notification?
> >> >
> >> > If (for example) ext4 discovered an error in the blockgroup descriptor
> >> > and wanted to report it, the inode and block numbers would be
> >> > irrelevant, but the blockgroup number would be nice to have.
> >>
> >> A previous RFC had superblock_error_notification and
> >> superblock_inode_error_notification split, I think we can recover that.
> >>
> >> >
> >> >> >> __u32 error_cookie;
> >> >> >> __u64 inode;
> >> >> >> __u64 block;
> >> >> >> + char function[SB_NOTIFICATION_FNAME_LEN];
> >> >> >> + __u16 line;
> >> >> >> char desc[0];
> >> >> >> };
> >> >> >
> >> >> > As Darrick said, this is a UAPI breaker, so you shouldn't do this (you can,
> >> >> > however, merge this ahead a patch). Also, I would put the __u16 before the
> >> >> > char[].
> >> >> >
> >> >> > That said, I'm not sure whether it's useful to include the function name and
> >> >> > line. Both fields are liable to change over kernel commits, so it's not
> >> >> > something userspace can actually interpret. I think you're better off dumping
> >> >> > those into dmesg.
> >> >> >
> >> >> > Further, this reduces the capacity of desc[] significantly - I don't know if
> >> >> > that's a problem.
> >> >>
> >> >> Yes, that is a big problem as desc is already quite limited. I don't
> >> >
> >> > How limited?
> >>
> >> The largest notification is 128 bytes, the one with the biggest header
> >> is superblock_error_notification which leaves 56 bytes for description.
> >>
> >> >
> >> >> think it is a problem for them to change between kernel versions, as the
> >> >> monitoring userspace can easily associate it with the running kernel.
> >> >
> >> > How do you make that association? $majordistro's 4.18 kernel is not the
> >> > same as the upstream 4.18. Wouldn't you rather the notification message
> >> > be entirely self-describing rather than depending on some external
> >> > information about the sender?
> >>
> >> True. I was thinking on my use case where the customer controls their
> >> infrastructure and would specialize their userspace tools, but that is
> >> poor design on my part. A self describing mechanism would be better.
> >>
> >> >
> >> >> The alternative would be generating something like unique IDs for each
> >> >> error notification in the filesystem, no?
> >> >>
> >> >> > And yet further, there's no room for addition of new fields with the desc[]
> >> >> > buffer on the end. Now maybe you're planning on making use of desc[] for
> >> >> > text-encoding?
> >> >>
> >> >> Yes. I would like to be able to provide more details on the error,
> >> >> without having a unique id. For instance, desc would have the formatted
> >> >> string below, describing the warning:
> >> >>
> >> >> ext4_warning(inode->i_sb, "couldn't mark inode dirty (err %d)", err);
> >> >
> >> > Depending on the upper limit on the length of messages, I wonder if you
> >> > could split the superblock notification and the description string into
> >> > separate messages (with maybe the error cookie to tie them together) so
> >> > that the struct isn't limited by having a VLA on the end, and the
> >> > description can be more or less an arbitrary string?
> >> >
> >> > (That said I'm not familiar with the watch queue system so I have no
> >> > idea if chained messages even make sense here, or are already
> >> > implemented in some other way, or...)
> >>
> >> I don't see any support for chaining messages in the current watch_queue
> >> implementation, I'd need to extend the interface to support it. I
> >> considered this idea before, given the small description size, but I
> >> thought it would be over-complicated, even though much more future
> >> proof. I will look into that.
> >>
> >> What about the kernel exporting a per-filesystem table, as a build
> >> target or in /sys/fs/<fs>/errors, that has descriptions strings for each
> >> error? Then the notification can have only the FS type, index to the
> >> table and params. This won't exactly be self-describing as you wanted
> >> but, differently from function:line, it removes the need for the source
> >> code, and allows localization. The per-filesystem table would be
> >> stable ABI, of course.
> >
> > Yikes. I don't think people are going to be ok with a message table
> > where we can never remove the strings. I bet GregKH won't like that
> > either (one value per sysfs file).
>
> Indeed, sysfs seems out of question. In fact the string format doesn't
> even need to be in the kernel, and we don't need the strings to be sent
> as part of the notifications. What if we can have a bunch of
> notification types, specific for each error message, and a library in
> userspace that parses the notifications and understands the parameters
> passed? The library then displays the data as they wish.

Er... I don't think we (XFS) really are going to maintain a userspace
library to decode kernel messages.

> > (Maybe I misread that and all you meant by stable ABI is the fact that
> > the table exists at a given path and the notification message gives you
> > a index into ... wherever we put it.)
>
> The kernel could even export the table as a build-time target, that
> get's installed into X. But even that is not necessary if a library can
> make sense of a notification that uniquely identifies each error and
> only includes the useful debug parameters without any string formatting?

/me shrugs and thinks that chaining fs notifications via cookie might be
a better way to do this, since then you could push out a stream of
notices about a filesystem event:

(0) generic vfs error telling you something happened at a inode/offset

(1) fs-specific notice giving more details about what that fs thinks is
wrong

(2) formatted message string so that you can send it to the sysadmin or
feed it to google translate or whatever :)

--D

> --
> Gabriel Krisman Bertazi