2020-08-24 15:38:15

by David Howells

[permalink] [raw]
Subject: [PATCH 1/2] Add a manpage for watch_queue(7)

Add a manual page for the notifications/watch_queue facility.

Signed-off-by: David Howells <[email protected]>
---

man7/watch_queue.7 | 304 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 304 insertions(+)
create mode 100644 man7/watch_queue.7

diff --git a/man7/watch_queue.7 b/man7/watch_queue.7
new file mode 100644
index 000000000..14c202cef
--- /dev/null
+++ b/man7/watch_queue.7
@@ -0,0 +1,304 @@
+.\"
+.\" Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+.\" Written by David Howells ([email protected])
+.\"
+.\" This program is free software; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public Licence
+.\" as published by the Free Software Foundation; either version
+.\" 2 of the Licence, or (at your option) any later version.
+.\"
+.TH WATCH_QUEUE 7 "2020-08-07" Linux "General Kernel Notifications"
+.SH NAME
+General kernel notification queue
+.SH SYNOPSIS
+#include <linux/watch_queue.h>
+.EX
+
+pipe2(fds, O_NOTIFICATION_PIPE);
+ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, max_message_count);
+ioctl(fds[0], IOC_WATCH_QUEUE_SET_FILTER, &filter);
+keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[0], message_tag);
+for (;;) {
+ buf_len = read(fds[0], buffer, sizeof(buffer));
+ ...
+}
+.EE
+.SH OVERVIEW
+.PP
+The general kernel notification queue is a general purpose transport for kernel
+notification messages to userspace. Notification messages are marked with type
+information so that events from multiple sources can be distinguished.
+Messages are also of variable length to accommodate different information for
+each type.
+.PP
+Queues are implemented on top of standard pipes and multiple independent queues
+can be created. After a pipe has been created, its size and filtering can be
+configured and event sources attached. The pipe can then be read or polled to
+wait for messages.
+.PP
+Multiple messages may be read out of the queue at a time if the buffer is large
+enough, but messages will not get split amongst multiple reads. If the buffer
+isn't large enough for a message,
+.B ENOBUFS
+will be returned.
+.PP
+In the case of message loss,
+.BR read (2)
+will fabricate a loss message and pass that to userspace immediately after the
+point at which the loss occurred. A single loss message is generated, even if
+multiple messages get lost at the same point.
+.PP
+A notification pipe allocates a certain amount of locked kernel memory (so that
+the kernel can write a notification into it from contexts where allocation is
+restricted), and so is subject to pipe resource limit restrictions - see
+.BR pipe (7),
+in the section on
+.BR "/proc files" .
+.PP
+Sources must be attached to a queue manually; there's no single global event
+source, but rather a variety of sources, each of which can be attached to by
+multiple queues. Attachments can be set up by:
+.TP
+.BR keyctl_watch_key (3)
+Monitor a key or keyring for changes.
+.PP
+Because a source can produce a lot of different events, not all of which may
+be of interest to the watcher, a single set of filters can be set on a queue
+to determine whether a particular event will get inserted in a queue at the
+point of posting inside the kernel.
+.SH MESSAGE STRUCTURE
+.PP
+The output from reading the pipe is divided into variable length messages.
+.BR read (2)
+will never split a message across two separate read calls. Each message
+begins with a header of the form:
+.PP
+.in +4n
+.EX
+struct watch_notification {
+ __u32 type:24;
+ __u32 subtype:8;
+ __u32 info;
+};
+.EE
+.in
+.PP
+Where
+.I type
+indicates the general class of notification,
+.I subtype
+indicates the specific type of notification within that class and
+.I info
+includes the message length (in bytes), the watcher's ID and some type-specific
+information.
+.PP
+A special message type,
+.BR WATCH_TYPE_META ,
+exists to convey information about the notification facility itself. It has
+the following subtypes:
+.TP
+.B WATCH_META_LOSS_NOTIFICATION
+This indicates one or more messages were lost, probably due to a buffer
+overrun.
+.TP
+.B WATCH_META_REMOVAL_NOTIFICATION
+This indicates that a notification source went away whilst it is being watched.
+This comes in two lengths: a short variant that carries just the header and a
+long variant that includes a 64-bit identifier as well that identifies the
+source more precisely (which variant is used and how the identifier should be
+interpreted is source dependent).
+.PP
+.I info
+includes the following fields:
+.TP
+.B WATCH_INFO_LENGTH
+Bits 0-6 indicate the size of the message in bytes, and can be between 8 and
+127.
+.TP
+.B WATCH_INFO_ID
+Bits 8-15 indicate the tag given to the source binding call. This is a number
+between 0 and 255 and is purely a source index for userspace's use and isn't
+interpreted by the kernel.
+.TP
+.B WATCH_INFO_TYPE_INFO
+Bits 16-31 indicate subtype-dependent information.
+.SH IOCTL COMMANDS
+Pipes opened with
+.B O_NOTIFICATION_PIPE
+have the following
+.BR ioctl (2)
+commands available:
+.TP
+.B IOC_WATCH_QUEUE_SET_SIZE
+The ioctl argument is indicates the maximum number of messages that can be
+inserted into the pipe. This must be a power of two. This command also
+pre-allocates memory to hold messages.
+.IP
+This may only be done once and the queue cannot be used until this command has
+been done.
+.TP
+.B IOC_WATCH_QUEUE_SET_FILTER
+This is used to set filters on the notifications that get written into the
+buffer. See the section on filtering for details.
+.SH FILTERING
+.PP
+The
+.B IOC_WATH_QUEUE_SET_FILTER
+ioctl argument points to a structure of the following form:
+.PP
+.in +4n
+.EX
+struct watch_notification_filter {
+ __u32 nr_filters;
+ __u32 __reserved;
+ struct watch_notification_type_filter filters[];
+};
+.EE
+.in
+.PP
+Where
+.I nr_filters
+indicates the number of elements in the
+.IR filters []
+array, and
+.I __reserved
+should be 0. Each element in the filters array specifies a filter and is of
+the following form:
+.PP
+.in +4n
+.EX
+struct watch_notification_type_filter {
+ __u32 type;
+ __u32 info_filter;
+ __u32 info_mask;
+ __u32 subtype_filter[8];
+};
+.EE
+.in
+.PP
+Where
+.I type
+refers to the type field in a notification record header;
+.IR info_filter " and " info_mask
+refer to the info field; and
+.I subtype_filter
+is a bit-mask of permitted subtypes.
+.PP
+A notification matches a filter if all of the following are true:
+.in +4n
+.PP
+(*) The type on the notification matches that on the filter.
+.PP
+(*) The bit in subtype_filter that matches the notification subtype is set.
+Each element in subtype_filter[] covers 32 subtypes, with, for example,
+element 0 matching subtypes 0-31. This can be summarised as:
+.PP
+.in +4n
+.EX
+F->subtype_filter[N->subtype / 32] & (1U << (N->subtype % 32))
+.EE
+.in
+.PP
+(*) The notification info, masked off, matches the filter info, e.g.:
+.PP
+.in +4n
+.EX
+(N->info & F->info_mask) == F->info_filter
+.EE
+.in
+.PP
+If no filters are set, all notifications are allowed by default and if one or
+more filters are set, notifications are disallowed by default.
+WATCH_TYPE_META cannot, however, be filtered.
+.SH VERSIONS
+The notification queue driver first appeared in v5.8 of the Linux kernel.
+.SH EXAMPLE
+To use the notification mechanism, first of all the pipe has to be opened and
+the size must be set:
+.PP
+.in +4n
+.EX
+int fds[2];
+pipe2(fd[0], O_NOTIFICATION_QUEUE);
+int wfd = fd[0];
+
+ioctl(wfd, IOC_WATCH_QUEUE_SET_SIZE, 16);
+.EE
+.in
+.PP
+From this point, the queue is open for business. Filters can be set to
+restrict the notifications that get inserted into the queue from the sources
+that are being watched. For example:
+.PP
+.in +4n
+.EX
+static struct watch_notification_filter filter = {
+ .nr_filters = 1,
+ .__reserved = 0,
+ .filters = {
+ [0] = {
+ .type = WATCH_TYPE_KEY_NOTIFY,
+ .subtype_filter[0] = 1 << NOTIFY_KEY_LINKED,
+ .info_filter = 1 << WATCH_INFO_FLAG_2,
+ .info_mask = 1 << WATCH_INFO_FLAG_2,
+ },
+ },
+};
+
+ioctl(wfd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
+.EE
+.in
+.PP
+will only allow key-change notifications that indicate a key is linked into a
+keyring and then only if type-specific flag WATCH_INFO_FLAG_2 is set on the
+notification.
+.PP
+Sources can then be watched, for example:
+.PP
+.in +4n
+.EX
+keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, wfd, 0x33);
+.EE
+.in
+.PP
+The first places a watch on the process's session keyring, directing the
+notifications to the buffer we just created and specifying that they should be
+tagged with 0x33 in the info ID field.
+.PP
+When it is determined that there is something in the buffer, messages can be
+read out of the ring with something like the following:
+.PP
+.in +4n
+.EX
+for (;;) {
+ unsigned char buf[WATCH_INFO_LENGTH];
+ read(fd, buf, sizeof(buf));
+ struct watch_notification *n = (struct watch_notification *)buf;
+ switch (n->type) {
+ case WATCH_TYPE_META:
+ switch (n->subtype) {
+ case WATCH_META_REMOVAL_NOTIFICATION:
+ saw_removal_notification(n);
+ break;
+ case WATCH_META_LOSS_NOTIFICATION:
+ printf("-- LOSS --\n");
+ break;
+ }
+ break;
+ case WATCH_TYPE_KEY_NOTIFY:
+ saw_key_change(n);
+ break;
+ }
+}
+.EE
+.in
+.PP
+
+.SH SEE ALSO
+.ad l
+.nh
+.BR keyctl (1),
+.BR ioctl (2),
+.BR pipe2 (2),
+.BR read (2),
+.BR keyctl_watch_key (3)



Subject: Re: [PATCH 1/2] Add a manpage for watch_queue(7)

Hi David,

It would be helpful if you labelled the subject "[PATCH v2 1/2]"
so that one could quickly see which of the patch series in my
inbox is the newest. (I nearly replied to the wrong draft.)

On 8/24/20 5:30 PM, David Howells wrote:
> Add a manual page for the notifications/watch_queue facility.
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> man7/watch_queue.7 | 304 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 304 insertions(+)
> create mode 100644 man7/watch_queue.7
>
> diff --git a/man7/watch_queue.7 b/man7/watch_queue.7
> new file mode 100644
> index 000000000..14c202cef
> --- /dev/null
> +++ b/man7/watch_queue.7
> @@ -0,0 +1,304 @@
> +.\"
> +.\" Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> +.\" Written by David Howells ([email protected])
> +.\"
> +.\" This program is free software; you can redistribute it and/or
> +.\" modify it under the terms of the GNU General Public Licence
> +.\" as published by the Free Software Foundation; either version
> +.\" 2 of the Licence, or (at your option) any later version.

FWIW, and of course the decision is yours, the GPL is a license
for code, not for documentation. Please consider using the "VERBATIM"
license instead (as you used in the mount manual pages).

> +.\"
> +.TH WATCH_QUEUE 7 "2020-08-07" Linux "General Kernel Notifications"
> +.SH NAME
> +General kernel notification queue
> +.SH SYNOPSIS
> +#include <linux/watch_queue.h>
> +.EX
> +
> +pipe2(fds, O_NOTIFICATION_PIPE);
> +ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, max_message_count);
> +ioctl(fds[0], IOC_WATCH_QUEUE_SET_FILTER, &filter);
> +keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[0], message_tag);
> +for (;;) {
> + buf_len = read(fds[0], buffer, sizeof(buffer));
> + ...
> +}

Please no tabs in man-pages source (they can render strangely), and
the standard in man-pages is 4-space indents. (This needs fixing
at multiple places below.)


> +.EE
> +.SH OVERVIEW
> +.PP

Preceding line unneeded, and mandoc lint will complain about such things.

Okay -- BOOM -- you jump right in here, in the following lines. Can we start
off with a paragraph or two that gently introduces this feature to
the reader who currently knows next to nothing about the topic (i.e., me).
Begin by telling the reader what problem is being solved here. And then
tell them that this page is describing the solution.

> +The general kernel notification queue is a general purpose transport for kernel

Why "general" twice in the preceding line?

Also, here you talk about "The... queue", as though there is only one of
them, globally. But below, you talk about "queues", plural. I think
some rewording is needed.

> +notification messages to userspace. Notification messages are marked with type
> +information so that events from multiple sources can be distinguished.

I think it might be helpful to mention a few examples of what things can
be "sources".

> +Messages are also of variable length to accommodate different information for
> +each type.
> +.PP
> +Queues are implemented on top of standard pipes and multiple independent queues
> +can be created. After a pipe has been created, its size and filtering can be
> +configured and event sources attached. The pipe can then be read or polled to
> +wait for messages.
> +.PP
> +Multiple messages may be read out of the queue at a time if the buffer is large
> +enough, but messages will not get split amongst multiple reads. If the buffer
> +isn't large enough for a message,

s/enough/enough for at least the next message/ ?

> +.B ENOBUFS
> +will be returned.
> +.PP
> +In the case of message loss,

Please, at or before this point, add an explanation of why loss could occur.
(I see that you say something about this later, but at this point in the page,
"loss" pops out of nowhere.)

> +.BR read (2)
> +will fabricate a loss message and pass that to userspace immediately after the
> +point at which the loss occurred. A single loss message is generated, even if
> +multiple messages get lost at the same point.
> +.PP
> +A notification pipe allocates a certain amount of locked kernel memory (so that
> +the kernel can write a notification into it from contexts where allocation is
> +restricted), and so is subject to pipe resource limit restrictions - see
> +.BR pipe (7),
> +in the section on
> +.BR "/proc files" .
> +.PP
> +Sources must be attached to a queue manually; there's no single global event
> +source, but rather a variety of sources, each of which can be attached to by
> +multiple queues. Attachments can be set up by:
> +.TP
> +.BR keyctl_watch_key (3)
> +Monitor a key or keyring for changes.
> +.PP
> +Because a source can produce a lot of different events, not all of which may
> +be of interest to the watcher, a single set of filters can be set on a queue
> +to determine whether a particular event will get inserted in a queue at the
> +point of posting inside the kernel.
> +.SH MESSAGE STRUCTURE
> +.PP
> +The output from reading the pipe is divided into variable length messages.
> +.BR read (2)
> +will never split a message across two separate read calls. Each message
> +begins with a header of the form:
> +.PP
> +.in +4n
> +.EX
> +struct watch_notification {
> + __u32 type:24;
> + __u32 subtype:8;
> + __u32 info;
> +};
> +.EE
> +.in
> +.PP
> +Where
> +.I type
> +indicates the general class of notification,
> +.I subtype
> +indicates the specific type of notification within that class and
> +.I info
> +includes the message length (in bytes), the watcher's ID and some type-specific
> +information.
> +.PP
> +A special message type,
> +.BR WATCH_TYPE_META ,
> +exists to convey information about the notification facility itself. It has
> +the following subtypes:
> +.TP
> +.B WATCH_META_LOSS_NOTIFICATION
> +This indicates one or more messages were lost, probably due to a buffer
> +overrun.
> +.TP
> +.B WATCH_META_REMOVAL_NOTIFICATION
> +This indicates that a notification source went away whilst it is being watched.
> +This comes in two lengths: a short variant that carries just the header and a
> +long variant that includes a 64-bit identifier as well that identifies the
> +source more precisely (which variant is used and how the identifier should be
> +interpreted is source dependent).
> +.PP
> +.I info
> +includes the following fields:
> +.TP
> +.B WATCH_INFO_LENGTH
> +Bits 0-6 indicate the size of the message in bytes, and can be between 8 and
> +127.
> +.TP
> +.B WATCH_INFO_ID
> +Bits 8-15 indicate the tag given to the source binding call. This is a number
> +between 0 and 255 and is purely a source index for userspace's use and isn't
> +interpreted by the kernel.
> +.TP
> +.B WATCH_INFO_TYPE_INFO
> +Bits 16-31 indicate subtype-dependent information.
> +.SH IOCTL COMMANDS
> +Pipes opened with
> +.B O_NOTIFICATION_PIPE
> +have the following
> +.BR ioctl (2)
> +commands available:
> +.TP
> +.B IOC_WATCH_QUEUE_SET_SIZE
> +The ioctl argument is indicates the maximum number of messages that can be
> +inserted into the pipe. This must be a power of two. This command also
> +pre-allocates memory to hold messages.
> +.IP
> +This may only be done once and the queue cannot be used until this command has
> +been done.
> +.TP
> +.B IOC_WATCH_QUEUE_SET_FILTER
> +This is used to set filters on the notifications that get written into the
> +buffer. See the section on filtering for details.
> +.SH FILTERING
> +.PP
> +The
> +.B IOC_WATH_QUEUE_SET_FILTER
> +ioctl argument points to a structure of the following form:
> +.PP
> +.in +4n
> +.EX
> +struct watch_notification_filter {
> + __u32 nr_filters;
> + __u32 __reserved;
> + struct watch_notification_type_filter filters[];
> +};
> +.EE
> +.in
> +.PP
> +Where
> +.I nr_filters
> +indicates the number of elements in the
> +.IR filters []
> +array, and
> +.I __reserved
> +should be 0. Each element in the filters array specifies a filter and is of
> +the following form:
> +.PP
> +.in +4n
> +.EX
> +struct watch_notification_type_filter {
> + __u32 type;
> + __u32 info_filter;
> + __u32 info_mask;
> + __u32 subtype_filter[8];
> +};
> +.EE
> +.in
> +.PP
> +Where
> +.I type
> +refers to the type field in a notification record header;
> +.IR info_filter " and " info_mask
> +refer to the info field; and
> +.I subtype_filter
> +is a bit-mask of permitted subtypes.
> +.PP
> +A notification matches a filter if all of the following are true:
> +.in +4n
> +.PP
> +(*) The type on the notification matches that on the filter.
> +.PP
> +(*) The bit in subtype_filter that matches the notification subtype is set.
> +Each element in subtype_filter[] covers 32 subtypes, with, for example,
> +element 0 matching subtypes 0-31. This can be summarised as:
> +.PP
> +.in +4n
> +.EX
> +F->subtype_filter[N->subtype / 32] & (1U << (N->subtype % 32))
> +.EE
> +.in
> +.PP
> +(*) The notification info, masked off, matches the filter info, e.g.:
> +.PP
> +.in +4n
> +.EX
> +(N->info & F->info_mask) == F->info_filter
> +.EE
> +.in
> +.PP
> +If no filters are set, all notifications are allowed by default and if one or
> +more filters are set, notifications are disallowed by default.
> +WATCH_TYPE_META cannot, however, be filtered.
> +.SH VERSIONS
> +The notification queue driver first appeared in v5.8 of the Linux kernel.
> +.SH EXAMPLE
> +To use the notification mechanism, first of all the pipe has to be opened and
> +the size must be set:
> +.PP
> +.in +4n
> +.EX
> +int fds[2];
> +pipe2(fd[0], O_NOTIFICATION_QUEUE);
> +int wfd = fd[0];
> +
> +ioctl(wfd, IOC_WATCH_QUEUE_SET_SIZE, 16);
> +.EE
> +.in
> +.PP
> +From this point, the queue is open for business. Filters can be set to
> +restrict the notifications that get inserted into the queue from the sources
> +that are being watched. For example:
> +.PP
> +.in +4n
> +.EX
> +static struct watch_notification_filter filter = {
> + .nr_filters = 1,
> + .__reserved = 0,
> + .filters = {
> + [0] = {
> + .type = WATCH_TYPE_KEY_NOTIFY,
> + .subtype_filter[0] = 1 << NOTIFY_KEY_LINKED,
> + .info_filter = 1 << WATCH_INFO_FLAG_2,
> + .info_mask = 1 << WATCH_INFO_FLAG_2,
> + },
> + },
> +};

> +ioctl(wfd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> +.EE
> +.in
> +.PP
> +will only allow key-change notifications that indicate a key is linked into a
> +keyring and then only if type-specific flag WATCH_INFO_FLAG_2 is set on the
> +notification.
> +.PP
> +Sources can then be watched, for example:
> +.PP
> +.in +4n
> +.EX
> +keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, wfd, 0x33);
> +.EE
> +.in
> +.PP
> +The first places a watch on the process's session keyring, directing the
> +notifications to the buffer we just created and specifying that they should be
> +tagged with 0x33 in the info ID field.
> +.PP
> +When it is determined that there is something in the buffer, messages can be
> +read out of the ring with something like the following:
> +.PP
> +.in +4n
> +.EX
> +for (;;) {
> + unsigned char buf[WATCH_INFO_LENGTH];
> + read(fd, buf, sizeof(buf));
> + struct watch_notification *n = (struct watch_notification *)buf;
> + switch (n->type) {
> + case WATCH_TYPE_META:
> + switch (n->subtype) {
> + case WATCH_META_REMOVAL_NOTIFICATION:
> + saw_removal_notification(n);
> + break;
> + case WATCH_META_LOSS_NOTIFICATION:
> + printf("-- LOSS --\n");
> + break;
> + }
> + break;
> + case WATCH_TYPE_KEY_NOTIFY:
> + saw_key_change(n);
> + break;
> + }
> +}

> +.EE
> +.in
> +.PP
> +

Remove preceding two lines.

> +.SH SEE ALSO
> +.ad l
> +.nh
> +.BR keyctl (1),
> +.BR ioctl (2),
> +.BR pipe2 (2),
> +.BR read (2),
> +.BR keyctl_watch_key (3)

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/