Date: Tue, 02 Jun 2020 16:51:44 +0100
Hi Linus,
Can you pull this, please? It adds a general notification queue concept
and adds an event source for keys/keyrings, such as linking and unlinking
keys and changing their attributes.
Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:
https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47
Without this, g-o-a has to constantly poll a keyring-based kerberos cache
to find out if kinit has changed anything.
[[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and
Ian Kent have been working on making libmount use them, preparatory to
working on systemd:
https://github.com/karelzak/util-linux/commits/topic/fsinfo
https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
Development has stalled briefly due to other commitments, so I'm not
sure I can ask you to pull those parts of the series for now. Christian
Brauner would like to use them in lxc, but hasn't started.
]]
LSM hooks are included:
(1) A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The LSM
should use current's credentials. [Wanted by SELinux & Smack]
(2) A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is given
the credentials from the event generator (which may be the system) and
the watch setter. [Wanted by Smack]
I've provided SELinux and Smack with implementations of some of these hooks.
WHY
===
Key/keyring notifications are desirable because if you have your kerberos
tickets in a file/directory, your Gnome desktop will monitor that using
something like fanotify and tell you if your credentials cache changes.
However, we also have the ability to cache your kerberos tickets in the
session, user or persistent keyring so that it isn't left around on disk
across a reboot or logout. Keyrings, however, cannot currently be
monitored asynchronously, so the desktop has to poll for it - not so good
on a laptop. This facility will allow the desktop to avoid the need to
poll.
DESIGN DECISIONS
================
(1) The notification queue is built on top of a standard pipe. Messages
are effectively spliced in. The pipe is opened with a special flag:
pipe2(fds, O_NOTIFICATION_PIPE);
The special flag has the same value as O_EXCL (which doesn't seem like
it will ever be applicable in this context)[?]. It is given up front
to make it a lot easier to prohibit splice and co. from accessing the
pipe.
[?] Should this be done some other way? I'd rather not use up a new
O_* flag if I can avoid it - should I add a pipe3() system call
instead?
The pipe is then configured::
ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
Messages are then read out of the pipe using read().
(2) It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the kernel
has to be able to insert messages into the pipe *without* holding
pipe->mutex and the code to make this work needs careful auditing.
(3) sendfile(), splice() and vmsplice() are disabled on notification pipes
because of the pipe->mutex issue and also because they sometimes want
to revert what they just did - but one or more notification messages
might've been interleaved in the ring.
(4) The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock to
update the queue pointers.
(5) Records in the buffer are binary, typed and have a length so that they
can be of varying size.
This allows multiple heterogeneous sources to share a common buffer;
there are 16 million types available, of which I've used just a few,
so there is scope for others to be used. Tags may be specified when a
watchpoint is created to help distinguish the sources.
(6) Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.
(7) Notification pipes don't interfere with each other; each may be bound
to a different set of watches. Any particular notification will be
copied to all the queues that are currently watching for it - and only
those that are watching for it.
(8) When recording a notification, the kernel will not sleep, but will
rather mark a queue as having lost a message if there's insufficient
space. read() will fabricate a loss notification message at an
appropriate point later.
(9) The notification pipe is created and then watchpoints are attached to
it, using one of:
keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);
where in both cases, fd indicates the queue and the number after is a
tag between 0 and 255.
(10) Watches are removed if either the notification pipe is destroyed or
the watched object is destroyed. In the latter case, a message will
be generated indicating the enforced watch removal.
Things I want to avoid:
(1) Introducing features that make the core VFS dependent on the network
stack or networking namespaces (ie. usage of netlink).
(2) Dumping all this stuff into dmesg and having a daemon that sits there
parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.
(3) Letting users see events they shouldn't be able to see.
TESTING AND MANPAGES
====================
(*) The keyutils tree has a pipe-watch branch that has keyctl commands for
making use of notifications. Proposed manual pages can also be found
on this branch, though a couple of them really need to go to the main
manpages repository instead.
If the kernel supports the watching of keys, then running "make test"
on that branch will cause the testing infrastructure to spawn a
monitoring process on the side that monitors a notifications pipe for
all the key/keyring changes induced by the tests and they'll all be
checked off to make sure they happened.
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch
(*) A test program is provided (samples/watch_queue/watch_test) that can
be used to monitor for keyrings, mount and superblock events.
Information on the notifications is simply logged to stdout.
Thanks,
David
---
The following changes since commit b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce:
Linux 5.7-rc6 (2020-05-17 16:48:37 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601
for you to fetch changes up to a8478a602913dc89a7cd2060e613edecd07e1dbd:
smack: Implement the watch_key and post_notification hooks (2020-05-19 15:47:38 +0100)
----------------------------------------------------------------
Notifications over pipes + Keyring notifications
----------------------------------------------------------------
David Howells (12):
uapi: General notification queue definitions
security: Add a hook for the point of notification insertion
pipe: Add O_NOTIFICATION_PIPE
pipe: Add general notification queue support
security: Add hooks to rule on setting a watch
watch_queue: Add a key/keyring notification facility
Add sample notification program
pipe: Allow buffers to be marked read-whole-or-error for notifications
pipe: Add notification lossage handling
keys: Make the KEY_NEED_* perms an enum rather than a mask
selinux: Implement the watch_key security hook
smack: Implement the watch_key and post_notification hooks
Documentation/security/keys/core.rst | 57 ++
Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
Documentation/watch_queue.rst | 339 +++++++++++
fs/pipe.c | 242 +++++---
fs/splice.c | 12 +-
include/linux/key.h | 33 +-
include/linux/lsm_audit.h | 1 +
include/linux/lsm_hook_defs.h | 9 +
include/linux/lsm_hooks.h | 14 +
include/linux/pipe_fs_i.h | 27 +-
include/linux/security.h | 30 +-
include/linux/watch_queue.h | 127 ++++
include/uapi/linux/keyctl.h | 2 +
include/uapi/linux/watch_queue.h | 104 ++++
init/Kconfig | 12 +
kernel/Makefile | 1 +
kernel/watch_queue.c | 659 +++++++++++++++++++++
samples/Kconfig | 6 +
samples/Makefile | 1 +
samples/watch_queue/Makefile | 7 +
samples/watch_queue/watch_test.c | 186 ++++++
security/keys/Kconfig | 9 +
security/keys/compat.c | 3 +
security/keys/gc.c | 5 +
security/keys/internal.h | 38 +-
security/keys/key.c | 38 +-
security/keys/keyctl.c | 115 +++-
security/keys/keyring.c | 20 +-
security/keys/permission.c | 31 +-
security/keys/process_keys.c | 46 +-
security/keys/request_key.c | 4 +-
security/security.c | 22 +-
security/selinux/hooks.c | 51 +-
security/smack/smack_lsm.c | 112 +++-
34 files changed, 2185 insertions(+), 179 deletions(-)
create mode 100644 Documentation/watch_queue.rst
create mode 100644 include/linux/watch_queue.h
create mode 100644 include/uapi/linux/watch_queue.h
create mode 100644 kernel/watch_queue.c
create mode 100644 samples/watch_queue/Makefile
create mode 100644 samples/watch_queue/watch_test.c
On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
>
> [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak
> and
> Ian Kent have been working on making libmount use them,
> preparatory to
> working on systemd:
>
> https://github.com/karelzak/util-linux/commits/topic/fsinfo
>
> https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
>
> Development has stalled briefly due to other commitments, so I'm
> not
> sure I can ask you to pull those parts of the series for
> now. Christian
> Brauner would like to use them in lxc, but hasn't started.
> ]]
Linus,
Just so your aware of what has been done and where we are at here's
a summary.
Karel has done quite a bit of work on libmount (at this stage it's
getting hold of the mount information, aka. fsinfo()) and most of
what I have done is included in that too which you can see in Karel's
repo above). You can see a couple of bug fixes and a little bit of
new code present in my repo which hasn't been sent over to Karel
yet.
This infrastructure is essential before notifications work is started
which is where we will see the most improvement.
It turns out that while systemd uses libmount it has it's own
notifications handling sub-system as it deals with several event
types, not just mount information, in the same area. So, unfortunately,
changes will need to be made there as well as in libmount, more so
than the trivial changes to use fsinfo() via libmount.
That's where we are at the moment and I will get back to it once
I've dealt with a few things I postponed to work on libmount.
If you would like a more detailed account of what we have found I
can provide that too.
Is there anything else you would like from me or Karel?
Ian
On Wed, 2020-06-03 at 10:15 +0800, Ian Kent wrote:
> On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> > [[ With regard to the mount/sb notifications and fsinfo(), Karel
> > Zak
> > and
> > Ian Kent have been working on making libmount use them,
> > preparatory to
> > working on systemd:
> >
> > https://github.com/karelzak/util-linux/commits/topic/fsinfo
> >
> > https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
> >
> > Development has stalled briefly due to other commitments, so I'm
> > not
> > sure I can ask you to pull those parts of the series for
> > now. Christian
> > Brauner would like to use them in lxc, but hasn't started.
> > ]]
>
> Linus,
>
> Just so your aware of what has been done and where we are at here's
> a summary.
>
> Karel has done quite a bit of work on libmount (at this stage it's
> getting hold of the mount information, aka. fsinfo()) and most of
> what I have done is included in that too which you can see in Karel's
> repo above). You can see a couple of bug fixes and a little bit of
> new code present in my repo which hasn't been sent over to Karel
> yet.
>
> This infrastructure is essential before notifications work is started
> which is where we will see the most improvement.
>
> It turns out that while systemd uses libmount it has it's own
> notifications handling sub-system as it deals with several event
> types, not just mount information, in the same area. So,
> unfortunately,
> changes will need to be made there as well as in libmount, more so
> than the trivial changes to use fsinfo() via libmount.
>
> That's where we are at the moment and I will get back to it once
> I've dealt with a few things I postponed to work on libmount.
>
> If you would like a more detailed account of what we have found I
> can provide that too.
>
> Is there anything else you would like from me or Karel?
I think there's a bit more I should say about this.
One reason work hasn't progressed further on this is I spent
quite a bit of time looking at the affects of using fsinfo().
My testing was done by using a large autofs direct mount map of
20000 entries which means that at autofs startup 20000 autofs
mounts must be done and at autofs shutdown those 20000 mounts
must be umounted. Not very scientific but something to use to
get a feel for the affect of our changes.
Initially just using fsinfo() to load all the mount entries was
done to see how that would perform. This was done in a way that
required no modifications to library user code but didn't get
much improvement.
Next loading all the mount ids (alone) for mount entry traversal
was done and the various fields retrieved on-demand (implemented
by Karel).
Loading the entire mount table and then traversing the entries
means the mount table is always possibly out of date. And loading
the ids and getting the fields on-demand might have made that
problem worse. But loading only the mount ids and using an
on-demand method to get needed fields worked surprisingly well.
The main issue is a mount going away while getting the fields.
Testing showed that simply checking the field is valid and
ignoring the entry if it isn't is enough to handle that case.
Also the mount going away after the needed fields have been
retrieved must be handled by callers of libmount as mounts
can just as easily go away after reading the proc based tables.
The case of the underlying mount information changing needs to
be considered too. We will need to do better on that in the
future but it too is a problem with the proc table handing and
hasn't seen problems logged against libmount for it AFAIK.
So, all in all, this approach worked pretty well as libmount
users do use the getter access methods to retrieve the mount
entry fields (which is required for the on-demand method to
work). Certainly systemd always uses them (and it looks like
udisks2 does too).
Unfortunately using the libmount on-demand implementation
requires library user code be modified (only a little in
the systemd case) to use the implementation.
Testing showed that we get between 10-15% reduction in
overhead and CPU usage remained high.
I think processing large numbers of mounts is simply a lot
of work and there are particular cases that will remain that
require the use of the load and traverse method. For example
matching all mounts with a given prefix string (one of the
systemd use cases).
It's hard to get information about this but I can say that
running pref during the autofs start and stop shows the bulk
of the counter hits on the fsinfo() table construction code
so that ahs to be where the overhead is.
The unavoidable conclusion is that the load and traverse method
that's been imposed on us for so long (even before libmount)
for mount handling is what we need to get away from. After all,
this is essentially where the problem comes from in the first
place. And fsinfo() is designed to not need to use this method
for getting mount information for that reason.
There's also the notifications side of things which is the next
area to work on. Looking at systemd I see that monitoring the
proc mount table leads to a load, traverse, and process of the
entire table for every single notification. It's clear that's
because of the (what I'll call) anonymous notifications that we
have now.
The notifications in David's series carry event specific
information, for example the mount id for mount notifications
and the libmount fsinfo() implementation is written to use the
mount id (lowest overhead lookup option), so there has to be
significant improvement for this case.
But systemd has it's own notifications handling code so there
will need to be non-trivial changes there as well as changes
in libmount.
Bottom line is we have a bit of a challenge with this because we
are trying to change coding practices developed over many years
that, necessarily, use a load/traverse method and it's going to
take quite a while to change these coding practices.
My question is, is there something specific, besides what we are
doing, that you'd like to see done now in order to get the series
merged?
Ian
On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote:
> Date: Tue, 02 Jun 2020 16:51:44 +0100
>
> Hi Linus,
>
> Can you pull this, please? It adds a general notification queue concept
> and adds an event source for keys/keyrings, such as linking and unlinking
> keys and changing their attributes.
>
> Thanks to Debarshi Ray, we do have a pull request to use this to fix a
> problem with gnome-online-accounts - as mentioned last time:
>
> https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47
>
> Without this, g-o-a has to constantly poll a keyring-based kerberos cache
> to find out if kinit has changed anything.
>
> [[ With regard to the mount/sb notifications and fsinfo(), Karel Zak and
The mount/sb notification and fsinfo() stuff is something we'd like to
use. (And then later extend to allow for supervised mounts where a
container manager can supervise the mounts of an unprivileged
container.)
I'm not sure if the mount notifications are already part of this pr.
Christian
> Ian Kent have been working on making libmount use them, preparatory to
> working on systemd:
>
> https://github.com/karelzak/util-linux/commits/topic/fsinfo
> https://github.com/raven-au/util-linux/commits/topic/fsinfo.public
>
> Development has stalled briefly due to other commitments, so I'm not
> sure I can ask you to pull those parts of the series for now. Christian
> Brauner would like to use them in lxc, but hasn't started.
> ]]
>
>
> LSM hooks are included:
>
> (1) A set of hooks are provided that allow an LSM to rule on whether or
> not a watch may be set. Each of these hooks takes a different
> "watched object" parameter, so they're not really shareable. The LSM
> should use current's credentials. [Wanted by SELinux & Smack]
>
> (2) A hook is provided to allow an LSM to rule on whether or not a
> particular message may be posted to a particular queue. This is given
> the credentials from the event generator (which may be the system) and
> the watch setter. [Wanted by Smack]
>
> I've provided SELinux and Smack with implementations of some of these hooks.
>
>
> WHY
> ===
>
> Key/keyring notifications are desirable because if you have your kerberos
> tickets in a file/directory, your Gnome desktop will monitor that using
> something like fanotify and tell you if your credentials cache changes.
>
> However, we also have the ability to cache your kerberos tickets in the
> session, user or persistent keyring so that it isn't left around on disk
> across a reboot or logout. Keyrings, however, cannot currently be
> monitored asynchronously, so the desktop has to poll for it - not so good
> on a laptop. This facility will allow the desktop to avoid the need to
> poll.
>
>
> DESIGN DECISIONS
> ================
>
> (1) The notification queue is built on top of a standard pipe. Messages
> are effectively spliced in. The pipe is opened with a special flag:
>
> pipe2(fds, O_NOTIFICATION_PIPE);
>
> The special flag has the same value as O_EXCL (which doesn't seem like
> it will ever be applicable in this context)[?]. It is given up front
> to make it a lot easier to prohibit splice and co. from accessing the
> pipe.
>
> [?] Should this be done some other way? I'd rather not use up a new
> O_* flag if I can avoid it - should I add a pipe3() system call
> instead?
>
> The pipe is then configured::
>
> ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
> ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
>
> Messages are then read out of the pipe using read().
>
> (2) It should be possible to allow write() to insert data into the
> notification pipes too, but this is currently disabled as the kernel
> has to be able to insert messages into the pipe *without* holding
> pipe->mutex and the code to make this work needs careful auditing.
>
> (3) sendfile(), splice() and vmsplice() are disabled on notification pipes
> because of the pipe->mutex issue and also because they sometimes want
> to revert what they just did - but one or more notification messages
> might've been interleaved in the ring.
>
> (4) The kernel inserts messages with the wait queue spinlock held. This
> means that pipe_read() and pipe_write() have to take the spinlock to
> update the queue pointers.
>
> (5) Records in the buffer are binary, typed and have a length so that they
> can be of varying size.
>
> This allows multiple heterogeneous sources to share a common buffer;
> there are 16 million types available, of which I've used just a few,
> so there is scope for others to be used. Tags may be specified when a
> watchpoint is created to help distinguish the sources.
>
> (6) Records are filterable as types have up to 256 subtypes that can be
> individually filtered. Other filtration is also available.
>
> (7) Notification pipes don't interfere with each other; each may be bound
> to a different set of watches. Any particular notification will be
> copied to all the queues that are currently watching for it - and only
> those that are watching for it.
>
> (8) When recording a notification, the kernel will not sleep, but will
> rather mark a queue as having lost a message if there's insufficient
> space. read() will fabricate a loss notification message at an
> appropriate point later.
>
> (9) The notification pipe is created and then watchpoints are attached to
> it, using one of:
>
> keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
> watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
> watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);
>
> where in both cases, fd indicates the queue and the number after is a
> tag between 0 and 255.
>
> (10) Watches are removed if either the notification pipe is destroyed or
> the watched object is destroyed. In the latter case, a message will
> be generated indicating the enforced watch removal.
>
>
> Things I want to avoid:
>
> (1) Introducing features that make the core VFS dependent on the network
> stack or networking namespaces (ie. usage of netlink).
>
> (2) Dumping all this stuff into dmesg and having a daemon that sits there
> parsing the output and distributing it as this then puts the
> responsibility for security into userspace and makes handling
> namespaces tricky. Further, dmesg might not exist or might be
> inaccessible inside a container.
>
> (3) Letting users see events they shouldn't be able to see.
>
>
> TESTING AND MANPAGES
> ====================
>
> (*) The keyutils tree has a pipe-watch branch that has keyctl commands for
> making use of notifications. Proposed manual pages can also be found
> on this branch, though a couple of them really need to go to the main
> manpages repository instead.
>
> If the kernel supports the watching of keys, then running "make test"
> on that branch will cause the testing infrastructure to spawn a
> monitoring process on the side that monitors a notifications pipe for
> all the key/keyring changes induced by the tests and they'll all be
> checked off to make sure they happened.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch
>
> (*) A test program is provided (samples/watch_queue/watch_test) that can
> be used to monitor for keyrings, mount and superblock events.
> Information on the notifications is simply logged to stdout.
>
> Thanks,
> David
> ---
> The following changes since commit b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce:
>
> Linux 5.7-rc6 (2020-05-17 16:48:37 -0700)
>
> are available in the Git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601
>
> for you to fetch changes up to a8478a602913dc89a7cd2060e613edecd07e1dbd:
>
> smack: Implement the watch_key and post_notification hooks (2020-05-19 15:47:38 +0100)
>
> ----------------------------------------------------------------
> Notifications over pipes + Keyring notifications
>
> ----------------------------------------------------------------
> David Howells (12):
> uapi: General notification queue definitions
> security: Add a hook for the point of notification insertion
> pipe: Add O_NOTIFICATION_PIPE
> pipe: Add general notification queue support
> security: Add hooks to rule on setting a watch
> watch_queue: Add a key/keyring notification facility
> Add sample notification program
> pipe: Allow buffers to be marked read-whole-or-error for notifications
> pipe: Add notification lossage handling
> keys: Make the KEY_NEED_* perms an enum rather than a mask
> selinux: Implement the watch_key security hook
> smack: Implement the watch_key and post_notification hooks
>
> Documentation/security/keys/core.rst | 57 ++
> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
> Documentation/watch_queue.rst | 339 +++++++++++
> fs/pipe.c | 242 +++++---
> fs/splice.c | 12 +-
> include/linux/key.h | 33 +-
> include/linux/lsm_audit.h | 1 +
> include/linux/lsm_hook_defs.h | 9 +
> include/linux/lsm_hooks.h | 14 +
> include/linux/pipe_fs_i.h | 27 +-
> include/linux/security.h | 30 +-
> include/linux/watch_queue.h | 127 ++++
> include/uapi/linux/keyctl.h | 2 +
> include/uapi/linux/watch_queue.h | 104 ++++
> init/Kconfig | 12 +
> kernel/Makefile | 1 +
> kernel/watch_queue.c | 659 +++++++++++++++++++++
> samples/Kconfig | 6 +
> samples/Makefile | 1 +
> samples/watch_queue/Makefile | 7 +
> samples/watch_queue/watch_test.c | 186 ++++++
> security/keys/Kconfig | 9 +
> security/keys/compat.c | 3 +
> security/keys/gc.c | 5 +
> security/keys/internal.h | 38 +-
> security/keys/key.c | 38 +-
> security/keys/keyctl.c | 115 +++-
> security/keys/keyring.c | 20 +-
> security/keys/permission.c | 31 +-
> security/keys/process_keys.c | 46 +-
> security/keys/request_key.c | 4 +-
> security/security.c | 22 +-
> security/selinux/hooks.c | 51 +-
> security/smack/smack_lsm.c | 112 +++-
> 34 files changed, 2185 insertions(+), 179 deletions(-)
> create mode 100644 Documentation/watch_queue.rst
> create mode 100644 include/linux/watch_queue.h
> create mode 100644 include/uapi/linux/watch_queue.h
> create mode 100644 kernel/watch_queue.c
> create mode 100644 samples/watch_queue/Makefile
> create mode 100644 samples/watch_queue/watch_test.c
>
Hi Linus,
On Tue, Jun 02, 2020 at 04:55:04PM +0100, David Howells wrote:
> Can you pull this, please? It adds a general notification queue concept
I'm trying to use David's notification stuff in userspace, and I guess
feedback is welcome :-)
The notification stuff looks pretty promising, but I do not understand
why we need to use pipe for this purpose, see typical userspace use-case:
int pipefd[2], fd;
if (pipe2(pipefd, O_NOTIFICATION_PIPE) == -1)
err(EXIT_FAILURE, "pipe2 failed");
fd = pipefd[0];
All the next operations are done with "fd". It's nowhere used as a
pipe, and nothing uses pipefd[1]. The first impression from this code
is "oh, this is strange; why?".
Is it because we need to create a new file descriptor from nothing?
Why O_NOTIFICATION_PIPE is better than introduce a new syscall
notifyfd()?
(We have signalfd(), no O_SIGNAL_PIPE, etc.)
Karel
--
Karel Zak <[email protected]>
http://karelzak.blogspot.com
[ Finally getting around to this since my normal pull queue is now empty ]
On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <[email protected]> wrote:
>
> The notification stuff looks pretty promising, but I do not understand
> why we need to use pipe for this purpose
The original intent was never to use the "pipe()" system call itself,
only use pipes as the actual transport mechanism (because I do not for
a second believe in the crazy "use sockets" model that a lot of other
people seem to blindly believe in).
But using "pipe()" also allows for non-kernel notification queues (ie
where the events come from a user space process). Then you'd not use
O_NOTIFICATION_PIPE, but O_DIRECT (for a packetized pipe).
> Is it because we need to create a new file descriptor from nothing?
> Why O_NOTIFICATION_PIPE is better than introduce a new syscall
> notifyfd()?
We could eventually introduce a new system call.
But I most definitely did *NOT* want to see anything like that for any
first gen stuff. Especially since it wasn't clear who was going to
use it, and whether early trials would literally be done with that
user-space emulation model of using a perfectly regular pipe (just
with packetization).
I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst
it will be a useful marker. I think the only real reason for it was to
avoid any clashes with splice(), which has more complex use of the
pipe buffers.
I'm so far just reading this thread and the arguments for users, and I
haven't yet looked at all the actual details in the pull request - but
last time I had objections to things it wasn't the code, it was the
lack of any use.
Linus
[ Actually going through the code now ]
On Wed, Jun 10, 2020 at 4:13 AM Karel Zak <[email protected]> wrote:
>
> All the next operations are done with "fd". It's nowhere used as a
> pipe, and nothing uses pipefd[1].
As an aside, that isn't necessarily true.
In some of the examples, pipefd[1] is used for configuration (sizing
and adding filters), although I think right now that's not really
enforced, and other examples seem to have pipefd[0] do that too.
DavidH: should that perhaps be a hard rule, so that you can pass a
pipefd[0] to readers, while knowing that they can't then change the
kinds of notifications they see.
In the "pipe: Add general notification queue support" commit message,
the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then
in the commit message for "watch_queue: Add a key/keyring notification
facility" it uses pipefd[1].
And that latter example does make sense: using the write-side
pipefd[1] for configuration, while the read-side pipefd[0] is the side
that sees the results. That is also how it would work if you have a
user-mode pipe with the notification source controlling the writing
side - the reading side can obviously not add filters or change the
semantics of the watches.
So that allows a trusted side to add and create filters, while some
untrusted entity can then see the results.
This isn't going to hold up me merging the code, but it would be good
to clarify and make that something that gets enforced if we decide
it's worth it.
It does seem conceptually like a good idea, and potentially actually
useful to clearly separate the domain of "you can add watches and
filters" from "you can see the notifications".
Linus
Linus Torvalds <[email protected]> wrote:
> I'm not even convinced O_NOTIFICATION_PIPE is necessary, but at worst
> it will be a useful marker. I think the only real reason for it was to
> avoid any clashes with splice(), which has more complex use of the
> pipe buffers.
The main reason is to prevent splice because the iov_iter rewind for splice
gets quite tricky if the kernel can randomly insert packets into the pipe
buffer in between what splice is inserting.
> I'm so far just reading this thread and the arguments for users, and I
> haven't yet looked at all the actual details in the pull request - but
> last time I had objections to things it wasn't the code, it was the
> lack of any use.
Would you be willing at this point to consider pulling the mount notifications
and fsinfo() which helps support that? I could whip up pull reqs for those
two pieces - or do you want to see more concrete patches that use it?
David
Linus Torvalds <[email protected]> wrote:
> > All the next operations are done with "fd". It's nowhere used as a
> > pipe, and nothing uses pipefd[1].
>
> As an aside, that isn't necessarily true.
>
> In some of the examples, pipefd[1] is used for configuration (sizing
> and adding filters), although I think right now that's not really
> enforced, and other examples seem to have pipefd[0] do that too.
The configuration can happen on either end of the pipe. I just need to be
able to find the pipe object.
> DavidH: should that perhaps be a hard rule, so that you can pass a
> pipefd[0] to readers, while knowing that they can't then change the
> kinds of notifications they see.
You can argue that the other way: that it should be a hard rule that you can
pass pipefd[1] to writers, whilst knowing that they can't then change the kind
of notifications that the kernel can insert into the pipe. My feeling is that
it's more likely that you would keep the read end yourself and give the write
end away - if at all. Most likely, IMO, would be that you attach notification
sources and never use the write end directly.
There is some argument for making it so that the notification sources belong
to the read end only and that they keep the write side alive internally -
meaning that you can just close the write end. All the notification sources
just then disappear when the read end is closed - but dup() might make this
kind of tricky as there is only one pipe object and its shared between both
ends. The existence of O_RDWR FIFOs might also make this tricky.
> In the "pipe: Add general notification queue support" commit message,
> the code example uses pipefd[0] for IOC_WATCH_QUEUE_SET_SIZE, but then
> in the commit message for "watch_queue: Add a key/keyring notification
> facility" it uses pipefd[1].
>
> And that latter example does make sense: using the write-side
> pipefd[1] for configuration, while the read-side pipefd[0] is the side
> that sees the results. That is also how it would work if you have a
> user-mode pipe with the notification source controlling the writing
> side - the reading side can obviously not add filters or change the
> semantics of the watches.
>
> So that allows a trusted side to add and create filters, while some
> untrusted entity can then see the results.
As stated above, I think you should be looking at this the other way round -
you're more likely to keep the read end for yourself. If you attach multiple
sources to a pipe, everything they produce comes out mixed together from the
read end of the pipe.
You might even pass the write end to multiple userspace-side event generators,
but I'm not sure it would make sense to pass the read end around unless you
have sufficient flow that you need multiple consumers to keep up with it.
David
On Sat, Jun 13, 2020 at 6:05 AM David Howells <[email protected]> wrote:
>
> Would you be willing at this point to consider pulling the mount notifications
> and fsinfo() which helps support that? I could whip up pull reqs for those
> two pieces - or do you want to see more concrete patches that use it?
I'd want to see more concrete use cases, but I'd also like to see that
this keyring thing gets used and doesn't find any show-stoppers when
it does.
If we have multiple uses, and one of them notices some problem that
requires any ABI changes, but the other one has already started using
it, we'll have more problems.
Linus
On Sat, Jun 13, 2020 at 9:47 AM Linus Torvalds
<[email protected]> wrote:
>
> If we have multiple uses, and one of them notices some problem that
> requires any ABI changes, but the other one has already started using
> it, we'll have more problems.
Ok, it's merged in my tree, although I was somewhat unhappy about the
incomprehensible calling conventions of "get_pipe_info()". The random
second argument just makes no sense when you read the code, it would
have probably been better as a helper function or #define to clarify
the whole "for_splice" thing.
But let's see how it works and what actually happens.
Linus
The pull request you sent on Tue, 02 Jun 2020 16:55:04 +0100:
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/notifications-20200601
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6c3297841472b4e53e22e53826eea9e483d993e5
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker
On Sat, Jun 13, 2020 at 3:05 PM David Howells <[email protected]> wrote:
> > I'm so far just reading this thread and the arguments for users, and I
> > haven't yet looked at all the actual details in the pull request - but
> > last time I had objections to things it wasn't the code, it was the
> > lack of any use.
>
> Would you be willing at this point to consider pulling the mount notifications
> and fsinfo() which helps support that? I could whip up pull reqs for those
> two pieces - or do you want to see more concrete patches that use it?
Well, I had some questions and comments for the mount notifications
last time around[1] and didn't yet get a reply.
And the fsinfo stuff is simply immature, please lets not merge it just
yet. When we have some uses (most notably systemd) running on top of
the current fsinfo interface, we can sit down and discuss how the API
can be cleaned up.
BTW I had a similar experience with the fsconfig() merge, which was
pushed with some unpolished bits and where my comments were also
largely ignored. So, before asking to pull, please at least *answer*
reviews. You don't have to agree, but at least consider and think
about the comments.
Thanks,
Miklos
[1] https://lore.kernel.org/linux-fsdevel/CAJfpegspWA6oUtdcYvYF=3fij=Bnq03b8VMbU9RNMKc+zzjbag@mail.gmail.com/
Hi David,
On Tue, 2020-06-02 at 16:55 +-0100, David Howells wrote:
+AD4- Date: Tue, 02 Jun 2020 16:51:44 +-0100
+AD4-
+AD4- Hi Linus,
+AD4-
+AD4- Can you pull this, please? It adds a general notification queue
+AD4- concept
+AD4- and adds an event source for keys/keyrings, such as linking and
+AD4- unlinking
+AD4- keys and changing their attributes.
+AFs-..+AF0-
This commit:
+AD4- keys: Make the KEY+AF8-NEED+AF8AKg- perms an enum rather than a mask
...upstream as:
8c0637e950d6 keys: Make the KEY+AF8-NEED+AF8AKg- perms an enum rather than a mask
...triggers a regression in the libnvdimm unit test that exercises the
encrypted keys used to store nvdimm passphrases. It results in the
below warning.
---
WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key+AF8-task+AF8-permission+-0xd3/0x140
Modules linked in: nd+AF8-blk(OE) nfit+AF8-test(OE) device+AF8-dax(OE) ebtable+AF8-filter(E) ebtables(E) ip6table+AF8-filter(E) ip6+AF8-tables(E) kvm+AF8-intel(E) kvm(E) irqbypass(E) nd+AF8-pmem(OE) dax+AF8-pmem(OE) nd+AF8-btt(OE) dax+AF8-p
ct10dif+AF8-pclmul(E) nd+AF8-e820(OE) nfit(OE) crc32+AF8-pclmul(E) libnvdimm(OE) crc32c+AF8-intel(E) ghash+AF8-clmulni+AF8-intel(E) serio+AF8-raw(E) encrypted+AF8-keys(E) trusted(E) nfit+AF8-test+AF8-iomap(OE) tpm(E) drm(E)
CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G OE 5.7.0-rc6+- +ACM-155
Hardware name: QEMU Standard PC (i440FX +- PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:key+AF8-task+AF8-permission+-0xd3/0x140
Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff +ADw-0f+AD4- 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06
RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297
RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c
RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01
RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0
R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840
R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540
FS: 00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0
Call Trace:
lookup+AF8-user+AF8-key+-0xeb/0x6b0
? vsscanf+-0x3df/0x840
? key+AF8-validate+-0x50/0x50
? key+AF8-default+AF8-cmp+-0x20/0x20
nvdimm+AF8-get+AF8-user+AF8-key+AF8-payload.part.0+-0x21/0x110 +AFs-libnvdimm+AF0-
nvdimm+AF8-security+AF8-store+-0x67d/0xb20 +AFs-libnvdimm+AF0-
security+AF8-store+-0x67/0x1a0 +AFs-libnvdimm+AF0-
kernfs+AF8-fop+AF8-write+-0xcf/0x1c0
vfs+AF8-write+-0xde/0x1d0
ksys+AF8-write+-0x68/0xe0
do+AF8-syscall+AF8-64+-0x5c/0xa0
entry+AF8-SYSCALL+AF8-64+AF8-after+AF8-hwframe+-0x49/0xb3
RIP: 0033:0x7f863c624547
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 +ADw-48+AD4- 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG+AF8-RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547
RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005
RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0
R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0
R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0
irq event stamp: 36976
hardirqs last enabled at (36975): +AFsAPA-ffffffff9131fa40+AD4AXQ- +AF8AXw-slab+AF8-alloc+-0x70/0x90
hardirqs last disabled at (36976): +AFsAPA-ffffffff910049c7+AD4AXQ- trace+AF8-hardirqs+AF8-off+AF8-thunk+-0x1a/0x1c
softirqs last enabled at (35474): +AFsAPA-ffffffff91e00357+AD4AXQ- +AF8AXw-do+AF8-softirq+-0x357/0x466
softirqs last disabled at (35467): +AFsAPA-ffffffff910eae96+AD4AXQ- irq+AF8-exit+-0xe6/0xf0
On Tue, Jun 16, 2020 at 6:15 PM Williams, Dan J
<[email protected]> wrote:
>
> Hi David,
>
> On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> > Date: Tue, 02 Jun 2020 16:51:44 +0100
> >
> > Hi Linus,
> >
> > Can you pull this, please? It adds a general notification queue
> > concept
> > and adds an event source for keys/keyrings, such as linking and
> > unlinking
> > keys and changing their attributes.
> [..]
>
> This commit:
>
> > keys: Make the KEY_NEED_* perms an enum rather than a mask
>
> ...upstream as:
>
> 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
>
> ...triggers a regression in the libnvdimm unit test that exercises the
> encrypted keys used to store nvdimm passphrases. It results in the
> below warning.
This regression is still present in tip of tree. David, have you had a
chance to take a look?
>
> ---
>
> WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 key_task_permission+0xd3/0x140
> Modules linked in: nd_blk(OE) nfit_test(OE) device_dax(OE) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) kvm_intel(E) kvm(E) irqbypass(E) nd_pmem(OE) dax_pmem(OE) nd_btt(OE) dax_p
> ct10dif_pclmul(E) nd_e820(OE) nfit(OE) crc32_pclmul(E) libnvdimm(OE) crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) encrypted_keys(E) trusted(E) nfit_test_iomap(OE) tpm(E) drm(E)
> CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G OE 5.7.0-rc6+ #155
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> RIP: 0010:key_task_permission+0xd3/0x140
> Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff <0f> 0b 48 83 c4 08 b8 f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06
>
> RSP: 0018:ffffaddc42db7c90 EFLAGS: 00010297
> RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffffaddc42db7c7c
> RDX: 0000000000000000 RSI: ffff985e1c46e840 RDI: ffff985e3a03de01
> RBP: ffff985e3a03de01 R08: 0000000000000000 R09: 5461e7bc000002a0
> R10: 0000000000000004 R11: 0000000066666666 R12: ffff985e1c46e840
> R13: 0000000000000000 R14: ffffaddc42db7cd8 R15: ffff985e248c6540
> FS: 00007f863c18a780(0000) GS:ffff985e3bbc0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000006d3708 CR3: 0000000125a1e006 CR4: 0000000000160ee0
> Call Trace:
> lookup_user_key+0xeb/0x6b0
> ? vsscanf+0x3df/0x840
> ? key_validate+0x50/0x50
> ? key_default_cmp+0x20/0x20
> nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm]
> nvdimm_security_store+0x67d/0xb20 [libnvdimm]
> security_store+0x67/0x1a0 [libnvdimm]
> kernfs_fop_write+0xcf/0x1c0
> vfs_write+0xde/0x1d0
> ksys_write+0x68/0xe0
> do_syscall_64+0x5c/0xa0
> entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x7f863c624547
> Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> RSP: 002b:00007ffd61d8f5e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007ffd61d8f640 RCX: 00007f863c624547
> RDX: 0000000000000014 RSI: 00007ffd61d8f640 RDI: 0000000000000005
> RBP: 0000000000000005 R08: 0000000000000014 R09: 00007ffd61d8f4a0
> R10: fffffffffffff455 R11: 0000000000000246 R12: 00000000006dbbf0
> R13: 00000000006cd710 R14: 00007f863c18a6a8 R15: 00007ffd61d8fae0
> irq event stamp: 36976
> hardirqs last enabled at (36975): [<ffffffff9131fa40>] __slab_alloc+0x70/0x90
> hardirqs last disabled at (36976): [<ffffffff910049c7>] trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last enabled at (35474): [<ffffffff91e00357>] __do_softirq+0x357/0x466
> softirqs last disabled at (35467): [<ffffffff910eae96>] irq_exit+0xe6/0xf0
>
Dan Williams <[email protected]> wrote:
> > This commit:
> >
> > > keys: Make the KEY_NEED_* perms an enum rather than a mask
> >
> > ...upstream as:
> >
> > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
> >
> > ...triggers a regression in the libnvdimm unit test that exercises the
> > encrypted keys used to store nvdimm passphrases. It results in the
> > below warning.
>
> This regression is still present in tip of tree. David, have you had a
> chance to take a look?
nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants
the key for so that the appropriate security checks can take place in SELinux
and Smack. Note that I have a patch in the works that changes this still
further.
Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work
for you?
David
On Tue, Jun 23, 2020 at 5:55 PM David Howells <[email protected]> wrote:
>
> Dan Williams <[email protected]> wrote:
>
> > > This commit:
> > >
> > > > keys: Make the KEY_NEED_* perms an enum rather than a mask
> > >
> > > ...upstream as:
> > >
> > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
> > >
> > > ...triggers a regression in the libnvdimm unit test that exercises the
> > > encrypted keys used to store nvdimm passphrases. It results in the
> > > below warning.
> >
> > This regression is still present in tip of tree. David, have you had a
> > chance to take a look?
>
> nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants
> the key for so that the appropriate security checks can take place in SELinux
> and Smack. Note that I have a patch in the works that changes this still
> further.
>
> Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work
> for you?
It does, thanks.
Shall I wait for your further reworks to fix this for v5.8, or is that
v5.9 material?
Dan Williams <[email protected]> wrote:
> Shall I wait for your further reworks to fix this for v5.8, or is that
> v5.9 material?
It could do with stewing in linux-next for a while, so 5.9 probably.
David