Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

[widening CC to the same scope as the code patches]

Hello Tycho,

Sorry for the late response on this. Could you amend/rebase your patches
in light of the below please.

And anyone's comment on my "big picture" text at the foot of this
mail would be very welcome.

On 12/13/18 1:11 AM, Tycho Andersen wrote:
> Here's some documentation on how to use the userspace notification. I tried
> to mention the biggest TOCTOU gotcha, but the pointer to the sample is
> really the best thing,

Cough! No, the "best thing" really is a good sample program *and* really
good documentation :-).

> as it has actual code and many comments about what's
> going on.
>
> Signed-off-by: Tycho Andersen <[email protected]>
> CC: Kees Cook <[email protected]>
> ---
> man2/seccomp.2 | 85 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> index d69187783..fef8914bf 100644
> --- a/man2/seccomp.2
> +++ b/man2/seccomp.2
> @@ -228,6 +228,13 @@ An administrator may override this filter flag by preventing specific
> actions from being logged via the
> .IR /proc/sys/kernel/seccomp/actions_logged
> file.
> +.TP
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER " (since Linux 4.21)"
> +With this flag, a new userspace notification fd is returned on success. When
> +this filter returns
> +.BR SECCOMP_RET_USER_NOTIF
> +a notification will be sent to this fd. See "Userspace Notification" below for

s/fd/file descriptor/ throughout please.

> +more details.

I think the description here could be better worded as something like:

SECCOMP_FILTER_FLAG_NEW_LISTENER
Register a new filter, as usual, but on success return a
new file descriptor that provides user-space notifications.
When the filter returns SECCOMP_RET_USER_NOTIF, a notification
will be provided via this file descriptor. The close-on-exec
flag is automatically set on the new file descriptor. ...

> .RE
> .TP
> .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
> @@ -606,6 +613,17 @@ file.
> .TP
> .BR SECCOMP_RET_ALLOW
> This value results in the system call being executed.
> +.TP
> +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"

Please see the start of this hanging list in the manual page.
Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
in the precedence order of all of the filter return values?

> +Forwards the syscall to an attached listener in userspace to allow userspace to

s/syscall/system call throughout please.

> +decide what to do with the syscall. If there is no attached listener (either
> +because the filter was not installed with the
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> +or because the fd was closed), the filter returns
> +.BR ENOSYS
> +similar to what happens when a filter returns
> +.BR SECCOMP_RET_TRACE
> +and there is no tracer. See "Userspace Notification" below for more details.
> .PP
> If an action value other than one of the above is specified,
> then the filter action is treated as either
> @@ -693,10 +711,75 @@ Otherwise, if kernel auditing is enabled and the process is being audited
> the action is logged.
> .IP *
> Otherwise, the action is not logged.
> +.SS Userspace Notification
> +Interactin userspace notification functionality in seccomp is primarily done
> +via file descriptor.

That sentence is somewhat garbled. Even if I correct the typo,
I still don't really understand it. Could you try again?
(Or, take a look at my "big picture" text at the foot of this mail.)

> This file descriptor can be obtained by passing
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> +as a filter flag when installing a new filter.
> +.PP
> +Once an fd is obtained, userspace can wait for events using
> +.BR poll ()

and presumably select() and epoll?

What kind of notification event do poll(2)/epoll(7)/select(2) provide?
It looks to be POLLIN/EPOLLIN/readable. Is that correct?
These details should be noted here. More generally, my assumption
is that you can use poll(2)/epoll(7)/select(2) to find out about the
availability of an event, and then use SECCOMP_IOCTL_NOTIF_RECV
to read that event. Correct? The text needs to be more explicit on
this.

> +or
> +.BR ioctl ().
> +The supported
> +.BR ioctl ()
> +operations on a notification fd are:
> +.TP
> +.BR SECCOMP_IOCTL_NOTIF_RECV
> +The argument to this command should be a pointer to a struct seccomp_notif:
> +.IP
> +.in +4n
> +.EX
> +struct seccomp_notif {
> + __u64 id;
> + __u32 pid;
> + __u32 flags;
> + struct seccomp_data data;
> +};
> +.EE
> +.in
> +.IP
> +The id field is a filter-unique id for this syscall, and should be supplied in
> +the response. It can additionally be used in
> +.BR SECCOMP_IOCTL_ID_VALID
> +to test whether or not the request is still alive. The pid here is the pid of
> +the task as visible from the listener's pid namespace. If the pid is not
> +visible, it is 0. Flags is unused right now.

So, is 'flags' explicitly zeroed by the kernel? the manual page should note
this.

> struct seccomp_data is the same
> +data that would be passed to a filter running in the kernel.

What are the semantics if multiple monitoring processes are employing
SECCOMP_IOCTL_NOTIF_RECV? Does only one of them get awoken? (Which one?)
Or do they all get woken up and get a 'struct seccomp_notif'? The semantics
should be detailed here.

> +.TP
> +.BR SECCOMP_IOCTL_NOTIF_SEND
> +The argument to this command should be a pointer to a struct seccomp_notif_resp:
> +.IP
> +.in +4n
> +.EX
> +struct seccomp_notif_resp {
> + __u64 id;
> + __s64 val;
> + __s32 error;
> + __u32 flags;
> +};
> +.EE
> +.in
> +.IP
> +The id should be the id from struct seccomp_notif; if error is non-zero, it is
> +used as the return value, otherwise val is. Flags must be 0 right now.

There really isn't enough detail here to help the reader understand. What
is the purpose of 'error' vs 'val'? In particular, I'm assuming that the
monitoring process has (at least) two choices with respect to the system
call being made by the target process:

(1) Cause the system call to fail with an error.
(2) Allow the system call to proceed.

How are 'error' and 'val' used in these two cases?
Are there more than these two cases? How else is 'val' used?

Also, what happens if the monitoring process does a SECCOMP_IOCTL_NOTIF_RECV,
but does not subsequently do a SECCOMP_IOCTL_NOTIF_SEND? This should be
detailed in the manual page.

What are the semantics if multiple monitoring processes are employing
SECCOMP_IOCTL_NOTIF_SEND? An error for all but the first? The semantics
should be detailed here.

> +.TP
> +.BR SECCOMP_IOCTL_NOTIF_ID_VALID
> +The argument to this command is just the id to be tested. There is a pid reuse
> +issue where a task may make a syscall, die, its pid get re-used, and the
> +listener may operate on the wrong pid. So the process for handling modifying a
> +pid's state in some way would be to open the pid's resources (/proc/pid/mem or
> +similar), do this call to verify that the id is still valid, and then use the
> +resource. See the sample below for more detail.
> +.PP
> +A complete example is available in the kernel tree at
> +.IR samples/seccomp/user-trap.c .
> .SH RETURN VALUE
> On success,
> .BR seccomp ()
> -returns 0.
> +returns 0, unless
> +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> +was specified, in which case it returns the fd number for the new listener fd.
> On error, if
> .BR SECCOMP_FILTER_FLAG_TSYNC
> was used,

More generally though, I'm struggling with this patch because "the
big picture" is lacking. Here's my estimate of what I think that
picture is:

[[
Instead of having the seccomp() filter decision being made
within the filter itself, it is possible to pass off the
decision making to a (different) user-space process that
makes the decision. This is done using the following steps.

1. The "target process" (i.e., process that will
establish a seccomp filter) and the process that will make
decisions about system calls ("the monitoring process")
establish a connection using a UNIX domain socket. This
socket will subsequently be used to exchange a file
descriptor.

2. The "target process" establishes a seccomp BPF filter in the
usual manner, but with two notable differences:

* The seccomp() 'flags' argument includes the flag
SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return
value of a successful seccomp() call is a new "listening"
file descriptor that can be used for monitoring.
* In cases where it is needed, the BPF filter returns the
special action SECCOMP_RET_USER_NOTIF. This return value
will trigger a notification event.

3. The "target process" passes the "listening file descriptor"
to the "monitoring process" via the UNIX domain socket.

4. The target process then performs its workload, which includes
system calls that will be controlled by the BPF filter.
Whenever one of these system calls causes the BPF filter to
return SECCOMP_RET_USER_NOTIF, a notification event is
generated on the listening file descriptor.

5. The monitoring process will receive notification events
on the listening file descriptor. These events are returned
as structures of type 'seccomp_notif'. Because this structure
and its size may evolve over kernel version, the monitoring
process must first determine the size of this structure using
the seccomp() SECCOMP_GET_NOTIF_SIZES operation, which
returns a structure of type 'seccomp_notif_sizes'. The
monitoring process allocates a buffer of size
'seccomp_notif_sizes.seccomp_notif' bytes to receive
monitoring events. In addition,the monitoring process
allocates another buffer of size
'seccomp_notif_sizes.seccomp_notif_resp' bytes for the
response (a 'struct seccomp_notif_resp') that it will
provide to the kernel in order to advise how the system
call being made by the target process shall be treated.

6. The monitoring process can now repeatedly monitor the
listening file descriptor.

When a SECCOMP_RET_USER_NOTIF-triggered event occurs,
the file descriptor will test as readable for
poll(2)/epoll(7)/select(2). The call

ioctl(listenfd, SECCOMP_IOCTL_NOTIF_RECV, reqptr)

can be used to read info on the event; this operation
blocks until an event is available. It populates
the 'struct seccomp_notif' pointed to by the third
argument with information about the system call
that is being attempted by the target process.

7. The monitoring process can use the information in the
'struct seccomp_notif' to make a determination about the
system call being made by the target process. This
structure includes a 'data' field that is the same
'struct seccomp_data' that is passed to a BPF filter.

In addition, the monitoring process may make use of other
information that is available from user space. For example,
it may inspect the memory of the target process (whose PID
is provided in the 'struct seccomp_notif') using
/proc/PID/mem, which includes inspecting the values
pointed to by system call arguments (whose location is
available 'seccomp_notif.data.args). However, when using
the target process PID in this way, one must guard against
PID re-use race conditions using the seccomp()
SECCOMP_IOCTL_NOTIF_ID_VALID operation.

8. Having arrived at a decision about the target process's
system call, the monitoring process can inform the kernel
of its decision using the operation

ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr)

where the third argument is a pointer to a
'struct seccomp_notif_resp'. [Some more details
needed here, but I still don't yet understand fully
the semantics of the 'error' and 'val' fields.]

I'd like to include text along the above lines at the start of
the "Userspace Notification" section. Could you take a look at
the above and let me know any errors, improved terminology, or
other improvements generally. I'll then work that text into
the page.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

> 7. The monitoring process can use the information in the
> 'struct seccomp_notif' to make a determination about the
> system call being made by the target process. This
> structure includes a 'data' field that is the same
> 'struct seccomp_data' that is passed to a BPF filter.
>
> In addition, the monitoring process may make use of other
> information that is available from user space. For example,
> it may inspect the memory of the target process (whose PID
> is provided in the 'struct seccomp_notif') using
> /proc/PID/mem, which includes inspecting the values
> pointed to by system call arguments (whose location is
> available 'seccomp_notif.data.args). However, when using
> the target process PID in this way, one must guard against
> PID re-use race conditions using the seccomp()
> SECCOMP_IOCTL_NOTIF_ID_VALID operation.
>
> 8. Having arrived at a decision about the target process's
> system call, the monitoring process can inform the kernel
> of its decision using the operation
>
> ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr)
>
> where the third argument is a pointer to a
> 'struct seccomp_notif_resp'. [Some more details
> needed here, but I still don't yet understand fully
> the semantics of the 'error' and 'val' fields.]

So clearly, I misunderstood these last two steps.

(7) is something like: discover information in userspace
as required; perform userspace actions if appropriate
(perhaps doing the system call operation "on behalf of" the
target process).


(8) is something like:
set 'error' and 'val' to return info to the target process:
* error != 0 ==> make it look like the syscall failed,
with 'errno' set to that value
* error == 0 ==> make it look like the syscall succeeded
and returned 'val'

Right?

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2019-03-01 14:57:07

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

On Thu, Feb 28, 2019 at 01:52:19PM +0100, Michael Kerrisk (man-pages) wrote:
> > +a notification will be sent to this fd. See "Userspace Notification" below for
>
> s/fd/file descriptor/ throughout please.

Will do.

> > +more details.
>
> I think the description here could be better worded as something like:
>
> SECCOMP_FILTER_FLAG_NEW_LISTENER
> Register a new filter, as usual, but on success return a
> new file descriptor that provides user-space notifications.
> When the filter returns SECCOMP_RET_USER_NOTIF, a notification
> will be provided via this file descriptor. The close-on-exec
> flag is automatically set on the new file descriptor. ...
>
> > .RE
> > .TP
> > .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
> > @@ -606,6 +613,17 @@ file.
> > .TP
> > .BR SECCOMP_RET_ALLOW
> > This value results in the system call being executed.
> > +.TP
> > +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"
>
> Please see the start of this hanging list in the manual page.
> Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
> in the precedence order of all of the filter return values?

Oh, no, I didn't realize it was in a particular order. I'll switch it.

> > +Forwards the syscall to an attached listener in userspace to allow userspace to
>
> s/syscall/system call throughout please.

Will do.

> > +decide what to do with the syscall. If there is no attached listener (either
> > +because the filter was not installed with the
> > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> > +or because the fd was closed), the filter returns
> > +.BR ENOSYS
> > +similar to what happens when a filter returns
> > +.BR SECCOMP_RET_TRACE
> > +and there is no tracer. See "Userspace Notification" below for more details.
> > .PP
> > If an action value other than one of the above is specified,
> > then the filter action is treated as either
> > @@ -693,10 +711,75 @@ Otherwise, if kernel auditing is enabled and the process is being audited
> > the action is logged.
> > .IP *
> > Otherwise, the action is not logged.
> > +.SS Userspace Notification
> > +Interactin userspace notification functionality in seccomp is primarily done
> > +via file descriptor.
>
> That sentence is somewhat garbled. Even if I correct the typo,
> I still don't really understand it. Could you try again?

Maybe "Userspace interacts with the notification functionality via
a file descriptor"? Perhaps we can just delete it.

> > This file descriptor can be obtained by passing
> > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> > +as a filter flag when installing a new filter.
> > +.PP
> > +Once an fd is obtained, userspace can wait for events using
> > +.BR poll ()
>
> and presumably select() and epoll?
>
> What kind of notification event do poll(2)/epoll(7)/select(2) provide?
> It looks to be POLLIN/EPOLLIN/readable. Is that correct?
> These details should be noted here. More generally, my assumption
> is that you can use poll(2)/epoll(7)/select(2) to find out about the
> availability of an event, and then use SECCOMP_IOCTL_NOTIF_RECV
> to read that event. Correct? The text needs to be more explicit on
> this.

Yes.

> > +or
> > +.BR ioctl ().
> > +The supported
> > +.BR ioctl ()
> > +operations on a notification fd are:
> > +.TP
> > +.BR SECCOMP_IOCTL_NOTIF_RECV
> > +The argument to this command should be a pointer to a struct seccomp_notif:
> > +.IP
> > +.in +4n
> > +.EX
> > +struct seccomp_notif {
> > + __u64 id;
> > + __u32 pid;
> > + __u32 flags;
> > + struct seccomp_data data;
> > +};
> > +.EE
> > +.in
> > +.IP
> > +The id field is a filter-unique id for this syscall, and should be supplied in
> > +the response. It can additionally be used in
> > +.BR SECCOMP_IOCTL_ID_VALID
> > +to test whether or not the request is still alive. The pid here is the pid of
> > +the task as visible from the listener's pid namespace. If the pid is not
> > +visible, it is 0. Flags is unused right now.
>
> So, is 'flags' explicitly zeroed by the kernel? the manual page should note
> this.

Yes.

> > struct seccomp_data is the same
> > +data that would be passed to a filter running in the kernel.
>
> What are the semantics if multiple monitoring processes are employing
> SECCOMP_IOCTL_NOTIF_RECV? Does only one of them get awoken? (Which one?)
> Or do they all get woken up and get a 'struct seccomp_notif'? The semantics
> should be detailed here.

Ok. (Only one notification is sent.)

> > +.TP
> > +.BR SECCOMP_IOCTL_NOTIF_SEND
> > +The argument to this command should be a pointer to a struct seccomp_notif_resp:
> > +.IP
> > +.in +4n
> > +.EX
> > +struct seccomp_notif_resp {
> > + __u64 id;
> > + __s64 val;
> > + __s32 error;
> > + __u32 flags;
> > +};
> > +.EE
> > +.in
> > +.IP
> > +The id should be the id from struct seccomp_notif; if error is non-zero, it is
> > +used as the return value, otherwise val is. Flags must be 0 right now.
>
> There really isn't enough detail here to help the reader understand. What
> is the purpose of 'error' vs 'val'? In particular, I'm assuming that the
> monitoring process has (at least) two choices with respect to the system
> call being made by the target process:
>
> (1) Cause the system call to fail with an error.
> (2) Allow the system call to proceed.
>
> How are 'error' and 'val' used in these two cases?
> Are there more than these two cases? How else is 'val' used?

This is basically just an artifact that's exposed because some arches
use different registers to return errors vs. successful values. I'll
make a note of that and how the precedence works.

> Also, what happens if the monitoring process does a SECCOMP_IOCTL_NOTIF_RECV,
> but does not subsequently do a SECCOMP_IOCTL_NOTIF_SEND? This should be
> detailed in the manual page.
>
> What are the semantics if multiple monitoring processes are employing
> SECCOMP_IOCTL_NOTIF_SEND? An error for all but the first? The semantics
> should be detailed here.

Ok.

> > +.TP
> > +.BR SECCOMP_IOCTL_NOTIF_ID_VALID
> > +The argument to this command is just the id to be tested. There is a pid reuse
> > +issue where a task may make a syscall, die, its pid get re-used, and the
> > +listener may operate on the wrong pid. So the process for handling modifying a
> > +pid's state in some way would be to open the pid's resources (/proc/pid/mem or
> > +similar), do this call to verify that the id is still valid, and then use the
> > +resource. See the sample below for more detail.
> > +.PP
> > +A complete example is available in the kernel tree at
> > +.IR samples/seccomp/user-trap.c .
> > .SH RETURN VALUE
> > On success,
> > .BR seccomp ()
> > -returns 0.
> > +returns 0, unless
> > +.BR SECCOMP_FILTER_FLAG_NEW_LISTENER
> > +was specified, in which case it returns the fd number for the new listener fd.
> > On error, if
> > .BR SECCOMP_FILTER_FLAG_TSYNC
> > was used,
>
> More generally though, I'm struggling with this patch because "the
> big picture" is lacking. Here's my estimate of what I think that
> picture is:
>
> [[
> Instead of having the seccomp() filter decision being made
> within the filter itself, it is possible to pass off the
> decision making to a (different) user-space process that
> makes the decision. This is done using the following steps.
>
> 1. The "target process" (i.e., process that will
> establish a seccomp filter) and the process that will make
> decisions about system calls ("the monitoring process")
> establish a connection using a UNIX domain socket. This
> socket will subsequently be used to exchange a file
> descriptor.
>
> 2. The "target process" establishes a seccomp BPF filter in the
> usual manner, but with two notable differences:
>
> * The seccomp() 'flags' argument includes the flag
> SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return
> value of a successful seccomp() call is a new "listening"
> file descriptor that can be used for monitoring.
> * In cases where it is needed, the BPF filter returns the
> special action SECCOMP_RET_USER_NOTIF. This return value
> will trigger a notification event.
>
> 3. The "target process" passes the "listening file descriptor"
> to the "monitoring process" via the UNIX domain socket.

or some other means, it doesn't have to be with SCM_RIGHTS.

> 4. The target process then performs its workload, which includes
> system calls that will be controlled by the BPF filter.
> Whenever one of these system calls causes the BPF filter to
> return SECCOMP_RET_USER_NOTIF, a notification event is
> generated on the listening file descriptor.
>
> 5. The monitoring process will receive notification events
> on the listening file descriptor. These events are returned
> as structures of type 'seccomp_notif'. Because this structure
> and its size may evolve over kernel version, the monitoring
> process must first determine the size of this structure using
> the seccomp() SECCOMP_GET_NOTIF_SIZES operation, which
> returns a structure of type 'seccomp_notif_sizes'. The
> monitoring process allocates a buffer of size
> 'seccomp_notif_sizes.seccomp_notif' bytes to receive
> monitoring events. In addition,the monitoring process
> allocates another buffer of size
> 'seccomp_notif_sizes.seccomp_notif_resp' bytes for the
> response (a 'struct seccomp_notif_resp') that it will
> provide to the kernel in order to advise how the system
> call being made by the target process shall be treated.
>
> 6. The monitoring process can now repeatedly monitor the
> listening file descriptor.
>
> When a SECCOMP_RET_USER_NOTIF-triggered event occurs,
> the file descriptor will test as readable for
> poll(2)/epoll(7)/select(2). The call
>
> ioctl(listenfd, SECCOMP_IOCTL_NOTIF_RECV, reqptr)
>
> can be used to read info on the event; this operation
> blocks until an event is available. It populates
> the 'struct seccomp_notif' pointed to by the third
> argument with information about the system call
> that is being attempted by the target process.
>
> 7. The monitoring process can use the information in the
> 'struct seccomp_notif' to make a determination about the
> system call being made by the target process. This
> structure includes a 'data' field that is the same
> 'struct seccomp_data' that is passed to a BPF filter.
>
> In addition, the monitoring process may make use of other
> information that is available from user space. For example,
> it may inspect the memory of the target process (whose PID
> is provided in the 'struct seccomp_notif') using
> /proc/PID/mem, which includes inspecting the values

Again, could be ptrace() or /proc/self/map_files/* or whatever,
/proc/PID/mem is just the most convenient.

> pointed to by system call arguments (whose location is
> available 'seccomp_notif.data.args). However, when using
> the target process PID in this way, one must guard against
> PID re-use race conditions using the seccomp()
> SECCOMP_IOCTL_NOTIF_ID_VALID operation.
>
> 8. Having arrived at a decision about the target process's
> system call, the monitoring process can inform the kernel
> of its decision using the operation
>
> ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr)
>
> where the third argument is a pointer to a
> 'struct seccomp_notif_resp'. [Some more details
> needed here, but I still don't yet understand fully
> the semantics of the 'error' and 'val' fields.]
>
> I'd like to include text along the above lines at the start of
> the "Userspace Notification" section. Could you take a look at
> the above and let me know any errors, improved terminology, or
> other improvements generally. I'll then work that text into
> the page.

Yep, looks good to me, thanks.

Tycho

2019-03-01 14:58:03

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

On Thu, Feb 28, 2019 at 02:25:55PM +0100, Michael Kerrisk (man-pages) wrote:
> > 7. The monitoring process can use the information in the
> > 'struct seccomp_notif' to make a determination about the
> > system call being made by the target process. This
> > structure includes a 'data' field that is the same
> > 'struct seccomp_data' that is passed to a BPF filter.
> >
> > In addition, the monitoring process may make use of other
> > information that is available from user space. For example,
> > it may inspect the memory of the target process (whose PID
> > is provided in the 'struct seccomp_notif') using
> > /proc/PID/mem, which includes inspecting the values
> > pointed to by system call arguments (whose location is
> > available 'seccomp_notif.data.args). However, when using
> > the target process PID in this way, one must guard against
> > PID re-use race conditions using the seccomp()
> > SECCOMP_IOCTL_NOTIF_ID_VALID operation.
> >
> > 8. Having arrived at a decision about the target process's
> > system call, the monitoring process can inform the kernel
> > of its decision using the operation
> >
> > ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr)
> >
> > where the third argument is a pointer to a
> > 'struct seccomp_notif_resp'. [Some more details
> > needed here, but I still don't yet understand fully
> > the semantics of the 'error' and 'val' fields.]
>
> So clearly, I misunderstood these last two steps.
>
> (7) is something like: discover information in userspace
> as required; perform userspace actions if appropriate
> (perhaps doing the system call operation "on behalf of" the
> target process).
>
>
> (8) is something like:
> set 'error' and 'val' to return info to the target process:
> * error != 0 ==> make it look like the syscall failed,
> with 'errno' set to that value
> * error == 0 ==> make it look like the syscall succeeded
> and returned 'val'
>
> Right?

Yep, exactly.

Tycho

Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

On 3/1/19 4:19 PM, Tycho Andersen wrote:
> On Fri, Mar 01, 2019 at 04:16:27PM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Tycho,
>>
>> On 3/1/19 3:53 PM, Tycho Andersen wrote:
>>> On Thu, Feb 28, 2019 at 01:52:19PM +0100, Michael Kerrisk (man-pages) wrote:
>>>>> +a notification will be sent to this fd. See "Userspace Notification" below for
>>>>
>>>> s/fd/file descriptor/ throughout please.
>>>
>>> Will do.
>>>
>>>>> +more details.
>>>>
>>>> I think the description here could be better worded as something like:
>>>>
>>>> SECCOMP_FILTER_FLAG_NEW_LISTENER
>>>> Register a new filter, as usual, but on success return a
>>>> new file descriptor that provides user-space notifications.
>>>> When the filter returns SECCOMP_RET_USER_NOTIF, a notification
>>>> will be provided via this file descriptor. The close-on-exec
>>>> flag is automatically set on the new file descriptor. ...
>>>>
>>>>> .RE
>>>>> .TP
>>>>> .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
>>>>> @@ -606,6 +613,17 @@ file.
>>>>> .TP
>>>>> .BR SECCOMP_RET_ALLOW
>>>>> This value results in the system call being executed.
>>>>> +.TP
>>>>> +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"
>>>>
>>>> Please see the start of this hanging list in the manual page.
>>>> Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
>>>> in the precedence order of all of the filter return values?
>>>
>>> Oh, no, I didn't realize it was in a particular order. I'll switch it.
>>
>> Just for my immediate education (I'm experimenting right now),
>> where/how does it fit in the precedence order?
>
> In between RET_ERRNO and RET_TRACE; see include/uapi/linux/seccomp.h
> for details.

Confirmed by experiment :-).

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

On 3/1/19 3:53 PM, Tycho Andersen wrote:
> On Thu, Feb 28, 2019 at 02:25:55PM +0100, Michael Kerrisk (man-pages) wrote:
>>> 7. The monitoring process can use the information in the
>>> 'struct seccomp_notif' to make a determination about the
>>> system call being made by the target process. This
>>> structure includes a 'data' field that is the same
>>> 'struct seccomp_data' that is passed to a BPF filter.
>>>
>>> In addition, the monitoring process may make use of other
>>> information that is available from user space. For example,
>>> it may inspect the memory of the target process (whose PID
>>> is provided in the 'struct seccomp_notif') using
>>> /proc/PID/mem, which includes inspecting the values
>>> pointed to by system call arguments (whose location is
>>> available 'seccomp_notif.data.args). However, when using
>>> the target process PID in this way, one must guard against
>>> PID re-use race conditions using the seccomp()
>>> SECCOMP_IOCTL_NOTIF_ID_VALID operation.
>>>
>>> 8. Having arrived at a decision about the target process's
>>> system call, the monitoring process can inform the kernel
>>> of its decision using the operation
>>>
>>> ioctl(listenfd, SECCOMP_IOCTL_NOTIF_SEND, respptr)
>>>
>>> where the third argument is a pointer to a
>>> 'struct seccomp_notif_resp'. [Some more details
>>> needed here, but I still don't yet understand fully
>>> the semantics of the 'error' and 'val' fields.]
>>
>> So clearly, I misunderstood these last two steps.
>>
>> (7) is something like: discover information in userspace
>> as required; perform userspace actions if appropriate
>> (perhaps doing the system call operation "on behalf of" the
>> target process).
>>
>>
>> (8) is something like:
>> set 'error' and 'val' to return info to the target process:
>> * error != 0 ==> make it look like the syscall failed,
>> with 'errno' set to that value

That piece should be amended:
error < 0 ==> make it look like syscall failed.
error > 0 ==> make it look like the syscall succeeded
and returned 'error'

Is that really supposed to happen?

>> * error == 0 ==> make it look like the syscall succeeded
>> and returned 'val'

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

Hello Tycho,

On 3/1/19 3:53 PM, Tycho Andersen wrote:
> On Thu, Feb 28, 2019 at 01:52:19PM +0100, Michael Kerrisk (man-pages) wrote:
>>> +a notification will be sent to this fd. See "Userspace Notification" below for
>>
>> s/fd/file descriptor/ throughout please.
>
> Will do.
>
>>> +more details.
>>
>> I think the description here could be better worded as something like:
>>
>> SECCOMP_FILTER_FLAG_NEW_LISTENER
>> Register a new filter, as usual, but on success return a
>> new file descriptor that provides user-space notifications.
>> When the filter returns SECCOMP_RET_USER_NOTIF, a notification
>> will be provided via this file descriptor. The close-on-exec
>> flag is automatically set on the new file descriptor. ...
>>
>>> .RE
>>> .TP
>>> .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
>>> @@ -606,6 +613,17 @@ file.
>>> .TP
>>> .BR SECCOMP_RET_ALLOW
>>> This value results in the system call being executed.
>>> +.TP
>>> +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"
>>
>> Please see the start of this hanging list in the manual page.
>> Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
>> in the precedence order of all of the filter return values?
>
> Oh, no, I didn't realize it was in a particular order. I'll switch it.

Just for my immediate education (I'm experimenting right now),
where/how does it fit in the precedence order?


Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2019-03-01 17:31:52

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH 2/2] seccomp.2: document userspace notification

On Fri, Mar 01, 2019 at 04:16:27PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Tycho,
>
> On 3/1/19 3:53 PM, Tycho Andersen wrote:
> > On Thu, Feb 28, 2019 at 01:52:19PM +0100, Michael Kerrisk (man-pages) wrote:
> >>> +a notification will be sent to this fd. See "Userspace Notification" below for
> >>
> >> s/fd/file descriptor/ throughout please.
> >
> > Will do.
> >
> >>> +more details.
> >>
> >> I think the description here could be better worded as something like:
> >>
> >> SECCOMP_FILTER_FLAG_NEW_LISTENER
> >> Register a new filter, as usual, but on success return a
> >> new file descriptor that provides user-space notifications.
> >> When the filter returns SECCOMP_RET_USER_NOTIF, a notification
> >> will be provided via this file descriptor. The close-on-exec
> >> flag is automatically set on the new file descriptor. ...
> >>
> >>> .RE
> >>> .TP
> >>> .BR SECCOMP_GET_ACTION_AVAIL " (since Linux 4.14)"
> >>> @@ -606,6 +613,17 @@ file.
> >>> .TP
> >>> .BR SECCOMP_RET_ALLOW
> >>> This value results in the system call being executed.
> >>> +.TP
> >>> +.BR SECCOMP_RET_USER_NOTIF " (since Linux 4.21)"
> >>
> >> Please see the start of this hanging list in the manual page.
> >> Can you confirm that SECCOMP_RET_USER_NOTIF really is the lowest
> >> in the precedence order of all of the filter return values?
> >
> > Oh, no, I didn't realize it was in a particular order. I'll switch it.
>
> Just for my immediate education (I'm experimenting right now),
> where/how does it fit in the precedence order?

In between RET_ERRNO and RET_TRACE; see include/uapi/linux/seccomp.h
for details.

Tycho