Ignore this one (it's an older version of the openat2.2 patch) -- I sent
it by accident.
On 2019-10-04, Aleksa Sarai <[email protected]> wrote:
> Signed-off-by: Aleksa Sarai <[email protected]>
> ---
> man2/open.2 | 5 +
> man2/openat2.2 | 381 +++++++++++++++++++++++++++++++++++++++++
> man7/path_resolution.7 | 57 ++++--
> 3 files changed, 426 insertions(+), 17 deletions(-)
> create mode 100644 man2/openat2.2
>
> diff --git a/man2/open.2 b/man2/open.2
> index 7217fe056e5e..a0b43394bbee 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
> .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
> .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
> ", mode_t " mode );
> +.PP
> +/* Docuented separately, in \fBopenat2\fP(2). */
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
> .fi
> .PP
> .in -4n
> @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
> .B O_DIRECTORY
> is ignored).
> .SH SEE ALSO
> +.BR openat2 (2),
> .BR chmod (2),
> .BR chown (2),
> .BR close (2),
> diff --git a/man2/openat2.2 b/man2/openat2.2
> new file mode 100644
> index 000000000000..c43c76046243
> --- /dev/null
> +++ b/man2/openat2.2
> @@ -0,0 +1,381 @@
> +.\" Copyright (C) 2019 Aleksa Sarai <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +openat2 \- open and possibly create a file (extended)
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/types.h>
> +.B #include <sys/stat.h>
> +.B #include <fcntl.h>
> +.PP
> +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> +const struct open_how *" how ", size_t " size ");
> +.fi
> +.PP
> +.IR Note :
> +There is no glibc wrapper for this system call; see NOTES.
> +.SH DESCRIPTION
> +The
> +.BR openat2 ()
> +system call is an extension of
> +.BR openat (2)
> +and provides a superset of its functionality. Rather than taking a single
> +.I flag
> +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> +seamless future extensions.
> +.PP
> +.I size
> +must be set to
> +.IR "sizeof(struct open_how)" ,
> +to facilitate future extensions (see the "Extensibility" section of the
> +\fBNOTES\fP for more detail on how extensions are handled.)
> +
> +.SS The open_how structure
> +The following structure indicates how
> +.I pathname
> +should be opened, and acts as a superset of the
> +.IR flag " and " mode
> +arguments to
> +.BR openat (2).
> +.PP
> +.in +4n
> +.EX
> +struct open_how {
> + uint32_t flags; /* open(2)-style O_* flags. */
> + union {
> + uint16_t mode; /* File mode bits for new file creation. */
> + uint16_t upgrade_mask; /* Restrict how O_PATHs may be re-opened. */
> + };
> + uint32_t resolve; /* RESOLVE_* path-resolution flags. */
> +};
> +.EE
> +.in
> +.PP
> +Any future extensions to
> +.BR openat2 ()
> +will be implemented as new fields appended to the above structure, with the
> +zero value of the new fields acting as though the extension were not present.
> +.PP
> +The meaning of each field is as follows:
> +.RS
> +
> +.I flags
> +.RS
> +The file creation and status flags to use for this operation. All of the
> +.B O_*
> +flags defined for
> +.BR openat (2)
> +are valid
> +.BR openat2 ()
> +flag values.
> +.RE
> +
> +.I upgrade_mask
> +.RS
> +Restrict with which
> +.I access modes
> +the returned
> +.B O_PATH
> +descriptor may be re-opened (either through
> +.B O_EMPTYPATH
> +or
> +.IR /proc/self/fd/ .)
> +This field may only be set to a non-zero value if
> +.I flags
> +contains
> +.BR O_PATH .
> +By default, an
> +.B O_PATH
> +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> +.B O_PATH
> +file descriptor of a magic-link may only be re-opened with access modes that
> +the original magic-link possessed). The full list of
> +.I upgrade_mask
> +flags is given below.
> +.TP
> +.B UPGRADE_NOREAD
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for reading (i.e.
> +.BR O_RDONLY " or " O_RDWR .)
> +.TP
> +.B UPGRADE_NOWRITE
> +Do not permit the
> +.B O_PATH
> +file descriptor to be re-opened for writing (i.e.
> +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> +.RE
> +
> +.I resolve
> +.RS
> +Change how the components of
> +.I pathname
> +will be resolved (see
> +.BR path_resolution (7)
> +for background information.) The primary use-case for these flags is to allow
> +trusted programs to restrict how un-trusted paths (or paths inside un-trusted
> +directories) are resolved. The full list of
> +.I resolve
> +flags is given below.
> +.TP
> +.B RESOLVE_NO_XDEV
> +Disallow all mount-point crossings during path resolution (including
> +all bind-mounts).
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as bind-mounts are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_SYMLINKS
> +Disallow all symlink resolution during path resolution. If the trailing
> +component is a symlink, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the symlink will be returned. This option implies
> +.BR RESOLVE_NO_MAGICLINKS .
> +
> +Users of this flag are encouraged to make its use configurable (unless it is
> +used for a specific security purpose), as symlinks are very widely used by
> +end-users and thus enabling this flag globally may result in spurious errors on
> +some systems.
> +.TP
> +.B RESOLVE_NO_MAGICLINKS
> +Disallow all magic-link resolution during path resolution. If the trailing
> +component is a magic-link, and
> +.I flags
> +contains both
> +.BR O_PATH " and " O_NOFOLLOW ","
> +then an
> +.B O_PATH
> +file descriptor referencing the magic-link will be returned.
> +
> +Magic-links are symlink-like objects that are most notably found in
> +.BR proc (5)
> +(examples include
> +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> +Due to the potential danger of unknowingly opening these magic-links, it may be
> +preferable for users to disable their resolution entirely (see
> +.BR symlink (7)
> +for more details.)
> +.TP
> +.B RESOLVE_BENEATH
> +Do not permit the path resolution to succeed if any component of the resolution
> +is not a descendant of the directory indicated by
> +.IR dirfd .
> +This results in absolute symlinks (and absolute values of
> +.IR pathname )
> +to be rejected. Magic-link resolution is also not permitted.
> +
> +.TP
> +.B RESOLVE_IN_ROOT
> +Temporarily treat
> +.I dirfd
> +as the root of the filesystem (as though the user called
> +.BR chroot (2)
> +with
> +.IR dirfd
> +as the argument.) Absolute symlinks and ".." path components will be scoped to
> +.IR dirfd . Magic-link resolution is also not permitted.
> +
> +However, unlike
> +.BR chroot (2)
> +(which changes the filesystem root persistently for an entire thread-group),
> +.B RESOLVE_IN_ROOT
> +allows a program to efficiently restrict path resolution for only certain
> +operations. It also has several hardening features (such as not permitting
> +magic-link resolution) which
> +.BR chroot (2)
> +does not.
> +.RE
> +
> +.RE
> +
> +.PP
> +Unlike
> +.BR openat (2),
> +any unknown flags set in fields of
> +.I how
> +will result in an error, rather than being ignored. In addition, an error will
> +be returned if the value of the
> +.IR mode " and " upgrade_mask
> +union is non-zero unless:
> +.RS
> +.IP * 3
> +.I flags
> +indicates that a new file will be created (it contains
> +.BR O_CREAT " or " O_TMPFILE ),
> +in which case
> +.I mode
> +may be any valid file mode.
> +.IP *
> +.I flags
> +contains
> +.BR O_PATH ,
> +in which case
> +.I upgrade_mask
> +must only contain valid
> +.B UPGRADE_*
> +flags.
> +.RE
> +
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned. On error, -1 is returned, and
> +.I errno
> +is set appropriately.
> +
> +.SH ERRORS
> +The set of errors returned by
> +.BR openat2 ()
> +includes all of the errors returned by
> +.BR openat (2),
> +as well as the following additional errors:
> +.TP
> +.B EINVAL
> +An unknown flag or invalid value was specified in
> +.IR how .
> +.TP
> +.B EINVAL
> +.I size
> +was smaller than any known version of
> +.IR "struct open_how" .
> +.TP
> +.B E2BIG
> +An extension was specified in
> +.IR how ,
> +which the current kernel does not support (see the "Extensibility" section of
> +the \fBNOTES\fP for more detail on how extensions are handled.)
> +.TP
> +.B EAGAIN
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and the kernel could not ensure that a ".." component didn't escape (due to a
> +race condition or potential attack). Callers may choose to retry the
> +.BR openat2 ()
> +call.
> +.TP
> +.B EXDEV
> +.I resolve
> +contains either
> +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> +and a path component attempted to escape the root of the resolution.
> +
> +.TP
> +.B EXDEV
> +.I resolve
> +contains
> +.BR RESOLVE_NO_XDEV ,
> +and a path component attempted to cross a mount-point.
> +
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_SYMLINKS ,
> +and one of the path components was a symlink.
> +.TP
> +.B ELOOP
> +.I resolve
> +contains
> +.BR RESOLVE_NO_MAGICLINKS ,
> +and one of the path components was a magic-link.
> +
> +.SH VERSIONS
> +.BR openat2 ()
> +was added to Linux in kernel 5.FOO.
> +
> +.SH CONFORMING TO
> +This system call is Linux-specific.
> +
> +The semantics of
> +.B RESOLVE_BENEATH
> +were modelled after FreeBSD's
> +.BR O_BENEATH .
> +
> +.SH NOTES
> +Glibc does not provide a wrapper for this system call; call it using
> +.BR syscall (2).
> +
> +.SS Extensibility
> +In order to allow for
> +.I struct open_how
> +to be extended in future kernel revisions,
> +.BR openat2 ()
> +requires userspace to specify what sized
> +.I struct open_how
> +structure they are passing. By providing this information, it is possible for
> +.BR openat2 ()
> +to provide both forwards- and backwards-compatibility \(em with
> +.I size
> +acting as an implicit version number (because new extension fields will always
> +be appended, the size will always increase.) This extensibility design is very
> +similar to other system calls such as
> +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
> +
> +If we let
> +.I usize
> +be the size of the structure according to userspace and
> +.I ksize
> +be the size of the structure which the kernel supports, then there are only
> +three cases to consider:
> +
> +.RS
> +.IP * 3
> +If
> +.IR ksize " equals " usize ,
> +then there is no version mismatch and
> +.I how
> +can be used verbatim.
> +.IP *
> +If
> +.IR ksize " is larger than " usize ,
> +then there are some extensions the kernel supports which the userspace program
> +is unaware of. Because all extensions must have their zero values be a no-op,
> +the kernel treats all of the extension fields not set by userspace to have zero
> +values. This provides backwards-compatibility.
> +.IP *
> +If
> +.IR ksize " is smaller than " usize ,
> +then there are some extensions which the userspace program is aware of but the
> +kernel does not support. Because all extensions must have their zero values be
> +a no-op, the kernel can safely ignore the unsupported extension fields if they
> +are all-zero. If any unsupported extension fields are non-zero, then an error
> +is returned. This provides forwards-compatibility.
> +.RE
> +
> +Therefore, most userspace programs will not need to have any special handling
> +of extensions. However, if a userspace program wishes to determine what
> +extensions the running kernel supports, they may conduct a binary search on
> +.IR size
> +(to find the largest value which doesn't produce an error.)
> +
> +.SH SEE ALSO
> +.BR openat (2),
> +.BR path_resolution (7),
> +.BR symlink (7)
> diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> index 85dd354e9a93..3da3e5b614c8 100644
> --- a/man7/path_resolution.7
> +++ b/man7/path_resolution.7
> @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
> Some UNIX/Linux system calls have as parameter one or more filenames.
> A filename (or pathname) is resolved as follows.
> .SS Step 1: start of the resolution process
> -If the pathname starts with the \(aq/\(aq character,
> -the starting lookup directory
> -is the root directory of the calling process.
> -(A process inherits its
> -root directory from its parent.
> -Usually this will be the root directory
> -of the file hierarchy.
> -A process may get a different root directory
> -by use of the
> +If the pathname starts with the \(aq/\(aq character, the starting lookup
> +directory is the root directory of the calling process. (A process inherits its
> +root directory from its parent. Usually this will be the root directory of the
> +file hierarchy. A process may get a different root directory by use of the
> .BR chroot (2)
> -system call.
> +system call, or may temporarily use a different root directory by using
> +.BR openat2 (2)
> +with the
> +.B RESOLVE_IN_ROOT
> +flag set.
> +.PP
> A process may get an entirely private mount namespace in case
> it\(emor one of its ancestors\(emwas started by an invocation of the
> .BR clone (2)
> @@ -48,16 +48,24 @@ system call that had the
> flag set.)
> This handles the \(aq/\(aq part of the pathname.
> .PP
> -If the pathname does not start with the \(aq/\(aq character, the
> -starting lookup directory of the resolution process is the current working
> -directory of the process.
> -(This is also inherited from the parent.
> -It can be changed by use of the
> +If the pathname does not start with the \(aq/\(aq character, the starting
> +lookup directory of the resolution process is the current working directory of
> +the process \(em or in the case of
> +.BR openat (2)-style
> +syscalls, the
> +.I dfd
> +argument (or the current working directory if
> +.B AT_FDCWD
> +is passed as the
> +.I dfd
> +argumnet). The current working directory is inherited from the parent, and can
> +be changed by use of the
> .BR chdir (2)
> -system call.)
> +syscall.
> .PP
> Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
> Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> +
> .SS Step 2: walk along the path
> Set the current lookup directory to the starting lookup directory.
> Now, for each nonfinal component of the pathname, where a component
> @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
> was reworked to eliminate the use of recursion,
> so that the only limit that remains is the maximum of 40
> resolutions for the entire pathname.
> +.PP
> +The resolution of syscalls during this stage can be blocked by using
> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_SYMLINKS
> +flag set.
> +
> .SS Step 3: find the final entry
> The lookup of the final component of the pathname goes just like
> that of all other components, as described in the previous step,
> @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
> their conventional meanings, regardless of whether they are
> actually present in the physical filesystem.
> .PP
> -One cannot walk down past the root: "/.." is the same as "/".
> +One cannot walk up past the root: "/.." is the same as "/".
> +
> .SS Mount points
> After a "mount dev path" command, the pathname "path" refers to
> the root of the filesystem hierarchy on the device "dev", and no
> @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
> One can walk out of a mounted filesystem: "path/.." refers to
> the parent directory of "path",
> outside of the filesystem hierarchy on "dev".
> +.PP
> +Mount-point crossings can be blocked by using
> +.BR openat2 (2),
> +with the
> +.B RESOLVE_NO_XDEV
> +flag set (though note that this also restricts bind-mount crossings).
> +
> .SS Trailing slashes
> If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
> component as in Step 2: it has to exist and resolve to a directory.
> --
> 2.23.0
>
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>