This documents close_range(2) based on information in
278a5fbaed89dacd04e9d052f4594ffd0e0585de and
60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
Signed-off-by: Stephen Kitt <[email protected]>
---
man2/close_range.2 | 112 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 112 insertions(+)
create mode 100644 man2/close_range.2
diff --git a/man2/close_range.2 b/man2/close_range.2
new file mode 100644
index 000000000..62167d9b0
--- /dev/null
+++ b/man2/close_range.2
@@ -0,0 +1,112 @@
+.\" Copyright (c) 2020 Stephen Kitt <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
+.SH NAME
+close_range \- close all file descriptors in a given range
+.SH SYNOPSIS
+.nf
+.B #include <linux/close_range.h>
+.PP
+.BI "int close_range(int " first ", int " last ", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR close_range ()
+system call closes all open file descriptors from
+.I first
+to
+.IR last
+(included).
+.PP
+Errors closing a given file descriptor are currently ignored.
+.PP
+.I flags
+can be set to
+.B CLOSE_RANGE_UNSHARE
+to unshare the range of file descriptors from any other processes,
+.I instead
+of closing them.
+.SH RETURN VALUE
+On success,
+.BR close_range ()
+return 0.
+On error, \-1 is returned and
+.I errno
+is set to indicate the cause of the error.
+.SH ERRORS
+.TP
+.B EINVAL
+.I flags
+is not valid, or
+.I first
+is greater than
+.IR last .
+.TP
+.B EMFILE
+The per-process limit on the number of open file descriptors has been reached
+(see the description of
+.BR RLIMIT_NOFILE
+in
+.BR getrlimit (2)).
+.TP
+.B ENOMEM
+Insufficient kernel memory was available.
+.SH VERSIONS
+.BR close_range ()
+first appeared in Linux 5.9.
+.SH CONFORMING TO
+.BR close_range ()
+is available on Linux and FreeBSD.
+.SH NOTES
+Currently, there is no glibc wrapper for this system call; call it using
+.BR syscall (2).
+.SH USE CASES
+.\" 278a5fbaed89dacd04e9d052f4594ffd0e0585de
+.\" 60997c3d45d9a67daf01c56d805ae4fec37e0bd8
+.SS Closing file descriptors before exec
+File descriptors can be closed safely using
+.PP
+.in +4n
+.EX
+/* we don't want anything past stderr here */
+close_range(3, ~0U, CLOSE_RANGE_UNSHARE);
+execve(....);
+.EE
+.in
+.PP
+.SS Closing all open file descriptors
+This is commonly implemented (on Linux) by listing open file
+descriptors in
+.B /proc/self/fd/
+and calling
+.BR close (2)
+on each one.
+.BR close_range ()
+can take care of this without requiring
+.B /proc
+and with a single system call, which provides significant performance
+benefits.
+.SH SEE ALSO
+.BR close (2)
base-commit: b5dae3959625f5ff378e9edf9139057d1c06bb55
--
2.20.1
Hello Stephen
Thank you for writing this page! Some comments/questions below.
On Tue, 8 Dec 2020 at 22:51, Stephen Kitt <[email protected]> wrote:
>
> This documents close_range(2) based on information in
> 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
(Thanks for noting these commit IDs.)
> Signed-off-by: Stephen Kitt <[email protected]>
> ---
> man2/close_range.2 | 112 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 112 insertions(+)
> create mode 100644 man2/close_range.2
>
> diff --git a/man2/close_range.2 b/man2/close_range.2
> new file mode 100644
> index 000000000..62167d9b0
> --- /dev/null
> +++ b/man2/close_range.2
> @@ -0,0 +1,112 @@
> +.\" Copyright (c) 2020 Stephen Kitt <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +close_range \- close all file descriptors in a given range
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/close_range.h>
> +.PP
> +.BI "int close_range(int " first ", int " last ", unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR close_range ()
> +system call closes all open file descriptors from
> +.I first
> +to
> +.IR last
> +(included).
> +.PP
> +Errors closing a given file descriptor are currently ignored.
> +.PP
> +.I flags
> +can be set to
> +.B CLOSE_RANGE_UNSHARE
> +to unshare the range of file descriptors from any other processes,
> +.I instead
> +of closing them.
Really "instead of closing them"? I had supposed that rather that this
should be "before closing them". That's also how the kernel code reads
to me, from a quick glance.
> +.SH RETURN VALUE
> +On success,
> +.BR close_range ()
> +return 0.
s/return/returns/
> +On error, \-1 is returned and
> +.I errno
> +is set to indicate the cause of the error.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +.I flags
> +is not valid, or
> +.I first
> +is greater than
> +.IR last .
> +.TP
> +.B EMFILE
> +The per-process limit on the number of open file descriptors has been reached
> +(see the description of
> +.BR RLIMIT_NOFILE
> +in
> +.BR getrlimit (2)).
Given that we are simply closing FDs, how can EMFILE occur?
> +.TP
> +.B ENOMEM
> +Insufficient kernel memory was available.
> +.SH VERSIONS
> +.BR close_range ()
> +first appeared in Linux 5.9.
> +.SH CONFORMING TO
> +.BR close_range ()
> +is available on Linux and FreeBSD.
Here, I think it would be better to write:
close_range()
is a nonstandard function that is also present on FreeBSD.
> +.SH NOTES
> +Currently, there is no glibc wrapper for this system call; call it using
> +.BR syscall (2).
> +.SH USE CASES
> +.\" 278a5fbaed89dacd04e9d052f4594ffd0e0585de
> +.\" 60997c3d45d9a67daf01c56d805ae4fec37e0bd8
> +.SS Closing file descriptors before exec
> +File descriptors can be closed safely using
> +.PP
> +.in +4n
> +.EX
> +/* we don't want anything past stderr here */
> +close_range(3, ~0U, CLOSE_RANGE_UNSHARE);
> +execve(....);
> +.EE
> +.in
> +.PP
.PP is not necessary before a new subsection (.SS).
> +.SS Closing all open file descriptors
> +This is commonly implemented (on Linux) by listing open file
Is it really true that this is common? I suspect not. It's slow, and
relies on /proc being present. I would have thought that more common
is something like:
int maxfd = sysconf(_SC_OPEN_MAX);
if (maxfd == -1) /* Limit is indeterminate... */
maxfd = 16384; /* so take a guess */
for (fd = 0; fd < maxfd; fd++)
close(fd);
I think it's fine to mention the use of a /proc as an (inferior and)
alternative way of doing this. I'm just not sure that "commonly" is
correct.
> +descriptors in
> +.B /proc/self/fd/
> +and calling
> +.BR close (2)
> +on each one.
> +.BR close_range ()
> +can take care of this without requiring
> +.B /proc
> +and with a single system call, which provides significant performance
> +benefits.
> +.SH SEE ALSO
> +.BR close (2)
>
> base-commit: b5dae3959625f5ff378e9edf9139057d1c06bb55
> --
> 2.20.1
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
On Wed, Dec 09, 2020 at 09:50:38AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Stephen
>
> Thank you for writing this page! Some comments/questions below.
>
> On Tue, 8 Dec 2020 at 22:51, Stephen Kitt <[email protected]> wrote:
> >
> > This documents close_range(2) based on information in
> > 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
> > 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
>
> (Thanks for noting these commit IDs.)
>
> > Signed-off-by: Stephen Kitt <[email protected]>
> > ---
> > man2/close_range.2 | 112 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 112 insertions(+)
> > create mode 100644 man2/close_range.2
> >
> > diff --git a/man2/close_range.2 b/man2/close_range.2
> > new file mode 100644
> > index 000000000..62167d9b0
> > --- /dev/null
> > +++ b/man2/close_range.2
> > @@ -0,0 +1,112 @@
> > +.\" Copyright (c) 2020 Stephen Kitt <[email protected]>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein. The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +close_range \- close all file descriptors in a given range
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <linux/close_range.h>
> > +.PP
> > +.BI "int close_range(int " first ", int " last ", unsigned int " flags );
> > +.fi
> > +.SH DESCRIPTION
> > +The
> > +.BR close_range ()
> > +system call closes all open file descriptors from
> > +.I first
> > +to
> > +.IR last
> > +(included).
> > +.PP
> > +Errors closing a given file descriptor are currently ignored.
> > +.PP
> > +.I flags
> > +can be set to
> > +.B CLOSE_RANGE_UNSHARE
> > +to unshare the range of file descriptors from any other processes,
> > +.I instead
> > +of closing them.
>
> Really "instead of closing them"? I had supposed that rather that this
> should be "before closing them". That's also how the kernel code reads
> to me, from a quick glance.
It's also mentioned in the commit message. Basically setting
CLOSE_RANGE_UNSHARE is equivalent to:
unshare(CLONE_FILES);
close_range(<start>, <end>);
Christian
On Tue, Dec 08, 2020 at 10:51:33PM +0100, Stephen Kitt wrote:
> This documents close_range(2) based on information in
> 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
>
> Signed-off-by: Stephen Kitt <[email protected]>
> ---
Hey Stephen,
Thanks for working on this that's an early Christmas present as it gets
an item off my todo list!
> man2/close_range.2 | 112 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 112 insertions(+)
> create mode 100644 man2/close_range.2
>
> diff --git a/man2/close_range.2 b/man2/close_range.2
> new file mode 100644
> index 000000000..62167d9b0
> --- /dev/null
> +++ b/man2/close_range.2
> @@ -0,0 +1,112 @@
> +.\" Copyright (c) 2020 Stephen Kitt <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +close_range \- close all file descriptors in a given range
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/close_range.h>
> +.PP
> +.BI "int close_range(int " first ", int " last ", unsigned int " flags );
Note, the kernel prototype uses unsigned int as the type for file
descriptor arguments. As does the close() syscall itself. Only glibc
wrappers expose file descriptor types (at least in close variants) as
int.
Since this is a manpage about the syscall not the wrapper it might make
sense to note the correct types.
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR close_range ()
> +system call closes all open file descriptors from
> +.I first
> +to
> +.IR last
> +(included).
> +.PP
> +Errors closing a given file descriptor are currently ignored.
> +.PP
> +.I flags
> +can be set to
> +.B CLOSE_RANGE_UNSHARE
> +to unshare the range of file descriptors from any other processes,
> +.I instead
> +of closing them.
As Michael has noted, this needs to be reworded. A few things to note:
- CLOSE_RANGE_UNSHARE will ensure that the calling process will have a
private file descriptor table. This ensures that other threads opening
files cannot inject new file descriptors into the caller's file
descriptor table to e.g. make the caller inherit unwanted file
descriptors.
- CLOSE_RANGE_UNSHARE is conceptually equivalent to:
unshare(CLONE_FILES);
close_range(3, ~0U);
- Whenever the requested range @last is greater than the current maximum
number of file descriptors allocated in the caller's file descriptor
table the kernel will only unshare a new file descriptor table for the
caller up to @first, i.e. the new file descriptor table will be 0 up
to and including @first not 0 up to and including @last. Which means
that the kernel will not have to do any costly filp_close() calls at
all. In essence, the close_range() operation is finished after the
in-kernel unshare call in such cases.
Christian
Le 09/12/2020 10:40, Christian Brauner a écrit :
> On Wed, Dec 09, 2020 at 09:50:38AM +0100, Michael Kerrisk (man-pages)
> wrote:
>> > +.PP
>> > +.I flags
>> > +can be set to
>> > +.B CLOSE_RANGE_UNSHARE
>> > +to unshare the range of file descriptors from any other processes,
>> > +.I instead
>> > +of closing them.
>>
>> Really "instead of closing them"? I had supposed that rather that this
>> should be "before closing them". That's also how the kernel code reads
>> to me, from a quick glance.
>
> It's also mentioned in the commit message. Basically setting
> CLOSE_RANGE_UNSHARE is equivalent to:
>
> unshare(CLONE_FILES);
> close_range(<start>, <end>);
Yes, I got that mixed up, thanks for the clarification! I'll send a v2
addressing the review comments later today.
Regards,
Stephen
Hey Christian,
I have a question for you below.
Thanks,
Alex
On 12/9/20 10:58 AM, Christian Brauner wrote:
> On Tue, Dec 08, 2020 at 10:51:33PM +0100, Stephen Kitt wrote:
>> This documents close_range(2) based on information in
>> 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
>> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
>>
>> Signed-off-by: Stephen Kitt <[email protected]>
>> ---
>
> Hey Stephen,
>
> Thanks for working on this that's an early Christmas present as it gets
> an item off my todo list!
>
>> man2/close_range.2 | 112 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 112 insertions(+)
>> create mode 100644 man2/close_range.2
>>
>> diff --git a/man2/close_range.2 b/man2/close_range.2
>> new file mode 100644
>> index 000000000..62167d9b0
>> --- /dev/null
>> +++ b/man2/close_range.2
>> @@ -0,0 +1,112 @@
>> +.\" Copyright (c) 2020 Stephen Kitt <[email protected]>
>> +.\"
>> +.\" %%%LICENSE_START(VERBATIM)
>> +.\" Permission is granted to make and distribute verbatim copies of this
>> +.\" manual provided the copyright notice and this permission notice are
>> +.\" preserved on all copies.
>> +.\"
>> +.\" Permission is granted to copy and distribute modified versions of this
>> +.\" manual under the conditions for verbatim copying, provided that the
>> +.\" entire resulting derived work is distributed under the terms of a
>> +.\" permission notice identical to this one.
>> +.\"
>> +.\" Since the Linux kernel and libraries are constantly changing, this
>> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
>> +.\" responsibility for errors or omissions, or for damages resulting from
>> +.\" the use of the information contained herein. The author(s) may not
>> +.\" have taken the same level of care in the production of this manual,
>> +.\" which is licensed free of charge, as they might when working
>> +.\" professionally.
>> +.\"
>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>> +.\" the source, must acknowledge the copyright and authors of this work.
>> +.\" %%%LICENSE_END
>> +.\"
>> +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +close_range \- close all file descriptors in a given range
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include <linux/close_range.h>
>> +.PP
>> +.BI "int close_range(int " first ", int " last ", unsigned int " flags );
>
> Note, the kernel prototype uses unsigned int as the type for file
> descriptor arguments. As does the close() syscall itself. Only glibc
> wrappers expose file descriptor types (at least in close variants) as
> int.
> Since this is a manpage about the syscall not the wrapper it might make
> sense to note the correct types.
>
>> +.fi
>> +.SH DESCRIPTION
>> +The
>> +.BR close_range ()
>> +system call closes all open file descriptors from
>> +.I first
>> +to
>> +.IR last
>> +(included).
>> +.PP
>> +Errors closing a given file descriptor are currently ignored.
>> +.PP
>> +.I flags
>> +can be set to
>> +.B CLOSE_RANGE_UNSHARE
>> +to unshare the range of file descriptors from any other processes,
>> +.I instead
>> +of closing them.
>
> As Michael has noted, this needs to be reworded. A few things to note:
> - CLOSE_RANGE_UNSHARE will ensure that the calling process will have a
> private file descriptor table. This ensures that other threads opening
> files cannot inject new file descriptors into the caller's file
> descriptor table to e.g. make the caller inherit unwanted file
> descriptors.
> - CLOSE_RANGE_UNSHARE is conceptually equivalent to:
> unshare(CLONE_FILES);
> close_range(3, ~0U);
AFAICS after reading the code, if the unsharing fails,
it will not close any file descriptors (please correct me if I'm wrong).
Just wanted to be sure that it was the intended behavior with you,
and if so, it would be good to document it in the page.
> - Whenever the requested range @last is greater than the current maximum
> number of file descriptors allocated in the caller's file descriptor
> table the kernel will only unshare a new file descriptor table for the
> caller up to @first, i.e. the new file descriptor table will be 0 up
> to and including @first not 0 up to and including @last. Which means
> that the kernel will not have to do any costly filp_close() calls at
> all. In essence, the close_range() operation is finished after the
> in-kernel unshare call in such cases.
>
> Christian
>
--
Alejandro Colomar
Linux man-pages comaintainer; http://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es
On Wed, Dec 09, 2020 at 11:44:22AM +0100, Alejandro Colomar (man-pages) wrote:
> Hey Christian,
>
> I have a question for you below.
>
> Thanks,
Hey Alex,
Sure!
>
> Alex
>
> On 12/9/20 10:58 AM, Christian Brauner wrote:
> > On Tue, Dec 08, 2020 at 10:51:33PM +0100, Stephen Kitt wrote:
> >> This documents close_range(2) based on information in
> >> 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
> >> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
> >>
> >> Signed-off-by: Stephen Kitt <[email protected]>
> >> ---
> >
> > Hey Stephen,
> >
> > Thanks for working on this that's an early Christmas present as it gets
> > an item off my todo list!
> >
> >> man2/close_range.2 | 112 +++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 112 insertions(+)
> >> create mode 100644 man2/close_range.2
> >>
> >> diff --git a/man2/close_range.2 b/man2/close_range.2
> >> new file mode 100644
> >> index 000000000..62167d9b0
> >> --- /dev/null
> >> +++ b/man2/close_range.2
> >> @@ -0,0 +1,112 @@
> >> +.\" Copyright (c) 2020 Stephen Kitt <[email protected]>
> >> +.\"
> >> +.\" %%%LICENSE_START(VERBATIM)
> >> +.\" Permission is granted to make and distribute verbatim copies of this
> >> +.\" manual provided the copyright notice and this permission notice are
> >> +.\" preserved on all copies.
> >> +.\"
> >> +.\" Permission is granted to copy and distribute modified versions of this
> >> +.\" manual under the conditions for verbatim copying, provided that the
> >> +.\" entire resulting derived work is distributed under the terms of a
> >> +.\" permission notice identical to this one.
> >> +.\"
> >> +.\" Since the Linux kernel and libraries are constantly changing, this
> >> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> >> +.\" responsibility for errors or omissions, or for damages resulting from
> >> +.\" the use of the information contained herein. The author(s) may not
> >> +.\" have taken the same level of care in the production of this manual,
> >> +.\" which is licensed free of charge, as they might when working
> >> +.\" professionally.
> >> +.\"
> >> +.\" Formatted or processed versions of this manual, if unaccompanied by
> >> +.\" the source, must acknowledge the copyright and authors of this work.
> >> +.\" %%%LICENSE_END
> >> +.\"
> >> +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
> >> +.SH NAME
> >> +close_range \- close all file descriptors in a given range
> >> +.SH SYNOPSIS
> >> +.nf
> >> +.B #include <linux/close_range.h>
> >> +.PP
> >> +.BI "int close_range(int " first ", int " last ", unsigned int " flags );
> >
> > Note, the kernel prototype uses unsigned int as the type for file
> > descriptor arguments. As does the close() syscall itself. Only glibc
> > wrappers expose file descriptor types (at least in close variants) as
> > int.
> > Since this is a manpage about the syscall not the wrapper it might make
> > sense to note the correct types.
> >
> >> +.fi
> >> +.SH DESCRIPTION
> >> +The
> >> +.BR close_range ()
> >> +system call closes all open file descriptors from
> >> +.I first
> >> +to
> >> +.IR last
> >> +(included).
> >> +.PP
> >> +Errors closing a given file descriptor are currently ignored.
> >> +.PP
> >> +.I flags
> >> +can be set to
> >> +.B CLOSE_RANGE_UNSHARE
> >> +to unshare the range of file descriptors from any other processes,
> >> +.I instead
> >> +of closing them.
> >
> > As Michael has noted, this needs to be reworded. A few things to note:
> > - CLOSE_RANGE_UNSHARE will ensure that the calling process will have a
> > private file descriptor table. This ensures that other threads opening
> > files cannot inject new file descriptors into the caller's file
> > descriptor table to e.g. make the caller inherit unwanted file
> > descriptors.
> > - CLOSE_RANGE_UNSHARE is conceptually equivalent to:
> > unshare(CLONE_FILES);
> > close_range(3, ~0U);
>
> AFAICS after reading the code, if the unsharing fails,
> it will not close any file descriptors (please correct me if I'm wrong).
>
> Just wanted to be sure that it was the intended behavior with you,
> and if so, it would be good to document it in the page.
Yes, this is intended because if the unshare fails we haven't yet
actually started closing anything so we're before the point of no
return where we ignore failures. So we can let userspace decide whether
they want to retry without CLOSE_RANGE_UNSHARE.
Christian
Hello Stephen and Michael,
Thanks for the page!
A few more comments below.
Thanks,
Alex
On 12/9/20 9:50 AM, Michael Kerrisk (man-pages) wrote:
> Hello Stephen
>
> Thank you for writing this page! Some comments/questions below.
>
> On Tue, 8 Dec 2020 at 22:51, Stephen Kitt <[email protected]> wrote:
>>
>> This documents close_range(2) based on information in
>> 278a5fbaed89dacd04e9d052f4594ffd0e0585de and
>> 60997c3d45d9a67daf01c56d805ae4fec37e0bd8.
>
> (Thanks for noting these commit IDs.)
>
>> Signed-off-by: Stephen Kitt <[email protected]>
>> ---
>> man2/close_range.2 | 112 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 112 insertions(+)
>> create mode 100644 man2/close_range.2
>>
>> diff --git a/man2/close_range.2 b/man2/close_range.2
>> new file mode 100644
>> index 000000000..62167d9b0
>> --- /dev/null
>> +++ b/man2/close_range.2
>> @@ -0,0 +1,112 @@
>> +.\" Copyright (c) 2020 Stephen Kitt <[email protected]>
>> +.\"
>> +.\" %%%LICENSE_START(VERBATIM)
>> +.\" Permission is granted to make and distribute verbatim copies of this
>> +.\" manual provided the copyright notice and this permission notice are
>> +.\" preserved on all copies.
>> +.\"
>> +.\" Permission is granted to copy and distribute modified versions of this
>> +.\" manual under the conditions for verbatim copying, provided that the
>> +.\" entire resulting derived work is distributed under the terms of a
>> +.\" permission notice identical to this one.
>> +.\"
>> +.\" Since the Linux kernel and libraries are constantly changing, this
>> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
>> +.\" responsibility for errors or omissions, or for damages resulting from
>> +.\" the use of the information contained herein. The author(s) may not
>> +.\" have taken the same level of care in the production of this manual,
>> +.\" which is licensed free of charge, as they might when working
>> +.\" professionally.
>> +.\"
>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>> +.\" the source, must acknowledge the copyright and authors of this work.
>> +.\" %%%LICENSE_END
>> +.\"
>> +.TH CLOSE_RANGE 2 2020-12-08 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +close_range \- close all file descriptors in a given range
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include <linux/close_range.h>
>> +.PP
>> +.BI "int close_range(int " first ", int " last ", unsigned int " flags );
The actual interface of close_range() uses 'unsigned int'
for the file descriptors:
$ sed -n '/SYSCALL_DEFINE.(close_range,/,/^{/p' linux/fs/open.c
SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
unsigned int, flags)
{
>> +.fi
>> +.SH DESCRIPTION
>> +The
>> +.BR close_range ()
>> +system call closes all open file descriptors from
>> +.I first
>> +to
>> +.IR last
s/IR/I/
>> +(included).
>> +.PP
>> +Errors closing a given file descriptor are currently ignored.
>> +.PP
>> +.I flags
>> +can be set to
>> +.B CLOSE_RANGE_UNSHARE
>> +to unshare the range of file descriptors from any other processes,
>> +.I instead
>> +of closing them.
>
> Really "instead of closing them"? I had supposed that rather that this
> should be "before closing them". That's also how the kernel code reads
> to me, from a quick glance.
>
>> +.SH RETURN VALUE
>> +On success,
>> +.BR close_range ()
>> +return 0.
>
> s/return/returns/
>
>> +On error, \-1 is returned and
>> +.I errno
>> +is set to indicate the cause of the error.
>> +.SH ERRORS
>> +.TP
>> +.B EINVAL
>> +.I flags
>> +is not valid, or
>> +.I first
>> +is greater than
>> +.IR last .
>> +.TP
>> +.B EMFILE
>> +The per-process limit on the number of open file descriptors has been reached
>> +(see the description of
>> +.BR RLIMIT_NOFILE
>> +in
>> +.BR getrlimit (2)).
>
> Given that we are simply closing FDs, how can EMFILE occur?
I was also going to ask this, but after some investigation,
it's possible. Basically, CLOSE_RANGE_UNSHARE duplicates fds
before closing them. In case that duplication is not possible,
ENOMEM and EMFILE are the two possible reasons. It might be interesting
to note that those errors may only occur if CLOSE_RANGE_UNSHARE was set.
See below:
fs/open.c:
close_range() calls __close_range().
fs/file.c:
If CLOSE_RANGE_UNSHARE, __close_range() calls unshare_fd()
with CLONE_FILES flag.
kernel/fork.c:
unshare_fd() calls dup_fd().
fs/file.c:
dup_fd() may fail with -ENOMEM or -EMFILE.
>
>> +.TP
>> +.B ENOMEM
>> +Insufficient kernel memory was available.
>> +.SH VERSIONS
>> +.BR close_range ()
>> +first appeared in Linux 5.9.
>> +.SH CONFORMING TO
>> +.BR close_range ()
>> +is available on Linux and FreeBSD.
>
> Here, I think it would be better to write:
>
> close_range()
> is a nonstandard function that is also present on FreeBSD.
>
>> +.SH NOTES
>> +Currently, there is no glibc wrapper for this system call; call it using
>> +.BR syscall (2).
>> +.SH USE CASES
>> +.\" 278a5fbaed89dacd04e9d052f4594ffd0e0585de
>> +.\" 60997c3d45d9a67daf01c56d805ae4fec37e0bd8
>> +.SS Closing file descriptors before exec
>> +File descriptors can be closed safely using
>> +.PP
>> +.in +4n
>> +.EX
>> +/* we don't want anything past stderr here */
>> +close_range(3, ~0U, CLOSE_RANGE_UNSHARE);
>> +execve(....);
>> +.EE
>> +.in
>> +.PP
>
> .PP is not necessary before a new subsection (.SS).
>
>> +.SS Closing all open file descriptors
>> +This is commonly implemented (on Linux) by listing open file
>
> Is it really true that this is common? I suspect not. It's slow, and
> relies on /proc being present. I would have thought that more common
> is something like:
>
> int maxfd = sysconf(_SC_OPEN_MAX);
> if (maxfd == -1) /* Limit is indeterminate... */
> maxfd = 16384; /* so take a guess */
>
> for (fd = 0; fd < maxfd; fd++)
> close(fd);
>
> I think it's fine to mention the use of a /proc as an (inferior and)
> alternative way of doing this. I'm just not sure that "commonly" is
> correct.
>
>> +descriptors in
>> +.B /proc/self/fd/
By reading proc.5, I think this should s/.B/.I/, right mtk?
>> +and calling
>> +.BR close (2)
>> +on each one.
>> +.BR close_range ()
>> +can take care of this without requiring
>> +.B /proc
By reading proc.5, I think this should s/.B/.I/, right mtk?
>> +and with a single system call, which provides significant performance
>> +benefits.
>> +.SH SEE ALSO
>> +.BR close (2)
>>
>> base-commit: b5dae3959625f5ff378e9edf9139057d1c06bb55
>> --
>> 2.20.1
>
> Thanks,
>
> Michael
>
>
--
Alejandro Colomar
Linux man-pages comaintainer; http://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es
Hi Christian,
Thanks for confirming that behavior. Seems reasonable.
I was wondering...
If this call is equivalent to unshare(2)+{close(2) in a loop},
shouldn't it fail for the same reasons those syscalls can fail?
What about the following errors?:
From unshare(2):
EPERM The calling process did not have the required privi‐
leges for this operation.
From close(2):
EBADF fd isn't a valid open file descriptor.
OK, this one can't happen with the current code.
Let's say there are fds 1 to 10, and you call 'close_range(20,30,0)'.
It's a no-op (although it will still unshare if the flag is set).
But souldn't it fail with EBADF?
EINTR The close() call was interrupted by a signal; see sig‐
nal(7).
EIO An I/O error occurred.
ENOSPC, EDQUOT
On NFS, these errors are not normally reported against
the first write which exceeds the available storage
space, but instead against a subsequent write(2),
fsync(2), or close().
Thanks,
Alex
On 12/9/20 11:56 AM, Christian Brauner wrote:
> On Wed, Dec 09, 2020 at 11:44:22AM +0100, Alejandro Colomar (man-pages) wrote:
>> Hey Christian,
>>
>> I have a question for you below.
>>
>> Thanks,
>
> Hey Alex,
>
> Sure!
[...]
>>
>> AFAICS after reading the code, if the unsharing fails,
>> it will not close any file descriptors (please correct me if I'm wrong).
>>
>> Just wanted to be sure that it was the intended behavior with you,
>> and if so, it would be good to document it in the page.
>
> Yes, this is intended because if the unshare fails we haven't yet
> actually started closing anything so we're before the point of no
> return where we ignore failures. So we can let userspace decide whether
> they want to retry without CLOSE_RANGE_UNSHARE.
>
> Christian
>
--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es
On 12/9/20 10:47 AM, Alejandro Colomar (man-pages) wrote:
>>> +descriptors in
>>> +.B /proc/self/fd/
>
> By reading proc.5, I think this should s/.B/.I/, right mtk?
>
>>> +and calling
>>> +.BR close (2)
>>> +on each one.
>>> +.BR close_range ()
>>> +can take care of this without requiring
>>> +.B /proc
>
> By reading proc.5, I think this should s/.B/.I/, right mtk?
Yes to both. Pathnames are formatted with .I.
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
On Thu, Dec 10, 2020 at 03:36:42PM +0100, Alejandro Colomar (man-pages) wrote:
> Hi Christian,
Hi Alex,
>
> Thanks for confirming that behavior. Seems reasonable.
>
> I was wondering...
> If this call is equivalent to unshare(2)+{close(2) in a loop},
> shouldn't it fail for the same reasons those syscalls can fail?
>
> What about the following errors?:
>
> From unshare(2):
>
> EPERM The calling process did not have the required privi‐
> leges for this operation.
unshare(CLONE_FILES) doesn't require any privileges. Only flags relevant
to kernel/nsproxy.c:unshare_nsproxy_namespaces() require privileges,
i.e.
CLONE_NEWNS
CLONE_NEWUTS
CLONE_NEWIPC
CLONE_NEWNET
CLONE_NEWPID
CLONE_NEWCGROUP
CLONE_NEWTIME
so the permissions are the same.
>
> From close(2):
> EBADF fd isn't a valid open file descriptor.
>
> OK, this one can't happen with the current code.
> Let's say there are fds 1 to 10, and you call 'close_range(20,30,0)'.
> It's a no-op (although it will still unshare if the flag is set).
> But souldn't it fail with EBADF?
CLOSE_RANGE_UNSHARE should always give you a private file descriptor
table independent of whether or not any file descriptors need to be
closed. That's also how we documented the flag:
/* Unshare the file descriptor table before closing file descriptors. */
#define CLOSE_RANGE_UNSHARE (1U << 1)
A caller calling unshare(CLONE_FILES) and then an emulated close_range()
or the proper close_range() syscall wants to make sure that all unwanted
file descriptors are closed (if any) and that no new file descriptors
can be injected afterwards. If you skip the unshare(CLONE_FILES) because
there are no fds to be closed you open up a race window. It would also
be annoying for userspace if they _may_ have received a private file
descriptor table but only if any fds needed to be closed.
If people really were extremely keen about skipping the unshare when no
fd needs to be closed then this could become a new flag. But I really
don't think that's necessary and also doesn't make a lot of sense, imho.
>
> EINTR The close() call was interrupted by a signal; see sig‐
> nal(7).
>
> EIO An I/O error occurred.
>
> ENOSPC, EDQUOT
> On NFS, these errors are not normally reported against
> the first write which exceeds the available storage
> space, but instead against a subsequent write(2),
> fsync(2), or close().
None of these will be seen by userspace because close_range() currently
ignores all errors after it has begun closing files.
Christian
Hi Christian,
Makes sense to me.
Thanks,
Alex
On 12/12/20 1:14 PM, Christian Brauner wrote:
> On Thu, Dec 10, 2020 at 03:36:42PM +0100, Alejandro Colomar (man-pages) wrote:
>> Hi Christian,
>
> Hi Alex,
>
>>
>> Thanks for confirming that behavior. Seems reasonable.
>>
>> I was wondering...
>> If this call is equivalent to unshare(2)+{close(2) in a loop},
>> shouldn't it fail for the same reasons those syscalls can fail?
>>
>> What about the following errors?:
>>
>> From unshare(2):
>>
>> EPERM The calling process did not have the required privi‐
>> leges for this operation.
>
> unshare(CLONE_FILES) doesn't require any privileges. Only flags relevant
> to kernel/nsproxy.c:unshare_nsproxy_namespaces() require privileges,
> i.e.
> CLONE_NEWNS
> CLONE_NEWUTS
> CLONE_NEWIPC
> CLONE_NEWNET
> CLONE_NEWPID
> CLONE_NEWCGROUP
> CLONE_NEWTIME
> so the permissions are the same.
>
>>
>> From close(2):
>> EBADF fd isn't a valid open file descriptor.
>>
>> OK, this one can't happen with the current code.
>> Let's say there are fds 1 to 10, and you call 'close_range(20,30,0)'.
>> It's a no-op (although it will still unshare if the flag is set).
>> But souldn't it fail with EBADF?
>
> CLOSE_RANGE_UNSHARE should always give you a private file descriptor
> table independent of whether or not any file descriptors need to be
> closed. That's also how we documented the flag:
>
> /* Unshare the file descriptor table before closing file descriptors. */
> #define CLOSE_RANGE_UNSHARE (1U << 1)
>
> A caller calling unshare(CLONE_FILES) and then an emulated close_range()
> or the proper close_range() syscall wants to make sure that all unwanted
> file descriptors are closed (if any) and that no new file descriptors
> can be injected afterwards. If you skip the unshare(CLONE_FILES) because
> there are no fds to be closed you open up a race window. It would also
> be annoying for userspace if they _may_ have received a private file
> descriptor table but only if any fds needed to be closed.
>
> If people really were extremely keen about skipping the unshare when no
> fd needs to be closed then this could become a new flag. But I really
> don't think that's necessary and also doesn't make a lot of sense, imho.
>
>>
>> EINTR The close() call was interrupted by a signal; see sig‐
>> nal(7).
>>
>> EIO An I/O error occurred.
>>
>> ENOSPC, EDQUOT
>> On NFS, these errors are not normally reported against
>> the first write which exceeds the available storage
>> space, but instead against a subsequent write(2),
>> fsync(2), or close().
>
> None of these will be seen by userspace because close_range() currently
> ignores all errors after it has begun closing files.
>
> Christian
>
Hi Stephen,
Linux 5.10 has been recently released.
Do you have any updates for this patch?
Thanks,
Alex
On 12/12/20 6:58 PM, Alejandro Colomar (man-pages) wrote:
> Hi Christian,
>
> Makes sense to me.
>
> Thanks,
>
> Alex
>
> On 12/12/20 1:14 PM, Christian Brauner wrote:
>> On Thu, Dec 10, 2020 at 03:36:42PM +0100, Alejandro Colomar (man-pages) wrote:
>>> Hi Christian,
>>
>> Hi Alex,
>>
>>>
>>> Thanks for confirming that behavior. Seems reasonable.
>>>
>>> I was wondering...
>>> If this call is equivalent to unshare(2)+{close(2) in a loop},
>>> shouldn't it fail for the same reasons those syscalls can fail?
>>>
>>> What about the following errors?:
>>>
>>> From unshare(2):
>>>
>>> EPERM The calling process did not have the required privi‐
>>> leges for this operation.
>>
>> unshare(CLONE_FILES) doesn't require any privileges. Only flags relevant
>> to kernel/nsproxy.c:unshare_nsproxy_namespaces() require privileges,
>> i.e.
>> CLONE_NEWNS
>> CLONE_NEWUTS
>> CLONE_NEWIPC
>> CLONE_NEWNET
>> CLONE_NEWPID
>> CLONE_NEWCGROUP
>> CLONE_NEWTIME
>> so the permissions are the same.
>>
>>>
>>> From close(2):
>>> EBADF fd isn't a valid open file descriptor.
>>>
>>> OK, this one can't happen with the current code.
>>> Let's say there are fds 1 to 10, and you call 'close_range(20,30,0)'.
>>> It's a no-op (although it will still unshare if the flag is set).
>>> But souldn't it fail with EBADF?
>>
>> CLOSE_RANGE_UNSHARE should always give you a private file descriptor
>> table independent of whether or not any file descriptors need to be
>> closed. That's also how we documented the flag:
>>
>> /* Unshare the file descriptor table before closing file descriptors. */
>> #define CLOSE_RANGE_UNSHARE (1U << 1)
>>
>> A caller calling unshare(CLONE_FILES) and then an emulated close_range()
>> or the proper close_range() syscall wants to make sure that all unwanted
>> file descriptors are closed (if any) and that no new file descriptors
>> can be injected afterwards. If you skip the unshare(CLONE_FILES) because
>> there are no fds to be closed you open up a race window. It would also
>> be annoying for userspace if they _may_ have received a private file
>> descriptor table but only if any fds needed to be closed.
>>
>> If people really were extremely keen about skipping the unshare when no
>> fd needs to be closed then this could become a new flag. But I really
>> don't think that's necessary and also doesn't make a lot of sense, imho.
>>
>>>
>>> EINTR The close() call was interrupted by a signal; see sig‐
>>> nal(7).
>>>
>>> EIO An I/O error occurred.
>>>
>>> ENOSPC, EDQUOT
>>> On NFS, these errors are not normally reported against
>>> the first write which exceeds the available storage
>>> space, but instead against a subsequent write(2),
>>> fsync(2), or close().
>>
>> None of these will be seen by userspace because close_range() currently
>> ignores all errors after it has begun closing files.
>>
>> Christian
>>
--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/
Hi Alex,
Le 18/12/2020 11:12, Alejandro Colomar (man-pages) a écrit :
> Linux 5.10 has been recently released.
> Do you have any updates for this patch?
Yes, I have a v3 in preparation, with _CLOEXEC and a code example. I'll
wrap it up today.
Regards,
Stephen