2020-10-19 11:43:51

by Giuseppe Scrivano

[permalink] [raw]
Subject: [PATCH v2 0/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC

When the new flag is used, close_range will set the close-on-exec bit
for the file descriptors instead of close()-ing them.

It is useful for e.g. container runtimes that want to minimize the
number of syscalls used after a seccomp profile is installed but want
to keep some fds open until the container process is executed.

v1->v2:
* move close_range(..., CLOSE_RANGE_CLOEXEC) implementation to a separate function.
* use bitmap_set() to set the close-on-exec bits in the bitmap.
* add test with rlimit(RLIMIT_NOFILE) in place.
* use "cur_max" that is already used by close_range(..., 0).

Giuseppe Scrivano (2):
fs, close_range: add flag CLOSE_RANGE_CLOEXEC
selftests: add tests for CLOSE_RANGE_CLOEXEC

fs/file.c | 44 ++++++++---
include/uapi/linux/close_range.h | 3 +
.../testing/selftests/core/close_range_test.c | 74 +++++++++++++++++++
3 files changed, 111 insertions(+), 10 deletions(-)

--
2.26.2


2020-10-19 11:44:08

by Giuseppe Scrivano

[permalink] [raw]
Subject: [PATCH v2 2/2] selftests: add tests for CLOSE_RANGE_CLOEXEC

Signed-off-by: Giuseppe Scrivano <[email protected]>
---
.../testing/selftests/core/close_range_test.c | 74 +++++++++++++++++++
1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
index c99b98b0d461..c9db282158bb 100644
--- a/tools/testing/selftests/core/close_range_test.c
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -11,6 +11,7 @@
#include <string.h>
#include <syscall.h>
#include <unistd.h>
+#include <sys/resource.h>

#include "../kselftest_harness.h"
#include "../clone3/clone3_selftests.h"
@@ -23,6 +24,10 @@
#define CLOSE_RANGE_UNSHARE (1U << 1)
#endif

+#ifndef CLOSE_RANGE_CLOEXEC
+#define CLOSE_RANGE_CLOEXEC (1U << 2)
+#endif
+
static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
unsigned int flags)
{
@@ -224,4 +229,73 @@ TEST(close_range_unshare_capped)
EXPECT_EQ(0, WEXITSTATUS(status));
}

+TEST(close_range_cloexec)
+{
+ int i, ret;
+ int open_fds[101];
+ struct rlimit rlimit;
+
+ for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+ int fd;
+
+ fd = open("/dev/null", O_RDONLY);
+ ASSERT_GE(fd, 0) {
+ if (errno == ENOENT)
+ XFAIL(return, "Skipping test since /dev/null does not exist");
+ }
+
+ open_fds[i] = fd;
+ }
+
+ ret = sys_close_range(1000, 1000, CLOSE_RANGE_CLOEXEC);
+ if (ret < 0) {
+ if (errno == ENOSYS)
+ XFAIL(return, "close_range() syscall not supported");
+ if (errno == EINVAL)
+ XFAIL(return, "close_range() doesn't support CLOSE_RANGE_CLOEXEC");
+ }
+
+ /* Ensure the FD_CLOEXEC bit is set also with a resource limit in place. */
+ EXPECT_EQ(0, getrlimit(RLIMIT_NOFILE, &rlimit));
+ rlimit.rlim_cur = 25;
+ EXPECT_EQ(0, setrlimit(RLIMIT_NOFILE, &rlimit));
+
+ /* Set close-on-exec for two ranges: [0-50] and [75-100]. */
+ ret = sys_close_range(open_fds[0], open_fds[50], CLOSE_RANGE_CLOEXEC);
+ EXPECT_EQ(0, ret);
+ ret = sys_close_range(open_fds[75], open_fds[100], CLOSE_RANGE_CLOEXEC);
+ EXPECT_EQ(0, ret);
+
+ for (i = 0; i <= 50; i++) {
+ int flags = fcntl(open_fds[i], F_GETFD);
+
+ EXPECT_GT(flags, -1);
+ EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC);
+ }
+
+ for (i = 51; i <= 74; i++) {
+ int flags = fcntl(open_fds[i], F_GETFD);
+
+ EXPECT_GT(flags, -1);
+ EXPECT_EQ(flags & FD_CLOEXEC, 0);
+ }
+
+ for (i = 75; i <= 100; i++) {
+ int flags = fcntl(open_fds[i], F_GETFD);
+
+ EXPECT_GT(flags, -1);
+ EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC);
+ }
+
+ /* Test a common pattern. */
+ ret = sys_close_range(3, UINT_MAX, CLOSE_RANGE_CLOEXEC);
+ for (i = 0; i <= 100; i++) {
+ int flags = fcntl(open_fds[i], F_GETFD);
+
+ EXPECT_GT(flags, -1);
+ EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC);
+ }
+}
+
+
TEST_HARNESS_MAIN
--
2.26.2

2020-10-19 23:30:36

by Giuseppe Scrivano

[permalink] [raw]
Subject: [PATCH v2 1/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC

When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
immediately close the files but it sets the close-on-exec bit.

It is useful for e.g. container runtimes that usually install a
seccomp profile "as late as possible" before execv'ing the container
process itself. The container runtime could either do:
1 2
- install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0);
- close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
- execve(...); - execve(...);

Both alternative have some disadvantages.

In the first variant the seccomp_profile cannot block the close_range
syscall, as well as opendir/read/close/... for the fallback on older
kernels).
In the second variant, close_range() can be used only on the fds
that are not going to be needed by the runtime anymore, and it must be
potentially called multiple times to account for the different ranges
that must be closed.

Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
The runtime is able to use the open fds and the seccomp profile could
block close_range() and the syscalls used for its fallback.

Signed-off-by: Giuseppe Scrivano <[email protected]>
---
fs/file.c | 44 ++++++++++++++++++++++++--------
include/uapi/linux/close_range.h | 3 +++
2 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 21c0893f2f1d..0295d4f7c5ef 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -672,6 +672,35 @@ int __close_fd(struct files_struct *files, unsigned fd)
}
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */

+static inline void __range_cloexec(struct files_struct *cur_fds,
+ unsigned int fd, unsigned int max_fd)
+{
+ struct fdtable *fdt;
+
+ if (fd > max_fd)
+ return;
+
+ spin_lock(&cur_fds->file_lock);
+ fdt = files_fdtable(cur_fds);
+ bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1);
+ spin_unlock(&cur_fds->file_lock);
+}
+
+static inline void __range_close(struct files_struct *cur_fds, unsigned int fd,
+ unsigned int max_fd)
+{
+ while (fd <= max_fd) {
+ struct file *file;
+
+ file = pick_file(cur_fds, fd++);
+ if (!file)
+ continue;
+
+ filp_close(file, cur_fds);
+ cond_resched();
+ }
+}
+
/**
* __close_range() - Close all file descriptors in a given range.
*
@@ -687,7 +716,7 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
struct task_struct *me = current;
struct files_struct *cur_fds = me->files, *fds = NULL;

- if (flags & ~CLOSE_RANGE_UNSHARE)
+ if (flags & ~(CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC))
return -EINVAL;

if (fd > max_fd)
@@ -725,16 +754,11 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
}

max_fd = min(max_fd, cur_max);
- while (fd <= max_fd) {
- struct file *file;

- file = pick_file(cur_fds, fd++);
- if (!file)
- continue;
-
- filp_close(file, cur_fds);
- cond_resched();
- }
+ if (flags & CLOSE_RANGE_CLOEXEC)
+ __range_cloexec(cur_fds, fd, max_fd);
+ else
+ __range_close(cur_fds, fd, max_fd);

if (fds) {
/*
diff --git a/include/uapi/linux/close_range.h b/include/uapi/linux/close_range.h
index 6928a9fdee3c..2d804281554c 100644
--- a/include/uapi/linux/close_range.h
+++ b/include/uapi/linux/close_range.h
@@ -5,5 +5,8 @@
/* Unshare the file descriptor table before closing file descriptors. */
#define CLOSE_RANGE_UNSHARE (1U << 1)

+/* Set the FD_CLOEXEC bit instead of closing the file descriptor. */
+#define CLOSE_RANGE_CLOEXEC (1U << 2)
+
#endif /* _UAPI_LINUX_CLOSE_RANGE_H */

--
2.26.2

2020-10-21 05:28:03

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC

On Mon, Oct 19, 2020 at 12:26:53PM +0200, Giuseppe Scrivano wrote:
> When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
> immediately close the files but it sets the close-on-exec bit.
>
> It is useful for e.g. container runtimes that usually install a
> seccomp profile "as late as possible" before execv'ing the container
> process itself. The container runtime could either do:
> 1 2
> - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0);
> - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
> - execve(...); - execve(...);
>
> Both alternative have some disadvantages.
>
> In the first variant the seccomp_profile cannot block the close_range
> syscall, as well as opendir/read/close/... for the fallback on older
> kernels).
> In the second variant, close_range() can be used only on the fds
> that are not going to be needed by the runtime anymore, and it must be
> potentially called multiple times to account for the different ranges
> that must be closed.
>
> Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
> The runtime is able to use the open fds and the seccomp profile could
> block close_range() and the syscalls used for its fallback.

I see, so you want those fds to be closed after exec but still use them
before. Yeah, this is a good use-case. (I proposed this extension quite a
while ago when we started discussing this syscall. Thanks for working
ont this!)

>
> Signed-off-by: Giuseppe Scrivano <[email protected]>
> ---
> fs/file.c | 44 ++++++++++++++++++++++++--------
> include/uapi/linux/close_range.h | 3 +++
> 2 files changed, 37 insertions(+), 10 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 21c0893f2f1d..0295d4f7c5ef 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -672,6 +672,35 @@ int __close_fd(struct files_struct *files, unsigned fd)
> }
> EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
>
> +static inline void __range_cloexec(struct files_struct *cur_fds,
> + unsigned int fd, unsigned int max_fd)
> +{
> + struct fdtable *fdt;
> +
> + if (fd > max_fd)
> + return;

Looks like formatting issues here.

> +
> + spin_lock(&cur_fds->file_lock);
> + fdt = files_fdtable(cur_fds);
> + bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1);

I think that this is ok and that there's no reason to make this anymore
complex unless we somehow really see performance issues which I doubt.

If Al is ok with doing it this way and doesn't see any obvious issues
I'll be taking this for some testing and would come back to ack this and
pick it up.

> + spin_unlock(&cur_fds->file_lock);
> +}
> +
> +static inline void __range_close(struct files_struct *cur_fds, unsigned int fd,
> + unsigned int max_fd)
> +{
> + while (fd <= max_fd) {
> + struct file *file;
> +
> + file = pick_file(cur_fds, fd++);
> + if (!file)
> + continue;
> +
> + filp_close(file, cur_fds);
> + cond_resched();
> + }
> +}
> +
> /**
> * __close_range() - Close all file descriptors in a given range.
> *
> @@ -687,7 +716,7 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
> struct task_struct *me = current;
> struct files_struct *cur_fds = me->files, *fds = NULL;
>
> - if (flags & ~CLOSE_RANGE_UNSHARE)
> + if (flags & ~(CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC))
> return -EINVAL;
>
> if (fd > max_fd)
> @@ -725,16 +754,11 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
> }
>
> max_fd = min(max_fd, cur_max);
> - while (fd <= max_fd) {
> - struct file *file;
>
> - file = pick_file(cur_fds, fd++);
> - if (!file)
> - continue;
> -
> - filp_close(file, cur_fds);
> - cond_resched();
> - }
> + if (flags & CLOSE_RANGE_CLOEXEC)
> + __range_cloexec(cur_fds, fd, max_fd);
> + else
> + __range_close(cur_fds, fd, max_fd);
>
> if (fds) {
> /*
> diff --git a/include/uapi/linux/close_range.h b/include/uapi/linux/close_range.h
> index 6928a9fdee3c..2d804281554c 100644
> --- a/include/uapi/linux/close_range.h
> +++ b/include/uapi/linux/close_range.h
> @@ -5,5 +5,8 @@
> /* Unshare the file descriptor table before closing file descriptors. */
> #define CLOSE_RANGE_UNSHARE (1U << 1)
>
> +/* Set the FD_CLOEXEC bit instead of closing the file descriptor. */
> +#define CLOSE_RANGE_CLOEXEC (1U << 2)
> +
> #endif /* _UAPI_LINUX_CLOSE_RANGE_H */
>
> --
> 2.26.2
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2020-10-21 05:31:55

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] selftests: add tests for CLOSE_RANGE_CLOEXEC

First, thank you for the selftests. That's great to see!

Could you please add a short explanation what you're testing here to the
commit message?

On Mon, Oct 19, 2020 at 12:26:54PM +0200, Giuseppe Scrivano wrote:
> Signed-off-by: Giuseppe Scrivano <[email protected]>
> ---
> .../testing/selftests/core/close_range_test.c | 74 +++++++++++++++++++
> 1 file changed, 74 insertions(+)
>
> diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
> index c99b98b0d461..c9db282158bb 100644
> --- a/tools/testing/selftests/core/close_range_test.c
> +++ b/tools/testing/selftests/core/close_range_test.c
> @@ -11,6 +11,7 @@
> #include <string.h>
> #include <syscall.h>
> #include <unistd.h>
> +#include <sys/resource.h>
>
> #include "../kselftest_harness.h"
> #include "../clone3/clone3_selftests.h"
> @@ -23,6 +24,10 @@
> #define CLOSE_RANGE_UNSHARE (1U << 1)
> #endif
>
> +#ifndef CLOSE_RANGE_CLOEXEC
> +#define CLOSE_RANGE_CLOEXEC (1U << 2)
> +#endif
> +
> static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
> unsigned int flags)
> {
> @@ -224,4 +229,73 @@ TEST(close_range_unshare_capped)
> EXPECT_EQ(0, WEXITSTATUS(status));
> }
>
> +TEST(close_range_cloexec)
> +{
> + int i, ret;
> + int open_fds[101];
> + struct rlimit rlimit;
> +
> + for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
> + int fd;
> +
> + fd = open("/dev/null", O_RDONLY);
> + ASSERT_GE(fd, 0) {
> + if (errno == ENOENT)
> + XFAIL(return, "Skipping test since /dev/null does not exist");
> + }
> +
> + open_fds[i] = fd;
> + }
> +
> + ret = sys_close_range(1000, 1000, CLOSE_RANGE_CLOEXEC);
> + if (ret < 0) {
> + if (errno == ENOSYS)
> + XFAIL(return, "close_range() syscall not supported");
> + if (errno == EINVAL)
> + XFAIL(return, "close_range() doesn't support CLOSE_RANGE_CLOEXEC");
> + }
> +
> + /* Ensure the FD_CLOEXEC bit is set also with a resource limit in place. */
> + EXPECT_EQ(0, getrlimit(RLIMIT_NOFILE, &rlimit));
> + rlimit.rlim_cur = 25;
> + EXPECT_EQ(0, setrlimit(RLIMIT_NOFILE, &rlimit));

I usually prefer to call ASSERT_* to abort at the first true failure
before moving on. And I think all the EXPECT_*()s here should be
ASSERT_*()s because that are all hard failures imho.

Apart from that this looks good.

> +
> + /* Set close-on-exec for two ranges: [0-50] and [75-100]. */
> + ret = sys_close_range(open_fds[0], open_fds[50], CLOSE_RANGE_CLOEXEC);
> + EXPECT_EQ(0, ret);
> + ret = sys_close_range(open_fds[75], open_fds[100], CLOSE_RANGE_CLOEXEC);
> + EXPECT_EQ(0, ret);
> +
> + for (i = 0; i <= 50; i++) {
> + int flags = fcntl(open_fds[i], F_GETFD);
> +
> + EXPECT_GT(flags, -1);
> + EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC);
> + }
> +
> + for (i = 51; i <= 74; i++) {
> + int flags = fcntl(open_fds[i], F_GETFD);
> +
> + EXPECT_GT(flags, -1);
> + EXPECT_EQ(flags & FD_CLOEXEC, 0);
> + }
> +
> + for (i = 75; i <= 100; i++) {
> + int flags = fcntl(open_fds[i], F_GETFD);
> +
> + EXPECT_GT(flags, -1);
> + EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC);
> + }
> +
> + /* Test a common pattern. */
> + ret = sys_close_range(3, UINT_MAX, CLOSE_RANGE_CLOEXEC);
> + for (i = 0; i <= 100; i++) {
> + int flags = fcntl(open_fds[i], F_GETFD);
> +
> + EXPECT_GT(flags, -1);
> + EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC);
> + }
> +}
> +
> +
> TEST_HARNESS_MAIN
> --
> 2.26.2
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2020-10-29 15:40:55

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC

On Mon, Oct 19, 2020 at 12:26:52PM +0200, Giuseppe Scrivano wrote:
> When the new flag is used, close_range will set the close-on-exec bit
> for the file descriptors instead of close()-ing them.
>
> It is useful for e.g. container runtimes that want to minimize the
> number of syscalls used after a seccomp profile is installed but want
> to keep some fds open until the container process is executed.
>
> v1->v2:
> * move close_range(..., CLOSE_RANGE_CLOEXEC) implementation to a separate function.
> * use bitmap_set() to set the close-on-exec bits in the bitmap.
> * add test with rlimit(RLIMIT_NOFILE) in place.
> * use "cur_max" that is already used by close_range(..., 0).

I'm picking this up for some testing, thanks
Christian

2020-10-29 16:50:36

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC

Hi Christian,

Christian Brauner <[email protected]> writes:

> On Mon, Oct 19, 2020 at 12:26:52PM +0200, Giuseppe Scrivano wrote:
>> When the new flag is used, close_range will set the close-on-exec bit
>> for the file descriptors instead of close()-ing them.
>>
>> It is useful for e.g. container runtimes that want to minimize the
>> number of syscalls used after a seccomp profile is installed but want
>> to keep some fds open until the container process is executed.
>>
>> v1->v2:
>> * move close_range(..., CLOSE_RANGE_CLOEXEC) implementation to a separate function.
>> * use bitmap_set() to set the close-on-exec bits in the bitmap.
>> * add test with rlimit(RLIMIT_NOFILE) in place.
>> * use "cur_max" that is already used by close_range(..., 0).
>
> I'm picking this up for some testing, thanks
> Christian

thanks! I've addressed the comments you had for v2 and pushed them
here[1] but I've not sent yet v3 as I was waiting for a feedback from Al
whether using bitmap_set() is fine.

Regards,
Giuseppe

[1] https://github.com/giuseppe/linux/tree/close-range-cloexec

2020-11-18 10:06:49

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC

On Thu, Oct 29, 2020 at 05:47:53PM +0100, Giuseppe Scrivano wrote:
> Hi Christian,
>
> Christian Brauner <[email protected]> writes:
>
> > On Mon, Oct 19, 2020 at 12:26:52PM +0200, Giuseppe Scrivano wrote:
> >> When the new flag is used, close_range will set the close-on-exec bit
> >> for the file descriptors instead of close()-ing them.
> >>
> >> It is useful for e.g. container runtimes that want to minimize the
> >> number of syscalls used after a seccomp profile is installed but want
> >> to keep some fds open until the container process is executed.
> >>
> >> v1->v2:
> >> * move close_range(..., CLOSE_RANGE_CLOEXEC) implementation to a separate function.
> >> * use bitmap_set() to set the close-on-exec bits in the bitmap.
> >> * add test with rlimit(RLIMIT_NOFILE) in place.
> >> * use "cur_max" that is already used by close_range(..., 0).
> >
> > I'm picking this up for some testing, thanks
> > Christian
>
> thanks! I've addressed the comments you had for v2 and pushed them
> here[1] but I've not sent yet v3 as I was waiting for a feedback from Al
> whether using bitmap_set() is fine.

Send it please.
Christian