Hi,
This is the second version of the work. V1 was here:
https://lkml.org/lkml/2019/3/11/207
I removed CC to kvm list since not necessary any more, but added
linux-api to the list as suggested by Kirill.
This one greatly simplifies the previous version, dropped the kvm
special entry and mimic the sysctl_unprivileged_bpf_disabled knob for
userfaultfd as suggested by many. The major differences comparing to
the BPF flag are: (1) use PTRACE instead of ADMIN capability, and (2)
allow to switch the flag back and forth (BPF does not allow to switch
back to "enabled" if "disabled" once).
So the main idea of this simpler version is that we still keep the old
way as is by default but we only provide a way for admins when they
really want to turn userfaultfd off for unprivileged users.
About procfs vs sysfs: I still used the procfs way because admins can
still leverage sysctl.conf with that and also since no one yet
explicitly asked for sysfs for a better reason yet (And I just noticed
BPF just added another bpf_stats_enabled into sysctl a few weeks ago).
Please have a look, thanks.
Peter Xu (1):
userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Documentation/sysctl/vm.txt | 12 ++++++++++++
fs/userfaultfd.c | 5 +++++
include/linux/userfaultfd_k.h | 2 ++
kernel/sysctl.c | 12 ++++++++++++
4 files changed, 31 insertions(+)
--
2.17.1
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Suggested-by: Andrea Arcangeli <[email protected]>
Suggested-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
Documentation/sysctl/vm.txt | 12 ++++++++++++
fs/userfaultfd.c | 5 +++++
include/linux/userfaultfd_k.h | 2 ++
kernel/sysctl.c | 12 ++++++++++++
4 files changed, 31 insertions(+)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 187ce4f599a2..f146712f67bb 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
- stat_refresh
- numa_stat
- swappiness
+- unprivileged_userfaultfd
- user_reserve_kbytes
- vfs_cache_pressure
- watermark_boost_factor
@@ -818,6 +819,17 @@ The default value is 60.
==============================================================
+unprivileged_userfaultfd
+
+This flag controls whether unprivileged users can use the userfaultfd
+syscalls. Set this to 1 to allow unprivileged users to use the
+userfaultfd syscalls, or set this to 0 to restrict userfaultfd to only
+privileged users (with SYS_CAP_PTRACE capability).
+
+The default value is 1.
+
+==============================================================
+
- user_reserve_kbytes
When overcommit_memory is set to 2, "never overcommit" mode, reserve
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 89800fc7dc9d..7e856a25cc2f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -30,6 +30,8 @@
#include <linux/security.h>
#include <linux/hugetlb.h>
+int sysctl_unprivileged_userfaultfd __read_mostly = 1;
+
static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
enum userfaultfd_state {
@@ -1921,6 +1923,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
struct userfaultfd_ctx *ctx;
int fd;
+ if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
+ return -EPERM;
+
BUG_ON(!current->mm);
/* Check the UFFD_* constants for consistency. */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 37c9eba75c98..ac9d71e24b81 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -28,6 +28,8 @@
#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+extern int sysctl_unprivileged_userfaultfd;
+
extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7578e21a711b..9b8ff1881df9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -66,6 +66,7 @@
#include <linux/kexec.h>
#include <linux/bpf.h>
#include <linux/mount.h>
+#include <linux/userfaultfd_k.h>
#include <linux/uaccess.h>
#include <asm/processor.h>
@@ -1704,6 +1705,17 @@ static struct ctl_table vm_table[] = {
.extra1 = (void *)&mmap_rnd_compat_bits_min,
.extra2 = (void *)&mmap_rnd_compat_bits_max,
},
+#endif
+#ifdef CONFIG_USERFAULTFD
+ {
+ .procname = "unprivileged_userfaultfd",
+ .data = &sysctl_unprivileged_userfaultfd,
+ .maxlen = sizeof(sysctl_unprivileged_userfaultfd),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
#endif
{ }
};
--
2.17.1
Hi Peter,
On Tue, Mar 19, 2019 at 11:07:22AM +0800, Peter Xu wrote:
> Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> whether userfaultfd is allowed by unprivileged users. When this is
> set to zero, only privileged users (root user, or users with the
> CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> syscalls.
>
> Suggested-by: Andrea Arcangeli <[email protected]>
> Suggested-by: Mike Rapoport <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Just one minor note below
> ---
> Documentation/sysctl/vm.txt | 12 ++++++++++++
> fs/userfaultfd.c | 5 +++++
> include/linux/userfaultfd_k.h | 2 ++
> kernel/sysctl.c | 12 ++++++++++++
> 4 files changed, 31 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 187ce4f599a2..f146712f67bb 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
> - stat_refresh
> - numa_stat
> - swappiness
> +- unprivileged_userfaultfd
> - user_reserve_kbytes
> - vfs_cache_pressure
> - watermark_boost_factor
> @@ -818,6 +819,17 @@ The default value is 60.
>
> ==============================================================
>
> +unprivileged_userfaultfd
> +
> +This flag controls whether unprivileged users can use the userfaultfd
> +syscalls. Set this to 1 to allow unprivileged users to use the
> +userfaultfd syscalls, or set this to 0 to restrict userfaultfd to only
> +privileged users (with SYS_CAP_PTRACE capability).
Can you please fully spell "system call"?
> +
> +The default value is 1.
> +
> +==============================================================
> +
> - user_reserve_kbytes
>
> When overcommit_memory is set to 2, "never overcommit" mode, reserve
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 89800fc7dc9d..7e856a25cc2f 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -30,6 +30,8 @@
> #include <linux/security.h>
> #include <linux/hugetlb.h>
>
> +int sysctl_unprivileged_userfaultfd __read_mostly = 1;
> +
> static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
>
> enum userfaultfd_state {
> @@ -1921,6 +1923,9 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
> struct userfaultfd_ctx *ctx;
> int fd;
>
> + if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
> + return -EPERM;
> +
> BUG_ON(!current->mm);
>
> /* Check the UFFD_* constants for consistency. */
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 37c9eba75c98..ac9d71e24b81 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -28,6 +28,8 @@
> #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
>
> +extern int sysctl_unprivileged_userfaultfd;
> +
> extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>
> extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 7578e21a711b..9b8ff1881df9 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -66,6 +66,7 @@
> #include <linux/kexec.h>
> #include <linux/bpf.h>
> #include <linux/mount.h>
> +#include <linux/userfaultfd_k.h>
>
> #include <linux/uaccess.h>
> #include <asm/processor.h>
> @@ -1704,6 +1705,17 @@ static struct ctl_table vm_table[] = {
> .extra1 = (void *)&mmap_rnd_compat_bits_min,
> .extra2 = (void *)&mmap_rnd_compat_bits_max,
> },
> +#endif
> +#ifdef CONFIG_USERFAULTFD
> + {
> + .procname = "unprivileged_userfaultfd",
> + .data = &sysctl_unprivileged_userfaultfd,
> + .maxlen = sizeof(sysctl_unprivileged_userfaultfd),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> #endif
> { }
> };
> --
> 2.17.1
>
--
Sincerely yours,
Mike.
On Tue, 19 Mar 2019 11:07:22 +0800 Peter Xu <[email protected]> wrote:
> Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> whether userfaultfd is allowed by unprivileged users. When this is
> set to zero, only privileged users (root user, or users with the
> CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> syscalls.
Please send along a full description of why you believe Linux needs
this feature, for me to add to the changelog. What is the benefit to
our users? How will it be used?
etcetera. As it was presented I'm seeing no justification for adding
the patch!
* Andrew Morton ([email protected]) wrote:
> On Tue, 19 Mar 2019 11:07:22 +0800 Peter Xu <[email protected]> wrote:
>
> > Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> > whether userfaultfd is allowed by unprivileged users. When this is
> > set to zero, only privileged users (root user, or users with the
> > CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> > syscalls.
>
> Please send along a full description of why you believe Linux needs
> this feature, for me to add to the changelog. What is the benefit to
> our users? How will it be used?
>
> etcetera. As it was presented I'm seeing no justification for adding
> the patch!
How about:
---
Userfaultfd can be misued to make it easier to exploit existing use-after-free
(and similar) bugs that might otherwise only make a short window
or race condition available. By using userfaultfd to stall a kernel
thread, a malicious program can keep some state, that it wrote, stable
for an extended period, which it can then access using an existing
exploit. While it doesn't cause the exploit itself, and while it's not
the only thing that can stall a kernel thread when accessing a memory location,
it's one of the few that never needs priviledge.
Add a flag, allowing userfaultfd to be restricted, so that in general
it won't be useable by arbitrary user programs, but in environments that
require userfaultfd it can be turned back on.
---
Dave
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Tue, Mar 19, 2019 at 06:28:23PM +0000, Dr. David Alan Gilbert wrote:
> * Andrew Morton ([email protected]) wrote:
> > On Tue, 19 Mar 2019 11:07:22 +0800 Peter Xu <[email protected]> wrote:
> >
> > > Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> > > whether userfaultfd is allowed by unprivileged users. When this is
> > > set to zero, only privileged users (root user, or users with the
> > > CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> > > syscalls.
> >
> > Please send along a full description of why you believe Linux needs
> > this feature, for me to add to the changelog. What is the benefit to
> > our users? How will it be used?
> >
> > etcetera. As it was presented I'm seeing no justification for adding
> > the patch!
>
> How about:
>
> ---
> Userfaultfd can be misued to make it easier to exploit existing use-after-free
> (and similar) bugs that might otherwise only make a short window
> or race condition available. By using userfaultfd to stall a kernel
> thread, a malicious program can keep some state, that it wrote, stable
> for an extended period, which it can then access using an existing
> exploit. While it doesn't cause the exploit itself, and while it's not
> the only thing that can stall a kernel thread when accessing a memory location,
> it's one of the few that never needs priviledge.
>
> Add a flag, allowing userfaultfd to be restricted, so that in general
> it won't be useable by arbitrary user programs, but in environments that
> require userfaultfd it can be turned back on.
Thanks for the quick write up, Dave! I definitely should have some
justification in the cover letter and carry it until the last version.
Sorry to be unclear at the first glance.
--
Peter Xu
Hello,
On Tue, Mar 19, 2019 at 09:11:04AM +0200, Mike Rapoport wrote:
> Hi Peter,
>
> On Tue, Mar 19, 2019 at 11:07:22AM +0800, Peter Xu wrote:
> > Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> > whether userfaultfd is allowed by unprivileged users. When this is
> > set to zero, only privileged users (root user, or users with the
> > CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> > syscalls.
> >
> > Suggested-by: Andrea Arcangeli <[email protected]>
> > Suggested-by: Mike Rapoport <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
>
> Reviewed-by: Mike Rapoport <[email protected]>
>
> Just one minor note below
This looks fine with me too.
> > + if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
> > + return -EPERM;
The only difference between the bpf sysctl and the userfaultfd sysctl
this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
already if it's doing other kind of tracking on processes runtime, in
addition of userfaultfd. In other words both syscalls works only for
root, when the two sysctl are opt-in set to 1.
Reviewed-by: Andrea Arcangeli <[email protected]>
Hello,
On Tue, Mar 19, 2019 at 06:28:23PM +0000, Dr. David Alan Gilbert wrote:
> ---
> Userfaultfd can be misued to make it easier to exploit existing use-after-free
> (and similar) bugs that might otherwise only make a short window
> or race condition available. By using userfaultfd to stall a kernel
> thread, a malicious program can keep some state, that it wrote, stable
> for an extended period, which it can then access using an existing
> exploit. While it doesn't cause the exploit itself, and while it's not
> the only thing that can stall a kernel thread when accessing a memory location,
> it's one of the few that never needs priviledge.
>
> Add a flag, allowing userfaultfd to be restricted, so that in general
> it won't be useable by arbitrary user programs, but in environments that
> require userfaultfd it can be turned back on.
The default in the patch leaves userfaultfd enabled to all users, so
it may be clearer to reverse the last sentence to "in hardened
environments it allows to restrict userfaultfd to privileged processes.".
We can also make example that 'While this is not a kernel issue, in
practice unless you also "chmod u-s /usr/bin/fusermount" there's no
tangible benefit in removing privileges for userfaultfd, other than
probabilistic ones by decreasig the attack surface of the kernel, but
that would be better be achieved through SECCOMP and not globally.'.
On Wed, Mar 20, 2019 at 03:01:12PM -0400, Andrea Arcangeli wrote:
> but
> that would be better be achieved through SECCOMP and not globally.'.
That begs the question why not use seccomp for this? What if everyone
decided to add a knob for all syscalls to do the same? For the commit
log, why is it OK then to justify a knob for this syscall?
Luis
Hello,
On Thu, Mar 21, 2019 at 01:43:35PM +0000, Luis Chamberlain wrote:
> On Wed, Mar 20, 2019 at 03:01:12PM -0400, Andrea Arcangeli wrote:
> > but
> > that would be better be achieved through SECCOMP and not globally.'.
>
> That begs the question why not use seccomp for this? What if everyone
> decided to add a knob for all syscalls to do the same? For the commit
> log, why is it OK then to justify a knob for this syscall?
That's a good point and it's obviously more secure because you can
block a lot more than just bpf and userfaultfd: however not all
syscalls have CONFIG_USERFAULTFD=n or CONFIG_BPF_SYSCALL=n that you
can set to =n at build time, then they'll return -ENOSYS (implemented
as sys_ni_syscall in the =n case).
The point of the bpf (already included upstream) and userfaultfd
(proposed) sysctl is to avoid users having to rebuild the kernel if
they want to harden their setup without being forced to run all
containers under seccomp, just like they could by setting those two
config options "=n" at build time.
So you can see it like allowing a runtime selection of
CONFIG_USERFAULTFD and CONFIG_BPF_SYSCALL without the kernel build
time config forcing the decision on behalf of the end user.
Thanks,
Andrea
On Tue, Mar 19, 2019 at 11:02 AM Andrew Morton
<[email protected]> wrote:
>
> On Tue, 19 Mar 2019 11:07:22 +0800 Peter Xu <[email protected]> wrote:
>
> > Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> > whether userfaultfd is allowed by unprivileged users. When this is
> > set to zero, only privileged users (root user, or users with the
> > CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> > syscalls.
>
> Please send along a full description of why you believe Linux needs
> this feature, for me to add to the changelog. What is the benefit to
> our users? How will it be used?
>
> etcetera. As it was presented I'm seeing no justification for adding
> the patch!
Was there a v3 of this patch? I'd still really like to have this knob added. :)
Thanks!
--
Kees Cook
On Tue, Mar 19, 2019 at 11:07:22AM +0800, Peter Xu wrote:
> Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> whether userfaultfd is allowed by unprivileged users. When this is
> set to zero, only privileged users (root user, or users with the
> CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> syscalls.
Hello
I am a bit confused about this patch, can you help to answer it.
Why the sysctl interface of fs/userfaultfd.c belongs to vm_table instead
of fs_table ?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cefdca0a86be517bc390fc4541e3674b8e7803b0
Thanks
Xiaoming Ni
On Wed, May 27, 2020 at 02:54:13PM +0800, Xiaoming Ni wrote:
>
> On Tue, Mar 19, 2019 at 11:07:22AM +0800, Peter Xu wrote:
> > Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
> > whether userfaultfd is allowed by unprivileged users. When this is
> > set to zero, only privileged users (root user, or users with the
> > CAP_SYS_PTRACE capability) will be able to use the userfaultfd
> > syscalls.
>
> Hello
Hi, Xiaoming,
> I am a bit confused about this patch, can you help to answer it.
>
> Why the sysctl interface of fs/userfaultfd.c belongs to vm_table instead of
> fs_table ?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cefdca0a86be517bc390fc4541e3674b8e7803b0
Because I think it makes more sense to put the new key into where it suites
better, irrelevant to which directory the variable is declared. To me,
unprivileged_userfaultfd is definitely more suitable for vm rather than fs,
because userfaultfd is really about memory management rather than file system.
Thanks,
--
Peter Xu
On 2020/5/27 22:21, Peter Xu wrote:
> On Wed, May 27, 2020 at 02:54:13PM +0800, Xiaoming Ni wrote:
>>
>> On Tue, Mar 19, 2019 at 11:07:22AM +0800, Peter Xu wrote:
>>> Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
>>> whether userfaultfd is allowed by unprivileged users. When this is
>>> set to zero, only privileged users (root user, or users with the
>>> CAP_SYS_PTRACE capability) will be able to use the userfaultfd
>>> syscalls.
>>
>> Hello
>
> Hi, Xiaoming,
>
>> I am a bit confused about this patch, can you help to answer it.
>>
>> Why the sysctl interface of fs/userfaultfd.c belongs to vm_table instead of
>> fs_table ?
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cefdca0a86be517bc390fc4541e3674b8e7803b0
>
> Because I think it makes more sense to put the new key into where it suites
> better, irrelevant to which directory the variable is declared. To me,
> unprivileged_userfaultfd is definitely more suitable for vm rather than fs,
> because userfaultfd is really about memory management rather than file system.
>
> Thanks,
>
Thank you for your answer
Since userfaultfd and vm are more closely related, will there be
consideration to move fs/userfaultfd.c to the mm directory in the future?
Thanks
Xiaoming Ni
On Thu, May 28, 2020 at 04:50:49PM +0800, Xiaoming Ni wrote:
> Since userfaultfd and vm are more closely related, will there be
> consideration to move fs/userfaultfd.c to the mm directory in the future?
Xiaoming,
I don't think so - userfaultfd is still interfacing the userspace as a file
object, so I think it's proper to have fs/userfaultfd.c handles the fs part of
userfaultfd. There's also an mm/userfaultfd.c which handles some internal
logic of userfaultfd syscalls, please feel free to have a look.
Thanks,
--
Peter Xu