2014-02-12 15:44:47

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH] kernel: reduce required permission for prctl_set_mm

Currently prctl_set_mm requires the global CAP_SYS_RESOURCE,
this patch reduce requiremence to CAP_SYS_RESOURCE in the current
namespace.

When we restore a task we need to set up text, data and data heap sizes
from userspace to the values a task had at checkpoint time.

Currently we can not restore these parameters, if a task lives in
a non-root user name space, because it has no capabilities in the
parent namespace.

prctl_set_mm() changes parameters of the current task and doesn't affect
other tasks.

This patch affects the RLIMIT_DATA limit, because a consumtiuon is
calculated relatively to mm->end_data, mm->start_data, mm->start_brk.

rlim = rlimit(RLIMIT_DATA);
if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +
(mm->end_data - mm->start_data) > rlim)
goto out;

This limit affects calls to brk() and sbrk(), but it doesn't affect
mmap. So I think requirement of CAP_SYS_RESOURCE in the current
namespace is enough for this limit.

Cc: Andrew Morton <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Aditya Kali <[email protected]>
Cc: [email protected]
Signed-off-by: Andrey Vagin <[email protected]>
---
kernel/sys.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index c0a58be..6f36fb3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1701,7 +1701,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
return -EINVAL;

- if (!capable(CAP_SYS_RESOURCE))
+ if (!ns_capable(current_user_ns(), CAP_SYS_RESOURCE))
return -EPERM;

if (opt == PR_SET_MM_EXE_FILE)
--
1.8.5.3


2014-02-12 21:32:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kernel: reduce required permission for prctl_set_mm

On Wed, 12 Feb 2014 19:40:11 +0400 Andrey Vagin <[email protected]> wrote:

> Currently prctl_set_mm requires the global CAP_SYS_RESOURCE,
> this patch reduce requiremence to CAP_SYS_RESOURCE in the current
> namespace.
>
> When we restore a task we need to set up text, data and data heap sizes
> from userspace to the values a task had at checkpoint time.
>
> Currently we can not restore these parameters, if a task lives in
> a non-root user name space, because it has no capabilities in the
> parent namespace.
>
> prctl_set_mm() changes parameters of the current task and doesn't affect
> other tasks.
>
> This patch affects the RLIMIT_DATA limit, because a consumtiuon is
> calculated relatively to mm->end_data, mm->start_data, mm->start_brk.

I can't for the life of me work out what you were trying to say here.
Please fix and resend this paragraph?

> rlim = rlimit(RLIMIT_DATA);
> if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +
> (mm->end_data - mm->start_data) > rlim)
> goto out;
>
> This limit affects calls to brk() and sbrk(), but it doesn't affect
> mmap. So I think requirement of CAP_SYS_RESOURCE in the current
> namespace is enough for this limit.
>
> ...
>
> Cc: [email protected]

That list is for reporting kernel security bugs.

>
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1701,7 +1701,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
> if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
> return -EINVAL;
>
> - if (!capable(CAP_SYS_RESOURCE))
> + if (!ns_capable(current_user_ns(), CAP_SYS_RESOURCE))
> return -EPERM;
>
> if (opt == PR_SET_MM_EXE_FILE)

This looks harmless.

My relatively-up-to-date manpages don't mention prctl(PR_SET_MM). I
see from http://marc.info/?l=linux-man&m=133132612704130&w=2 that
manpage additions were prepared nearly three years ago. Michael, did
this fall through a crack?

2014-02-12 21:50:41

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] kernel: reduce required permission for prctl_set_mm

On Wed, Feb 12, 2014 at 1:32 PM, Andrew Morton
<[email protected]> wrote:
> On Wed, 12 Feb 2014 19:40:11 +0400 Andrey Vagin <[email protected]> wrote:
>
>> Currently prctl_set_mm requires the global CAP_SYS_RESOURCE,
>> this patch reduce requiremence to CAP_SYS_RESOURCE in the current
>> namespace.
>>
>> When we restore a task we need to set up text, data and data heap sizes
>> from userspace to the values a task had at checkpoint time.
>>
>> Currently we can not restore these parameters, if a task lives in
>> a non-root user name space, because it has no capabilities in the
>> parent namespace.
>>
>> prctl_set_mm() changes parameters of the current task and doesn't affect
>> other tasks.
>>
>> This patch affects the RLIMIT_DATA limit, because a consumtiuon is
>> calculated relatively to mm->end_data, mm->start_data, mm->start_brk.
>
> I can't for the life of me work out what you were trying to say here.
> Please fix and resend this paragraph?
>
>> rlim = rlimit(RLIMIT_DATA);
>> if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +
>> (mm->end_data - mm->start_data) > rlim)
>> goto out;
>>
>> This limit affects calls to brk() and sbrk(), but it doesn't affect
>> mmap. So I think requirement of CAP_SYS_RESOURCE in the current
>> namespace is enough for this limit.
>>
>> ...
>>
>> Cc: [email protected]
>
> That list is for reporting kernel security bugs.
>
>>
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1701,7 +1701,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
>> if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
>> return -EINVAL;
>>
>> - if (!capable(CAP_SYS_RESOURCE))
>> + if (!ns_capable(current_user_ns(), CAP_SYS_RESOURCE))
>> return -EPERM;
>>
>> if (opt == PR_SET_MM_EXE_FILE)
>
> This looks harmless.

I want to be convinced of this, but weakening this cap check seems
like an easy way for a process to hide itself trivially from the real
root user. It can change it's exe file link, and dodge RLIMIT_DATA by
changing the brk addresses. The whole reason this cap check was there
was to stop that kind of thing. Limiting it to a namespace isn't great
since USER_NS means unprivileged processes can enter a new NS as the
NS root user.

-Kees

--
Kees Cook
Chrome OS Security

2014-02-12 21:55:49

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [CRIU] [PATCH] kernel: reduce required permission for prctl_set_mm

On Wed, Feb 12, 2014 at 01:32:28PM -0800, Andrew Morton wrote:
> On Wed, 12 Feb 2014 19:40:11 +0400 Andrey Vagin <[email protected]> wrote:
>
> > Currently prctl_set_mm requires the global CAP_SYS_RESOURCE,
> > this patch reduce requiremence to CAP_SYS_RESOURCE in the current
> > namespace.
> >
> > When we restore a task we need to set up text, data and data heap sizes
> > from userspace to the values a task had at checkpoint time.
> >
> > Currently we can not restore these parameters, if a task lives in
> > a non-root user name space, because it has no capabilities in the
> > parent namespace.
> >
> > prctl_set_mm() changes parameters of the current task and doesn't affect
> > other tasks.
> >
> > This patch affects the RLIMIT_DATA limit, because a consumtiuon is
> > calculated relatively to mm->end_data, mm->start_data, mm->start_brk.
>
> I can't for the life of me work out what you were trying to say here.
> Please fix and resend this paragraph?

I guess Andrey wanted to say that with this prctl call we rely on
user that the data provided to assign mm members is somehow sane.
We do a basic checks here but still it is possible to write compele
crap into these fields if you have enough privileges. And this will
be not that scary because in worst scenarion the only thing one
may achieve is "weird" output in task statistics (but this won't
harm kernel itself anyhow).

Still the fields start_brk,end_data,start_data and start_brk are
involved into address computation inside sys_brk syscall. So
if we assume someone have set complete random/crap values into
the mm members pointed above -- he might screw own sys_brk
call. But again it won't affect the kernel itself only "current"
task is involved. Thus harmless.

>
> > rlim = rlimit(RLIMIT_DATA);
> > if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +
> > (mm->end_data - mm->start_data) > rlim)
> > goto out;
> >
> > This limit affects calls to brk() and sbrk(), but it doesn't affect
> > mmap. So I think requirement of CAP_SYS_RESOURCE in the current
> > namespace is enough for this limit.
>
> This looks harmless.
>
> My relatively-up-to-date manpages don't mention prctl(PR_SET_MM). I
> see from http://marc.info/?l=linux-man&m=133132612704130&w=2 that
> manpage additions were prepared nearly three years ago. Michael, did
> this fall through a crack?

For sure your manpages are too old ;) On my fedora 19 PR_SET_MM
is pretty here.

[cyrill@moon ~] yum info man-pages
Loaded plugins: auto-update-debuginfo, langpacks, refresh-packagekit
Installed Packages
Name : man-pages
Arch : noarch
Version : 3.51

As to me, the patch looks good.

2014-02-12 22:11:54

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] kernel: reduce required permission for prctl_set_mm

On Wed, Feb 12, 2014 at 01:32:28PM -0800, Andrew Morton wrote:
> On Wed, 12 Feb 2014 19:40:11 +0400 Andrey Vagin <[email protected]> wrote:
>
> > Currently prctl_set_mm requires the global CAP_SYS_RESOURCE,
> > this patch reduce requiremence to CAP_SYS_RESOURCE in the current
> > namespace.
> >
> > When we restore a task we need to set up text, data and data heap sizes
> > from userspace to the values a task had at checkpoint time.
> >
> > Currently we can not restore these parameters, if a task lives in
> > a non-root user name space, because it has no capabilities in the
> > parent namespace.
> >
> > prctl_set_mm() changes parameters of the current task and doesn't affect
> > other tasks.
> >
> > This patch affects the RLIMIT_DATA limit, because a consumtiuon is
> > calculated relatively to mm->end_data, mm->start_data, mm->start_brk.
>
> I can't for the life of me work out what you were trying to say here.
> Please fix and resend this paragraph?

A task can exceed the RLIMIT_DATA limit by changing mm->start_brk,
so this patch reduces required permission for RLIMIT_DATA too

>
> > rlim = rlimit(RLIMIT_DATA);
> > if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +
> > (mm->end_data - mm->start_data) > rlim)
> > goto out;
> >
> > This limit affects calls to brk() and sbrk(), but it doesn't affect
> > mmap. So I think requirement of CAP_SYS_RESOURCE in the current
> > namespace is enough for this limit.
> >
> > ...
> >
> > Cc: [email protected]
>
> That list is for reporting kernel security bugs.
>
> >
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1701,7 +1701,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
> > if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
> > return -EINVAL;
> >
> > - if (!capable(CAP_SYS_RESOURCE))
> > + if (!ns_capable(current_user_ns(), CAP_SYS_RESOURCE))
> > return -EPERM;
> >
> > if (opt == PR_SET_MM_EXE_FILE)
>
> This looks harmless.
>
> My relatively-up-to-date manpages don't mention prctl(PR_SET_MM). I
> see from http://marc.info/?l=linux-man&m=133132612704130&w=2 that
> manpage additions were prepared nearly three years ago. Michael, did
> this fall through a crack?
>

2014-02-12 23:09:35

by Andrew Vagin

[permalink] [raw]
Subject: Re: [PATCH] kernel: reduce required permission for prctl_set_mm

On Wed, Feb 12, 2014 at 01:50:35PM -0800, Kees Cook wrote:
> On Wed, Feb 12, 2014 at 1:32 PM, Andrew Morton
> <[email protected]> wrote:
> > On Wed, 12 Feb 2014 19:40:11 +0400 Andrey Vagin <[email protected]> wrote:
> >
> >> Currently prctl_set_mm requires the global CAP_SYS_RESOURCE,
> >> this patch reduce requiremence to CAP_SYS_RESOURCE in the current
> >> namespace.
> >>
> >> When we restore a task we need to set up text, data and data heap sizes
> >> from userspace to the values a task had at checkpoint time.
> >>
> >> Currently we can not restore these parameters, if a task lives in
> >> a non-root user name space, because it has no capabilities in the
> >> parent namespace.
> >>
> >> prctl_set_mm() changes parameters of the current task and doesn't affect
> >> other tasks.
> >>
> >> This patch affects the RLIMIT_DATA limit, because a consumtiuon is
> >> calculated relatively to mm->end_data, mm->start_data, mm->start_brk.
> >
> > I can't for the life of me work out what you were trying to say here.
> > Please fix and resend this paragraph?
> >
> >> rlim = rlimit(RLIMIT_DATA);
> >> if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +
> >> (mm->end_data - mm->start_data) > rlim)
> >> goto out;
> >>
> >> This limit affects calls to brk() and sbrk(), but it doesn't affect
> >> mmap. So I think requirement of CAP_SYS_RESOURCE in the current
> >> namespace is enough for this limit.
> >>
> >> ...
> >>
> >> Cc: [email protected]
> >
> > That list is for reporting kernel security bugs.
> >
> >>
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -1701,7 +1701,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
> >> if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
> >> return -EINVAL;
> >>
> >> - if (!capable(CAP_SYS_RESOURCE))
> >> + if (!ns_capable(current_user_ns(), CAP_SYS_RESOURCE))
> >> return -EPERM;
> >>
> >> if (opt == PR_SET_MM_EXE_FILE)
> >
> > This looks harmless.
>
> I want to be convinced of this, but weakening this cap check seems
> like an easy way for a process to hide itself trivially from the real
> root user. It can change it's exe file link, and dodge RLIMIT_DATA by
> changing the brk addresses. The whole reason this cap check was there
> was to stop that kind of thing. Limiting it to a namespace isn't great
> since USER_NS means unprivileged processes can enter a new NS as the
> NS root user.

All what you are describing here we are doing on restoring tasks. We
need a way how to restore these parameters. One of our targets is to be
able to dump and restore Linux Containers. All processes of a container
live in a separate set of namespaces.

I was thinking to restore these parameters before entering into userns,
but this idea failed, because a process can't enter in pidns, but pidns
must be created in userns...


>> It can change it's exe file link
We can change memory content with help of ptrace. So if we want to hide
a process, we can execute another process and inject our code into it.

It can be equivalent to changing exe file link. Yes, it's a bit
harder, but we can do that even without this patch.

>> dodge RLIMIT_DATA

This limit affects calls to brk(2) and sbrk(2). But a task can use mmap() to
allocate memory. How is this limit used?

Sorry if I miss something.

>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security

2014-02-12 23:14:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] kernel: reduce required permission for prctl_set_mm

Andrey Vagin <[email protected]> writes:

> Currently prctl_set_mm requires the global CAP_SYS_RESOURCE,
> this patch reduce requiremence to CAP_SYS_RESOURCE in the current
> namespace.
>
> When we restore a task we need to set up text, data and data heap sizes
> from userspace to the values a task had at checkpoint time.
>
> Currently we can not restore these parameters, if a task lives in
> a non-root user name space, because it has no capabilities in the
> parent namespace.
>
> prctl_set_mm() changes parameters of the current task and doesn't affect
> other tasks.
>
> This patch affects the RLIMIT_DATA limit, because a consumtiuon is
> calculated relatively to mm->end_data, mm->start_data, mm->start_brk.
>
> rlim = rlimit(RLIMIT_DATA);
> if (rlim < RLIM_INFINITY && (brk - mm->start_brk) +
> (mm->end_data - mm->start_data) > rlim)
> goto out;
>
> This limit affects calls to brk() and sbrk(), but it doesn't affect
> mmap. So I think requirement of CAP_SYS_RESOURCE in the current
> namespace is enough for this limit.

Ick. No.

You do not have an argument for reducing the capable call here to
ns_capable. ns_capable(current_user_ns(), CAP_SYS_RESOURCE) does not
currently allow anything. If ns_capable(current_user_ns(),
CAP_SYS_RESOURCE) were to allow things there would still need to be a
check for a root setable maximum which is not present in this patch.

Either you have an argument for completely removing the capability check
or your reasoning is broken.

Reading through the code and reading through brk I an fairly confident
that your reasoning is broken.

The rlimit test needs to be when any of start_brk, end_data, or
start_data are changed, and that test is most definitely not performed.

Checks for enforcing the stack_size are completely missing.

It does look like with care we can remove or make much more precise the
capable checks from in prctl_set_mm but this patch definitely does not
take that needed care.

Nacked-by: "Eric W. Biederman" <[email protected]>

> Cc: Andrew Morton <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Robin Holt <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Chen Gang <[email protected]>
> Cc: Stephen Rothwell <[email protected]>
> Cc: Pavel Emelyanov <[email protected]>
> Cc: Aditya Kali <[email protected]>
> Cc: [email protected]
> Signed-off-by: Andrey Vagin <[email protected]>
> ---
> kernel/sys.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c0a58be..6f36fb3 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1701,7 +1701,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
> if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
> return -EINVAL;
>
> - if (!capable(CAP_SYS_RESOURCE))
> + if (!ns_capable(current_user_ns(), CAP_SYS_RESOURCE))
> return -EPERM;
>
> if (opt == PR_SET_MM_EXE_FILE)