LinuxLists.cc - [PATCH] mm_release: Do a set_fs(USER_DS) before handling clear_child

2010-11-30 02:24:15

Subject: [PATCH] mm_release: Do a set_fs(USER_DS) before handling clear_child_tid.

If a user manages to trigger a kernel BUG() or page fault with fs set to
KERNEL_DS, fs is not otherwise reset before do_exit(), allowing the user to
write a 0 to an arbitrary address in kernel memory.

Signed-off-by: Nelson Elhage <[email protected]>
---
AFAICT this is presently only triggerable in the presence of another bug, but
this potentially turns a lot of DoS bugs into privilege escalation, so it's
worth fixing. Among other things, sock_no_sendpage and the kernel_{read,write}v
calls in splice.c make it easy to call an awful lot of the kernel under
KERNEL_DS.

This isn't the only way we could fix this -- we could put the set_fs() at the
start of do_exit, or in all the callers that might call potentially do_exit with
KERNEL_DS set, or else we could do an access_ok inside fork(). I'm happy to put
together one of those patches if someone thinks another approach makes more
sense.

kernel/fork.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..a68445e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -636,7 +636,12 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
+ *
+ * We do set_fs() explicitly in case this task
+ * exited while inside set_fs(KERNEL_DS) for
+ * some reason (e.g. on a BUG()).
*/
+ set_fs(USER_DS);
put_user(0, tsk->clear_child_tid);
sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
1, NULL, NULL, 0);
--
1.7.1.31.g6297e

2010-12-01 00:09:55

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm_release: Do a set_fs(USER_DS) before handling clear_child_tid.

On Mon, 29 Nov 2010 21:19:16 -0500
Nelson Elhage <[email protected]> wrote:

> If a user manages to trigger a kernel BUG() or page fault with fs set to
> KERNEL_DS, fs is not otherwise reset before do_exit(), allowing the user to
> write a 0 to an arbitrary address in kernel memory.
>
> Signed-off-by: Nelson Elhage <[email protected]>
> ---
> AFAICT this is presently only triggerable in the presence of another bug, but
> this potentially turns a lot of DoS bugs into privilege escalation, so it's
> worth fixing. Among other things, sock_no_sendpage and the kernel_{read,write}v
> calls in splice.c make it easy to call an awful lot of the kernel under
> KERNEL_DS.
>
> This isn't the only way we could fix this -- we could put the set_fs() at the
> start of do_exit, or in all the callers that might call potentially do_exit with
> KERNEL_DS set, or else we could do an access_ok inside fork(). I'm happy to put
> together one of those patches if someone thinks another approach makes more
> sense.
>
> kernel/fork.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3b159c5..a68445e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -636,7 +636,12 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
> /*
> * We don't check the error code - if userspace has
> * not set up a proper pointer then tough luck.
> + *
> + * We do set_fs() explicitly in case this task
> + * exited while inside set_fs(KERNEL_DS) for
> + * some reason (e.g. on a BUG()).
> */
> + set_fs(USER_DS);
> put_user(0, tsk->clear_child_tid);
> sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
> 1, NULL, NULL, 0);

Confused. The user can only exploit the wrong addr_limit if control
returns to userspace for the user's code to execute. But that won't be
happening, because this thread will unconditionally exit.

If/when you unconfuse me, I'd suggest this change only be done if the
thread is *known* to have oopsed - doing it for non-oopsed threads
seems unpleasant to my mind. And I think it should be done nice and
clearly, right up inside do_exit() by some means. Or perhaps in the
oops code, just before it calls do_exit(). Not hidden down in
mm_release().

2010-12-01 00:59:14

by Nelson Elhage

[permalink] [raw]

Subject: Re: [PATCH] mm_release: Do a set_fs(USER_DS) before handling clear_child_tid.

On Tue, Nov 30, 2010 at 04:09:50PM -0800, Andrew Morton wrote:
> On Mon, 29 Nov 2010 21:19:16 -0500
> Nelson Elhage <[email protected]> wrote:
>
> > If a user manages to trigger a kernel BUG() or page fault with fs set to
> > KERNEL_DS, fs is not otherwise reset before do_exit(), allowing the user to
> > write a 0 to an arbitrary address in kernel memory.
> >
> > Signed-off-by: Nelson Elhage <[email protected]>
> > ---
> > AFAICT this is presently only triggerable in the presence of another bug, but
> > this potentially turns a lot of DoS bugs into privilege escalation, so it's
> > worth fixing. Among other things, sock_no_sendpage and the kernel_{read,write}v
> > calls in splice.c make it easy to call an awful lot of the kernel under
> > KERNEL_DS.
> >
> > This isn't the only way we could fix this -- we could put the set_fs() at the
> > start of do_exit, or in all the callers that might call potentially do_exit with
> > KERNEL_DS set, or else we could do an access_ok inside fork(). I'm happy to put
> > together one of those patches if someone thinks another approach makes more
> > sense.
> >
> > kernel/fork.c | 5 +++++
> > 1 files changed, 5 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3b159c5..a68445e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -636,7 +636,12 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
> > /*
> > * We don't check the error code - if userspace has
> > * not set up a proper pointer then tough luck.
> > + *
> > + * We do set_fs() explicitly in case this task
> > + * exited while inside set_fs(KERNEL_DS) for
> > + * some reason (e.g. on a BUG()).
> > */
> > + set_fs(USER_DS);
> > put_user(0, tsk->clear_child_tid);
> > sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
> > 1, NULL, NULL, 0);
>
> Confused. The user can only exploit the wrong addr_limit if control
> returns to userspace for the user's code to execute. But that won't be
> happening, because this thread will unconditionally exit.

The user can exploit the wrong addr_limit on the very next line, with the
put_user() there. clear_child_tid is not checked in any way before this
point. Writing a single zero might not seem like much, but it's enough for
privilege escalation (e.g. overwrite the top half of a function pointer to point
to userspace).

I have a PoC code that uses this bug, along with CVE-2010-3849, to write a zero
to an arbitrary kernel address, so I've tested that this is not theoretical.

That's also why I put the set_fs() hidden inside mm_release, since that's the
only place where (to my knowledge) it matters.

On re-reading, I didn't mention clear_child_tid anywhere in the commit message,
which was an error on my part, and explains the confusion. Sorry about that, and
I hope this clears that up.

Let me know if this makes more sense, and I'll send a revised patch.

- Nelson

>
>
> If/when you unconfuse me, I'd suggest this change only be done if the
> thread is *known* to have oopsed - doing it for non-oopsed threads
> seems unpleasant to my mind. And I think it should be done nice and
> clearly, right up inside do_exit() by some means. Or perhaps in the
> oops code, just before it calls do_exit(). Not hidden down in
> mm_release().

2010-12-01 01:50:32

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm_release: Do a set_fs(USER_DS) before handling clear_child_tid.

On Tue, 30 Nov 2010 19:59:09 -0500 Nelson Elhage <[email protected]> wrote:

> On Tue, Nov 30, 2010 at 04:09:50PM -0800, Andrew Morton wrote:
> > On Mon, 29 Nov 2010 21:19:16 -0500
> > Nelson Elhage <[email protected]> wrote:
> >
> > > + * exited while inside set_fs(KERNEL_DS) for
> > > + * some reason (e.g. on a BUG()).
> > > */
> > > + set_fs(USER_DS);
> > > put_user(0, tsk->clear_child_tid);
> > > sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
> > > 1, NULL, NULL, 0);
> >
> > Confused. The user can only exploit the wrong addr_limit if control
> > returns to userspace for the user's code to execute. But that won't be
> > happening, because this thread will unconditionally exit.
>
> The user can exploit the wrong addr_limit on the very next line, with the
> put_user() there. clear_child_tid is not checked in any way before this
> point. Writing a single zero might not seem like much, but it's enough for
> privilege escalation (e.g. overwrite the top half of a function pointer to point
> to userspace).

Ah, OK. Doh.

> I have a PoC code that uses this bug, along with CVE-2010-3849, to write a zero
> to an arbitrary kernel address, so I've tested that this is not theoretical.
>
> That's also why I put the set_fs() hidden inside mm_release, since that's the
> only place where (to my knowledge) it matters.
>
> On re-reading, I didn't mention clear_child_tid anywhere in the commit message,
> which was an error on my part, and explains the confusion. Sorry about that, and
> I hope this clears that up.
>
> Let me know if this makes more sense, and I'll send a revised patch.

I do think it would be better to fix up the addr_limit somewhere within
the oops code rather than in the regular code. Presumably just before
calling do_exit(). Isn't that the logical place? Plus it fixes up any
other such problems, whether they be there now or in the future.

Although that involves altering every arch, each in multiple places.
ick. Maybe at the start of do_exit()?

2010-12-01 02:27:56

by Nelson Elhage

[permalink] [raw]

Subject: [PATCH v2] do_exit(): Make sure we run with get_fs() == USER_DS.

If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c)
do a put_user to a user-controlled address, potentially allowing a user to
leverage an oops into a controlled write into kernel memory.

A more logical place to put this might be when we know an oops has occurred,
before we call do_exit(), but that would involve changing every architecture, in
multiple places. Let's just stick it in do_exit instead.

Signed-off-by: Nelson Elhage <[email protected]>
---
kernel/exit.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 21aa7b3..68899b3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -914,6 +914,14 @@ NORET_TYPE void do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");

+ /*
+ * If do_exit is called because this processes oopsed, it's possible
+ * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
+ * continuing. This is relevant at least for clearing clear_child_tid in
+ * mm_release.
+ */
+ set_fs(USER_DS);
+
tracehook_report_exit(&code);

validate_creds_for_do_exit(tsk);
--
1.7.1.31.g6297e

2010-12-01 02:50:37

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2] do_exit(): Make sure we run with get_fs() == USER_DS.

> If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
> otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c)
> do a put_user to a user-controlled address, potentially allowing a user to
> leverage an oops into a controlled write into kernel memory.
>
> A more logical place to put this might be when we know an oops has occurred,
> before we call do_exit(), but that would involve changing every architecture, in
> multiple places. Let's just stick it in do_exit instead.
>
> Signed-off-by: Nelson Elhage <[email protected]>
> ---
> kernel/exit.c | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 21aa7b3..68899b3 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -914,6 +914,14 @@ NORET_TYPE void do_exit(long code)
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
>
> + /*
> + * If do_exit is called because this processes oopsed, it's possible
> + * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> + * continuing. This is relevant at least for clearing clear_child_tid in
> + * mm_release.
> + */
> + set_fs(USER_DS);

"This is relevant" is no good explanation ;)
Please recognize this is tricky code and Please consider to write more
careful and looooong comments.

Thanks.

2010-12-02 00:30:10

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2] do_exit(): Make sure we run with get_fs() == USER_DS.

On Wed, 1 Dec 2010 11:50:32 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
> > otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c)
> > do a put_user to a user-controlled address, potentially allowing a user to
> > leverage an oops into a controlled write into kernel memory.
> >
> > A more logical place to put this might be when we know an oops has occurred,
> > before we call do_exit(), but that would involve changing every architecture, in
> > multiple places. Let's just stick it in do_exit instead.
> >
> > Signed-off-by: Nelson Elhage <[email protected]>
> > ---
> > kernel/exit.c | 8 ++++++++
> > 1 files changed, 8 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 21aa7b3..68899b3 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -914,6 +914,14 @@ NORET_TYPE void do_exit(long code)
> > if (unlikely(!tsk->pid))
> > panic("Attempted to kill the idle task!");
> >
> > + /*
> > + * If do_exit is called because this processes oopsed, it's possible
> > + * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> > + * continuing. This is relevant at least for clearing clear_child_tid in
> > + * mm_release.
> > + */
> > + set_fs(USER_DS);
>
> "This is relevant" is no good explanation ;)
> Please recognize this is tricky code and Please consider to write more
> careful and looooong comments.

I've seen worse comments. And occasionally none at all :)

Is this better?

--- a/kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds-fix
+++ a/kernel/exit.c
@@ -917,8 +917,9 @@ NORET_TYPE void do_exit(long code)
/*
* If do_exit is called because this processes oopsed, it's possible
* that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
- * continuing. This is relevant at least for clearing clear_child_tid in
- * mm_release.
+ * continuing. Amongst other possible reasons, this is to prevent
+ * mm_release()->clear_child_tid() from writing to a user-controlled
+ * kernel address.
*/
set_fs(USER_DS);

_

2010-12-02 00:48:06

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH v2] do_exit(): Make sure we run with get_fs() == USER_DS.

> > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > index 21aa7b3..68899b3 100644
> > > --- a/kernel/exit.c
> > > +++ b/kernel/exit.c
> > > @@ -914,6 +914,14 @@ NORET_TYPE void do_exit(long code)
> > > if (unlikely(!tsk->pid))
> > > panic("Attempted to kill the idle task!");
> > >
> > > + /*
> > > + * If do_exit is called because this processes oopsed, it's possible
> > > + * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> > > + * continuing. This is relevant at least for clearing clear_child_tid in
> > > + * mm_release.
> > > + */
> > > + set_fs(USER_DS);
> >
> > "This is relevant" is no good explanation ;)
> > Please recognize this is tricky code and Please consider to write more
> > careful and looooong comments.
>
> I've seen worse comments. And occasionally none at all :)
>
> Is this better?
>
> --- a/kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds-fix
> +++ a/kernel/exit.c
> @@ -917,8 +917,9 @@ NORET_TYPE void do_exit(long code)
> /*
> * If do_exit is called because this processes oopsed, it's possible
> * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> - * continuing. This is relevant at least for clearing clear_child_tid in
> - * mm_release.
> + * continuing. Amongst other possible reasons, this is to prevent
> + * mm_release()->clear_child_tid() from writing to a user-controlled
> + * kernel address.
> */

Great!

2010-12-02 01:12:40

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2] do_exit(): Make sure we run with get_fs() == USER_DS.

On Tue, 30 Nov 2010 21:27:36 -0500
Nelson Elhage <[email protected]> wrote:

> If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
> otherwise reset before do_exit(). do_exit may later (via mm_release in fork.c)
> do a put_user to a user-controlled address, potentially allowing a user to
> leverage an oops into a controlled write into kernel memory.
>
> A more logical place to put this might be when we know an oops has occurred,
> before we call do_exit(), but that would involve changing every architecture, in
> multiple places. Let's just stick it in do_exit instead.
>
> Signed-off-by: Nelson Elhage <[email protected]>
> ---
> kernel/exit.c | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 21aa7b3..68899b3 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -914,6 +914,14 @@ NORET_TYPE void do_exit(long code)
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
>
> + /*
> + * If do_exit is called because this processes oopsed, it's possible
> + * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> + * continuing. This is relevant at least for clearing clear_child_tid in
> + * mm_release.
> + */
> + set_fs(USER_DS);
> +
> tracehook_report_exit(&code);
>
> validate_creds_for_do_exit(tsk);

I think that the potential of escalating an oops or a BUG into a local
root hole is pretty serious so I'll send this fix along for 2.6.37 and
I tagged it for -stable backporting, along with a sterner-sounding
changelog.

From: Nelson Elhage <[email protected]>

If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
otherwise reset before do_exit(). do_exit may later (via mm_release in
fork.c) do a put_user to a user-controlled address, potentially allowing a
user to leverage an oops into a controlled write into kernel memory.

This is only triggerable in the presence of another bug, but this
potentially turns a lot of DoS bugs into privilege escalations, so it's
worth fixing. I have proof-of-concept code which uses this bug along with
CVE-2010-3849 to write a zero to an arbitrary kernel address, so I've
tested that this is not theoretical.

A more logical place to put this fix might be when we know an oops has
occurred, before we call do_exit(), but that would involve changing every
architecture, in multiple places. Let's just stick it in do_exit instead.

[[email protected]: update code comment]
Signed-off-by: Nelson Elhage <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

kernel/exit.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff -puN kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds kernel/exit.c
--- a/kernel/exit.c~do_exit-make-sure-we-run-with-get_fs-==-user_ds
+++ a/kernel/exit.c
@@ -914,6 +914,15 @@ NORET_TYPE void do_exit(long code)
if (unlikely(!tsk->pid))
panic("Attempted to kill the idle task!");

+ /*
+ * If do_exit is called because this processes oopsed, it's possible
+ * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
+ * continuing. Amongst other possible reasons, this is to prevent
+ * mm_release()->clear_child_tid() from writing to a user-controlled
+ * kernel address.
+ */
+ set_fs(USER_DS);
+
tracehook_report_exit(&code);

validate_creds_for_do_exit(tsk);
_