2017-12-14 20:18:31

by Cong Wang

[permalink] [raw]
Subject: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

syzbot reported we have a use-after-free when mqueue_evict_inode()
is called on __cleanup_mnt() path, where the ipc ns is already
freed by the previous exit_task_namespaces(). We can just move
it after after exit_task_work() to avoid this use-after-free.

Reported-by: syzbot <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Signed-off-by: Cong Wang <[email protected]>
---
kernel/exit.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 6b4298a41167..909e43c45158 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -861,8 +861,8 @@ void __noreturn do_exit(long code)
exit_fs(tsk);
if (group_dead)
disassociate_ctty(1);
- exit_task_namespaces(tsk);
exit_task_work(tsk);
+ exit_task_namespaces(tsk);
exit_thread(tsk);

/*
--
2.13.0


2017-12-14 21:08:22

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote:
> syzbot reported we have a use-after-free when mqueue_evict_inode()
> is called on __cleanup_mnt() path, where the ipc ns is already
> freed by the previous exit_task_namespaces(). We can just move
> it after after exit_task_work() to avoid this use-after-free.

What's to prevent somebody else holding a reference to the same
inode past the exit(2)? IOW, I don't believe that this is fixing
anything - in the best case, your patch papers over a specific
reproducer.

2017-12-15 06:57:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

Cong Wang <[email protected]> writes:

> syzbot reported we have a use-after-free when mqueue_evict_inode()
> is called on __cleanup_mnt() path, where the ipc ns is already
> freed by the previous exit_task_namespaces(). We can just move
> it after after exit_task_work() to avoid this use-after-free.

How does that possibly work. (I haven't seen this syzbot report).

Looking at the code we have get_ns_from_inode. Which takes the mq_lock,
sees if the pointer is NULL and takes a reference if it is non-NULL.

Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held
when the count drops to zero.

Where is the race in that?

The rest of mqueue_evict_inode uses the returned pointer and
tests that the pointer is non-NULL before user it.

So either szbot is giving you a bad report or there is a subtle race
there I am not seeing. The change below is not at all the proper way to
fix a subtle race.

Eric


>
> Reported-by: syzbot <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: [email protected]
> Signed-off-by: Cong Wang <[email protected]>
> ---
> kernel/exit.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 6b4298a41167..909e43c45158 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -861,8 +861,8 @@ void __noreturn do_exit(long code)
> exit_fs(tsk);
> if (group_dead)
> disassociate_ctty(1);
> - exit_task_namespaces(tsk);
> exit_task_work(tsk);
> + exit_task_namespaces(tsk);
> exit_thread(tsk);
>
> /*

2017-12-15 07:43:40

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman
<[email protected]> wrote:
> Cong Wang <[email protected]> writes:
>
>> syzbot reported we have a use-after-free when mqueue_evict_inode()
>> is called on __cleanup_mnt() path, where the ipc ns is already
>> freed by the previous exit_task_namespaces(). We can just move
>> it after after exit_task_work() to avoid this use-after-free.
>
> How does that possibly work. (I haven't seen this syzbot report).
>
> Looking at the code we have get_ns_from_inode. Which takes the mq_lock,
> sees if the pointer is NULL and takes a reference if it is non-NULL.
>
> Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held
> when the count drops to zero.
>
> Where is the race in that?
>
> The rest of mqueue_evict_inode uses the returned pointer and
> tests that the pointer is non-NULL before user it.
>
> So either szbot is giving you a bad report or there is a subtle race
> there I am not seeing. The change below is not at all the proper way to
> fix a subtle race.
>
> Eric

Cong, what was that report? Searching by
"exit_task_work|exit_task_namespaces" there are too many of them:
https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date

I can only say that syzbot does not make up reports. That's something
that actually happened and was provoked by userspace.



>> Reported-by: syzbot <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Al Viro <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Cong Wang <[email protected]>
>> ---
>> kernel/exit.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 6b4298a41167..909e43c45158 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -861,8 +861,8 @@ void __noreturn do_exit(long code)
>> exit_fs(tsk);
>> if (group_dead)
>> disassociate_ctty(1);
>> - exit_task_namespaces(tsk);
>> exit_task_work(tsk);
>> + exit_task_namespaces(tsk);
>> exit_thread(tsk);
>>
>> /*

2017-12-15 08:01:03

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

On Fri, Dec 15, 2017 at 8:35 AM, Dmitry Vyukov <[email protected]> wrote:
> On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman
> <[email protected]> wrote:
>> Cong Wang <[email protected]> writes:
>>
>>> syzbot reported we have a use-after-free when mqueue_evict_inode()
>>> is called on __cleanup_mnt() path, where the ipc ns is already
>>> freed by the previous exit_task_namespaces(). We can just move
>>> it after after exit_task_work() to avoid this use-after-free.
>>
>> How does that possibly work. (I haven't seen this syzbot report).
>>
>> Looking at the code we have get_ns_from_inode. Which takes the mq_lock,
>> sees if the pointer is NULL and takes a reference if it is non-NULL.
>>
>> Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held
>> when the count drops to zero.
>>
>> Where is the race in that?
>>
>> The rest of mqueue_evict_inode uses the returned pointer and
>> tests that the pointer is non-NULL before user it.
>>
>> So either szbot is giving you a bad report or there is a subtle race
>> there I am not seeing. The change below is not at all the proper way to
>> fix a subtle race.
>>
>> Eric
>
> Cong, what was that report? Searching by
> "exit_task_work|exit_task_namespaces" there are too many of them:
> https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date
>
> I can only say that syzbot does not make up reports. That's something
> that actually happened and was provoked by userspace.


Ah, found that bug:
https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ




>>> Reported-by: syzbot <[email protected]>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: Al Viro <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: Linus Torvalds <[email protected]>
>>> Cc: [email protected]
>>> Signed-off-by: Cong Wang <[email protected]>
>>> ---
>>> kernel/exit.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/exit.c b/kernel/exit.c
>>> index 6b4298a41167..909e43c45158 100644
>>> --- a/kernel/exit.c
>>> +++ b/kernel/exit.c
>>> @@ -861,8 +861,8 @@ void __noreturn do_exit(long code)
>>> exit_fs(tsk);
>>> if (group_dead)
>>> disassociate_ctty(1);
>>> - exit_task_namespaces(tsk);
>>> exit_task_work(tsk);
>>> + exit_task_namespaces(tsk);
>>> exit_thread(tsk);
>>>
>>> /*

2017-12-16 00:00:10

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

On Thu, Dec 14, 2017 at 1:08 PM, Al Viro <[email protected]> wrote:
> On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote:
>> syzbot reported we have a use-after-free when mqueue_evict_inode()
>> is called on __cleanup_mnt() path, where the ipc ns is already
>> freed by the previous exit_task_namespaces(). We can just move
>> it after after exit_task_work() to avoid this use-after-free.
>
> What's to prevent somebody else holding a reference to the same
> inode past the exit(2)? IOW, I don't believe that this is fixing
> anything - in the best case, your patch papers over a specific
> reproducer.

You are right, I missed mq_clear_sbinfo().

And the offending commit is:

commit 9c583773d036336176e9e50441890659bc4eeae8
Author: Giuseppe Scrivano <[email protected]>
Date: Fri Dec 15 01:06:28 2017 +0000

ipc, mqueue: lazy call kern_mount_data in new namespaces

kern_mount_data is a relatively expensive operation when creating a new
IPC namespace, so delay the mount until its first usage when not creating
the the global namespace.

This is a net saving for new IPC namespaces that don't use mq_open(). In
this case there won't be any kern_mount_data() cost at all.

On my machine, the time for creating 1000 new IPC namespaces dropped from
~8s to ~2s.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Giuseppe Scrivano <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

2017-12-16 00:01:21

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

On Fri, Dec 15, 2017 at 12:00 AM, Dmitry Vyukov <[email protected]> wrote:
> On Fri, Dec 15, 2017 at 8:35 AM, Dmitry Vyukov <[email protected]> wrote:
>> On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Cong Wang <[email protected]> writes:
>>>
>>>> syzbot reported we have a use-after-free when mqueue_evict_inode()
>>>> is called on __cleanup_mnt() path, where the ipc ns is already
>>>> freed by the previous exit_task_namespaces(). We can just move
>>>> it after after exit_task_work() to avoid this use-after-free.
>>>
>>> How does that possibly work. (I haven't seen this syzbot report).
>>>
>>> Looking at the code we have get_ns_from_inode. Which takes the mq_lock,
>>> sees if the pointer is NULL and takes a reference if it is non-NULL.
>>>
>>> Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held
>>> when the count drops to zero.
>>>
>>> Where is the race in that?
>>>
>>> The rest of mqueue_evict_inode uses the returned pointer and
>>> tests that the pointer is non-NULL before user it.
>>>
>>> So either szbot is giving you a bad report or there is a subtle race
>>> there I am not seeing. The change below is not at all the proper way to
>>> fix a subtle race.
>>>
>>> Eric
>>
>> Cong, what was that report? Searching by
>> "exit_task_work|exit_task_namespaces" there are too many of them:
>> https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date
>>
>> I can only say that syzbot does not make up reports. That's something
>> that actually happened and was provoked by userspace.
>
>
> Ah, found that bug:
> https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ

Yeah, and it is introduced by:

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/commit/?id=9c583773d036336176e9e50441890659bc4eeae8

2017-12-16 07:40:48

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

On Sat, Dec 16, 2017 at 1:00 AM, Cong Wang <[email protected]> wrote:
>>>>> syzbot reported we have a use-after-free when mqueue_evict_inode()
>>>>> is called on __cleanup_mnt() path, where the ipc ns is already
>>>>> freed by the previous exit_task_namespaces(). We can just move
>>>>> it after after exit_task_work() to avoid this use-after-free.
>>>>
>>>> How does that possibly work. (I haven't seen this syzbot report).
>>>>
>>>> Looking at the code we have get_ns_from_inode. Which takes the mq_lock,
>>>> sees if the pointer is NULL and takes a reference if it is non-NULL.
>>>>
>>>> Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held
>>>> when the count drops to zero.
>>>>
>>>> Where is the race in that?
>>>>
>>>> The rest of mqueue_evict_inode uses the returned pointer and
>>>> tests that the pointer is non-NULL before user it.
>>>>
>>>> So either szbot is giving you a bad report or there is a subtle race
>>>> there I am not seeing. The change below is not at all the proper way to
>>>> fix a subtle race.
>>>>
>>>> Eric
>>>
>>> Cong, what was that report? Searching by
>>> "exit_task_work|exit_task_namespaces" there are too many of them:
>>> https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22exit_task_work$7Cexit_task_namespaces%22%7Csort:date
>>>
>>> I can only say that syzbot does not make up reports. That's something
>>> that actually happened and was provoked by userspace.
>>
>>
>> Ah, found that bug:
>> https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ
>
> Yeah, and it is introduced by:
>
> http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/commit/?id=9c583773d036336176e9e50441890659bc4eeae8


If it's going to be discussed here, note that it was reported by
syzbot and requires "#syz fix" tag after fixing here:
https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/jGDCMoz_AQAJ

2017-12-16 22:04:47

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH] exit: move exit_task_namespaces() after exit_task_work()

Cong Wang <[email protected]> writes:

> On Thu, Dec 14, 2017 at 1:08 PM, Al Viro <[email protected]> wrote:
>> On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote:
>>> syzbot reported we have a use-after-free when mqueue_evict_inode()
>>> is called on __cleanup_mnt() path, where the ipc ns is already
>>> freed by the previous exit_task_namespaces(). We can just move
>>> it after after exit_task_work() to avoid this use-after-free.
>>
>> What's to prevent somebody else holding a reference to the same
>> inode past the exit(2)? IOW, I don't believe that this is fixing
>> anything - in the best case, your patch papers over a specific
>> reproducer.
>
> You are right, I missed mq_clear_sbinfo().

yes, the patch 9c583773d036336176e9e50441890659bc4eeae8 introduced an
issue since it doesn't reset s_fs_info when ns->mq_mnt == NULL.

The following patch solves the issue for me. I store the superblock in
"struct ipc_namespace" since we cannot assume "mq_mnt" exists.

Does the patch look fine? I'll clean it up and submit it separately if
this approach is correct.

Regards,
Giuseppe


diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 554e31494f69..66d4a5740a60 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -53,6 +53,7 @@ struct ipc_namespace {

/* The kern_mount of the mqueuefs sb. We take a ref on it */
struct vfsmount *mq_mnt;
+ struct super_block *mq_sb;

/* # queues in this ns, protected by mq_lock */
unsigned int mq_queues_count;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index f78d6687a61b..16347cf1de2d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -341,6 +341,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
sb->s_root = d_make_root(inode);
if (!sb->s_root)
return -ENOMEM;
+ ns->mq_sb = sb;
return 0;
}

@@ -1554,6 +1555,7 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
ns->mq_msg_max = DFLT_MSGMAX;
ns->mq_msgsize_max = DFLT_MSGSIZEMAX;
ns->mq_msg_default = DFLT_MSG;
+ ns->mq_sb = NULL;
ns->mq_msgsize_default = DFLT_MSGSIZE;

if (!mount)
@@ -1573,8 +1575,8 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)

void mq_clear_sbinfo(struct ipc_namespace *ns)
{
- if (ns->mq_mnt)
- ns->mq_mnt->mnt_sb->s_fs_info = NULL;
+ if (ns->mq_sb)
+ ns->mq_sb->s_fs_info = NULL;
}

void mq_put_mnt(struct ipc_namespace *ns)