From:   ebiederm@xmission.com (Eric W. Biederman)
To:     Michal =?utf-8?Q?Koutn=C3=BD?= <mkoutny@suse.com>
Cc:     linux-kernel@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
        Kees Cook <keescook@chromium.org>,
        Oleg Nesterov <oleg@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Jim Newsome <jnewsome@torproject.org>,
        Alexey Gladkov <legion@kernel.org>, Tejun Heo <tj@kernel.org>,
        <security@kernel.org>, Andy Lutomirski <luto@amacapital.net>,
        Jann Horn <jannh@google.com>
References: <20211208180501.11969-1-mkoutny@suse.com>
Date:   Wed, 08 Dec 2021 12:45:54 -0600
In-Reply-To: <20211208180501.11969-1-mkoutny@suse.com> ("Michal
 \=\?utf-8\?Q\?Koutn\=C3\=BD\=22's\?\=
        message of "Wed, 8 Dec 2021 19:05:01 +0100")
Message-ID: <87sfv3540t.fsf@email.froward.int.ebiederm.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8BIT
Subject: Re: [PATCH] exit: Retain nsproxy for exit_task_work() work entries
Precedence: bulk


Adding the security list and a couple of other people because I feel
like I just opened pandora's box, and I don't know how bad this issue
is.

TL;DR the cgroup file system is checking permissions at write time.

Michal Koutný <mkoutny@suse.com> writes:

> The reported issue is an attempted write in a cgroup file, by a zombie
> who has/had acct(2) enabled. Such a write needs cgroup_ns for access
> checking. Ordinary acct_process() would not be affected by this as it is
> called well before exit_task_namespaces(). However, the reported NULL
> dereference is a different acct data writer:
>
> 	Call Trace:
> 	 <TASK>
> 	 kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296
> 	 __kernel_write+0x5d1/0xaf0 fs/read_write.c:535
> 	 do_acct_process+0x112a/0x17b0 kernel/acct.c:518
> 	 acct_pin_kill+0x27/0x130 kernel/acct.c:173
> 	 pin_kill+0x2a6/0x940 fs/fs_pin.c:44
> 	 mnt_pin_kill+0xc1/0x170 fs/fs_pin.c:81
> 	 cleanup_mnt+0x4bc/0x510 fs/namespace.c:1130
> 	 task_work_run+0x146/0x1c0 kernel/task_work.c:164
> 	 exit_task_work include/linux/task_work.h:32 [inline]
> 	 do_exit+0x705/0x24f0 kernel/exit.c:832
> 	 do_group_exit+0x168/0x2d0 kernel/exit.c:929
> 	 get_signal+0x16b0/0x2090 kernel/signal.c:2820
> 	 arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
> 	 handle_signal_work kernel/entry/common.c:148 [inline]
> 	 exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
> 	 exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
> 	 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> 	 syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
> 	 do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
> 	 entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> i.e. called as one of task_work_run() entries.
>
> The historical commit 8aac62706ada ("move exit_task_namespaces() outside
> of exit_notify()") argues that exit_task_namespaces() must come before
> exit_task_work() because ipc_ns cleanup calls fput/task_work_add.
>
> There is accompanying commit e7b2c4069252 ("fput: task_work_add() can
> fail if the caller has passed exit_task_work()") in the original series
> that makes fput() robust in situations when task_work_add() cannot be
> used anymore.
>
> So in order to ensure that task_work_run() entries of the exiting task
> have the nsproxy still available, swap the order of
> exit_task_namespaces() and exit_task_work().
>
> This change may appear like a partial revert of 8aac62706ada but this
> particular ordering change shouldn't matter with the fix from
> e7b2c4069252 and the other reason for 8aac62706ada (keeping exit_notify
> simpler) still holds.

I think I follow your reasoning and I think it will even fix the issue
but no.  That is completely very much not the correct fix.

That permission check in cgroup_file_write is just plain wrong.

Permission checks on files need to happen at open time, not at
write time.  It is all too easy to confuse something that writes
to stdout to write to a file of your choosing to make permission
checking at write time a good idea.

It looks like the this issue was introduced in commit 5136f6365ce3
("cgroup: implement "nsdelegate" mount option").

I wish I knew where I was when that patch was posted for review so
I could suggest a better implementation. 

That said I can't quite tell if the test should be moved into
cgroup_file_open or if there is a permission entry that would work.

Oh bleep!

The likely named cgroup_procs_write_permission isn't a permission method
at all.  It is called from cgroup_procs_write which I believe is
called from cgroup_file_write.

I may be wrong but at first glance this looks like the cgroup code is
going to need significant surgery to get the permission checks happening
at open time where they belong.

Please don't apply this patch.

exit_task_work running after exit_task_namespaces is the messenger
that just told us about something ugly.

Eric

> Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
> Cc: Oleg Nesterov <oleg@redhat.com>
> Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>  kernel/exit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> I wasn't able to reproduce the syzbot's crash manually so the effectiveness of
> the fix is only based on the reasoning.
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f702a6a63686..0c2abdebb87c 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -828,8 +828,8 @@ void __noreturn do_exit(long code)
>  	exit_fs(tsk);
>  	if (group_dead)
>  		disassociate_ctty(1);
> -	exit_task_namespaces(tsk);
>  	exit_task_work(tsk);
> +	exit_task_namespaces(tsk);
>  	exit_thread(tsk);
>  
>  	/*