Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp3174866ybf; Tue, 3 Mar 2020 00:44:58 -0800 (PST) X-Google-Smtp-Source: ADFU+vub8zzNYN1IJ48T6qIwpVyeu7a8drDGh89o++3OQ1XjP1D4Uk+ha1HYOoC3F//brgWdMNSD X-Received: by 2002:aca:af97:: with SMTP id y145mr1678754oie.24.1583225097950; Tue, 03 Mar 2020 00:44:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1583225097; cv=none; d=google.com; s=arc-20160816; b=G7ahR8FfnVYg9EmYinJk5g1iZHbtHnQIC1iaL42tGlv4W6oZM+Kg/hCETTR97dMFls 5zVnZ1eiqGtuow0XY8r7gRQui+PIl14gfd/kBPktwWDu67J2J8DeuEEesoCFHGc6/rKN n4dk3p8LChghZIuoVbCGcAGqyOOYsZaZVZlcpE3U3nJ3o3yUwUU0MtWe43w0hNBjjTYp aXlW547wOEDgeF8DzoAf94S/LbCmxfHaZAw0OqYib3ig+iUufIwpEny4ujLlE7FeQ9Gt eT+Je/G631BoJPEpaWX+k+SWmt2U7ogAhUCJ25ZdndIuJIO0s3+3wABIujCzxEc0g6gJ CCqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=VR3snYY8Nee1eON7cn2zfEIO33wjAIPM0HlJfFlScmo=; b=QV/Tds/ptAT/3xw3ABEmeLIJPCrvjTHHr3oZIJLI4PMkSqT+nhS2k3J1usEuUrF2tK G5n+i6f80dAIo3w9UJfEETlsoM9S56FeUmlhly30y7U3bVdFiGJddsEzSwICkIdKCOSS jWT3xM4xGVEN8Um5hq3KmYubQCFcPM2DXBIvswpm4z0DNH1EmNg9YnJVjQp6VN0ylo34 /rYMIMGp0S7ZzRfJq+ylXDPkmQvDJHZ39wqI8lxNmtG+xiscuLTxe8M+Nb6BEcd/AwXh j6eNCDDVpD9CRkc0VKF85vV5beMsJY0aX5rYguVJBLsfJ4wn4PQ8kt5HecMYEjlf/f2w KB/A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t12si1188828otr.199.2020.03.03.00.44.45; Tue, 03 Mar 2020 00:44:57 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727659AbgCCIoY (ORCPT + 99 others); Tue, 3 Mar 2020 03:44:24 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:34477 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726917AbgCCIoY (ORCPT ); Tue, 3 Mar 2020 03:44:24 -0500 Received: from ip5f5bf7ec.dynamic.kabel-deutschland.de ([95.91.247.236] helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1j939X-0004Ox-Q8; Tue, 03 Mar 2020 08:43:47 +0000 Date: Tue, 3 Mar 2020 09:43:46 +0100 From: Christian Brauner To: Bernd Edlinger Cc: Kees Cook , "Eric W. Biederman" , Jann Horn , Jonathan Corbet , Alexander Viro , Andrew Morton , Alexey Dobriyan , Thomas Gleixner , Oleg Nesterov , Frederic Weisbecker , Andrei Vagin , Ingo Molnar , "Peter Zijlstra (Intel)" , Yuyang Du , David Hildenbrand , Sebastian Andrzej Siewior , Anshuman Khandual , David Howells , James Morris , Greg Kroah-Hartman , Shakeel Butt , Jason Gunthorpe , Christian Kellner , Andrea Arcangeli , Aleksa Sarai , "Dmitry V. Levin" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "stable@vger.kernel.org" Subject: Re: [PATCHv4] exec: Fix a deadlock in ptrace Message-ID: <20200303084346.jympckv5wjki3orb@wittgenstein> References: <875zfmloir.fsf@x220.int.ebiederm.org> <87v9nmjulm.fsf@x220.int.ebiederm.org> <202003021531.C77EF10@keescook> <202003022103.196C313623@keescook> <20200303083425.jbf43axuymttijfv@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200303083425.jbf43axuymttijfv@wittgenstein> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote: > On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote: > > On 3/3/20 6:29 AM, Kees Cook wrote: > > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote: > > >> On 3/3/20 3:26 AM, Kees Cook wrote: > > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote: > > >>>> [...] > > >>> > > >>> If I'm reading this patch correctly, this changes the lifetime of the > > >>> cred_guard_mutex lock to be: > > >>> - during prepare_bprm_creds() > > >>> - from flush_old_exec() through install_exec_creds() > > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through > > >>> install_exec_creds(). > > > > > > BTW, I think the effect of this change (i.e. my paragraph above) should > > > be distinctly called out in the commit log if this solution moves > > > forward. > > > > > > > Okay, will do. > > > > >>> That means, for example, that check_unsafe_exec()'s documented invariant > > >>> is violated: > > >>> /* > > >>> * determine how safe it is to execute the proposed program > > >>> * - the caller must hold ->cred_guard_mutex to protect against > > >>> * PTRACE_ATTACH or seccomp thread-sync > > >>> */ > > >> > > >> Oh, right, I haven't understood that hint... > > > > > > I know no_new_privs is checked there, but I haven't studied the > > > PTRACE_ATTACH part of that comment. If that is handled with the new > > > check, this comment should be updated. > > > > > > > Okay, I change that comment to: > > > > /* > > * determine how safe it is to execute the proposed program > > * - the caller must have set ->cred_locked_in_execve to protect against > > * PTRACE_ATTACH or seccomp thread-sync > > */ > > > > >>> I think it also means that the potentially multiple invocations > > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and > > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without > > >>> a lock (another place where current's no_new_privs is evaluated). > > >> > > >> So no_new_privs can change from 0->1, but should not > > >> when execve is running. > > >> > > >> As long as the calling thread is in execve it won't do this, > > >> and the only other place, where it may set for other threads > > >> is in seccomp_sync_threads, but that can easily be avoided see below. > > > > > > Yeah, everything was fine until I had to go complicate things with > > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while > > > later gaining a seccomp filter from an unpriv process. The no_new_privs > > > flag was used to control this, but it required that the filter not get > > > applied during exec. > > > > > >>> Related, it also means that cred_guard_mutex is unheld for every > > >>> invocation of search_binary_handler() (which can loop via the previously > > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden > > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() > > >>> currently.) > > >>> > > >>> For seccomp, the expectations about existing thread states risks races > > >>> too. There are two locks held for TSYNC: > > >>> - current->sighand->siglock is held to keep new threads from > > >>> appearing/disappearing, which would destroy filter refcounting and > > >>> lead to memory corruption. > > >> > > >> I don't understand what you mean here. > > >> How can this lead to memory corruption? > > > > > > Mainly this is a matter of how seccomp manages its filter hierarchy > > > (since the filters are shared through process ancestry), so if a thread > > > appears in the middle of TSYNC it may be racing another TSYNC and break > > > ancestry, leading to bad reference counting on process death, etc. > > > (Though, yes, with refcount_t now, things should never corrupt, just > > > waste memory.) > > > > > > > I assume for now, that the current->sighand->siglock held while iterating all > > threads is sufficient here. > > > > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to > > >>> avoid no_new_privs and filter confusion during exec, which could > > >>> lead to exploitable setuid conditions (see below). > > >>> > > >>> Just racing a malicious thread during TSYNC is not a very strong > > >>> example (a malicious thread could do lots of fun things to "current" > > >>> before it ever got near calling TSYNC), but I think there is the risk > > >>> of mismatched/confused states that we don't want to allow. One is a > > >>> particularly bad state that could lead to privilege escalations (in the > > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process > > >>> has a filter attached that silently fails a priv-dropping setuid call > > >>> and continues execution with elevated privs, it can be tricked into > > >>> doing bad things on behalf of the unprivileged parent, which was the > > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]): > > >>> > > >>> thread A clones thread B > > >>> thread B starts setuid exec > > >>> thread A sets no_new_privs > > >>> thread A calls seccomp with TSYNC > > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B > > >>> thread B passes check_unsafe_exec() with no_new_privs unset > > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs > > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B > > >>> thread B finishes exec, now running with elevated privs, a filter chosen > > >>> by thread A, _and_ nnp set (which doesn't matter) > > >>> > > >>> With the original locking, thread B will fail check_unsafe_exec() > > >>> because filter and nnp state are changed together, with "atomicity" > > >>> protected by the cred_guard_mutex. > > >>> > > >> > > >> Ah, good point, thanks! > > >> > > >> This can be fixed by checking current->signal->cred_locked_for_ptrace > > >> while the cred_guard_mutex is locked, like this for instance: > > >> > > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c > > >> index b6ea3dc..377abf0 100644 > > >> --- a/kernel/seccomp.c > > >> +++ b/kernel/seccomp.c > > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) > > >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); > > >> assert_spin_locked(¤t->sighand->siglock); > > >> > > >> + if (current->signal->cred_locked_for_ptrace) > > >> + return -EAGAIN; > > >> + > > > > > > Hmm. I guess something like that could work. TSYNC expects to be able to > > > report _which_ thread wrecked the call, though... I wonder if in_execve > > > could be used to figure out the offending thread. Hm, nope, that would > > > be outside of lock too (and all users are "current" right now, so the > > > lock wasn't needed before). > > > > > > > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, > > but the caller will die quickly and cannot do anything with that information > > when another thread executes execve, right? > > > > >> /* Validate all threads being eligible for synchronization. */ > > >> caller = current; > > >> for_each_thread(caller, thread) { > > >> > > >> > > >>> And this is just the bad state I _can_ see. I'm worried there are more... > > >>> > > >>> All this said, I do see a small similarity here to the work I did to > > >>> stabilize stack rlimits (there was an ongoing problem with making multiple > > >>> decisions for the bprm based on current's state -- but current's state > > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored > > >>> current's copy until exec ended and then stored bprm's copy into current. > > >>> If the only problem anyone can see here is the handling of no_new_privs, > > >>> we might be able to solve that similarly, at least disentangling tsync/nnp > > >>> from cred_guard_mutex. > > >>> > > >> > > >> I still think that is solvable with using cred_locked_for_ptrace and > > >> simply make the tsync fail if it would otherwise be blocked. > > > > > > I wonder if we can find a better name than "cred_locked_for_ptrace"? > > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something? > > > > > > > Yeah, I'd go with "cred_locked_in_execve". > > > > > And the comment on bool cred_locked_for_ptrace should mention that > > > access is only allowed under cred_guard_mutex lock. > > > > > > > okay. > > > > >>>> + sig->cred_locked_for_ptrace = false; > > > > > > This is redundant to the zalloc -- I think you can drop it (unless > > > someone wants to keep it for clarify?) > > > > > > > I'll remove that here and in init/init_task.c > > > > > Also, I think cred_locked_for_ptrace needs checking deeper, in > > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make > > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to > > > be sufficient to see a stable version of the thread... > > > > > > > No, these need to be addressed individually, but most users just want > > to know if the current credentials are sufficient at this moment, but will > > not change the credentials, as ptrace and TSYNC do. > > > > BTW: Not all users have cred_guard_mutex, see mm/migrate.c, > > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. > > So adding an access to cred_locked_for_execve in ptrace_may_access is > > probably not an option. > > > > However, one nice added value by this change is this: > > > > void *thread(void *arg) > > { > > ptrace(PTRACE_TRACEME, 0,0,0); > > return NULL; > > } > > > > int main(void) > > { > > int pid = fork(); > > > > if (!pid) { > > pthread_t pt; > > pthread_create(&pt, NULL, thread, NULL); > > pthread_join(pt, NULL); > > execlp("echo", "echo", "passed", NULL); > > } > > > > sleep(1000); > > ptrace(PTRACE_ATTACH, pid, 0,0); > > kill(pid, SIGCONT); > > return 0; > > } > > > > cat /proc/3812/stack > > [<0>] flush_old_exec+0xbf/0x760 > > [<0>] load_elf_binary+0x35a/0x16c0 > > [<0>] search_binary_handler+0x97/0x1d0 > > [<0>] __do_execve_file.isra.40+0x624/0x920 > > [<0>] __x64_sys_execve+0x49/0x60 > > [<0>] do_syscall_64+0x64/0x220 > > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > > (I remain very nervous about weakening cred_guard_mutex without > > > addressing the many many users...) > > > > > > > They need to be looked at closely, that's pretty clear. > > Most fall in the class, that just the current credentials need > > to stay stable for a certain time. > > I remain rather set on wanting some very basic tests with this change. > Imho, looking through tools/testing/selftests again we don't have nearly > enough for these codepaths; not to say none. Basically, if someone wants > to make a change affecting the current problem we should really have at > least a single simple test/reproducer that can be run without digging > through lore. And hopefully over time we'll have more tests. Which you added in v4. Which is great! (I should've mentioned this in my first mail.) Christian