Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp3169018ybf; Tue, 3 Mar 2020 00:36:06 -0800 (PST) X-Google-Smtp-Source: ADFU+vsflVvwkM4gc1+f5HTFM9GnlE9Wemh8AmgqhSptKFQZ6GDFvpWKZlV7i6ILyDSrkcHojTx6 X-Received: by 2002:a9d:a68:: with SMTP id 95mr2450157otg.87.1583224566748; Tue, 03 Mar 2020 00:36:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1583224566; cv=none; d=google.com; s=arc-20160816; b=y1C+uAB+nqm5Fr6Z4FIK1hqtgAQiiClXKEz1SRBVDtbVPWAEPNqgXfuOaXAsmBhoNH 9QZqPZh3uOjKqmGCqgRPxAMTpyS4wINoHgXBg29Gmcs+IsjPnsVfSXaMRpTL0DpiTCQX xqdq74iZleo70/v0v0HBjmtxM2t9Ye6rZ5662LAdvwyELTvnoAWCYYmR9EJqjXup7hNH q+h3RN6E7RJZPbHkX++ndaTW+O7a2BujZSO3n0s2Wn5WyZkknIubFuucxZaGTGA1N9np YsVWJ5QlT77F2Uz19+BH8iaj70J1+FHff03IwIdvmAqPK3pWIWzjaelK8UMyrhKyxmYJ Kp0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=x9AvxkBz3jNvwzhFqPve6CECn9KaqFe4an6wUcPcZ3E=; b=mxQSwNn+Us7acrhK91TNC6103Mu9IzSWO/xpSkCPvbLWpvnYtaqk9X+YzHhB9uAE9a oZFIpiRBab6WaypXBX31+WVXpBO2WsFMk69PRpOR0jyLwtB4ne2j8tvWusgrRrn+4n3B ZU4ujVCJ44PD1lU4VsJxVEzNbxw3VcXqUBUgw+sbHsuwalo1zxpaODfo5Kfbmbhj00CM VT6daK3oPCLTpt/Z0RNJtsQpuvp5wVpRBK4CzShGetJO/3v0zUWam4XXVWhInnhxwmIp 2QcrNz61vpUaMjdJ6/omVw9I0Gp4fzSW2+ZD7QHqVzgzPtnrapeenHrTwrilpBTzYosZ 2ghA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q3si3019506otc.141.2020.03.03.00.35.55; Tue, 03 Mar 2020 00:36:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727441AbgCCIfL (ORCPT + 99 others); Tue, 3 Mar 2020 03:35:11 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:34351 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725879AbgCCIfK (ORCPT ); Tue, 3 Mar 2020 03:35:10 -0500 Received: from ip5f5bf7ec.dynamic.kabel-deutschland.de ([95.91.247.236] helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1j930U-0003nu-D0; Tue, 03 Mar 2020 08:34:26 +0000 Date: Tue, 3 Mar 2020 09:34:25 +0100 From: Christian Brauner To: Bernd Edlinger Cc: Kees Cook , "Eric W. Biederman" , Jann Horn , Jonathan Corbet , Alexander Viro , Andrew Morton , Alexey Dobriyan , Thomas Gleixner , Oleg Nesterov , Frederic Weisbecker , Andrei Vagin , Ingo Molnar , "Peter Zijlstra (Intel)" , Yuyang Du , David Hildenbrand , Sebastian Andrzej Siewior , Anshuman Khandual , David Howells , James Morris , Greg Kroah-Hartman , Shakeel Butt , Jason Gunthorpe , Christian Kellner , Andrea Arcangeli , Aleksa Sarai , "Dmitry V. Levin" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , "stable@vger.kernel.org" Subject: Re: [PATCHv4] exec: Fix a deadlock in ptrace Message-ID: <20200303083425.jbf43axuymttijfv@wittgenstein> References: <87k142lpfz.fsf@x220.int.ebiederm.org> <875zfmloir.fsf@x220.int.ebiederm.org> <87v9nmjulm.fsf@x220.int.ebiederm.org> <202003021531.C77EF10@keescook> <202003022103.196C313623@keescook> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote: > On 3/3/20 6:29 AM, Kees Cook wrote: > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote: > >> On 3/3/20 3:26 AM, Kees Cook wrote: > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote: > >>>> [...] > >>> > >>> If I'm reading this patch correctly, this changes the lifetime of the > >>> cred_guard_mutex lock to be: > >>> - during prepare_bprm_creds() > >>> - from flush_old_exec() through install_exec_creds() > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through > >>> install_exec_creds(). > > > > BTW, I think the effect of this change (i.e. my paragraph above) should > > be distinctly called out in the commit log if this solution moves > > forward. > > > > Okay, will do. > > >>> That means, for example, that check_unsafe_exec()'s documented invariant > >>> is violated: > >>> /* > >>> * determine how safe it is to execute the proposed program > >>> * - the caller must hold ->cred_guard_mutex to protect against > >>> * PTRACE_ATTACH or seccomp thread-sync > >>> */ > >> > >> Oh, right, I haven't understood that hint... > > > > I know no_new_privs is checked there, but I haven't studied the > > PTRACE_ATTACH part of that comment. If that is handled with the new > > check, this comment should be updated. > > > > Okay, I change that comment to: > > /* > * determine how safe it is to execute the proposed program > * - the caller must have set ->cred_locked_in_execve to protect against > * PTRACE_ATTACH or seccomp thread-sync > */ > > >>> I think it also means that the potentially multiple invocations > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without > >>> a lock (another place where current's no_new_privs is evaluated). > >> > >> So no_new_privs can change from 0->1, but should not > >> when execve is running. > >> > >> As long as the calling thread is in execve it won't do this, > >> and the only other place, where it may set for other threads > >> is in seccomp_sync_threads, but that can easily be avoided see below. > > > > Yeah, everything was fine until I had to go complicate things with > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while > > later gaining a seccomp filter from an unpriv process. The no_new_privs > > flag was used to control this, but it required that the filter not get > > applied during exec. > > > >>> Related, it also means that cred_guard_mutex is unheld for every > >>> invocation of search_binary_handler() (which can loop via the previously > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid() > >>> currently.) > >>> > >>> For seccomp, the expectations about existing thread states risks races > >>> too. There are two locks held for TSYNC: > >>> - current->sighand->siglock is held to keep new threads from > >>> appearing/disappearing, which would destroy filter refcounting and > >>> lead to memory corruption. > >> > >> I don't understand what you mean here. > >> How can this lead to memory corruption? > > > > Mainly this is a matter of how seccomp manages its filter hierarchy > > (since the filters are shared through process ancestry), so if a thread > > appears in the middle of TSYNC it may be racing another TSYNC and break > > ancestry, leading to bad reference counting on process death, etc. > > (Though, yes, with refcount_t now, things should never corrupt, just > > waste memory.) > > > > I assume for now, that the current->sighand->siglock held while iterating all > threads is sufficient here. > > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to > >>> avoid no_new_privs and filter confusion during exec, which could > >>> lead to exploitable setuid conditions (see below). > >>> > >>> Just racing a malicious thread during TSYNC is not a very strong > >>> example (a malicious thread could do lots of fun things to "current" > >>> before it ever got near calling TSYNC), but I think there is the risk > >>> of mismatched/confused states that we don't want to allow. One is a > >>> particularly bad state that could lead to privilege escalations (in the > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process > >>> has a filter attached that silently fails a priv-dropping setuid call > >>> and continues execution with elevated privs, it can be tricked into > >>> doing bad things on behalf of the unprivileged parent, which was the > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]): > >>> > >>> thread A clones thread B > >>> thread B starts setuid exec > >>> thread A sets no_new_privs > >>> thread A calls seccomp with TSYNC > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B > >>> thread B passes check_unsafe_exec() with no_new_privs unset > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B > >>> thread B finishes exec, now running with elevated privs, a filter chosen > >>> by thread A, _and_ nnp set (which doesn't matter) > >>> > >>> With the original locking, thread B will fail check_unsafe_exec() > >>> because filter and nnp state are changed together, with "atomicity" > >>> protected by the cred_guard_mutex. > >>> > >> > >> Ah, good point, thanks! > >> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace > >> while the cred_guard_mutex is locked, like this for instance: > >> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c > >> index b6ea3dc..377abf0 100644 > >> --- a/kernel/seccomp.c > >> +++ b/kernel/seccomp.c > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void) > >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex)); > >> assert_spin_locked(¤t->sighand->siglock); > >> > >> + if (current->signal->cred_locked_for_ptrace) > >> + return -EAGAIN; > >> + > > > > Hmm. I guess something like that could work. TSYNC expects to be able to > > report _which_ thread wrecked the call, though... I wonder if in_execve > > could be used to figure out the offending thread. Hm, nope, that would > > be outside of lock too (and all users are "current" right now, so the > > lock wasn't needed before). > > > > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters, > but the caller will die quickly and cannot do anything with that information > when another thread executes execve, right? > > >> /* Validate all threads being eligible for synchronization. */ > >> caller = current; > >> for_each_thread(caller, thread) { > >> > >> > >>> And this is just the bad state I _can_ see. I'm worried there are more... > >>> > >>> All this said, I do see a small similarity here to the work I did to > >>> stabilize stack rlimits (there was an ongoing problem with making multiple > >>> decisions for the bprm based on current's state -- but current's state > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored > >>> current's copy until exec ended and then stored bprm's copy into current. > >>> If the only problem anyone can see here is the handling of no_new_privs, > >>> we might be able to solve that similarly, at least disentangling tsync/nnp > >>> from cred_guard_mutex. > >>> > >> > >> I still think that is solvable with using cred_locked_for_ptrace and > >> simply make the tsync fail if it would otherwise be blocked. > > > > I wonder if we can find a better name than "cred_locked_for_ptrace"? > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something? > > > > Yeah, I'd go with "cred_locked_in_execve". > > > And the comment on bool cred_locked_for_ptrace should mention that > > access is only allowed under cred_guard_mutex lock. > > > > okay. > > >>>> + sig->cred_locked_for_ptrace = false; > > > > This is redundant to the zalloc -- I think you can drop it (unless > > someone wants to keep it for clarify?) > > > > I'll remove that here and in init/init_task.c > > > Also, I think cred_locked_for_ptrace needs checking deeper, in > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to > > be sufficient to see a stable version of the thread... > > > > No, these need to be addressed individually, but most users just want > to know if the current credentials are sufficient at this moment, but will > not change the credentials, as ptrace and TSYNC do. > > BTW: Not all users have cred_guard_mutex, see mm/migrate.c, > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc. > So adding an access to cred_locked_for_execve in ptrace_may_access is > probably not an option. > > However, one nice added value by this change is this: > > void *thread(void *arg) > { > ptrace(PTRACE_TRACEME, 0,0,0); > return NULL; > } > > int main(void) > { > int pid = fork(); > > if (!pid) { > pthread_t pt; > pthread_create(&pt, NULL, thread, NULL); > pthread_join(pt, NULL); > execlp("echo", "echo", "passed", NULL); > } > > sleep(1000); > ptrace(PTRACE_ATTACH, pid, 0,0); > kill(pid, SIGCONT); > return 0; > } > > cat /proc/3812/stack > [<0>] flush_old_exec+0xbf/0x760 > [<0>] load_elf_binary+0x35a/0x16c0 > [<0>] search_binary_handler+0x97/0x1d0 > [<0>] __do_execve_file.isra.40+0x624/0x920 > [<0>] __x64_sys_execve+0x49/0x60 > [<0>] do_syscall_64+0x64/0x220 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > (I remain very nervous about weakening cred_guard_mutex without > > addressing the many many users...) > > > > They need to be looked at closely, that's pretty clear. > Most fall in the class, that just the current credentials need > to stay stable for a certain time. I remain rather set on wanting some very basic tests with this change. Imho, looking through tools/testing/selftests again we don't have nearly enough for these codepaths; not to say none. Basically, if someone wants to make a change affecting the current problem we should really have at least a single simple test/reproducer that can be run without digging through lore. And hopefully over time we'll have more tests. Christian