This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated. They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to have a second mutex that is
used in mm_access, so it is allowed to continue while the
dying threads are not yet terminated.
I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger <[email protected]>
---
Documentation/security/credentials.rst | 18 ++++++++++--------
fs/exec.c | 9 +++++++++
include/linux/binfmts.h | 6 +++++-
include/linux/sched/signal.h | 1 +
init/init_task.c | 1 +
kernel/cred.c | 2 +-
kernel/fork.c | 5 +++--
mm/process_vm_access.c | 2 +-
8 files changed, 31 insertions(+), 13 deletions(-)
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..c98e0a8 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,13 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful. It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and the mutex
+current->signal->cred_change_mutex is acquired later, while the credentials
+and the process mmap are actually changed.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process
while security checks on credentials construction and changing is taking place
@@ -466,9 +470,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the
LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
end of such functions as ``sys_setresuid()``.
@@ -486,8 +489,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..a6884e4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_change_mutex);
+ if (retval)
+ goto out;
+
+ bprm->called_flush_old_exec = 1;
+
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
@@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
+ if (bprm->called_flush_old_exec)
+ mutex_unlock(¤t->signal->cred_change_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ mutex_unlock(¤t->signal->cred_change_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2e1318b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by flush_old_exec, when the cred_change_mutex is taken.
+ */
+ called_flush_old_exec:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..37eeabe 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace) */
+ struct mutex cred_change_mutex; /* guard against credentials change */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..6cd9a0f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
*
* Returns the new credentials or NULL if out of memory.
*
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..0395154 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
struct mm_struct *mm;
int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ err = mutex_lock_killable(&task->signal->cred_change_mutex);
if (err)
return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
mmput(mm);
mm = ERR_PTR(-EACCES);
}
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_change_mutex);
return mm;
}
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ mutex_init(&sig->cred_change_mutex);
return 0;
}
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
- * Explicitly map EACCES to EPERM as EPERM is a more a
+ * Explicitly map EACCES to EPERM as EPERM is a more
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
--
1.9.1
On 2020-03-01, Bernd Edlinger <[email protected]> wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated. They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> strace D 0 30614 30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect D 0 31933 30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.
>
> I also took the opportunity to improve the documentation
> of prepare_creds, which is obviously out of sync.
>
> Signed-off-by: Bernd Edlinger <[email protected]>
I can't comment on the validity of the patch, but I also found and
reported this issue in 2016[1] and the discussion quickly veered into
the problem being more complicated (and uglier) than it seems at first
glance.
You should probably also Cc stable, given this has been a long-standing
issue and your patch doesn't look (too) invasive.
[1]: https://lore.kernel.org/lkml/[email protected]/
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
> On 2020-03-01, Bernd Edlinger <[email protected]> wrote:
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> >
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated. They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> >
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> >
> > strace D 0 30614 30584 0x00000000
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > schedule_preempt_disabled+0x15/0x20
> > __mutex_lock.isra.13+0x1ec/0x520
> > __mutex_lock_killable_slowpath+0x13/0x20
> > mutex_lock_killable+0x28/0x30
> > mm_access+0x27/0xa0
> > process_vm_rw_core.isra.3+0xff/0x550
> > process_vm_rw+0xdd/0xf0
> > __x64_sys_process_vm_readv+0x31/0x40
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > expect D 0 31933 30876 0x80004003
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > flush_old_exec+0xc4/0x770
> > load_elf_binary+0x35a/0x16c0
> > search_binary_handler+0x97/0x1d0
> > __do_execve_file.isra.40+0x5d4/0x8a0
> > __x64_sys_execve+0x49/0x60
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > The proposed solution is to have a second mutex that is
> > used in mm_access, so it is allowed to continue while the
> > dying threads are not yet terminated.
> >
> > I also took the opportunity to improve the documentation
> > of prepare_creds, which is obviously out of sync.
> >
> > Signed-off-by: Bernd Edlinger <[email protected]>
>
> I can't comment on the validity of the patch, but I also found and
> reported this issue in 2016[1] and the discussion quickly veered into
> the problem being more complicated (and uglier) than it seems at first
> glance.
>
> You should probably also Cc stable, given this has been a long-standing
> issue and your patch doesn't look (too) invasive.
>
> [1]: https://lore.kernel.org/lkml/[email protected]/
Yeah, I remember you mentioning this a while back.
Bernd, we really want a reproducer for this sent alongside with this
patch added to:
tools/testing/selftests/ptrace/
Having a test for this bug irrespective of whether or not we go with
this as fix seems really worth it.
Oleg seems to have suggested that a potential alternative fix is to wait
in de_thread() until all other threads in the thread-group have passed
exit_notiy(). Right now we only kill them but don't wait. Currently
de_thread() only waits for the thread-group leader to pass exit_notify()
whenever a non-thread-group leader thread execs (because the exec'ing
thread becomes the new thread-group leader with the same pid as the
former thread-group leader).
Christian
Hi Aleksa,
On 3/1/20 4:13 PM, Aleksa Sarai wrote:
> On 2020-03-01, Bernd Edlinger <[email protected]> wrote:
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated. They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>
>> strace D 0 30614 30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect D 0 31933 30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> The proposed solution is to have a second mutex that is
>> used in mm_access, so it is allowed to continue while the
>> dying threads are not yet terminated.
>>
>> I also took the opportunity to improve the documentation
>> of prepare_creds, which is obviously out of sync.
>>
>> Signed-off-by: Bernd Edlinger <[email protected]>
>
> I can't comment on the validity of the patch, but I also found and
> reported this issue in 2016[1] and the discussion quickly veered into
> the problem being more complicated (and uglier) than it seems at first
> glance.
>
> You should probably also Cc stable, given this has been a long-standing
> issue and your patch doesn't look (too) invasive.
>
I am fully aware that this patch won't fix the case then PTRACE_ACCESS is racing
with de_thread. But I don't see a problem with allowing vm access based on the
current credentials as they are still the same until de_thread is done with it's
job. And in a practical way this fixes 99% of the real problem here, as it only
happens since strace is currently tracing something and needs access to the parameters
in the tracee's vm space.
Of course you could fork the strace process to do any PTRACE_ACCESS when necessary,
and, well, maybe that would fix the remaining problem here...
However before I considered changing the kernel for this I tried to fix this
within strace. First I tried to wait in the signal handler. See attached
strace-patch-1.diff, but that did not work, BUT I think it is possible that your
patch you proposed previously would actually make it work.
I tried then another approach, using a worker thread to wait for the childs,
but it did only work when I remove PTRACE_O_TRACEEXIT from the ptrace options,
because the ptrace(PTRACE_SYSCALL, pid, 0L, 0L) does not work in the worker thread,
rv = -1, errno = 3 there, and unfortunately the main thread is blocked and unable
to do the ptrace call, that makes the thread continue.
So I consider that second patch really ugly, and wouldn't propose something like
that seriously.
@@ -69,7 +71,7 @@
cflag_t cflag = CFLAG_NONE;
unsigned int followfork;
unsigned int ptrace_setoptions = PTRACE_O_TRACESYSGOOD | PTRACE_O_TRACEEXEC
- | PTRACE_O_TRACEEXIT;
+ ;//| PTRACE_O_TRACEEXIT;
unsigned int xflag;
bool debug_flag;
bool Tflag;
so it only works because of this line, without that it is not able to make the
thread continue after the PTRACE_EVENT_EXIT.
Thanks
Bernd.
> [1]: https://lore.kernel.org/lkml/[email protected]/
>
On 3/1/20 4:58 PM, Christian Brauner wrote:
> On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
>> On 2020-03-01, Bernd Edlinger <[email protected]> wrote:
>>> This fixes a deadlock in the tracer when tracing a multi-threaded
>>> application that calls execve while more than one thread are running.
>>>
>>> I observed that when running strace on the gcc test suite, it always
>>> blocks after a while, when expect calls execve, because other threads
>>> have to be terminated. They send ptrace events, but the strace is no
>>> longer able to respond, since it is blocked in vm_access.
>>>
>>> The deadlock is always happening when strace needs to access the
>>> tracees process mmap, while another thread in the tracee starts to
>>> execve a child process, but that cannot continue until the
>>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>>
>>> strace D 0 30614 30584 0x00000000
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> schedule_preempt_disabled+0x15/0x20
>>> __mutex_lock.isra.13+0x1ec/0x520
>>> __mutex_lock_killable_slowpath+0x13/0x20
>>> mutex_lock_killable+0x28/0x30
>>> mm_access+0x27/0xa0
>>> process_vm_rw_core.isra.3+0xff/0x550
>>> process_vm_rw+0xdd/0xf0
>>> __x64_sys_process_vm_readv+0x31/0x40
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> expect D 0 31933 30876 0x80004003
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> flush_old_exec+0xc4/0x770
>>> load_elf_binary+0x35a/0x16c0
>>> search_binary_handler+0x97/0x1d0
>>> __do_execve_file.isra.40+0x5d4/0x8a0
>>> __x64_sys_execve+0x49/0x60
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> The proposed solution is to have a second mutex that is
>>> used in mm_access, so it is allowed to continue while the
>>> dying threads are not yet terminated.
>>>
>>> I also took the opportunity to improve the documentation
>>> of prepare_creds, which is obviously out of sync.
>>>
>>> Signed-off-by: Bernd Edlinger <[email protected]>
>>
>> I can't comment on the validity of the patch, but I also found and
>> reported this issue in 2016[1] and the discussion quickly veered into
>> the problem being more complicated (and uglier) than it seems at first
>> glance.
>>
>> You should probably also Cc stable, given this has been a long-standing
>> issue and your patch doesn't look (too) invasive.
>>
>> [1]: https://lore.kernel.org/lkml/[email protected]/
>
> Yeah, I remember you mentioning this a while back.
>
> Bernd, we really want a reproducer for this sent alongside with this
> patch added to:
> tools/testing/selftests/ptrace/
> Having a test for this bug irrespective of whether or not we go with
> this as fix seems really worth it.
>
I ran into this issue, because I wanted to fix an issue in the gcc testsuite,
namely why it forgets to remove some temp files,
so I did the following:
strace -ftt -o trace.txt make check-gcc-c -k -j4
I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is
not happening on all systems I tested, maybe it is something that the expect program
does, because, always when I try to reproduce this, the deadlock was always in "expect".
I use expect version 5.45 on the computer where the above test freezes after
a couple of minutes.
I think the issue with strace is that it is using vm_access to get the parameters
of a syscall that is going on in one thread, and that races with another thread
that calls execve, and blocks the cred_guard_mutex.
While Olg's test case here, will certainly not be fixed:
https://lore.kernel.org/lkml/[email protected]/
he mentions the access to "anything else which needs ->cred_guard_mutex,
say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if
that is possible, it would probably work as a test case.
What do you think?
Bernd.
> Oleg seems to have suggested that a potential alternative fix is to wait
> in de_thread() until all other threads in the thread-group have passed
> exit_notiy(). Right now we only kill them but don't wait. Currently
> de_thread() only waits for the thread-group leader to pass exit_notify()
> whenever a non-thread-group leader thread execs (because the exec'ing
> thread becomes the new thread-group leader with the same pid as the
> former thread-group leader).
>
> Christian
>
On Sun, Mar 01, 2020 at 05:46:08PM +0000, Bernd Edlinger wrote:
> On 3/1/20 4:58 PM, Christian Brauner wrote:
> > On Mon, Mar 02, 2020 at 02:13:33AM +1100, Aleksa Sarai wrote:
> >> On 2020-03-01, Bernd Edlinger <[email protected]> wrote:
> >>> This fixes a deadlock in the tracer when tracing a multi-threaded
> >>> application that calls execve while more than one thread are running.
> >>>
> >>> I observed that when running strace on the gcc test suite, it always
> >>> blocks after a while, when expect calls execve, because other threads
> >>> have to be terminated. They send ptrace events, but the strace is no
> >>> longer able to respond, since it is blocked in vm_access.
> >>>
> >>> The deadlock is always happening when strace needs to access the
> >>> tracees process mmap, while another thread in the tracee starts to
> >>> execve a child process, but that cannot continue until the
> >>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> >>>
> >>> strace D 0 30614 30584 0x00000000
> >>> Call Trace:
> >>> __schedule+0x3ce/0x6e0
> >>> schedule+0x5c/0xd0
> >>> schedule_preempt_disabled+0x15/0x20
> >>> __mutex_lock.isra.13+0x1ec/0x520
> >>> __mutex_lock_killable_slowpath+0x13/0x20
> >>> mutex_lock_killable+0x28/0x30
> >>> mm_access+0x27/0xa0
> >>> process_vm_rw_core.isra.3+0xff/0x550
> >>> process_vm_rw+0xdd/0xf0
> >>> __x64_sys_process_vm_readv+0x31/0x40
> >>> do_syscall_64+0x64/0x220
> >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>
> >>> expect D 0 31933 30876 0x80004003
> >>> Call Trace:
> >>> __schedule+0x3ce/0x6e0
> >>> schedule+0x5c/0xd0
> >>> flush_old_exec+0xc4/0x770
> >>> load_elf_binary+0x35a/0x16c0
> >>> search_binary_handler+0x97/0x1d0
> >>> __do_execve_file.isra.40+0x5d4/0x8a0
> >>> __x64_sys_execve+0x49/0x60
> >>> do_syscall_64+0x64/0x220
> >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>
> >>> The proposed solution is to have a second mutex that is
> >>> used in mm_access, so it is allowed to continue while the
> >>> dying threads are not yet terminated.
> >>>
> >>> I also took the opportunity to improve the documentation
> >>> of prepare_creds, which is obviously out of sync.
> >>>
> >>> Signed-off-by: Bernd Edlinger <[email protected]>
> >>
> >> I can't comment on the validity of the patch, but I also found and
> >> reported this issue in 2016[1] and the discussion quickly veered into
> >> the problem being more complicated (and uglier) than it seems at first
> >> glance.
> >>
> >> You should probably also Cc stable, given this has been a long-standing
> >> issue and your patch doesn't look (too) invasive.
> >>
> >> [1]: https://lore.kernel.org/lkml/[email protected]/
> >
> > Yeah, I remember you mentioning this a while back.
> >
> > Bernd, we really want a reproducer for this sent alongside with this
> > patch added to:
> > tools/testing/selftests/ptrace/
> > Having a test for this bug irrespective of whether or not we go with
> > this as fix seems really worth it.
> >
>
> I ran into this issue, because I wanted to fix an issue in the gcc testsuite,
> namely why it forgets to remove some temp files,
> so I did the following:
>
> strace -ftt -o trace.txt make check-gcc-c -k -j4
>
> I reproduced with v4.20 and v5.5 kernel, and I don't know why but it is
> not happening on all systems I tested, maybe it is something that the expect program
> does, because, always when I try to reproduce this, the deadlock was always in "expect".
>
> I use expect version 5.45 on the computer where the above test freezes after
> a couple of minutes.
>
> I think the issue with strace is that it is using vm_access to get the parameters
> of a syscall that is going on in one thread, and that races with another thread
> that calls execve, and blocks the cred_guard_mutex.
>
> While Olg's test case here, will certainly not be fixed:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> he mentions the access to "anything else which needs ->cred_guard_mutex,
> say open(/proc/$pid/mem)", I don't know for sure how that can be done, but if
> that is possible, it would probably work as a test case.
>
> What do you think?
Yeah, anything that calls ptrace_may_access() is fine and
open(/proc/$pid/mem) will work so long as $pid is not in the same
thread-group as the caller. A polished version of the reproducer you
linked in would probably be good.
On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
<[email protected]> wrote:
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.
Just for context: When I proposed something similar back in 2016,
https://lore.kernel.org/linux-fsdevel/[email protected]/
was the resulting discussion thread. At least back then, I looked
through the various existing users of cred_guard_mutex, and the only
places that couldn't be converted to the new second mutex were
PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
The ideal solution would IMO be something like this: Decide what the
new task's credentials should be *before* reaching de_thread(),
install them into a second cred* on the task (together with the new
dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
check against both. After that, some further restructuring might even
allow the cred_guard_mutex to not be held across all of the VFS
operations that happen early on in execve, which may block
indefinitely. But that would be pretty complicated, so I think your
proposed solution makes sense for now, given that nobody has managed
to implement anything better in the last few years.
On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> <[email protected]> wrote:
> > The proposed solution is to have a second mutex that is
> > used in mm_access, so it is allowed to continue while the
> > dying threads are not yet terminated.
>
> Just for context: When I proposed something similar back in 2016,
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> was the resulting discussion thread. At least back then, I looked
> through the various existing users of cred_guard_mutex, and the only
> places that couldn't be converted to the new second mutex were
> PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
>
>
> The ideal solution would IMO be something like this: Decide what the
> new task's credentials should be *before* reaching de_thread(),
> install them into a second cred* on the task (together with the new
> dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> check against both. After that, some further restructuring might even
Hm, so essentially a private ptrace_access_cred member in task_struct?
That would presumably also involve altering various LSM hooks to look at
ptrace_access_cred.
(Minor side-note, de_thread() takes a struct task_struct argument but
only ever is passed current.)
> allow the cred_guard_mutex to not be held across all of the VFS
> operations that happen early on in execve, which may block
> indefinitely. But that would be pretty complicated, so I think your
> proposed solution makes sense for now, given that nobody has managed
> to implement anything better in the last few years.
Reading through the old threads and how often this issue came up, I tend
to agree.
On 3/1/20 7:52 PM, Christian Brauner wrote:
> On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
>> On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
>> <[email protected]> wrote:
>>> The proposed solution is to have a second mutex that is
>>> used in mm_access, so it is allowed to continue while the
>>> dying threads are not yet terminated.
>>
>> Just for context: When I proposed something similar back in 2016,
>> https://lore.kernel.org/linux-fsdevel/[email protected]/
>> was the resulting discussion thread. At least back then, I looked
>> through the various existing users of cred_guard_mutex, and the only
>> places that couldn't be converted to the new second mutex were
>> PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
>>
>>
>> The ideal solution would IMO be something like this: Decide what the
>> new task's credentials should be *before* reaching de_thread(),
>> install them into a second cred* on the task (together with the new
>> dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
>> check against both. After that, some further restructuring might even
>
> Hm, so essentially a private ptrace_access_cred member in task_struct?
> That would presumably also involve altering various LSM hooks to look at
> ptrace_access_cred.
>
> (Minor side-note, de_thread() takes a struct task_struct argument but
> only ever is passed current.)
>
>> allow the cred_guard_mutex to not be held across all of the VFS
>> operations that happen early on in execve, which may block
>> indefinitely. But that would be pretty complicated, so I think your
>> proposed solution makes sense for now, given that nobody has managed
>> to implement anything better in the last few years.
>
> Reading through the old threads and how often this issue came up, I tend
> to agree.
>
Okay, fine.
I managed to change Oleg's test case, into one that shows what exactly
is changed with this patch:
$ cat t.c
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/signal.h>
#include <sys/ptrace.h>
void *thread(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return NULL;
}
int main(void)
{
int f, pid = fork();
char mm[64];
if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("echo", "echo", "passed", NULL);
}
sleep(1);
sprintf(mm, "/proc/%d/mem", pid);
printf("open(%s)\n", mm);
f = open(mm, O_RDONLY);
printf("f = %d\n", f);
// this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0);
kill(pid, SIGCONT);
if (f >= 0)
close(f);
return 0;
}
$ gcc -pthread -Wall t.c
$ ./a.out
open(/proc/2802/mem)
f = 3
$ passed
previously this did block, how can I make a test case for this?
I am not so experienced in this matter.
Thanks
Bernd.
On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
<[email protected]> wrote:
> On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > <[email protected]> wrote:
> > > The proposed solution is to have a second mutex that is
> > > used in mm_access, so it is allowed to continue while the
> > > dying threads are not yet terminated.
> >
> > Just for context: When I proposed something similar back in 2016,
> > https://lore.kernel.org/linux-fsdevel/[email protected]/
> > was the resulting discussion thread. At least back then, I looked
> > through the various existing users of cred_guard_mutex, and the only
> > places that couldn't be converted to the new second mutex were
> > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> >
> >
> > The ideal solution would IMO be something like this: Decide what the
> > new task's credentials should be *before* reaching de_thread(),
> > install them into a second cred* on the task (together with the new
> > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > check against both. After that, some further restructuring might even
>
> Hm, so essentially a private ptrace_access_cred member in task_struct?
And a second dumpability field, because that changes together with the
creds during execve. (Btw, currently the dumpability is in the
mm_struct, but that's kinda wrong. The mm_struct is removed from a
task on exit while access checks can still be performed against it, and
currently ptrace_may_access() just lets the access go through in that
case, which weakens the protection offered by PR_SET_DUMPABLE when
used for security purposes. I think it ought to be moved over into the
task_struct.)
> That would presumably also involve altering various LSM hooks to look at
> ptrace_access_cred.
When I tried to implement this in the past, I changed the LSM hook to
take the target task's cred* as an argument, and then called the LSM
hook twice from ptrace_may_access(). IIRC having the target task's
creds as an argument works for almost all the LSMs, with the exception
of Yama, which doesn't really care about the target task's creds, so
you have to pass in both the task_struct* and the cred*.
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated. They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to have a second mutex that is
used in mm_access, so it is allowed to continue while the
dying threads are not yet terminated.
I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger <[email protected]>
---
Documentation/security/credentials.rst | 18 ++++++------
fs/exec.c | 9 ++++++
include/linux/binfmts.h | 6 +++-
include/linux/sched/signal.h | 1 +
init/init_task.c | 1 +
kernel/cred.c | 2 +-
kernel/fork.c | 5 ++--
mm/process_vm_access.c | 2 +-
tools/testing/selftests/ptrace/Makefile | 4 +--
tools/testing/selftests/ptrace/vmaccess.c | 46 +++++++++++++++++++++++++++++++
10 files changed, 79 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..c98e0a8 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,13 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful. It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and the mutex
+current->signal->cred_change_mutex is acquired later, while the credentials
+and the process mmap are actually changed.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process
while security checks on credentials construction and changing is taking place
@@ -466,9 +470,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the
LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
end of such functions as ``sys_setresuid()``.
@@ -486,8 +489,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..a6884e4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_change_mutex);
+ if (retval)
+ goto out;
+
+ bprm->called_flush_old_exec = 1;
+
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
@@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
+ if (bprm->called_flush_old_exec)
+ mutex_unlock(¤t->signal->cred_change_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ mutex_unlock(¤t->signal->cred_change_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2e1318b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by flush_old_exec, when the cred_change_mutex is taken.
+ */
+ called_flush_old_exec:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..37eeabe 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace) */
+ struct mutex cred_change_mutex; /* guard against credentials change */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..6cd9a0f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
*
* Returns the new credentials or NULL if out of memory.
*
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..0395154 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
struct mm_struct *mm;
int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ err = mutex_lock_killable(&task->signal->cred_change_mutex);
if (err)
return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
mmput(mm);
mm = ERR_PTR(-EACCES);
}
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->cred_change_mutex);
return mm;
}
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ mutex_init(&sig->cred_change_mutex);
return 0;
}
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
- * Explicitly map EACCES to EPERM as EPERM is a more a
+ * Explicitly map EACCES to EPERM as EPERM is a more
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..ef08c9f
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <[email protected]>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+ ptrace(PTRACE_TRACEME, 0, 0, 0);
+ return NULL;
+}
+
+TEST(vmaccess)
+{
+ int f, pid = fork();
+ char mm[64];
+
+ if (!pid) {
+ pthread_t pt;
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ sprintf(mm, "/proc/%d/mem", pid);
+ f = open(mm, O_RDONLY);
+ ASSERT_LE(0, f)
+ close(f);
+ /* this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0); */
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
--
1.9.1
Bernd Edlinger <[email protected]> writes:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated. They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
I think your patch works, but I don't think to solve your case another
mutex is necessary. Possibly it is justified, but I hesitate to
introduce yet another concept in the code.
Having read elsewhere in the thread that this does not solve the problem
Oleg has mentioned I am really hesitant to add more complexity to the
situation.
For your case there is a straight forward and local workaround.
When the current task is ptracing the target task don't bother with
cred_gaurd_mutex and ptrace_may_access in access_mm as those tests
have already passed. Instead just confirm the ptrace status. AKA
the permission check in ptraces_access_vm.
I think something like this is all we need.
diff --git a/kernel/fork.c b/kernel/fork.c
index cee89229606a..b0ab98c84589 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,6 +1224,16 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
struct mm_struct *mm;
int err;
+ if (task->ptrace && (current == task->parent)) {
+ mm = get_task_mm(task);
+ if ((get_dumpable(mm) != SUID_DUMP_USER) &&
+ !ptracer_capable(task, mm->user_ns)) {
+ mmput(mm);
+ mm = ERR_PTR(-EACCESS);
+ }
+ return mm;
+ }
+
err = mutex_lock_killable(&task->signal->cred_guard_mutex);
if (err)
return ERR_PTR(err);
Does this solve your test case?
The patch above is short the approriate locking for the ptrace attached
check. (tasklist_lock I think). But is enough to illustrate the idea,
and it is probably a check we want in any event so that if the tracer
starts dropping privileges process_vm_readv and process_vm_writev will
still be usable by the tracer.
Eric
> strace D 0 30614 30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect D 0 31933 30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The proposed solution is to have a second mutex that is
> used in mm_access, so it is allowed to continue while the
> dying threads are not yet terminated.
>
> I also took the opportunity to improve the documentation
> of prepare_creds, which is obviously out of sync.
>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> Documentation/security/credentials.rst | 18 ++++++------
> fs/exec.c | 9 ++++++
> include/linux/binfmts.h | 6 +++-
> include/linux/sched/signal.h | 1 +
> init/init_task.c | 1 +
> kernel/cred.c | 2 +-
> kernel/fork.c | 5 ++--
> mm/process_vm_access.c | 2 +-
> tools/testing/selftests/ptrace/Makefile | 4 +--
> tools/testing/selftests/ptrace/vmaccess.c | 46 +++++++++++++++++++++++++++++++
> 10 files changed, 79 insertions(+), 15 deletions(-)
> create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
>
> v2: adds a test case which passes when this patch is applied.
>
>
> diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
> index 282e79f..c98e0a8 100644
> --- a/Documentation/security/credentials.rst
> +++ b/Documentation/security/credentials.rst
> @@ -437,9 +437,13 @@ new set of credentials by calling::
>
> struct cred *prepare_creds(void);
>
> -this locks current->cred_replace_mutex and then allocates and constructs a
> -duplicate of the current process's credentials, returning with the mutex still
> -held if successful. It returns NULL if not successful (out of memory).
> +this allocates and constructs a duplicate of the current process's credentials.
> +It returns NULL if not successful (out of memory).
> +
> +If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
> +is acquired before this function gets called, and the mutex
> +current->signal->cred_change_mutex is acquired later, while the credentials
> +and the process mmap are actually changed.
>
> The mutex prevents ``ptrace()`` from altering the ptrace state of a process
> while security checks on credentials construction and changing is taking place
> @@ -466,9 +470,8 @@ by calling::
>
> This will alter various aspects of the credentials and the process, giving the
> LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
> -actually commit the new credentials to ``current->cred``, it will release
> -``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
> -will notify the scheduler and others of the changes.
> +actually commit the new credentials to ``current->cred``, and it will notify
> +the scheduler and others of the changes.
>
> This function is guaranteed to return 0, so that it can be tail-called at the
> end of such functions as ``sys_setresuid()``.
> @@ -486,8 +489,7 @@ invoked::
>
> void abort_creds(struct cred *new);
>
> -This releases the lock on ``current->cred_replace_mutex`` that
> -``prepare_creds()`` got and then releases the new credentials.
> +This releases the new credentials.
>
>
> A typical credentials alteration function would look something like this::
> diff --git a/fs/exec.c b/fs/exec.c
> index 74d88da..a6884e4 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
> if (retval)
> goto out;
>
> + retval = mutex_lock_killable(¤t->signal->cred_change_mutex);
> + if (retval)
> + goto out;
> +
> + bprm->called_flush_old_exec = 1;
> +
> /*
> * Must be called _before_ exec_mmap() as bprm->mm is
> * not visibile until then. This also enables the update
> @@ -1420,6 +1426,8 @@ static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> + if (bprm->called_flush_old_exec)
> + mutex_unlock(¤t->signal->cred_change_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> abort_creds(bprm->cred);
> }
> @@ -1469,6 +1477,7 @@ void install_exec_creds(struct linux_binprm *bprm)
> * credentials; any time after this it may be unlocked.
> */
> security_bprm_committed_creds(bprm);
> + mutex_unlock(¤t->signal->cred_change_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> }
> EXPORT_SYMBOL(install_exec_creds);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index b40fc63..2e1318b 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,11 @@ struct linux_binprm {
> * exec has happened. Used to sanitize execution environment
> * and to set AT_SECURE auxv for glibc.
> */
> - secureexec:1;
> + secureexec:1,
> + /*
> + * Set by flush_old_exec, when the cred_change_mutex is taken.
> + */
> + called_flush_old_exec:1;
> #ifdef __alpha__
> unsigned int taso:1;
> #endif
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 8805025..37eeabe 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -225,6 +225,7 @@ struct signal_struct {
> struct mutex cred_guard_mutex; /* guard against foreign influences on
> * credential calculations
> * (notably. ptrace) */
> + struct mutex cred_change_mutex; /* guard against credentials change */
> } __randomize_layout;
>
> /*
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5..6cd9a0f 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -26,6 +26,7 @@
> .multiprocess = HLIST_HEAD_INIT,
> .rlim = INIT_RLIMITS,
> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> + .cred_change_mutex = __MUTEX_INITIALIZER(init_signals.cred_change_mutex),
> #ifdef CONFIG_POSIX_TIMERS
> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
> .cputimer = {
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 809a985..e4c78de 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -676,7 +676,7 @@ void __init cred_init(void)
> *
> * Returns the new credentials or NULL if out of memory.
> *
> - * Does not take, and does not return holding current->cred_replace_mutex.
> + * Does not take, and does not return holding ->cred_guard_mutex.
> */
> struct cred *prepare_kernel_cred(struct task_struct *daemon)
> {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0808095..0395154 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> struct mm_struct *mm;
> int err;
>
> - err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + err = mutex_lock_killable(&task->signal->cred_change_mutex);
> if (err)
> return ERR_PTR(err);
>
> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> mmput(mm);
> mm = ERR_PTR(-EACCES);
> }
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->cred_change_mutex);
>
> return mm;
> }
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>
> mutex_init(&sig->cred_guard_mutex);
> + mutex_init(&sig->cred_change_mutex);
>
> return 0;
> }
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 357aa7b..b3e6eb5 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
> if (!mm || IS_ERR(mm)) {
> rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> /*
> - * Explicitly map EACCES to EPERM as EPERM is a more a
> + * Explicitly map EACCES to EPERM as EPERM is a more
> * appropriate error code for process_vw_readv/writev
> */
> if (rc == -EACCES)
> diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
> index c0b7f89..2f1f532 100644
> --- a/tools/testing/selftests/ptrace/Makefile
> +++ b/tools/testing/selftests/ptrace/Makefile
> @@ -1,6 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -CFLAGS += -iquote../../../../include/uapi -Wall
> +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
>
> -TEST_GEN_PROGS := get_syscall_info peeksiginfo
> +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
>
> include ../lib.mk
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> new file mode 100644
> index 0000000..ef08c9f
> --- /dev/null
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -0,0 +1,46 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
> + * All rights reserved.
> + *
> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
> + * when de_thread is blocked with ->cred_guard_mutex held.
> + */
> +
> +#include "../kselftest_harness.h"
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <unistd.h>
> +#include <sys/ptrace.h>
> +
> +static void *thread(void *arg)
> +{
> + ptrace(PTRACE_TRACEME, 0, 0, 0);
> + return NULL;
> +}
> +
> +TEST(vmaccess)
> +{
> + int f, pid = fork();
> + char mm[64];
> +
> + if (!pid) {
> + pthread_t pt;
> + pthread_create(&pt, NULL, thread, NULL);
> + pthread_join(pt, NULL);
> + execlp("true", "true", NULL);
> + }
> +
> + sleep(1);
> + sprintf(mm, "/proc/%d/mem", pid);
> + f = open(mm, O_RDONLY);
> + ASSERT_LE(0, f)
> + close(f);
> + /* this is not fixed! ptrace(PTRACE_ATTACH, pid, 0,0); */
> + f = kill(pid, SIGCONT);
> + ASSERT_EQ(0, f);
> +}
> +
> +TEST_HARNESS_MAIN
On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote:
> On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
> <[email protected]> wrote:
> > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > > <[email protected]> wrote:
> > > > The proposed solution is to have a second mutex that is
> > > > used in mm_access, so it is allowed to continue while the
> > > > dying threads are not yet terminated.
> > >
> > > Just for context: When I proposed something similar back in 2016,
> > > https://lore.kernel.org/linux-fsdevel/[email protected]/
> > > was the resulting discussion thread. At least back then, I looked
> > > through the various existing users of cred_guard_mutex, and the only
> > > places that couldn't be converted to the new second mutex were
> > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> > >
> > >
> > > The ideal solution would IMO be something like this: Decide what the
> > > new task's credentials should be *before* reaching de_thread(),
> > > install them into a second cred* on the task (together with the new
> > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > > check against both. After that, some further restructuring might even
> >
> > Hm, so essentially a private ptrace_access_cred member in task_struct?
>
> And a second dumpability field, because that changes together with the
> creds during execve. (Btw, currently the dumpability is in the
> mm_struct, but that's kinda wrong. The mm_struct is removed from a
> task on exit while access checks can still be performed against it, and
> currently ptrace_may_access() just lets the access go through in that
> case, which weakens the protection offered by PR_SET_DUMPABLE when
> used for security purposes. I think it ought to be moved over into the
> task_struct.)
>
> > That would presumably also involve altering various LSM hooks to look at
> > ptrace_access_cred.
>
> When I tried to implement this in the past, I changed the LSM hook to
> take the target task's cred* as an argument, and then called the LSM
> hook twice from ptrace_may_access(). IIRC having the target task's
> creds as an argument works for almost all the LSMs, with the exception
> of Yama, which doesn't really care about the target task's creds, so
> you have to pass in both the task_struct* and the cred*.
It seems we should try PoCing this.
Christian
On Mon, Mar 02, 2020 at 08:47:53AM +0100, Christian Brauner wrote:
> On Sun, Mar 01, 2020 at 09:00:22PM +0100, Jann Horn wrote:
> > On Sun, Mar 1, 2020 at 7:52 PM Christian Brauner
> > <[email protected]> wrote:
> > > On Sun, Mar 01, 2020 at 07:21:03PM +0100, Jann Horn wrote:
> > > > On Sun, Mar 1, 2020 at 12:27 PM Bernd Edlinger
> > > > <[email protected]> wrote:
> > > > > The proposed solution is to have a second mutex that is
> > > > > used in mm_access, so it is allowed to continue while the
> > > > > dying threads are not yet terminated.
> > > >
> > > > Just for context: When I proposed something similar back in 2016,
> > > > https://lore.kernel.org/linux-fsdevel/[email protected]/
> > > > was the resulting discussion thread. At least back then, I looked
> > > > through the various existing users of cred_guard_mutex, and the only
> > > > places that couldn't be converted to the new second mutex were
> > > > PTRACE_ATTACH and SECCOMP_FILTER_FLAG_TSYNC.
> > > >
> > > >
> > > > The ideal solution would IMO be something like this: Decide what the
> > > > new task's credentials should be *before* reaching de_thread(),
> > > > install them into a second cred* on the task (together with the new
> > > > dumpability), drop the cred_guard_mutex, and let ptrace_may_access()
> > > > check against both. After that, some further restructuring might even
> > >
> > > Hm, so essentially a private ptrace_access_cred member in task_struct?
> >
> > And a second dumpability field, because that changes together with the
> > creds during execve. (Btw, currently the dumpability is in the
> > mm_struct, but that's kinda wrong. The mm_struct is removed from a
> > task on exit while access checks can still be performed against it, and
> > currently ptrace_may_access() just lets the access go through in that
> > case, which weakens the protection offered by PR_SET_DUMPABLE when
> > used for security purposes. I think it ought to be moved over into the
> > task_struct.)
> >
> > > That would presumably also involve altering various LSM hooks to look at
> > > ptrace_access_cred.
> >
> > When I tried to implement this in the past, I changed the LSM hook to
> > take the target task's cred* as an argument, and then called the LSM
> > hook twice from ptrace_may_access(). IIRC having the target task's
> > creds as an argument works for almost all the LSMs, with the exception
> > of Yama, which doesn't really care about the target task's creds, so
> > you have to pass in both the task_struct* and the cred*.
>
> It seems we should try PoCing this.
Independent of the fix for Bernd's issue that is.
On 03/01, Bernd Edlinger wrote:
>
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
Heh. Yes, known problem. See my attempt to fix it:
https://lore.kernel.org/lkml/[email protected]/
> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> struct mm_struct *mm;
> int err;
>
> - err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + err = mutex_lock_killable(&task->signal->cred_change_mutex);
So if I understand correctly your patch doesn't fix other problems
with debugger waiting for cred_guard_mutex.
I too do not think this can justify the new mutex in signal_struct...
Oleg.
On 3/2/20 7:38 AM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated. They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> I think your patch works, but I don't think to solve your case another
> mutex is necessary. Possibly it is justified, but I hesitate to
> introduce yet another concept in the code.
>
> Having read elsewhere in the thread that this does not solve the problem
> Oleg has mentioned I am really hesitant to add more complexity to the
> situation.
>
>
> For your case there is a straight forward and local workaround.
>
> When the current task is ptracing the target task don't bother with
> cred_gaurd_mutex and ptrace_may_access in access_mm as those tests
> have already passed. Instead just confirm the ptrace status. AKA
> the permission check in ptraces_access_vm.
>
> I think something like this is all we need.
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cee89229606a..b0ab98c84589 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,6 +1224,16 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> struct mm_struct *mm;
> int err;
>
> + if (task->ptrace && (current == task->parent)) {
> + mm = get_task_mm(task);
> + if ((get_dumpable(mm) != SUID_DUMP_USER) &&
> + !ptracer_capable(task, mm->user_ns)) {
> + mmput(mm);
> + mm = ERR_PTR(-EACCESS);
> + }
> + return mm;
> + }
> +
> err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> if (err)
> return ERR_PTR(err);
>
> Does this solve your test case?
>
I tried this with s/EACCESS/EACCES/.
The test case in this patch is not fixed, but strace does not freeze,
at least with my setup where it did freeze repeatable. That is
obviously because it bypasses the cred_guard_mutex. But all other
process that access this file still freeze, and cannot be
interrupted except with kill -9.
However that smells like a denial of service, that this
simple test case which can be executed by guest, creates a /proc/$pid/mem
that freezes any process, even root, when it looks at it.
I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Bernd.
On 3/2/20 1:28 PM, Oleg Nesterov wrote:
> On 03/01, Bernd Edlinger wrote:
>>
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>
> Heh. Yes, known problem. See my attempt to fix it:
> https://lore.kernel.org/lkml/[email protected]/
>
>> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>> struct mm_struct *mm;
>> int err;
>>
>> - err = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> + err = mutex_lock_killable(&task->signal->cred_change_mutex);
>
> So if I understand correctly your patch doesn't fix other problems
> with debugger waiting for cred_guard_mutex.
>
No, but I see this just as a first step.
> I too do not think this can justify the new mutex in signal_struct...
>
I think for the vm_access the semantic of this mutex is clear, that it
prevents the credentials to change while it is held by vm_access,
and probably other places can take advantage of this mutex as well.
While on the other hand, the cred_guard_mutex is needed to avoid two
threads calling execve at the same time. So that is needed as well.
What remains is probably making PTHREAD_ATTACH detect that the process
is currently in execve, and make that call fail in that situation.
I have not thought in depth about that problem, but it will probably
just need the right mutex to access current->in_execve.
That's at least how I see it.
Thanks
Bernd.
Bernd Edlinger <[email protected]> writes:
>
> I tried this with s/EACCESS/EACCES/.
>
> The test case in this patch is not fixed, but strace does not freeze,
> at least with my setup where it did freeze repeatable.
Thanks, That is what I was aiming at.
So we have one method we can pursue to fix this in practice.
> That is
> obviously because it bypasses the cred_guard_mutex. But all other
> process that access this file still freeze, and cannot be
> interrupted except with kill -9.
>
> However that smells like a denial of service, that this
> simple test case which can be executed by guest, creates a /proc/$pid/mem
> that freezes any process, even root, when it looks at it.
> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
Yes. Your the test case in your patch a variant of the original
problem.
I have been staring at this trying to understand the fundamentals of the
original deeper problem.
The current scope of cred_guard_mutex in exec is because being ptraced
causes suid exec to act differently. So we need to know early if we are
ptraced.
If that case did not exist we could reduce the scope of the
cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
I am starting to think reworking how we deal with ptrace and exec is the
way to solve this problem.
Eric
On 3/2/20 4:57 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>>
>> I tried this with s/EACCESS/EACCES/.
>>
>> The test case in this patch is not fixed, but strace does not freeze,
>> at least with my setup where it did freeze repeatable.
>
> Thanks, That is what I was aiming at.
>
> So we have one method we can pursue to fix this in practice.
>
>> That is
>> obviously because it bypasses the cred_guard_mutex. But all other
>> process that access this file still freeze, and cannot be
>> interrupted except with kill -9.
>>
>> However that smells like a denial of service, that this
>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>> that freezes any process, even root, when it looks at it.
>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>
> Yes. Your the test case in your patch a variant of the original
> problem.
>
>
> I have been staring at this trying to understand the fundamentals of the
> original deeper problem.
>
> The current scope of cred_guard_mutex in exec is because being ptraced
> causes suid exec to act differently. So we need to know early if we are
> ptraced.
>
It has a second use, that it prevents two threads entering execve,
which would probably result in disaster.
> If that case did not exist we could reduce the scope of the
> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
>
> I am starting to think reworking how we deal with ptrace and exec is the
> way to solve this problem.
>
> Eric
>
Bernd Edlinger <[email protected]> writes:
> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>>>
>>> I tried this with s/EACCESS/EACCES/.
>>>
>>> The test case in this patch is not fixed, but strace does not freeze,
>>> at least with my setup where it did freeze repeatable.
>>
>> Thanks, That is what I was aiming at.
>>
>> So we have one method we can pursue to fix this in practice.
>>
>>> That is
>>> obviously because it bypasses the cred_guard_mutex. But all other
>>> process that access this file still freeze, and cannot be
>>> interrupted except with kill -9.
>>>
>>> However that smells like a denial of service, that this
>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>> that freezes any process, even root, when it looks at it.
>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>
>> Yes. Your the test case in your patch a variant of the original
>> problem.
>>
>>
>> I have been staring at this trying to understand the fundamentals of the
>> original deeper problem.
>>
>> The current scope of cred_guard_mutex in exec is because being ptraced
>> causes suid exec to act differently. So we need to know early if we are
>> ptraced.
>>
>
> It has a second use, that it prevents two threads entering execve,
> which would probably result in disaster.
Exec can fail with an error code up until de_thread. de_thread causes
exec to fail with the error code -EAGAIN for the second thread to get
into de_thread.
So no. The cred_guard_mutex is not needed for that case at all.
>> If that case did not exist we could reduce the scope of the
>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
>>
>> I am starting to think reworking how we deal with ptrace and exec is the
>> way to solve this problem.
I am 99% convinced that the fix is to move cred_guard_mutex down.
Then right after we take cred_guard_mutex do:
if (ptraced) {
use_original_creds();
}
And call it a day.
The details suck but I am 99% certain that would solve everyones
problems, and not be too bad to audit either.
Eric
On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman <[email protected]> wrote:
>
> Bernd Edlinger <[email protected]> writes:
>
> > On 3/2/20 4:57 PM, Eric W. Biederman wrote:
> >> Bernd Edlinger <[email protected]> writes:
> >>
> >>>
> >>> I tried this with s/EACCESS/EACCES/.
> >>>
> >>> The test case in this patch is not fixed, but strace does not freeze,
> >>> at least with my setup where it did freeze repeatable.
> >>
> >> Thanks, That is what I was aiming at.
> >>
> >> So we have one method we can pursue to fix this in practice.
> >>
> >>> That is
> >>> obviously because it bypasses the cred_guard_mutex. But all other
> >>> process that access this file still freeze, and cannot be
> >>> interrupted except with kill -9.
> >>>
> >>> However that smells like a denial of service, that this
> >>> simple test case which can be executed by guest, creates a /proc/$pid/mem
> >>> that freezes any process, even root, when it looks at it.
> >>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
> >>
> >> Yes. Your the test case in your patch a variant of the original
> >> problem.
> >>
> >>
> >> I have been staring at this trying to understand the fundamentals of the
> >> original deeper problem.
> >>
> >> The current scope of cred_guard_mutex in exec is because being ptraced
> >> causes suid exec to act differently. So we need to know early if we are
> >> ptraced.
> >>
> >
> > It has a second use, that it prevents two threads entering execve,
> > which would probably result in disaster.
>
> Exec can fail with an error code up until de_thread. de_thread causes
> exec to fail with the error code -EAGAIN for the second thread to get
> into de_thread.
>
> So no. The cred_guard_mutex is not needed for that case at all.
>
> >> If that case did not exist we could reduce the scope of the
> >> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
> >>
> >> I am starting to think reworking how we deal with ptrace and exec is the
> >> way to solve this problem.
>
>
> I am 99% convinced that the fix is to move cred_guard_mutex down.
"move cred_guard_mutex down" as in "take it once we've already set up
the new process, past the point of no return"?
> Then right after we take cred_guard_mutex do:
> if (ptraced) {
> use_original_creds();
> }
>
> And call it a day.
>
> The details suck but I am 99% certain that would solve everyones
> problems, and not be too bad to audit either.
Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
SELinux normally doesn't do the execution-degrading thing, it just
blocks the execution completely - see their selinux_bprm_set_creds()
hook. So I think they'd still need to set some state on the task that
says "we're currently in the middle of an execution where the target
task will run in context X", and then check against that in the
ptrace_may_access hook. Or I suppose they could just kill the task
near the end of execve, although that'd be kinda ugly.
On 3/2/20 5:43 PM, Jann Horn wrote:
> On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman <[email protected]> wrote:
>>
>> Bernd Edlinger <[email protected]> writes:
>>
>>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>>> Bernd Edlinger <[email protected]> writes:
>>>>
>>>>>
>>>>> I tried this with s/EACCESS/EACCES/.
>>>>>
>>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>>> at least with my setup where it did freeze repeatable.
>>>>
>>>> Thanks, That is what I was aiming at.
>>>>
>>>> So we have one method we can pursue to fix this in practice.
>>>>
>>>>> That is
>>>>> obviously because it bypasses the cred_guard_mutex. But all other
>>>>> process that access this file still freeze, and cannot be
>>>>> interrupted except with kill -9.
>>>>>
>>>>> However that smells like a denial of service, that this
>>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>>> that freezes any process, even root, when it looks at it.
>>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>>
>>>> Yes. Your the test case in your patch a variant of the original
>>>> problem.
>>>>
>>>>
>>>> I have been staring at this trying to understand the fundamentals of the
>>>> original deeper problem.
>>>>
>>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>>> causes suid exec to act differently. So we need to know early if we are
>>>> ptraced.
>>>>
>>>
>>> It has a second use, that it prevents two threads entering execve,
>>> which would probably result in disaster.
>>
>> Exec can fail with an error code up until de_thread. de_thread causes
>> exec to fail with the error code -EAGAIN for the second thread to get
>> into de_thread.
>>
>> So no. The cred_guard_mutex is not needed for that case at all.
>>
>>>> If that case did not exist we could reduce the scope of the
>>>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
>>>>
>>>> I am starting to think reworking how we deal with ptrace and exec is the
>>>> way to solve this problem.
>>
>>
>> I am 99% convinced that the fix is to move cred_guard_mutex down.
>
> "move cred_guard_mutex down" as in "take it once we've already set up
> the new process, past the point of no return"?
>
>> Then right after we take cred_guard_mutex do:
>> if (ptraced) {
>> use_original_creds();
>> }
>>
>> And call it a day.
>>
>> The details suck but I am 99% certain that would solve everyones
>> problems, and not be too bad to audit either.
>
> Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
>
> SELinux normally doesn't do the execution-degrading thing, it just
> blocks the execution completely - see their selinux_bprm_set_creds()
> hook. So I think they'd still need to set some state on the task that
> says "we're currently in the middle of an execution where the target
> task will run in context X", and then check against that in the
> ptrace_may_access hook. Or I suppose they could just kill the task
> near the end of execve, although that'd be kinda ugly.
>
We have current->in_execve for that, right?
I think when the cred_guard_mutex is taken only in the critical section,
then PTRACE_ATTACH could take the guard_mutex, and look at current->in_execve,
and just return -EAGAIN in that case, right, everybody happy :)
Bernd.
On 3/2/20 5:17 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <[email protected]> writes:
>>>
>>>>
>>>> I tried this with s/EACCESS/EACCES/.
>>>>
>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>> at least with my setup where it did freeze repeatable.
>>>
>>> Thanks, That is what I was aiming at.
>>>
>>> So we have one method we can pursue to fix this in practice.
>>>
>>>> That is
>>>> obviously because it bypasses the cred_guard_mutex. But all other
>>>> process that access this file still freeze, and cannot be
>>>> interrupted except with kill -9.
>>>>
>>>> However that smells like a denial of service, that this
>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>> that freezes any process, even root, when it looks at it.
>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>
>>> Yes. Your the test case in your patch a variant of the original
>>> problem.
>>>
>>>
>>> I have been staring at this trying to understand the fundamentals of the
>>> original deeper problem.
>>>
>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>> causes suid exec to act differently. So we need to know early if we are
>>> ptraced.
>>>
>>
>> It has a second use, that it prevents two threads entering execve,
>> which would probably result in disaster.
>
> Exec can fail with an error code up until de_thread. de_thread causes
> exec to fail with the error code -EAGAIN for the second thread to get
> into de_thread.
>
> So no. The cred_guard_mutex is not needed for that case at all.
>
Okay, but that will reset current->in_execve, right?
>>> If that case did not exist we could reduce the scope of the
>>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
>>>
>>> I am starting to think reworking how we deal with ptrace and exec is the
>>> way to solve this problem.
>
>
> I am 99% convinced that the fix is to move cred_guard_mutex down.
>
> Then right after we take cred_guard_mutex do:
> if (ptraced) {
> use_original_creds();
> }
>
> And call it a day.
>
> The details suck but I am 99% certain that would solve everyones
> problems, and not be too bad to audit either.
>
> Eric
>
On Mon, Mar 2, 2020 at 6:01 PM Bernd Edlinger <[email protected]> wrote:
> On 3/2/20 5:43 PM, Jann Horn wrote:
> > On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman <[email protected]> wrote:
> >>
> >> Bernd Edlinger <[email protected]> writes:
> >>
> >>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
> >>>> Bernd Edlinger <[email protected]> writes:
> >>>>
> >>>>>
> >>>>> I tried this with s/EACCESS/EACCES/.
> >>>>>
> >>>>> The test case in this patch is not fixed, but strace does not freeze,
> >>>>> at least with my setup where it did freeze repeatable.
> >>>>
> >>>> Thanks, That is what I was aiming at.
> >>>>
> >>>> So we have one method we can pursue to fix this in practice.
> >>>>
> >>>>> That is
> >>>>> obviously because it bypasses the cred_guard_mutex. But all other
> >>>>> process that access this file still freeze, and cannot be
> >>>>> interrupted except with kill -9.
> >>>>>
> >>>>> However that smells like a denial of service, that this
> >>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
> >>>>> that freezes any process, even root, when it looks at it.
> >>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
> >>>>
> >>>> Yes. Your the test case in your patch a variant of the original
> >>>> problem.
> >>>>
> >>>>
> >>>> I have been staring at this trying to understand the fundamentals of the
> >>>> original deeper problem.
> >>>>
> >>>> The current scope of cred_guard_mutex in exec is because being ptraced
> >>>> causes suid exec to act differently. So we need to know early if we are
> >>>> ptraced.
> >>>>
> >>>
> >>> It has a second use, that it prevents two threads entering execve,
> >>> which would probably result in disaster.
> >>
> >> Exec can fail with an error code up until de_thread. de_thread causes
> >> exec to fail with the error code -EAGAIN for the second thread to get
> >> into de_thread.
> >>
> >> So no. The cred_guard_mutex is not needed for that case at all.
> >>
> >>>> If that case did not exist we could reduce the scope of the
> >>>> cred_guard_mutex in exec to where your patch puts the cred_change_mutex.
> >>>>
> >>>> I am starting to think reworking how we deal with ptrace and exec is the
> >>>> way to solve this problem.
> >>
> >>
> >> I am 99% convinced that the fix is to move cred_guard_mutex down.
> >
> > "move cred_guard_mutex down" as in "take it once we've already set up
> > the new process, past the point of no return"?
> >
> >> Then right after we take cred_guard_mutex do:
> >> if (ptraced) {
> >> use_original_creds();
> >> }
> >>
> >> And call it a day.
> >>
> >> The details suck but I am 99% certain that would solve everyones
> >> problems, and not be too bad to audit either.
> >
> > Ah, hmm, that sounds like it'll work fine at least when no LSMs are involved.
> >
> > SELinux normally doesn't do the execution-degrading thing, it just
> > blocks the execution completely - see their selinux_bprm_set_creds()
> > hook. So I think they'd still need to set some state on the task that
> > says "we're currently in the middle of an execution where the target
> > task will run in context X", and then check against that in the
> > ptrace_may_access hook. Or I suppose they could just kill the task
> > near the end of execve, although that'd be kinda ugly.
> >
>
> We have current->in_execve for that, right?
> I think when the cred_guard_mutex is taken only in the critical section,
> then PTRACE_ATTACH could take the guard_mutex, and look at current->in_execve,
> and just return -EAGAIN in that case, right, everybody happy :)
It's probably going to mean that things like strace will just randomly
fail to attach to processes if they happen to be in the middle of
execve... but I guess that works?
<[email protected]>,"[email protected]" <[email protected]>,"[email protected]" <[email protected]>,"[email protected]" <[email protected]>,"[email protected]" <[email protected]>,linux-security-module <[email protected]>
From: Christian Brauner <[email protected]>
Message-ID: <[email protected]>
On March 2, 2020 6:37:27 PM GMT+01:00, Jann Horn <[email protected]> wrote:
>On Mon, Mar 2, 2020 at 6:01 PM Bernd Edlinger
><[email protected]> wrote:
>> On 3/2/20 5:43 PM, Jann Horn wrote:
>> > On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman
><[email protected]> wrote:
>> >>
>> >> Bernd Edlinger <[email protected]> writes:
>> >>
>> >>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>> >>>> Bernd Edlinger <[email protected]> writes:
>> >>>>
>> >>>>>
>> >>>>> I tried this with s/EACCESS/EACCES/.
>> >>>>>
>> >>>>> The test case in this patch is not fixed, but strace does not
>freeze,
>> >>>>> at least with my setup where it did freeze repeatable.
>> >>>>
>> >>>> Thanks, That is what I was aiming at.
>> >>>>
>> >>>> So we have one method we can pursue to fix this in practice.
>> >>>>
>> >>>>> That is
>> >>>>> obviously because it bypasses the cred_guard_mutex. But all
>other
>> >>>>> process that access this file still freeze, and cannot be
>> >>>>> interrupted except with kill -9.
>> >>>>>
>> >>>>> However that smells like a denial of service, that this
>> >>>>> simple test case which can be executed by guest, creates a
>/proc/$pid/mem
>> >>>>> that freezes any process, even root, when it looks at it.
>> >>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>> >>>>
>> >>>> Yes. Your the test case in your patch a variant of the original
>> >>>> problem.
>> >>>>
>> >>>>
>> >>>> I have been staring at this trying to understand the
>fundamentals of the
>> >>>> original deeper problem.
>> >>>>
>> >>>> The current scope of cred_guard_mutex in exec is because being
>ptraced
>> >>>> causes suid exec to act differently. So we need to know early
>if we are
>> >>>> ptraced.
>> >>>>
>> >>>
>> >>> It has a second use, that it prevents two threads entering
>execve,
>> >>> which would probably result in disaster.
>> >>
>> >> Exec can fail with an error code up until de_thread. de_thread
>causes
>> >> exec to fail with the error code -EAGAIN for the second thread to
>get
>> >> into de_thread.
>> >>
>> >> So no. The cred_guard_mutex is not needed for that case at all.
>> >>
>> >>>> If that case did not exist we could reduce the scope of the
>> >>>> cred_guard_mutex in exec to where your patch puts the
>cred_change_mutex.
>> >>>>
>> >>>> I am starting to think reworking how we deal with ptrace and
>exec is the
>> >>>> way to solve this problem.
>> >>
>> >>
>> >> I am 99% convinced that the fix is to move cred_guard_mutex down.
>> >
>> > "move cred_guard_mutex down" as in "take it once we've already set
>up
>> > the new process, past the point of no return"?
>> >
>> >> Then right after we take cred_guard_mutex do:
>> >> if (ptraced) {
>> >> use_original_creds();
>> >> }
>> >>
>> >> And call it a day.
>> >>
>> >> The details suck but I am 99% certain that would solve everyones
>> >> problems, and not be too bad to audit either.
>> >
>> > Ah, hmm, that sounds like it'll work fine at least when no LSMs are
>involved.
>> >
>> > SELinux normally doesn't do the execution-degrading thing, it just
>> > blocks the execution completely - see their
>selinux_bprm_set_creds()
>> > hook. So I think they'd still need to set some state on the task
>that
>> > says "we're currently in the middle of an execution where the
>target
>> > task will run in context X", and then check against that in the
>> > ptrace_may_access hook. Or I suppose they could just kill the task
>> > near the end of execve, although that'd be kinda ugly.
>> >
>>
>> We have current->in_execve for that, right?
>> I think when the cred_guard_mutex is taken only in the critical
>section,
>> then PTRACE_ATTACH could take the guard_mutex, and look at
>current->in_execve,
>> and just return -EAGAIN in that case, right, everybody happy :)
>
>It's probably going to mean that things like strace will just randomly
>fail to attach to processes if they happen to be in the middle of
>execve... but I guess that works?
That sounds like an acceptable outcome.
We can at least risk it and if we regress
revert or come up with the more complex
solution suggested in another mail here?
On Mon, Mar 2, 2020 at 6:43 PM <[email protected]> wrote:
> On March 2, 2020 6:37:27 PM GMT+01:00, Jann Horn <[email protected]> wrote:
> >On Mon, Mar 2, 2020 at 6:01 PM Bernd Edlinger
> ><[email protected]> wrote:
> >> On 3/2/20 5:43 PM, Jann Horn wrote:
> >> > On Mon, Mar 2, 2020 at 5:19 PM Eric W. Biederman
> ><[email protected]> wrote:
[...]
> >> >> I am 99% convinced that the fix is to move cred_guard_mutex down.
> >> >
> >> > "move cred_guard_mutex down" as in "take it once we've already set
> >up
> >> > the new process, past the point of no return"?
> >> >
> >> >> Then right after we take cred_guard_mutex do:
> >> >> if (ptraced) {
> >> >> use_original_creds();
> >> >> }
> >> >>
> >> >> And call it a day.
> >> >>
> >> >> The details suck but I am 99% certain that would solve everyones
> >> >> problems, and not be too bad to audit either.
> >> >
> >> > Ah, hmm, that sounds like it'll work fine at least when no LSMs are
> >involved.
> >> >
> >> > SELinux normally doesn't do the execution-degrading thing, it just
> >> > blocks the execution completely - see their
> >selinux_bprm_set_creds()
> >> > hook. So I think they'd still need to set some state on the task
> >that
> >> > says "we're currently in the middle of an execution where the
> >target
> >> > task will run in context X", and then check against that in the
> >> > ptrace_may_access hook. Or I suppose they could just kill the task
> >> > near the end of execve, although that'd be kinda ugly.
> >> >
> >>
> >> We have current->in_execve for that, right?
> >> I think when the cred_guard_mutex is taken only in the critical
> >section,
> >> then PTRACE_ATTACH could take the guard_mutex, and look at
> >current->in_execve,
> >> and just return -EAGAIN in that case, right, everybody happy :)
> >
> >It's probably going to mean that things like strace will just randomly
> >fail to attach to processes if they happen to be in the middle of
> >execve... but I guess that works?
>
> That sounds like an acceptable outcome.
> We can at least risk it and if we regress
> revert or come up with the more complex
> solution suggested in another mail here?
Yeah, sounds reasonable, I guess.
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated. They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only
in a critical section at the beginning, and at the end of the
execve function, and let PTRACE_ATTACH fail with EAGAIN while
execve is not complete, but other functions like vm_access are
allowed to complete normally.
I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger <[email protected]>
---
Documentation/security/credentials.rst | 19 +++++----
fs/exec.c | 28 ++++++++++++--
include/linux/binfmts.h | 6 ++-
include/linux/sched/signal.h | 1 +
init/init_task.c | 1 +
kernel/cred.c | 2 +-
kernel/fork.c | 1 +
kernel/ptrace.c | 4 ++
mm/process_vm_access.c | 2 +-
tools/testing/selftests/ptrace/Makefile | 4 +-
tools/testing/selftests/ptrace/vmaccess.c | 64 +++++++++++++++++++++++++++++++
11 files changed, 115 insertions(+), 17 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..61d6704 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,14 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful. It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and released after setting
+current->signal->cred_locked_for_ptrace. The same mutex is acquired later,
+while the credentials and the process mmap are actually changed, and
+current->signal->cred_locked_for_ptrace is reset again.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process
while security checks on credentials construction and changing is taking place
@@ -466,9 +471,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the
LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
end of such functions as ``sys_setresuid()``.
@@ -486,8 +490,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..e466301 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_guard_mutex);
+ if (retval)
+ goto out;
+
+ bprm->called_flush_old_exec = 1;
+
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
@@ -1398,28 +1404,41 @@ void finalize_exec(struct linux_binprm *bprm)
EXPORT_SYMBOL(finalize_exec);
/*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_for_ptrace.
* install_exec_creds() commits the new creds and drops the lock.
* Or, if exec fails before, free_bprm() should release ->cred and
* and unlock.
*/
static int prepare_bprm_creds(struct linux_binprm *bprm)
{
+ int ret;
+
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
+ ret = -EAGAIN;
+ if (unlikely(current->signal->cred_locked_for_ptrace))
+ goto out;
+
+ ret = -ENOMEM;
bprm->cred = prepare_exec_creds();
- if (likely(bprm->cred))
- return 0;
+ if (likely(bprm->cred)) {
+ current->signal->cred_locked_for_ptrace = true;
+ ret = 0;
+ }
+out:
mutex_unlock(¤t->signal->cred_guard_mutex);
- return -ENOMEM;
+ return ret;
}
static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
+ if (!bprm->called_flush_old_exec)
+ mutex_lock(¤t->signal->cred_guard_mutex);
+ current->signal->cred_locked_for_ptrace = false;
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1469,6 +1488,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ current->signal->cred_locked_for_ptrace = false;
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2e1318b 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by flush_old_exec, when the cred_change_mutex is taken.
+ */
+ called_flush_old_exec:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..073a2b7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace) */
+ bool cred_locked_for_ptrace; /* set while in execve */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..ecefff28 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .cred_locked_for_ptrace = false,
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
*
* Returns the new credentials or NULL if out of memory.
*
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..a2b2ec8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ sig->cred_locked_for_ptrace = false;
return 0;
}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..abf09ba 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request,
if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
goto out;
+ retval = -EAGAIN;
+ if (task->signal->cred_locked_for_ptrace)
+ goto unlock_creds;
+
task_lock(task);
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
task_unlock(task);
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
- * Explicitly map EACCES to EPERM as EPERM is a more a
+ * Explicitly map EACCES to EPERM as EPERM is a more
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..63ff531
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <[email protected]>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+ ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+ return NULL;
+}
+
+TEST(vmaccess)
+{
+ int f, pid = fork();
+ char mm[64];
+
+ if (!pid) {
+ pthread_t pt;
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ sprintf(mm, "/proc/%d/mem", pid);
+ f = open(mm, O_RDONLY);
+ ASSERT_LE(0, f);
+ close(f);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(0, f);
+}
+
+TEST(attach)
+{
+ int f, pid = fork();
+
+ if (!pid) {
+ pthread_t pt;
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(EAGAIN, errno);
+ ASSERT_EQ(f, -1);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
--
1.9.1
On 3/2/20 9:10 PM, Bernd Edlinger wrote:
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,11 @@ struct linux_binprm {
> * exec has happened. Used to sanitize execution environment
> * and to set AT_SECURE auxv for glibc.
> */
> - secureexec:1;
> + secureexec:1,
> + /*
> + * Set by flush_old_exec, when the cred_change_mutex is taken.
Oops, missed to update this comment, should be "when the cred_guard_mutex is taken".
I'll send a new patch later.
Bernd.
> + */
> + called_flush_old_exec:1;
> #ifdef __alpha__
> unsigned int taso:1;
> #endif
Bernd Edlinger <[email protected]> writes:
> On 3/2/20 5:17 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>>> Bernd Edlinger <[email protected]> writes:
>>>>
>>>>>
>>>>> I tried this with s/EACCESS/EACCES/.
>>>>>
>>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>>> at least with my setup where it did freeze repeatable.
>>>>
>>>> Thanks, That is what I was aiming at.
>>>>
>>>> So we have one method we can pursue to fix this in practice.
>>>>
>>>>> That is
>>>>> obviously because it bypasses the cred_guard_mutex. But all other
>>>>> process that access this file still freeze, and cannot be
>>>>> interrupted except with kill -9.
>>>>>
>>>>> However that smells like a denial of service, that this
>>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>>> that freezes any process, even root, when it looks at it.
>>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>>
>>>> Yes. Your the test case in your patch a variant of the original
>>>> problem.
>>>>
>>>>
>>>> I have been staring at this trying to understand the fundamentals of the
>>>> original deeper problem.
>>>>
>>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>>> causes suid exec to act differently. So we need to know early if we are
>>>> ptraced.
>>>>
>>>
>>> It has a second use, that it prevents two threads entering execve,
>>> which would probably result in disaster.
>>
>> Exec can fail with an error code up until de_thread. de_thread causes
>> exec to fail with the error code -EAGAIN for the second thread to get
>> into de_thread.
>>
>> So no. The cred_guard_mutex is not needed for that case at all.
>>
>
> Okay, but that will reset current->in_execve, right?
Absolutely.
The error handling kicks in and exec_binprm fails with a negative
return code. Then __do_excve_file cleans up and clears
current->in_execve.
Eric
On 3/2/20 10:49 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/2/20 5:17 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <[email protected]> writes:
>>>
>>>> On 3/2/20 4:57 PM, Eric W. Biederman wrote:
>>>>> Bernd Edlinger <[email protected]> writes:
>>>>>
>>>>>>
>>>>>> I tried this with s/EACCESS/EACCES/.
>>>>>>
>>>>>> The test case in this patch is not fixed, but strace does not freeze,
>>>>>> at least with my setup where it did freeze repeatable.
>>>>>
>>>>> Thanks, That is what I was aiming at.
>>>>>
>>>>> So we have one method we can pursue to fix this in practice.
>>>>>
>>>>>> That is
>>>>>> obviously because it bypasses the cred_guard_mutex. But all other
>>>>>> process that access this file still freeze, and cannot be
>>>>>> interrupted except with kill -9.
>>>>>>
>>>>>> However that smells like a denial of service, that this
>>>>>> simple test case which can be executed by guest, creates a /proc/$pid/mem
>>>>>> that freezes any process, even root, when it looks at it.
>>>>>> I mean: "ln -s README /proc/$pid/mem" would be a nice bomb.
>>>>>
>>>>> Yes. Your the test case in your patch a variant of the original
>>>>> problem.
>>>>>
>>>>>
>>>>> I have been staring at this trying to understand the fundamentals of the
>>>>> original deeper problem.
>>>>>
>>>>> The current scope of cred_guard_mutex in exec is because being ptraced
>>>>> causes suid exec to act differently. So we need to know early if we are
>>>>> ptraced.
>>>>>
>>>>
>>>> It has a second use, that it prevents two threads entering execve,
>>>> which would probably result in disaster.
>>>
>>> Exec can fail with an error code up until de_thread. de_thread causes
>>> exec to fail with the error code -EAGAIN for the second thread to get
>>> into de_thread.
>>>
>>> So no. The cred_guard_mutex is not needed for that case at all.
>>>
>>
>> Okay, but that will reset current->in_execve, right?
>
> Absolutely.
>
> The error handling kicks in and exec_binprm fails with a negative
> return code. Then __do_excve_file cleans up and clears
> current->in_execve.
>
Yes of course. I was under the wrong impression that that value is
a kind of global, but it is a thread local.
So I think I need a new boolean see v3 of the patch, and soon v4 (with
just one comment fixed).
I'm currently executing the strace v5.5 testsuite, and every test
is passed so far. I'll also look at gdb testsuite, before I send the
next version.
Thanks
Bernd.
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated. They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only
in a critical section at the beginning, and at the end of the
execve function, and let PTRACE_ATTACH fail with EAGAIN while
execve is not complete, but other functions like vm_access are
allowed to complete normally.
I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger <[email protected]>
---
Documentation/security/credentials.rst | 19 +++++----
fs/exec.c | 28 +++++++++++--
include/linux/binfmts.h | 6 ++-
include/linux/sched/signal.h | 1 +
init/init_task.c | 1 +
kernel/cred.c | 2 +-
kernel/fork.c | 1 +
kernel/ptrace.c | 4 ++
mm/process_vm_access.c | 2 +-
tools/testing/selftests/ptrace/Makefile | 4 +-
tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++
11 files changed, 117 insertions(+), 17 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..61d6704 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,14 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful. It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and released after setting
+current->signal->cred_locked_for_ptrace. The same mutex is acquired later,
+while the credentials and the process mmap are actually changed, and
+current->signal->cred_locked_for_ptrace is reset again.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process
while security checks on credentials construction and changing is taking place
@@ -466,9 +471,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the
LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
end of such functions as ``sys_setresuid()``.
@@ -486,8 +490,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..e466301 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_guard_mutex);
+ if (retval)
+ goto out;
+
+ bprm->called_flush_old_exec = 1;
+
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
@@ -1398,28 +1404,41 @@ void finalize_exec(struct linux_binprm *bprm)
EXPORT_SYMBOL(finalize_exec);
/*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_for_ptrace.
* install_exec_creds() commits the new creds and drops the lock.
* Or, if exec fails before, free_bprm() should release ->cred and
* and unlock.
*/
static int prepare_bprm_creds(struct linux_binprm *bprm)
{
+ int ret;
+
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
+ ret = -EAGAIN;
+ if (unlikely(current->signal->cred_locked_for_ptrace))
+ goto out;
+
+ ret = -ENOMEM;
bprm->cred = prepare_exec_creds();
- if (likely(bprm->cred))
- return 0;
+ if (likely(bprm->cred)) {
+ current->signal->cred_locked_for_ptrace = true;
+ ret = 0;
+ }
+out:
mutex_unlock(¤t->signal->cred_guard_mutex);
- return -ENOMEM;
+ return ret;
}
static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
+ if (!bprm->called_flush_old_exec)
+ mutex_lock(¤t->signal->cred_guard_mutex);
+ current->signal->cred_locked_for_ptrace = false;
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1469,6 +1488,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ current->signal->cred_locked_for_ptrace = false;
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2930253 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by flush_old_exec, when the cred_guard_mutex is taken.
+ */
+ called_flush_old_exec:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..073a2b7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,7 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace) */
+ bool cred_locked_for_ptrace; /* set while in execve */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..ecefff28 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .cred_locked_for_ptrace = false,
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
*
* Returns the new credentials or NULL if out of memory.
*
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..a2b2ec8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ sig->cred_locked_for_ptrace = false;
return 0;
}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..abf09ba 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request,
if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
goto out;
+ retval = -EAGAIN;
+ if (task->signal->cred_locked_for_ptrace)
+ goto unlock_creds;
+
task_lock(task);
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
task_unlock(task);
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
- * Explicitly map EACCES to EPERM as EPERM is a more a
+ * Explicitly map EACCES to EPERM as EPERM is a more
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..6d8a048
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <[email protected]>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+ ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+ return NULL;
+}
+
+TEST(vmaccess)
+{
+ int f, pid = fork();
+ char mm[64];
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ sprintf(mm, "/proc/%d/mem", pid);
+ f = open(mm, O_RDONLY);
+ ASSERT_LE(0, f);
+ close(f);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(0, f);
+}
+
+TEST(attach)
+{
+ int f, pid = fork();
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(EAGAIN, errno);
+ ASSERT_EQ(f, -1);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
--
1.9.1
On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated. They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> strace D 0 30614 30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect D 0 31933 30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The proposed solution is to take the cred_guard_mutex only
> in a critical section at the beginning, and at the end of the
> execve function, and let PTRACE_ATTACH fail with EAGAIN while
> execve is not complete, but other functions like vm_access are
> allowed to complete normally.
Sorry to be bummer, but I don't think this will work. A few more things
during the exec process depend on cred_guard_mutex being held.
If I'm reading this patch correctly, this changes the lifetime of the
cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through
install_exec_creds().
That means, for example, that check_unsafe_exec()'s documented invariant
is violated:
/*
* determine how safe it is to execute the proposed program
* - the caller must hold ->cred_guard_mutex to protect against
* PTRACE_ATTACH or seccomp thread-sync
*/
static void check_unsafe_exec(struct linux_binprm *bprm) ...
which is looking at no_new_privs as well as other details, and making
decisions about the bprm state from the current state.
I think it also means that the potentially multiple invocations
of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
a lock (another place where current's no_new_privs is evaluated).
Related, it also means that cred_guard_mutex is unheld for every
invocation of search_binary_handler() (which can loop via the previously
mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
currently.)
For seccomp, the expectations about existing thread states risks races
too. There are two locks held for TSYNC:
- current->sighand->siglock is held to keep new threads from
appearing/disappearing, which would destroy filter refcounting and
lead to memory corruption.
- cred_guard_mutex is held to keep no_new_privs in sync with filters to
avoid no_new_privs and filter confusion during exec, which could
lead to exploitable setuid conditions (see below).
Just racing a malicious thread during TSYNC is not a very strong
example (a malicious thread could do lots of fun things to "current"
before it ever got near calling TSYNC), but I think there is the risk
of mismatched/confused states that we don't want to allow. One is a
particularly bad state that could lead to privilege escalations (in the
form of the old "sendmail doesn't check setuid" flaw; if a setuid process
has a filter attached that silently fails a priv-dropping setuid call
and continues execution with elevated privs, it can be tricked into
doing bad things on behalf of the unprivileged parent, which was the
primary goal of the original use of cred_guard_mutex with TSYNC[1]):
thread A clones thread B
thread B starts setuid exec
thread A sets no_new_privs
thread A calls seccomp with TSYNC
thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
thread B passes check_unsafe_exec() with no_new_privs unset
thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
thread A still in seccomp_sync_threads() sets no_new_privs on thread B
thread B finishes exec, now running with elevated privs, a filter chosen
by thread A, _and_ nnp set (which doesn't matter)
With the original locking, thread B will fail check_unsafe_exec()
because filter and nnp state are changed together, with "atomicity"
protected by the cred_guard_mutex.
And this is just the bad state I _can_ see. I'm worried there are more...
All this said, I do see a small similarity here to the work I did to
stabilize stack rlimits (there was an ongoing problem with making multiple
decisions for the bprm based on current's state -- but current's state
was mutable during exec). For this, I saved rlim_stack to bprm and ignored
current's copy until exec ended and then stored bprm's copy into current.
If the only problem anyone can see here is the handling of no_new_privs,
we might be able to solve that similarly, at least disentangling tsync/nnp
from cred_guard_mutex.
-Kees
[1] https://lore.kernel.org/lkml/[email protected]/
--
Kees Cook
On 3/3/20 3:26 AM, Kees Cook wrote:
> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated. They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>
>> strace D 0 30614 30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect D 0 31933 30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> The proposed solution is to take the cred_guard_mutex only
>> in a critical section at the beginning, and at the end of the
>> execve function, and let PTRACE_ATTACH fail with EAGAIN while
>> execve is not complete, but other functions like vm_access are
>> allowed to complete normally.
>
> Sorry to be bummer, but I don't think this will work. A few more things
> during the exec process depend on cred_guard_mutex being held.
>
> If I'm reading this patch correctly, this changes the lifetime of the
> cred_guard_mutex lock to be:
> - during prepare_bprm_creds()
> - from flush_old_exec() through install_exec_creds()
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
>
> That means, for example, that check_unsafe_exec()'s documented invariant
> is violated:
> /*
> * determine how safe it is to execute the proposed program
> * - the caller must hold ->cred_guard_mutex to protect against
> * PTRACE_ATTACH or seccomp thread-sync
> */
Oh, right, I haven't understood that hint...
> static void check_unsafe_exec(struct linux_binprm *bprm) ...
> which is looking at no_new_privs as well as other details, and making
> decisions about the bprm state from the current state.
>
> I think it also means that the potentially multiple invocations
> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> a lock (another place where current's no_new_privs is evaluated).
So no_new_privs can change from 0->1, but should not
when execve is running.
As long as the calling thread is in execve it won't do this,
and the only other place, where it may set for other threads
is in seccomp_sync_threads, but that can easily be avoided see below.
>
> Related, it also means that cred_guard_mutex is unheld for every
> invocation of search_binary_handler() (which can loop via the previously
> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> currently.)
>
> For seccomp, the expectations about existing thread states risks races
> too. There are two locks held for TSYNC:
> - current->sighand->siglock is held to keep new threads from
> appearing/disappearing, which would destroy filter refcounting and
> lead to memory corruption.
I don't understand what you mean here.
How can this lead to memory corruption?
> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> avoid no_new_privs and filter confusion during exec, which could
> lead to exploitable setuid conditions (see below).
>
> Just racing a malicious thread during TSYNC is not a very strong
> example (a malicious thread could do lots of fun things to "current"
> before it ever got near calling TSYNC), but I think there is the risk
> of mismatched/confused states that we don't want to allow. One is a
> particularly bad state that could lead to privilege escalations (in the
> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> has a filter attached that silently fails a priv-dropping setuid call
> and continues execution with elevated privs, it can be tricked into
> doing bad things on behalf of the unprivileged parent, which was the
> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
>
> thread A clones thread B
> thread B starts setuid exec
> thread A sets no_new_privs
> thread A calls seccomp with TSYNC
> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> thread B passes check_unsafe_exec() with no_new_privs unset
> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> thread B finishes exec, now running with elevated privs, a filter chosen
> by thread A, _and_ nnp set (which doesn't matter)
>
> With the original locking, thread B will fail check_unsafe_exec()
> because filter and nnp state are changed together, with "atomicity"
> protected by the cred_guard_mutex.
>
Ah, good point, thanks!
This can be fixed by checking current->signal->cred_locked_for_ptrace
while the cred_guard_mutex is locked, like this for instance:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..377abf0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
assert_spin_locked(¤t->sighand->siglock);
+ if (current->signal->cred_locked_for_ptrace)
+ return -EAGAIN;
+
/* Validate all threads being eligible for synchronization. */
caller = current;
for_each_thread(caller, thread) {
> And this is just the bad state I _can_ see. I'm worried there are more...
>
> All this said, I do see a small similarity here to the work I did to
> stabilize stack rlimits (there was an ongoing problem with making multiple
> decisions for the bprm based on current's state -- but current's state
> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> current's copy until exec ended and then stored bprm's copy into current.
> If the only problem anyone can see here is the handling of no_new_privs,
> we might be able to solve that similarly, at least disentangling tsync/nnp
> from cred_guard_mutex.
>
I still think that is solvable with using cred_locked_for_ptrace and
simply make the tsync fail if it would otherwise be blocked.
Thanks
Bernd.
On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> On 3/3/20 3:26 AM, Kees Cook wrote:
> > On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > > [...]
> >
> > If I'm reading this patch correctly, this changes the lifetime of the
> > cred_guard_mutex lock to be:
> > - during prepare_bprm_creds()
> > - from flush_old_exec() through install_exec_creds()
> > Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > install_exec_creds().
BTW, I think the effect of this change (i.e. my paragraph above) should
be distinctly called out in the commit log if this solution moves
forward.
> > That means, for example, that check_unsafe_exec()'s documented invariant
> > is violated:
> > /*
> > * determine how safe it is to execute the proposed program
> > * - the caller must hold ->cred_guard_mutex to protect against
> > * PTRACE_ATTACH or seccomp thread-sync
> > */
>
> Oh, right, I haven't understood that hint...
I know no_new_privs is checked there, but I haven't studied the
PTRACE_ATTACH part of that comment. If that is handled with the new
check, this comment should be updated.
> > I think it also means that the potentially multiple invocations
> > of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > a lock (another place where current's no_new_privs is evaluated).
>
> So no_new_privs can change from 0->1, but should not
> when execve is running.
>
> As long as the calling thread is in execve it won't do this,
> and the only other place, where it may set for other threads
> is in seccomp_sync_threads, but that can easily be avoided see below.
Yeah, everything was fine until I had to go complicate things with
TSYNC. ;) The real goal is making sure an exec cannot gain privs while
later gaining a seccomp filter from an unpriv process. The no_new_privs
flag was used to control this, but it required that the filter not get
applied during exec.
> > Related, it also means that cred_guard_mutex is unheld for every
> > invocation of search_binary_handler() (which can loop via the previously
> > mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > currently.)
> >
> > For seccomp, the expectations about existing thread states risks races
> > too. There are two locks held for TSYNC:
> > - current->sighand->siglock is held to keep new threads from
> > appearing/disappearing, which would destroy filter refcounting and
> > lead to memory corruption.
>
> I don't understand what you mean here.
> How can this lead to memory corruption?
Mainly this is a matter of how seccomp manages its filter hierarchy
(since the filters are shared through process ancestry), so if a thread
appears in the middle of TSYNC it may be racing another TSYNC and break
ancestry, leading to bad reference counting on process death, etc.
(Though, yes, with refcount_t now, things should never corrupt, just
waste memory.)
> > - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> > avoid no_new_privs and filter confusion during exec, which could
> > lead to exploitable setuid conditions (see below).
> >
> > Just racing a malicious thread during TSYNC is not a very strong
> > example (a malicious thread could do lots of fun things to "current"
> > before it ever got near calling TSYNC), but I think there is the risk
> > of mismatched/confused states that we don't want to allow. One is a
> > particularly bad state that could lead to privilege escalations (in the
> > form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > has a filter attached that silently fails a priv-dropping setuid call
> > and continues execution with elevated privs, it can be tricked into
> > doing bad things on behalf of the unprivileged parent, which was the
> > primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >
> > thread A clones thread B
> > thread B starts setuid exec
> > thread A sets no_new_privs
> > thread A calls seccomp with TSYNC
> > thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > thread B passes check_unsafe_exec() with no_new_privs unset
> > thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > thread B finishes exec, now running with elevated privs, a filter chosen
> > by thread A, _and_ nnp set (which doesn't matter)
> >
> > With the original locking, thread B will fail check_unsafe_exec()
> > because filter and nnp state are changed together, with "atomicity"
> > protected by the cred_guard_mutex.
> >
>
> Ah, good point, thanks!
>
> This can be fixed by checking current->signal->cred_locked_for_ptrace
> while the cred_guard_mutex is locked, like this for instance:
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b6ea3dc..377abf0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
> assert_spin_locked(¤t->sighand->siglock);
>
> + if (current->signal->cred_locked_for_ptrace)
> + return -EAGAIN;
> +
Hmm. I guess something like that could work. TSYNC expects to be able to
report _which_ thread wrecked the call, though... I wonder if in_execve
could be used to figure out the offending thread. Hm, nope, that would
be outside of lock too (and all users are "current" right now, so the
lock wasn't needed before).
> /* Validate all threads being eligible for synchronization. */
> caller = current;
> for_each_thread(caller, thread) {
>
>
> > And this is just the bad state I _can_ see. I'm worried there are more...
> >
> > All this said, I do see a small similarity here to the work I did to
> > stabilize stack rlimits (there was an ongoing problem with making multiple
> > decisions for the bprm based on current's state -- but current's state
> > was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > current's copy until exec ended and then stored bprm's copy into current.
> > If the only problem anyone can see here is the handling of no_new_privs,
> > we might be able to solve that similarly, at least disentangling tsync/nnp
> > from cred_guard_mutex.
> >
>
> I still think that is solvable with using cred_locked_for_ptrace and
> simply make the tsync fail if it would otherwise be blocked.
I wonder if we can find a better name than "cred_locked_for_ptrace"?
Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
And the comment on bool cred_locked_for_ptrace should mention that
access is only allowed under cred_guard_mutex lock.
> > > + sig->cred_locked_for_ptrace = false;
This is redundant to the zalloc -- I think you can drop it (unless
someone wants to keep it for clarify?)
Also, I think cred_locked_for_ptrace needs checking deeper, in
__ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
be sufficient to see a stable version of the thread...
(I remain very nervous about weakening cred_guard_mutex without
addressing the many many users...)
--
Kees Cook
On 3/3/20 6:29 AM, Kees Cook wrote:
> On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
>> On 3/3/20 3:26 AM, Kees Cook wrote:
>>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>>>> [...]
>>>
>>> If I'm reading this patch correctly, this changes the lifetime of the
>>> cred_guard_mutex lock to be:
>>> - during prepare_bprm_creds()
>>> - from flush_old_exec() through install_exec_creds()
>>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
>>> install_exec_creds().
>
> BTW, I think the effect of this change (i.e. my paragraph above) should
> be distinctly called out in the commit log if this solution moves
> forward.
>
Okay, will do.
>>> That means, for example, that check_unsafe_exec()'s documented invariant
>>> is violated:
>>> /*
>>> * determine how safe it is to execute the proposed program
>>> * - the caller must hold ->cred_guard_mutex to protect against
>>> * PTRACE_ATTACH or seccomp thread-sync
>>> */
>>
>> Oh, right, I haven't understood that hint...
>
> I know no_new_privs is checked there, but I haven't studied the
> PTRACE_ATTACH part of that comment. If that is handled with the new
> check, this comment should be updated.
>
Okay, I change that comment to:
/*
* determine how safe it is to execute the proposed program
* - the caller must have set ->cred_locked_in_execve to protect against
* PTRACE_ATTACH or seccomp thread-sync
*/
>>> I think it also means that the potentially multiple invocations
>>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
>>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
>>> a lock (another place where current's no_new_privs is evaluated).
>>
>> So no_new_privs can change from 0->1, but should not
>> when execve is running.
>>
>> As long as the calling thread is in execve it won't do this,
>> and the only other place, where it may set for other threads
>> is in seccomp_sync_threads, but that can easily be avoided see below.
>
> Yeah, everything was fine until I had to go complicate things with
> TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> later gaining a seccomp filter from an unpriv process. The no_new_privs
> flag was used to control this, but it required that the filter not get
> applied during exec.
>
>>> Related, it also means that cred_guard_mutex is unheld for every
>>> invocation of search_binary_handler() (which can loop via the previously
>>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
>>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
>>> currently.)
>>>
>>> For seccomp, the expectations about existing thread states risks races
>>> too. There are two locks held for TSYNC:
>>> - current->sighand->siglock is held to keep new threads from
>>> appearing/disappearing, which would destroy filter refcounting and
>>> lead to memory corruption.
>>
>> I don't understand what you mean here.
>> How can this lead to memory corruption?
>
> Mainly this is a matter of how seccomp manages its filter hierarchy
> (since the filters are shared through process ancestry), so if a thread
> appears in the middle of TSYNC it may be racing another TSYNC and break
> ancestry, leading to bad reference counting on process death, etc.
> (Though, yes, with refcount_t now, things should never corrupt, just
> waste memory.)
>
I assume for now, that the current->sighand->siglock held while iterating all
threads is sufficient here.
>>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
>>> avoid no_new_privs and filter confusion during exec, which could
>>> lead to exploitable setuid conditions (see below).
>>>
>>> Just racing a malicious thread during TSYNC is not a very strong
>>> example (a malicious thread could do lots of fun things to "current"
>>> before it ever got near calling TSYNC), but I think there is the risk
>>> of mismatched/confused states that we don't want to allow. One is a
>>> particularly bad state that could lead to privilege escalations (in the
>>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
>>> has a filter attached that silently fails a priv-dropping setuid call
>>> and continues execution with elevated privs, it can be tricked into
>>> doing bad things on behalf of the unprivileged parent, which was the
>>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
>>>
>>> thread A clones thread B
>>> thread B starts setuid exec
>>> thread A sets no_new_privs
>>> thread A calls seccomp with TSYNC
>>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
>>> thread B passes check_unsafe_exec() with no_new_privs unset
>>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
>>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
>>> thread B finishes exec, now running with elevated privs, a filter chosen
>>> by thread A, _and_ nnp set (which doesn't matter)
>>>
>>> With the original locking, thread B will fail check_unsafe_exec()
>>> because filter and nnp state are changed together, with "atomicity"
>>> protected by the cred_guard_mutex.
>>>
>>
>> Ah, good point, thanks!
>>
>> This can be fixed by checking current->signal->cred_locked_for_ptrace
>> while the cred_guard_mutex is locked, like this for instance:
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index b6ea3dc..377abf0 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
>> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
>> assert_spin_locked(¤t->sighand->siglock);
>>
>> + if (current->signal->cred_locked_for_ptrace)
>> + return -EAGAIN;
>> +
>
> Hmm. I guess something like that could work. TSYNC expects to be able to
> report _which_ thread wrecked the call, though... I wonder if in_execve
> could be used to figure out the offending thread. Hm, nope, that would
> be outside of lock too (and all users are "current" right now, so the
> lock wasn't needed before).
>
I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
but the caller will die quickly and cannot do anything with that information
when another thread executes execve, right?
>> /* Validate all threads being eligible for synchronization. */
>> caller = current;
>> for_each_thread(caller, thread) {
>>
>>
>>> And this is just the bad state I _can_ see. I'm worried there are more...
>>>
>>> All this said, I do see a small similarity here to the work I did to
>>> stabilize stack rlimits (there was an ongoing problem with making multiple
>>> decisions for the bprm based on current's state -- but current's state
>>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
>>> current's copy until exec ended and then stored bprm's copy into current.
>>> If the only problem anyone can see here is the handling of no_new_privs,
>>> we might be able to solve that similarly, at least disentangling tsync/nnp
>>> from cred_guard_mutex.
>>>
>>
>> I still think that is solvable with using cred_locked_for_ptrace and
>> simply make the tsync fail if it would otherwise be blocked.
>
> I wonder if we can find a better name than "cred_locked_for_ptrace"?
> Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
>
Yeah, I'd go with "cred_locked_in_execve".
> And the comment on bool cred_locked_for_ptrace should mention that
> access is only allowed under cred_guard_mutex lock.
>
okay.
>>>> + sig->cred_locked_for_ptrace = false;
>
> This is redundant to the zalloc -- I think you can drop it (unless
> someone wants to keep it for clarify?)
>
I'll remove that here and in init/init_task.c
> Also, I think cred_locked_for_ptrace needs checking deeper, in
> __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> be sufficient to see a stable version of the thread...
>
No, these need to be addressed individually, but most users just want
to know if the current credentials are sufficient at this moment, but will
not change the credentials, as ptrace and TSYNC do.
BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
So adding an access to cred_locked_for_execve in ptrace_may_access is
probably not an option.
However, one nice added value by this change is this:
void *thread(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return NULL;
}
int main(void)
{
int pid = fork();
if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("echo", "echo", "passed", NULL);
}
sleep(1000);
ptrace(PTRACE_ATTACH, pid, 0,0);
kill(pid, SIGCONT);
return 0;
}
cat /proc/3812/stack
[<0>] flush_old_exec+0xbf/0x760
[<0>] load_elf_binary+0x35a/0x16c0
[<0>] search_binary_handler+0x97/0x1d0
[<0>] __do_execve_file.isra.40+0x624/0x920
[<0>] __x64_sys_execve+0x49/0x60
[<0>] do_syscall_64+0x64/0x220
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> (I remain very nervous about weakening cred_guard_mutex without
> addressing the many many users...)
>
They need to be looked at closely, that's pretty clear.
Most fall in the class, that just the current credentials need
to stay stable for a certain time.
Bernd.
On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> On 3/3/20 6:29 AM, Kees Cook wrote:
> > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> >> On 3/3/20 3:26 AM, Kees Cook wrote:
> >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> >>>> [...]
> >>>
> >>> If I'm reading this patch correctly, this changes the lifetime of the
> >>> cred_guard_mutex lock to be:
> >>> - during prepare_bprm_creds()
> >>> - from flush_old_exec() through install_exec_creds()
> >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> >>> install_exec_creds().
> >
> > BTW, I think the effect of this change (i.e. my paragraph above) should
> > be distinctly called out in the commit log if this solution moves
> > forward.
> >
>
> Okay, will do.
>
> >>> That means, for example, that check_unsafe_exec()'s documented invariant
> >>> is violated:
> >>> /*
> >>> * determine how safe it is to execute the proposed program
> >>> * - the caller must hold ->cred_guard_mutex to protect against
> >>> * PTRACE_ATTACH or seccomp thread-sync
> >>> */
> >>
> >> Oh, right, I haven't understood that hint...
> >
> > I know no_new_privs is checked there, but I haven't studied the
> > PTRACE_ATTACH part of that comment. If that is handled with the new
> > check, this comment should be updated.
> >
>
> Okay, I change that comment to:
>
> /*
> * determine how safe it is to execute the proposed program
> * - the caller must have set ->cred_locked_in_execve to protect against
> * PTRACE_ATTACH or seccomp thread-sync
> */
>
> >>> I think it also means that the potentially multiple invocations
> >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> >>> a lock (another place where current's no_new_privs is evaluated).
> >>
> >> So no_new_privs can change from 0->1, but should not
> >> when execve is running.
> >>
> >> As long as the calling thread is in execve it won't do this,
> >> and the only other place, where it may set for other threads
> >> is in seccomp_sync_threads, but that can easily be avoided see below.
> >
> > Yeah, everything was fine until I had to go complicate things with
> > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > flag was used to control this, but it required that the filter not get
> > applied during exec.
> >
> >>> Related, it also means that cred_guard_mutex is unheld for every
> >>> invocation of search_binary_handler() (which can loop via the previously
> >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> >>> currently.)
> >>>
> >>> For seccomp, the expectations about existing thread states risks races
> >>> too. There are two locks held for TSYNC:
> >>> - current->sighand->siglock is held to keep new threads from
> >>> appearing/disappearing, which would destroy filter refcounting and
> >>> lead to memory corruption.
> >>
> >> I don't understand what you mean here.
> >> How can this lead to memory corruption?
> >
> > Mainly this is a matter of how seccomp manages its filter hierarchy
> > (since the filters are shared through process ancestry), so if a thread
> > appears in the middle of TSYNC it may be racing another TSYNC and break
> > ancestry, leading to bad reference counting on process death, etc.
> > (Though, yes, with refcount_t now, things should never corrupt, just
> > waste memory.)
> >
>
> I assume for now, that the current->sighand->siglock held while iterating all
> threads is sufficient here.
>
> >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >>> avoid no_new_privs and filter confusion during exec, which could
> >>> lead to exploitable setuid conditions (see below).
> >>>
> >>> Just racing a malicious thread during TSYNC is not a very strong
> >>> example (a malicious thread could do lots of fun things to "current"
> >>> before it ever got near calling TSYNC), but I think there is the risk
> >>> of mismatched/confused states that we don't want to allow. One is a
> >>> particularly bad state that could lead to privilege escalations (in the
> >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> >>> has a filter attached that silently fails a priv-dropping setuid call
> >>> and continues execution with elevated privs, it can be tricked into
> >>> doing bad things on behalf of the unprivileged parent, which was the
> >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >>>
> >>> thread A clones thread B
> >>> thread B starts setuid exec
> >>> thread A sets no_new_privs
> >>> thread A calls seccomp with TSYNC
> >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> >>> thread B passes check_unsafe_exec() with no_new_privs unset
> >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> >>> thread B finishes exec, now running with elevated privs, a filter chosen
> >>> by thread A, _and_ nnp set (which doesn't matter)
> >>>
> >>> With the original locking, thread B will fail check_unsafe_exec()
> >>> because filter and nnp state are changed together, with "atomicity"
> >>> protected by the cred_guard_mutex.
> >>>
> >>
> >> Ah, good point, thanks!
> >>
> >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> >> while the cred_guard_mutex is locked, like this for instance:
> >>
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index b6ea3dc..377abf0 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
> >> assert_spin_locked(¤t->sighand->siglock);
> >>
> >> + if (current->signal->cred_locked_for_ptrace)
> >> + return -EAGAIN;
> >> +
> >
> > Hmm. I guess something like that could work. TSYNC expects to be able to
> > report _which_ thread wrecked the call, though... I wonder if in_execve
> > could be used to figure out the offending thread. Hm, nope, that would
> > be outside of lock too (and all users are "current" right now, so the
> > lock wasn't needed before).
> >
>
> I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> but the caller will die quickly and cannot do anything with that information
> when another thread executes execve, right?
>
> >> /* Validate all threads being eligible for synchronization. */
> >> caller = current;
> >> for_each_thread(caller, thread) {
> >>
> >>
> >>> And this is just the bad state I _can_ see. I'm worried there are more...
> >>>
> >>> All this said, I do see a small similarity here to the work I did to
> >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> >>> decisions for the bprm based on current's state -- but current's state
> >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> >>> current's copy until exec ended and then stored bprm's copy into current.
> >>> If the only problem anyone can see here is the handling of no_new_privs,
> >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> >>> from cred_guard_mutex.
> >>>
> >>
> >> I still think that is solvable with using cred_locked_for_ptrace and
> >> simply make the tsync fail if it would otherwise be blocked.
> >
> > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> >
>
> Yeah, I'd go with "cred_locked_in_execve".
>
> > And the comment on bool cred_locked_for_ptrace should mention that
> > access is only allowed under cred_guard_mutex lock.
> >
>
> okay.
>
> >>>> + sig->cred_locked_for_ptrace = false;
> >
> > This is redundant to the zalloc -- I think you can drop it (unless
> > someone wants to keep it for clarify?)
> >
>
> I'll remove that here and in init/init_task.c
>
> > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > be sufficient to see a stable version of the thread...
> >
>
> No, these need to be addressed individually, but most users just want
> to know if the current credentials are sufficient at this moment, but will
> not change the credentials, as ptrace and TSYNC do.
>
> BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> So adding an access to cred_locked_for_execve in ptrace_may_access is
> probably not an option.
>
> However, one nice added value by this change is this:
>
> void *thread(void *arg)
> {
> ptrace(PTRACE_TRACEME, 0,0,0);
> return NULL;
> }
>
> int main(void)
> {
> int pid = fork();
>
> if (!pid) {
> pthread_t pt;
> pthread_create(&pt, NULL, thread, NULL);
> pthread_join(pt, NULL);
> execlp("echo", "echo", "passed", NULL);
> }
>
> sleep(1000);
> ptrace(PTRACE_ATTACH, pid, 0,0);
> kill(pid, SIGCONT);
> return 0;
> }
>
> cat /proc/3812/stack
> [<0>] flush_old_exec+0xbf/0x760
> [<0>] load_elf_binary+0x35a/0x16c0
> [<0>] search_binary_handler+0x97/0x1d0
> [<0>] __do_execve_file.isra.40+0x624/0x920
> [<0>] __x64_sys_execve+0x49/0x60
> [<0>] do_syscall_64+0x64/0x220
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>
> > (I remain very nervous about weakening cred_guard_mutex without
> > addressing the many many users...)
> >
>
> They need to be looked at closely, that's pretty clear.
> Most fall in the class, that just the current credentials need
> to stay stable for a certain time.
I remain rather set on wanting some very basic tests with this change.
Imho, looking through tools/testing/selftests again we don't have nearly
enough for these codepaths; not to say none. Basically, if someone wants
to make a change affecting the current problem we should really have at
least a single simple test/reproducer that can be run without digging
through lore. And hopefully over time we'll have more tests.
Christian
On Tue, Mar 03, 2020 at 09:34:26AM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> > On 3/3/20 6:29 AM, Kees Cook wrote:
> > > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> > >> On 3/3/20 3:26 AM, Kees Cook wrote:
> > >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > >>>> [...]
> > >>>
> > >>> If I'm reading this patch correctly, this changes the lifetime of the
> > >>> cred_guard_mutex lock to be:
> > >>> - during prepare_bprm_creds()
> > >>> - from flush_old_exec() through install_exec_creds()
> > >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> > >>> install_exec_creds().
> > >
> > > BTW, I think the effect of this change (i.e. my paragraph above) should
> > > be distinctly called out in the commit log if this solution moves
> > > forward.
> > >
> >
> > Okay, will do.
> >
> > >>> That means, for example, that check_unsafe_exec()'s documented invariant
> > >>> is violated:
> > >>> /*
> > >>> * determine how safe it is to execute the proposed program
> > >>> * - the caller must hold ->cred_guard_mutex to protect against
> > >>> * PTRACE_ATTACH or seccomp thread-sync
> > >>> */
> > >>
> > >> Oh, right, I haven't understood that hint...
> > >
> > > I know no_new_privs is checked there, but I haven't studied the
> > > PTRACE_ATTACH part of that comment. If that is handled with the new
> > > check, this comment should be updated.
> > >
> >
> > Okay, I change that comment to:
> >
> > /*
> > * determine how safe it is to execute the proposed program
> > * - the caller must have set ->cred_locked_in_execve to protect against
> > * PTRACE_ATTACH or seccomp thread-sync
> > */
> >
> > >>> I think it also means that the potentially multiple invocations
> > >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> > >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> > >>> a lock (another place where current's no_new_privs is evaluated).
> > >>
> > >> So no_new_privs can change from 0->1, but should not
> > >> when execve is running.
> > >>
> > >> As long as the calling thread is in execve it won't do this,
> > >> and the only other place, where it may set for other threads
> > >> is in seccomp_sync_threads, but that can easily be avoided see below.
> > >
> > > Yeah, everything was fine until I had to go complicate things with
> > > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > > flag was used to control this, but it required that the filter not get
> > > applied during exec.
> > >
> > >>> Related, it also means that cred_guard_mutex is unheld for every
> > >>> invocation of search_binary_handler() (which can loop via the previously
> > >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> > >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> > >>> currently.)
> > >>>
> > >>> For seccomp, the expectations about existing thread states risks races
> > >>> too. There are two locks held for TSYNC:
> > >>> - current->sighand->siglock is held to keep new threads from
> > >>> appearing/disappearing, which would destroy filter refcounting and
> > >>> lead to memory corruption.
> > >>
> > >> I don't understand what you mean here.
> > >> How can this lead to memory corruption?
> > >
> > > Mainly this is a matter of how seccomp manages its filter hierarchy
> > > (since the filters are shared through process ancestry), so if a thread
> > > appears in the middle of TSYNC it may be racing another TSYNC and break
> > > ancestry, leading to bad reference counting on process death, etc.
> > > (Though, yes, with refcount_t now, things should never corrupt, just
> > > waste memory.)
> > >
> >
> > I assume for now, that the current->sighand->siglock held while iterating all
> > threads is sufficient here.
> >
> > >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> > >>> avoid no_new_privs and filter confusion during exec, which could
> > >>> lead to exploitable setuid conditions (see below).
> > >>>
> > >>> Just racing a malicious thread during TSYNC is not a very strong
> > >>> example (a malicious thread could do lots of fun things to "current"
> > >>> before it ever got near calling TSYNC), but I think there is the risk
> > >>> of mismatched/confused states that we don't want to allow. One is a
> > >>> particularly bad state that could lead to privilege escalations (in the
> > >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> > >>> has a filter attached that silently fails a priv-dropping setuid call
> > >>> and continues execution with elevated privs, it can be tricked into
> > >>> doing bad things on behalf of the unprivileged parent, which was the
> > >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> > >>>
> > >>> thread A clones thread B
> > >>> thread B starts setuid exec
> > >>> thread A sets no_new_privs
> > >>> thread A calls seccomp with TSYNC
> > >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> > >>> thread B passes check_unsafe_exec() with no_new_privs unset
> > >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> > >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> > >>> thread B finishes exec, now running with elevated privs, a filter chosen
> > >>> by thread A, _and_ nnp set (which doesn't matter)
> > >>>
> > >>> With the original locking, thread B will fail check_unsafe_exec()
> > >>> because filter and nnp state are changed together, with "atomicity"
> > >>> protected by the cred_guard_mutex.
> > >>>
> > >>
> > >> Ah, good point, thanks!
> > >>
> > >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> > >> while the cred_guard_mutex is locked, like this for instance:
> > >>
> > >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > >> index b6ea3dc..377abf0 100644
> > >> --- a/kernel/seccomp.c
> > >> +++ b/kernel/seccomp.c
> > >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> > >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
> > >> assert_spin_locked(¤t->sighand->siglock);
> > >>
> > >> + if (current->signal->cred_locked_for_ptrace)
> > >> + return -EAGAIN;
> > >> +
> > >
> > > Hmm. I guess something like that could work. TSYNC expects to be able to
> > > report _which_ thread wrecked the call, though... I wonder if in_execve
> > > could be used to figure out the offending thread. Hm, nope, that would
> > > be outside of lock too (and all users are "current" right now, so the
> > > lock wasn't needed before).
> > >
> >
> > I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> > but the caller will die quickly and cannot do anything with that information
> > when another thread executes execve, right?
> >
> > >> /* Validate all threads being eligible for synchronization. */
> > >> caller = current;
> > >> for_each_thread(caller, thread) {
> > >>
> > >>
> > >>> And this is just the bad state I _can_ see. I'm worried there are more...
> > >>>
> > >>> All this said, I do see a small similarity here to the work I did to
> > >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> > >>> decisions for the bprm based on current's state -- but current's state
> > >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> > >>> current's copy until exec ended and then stored bprm's copy into current.
> > >>> If the only problem anyone can see here is the handling of no_new_privs,
> > >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> > >>> from cred_guard_mutex.
> > >>>
> > >>
> > >> I still think that is solvable with using cred_locked_for_ptrace and
> > >> simply make the tsync fail if it would otherwise be blocked.
> > >
> > > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> > >
> >
> > Yeah, I'd go with "cred_locked_in_execve".
> >
> > > And the comment on bool cred_locked_for_ptrace should mention that
> > > access is only allowed under cred_guard_mutex lock.
> > >
> >
> > okay.
> >
> > >>>> + sig->cred_locked_for_ptrace = false;
> > >
> > > This is redundant to the zalloc -- I think you can drop it (unless
> > > someone wants to keep it for clarify?)
> > >
> >
> > I'll remove that here and in init/init_task.c
> >
> > > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > > be sufficient to see a stable version of the thread...
> > >
> >
> > No, these need to be addressed individually, but most users just want
> > to know if the current credentials are sufficient at this moment, but will
> > not change the credentials, as ptrace and TSYNC do.
> >
> > BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> > mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> > So adding an access to cred_locked_for_execve in ptrace_may_access is
> > probably not an option.
> >
> > However, one nice added value by this change is this:
> >
> > void *thread(void *arg)
> > {
> > ptrace(PTRACE_TRACEME, 0,0,0);
> > return NULL;
> > }
> >
> > int main(void)
> > {
> > int pid = fork();
> >
> > if (!pid) {
> > pthread_t pt;
> > pthread_create(&pt, NULL, thread, NULL);
> > pthread_join(pt, NULL);
> > execlp("echo", "echo", "passed", NULL);
> > }
> >
> > sleep(1000);
> > ptrace(PTRACE_ATTACH, pid, 0,0);
> > kill(pid, SIGCONT);
> > return 0;
> > }
> >
> > cat /proc/3812/stack
> > [<0>] flush_old_exec+0xbf/0x760
> > [<0>] load_elf_binary+0x35a/0x16c0
> > [<0>] search_binary_handler+0x97/0x1d0
> > [<0>] __do_execve_file.isra.40+0x624/0x920
> > [<0>] __x64_sys_execve+0x49/0x60
> > [<0>] do_syscall_64+0x64/0x220
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> >
> > > (I remain very nervous about weakening cred_guard_mutex without
> > > addressing the many many users...)
> > >
> >
> > They need to be looked at closely, that's pretty clear.
> > Most fall in the class, that just the current credentials need
> > to stay stable for a certain time.
>
> I remain rather set on wanting some very basic tests with this change.
> Imho, looking through tools/testing/selftests again we don't have nearly
> enough for these codepaths; not to say none. Basically, if someone wants
> to make a change affecting the current problem we should really have at
> least a single simple test/reproducer that can be run without digging
> through lore. And hopefully over time we'll have more tests.
Which you added in v4. Which is great! (I should've mentioned this in my
first mail.)
Christian
On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> >
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated. They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> >
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> >
> > strace D 0 30614 30584 0x00000000
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > schedule_preempt_disabled+0x15/0x20
> > __mutex_lock.isra.13+0x1ec/0x520
> > __mutex_lock_killable_slowpath+0x13/0x20
> > mutex_lock_killable+0x28/0x30
> > mm_access+0x27/0xa0
> > process_vm_rw_core.isra.3+0xff/0x550
> > process_vm_rw+0xdd/0xf0
> > __x64_sys_process_vm_readv+0x31/0x40
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > expect D 0 31933 30876 0x80004003
> > Call Trace:
> > __schedule+0x3ce/0x6e0
> > schedule+0x5c/0xd0
> > flush_old_exec+0xc4/0x770
> > load_elf_binary+0x35a/0x16c0
> > search_binary_handler+0x97/0x1d0
> > __do_execve_file.isra.40+0x5d4/0x8a0
> > __x64_sys_execve+0x49/0x60
> > do_syscall_64+0x64/0x220
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > The proposed solution is to take the cred_guard_mutex only
> > in a critical section at the beginning, and at the end of the
> > execve function, and let PTRACE_ATTACH fail with EAGAIN while
> > execve is not complete, but other functions like vm_access are
> > allowed to complete normally.
>
> Sorry to be bummer, but I don't think this will work. A few more things
> during the exec process depend on cred_guard_mutex being held.
>
> If I'm reading this patch correctly, this changes the lifetime of the
> cred_guard_mutex lock to be:
> - during prepare_bprm_creds()
> - from flush_old_exec() through install_exec_creds()
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
>
> That means, for example, that check_unsafe_exec()'s documented invariant
> is violated:
> /*
> * determine how safe it is to execute the proposed program
> * - the caller must hold ->cred_guard_mutex to protect against
> * PTRACE_ATTACH or seccomp thread-sync
> */
> static void check_unsafe_exec(struct linux_binprm *bprm) ...
> which is looking at no_new_privs as well as other details, and making
> decisions about the bprm state from the current state.
>
> I think it also means that the potentially multiple invocations
> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> a lock (another place where current's no_new_privs is evaluated).
>
> Related, it also means that cred_guard_mutex is unheld for every
> invocation of search_binary_handler() (which can loop via the previously
> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> currently.)
So one issue I see with having to reacquire the cred_guard_mutex might
be that this would allow tasks holding the cred_guard_mutex to block a
killed exec'ing task from exiting, right?
On 3/3/20 9:58 AM, Christian Brauner wrote:
> On Mon, Mar 02, 2020 at 06:26:47PM -0800, Kees Cook wrote:
>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
>>> This fixes a deadlock in the tracer when tracing a multi-threaded
>>> application that calls execve while more than one thread are running.
>>>
>>> I observed that when running strace on the gcc test suite, it always
>>> blocks after a while, when expect calls execve, because other threads
>>> have to be terminated. They send ptrace events, but the strace is no
>>> longer able to respond, since it is blocked in vm_access.
>>>
>>> The deadlock is always happening when strace needs to access the
>>> tracees process mmap, while another thread in the tracee starts to
>>> execve a child process, but that cannot continue until the
>>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>>>
>>> strace D 0 30614 30584 0x00000000
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> schedule_preempt_disabled+0x15/0x20
>>> __mutex_lock.isra.13+0x1ec/0x520
>>> __mutex_lock_killable_slowpath+0x13/0x20
>>> mutex_lock_killable+0x28/0x30
>>> mm_access+0x27/0xa0
>>> process_vm_rw_core.isra.3+0xff/0x550
>>> process_vm_rw+0xdd/0xf0
>>> __x64_sys_process_vm_readv+0x31/0x40
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> expect D 0 31933 30876 0x80004003
>>> Call Trace:
>>> __schedule+0x3ce/0x6e0
>>> schedule+0x5c/0xd0
>>> flush_old_exec+0xc4/0x770
>>> load_elf_binary+0x35a/0x16c0
>>> search_binary_handler+0x97/0x1d0
>>> __do_execve_file.isra.40+0x5d4/0x8a0
>>> __x64_sys_execve+0x49/0x60
>>> do_syscall_64+0x64/0x220
>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> The proposed solution is to take the cred_guard_mutex only
>>> in a critical section at the beginning, and at the end of the
>>> execve function, and let PTRACE_ATTACH fail with EAGAIN while
>>> execve is not complete, but other functions like vm_access are
>>> allowed to complete normally.
>>
>> Sorry to be bummer, but I don't think this will work. A few more things
>> during the exec process depend on cred_guard_mutex being held.
>>
>> If I'm reading this patch correctly, this changes the lifetime of the
>> cred_guard_mutex lock to be:
>> - during prepare_bprm_creds()
>> - from flush_old_exec() through install_exec_creds()
>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
>> install_exec_creds().
>>
>> That means, for example, that check_unsafe_exec()'s documented invariant
>> is violated:
>> /*
>> * determine how safe it is to execute the proposed program
>> * - the caller must hold ->cred_guard_mutex to protect against
>> * PTRACE_ATTACH or seccomp thread-sync
>> */
>> static void check_unsafe_exec(struct linux_binprm *bprm) ...
>> which is looking at no_new_privs as well as other details, and making
>> decisions about the bprm state from the current state.
>>
>> I think it also means that the potentially multiple invocations
>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
>> a lock (another place where current's no_new_privs is evaluated).
>>
>> Related, it also means that cred_guard_mutex is unheld for every
>> invocation of search_binary_handler() (which can loop via the previously
>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
>> currently.)
>
> So one issue I see with having to reacquire the cred_guard_mutex might
> be that this would allow tasks holding the cred_guard_mutex to block a
> killed exec'ing task from exiting, right?
>
Yes maybe, but I think it will not be worse than it is now.
Since the second time the mutex is acquired it is done with
mutex_lock_killable, so at least kill -9 should get it terminated.
Bernd.
On 3/3/20 11:34 AM, Bernd Edlinger wrote:
> On 3/3/20 9:58 AM, Christian Brauner wrote:
>> So one issue I see with having to reacquire the cred_guard_mutex might
>> be that this would allow tasks holding the cred_guard_mutex to block a
>> killed exec'ing task from exiting, right?
>>
>
> Yes maybe, but I think it will not be worse than it is now.
> Since the second time the mutex is acquired it is done with
> mutex_lock_killable, so at least kill -9 should get it terminated.
>
> static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> + if (!bprm->called_flush_old_exec)
> + mutex_lock(¤t->signal->cred_guard_mutex);
> + current->signal->cred_locked_for_ptrace = false;
> mutex_unlock(¤t->signal->cred_guard_mutex);
Hmm, cough...
actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec
free_bprm locks the same mutex, this time unkillable, but I should better do
mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace,
it shouldn't matter, since this is a fatal signal anyway, right?
Bernd.
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated. They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to take the cred_guard_mutex only
in a critical section at the beginning, and at the end of the
execve function, and let PTRACE_ATTACH fail with EAGAIN while
execve is not complete, but other functions like vm_access are
allowed to complete normally.
This changes the lifetime of the cred_guard_mutex lock to be:
- during prepare_bprm_creds()
- from flush_old_exec() through install_exec_creds()
Before, cred_guard_mutex was held from prepare_bprm_creds() through
install_exec_creds().
I also took the opportunity to improve the documentation
of prepare_creds, which is obviously out of sync.
Signed-off-by: Bernd Edlinger <[email protected]>
---
Documentation/security/credentials.rst | 19 +++++----
fs/exec.c | 41 ++++++++++++++++---
include/linux/binfmts.h | 6 ++-
include/linux/sched/signal.h | 2 +
kernel/cred.c | 2 +-
kernel/ptrace.c | 4 ++
kernel/seccomp.c | 3 ++
mm/process_vm_access.c | 2 +-
tools/testing/selftests/ptrace/Makefile | 4 +-
tools/testing/selftests/ptrace/vmaccess.c | 66 +++++++++++++++++++++++++++++++
10 files changed, 130 insertions(+), 19 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case.
v5: addresses review comments.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..0988798 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,9 +437,14 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful. It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->cred_guard_mutex
+is acquired before this function gets called, and released after setting
+current->signal->cred_locked_in_execve. The same mutex is acquired later,
+while the credentials and the process mmap are actually changed, and
+current->signal->cred_locked_in_execve is reset again.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process
while security checks on credentials construction and changing is taking place
@@ -466,9 +471,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the
LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
end of such functions as ``sys_setresuid()``.
@@ -486,8 +490,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..5fc744e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1266,6 +1266,12 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_guard_mutex);
+ if (retval)
+ goto out;
+
+ bprm->called_flush_old_exec = 1;
+
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
@@ -1398,29 +1404,51 @@ void finalize_exec(struct linux_binprm *bprm)
EXPORT_SYMBOL(finalize_exec);
/*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_in_execve.
* install_exec_creds() commits the new creds and drops the lock.
* Or, if exec fails before, free_bprm() should release ->cred and
* and unlock.
*/
static int prepare_bprm_creds(struct linux_binprm *bprm)
{
+ int ret;
+
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
+ ret = -EAGAIN;
+ if (unlikely(current->signal->cred_locked_in_execve))
+ goto out;
+
+ ret = -ENOMEM;
bprm->cred = prepare_exec_creds();
- if (likely(bprm->cred))
- return 0;
+ if (likely(bprm->cred)) {
+ current->signal->cred_locked_in_execve = true;
+ ret = 0;
+ }
+out:
mutex_unlock(¤t->signal->cred_guard_mutex);
- return -ENOMEM;
+ return ret;
}
static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ /*
+ * If flush_old_exec did not acquire the cred_guard_mutex,
+ * try again here, but if that fails, just leave
+ * cred_locked_in_execve alone, since this means there
+ * must be a fatal signal pending.
+ * We don't want to prevent this task to be killed, just
+ * because it is stuck in the middle of execve.
+ */
+ if (bprm->called_flush_old_exec ||
+ !mutex_lock_killable(¤t->signal->cred_guard_mutex)) {
+ current->signal->cred_locked_in_execve = false;
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ }
abort_creds(bprm->cred);
}
if (bprm->file) {
@@ -1469,13 +1497,14 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ current->signal->cred_locked_in_execve = false;
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
/*
* determine how safe it is to execute the proposed program
- * - the caller must hold ->cred_guard_mutex to protect against
+ * - the caller must have set ->cred_locked_in_execve to protect against
* PTRACE_ATTACH or seccomp thread-sync
*/
static void check_unsafe_exec(struct linux_binprm *bprm)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..2930253 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,11 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by flush_old_exec, when the cred_guard_mutex is taken.
+ */
+ called_flush_old_exec:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..8f8e358 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -225,6 +225,8 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace) */
+ bool cred_locked_in_execve; /* set while in execve, only valid when
+ * cred_guard_mutex is held */
} __randomize_layout;
/*
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..e4c78de 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -676,7 +676,7 @@ void __init cred_init(void)
*
* Returns the new credentials or NULL if out of memory.
*
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..0f82bab 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -395,6 +395,10 @@ static int ptrace_attach(struct task_struct *task, long request,
if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
goto out;
+ retval = -EAGAIN;
+ if (task->signal->cred_locked_in_execve)
+ goto unlock_creds;
+
task_lock(task);
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
task_unlock(task);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..3efa3e5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
assert_spin_locked(¤t->sighand->siglock);
+ if (current->signal->cred_locked_in_execve)
+ return -EAGAIN;
+
/* Validate all threads being eligible for synchronization. */
caller = current;
for_each_thread(caller, thread) {
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
- * Explicitly map EACCES to EPERM as EPERM is a more a
+ * Explicitly map EACCES to EPERM as EPERM is a more
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..6d8a048
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <[email protected]>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+ ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+ return NULL;
+}
+
+TEST(vmaccess)
+{
+ int f, pid = fork();
+ char mm[64];
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ sprintf(mm, "/proc/%d/mem", pid);
+ f = open(mm, O_RDONLY);
+ ASSERT_LE(0, f);
+ close(f);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(0, f);
+}
+
+TEST(attach)
+{
+ int f, pid = fork();
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(EAGAIN, errno);
+ ASSERT_EQ(f, -1);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(0, f);
+}
+
+TEST_HARNESS_MAIN
--
1.9.1
On Tue, Mar 03, 2020 at 11:23:31AM +0000, Bernd Edlinger wrote:
> On 3/3/20 11:34 AM, Bernd Edlinger wrote:
> > On 3/3/20 9:58 AM, Christian Brauner wrote:
> >> So one issue I see with having to reacquire the cred_guard_mutex might
> >> be that this would allow tasks holding the cred_guard_mutex to block a
> >> killed exec'ing task from exiting, right?
> >>
> >
> > Yes maybe, but I think it will not be worse than it is now.
> > Since the second time the mutex is acquired it is done with
> > mutex_lock_killable, so at least kill -9 should get it terminated.
> >
>
>
>
> > static void free_bprm(struct linux_binprm *bprm)
> > {
> > free_arg_pages(bprm);
> > if (bprm->cred) {
> > + if (!bprm->called_flush_old_exec)
> > + mutex_lock(¤t->signal->cred_guard_mutex);
> > + current->signal->cred_locked_for_ptrace = false;
> > mutex_unlock(¤t->signal->cred_guard_mutex);
>
>
> Hmm, cough...
> actually when the mutex_lock_killable fails, due to kill -9, in flush_old_exec
> free_bprm locks the same mutex, this time unkillable, but I should better do
> mutex_lock_killable here, and if that fails, I can leave cred_locked_for_ptrace,
> it shouldn't matter, since this is a fatal signal anyway, right?
I think so, yes.
Bernd Edlinger <[email protected]> writes:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated. They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
A couple of things.
Why do we think it is safe to change the behavior exposed to userspace?
Not the deadlock but all of the times the current code would not
deadlock?
Especially given that this is a small window it might be hard for people
to track down and report so we need a strong argument that this won't
break existing userspace before we just change things.
Usually surveying all of the users of a system call that we can find
and checking to see if they might be affected by the change in behavior
is difficult enough that we usually opt for not being lazy and
preserving the behavior.
This patch is up to two changes in behavior now, that could potentially
affect a whole array of programs. Adding linux-api so that this change
in behavior can be documented if/when this change goes through.
If you can split the documentation and test fixes out into separate
patches that would help reviewing this code, or please make it explicit
that the your are changing documentation about behavior that is changing
with this patch.
Eric
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> new file mode 100644
> index 0000000..6d8a048
> --- /dev/null
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -0,0 +1,66 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
> + * All rights reserved.
> + *
> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
> + * when de_thread is blocked with ->cred_guard_mutex held.
> + */
> +
> +#include "../kselftest_harness.h"
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <unistd.h>
> +#include <sys/ptrace.h>
> +
> +static void *thread(void *arg)
> +{
> + ptrace(PTRACE_TRACEME, 0, 0L, 0L);
> + return NULL;
> +}
> +
> +TEST(vmaccess)
> +{
> + int f, pid = fork();
> + char mm[64];
> +
> + if (!pid) {
> + pthread_t pt;
> +
> + pthread_create(&pt, NULL, thread, NULL);
> + pthread_join(pt, NULL);
> + execlp("true", "true", NULL);
> + }
> +
> + sleep(1);
> + sprintf(mm, "/proc/%d/mem", pid);
> + f = open(mm, O_RDONLY);
> + ASSERT_LE(0, f);
> + close(f);
> + f = kill(pid, SIGCONT);
> + ASSERT_EQ(0, f);
> +}
> +
> +TEST(attach)
> +{
> + int f, pid = fork();
> +
> + if (!pid) {
> + pthread_t pt;
> +
> + pthread_create(&pt, NULL, thread, NULL);
> + pthread_join(pt, NULL);
> + execlp("true", "true", NULL);
> + }
> +
> + sleep(1);
> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
To be meaningful this code needs to learn to loop when
ptrace returns -EAGAIN.
Because that is pretty much what any self respecting user space
process will do.
At which point I am not certain we can say that the behavior has
sufficiently improved not to be a deadlock.
> + ASSERT_EQ(EAGAIN, errno);
> + ASSERT_EQ(f, -1);
> + f = kill(pid, SIGCONT);
> + ASSERT_EQ(0, f);
> +}
> +
> +TEST_HARNESS_MAIN
Eric
On 3/3/20 4:18 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated. They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> A couple of things.
>
> Why do we think it is safe to change the behavior exposed to userspace?
> Not the deadlock but all of the times the current code would not
> deadlock?
>
> Especially given that this is a small window it might be hard for people
> to track down and report so we need a strong argument that this won't
> break existing userspace before we just change things.
>
Hmm, I tend to agree.
> Usually surveying all of the users of a system call that we can find
> and checking to see if they might be affected by the change in behavior
> is difficult enough that we usually opt for not being lazy and
> preserving the behavior.
>
> This patch is up to two changes in behavior now, that could potentially
> affect a whole array of programs. Adding linux-api so that this change
> in behavior can be documented if/when this change goes through.
>
One is PTRACE_ACCESS possibly returning EAGAIN, yes.
We could try to restrict that behavior change to when any
thread is ptraced when execve starts, can't be too complicated.
But the other is only SYS_seccomp returning EAGAIN, when a different
thread of the current process is calling execve at the same time.
I would consider it completely impossible to have any user-visual effect,
since de_thread is just terminating all threads, including the thread
where the -EAGAIN was returned, so we will never know what happened.
> If you can split the documentation and test fixes out into separate
> patches that would help reviewing this code, or please make it explicit
> that the your are changing documentation about behavior that is changing
> with this patch.
>
I am not sure if I have touched the right user documentation.
I only saw a document referring to a non-existent "current->cred_replace_mutex"
I haven't digged the git history, but that must be pre-historic IMHO.
It appears to me that is some developer documentation, but it's nevertheless
worth to keep up to date when the code changes.
So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ?
Bernd.
> Eric
>
>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>> new file mode 100644
>> index 0000000..6d8a048
>> --- /dev/null
>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>> @@ -0,0 +1,66 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
>> + * All rights reserved.
>> + *
>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>> + * when de_thread is blocked with ->cred_guard_mutex held.
>> + */
>> +
>> +#include "../kselftest_harness.h"
>> +#include <stdio.h>
>> +#include <fcntl.h>
>> +#include <pthread.h>
>> +#include <signal.h>
>> +#include <unistd.h>
>> +#include <sys/ptrace.h>
>> +
>> +static void *thread(void *arg)
>> +{
>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>> + return NULL;
>> +}
>> +
>> +TEST(vmaccess)
>> +{
>> + int f, pid = fork();
>> + char mm[64];
>> +
>> + if (!pid) {
>> + pthread_t pt;
>> +
>> + pthread_create(&pt, NULL, thread, NULL);
>> + pthread_join(pt, NULL);
>> + execlp("true", "true", NULL);
>> + }
>> +
>> + sleep(1);
>> + sprintf(mm, "/proc/%d/mem", pid);
>> + f = open(mm, O_RDONLY);
>> + ASSERT_LE(0, f);
>> + close(f);
>> + f = kill(pid, SIGCONT);
>> + ASSERT_EQ(0, f);
>> +}
>> +
>> +TEST(attach)
>> +{
>> + int f, pid = fork();
>> +
>> + if (!pid) {
>> + pthread_t pt;
>> +
>> + pthread_create(&pt, NULL, thread, NULL);
>> + pthread_join(pt, NULL);
>> + execlp("true", "true", NULL);
>> + }
>> +
>> + sleep(1);
>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>
> To be meaningful this code needs to learn to loop when
> ptrace returns -EAGAIN.
>
> Because that is pretty much what any self respecting user space
> process will do.
>
> At which point I am not certain we can say that the behavior has
> sufficiently improved not to be a deadlock.
>
In this special dead-duck test it won't work, but it would
still be lots more transparent what is going on, since previously
you had two zombie process, and no way to even output debug
messages, which also all self respecting user space processes
should do.
So yes, I can at least give a good example and re-try it several
times together with wait4 which a tracer is expected to do.
Bernd.
>> + ASSERT_EQ(EAGAIN, errno);
>> + ASSERT_EQ(f, -1);
>> + f = kill(pid, SIGCONT);
>> + ASSERT_EQ(0, f);
>> +}
>> +
>> +TEST_HARNESS_MAIN
>
> Eric
>
On Tue, Mar 03, 2020 at 09:18:44AM -0600, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
> > This fixes a deadlock in the tracer when tracing a multi-threaded
> > application that calls execve while more than one thread are running.
> >
> > I observed that when running strace on the gcc test suite, it always
> > blocks after a while, when expect calls execve, because other threads
> > have to be terminated. They send ptrace events, but the strace is no
> > longer able to respond, since it is blocked in vm_access.
> >
> > The deadlock is always happening when strace needs to access the
> > tracees process mmap, while another thread in the tracee starts to
> > execve a child process, but that cannot continue until the
> > PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> A couple of things.
>
> Why do we think it is safe to change the behavior exposed to userspace?
> Not the deadlock but all of the times the current code would not
> deadlock?
>
> Especially given that this is a small window it might be hard for people
> to track down and report so we need a strong argument that this won't
> break existing userspace before we just change things.
>
> Usually surveying all of the users of a system call that we can find
> and checking to see if they might be affected by the change in behavior
> is difficult enough that we usually opt for not being lazy and
> preserving the behavior.
>
> This patch is up to two changes in behavior now, that could potentially
> affect a whole array of programs. Adding linux-api so that this change
> in behavior can be documented if/when this change goes through.
>
> If you can split the documentation and test fixes out into separate
> patches that would help reviewing this code, or please make it explicit
> that the your are changing documentation about behavior that is changing
> with this patch.
Agreed. I think it'd be good to do it in three patches:
1. unrelated documentation update
2. fix + documentation changes specific to the fix
3. test(s)
Christian
On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote:
> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
> > Bernd Edlinger <[email protected]> writes:
> >
> >> This fixes a deadlock in the tracer when tracing a multi-threaded
> >> application that calls execve while more than one thread are running.
> >>
> >> I observed that when running strace on the gcc test suite, it always
> >> blocks after a while, when expect calls execve, because other threads
> >> have to be terminated. They send ptrace events, but the strace is no
> >> longer able to respond, since it is blocked in vm_access.
> >>
> >> The deadlock is always happening when strace needs to access the
> >> tracees process mmap, while another thread in the tracee starts to
> >> execve a child process, but that cannot continue until the
> >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> >
> > A couple of things.
> >
> > Why do we think it is safe to change the behavior exposed to userspace?
> > Not the deadlock but all of the times the current code would not
> > deadlock?
> >
> > Especially given that this is a small window it might be hard for people
> > to track down and report so we need a strong argument that this won't
> > break existing userspace before we just change things.
> >
>
> Hmm, I tend to agree.
>
> > Usually surveying all of the users of a system call that we can find
> > and checking to see if they might be affected by the change in behavior
> > is difficult enough that we usually opt for not being lazy and
> > preserving the behavior.
> >
> > This patch is up to two changes in behavior now, that could potentially
> > affect a whole array of programs. Adding linux-api so that this change
> > in behavior can be documented if/when this change goes through.
> >
>
> One is PTRACE_ACCESS possibly returning EAGAIN, yes.
>
> We could try to restrict that behavior change to when any
> thread is ptraced when execve starts, can't be too complicated.
>
>
> But the other is only SYS_seccomp returning EAGAIN, when a different
> thread of the current process is calling execve at the same time.
>
> I would consider it completely impossible to have any user-visual effect,
> since de_thread is just terminating all threads, including the thread
> where the -EAGAIN was returned, so we will never know what happened.
I think if we risk a user-space facing change we should try the simple
thing first before making the fix more convoluted? But it's a tough
call...
>
>
> > If you can split the documentation and test fixes out into separate
> > patches that would help reviewing this code, or please make it explicit
> > that the your are changing documentation about behavior that is changing
> > with this patch.
> >
>
> I am not sure if I have touched the right user documentation.
>
> I only saw a document referring to a non-existent "current->cred_replace_mutex"
> I haven't digged the git history, but that must be pre-historic IMHO.
> It appears to me that is some developer documentation, but it's nevertheless
> worth to keep up to date when the code changes.
>
> So where would I add the possibility for PTRACE_ATTACH to return -EAGAIN ?
Since that would be a potentially user-visible change it would make the
most sense to add it to man ptrace(2) if/when we land this change.
For developers, placing a comment in kernel/ptrace.c:ptrace_attach()
would make the most sense? We already have something about exec
protection in there.
Christian
On Tue, Mar 03, 2020 at 06:01:11PM +0100, Christian Brauner wrote:
> On Tue, Mar 03, 2020 at 04:48:01PM +0000, Bernd Edlinger wrote:
> > On 3/3/20 4:18 PM, Eric W. Biederman wrote:
> > > Bernd Edlinger <[email protected]> writes:
> > >
> > >> This fixes a deadlock in the tracer when tracing a multi-threaded
> > >> application that calls execve while more than one thread are running.
> > >>
> > >> I observed that when running strace on the gcc test suite, it always
> > >> blocks after a while, when expect calls execve, because other threads
> > >> have to be terminated. They send ptrace events, but the strace is no
> > >> longer able to respond, since it is blocked in vm_access.
> > >>
> > >> The deadlock is always happening when strace needs to access the
> > >> tracees process mmap, while another thread in the tracee starts to
> > >> execve a child process, but that cannot continue until the
> > >> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
> > >
> > > A couple of things.
> > >
> > > Why do we think it is safe to change the behavior exposed to userspace?
> > > Not the deadlock but all of the times the current code would not
> > > deadlock?
> > >
> > > Especially given that this is a small window it might be hard for people
> > > to track down and report so we need a strong argument that this won't
> > > break existing userspace before we just change things.
> > >
> >
> > Hmm, I tend to agree.
> >
> > > Usually surveying all of the users of a system call that we can find
> > > and checking to see if they might be affected by the change in behavior
> > > is difficult enough that we usually opt for not being lazy and
> > > preserving the behavior.
> > >
> > > This patch is up to two changes in behavior now, that could potentially
> > > affect a whole array of programs. Adding linux-api so that this change
> > > in behavior can be documented if/when this change goes through.
> > >
> >
> > One is PTRACE_ACCESS possibly returning EAGAIN, yes.
> >
> > We could try to restrict that behavior change to when any
> > thread is ptraced when execve starts, can't be too complicated.
> >
> >
> > But the other is only SYS_seccomp returning EAGAIN, when a different
> > thread of the current process is calling execve at the same time.
> >
> > I would consider it completely impossible to have any user-visual effect,
> > since de_thread is just terminating all threads, including the thread
> > where the -EAGAIN was returned, so we will never know what happened.
>
> I think if we risk a user-space facing change we should try the simple
> thing first before making the fix more convoluted? But it's a tough
> call...
Actually, to get a _rough_ estimate of the possible impact I would
recommend you run the criu test suite (and possible the strace
test-suite) on a kernel with and without your fix. That's what I tend to
do when I touch code I fear will have impact on APIs that very deeply
touch core kernel. Criu's test-suite makes heavy use of ptrace and
usually runs into a bunch of interesting (exec) races too, and does have
tests for handling zombies processes etc. pp.
Should be relatively simple: create a vm and then criu build-dependencies,
git clone criu; cd criu; make; cd test; ./zdtm.py run -a --keep-going
If your system doesn't support Selinux properly, you need to disable it
when running the tests and you also need to make sure that you're using
python3 or change the shebang in zdtm.py to python3.
Just a recommendation.
Christian
Bernd Edlinger <[email protected]> writes:
> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>> new file mode 100644
>>> index 0000000..6d8a048
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>> @@ -0,0 +1,66 @@
>>> +// SPDX-License-Identifier: GPL-2.0+
>>> +/*
>>> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
>>> + * All rights reserved.
>>> + *
>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>> + */
>>> +
>>> +#include "../kselftest_harness.h"
>>> +#include <stdio.h>
>>> +#include <fcntl.h>
>>> +#include <pthread.h>
>>> +#include <signal.h>
>>> +#include <unistd.h>
>>> +#include <sys/ptrace.h>
>>> +
>>> +static void *thread(void *arg)
>>> +{
>>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>> + return NULL;
>>> +}
>>> +
>>> +TEST(vmaccess)
>>> +{
>>> + int f, pid = fork();
>>> + char mm[64];
>>> +
>>> + if (!pid) {
>>> + pthread_t pt;
>>> +
>>> + pthread_create(&pt, NULL, thread, NULL);
>>> + pthread_join(pt, NULL);
>>> + execlp("true", "true", NULL);
>>> + }
>>> +
>>> + sleep(1);
>>> + sprintf(mm, "/proc/%d/mem", pid);
>>> + f = open(mm, O_RDONLY);
>>> + ASSERT_LE(0, f);
>>> + close(f);
>>> + f = kill(pid, SIGCONT);
>>> + ASSERT_EQ(0, f);
>>> +}
>>> +
>>> +TEST(attach)
>>> +{
>>> + int f, pid = fork();
>>> +
>>> + if (!pid) {
>>> + pthread_t pt;
>>> +
>>> + pthread_create(&pt, NULL, thread, NULL);
>>> + pthread_join(pt, NULL);
>>> + execlp("true", "true", NULL);
>>> + }
>>> +
>>> + sleep(1);
>>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>
>> To be meaningful this code needs to learn to loop when
>> ptrace returns -EAGAIN.
>>
>> Because that is pretty much what any self respecting user space
>> process will do.
>>
>> At which point I am not certain we can say that the behavior has
>> sufficiently improved not to be a deadlock.
>>
>
> In this special dead-duck test it won't work, but it would
> still be lots more transparent what is going on, since previously
> you had two zombie process, and no way to even output debug
> messages, which also all self respecting user space processes
> should do.
Agreed it is more transparent. So if you are going to deadlock
it is better.
My previous proposal (which I admit is more work to implement) would
actually allow succeeding in this case and so it would not be subject to
a dead lock (even via -EGAIN) at this point.
> So yes, I can at least give a good example and re-try it several
> times together with wait4 which a tracer is expected to do.
Thank you,
Eric
On 3/3/20 9:08 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <[email protected]> writes:
>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>>> new file mode 100644
>>>> index 0000000..6d8a048
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>>> @@ -0,0 +1,66 @@
>>>> +// SPDX-License-Identifier: GPL-2.0+
>>>> +/*
>>>> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
>>>> + * All rights reserved.
>>>> + *
>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>>> + */
>>>> +
>>>> +#include "../kselftest_harness.h"
>>>> +#include <stdio.h>
>>>> +#include <fcntl.h>
>>>> +#include <pthread.h>
>>>> +#include <signal.h>
>>>> +#include <unistd.h>
>>>> +#include <sys/ptrace.h>
>>>> +
>>>> +static void *thread(void *arg)
>>>> +{
>>>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>>> + return NULL;
>>>> +}
>>>> +
>>>> +TEST(vmaccess)
>>>> +{
>>>> + int f, pid = fork();
>>>> + char mm[64];
>>>> +
>>>> + if (!pid) {
>>>> + pthread_t pt;
>>>> +
>>>> + pthread_create(&pt, NULL, thread, NULL);
>>>> + pthread_join(pt, NULL);
>>>> + execlp("true", "true", NULL);
>>>> + }
>>>> +
>>>> + sleep(1);
>>>> + sprintf(mm, "/proc/%d/mem", pid);
>>>> + f = open(mm, O_RDONLY);
>>>> + ASSERT_LE(0, f);
>>>> + close(f);
>>>> + f = kill(pid, SIGCONT);
>>>> + ASSERT_EQ(0, f);
>>>> +}
>>>> +
>>>> +TEST(attach)
>>>> +{
>>>> + int f, pid = fork();
>>>> +
>>>> + if (!pid) {
>>>> + pthread_t pt;
>>>> +
>>>> + pthread_create(&pt, NULL, thread, NULL);
>>>> + pthread_join(pt, NULL);
>>>> + execlp("true", "true", NULL);
>>>> + }
>>>> +
>>>> + sleep(1);
>>>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>>
>>> To be meaningful this code needs to learn to loop when
>>> ptrace returns -EAGAIN.
>>>
>>> Because that is pretty much what any self respecting user space
>>> process will do.
>>>
>>> At which point I am not certain we can say that the behavior has
>>> sufficiently improved not to be a deadlock.
>>>
>>
>> In this special dead-duck test it won't work, but it would
>> still be lots more transparent what is going on, since previously
>> you had two zombie process, and no way to even output debug
>> messages, which also all self respecting user space processes
>> should do.
>
> Agreed it is more transparent. So if you are going to deadlock
> it is better.
>
> My previous proposal (which I admit is more work to implement) would
> actually allow succeeding in this case and so it would not be subject to
> a dead lock (even via -EGAIN) at this point.
>
>> So yes, I can at least give a good example and re-try it several
>> times together with wait4 which a tracer is expected to do.
>
> Thank you,
>
> Eric
>
Okay, I think it can be done with minimal API changes,
but it needs two mutexes, one that guards the execve,
and one that guards only the credentials.
If no traced sibling thread exists, the mutexes are used this way:
lock(exec_guard_mutex)
cred_locked_in_execve = true;
de_thread()
lock(cred_guard_mutex)
unlock(cred_guard_mutex)
cred_locked_in_execve = false;
unlock(exec_guard_mutex)
so effectively no API change at all.
If a traced sibling thread exists, the mutexes are used differently:
lock(exec_guard_mutex)
cred_locked_in_execve = true;
unlock(exec_guard_mutex)
de_thread()
lock(cred_guard_mutex)
unlock(cred_guard_mutex)
lock(exec_guard_mutex)
cred_locked_in_execve = false;
unlock(exec_guard_mutex)
Only the case changes that would deadlock anyway.
Bernd.
On Tue, Mar 03, 2020 at 08:08:26AM +0000, Bernd Edlinger wrote:
> On 3/3/20 6:29 AM, Kees Cook wrote:
> > On Tue, Mar 03, 2020 at 04:54:34AM +0000, Bernd Edlinger wrote:
> >> On 3/3/20 3:26 AM, Kees Cook wrote:
> >>> On Mon, Mar 02, 2020 at 10:18:07PM +0000, Bernd Edlinger wrote:
> >>>> [...]
> >>>
> >>> If I'm reading this patch correctly, this changes the lifetime of the
> >>> cred_guard_mutex lock to be:
> >>> - during prepare_bprm_creds()
> >>> - from flush_old_exec() through install_exec_creds()
> >>> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> >>> install_exec_creds().
> >
> > BTW, I think the effect of this change (i.e. my paragraph above) should
> > be distinctly called out in the commit log if this solution moves
> > forward.
> >
>
> Okay, will do.
>
> >>> That means, for example, that check_unsafe_exec()'s documented invariant
> >>> is violated:
> >>> /*
> >>> * determine how safe it is to execute the proposed program
> >>> * - the caller must hold ->cred_guard_mutex to protect against
> >>> * PTRACE_ATTACH or seccomp thread-sync
> >>> */
> >>
> >> Oh, right, I haven't understood that hint...
> >
> > I know no_new_privs is checked there, but I haven't studied the
> > PTRACE_ATTACH part of that comment. If that is handled with the new
> > check, this comment should be updated.
> >
>
> Okay, I change that comment to:
>
> /*
> * determine how safe it is to execute the proposed program
> * - the caller must have set ->cred_locked_in_execve to protect against
> * PTRACE_ATTACH or seccomp thread-sync
> */
>
> >>> I think it also means that the potentially multiple invocations
> >>> of bprm_fill_uid() (via prepare_binprm() via binfmt_script.c and
> >>> binfmt_misc.c) would be changing bprm->cred details (uid, gid) without
> >>> a lock (another place where current's no_new_privs is evaluated).
> >>
> >> So no_new_privs can change from 0->1, but should not
> >> when execve is running.
> >>
> >> As long as the calling thread is in execve it won't do this,
> >> and the only other place, where it may set for other threads
> >> is in seccomp_sync_threads, but that can easily be avoided see below.
> >
> > Yeah, everything was fine until I had to go complicate things with
> > TSYNC. ;) The real goal is making sure an exec cannot gain privs while
> > later gaining a seccomp filter from an unpriv process. The no_new_privs
> > flag was used to control this, but it required that the filter not get
> > applied during exec.
> >
> >>> Related, it also means that cred_guard_mutex is unheld for every
> >>> invocation of search_binary_handler() (which can loop via the previously
> >>> mentioned binfmt_script.c and binfmt_misc.c), if any of them have hidden
> >>> dependencies on cred_guard_mutex. (Thought I only see bprm_fill_uid()
> >>> currently.)
> >>>
> >>> For seccomp, the expectations about existing thread states risks races
> >>> too. There are two locks held for TSYNC:
> >>> - current->sighand->siglock is held to keep new threads from
> >>> appearing/disappearing, which would destroy filter refcounting and
> >>> lead to memory corruption.
> >>
> >> I don't understand what you mean here.
> >> How can this lead to memory corruption?
> >
> > Mainly this is a matter of how seccomp manages its filter hierarchy
> > (since the filters are shared through process ancestry), so if a thread
> > appears in the middle of TSYNC it may be racing another TSYNC and break
> > ancestry, leading to bad reference counting on process death, etc.
> > (Though, yes, with refcount_t now, things should never corrupt, just
> > waste memory.)
> >
>
> I assume for now, that the current->sighand->siglock held while iterating all
> threads is sufficient here.
>
> >>> - cred_guard_mutex is held to keep no_new_privs in sync with filters to
> >>> avoid no_new_privs and filter confusion during exec, which could
> >>> lead to exploitable setuid conditions (see below).
> >>>
> >>> Just racing a malicious thread during TSYNC is not a very strong
> >>> example (a malicious thread could do lots of fun things to "current"
> >>> before it ever got near calling TSYNC), but I think there is the risk
> >>> of mismatched/confused states that we don't want to allow. One is a
> >>> particularly bad state that could lead to privilege escalations (in the
> >>> form of the old "sendmail doesn't check setuid" flaw; if a setuid process
> >>> has a filter attached that silently fails a priv-dropping setuid call
> >>> and continues execution with elevated privs, it can be tricked into
> >>> doing bad things on behalf of the unprivileged parent, which was the
> >>> primary goal of the original use of cred_guard_mutex with TSYNC[1]):
> >>>
> >>> thread A clones thread B
> >>> thread B starts setuid exec
> >>> thread A sets no_new_privs
> >>> thread A calls seccomp with TSYNC
> >>> thread A in seccomp_sync_threads() sets seccomp filter on self and thread B
> >>> thread B passes check_unsafe_exec() with no_new_privs unset
> >>> thread B reaches bprm_fill_uid() with no_new_privs unset and gains privs
> >>> thread A still in seccomp_sync_threads() sets no_new_privs on thread B
> >>> thread B finishes exec, now running with elevated privs, a filter chosen
> >>> by thread A, _and_ nnp set (which doesn't matter)
> >>>
> >>> With the original locking, thread B will fail check_unsafe_exec()
> >>> because filter and nnp state are changed together, with "atomicity"
> >>> protected by the cred_guard_mutex.
> >>>
> >>
> >> Ah, good point, thanks!
> >>
> >> This can be fixed by checking current->signal->cred_locked_for_ptrace
> >> while the cred_guard_mutex is locked, like this for instance:
> >>
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index b6ea3dc..377abf0 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -342,6 +342,9 @@ static inline pid_t seccomp_can_sync_threads(void)
> >> BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
> >> assert_spin_locked(¤t->sighand->siglock);
> >>
> >> + if (current->signal->cred_locked_for_ptrace)
> >> + return -EAGAIN;
> >> +
> >
> > Hmm. I guess something like that could work. TSYNC expects to be able to
> > report _which_ thread wrecked the call, though... I wonder if in_execve
> > could be used to figure out the offending thread. Hm, nope, that would
> > be outside of lock too (and all users are "current" right now, so the
> > lock wasn't needed before).
> >
>
> I could move that in_execve = 1 to prepare_bprm_creds, if it really matters,
> but the caller will die quickly and cannot do anything with that information
> when another thread executes execve, right?
>
> >> /* Validate all threads being eligible for synchronization. */
> >> caller = current;
> >> for_each_thread(caller, thread) {
> >>
> >>
> >>> And this is just the bad state I _can_ see. I'm worried there are more...
> >>>
> >>> All this said, I do see a small similarity here to the work I did to
> >>> stabilize stack rlimits (there was an ongoing problem with making multiple
> >>> decisions for the bprm based on current's state -- but current's state
> >>> was mutable during exec). For this, I saved rlim_stack to bprm and ignored
> >>> current's copy until exec ended and then stored bprm's copy into current.
> >>> If the only problem anyone can see here is the handling of no_new_privs,
> >>> we might be able to solve that similarly, at least disentangling tsync/nnp
> >>> from cred_guard_mutex.
> >>>
> >>
> >> I still think that is solvable with using cred_locked_for_ptrace and
> >> simply make the tsync fail if it would otherwise be blocked.
> >
> > I wonder if we can find a better name than "cred_locked_for_ptrace"?
> > Maybe "cred_unfinished" or "cred_locked_in_exec" or something?
> >
>
> Yeah, I'd go with "cred_locked_in_execve".
>
> > And the comment on bool cred_locked_for_ptrace should mention that
> > access is only allowed under cred_guard_mutex lock.
> >
>
> okay.
>
> >>>> + sig->cred_locked_for_ptrace = false;
> >
> > This is redundant to the zalloc -- I think you can drop it (unless
> > someone wants to keep it for clarify?)
> >
>
> I'll remove that here and in init/init_task.c
>
> > Also, I think cred_locked_for_ptrace needs checking deeper, in
> > __ptrace_may_access(), not in ptrace_attach(), since LOTS of things make
> > calls to ptrace_may_access() holding cred_guard_mutex, expecting that to
> > be sufficient to see a stable version of the thread...
> >
>
> No, these need to be addressed individually, but most users just want
> to know if the current credentials are sufficient at this moment, but will
> not change the credentials, as ptrace and TSYNC do.
>
> BTW: Not all users have cred_guard_mutex, see mm/migrate.c,
> mm/mempolicy.c, kernel/futex.c, fs/proc/namespaces.c etc.
> So adding an access to cred_locked_for_execve in ptrace_may_access is
> probably not an option.
That could be solved by e.g. adding ptrace_may_access_{no}exec() taking
cred_guard_mutex.
Bernd Edlinger <[email protected]> writes:
> On 3/3/20 9:08 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>>> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>>>> Bernd Edlinger <[email protected]> writes:
>>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>>>> new file mode 100644
>>>>> index 0000000..6d8a048
>>>>> --- /dev/null
>>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>>>> @@ -0,0 +1,66 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0+
>>>>> +/*
>>>>> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
>>>>> + * All rights reserved.
>>>>> + *
>>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>>>> + */
>>>>> +
>>>>> +#include "../kselftest_harness.h"
>>>>> +#include <stdio.h>
>>>>> +#include <fcntl.h>
>>>>> +#include <pthread.h>
>>>>> +#include <signal.h>
>>>>> +#include <unistd.h>
>>>>> +#include <sys/ptrace.h>
>>>>> +
>>>>> +static void *thread(void *arg)
>>>>> +{
>>>>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>>>> + return NULL;
>>>>> +}
>>>>> +
>>>>> +TEST(vmaccess)
>>>>> +{
>>>>> + int f, pid = fork();
>>>>> + char mm[64];
>>>>> +
>>>>> + if (!pid) {
>>>>> + pthread_t pt;
>>>>> +
>>>>> + pthread_create(&pt, NULL, thread, NULL);
>>>>> + pthread_join(pt, NULL);
>>>>> + execlp("true", "true", NULL);
>>>>> + }
>>>>> +
>>>>> + sleep(1);
>>>>> + sprintf(mm, "/proc/%d/mem", pid);
>>>>> + f = open(mm, O_RDONLY);
>>>>> + ASSERT_LE(0, f);
>>>>> + close(f);
>>>>> + f = kill(pid, SIGCONT);
>>>>> + ASSERT_EQ(0, f);
>>>>> +}
>>>>> +
>>>>> +TEST(attach)
>>>>> +{
>>>>> + int f, pid = fork();
>>>>> +
>>>>> + if (!pid) {
>>>>> + pthread_t pt;
>>>>> +
>>>>> + pthread_create(&pt, NULL, thread, NULL);
>>>>> + pthread_join(pt, NULL);
>>>>> + execlp("true", "true", NULL);
>>>>> + }
>>>>> +
>>>>> + sleep(1);
>>>>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>>>
>>>> To be meaningful this code needs to learn to loop when
>>>> ptrace returns -EAGAIN.
>>>>
>>>> Because that is pretty much what any self respecting user space
>>>> process will do.
>>>>
>>>> At which point I am not certain we can say that the behavior has
>>>> sufficiently improved not to be a deadlock.
>>>>
>>>
>>> In this special dead-duck test it won't work, but it would
>>> still be lots more transparent what is going on, since previously
>>> you had two zombie process, and no way to even output debug
>>> messages, which also all self respecting user space processes
>>> should do.
>>
>> Agreed it is more transparent. So if you are going to deadlock
>> it is better.
>>
>> My previous proposal (which I admit is more work to implement) would
>> actually allow succeeding in this case and so it would not be subject to
>> a dead lock (even via -EGAIN) at this point.
>>
>>> So yes, I can at least give a good example and re-try it several
>>> times together with wait4 which a tracer is expected to do.
>>
>> Thank you,
>>
>> Eric
>>
>
> Okay, I think it can be done with minimal API changes,
> but it needs two mutexes, one that guards the execve,
> and one that guards only the credentials.
>
> If no traced sibling thread exists, the mutexes are used this way:
> lock(exec_guard_mutex)
> cred_locked_in_execve = true;
> de_thread()
> lock(cred_guard_mutex)
> unlock(cred_guard_mutex)
> cred_locked_in_execve = false;
> unlock(exec_guard_mutex)
>
> so effectively no API change at all.
>
> If a traced sibling thread exists, the mutexes are used differently:
> lock(exec_guard_mutex)
> cred_locked_in_execve = true;
> unlock(exec_guard_mutex)
> de_thread()
> lock(cred_guard_mutex)
> unlock(cred_guard_mutex)
> lock(exec_guard_mutex)
> cred_locked_in_execve = false;
> unlock(exec_guard_mutex)
>
> Only the case changes that would deadlock anyway.
Let me propose a slight alternative that I think sets us up for long
term success.
Leave cred_guard_mutex as is, but declare it undesirable. The
cred_guard_mutex as designed really is something we should get rid of.
As it it can sleep over several different userspace accesses. The
copying of the exec arguments is technically as prone to deadlock as the
ptrace case.
Add a new mutex with a better name perhaps "exec_change_mutex" that is
used to guard the changes that exec makes to a process.
Then we gradually shift all the cred_guard_mutex users over to the new
mutex. AKA one patch per user of cred_guard_mutex. At each patch that
shifts things over we will have the opportunity to review the code to
see that there no funny dependencies that were missed.
I will sign up for working on the no_new_privs and ptrace_attach cases
as I think I can make those happen. Especially no_new_privs.
Getting the easier cases will resolve your issues and put things on a
better footing.
Eric
On 3/4/20 5:33 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/3/20 9:08 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <[email protected]> writes:
>>>
>>>> On 3/3/20 4:18 PM, Eric W. Biederman wrote:
>>>>> Bernd Edlinger <[email protected]> writes:
>>>>>> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
>>>>>> new file mode 100644
>>>>>> index 0000000..6d8a048
>>>>>> --- /dev/null
>>>>>> +++ b/tools/testing/selftests/ptrace/vmaccess.c
>>>>>> @@ -0,0 +1,66 @@
>>>>>> +// SPDX-License-Identifier: GPL-2.0+
>>>>>> +/*
>>>>>> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
>>>>>> + * All rights reserved.
>>>>>> + *
>>>>>> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
>>>>>> + * when de_thread is blocked with ->cred_guard_mutex held.
>>>>>> + */
>>>>>> +
>>>>>> +#include "../kselftest_harness.h"
>>>>>> +#include <stdio.h>
>>>>>> +#include <fcntl.h>
>>>>>> +#include <pthread.h>
>>>>>> +#include <signal.h>
>>>>>> +#include <unistd.h>
>>>>>> +#include <sys/ptrace.h>
>>>>>> +
>>>>>> +static void *thread(void *arg)
>>>>>> +{
>>>>>> + ptrace(PTRACE_TRACEME, 0, 0L, 0L);
>>>>>> + return NULL;
>>>>>> +}
>>>>>> +
>>>>>> +TEST(vmaccess)
>>>>>> +{
>>>>>> + int f, pid = fork();
>>>>>> + char mm[64];
>>>>>> +
>>>>>> + if (!pid) {
>>>>>> + pthread_t pt;
>>>>>> +
>>>>>> + pthread_create(&pt, NULL, thread, NULL);
>>>>>> + pthread_join(pt, NULL);
>>>>>> + execlp("true", "true", NULL);
>>>>>> + }
>>>>>> +
>>>>>> + sleep(1);
>>>>>> + sprintf(mm, "/proc/%d/mem", pid);
>>>>>> + f = open(mm, O_RDONLY);
>>>>>> + ASSERT_LE(0, f);
>>>>>> + close(f);
>>>>>> + f = kill(pid, SIGCONT);
>>>>>> + ASSERT_EQ(0, f);
>>>>>> +}
>>>>>> +
>>>>>> +TEST(attach)
>>>>>> +{
>>>>>> + int f, pid = fork();
>>>>>> +
>>>>>> + if (!pid) {
>>>>>> + pthread_t pt;
>>>>>> +
>>>>>> + pthread_create(&pt, NULL, thread, NULL);
>>>>>> + pthread_join(pt, NULL);
>>>>>> + execlp("true", "true", NULL);
>>>>>> + }
>>>>>> +
>>>>>> + sleep(1);
>>>>>> + f = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
>>>>>
>>>>> To be meaningful this code needs to learn to loop when
>>>>> ptrace returns -EAGAIN.
>>>>>
>>>>> Because that is pretty much what any self respecting user space
>>>>> process will do.
>>>>>
>>>>> At which point I am not certain we can say that the behavior has
>>>>> sufficiently improved not to be a deadlock.
>>>>>
>>>>
>>>> In this special dead-duck test it won't work, but it would
>>>> still be lots more transparent what is going on, since previously
>>>> you had two zombie process, and no way to even output debug
>>>> messages, which also all self respecting user space processes
>>>> should do.
>>>
>>> Agreed it is more transparent. So if you are going to deadlock
>>> it is better.
>>>
>>> My previous proposal (which I admit is more work to implement) would
>>> actually allow succeeding in this case and so it would not be subject to
>>> a dead lock (even via -EGAIN) at this point.
>>>
>>>> So yes, I can at least give a good example and re-try it several
>>>> times together with wait4 which a tracer is expected to do.
>>>
>>> Thank you,
>>>
>>> Eric
>>>
>>
>> Okay, I think it can be done with minimal API changes,
>> but it needs two mutexes, one that guards the execve,
>> and one that guards only the credentials.
>>
>> If no traced sibling thread exists, the mutexes are used this way:
>> lock(exec_guard_mutex)
>> cred_locked_in_execve = true;
>> de_thread()
>> lock(cred_guard_mutex)
>> unlock(cred_guard_mutex)
>> cred_locked_in_execve = false;
>> unlock(exec_guard_mutex)
>>
>> so effectively no API change at all.
>>
>> If a traced sibling thread exists, the mutexes are used differently:
>> lock(exec_guard_mutex)
>> cred_locked_in_execve = true;
>> unlock(exec_guard_mutex)
>> de_thread()
>> lock(cred_guard_mutex)
>> unlock(cred_guard_mutex)
>> lock(exec_guard_mutex)
>> cred_locked_in_execve = false;
>> unlock(exec_guard_mutex)
>>
>> Only the case changes that would deadlock anyway.
>
>
> Let me propose a slight alternative that I think sets us up for long
> term success.
>
> Leave cred_guard_mutex as is, but declare it undesirable. The
> cred_guard_mutex as designed really is something we should get rid of.
> As it it can sleep over several different userspace accesses. The
> copying of the exec arguments is technically as prone to deadlock as the
> ptrace case.
>
> Add a new mutex with a better name perhaps "exec_change_mutex" that is
> used to guard the changes that exec makes to a process.
>
> Then we gradually shift all the cred_guard_mutex users over to the new
> mutex. AKA one patch per user of cred_guard_mutex. At each patch that
> shifts things over we will have the opportunity to review the code to
> see that there no funny dependencies that were missed.
>
> I will sign up for working on the no_new_privs and ptrace_attach cases
> as I think I can make those happen. Especially no_new_privs.
>
> Getting the easier cases will resolve your issues and put things on a
> better footing.
>
> Eric
>
Okay, however I think we will need two mutexes in the long term.
So currently I have reduced the cred_guard_mutex to protect just
the loading of the executable code in the process vm, since that
is what works for vm_access, (one of the test cases).
And another mutex that protects the whole execve function, that
is need for ptrace, (and seccomp).
But I have only a test case for ptrace.
If I understand that right, I should not recycle cred_guard_mutex
but leave it as is, and create two additional mutexes which will
take over step by step.
Sounds reasonable, indeed.
I will send an update (v6) what I have right now,
but just for information, so you can see how my minimal API-Change
approach works.
Bernd.
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated. They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The proposed solution is to detect if a sibling thread
exists that is traced and in this case to make PTRACE_ACCESS
fail with -EAGAIN instead of dead-lock.
But other functions like vm_access are allowed to complete normally.
This changes the lifetime of the cred_guard_mutex lock to be
from flush_old_exec() through install_exec_creds().
Before, cred_guard_mutex was held from prepare_bprm_creds() through
install_exec_creds().
Additionally a new mutex exec_guard_mutex is introduced that is used
for PTRACE_ACCESS and SECCOMP_FILTER_FLAG_TSYNC.
Signed-off-by: Bernd Edlinger <[email protected]>
---
Documentation/security/credentials.rst | 29 ++++++++---
fs/exec.c | 58 ++++++++++++++++++---
include/linux/binfmts.h | 15 +++++-
include/linux/sched/signal.h | 10 ++--
init/init_task.c | 1 +
kernel/cred.c | 4 +-
kernel/fork.c | 1 +
kernel/ptrace.c | 20 ++++++--
kernel/seccomp.c | 15 +++---
mm/process_vm_access.c | 2 +-
tools/testing/selftests/ptrace/Makefile | 4 +-
tools/testing/selftests/ptrace/vmaccess.c | 85 +++++++++++++++++++++++++++++++
12 files changed, 210 insertions(+), 34 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
v2: adds a test case which passes when this patch is applied.
v3: fixes the issue without introducing a new mutex.
v4: fixes one comment and a formatting issue found by checkpatch.pl in the test case.
v5: addresses review comments.
v6: minimal API changes, using a second mutex, improved test case.
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..b08899f 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,15 +437,30 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful. It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->exec_guard_mutex
+is acquired before this function gets called, and usually released after
+the new process mmap and credentials are installed. However if one of the
+sibling threads are being traced when the execve is invoked, there is no
+guarantee how long it takes to terminate all sibling threads, and therefore
+the variable current->signal->cred_locked_in_execve is set, and the
+exec_guard_mutex is released immediately. Functions that may have effect
+on the credentials of a different thread need to lock the exec_guard_mutex
+and additionally check the cred_locked_in_execve status, and fail with
+-EAGAIN if that variable is set.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process
while security checks on credentials construction and changing is taking place
as the ptrace state may alter the outcome, particularly in the case of
``execve()``.
+The mutex current->signal->cred_guard_mutex is acquired when only a single thread
+is remaining, and the credentials and the process mmap are actually changed.
+Functions that only need to access to a consistent state of the credentials
+and the process mmap do only need to aquire this mutex.
+
The new credentials set should be altered appropriately, and any security
checks and hooks done. Both the current and the proposed sets of credentials
are available for this purpose as current_cred() will return the current set
@@ -466,9 +481,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the
LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
end of such functions as ``sys_setresuid()``.
@@ -486,8 +500,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
A typical credentials alteration function would look something like this::
diff --git a/fs/exec.c b/fs/exec.c
index 74d88da..8a23804 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1258,6 +1258,11 @@ int flush_old_exec(struct linux_binprm * bprm)
{
int retval;
+ if (bprm->detected_unsafe_exec) {
+ mutex_unlock(¤t->signal->exec_guard_mutex);
+ bprm->holding_exec_guard_mutex = 0;
+ }
+
/*
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.
@@ -1266,6 +1271,12 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
+ retval = mutex_lock_killable(¤t->signal->cred_guard_mutex);
+ if (retval)
+ goto out;
+
+ bprm->holding_cred_guard_mutex = 1;
+
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
@@ -1398,29 +1409,56 @@ void finalize_exec(struct linux_binprm *bprm)
EXPORT_SYMBOL(finalize_exec);
/*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and set ->cred_locked_in_execve.
* install_exec_creds() commits the new creds and drops the lock.
* Or, if exec fails before, free_bprm() should release ->cred and
* and unlock.
*/
static int prepare_bprm_creds(struct linux_binprm *bprm)
{
- if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
+ int ret;
+ struct task_struct *t;
+
+ if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex))
return -ERESTARTNOINTR;
+ bprm->holding_exec_guard_mutex = 1;
+
+ ret = -EAGAIN;
+ if (unlikely(current->signal->cred_locked_in_execve))
+ goto out;
+
bprm->cred = prepare_exec_creds();
- if (likely(bprm->cred))
- return 0;
+ ret = -ENOMEM;
+ if (unlikely(bprm->cred == NULL))
+ goto out;
- mutex_unlock(¤t->signal->cred_guard_mutex);
- return -ENOMEM;
+ current->signal->cred_locked_in_execve = true;
+
+ spin_lock_irq(¤t->sighand->siglock);
+ t = current;
+ while_each_thread(current, t) {
+ if (t->ptrace)
+ bprm->detected_unsafe_exec = 1;
+ }
+ spin_unlock_irq(¤t->sighand->siglock);
+ return 0;
+
+out:
+ mutex_unlock(¤t->signal->exec_guard_mutex);
+ return ret;
}
static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ if (bprm->holding_cred_guard_mutex)
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ if (!bprm->holding_exec_guard_mutex)
+ mutex_lock(¤t->signal->exec_guard_mutex);
+ current->signal->cred_locked_in_execve = false;
+ mutex_unlock(¤t->signal->exec_guard_mutex);
abort_creds(bprm->cred);
}
if (bprm->file) {
@@ -1470,12 +1508,16 @@ void install_exec_creds(struct linux_binprm *bprm)
*/
security_bprm_committed_creds(bprm);
mutex_unlock(¤t->signal->cred_guard_mutex);
+ if (bprm->detected_unsafe_exec)
+ mutex_lock(¤t->signal->exec_guard_mutex);
+ current->signal->cred_locked_in_execve = false;
+ mutex_unlock(¤t->signal->exec_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
/*
* determine how safe it is to execute the proposed program
- * - the caller must hold ->cred_guard_mutex to protect against
+ * - the caller must have set ->cred_locked_in_execve to protect against
* PTRACE_ATTACH or seccomp thread-sync
*/
static void check_unsafe_exec(struct linux_binprm *bprm)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..238e280 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,20 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by prepare_bprm_creds, if a sibling thread is being
+ * traced and the exec_guard_mutex is therefore not taken.
+ */
+ detected_unsafe_exec:1,
+ /*
+ * Set when the cred_guard_mutex is taken.
+ */
+ holding_cred_guard_mutex:1,
+ /*
+ * Set when the exec_guard_mutex is taken.
+ */
+ holding_exec_guard_mutex:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..4484aa3 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -222,9 +222,13 @@ struct signal_struct {
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
- struct mutex cred_guard_mutex; /* guard against foreign influences on
- * credential calculations
- * (notably. ptrace) */
+ struct mutex cred_guard_mutex; /* guard against changing credentials */
+ struct mutex exec_guard_mutex; /* guard against foreign influences on
+ * execve (notably. ptrace)
+ */
+ bool cred_locked_in_execve; /* set while in execve, only valid when
+ * exec_guard_mutex is held
+ */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..6cf602a 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .exec_guard_mutex = __MUTEX_INITIALIZER(init_signals.exec_guard_mutex),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..620cd50 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -295,7 +295,7 @@ struct cred *prepare_creds(void)
/*
* Prepare credentials for current to perform an execve()
- * - The caller must hold ->cred_guard_mutex
+ * - The caller must hold ->exec_guard_mutex
*/
struct cred *prepare_exec_creds(void)
{
@@ -676,7 +676,7 @@ void __init cred_init(void)
*
* Returns the new credentials or NULL if out of memory.
*
- * Does not take, and does not return holding current->cred_replace_mutex.
+ * Does not take, and does not return holding ->cred_guard_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 0808095..0c21baa 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ mutex_init(&sig->exec_guard_mutex);
return 0;
}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..1af8ff4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -392,9 +392,13 @@ static int ptrace_attach(struct task_struct *task, long request,
* under ptrace.
*/
retval = -ERESTARTNOINTR;
- if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
+ if (mutex_lock_interruptible(&task->signal->exec_guard_mutex))
goto out;
+ retval = -EAGAIN;
+ if (task->signal->cred_locked_in_execve)
+ goto unlock_creds;
+
task_lock(task);
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
task_unlock(task);
@@ -447,7 +451,7 @@ static int ptrace_attach(struct task_struct *task, long request,
unlock_tasklist:
write_unlock_irq(&tasklist_lock);
unlock_creds:
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_guard_mutex);
out:
if (!retval) {
/*
@@ -472,10 +476,18 @@ static int ptrace_attach(struct task_struct *task, long request,
*/
static int ptrace_traceme(void)
{
- int ret = -EPERM;
+ int ret;
+
+ if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex))
+ return -ERESTARTNOINTR;
+
+ ret = -EAGAIN;
+ if (current->signal->cred_locked_in_execve)
+ goto unlock_creds;
write_lock_irq(&tasklist_lock);
/* Are we already being traced? */
+ ret = -EPERM;
if (!current->ptrace) {
ret = security_ptrace_traceme(current->parent);
/*
@@ -490,6 +502,8 @@ static int ptrace_traceme(void)
}
write_unlock_irq(&tasklist_lock);
+unlock_creds:
+ mutex_unlock(¤t->signal->exec_guard_mutex);
return ret;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..7ec66b1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -329,7 +329,7 @@ static int is_ancestor(struct seccomp_filter *parent,
/**
* seccomp_can_sync_threads: checks if all threads can be synchronized
*
- * Expects sighand and cred_guard_mutex locks to be held.
+ * Expects sighand and exec_guard_mutex locks to be held.
*
* Returns 0 on success, -ve on error, or the pid of a thread which was
* either not in the correct seccomp mode or did not have an ancestral
@@ -339,9 +339,12 @@ static inline pid_t seccomp_can_sync_threads(void)
{
struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
+ BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex));
assert_spin_locked(¤t->sighand->siglock);
+ if (current->signal->cred_locked_in_execve)
+ return -EAGAIN;
+
/* Validate all threads being eligible for synchronization. */
caller = current;
for_each_thread(caller, thread) {
@@ -371,7 +374,7 @@ static inline pid_t seccomp_can_sync_threads(void)
/**
* seccomp_sync_threads: sets all threads to use current's filter
*
- * Expects sighand and cred_guard_mutex locks to be held, and for
+ * Expects sighand and exec_guard_mutex locks to be held, and for
* seccomp_can_sync_threads() to have returned success already
* without dropping the locks.
*
@@ -380,7 +383,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
{
struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
+ BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex));
assert_spin_locked(¤t->sighand->siglock);
/* Synchronize all threads. */
@@ -1319,7 +1322,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
* while another thread is in the middle of calling exec.
*/
if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(¤t->signal->cred_guard_mutex))
+ mutex_lock_killable(¤t->signal->exec_guard_mutex))
goto out_put_fd;
spin_lock_irq(¤t->sighand->siglock);
@@ -1337,7 +1340,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
out:
spin_unlock_irq(¤t->sighand->siglock);
if (flags & SECCOMP_FILTER_FLAG_TSYNC)
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ mutex_unlock(¤t->signal->exec_guard_mutex);
out_put_fd:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
if (ret) {
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
- * Explicitly map EACCES to EPERM as EPERM is a more a
+ * Explicitly map EACCES to EPERM as EPERM is a more
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..fdca30b
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <[email protected]>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+ ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+ return NULL;
+}
+
+TEST(vmaccess)
+{
+ int f, pid = fork();
+ char mm[64];
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ sprintf(mm, "/proc/%d/mem", pid);
+ f = open(mm, O_RDONLY);
+ ASSERT_GE(f, 0);
+ close(f);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(f, 0);
+}
+
+TEST(attach)
+{
+ int s, k, pid = fork();
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("sleep", "sleep", "2", NULL);
+ }
+
+ sleep(1);
+ k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(errno, EAGAIN);
+ ASSERT_EQ(k, -1);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_NE(k, 0);
+ ASSERT_NE(k, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 0);
+ sleep(1);
+ k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+ k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 0);
+ k = waitpid(-1, NULL, 0);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, ECHILD);
+}
+
+TEST_HARNESS_MAIN
--
1.9.1
On 3/4/20 10:56 PM, Bernd Edlinger wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated. They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> strace D 0 30614 30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect D 0 31933 30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The proposed solution is to detect if a sibling thread
> exists that is traced and in this case to make PTRACE_ACCESS
> fail with -EAGAIN instead of dead-lock.
> But other functions like vm_access are allowed to complete normally.
>
> This changes the lifetime of the cred_guard_mutex lock to be
> from flush_old_exec() through install_exec_creds().
> Before, cred_guard_mutex was held from prepare_bprm_creds() through
> install_exec_creds().
>
> Additionally a new mutex exec_guard_mutex is introduced that is used
> for PTRACE_ACCESS and SECCOMP_FILTER_FLAG_TSYNC.
>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> Documentation/security/credentials.rst | 29 ++++++++---
> fs/exec.c | 58 ++++++++++++++++++---
> include/linux/binfmts.h | 15 +++++-
> include/linux/sched/signal.h | 10 ++--
> init/init_task.c | 1 +
> kernel/cred.c | 4 +-
> kernel/fork.c | 1 +
> kernel/ptrace.c | 20 ++++++--
> kernel/seccomp.c | 15 +++---
> mm/process_vm_access.c | 2 +-
> tools/testing/selftests/ptrace/Makefile | 4 +-
> tools/testing/selftests/ptrace/vmaccess.c | 85 +++++++++++++++++++++++++++++++
> 12 files changed, 210 insertions(+), 34 deletions(-)
> create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
>
Okay, I think there is consensus about the next steps to be as follows:
- post the Documentation/security/credentials.rst changes as an independent patch.
- post a infrastructure patch which only introduces two new mutexes,
one exec_guard_mutex, and one the "cred_change_mutex" (I am unhappy with that name,
because credentials can change without the cred_guard_mutex, this appears more
to guarantee that the credentials of the process and the process memory map are
consistent, so I think I need to think of a better name first...)
This keeps cred_guard_mutex as is, just deprecates it, and adds a note that it will
go away.
- post one patch that fixes the mm_access code path
- post one patch that fixes the PTRACE_ATTACH code path
- post one patch that introduces the new test cases
Thanks
Bernd.
Bernd, everyone
This is how I think the infrastructure change should look that makes way
for fixing this issue.
- Correct the point of no return.
- Add a new mutex to replace cred_guard_mutex
Then I think it is just going through the existing
users of cred_guard_mutex and fixing them to use the new one.
There really aren't that many users of cred_guard_mutex so we should be
able to get through the easy ones fairly quickly. And anything that
isn't easy we can wait until we have a good fix.
The users of cred_guard_mutex that I saw were:
fs/proc/base.c:
proc_pid_attr_write
do_io_accounting
proc_pid_stack
proc_pid_syscall
proc_pid_personality
perf_event_open
mm_access
kcmp
pidfd_fget
seccomp_set_mode_filter
Bernd does this make sense to you?
I think we can fix the seccomp/no_new_privs issue with some careful
refactoring. We can probably do the same for ptrace but that appears
to need a little lsm bug fixing.
My goal here is to allow us to fix the uncontroversial easy bits. While
still allowing the difficult tricky bits to be fixed.
Eric W. Biederman (2):
exec: Properly mark the point of no return
exec: Add a exec_update_mutex to replace cred_guard_mutex
fs/exec.c | 11 ++++++++---
include/linux/binfmts.h | 7 ++++++-
include/linux/sched/signal.h | 9 ++++++++-
kernel/fork.c | 1 +
4 files changed, 23 insertions(+), 5 deletions(-)
Eric
On 3/5/20 10:14 PM, Eric W. Biederman wrote:
>
> Bernd, everyone
>
> This is how I think the infrastructure change should look that makes way
> for fixing this issue.
>
> - Correct the point of no return.
> - Add a new mutex to replace cred_guard_mutex
>
> Then I think it is just going through the existing
> users of cred_guard_mutex and fixing them to use the new one.
>
> There really aren't that many users of cred_guard_mutex so we should be
> able to get through the easy ones fairly quickly. And anything that
> isn't easy we can wait until we have a good fix.
>
> The users of cred_guard_mutex that I saw were:
> fs/proc/base.c:
> proc_pid_attr_write
> do_io_accounting
> proc_pid_stack
> proc_pid_syscall
> proc_pid_personality
>
> perf_event_open
> mm_access
> kcmp
> pidfd_fget
> seccomp_set_mode_filter
>
> Bernd does this make sense to you?
>
> I think we can fix the seccomp/no_new_privs issue with some careful
> refactoring. We can probably do the same for ptrace but that appears
> to need a little lsm bug fixing.
>
Yes, for most functions the proposed "exec_update_mutex" is fine,
but we will need a longer-time block for ptrace_attach, seccomp_set_mode_filter
and proc_pid_attr_write need to be blocked for the whole exec duration so
they need a second "mutex", with deadlock-detection as in my previous patch,
if I see that right.
Unfortunately only one of the two test cases can be fixed without the
second mutex, of course the mm_access is what cause the practical problem.
Currently for the unlimited user space delay, I have only the case of
a ptraced sibling thread on my radar, de_thread waits for the parent
to call wait in this case, that can literally take forever.
But I know that also PTRACE_CONT may be needed after a PTRACE_EVENT_EXIT.
Can you explain what else in the user space can go wrong to make an
unlimited delay in the execve?
Bernd.
Bernd Edlinger <[email protected]> writes:
> On 3/5/20 10:14 PM, Eric W. Biederman wrote:
>>
>> Bernd, everyone
>>
>> This is how I think the infrastructure change should look that makes way
>> for fixing this issue.
>>
>> - Correct the point of no return.
>> - Add a new mutex to replace cred_guard_mutex
>>
>> Then I think it is just going through the existing
>> users of cred_guard_mutex and fixing them to use the new one.
>>
>> There really aren't that many users of cred_guard_mutex so we should be
>> able to get through the easy ones fairly quickly. And anything that
>> isn't easy we can wait until we have a good fix.
>>
>> The users of cred_guard_mutex that I saw were:
>> fs/proc/base.c:
>> proc_pid_attr_write
>> do_io_accounting
>> proc_pid_stack
>> proc_pid_syscall
>> proc_pid_personality
>>
>> perf_event_open
>> mm_access
>> kcmp
>> pidfd_fget
>> seccomp_set_mode_filter
>>
>> Bernd does this make sense to you?
>>
>> I think we can fix the seccomp/no_new_privs issue with some careful
>> refactoring. We can probably do the same for ptrace but that appears
>> to need a little lsm bug fixing.
>>
>
> Yes, for most functions the proposed "exec_update_mutex" is fine,
> but we will need a longer-time block for ptrace_attach, seccomp_set_mode_filter
> and proc_pid_attr_write need to be blocked for the whole exec duration so
> they need a second "mutex", with deadlock-detection as in my previous patch,
> if I see that right.
So far I am leaving "cred_guard_mutex" as that second "mutex". My sense
is that when all we have left are the hard cases we can take those
cases out in detail, examine them and see what really can be done.
> Unfortunately only one of the two test cases can be fixed without the
> second mutex, of course the mm_access is what cause the practical problem.
Fixing the practical problems are foremost on my agenda.
That and clearing away enough of the noise that we can really focus on
the hard problems when we begin to address them.
That way I am hoping we can really solve some of these issues and make
them go away.
> Currently for the unlimited user space delay, I have only the case of
> a ptraced sibling thread on my radar, de_thread waits for the parent
> to call wait in this case, that can literally take forever.
> But I know that also PTRACE_CONT may be needed after a PTRACE_EVENT_EXIT.
>
> Can you explain what else in the user space can go wrong to make an
> unlimited delay in the execve?
Triggering a page fault. Depending on the backing store or possibly
with the use of userfaultfd that page fault can be delayed indefinitely
and pretty much be as bad as the ptrace case.
Eric
Bernd, everyone
This is how I think the infrastructure change should look that makes way
for fixing this issue.
- Cleanup and reorder the code so code that can potentially wait
indefinitely for userspace comes at the beginning for flush_old_exec.
- Add a new mutex and take it after we have passed any potential
indefinite waits for userspace.
Then I think it is just going through the existing users of
cred_guard_mutex and fixing them to use the new one.
There really aren't that many users of cred_guard_mutex so we should be
able to get through the easy ones fairly quickly. And anything that
isn't easy we can wait until we have a good fix.
The users of cred_guard_mutex that I saw were:
fs/proc/base.c:
proc_pid_attr_write
do_io_accounting
proc_pid_stack
proc_pid_syscall
proc_pid_personality
perf_event_open
mm_access
kcmp
pidfd_fget
seccomp_set_mode_filter
Bernd I think I have addressed the issues you pointed out in v1.
Please let me know if you see anything else.
Eric W. Biederman (5):
exec: Only compute current once in flush_old_exec
exec: Factor unshare_sighand out of de_thread and call it separately
exec: Move cleanup of posix timers on exec out of de_thread
exec: Move exec_mmap right after de_thread in flush_old_exec
exec: Add a exec_update_mutex to replace cred_guard_mutex
fs/exec.c | 65 ++++++++++++++++++++++++++++++--------------
include/linux/sched/signal.h | 9 +++++-
init/init_task.c | 1 +
kernel/fork.c | 1 +
4 files changed, 54 insertions(+), 22 deletions(-)
Make it clear that current only needs to be computed once in
flush_old_exec. This may have some efficiency improvements and it
makes the code easier to change.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..c3f34791f2f0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
*/
int flush_old_exec(struct linux_binprm * bprm)
{
+ struct task_struct *me = current;
int retval;
/*
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.
*/
- retval = de_thread(current);
+ retval = de_thread(me);
if (retval)
goto out;
@@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
bprm->mm = NULL;
set_fs(USER_DS);
- current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
+ me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
PF_NOFREEZE | PF_NO_SETAFFINITY);
flush_thread();
- current->personality &= ~bprm->per_clear;
+ me->personality &= ~bprm->per_clear;
/*
* We have to apply CLOEXEC before we change whether the process is
@@ -1305,7 +1306,7 @@ int flush_old_exec(struct linux_binprm * bprm)
* trying to access the should-be-closed file descriptors of a process
* undergoing exec(2).
*/
- do_close_on_exec(current->files);
+ do_close_on_exec(me->files);
return 0;
out:
--
2.25.0
This makes the code clearer and makes it easier to implement a mutex
that is not taken over any locations that may block indefinitely waiting
for userspace.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 39 ++++++++++++++++++++++++++-------------
1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index c3f34791f2f0..ff74b9a74d34 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
flush_itimer_signals();
#endif
+ BUG_ON(!thread_group_leader(tsk));
+ return 0;
+
+killed:
+ /* protects against exit_notify() and __exit_signal() */
+ read_lock(&tasklist_lock);
+ sig->group_exit_task = NULL;
+ sig->notify_count = 0;
+ read_unlock(&tasklist_lock);
+ return -EAGAIN;
+}
+
+
+static int unshare_sighand(struct task_struct *me)
+{
+ struct sighand_struct *oldsighand = me->sighand;
+
if (refcount_read(&oldsighand->count) != 1) {
struct sighand_struct *newsighand;
/*
@@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
write_lock_irq(&tasklist_lock);
spin_lock(&oldsighand->siglock);
- rcu_assign_pointer(tsk->sighand, newsighand);
+ rcu_assign_pointer(me->sighand, newsighand);
spin_unlock(&oldsighand->siglock);
write_unlock_irq(&tasklist_lock);
__cleanup_sighand(oldsighand);
}
-
- BUG_ON(!thread_group_leader(tsk));
return 0;
-
-killed:
- /* protects against exit_notify() and __exit_signal() */
- read_lock(&tasklist_lock);
- sig->group_exit_task = NULL;
- sig->notify_count = 0;
- read_unlock(&tasklist_lock);
- return -EAGAIN;
}
char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
@@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
int retval;
/*
- * Make sure we have a private signal table and that
- * we are unassociated from the previous thread group.
+ * Make this the only thread in the thread group.
*/
retval = de_thread(me);
if (retval)
goto out;
+ /*
+ * Make the signal table private.
+ */
+ retval = unshare_sighand(me);
+ if (retval)
+ goto out;
+
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
--
2.25.0
These functions have very little to do with de_thread move them out
of de_thread an into flush_old_exec proper so it can be more clearly
seen what flush_old_exec is doing.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index ff74b9a74d34..215d86f77b63 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;
-#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(sig);
- flush_itimer_signals();
-#endif
-
BUG_ON(!thread_group_leader(tsk));
return 0;
@@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
+#ifdef CONFIG_POSIX_TIMERS
+ exit_itimers(me->signal);
+ flush_itimer_signals();
+#endif
+
/*
* Make the signal table private.
*/
--
2.25.0
I have read through the code in exec_mmap and I do not see anything
that depends on sighand or the sighand lock, or on signals in anyway
so this should be safe.
This rearrangement of code has two siginficant benefits. It makes
the determination of passing the point of no return by testing bprm->mm
accurate. All failures prior to that point in flush_old_exec are
either truly recoverable or they are fatal.
Futher this consolidates all of the possible indefinite waits for
userspace together at the top of flush_old_exec. The possible wait
for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
to be resolved in clear_child_tid, and the possible wait for a page
fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace
cred_guard_mutex that is not held of possible indefinite userspace
waits. Which will allow removing deadlock scenarios from the kernel.
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 215d86f77b63..d820a7272a76 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
if (retval)
goto out;
-#ifdef CONFIG_POSIX_TIMERS
- exit_itimers(me->signal);
- flush_itimer_signals();
-#endif
-
- /*
- * Make the signal table private.
- */
- retval = unshare_sighand(me);
- if (retval)
- goto out;
-
/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
@@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
*/
bprm->mm = NULL;
+#ifdef CONFIG_POSIX_TIMERS
+ exit_itimers(me->signal);
+ flush_itimer_signals();
+#endif
+
+ /*
+ * Make the signal table private.
+ */
+ retval = unshare_sighand(me);
+ if (retval)
+ goto out;
+
set_fs(USER_DS);
me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
PF_NOFREEZE | PF_NO_SETAFFINITY);
--
2.25.0
The cred_guard_mutex is problematic. The cred_guard_mutex is held
over the userspace accesses as the arguments from userspace are read.
The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
threads are killed. The cred_guard_mutex is held over
"put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held
over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
with the new contents of exec, so that code that needs not to be
confused by exec changing the mm and the cred in ways that can not
happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to
exec_udpate_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 9 +++++++++
include/linux/sched/signal.h | 9 ++++++++-
init/init_task.c | 1 +
kernel/fork.c | 1 +
4 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/exec.c b/fs/exec.c
index d820a7272a76..ffeebb1f167b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct *old_mm, *active_mm;
+ int ret;
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
@@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
return -EINTR;
}
}
+
+ ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+ if (ret)
+ return ret;
+
task_lock(tsk);
active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
@@ -1438,6 +1444,8 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
+ if (!bprm->mm)
+ mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1487,6 +1495,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 88050259c466..a29df79540ce 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -224,7 +224,14 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
- * (notably. ptrace) */
+ * (notably. ptrace)
+ * Deprecated do not use in new code.
+ * Use exec_update_mutex instead.
+ */
+ struct mutex exec_update_mutex; /* Held while task_struct is being
+ * updated during exec, and may have
+ * inconsistent permissions.
+ */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..bd403ed3e418 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@ static struct signal_struct init_signals = {
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/fork.c b/kernel/fork.c
index 60a1295f4384..12896a6ecee6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ mutex_init(&sig->exec_update_mutex);
return 0;
}
--
2.25.0
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>
> The cred_guard_mutex is problematic. The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
or something?
I wonder if we also should mention that
it is held while waiting for the trace parent to
receive the exit code with "wait"?
> threads are killed. The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
Add ?
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one. This lets us move forward while still
s/udpate/update/
Bernd.
On 3/8/20 10:35 PM, Eric W. Biederman wrote:
>
> Make it clear that current only needs to be computed once in
> flush_old_exec. This may have some efficiency improvements and it
> makes the code easier to change.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..c3f34791f2f0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
> */
> int flush_old_exec(struct linux_binprm * bprm)
> {
> + struct task_struct *me = current;
> int retval;
>
> /*
> * Make sure we have a private signal table and that
> * we are unassociated from the previous thread group.
> */
> - retval = de_thread(current);
> + retval = de_thread(me);
> if (retval)
> goto out;
>
> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
> bprm->mm = NULL;
>
> set_fs(USER_DS);
> - current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> PF_NOFREEZE | PF_NO_SETAFFINITY);
I wonder if this line should be aligned with the previous?
Bernd.
On 3/8/20 10:34 PM, Eric W. Biederman wrote:
>
> Bernd, everyone
>
> This is how I think the infrastructure change should look that makes way
> for fixing this issue.
>
> - Cleanup and reorder the code so code that can potentially wait
> indefinitely for userspace comes at the beginning for flush_old_exec.
> - Add a new mutex and take it after we have passed any potential
> indefinite waits for userspace.
>
> Then I think it is just going through the existing users of
> cred_guard_mutex and fixing them to use the new one.
>
> There really aren't that many users of cred_guard_mutex so we should be
> able to get through the easy ones fairly quickly. And anything that
> isn't easy we can wait until we have a good fix.
>
> The users of cred_guard_mutex that I saw were:
> fs/proc/base.c:
> proc_pid_attr_write
> do_io_accounting
> proc_pid_stack
> proc_pid_syscall
> proc_pid_personality
>
> perf_event_open
> mm_access
> kcmp
> pidfd_fget
> seccomp_set_mode_filter
>
> Bernd I think I have addressed the issues you pointed out in v1.
> Please let me know if you see anything else.
>
Yes, looks good, except some nits.
Thanks
Bernd.
Bernd Edlinger <[email protected]> writes:
> On 3/8/20 10:35 PM, Eric W. Biederman wrote:
>>
>> Make it clear that current only needs to be computed once in
>> flush_old_exec. This may have some efficiency improvements and it
>> makes the code easier to change.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/exec.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index db17be51b112..c3f34791f2f0 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>> */
>> int flush_old_exec(struct linux_binprm * bprm)
>> {
>> + struct task_struct *me = current;
>> int retval;
>>
>> /*
>> * Make sure we have a private signal table and that
>> * we are unassociated from the previous thread group.
>> */
>> - retval = de_thread(current);
>> + retval = de_thread(me);
>> if (retval)
>> goto out;
>>
>> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>> bprm->mm = NULL;
>>
>> set_fs(USER_DS);
>> - current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>> + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>> PF_NOFREEZE | PF_NO_SETAFFINITY);
>
> I wonder if this line should be aligned with the previous?
In this case I don't think so. The style used for second line is indent
with tabs as much as possible to the right. I haven't changed that.
Further mixing a change in indentation style with just a variable rename
will make the patch confusing to read because two things have to be
verified at the same time.
So while I see why you ask I think this bit needs to stay as is.
Eric
Bernd Edlinger <[email protected]> writes:
> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>
>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
^ over
>
> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
> or something?
Yes. Let me see if I can phrase that better.
> I wonder if we also should mention that
> it is held while waiting for the trace parent to
> receive the exit code with "wait"?
I don't think we have to spell out the details of how it all works,
unless that makes things clearer. Kernel developers can be expected
to figure out how the kernel works. The critical thing is that it is
an indefinite wait for userspace to take action.
But I will look.
>> threads are killed. The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>>
>> Add exec_update_mutex that is only held over exec updating process
>
> Add ?
Yes. That is what the change does: add exec_update_mutex.
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>>
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one. This lets us move forward while still
>
> s/udpate/update/
Yes. Very much so.
Eric
On 3/9/20 6:34 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/8/20 10:35 PM, Eric W. Biederman wrote:
>>>
>>> Make it clear that current only needs to be computed once in
>>> flush_old_exec. This may have some efficiency improvements and it
>>> makes the code easier to change.
>>>
>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>> ---
>>> fs/exec.c | 9 +++++----
>>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index db17be51b112..c3f34791f2f0 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>>> */
>>> int flush_old_exec(struct linux_binprm * bprm)
>>> {
>>> + struct task_struct *me = current;
>>> int retval;
>>>
>>> /*
>>> * Make sure we have a private signal table and that
>>> * we are unassociated from the previous thread group.
>>> */
>>> - retval = de_thread(current);
>>> + retval = de_thread(me);
>>> if (retval)
>>> goto out;
>>>
>>> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>>> bprm->mm = NULL;
>>>
>>> set_fs(USER_DS);
>>> - current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>> + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>> PF_NOFREEZE | PF_NO_SETAFFINITY);
>>
>> I wonder if this line should be aligned with the previous?
>
> In this case I don't think so. The style used for second line is indent
> with tabs as much as possible to the right. I haven't changed that.
>
> Further mixing a change in indentation style with just a variable rename
> will make the patch confusing to read because two things have to be
> verified at the same time.
>
> So while I see why you ask I think this bit needs to stay as is.
>
Ah, okay, I see.
Thanks for explaining this rule, I was not aware of it,
but I am still new here :)
Thanks
Bernd.
On 3/9/20 6:40 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>
>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> ^ over
>>
>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>> or something?
>
> Yes. Let me see if I can phrase that better.
>
>> I wonder if we also should mention that
>> it is held while waiting for the trace parent to
>> receive the exit code with "wait"?
>
> I don't think we have to spell out the details of how it all works,
> unless that makes things clearer. Kernel developers can be expected
> to figure out how the kernel works. The critical thing is that it is
> an indefinite wait for userspace to take action.
>
> But I will look.
>
>>> threads are killed. The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>
>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>> over a possible indefinite userspace waits for userspace.
>>>
>>> Add exec_update_mutex that is only held over exec updating process
>>
>> Add ?
>
> Yes. That is what the change does: add exec_update_mutex.
>
I just kind of missed the "subject" in this sentence,
like "This patch adds an exec_update_mutex that is ..."
but english is a foreign language for me, so may be okay as is.
Bernd.
>>> with the new contents of exec, so that code that needs not to be
>>> confused by exec changing the mm and the cred in ways that can not
>>> happen during ordinary execution of a process.
>>>
>>> The plan is to switch the users of cred_guard_mutex to
>>> exec_udpate_mutex one by one. This lets us move forward while still
>>
>> s/udpate/update/
>
> Yes. Very much so.
>
> Eric
>
Bernd Edlinger <[email protected]> writes:
> On 3/9/20 6:40 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>>
>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> ^ over
>>>
>>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>>> or something?
>>
>> Yes. Let me see if I can phrase that better.
>>
>>> I wonder if we also should mention that
>>> it is held while waiting for the trace parent to
>>> receive the exit code with "wait"?
>>
>> I don't think we have to spell out the details of how it all works,
>> unless that makes things clearer. Kernel developers can be expected
>> to figure out how the kernel works. The critical thing is that it is
>> an indefinite wait for userspace to take action.
>>
>> But I will look.
>>
>>>> threads are killed. The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>> over a possible indefinite userspace waits for userspace.
>>>>
>>>> Add exec_update_mutex that is only held over exec updating process
>>>
>>> Add ?
>>
>> Yes. That is what the change does: add exec_update_mutex.
>>
>
> I just kind of missed the "subject" in this sentence,
> like "This patch adds an exec_update_mutex that is ..."
> but english is a foreign language for me, so may be okay as is.
English has a lot of options. I think this is a stylistic difference.
Instead of being an observer and describing what the change does:
"This patch adds exec_update_mutex ..."
I was being there in the moment and saying/commading what is happening:
"Add exec_update_mutex ..."
Using the more immdediate form ends up with more concise and clearer
sentences.
Every one of my writing teachers in school emphasized that point
and I see the who it works when I write things. But writing is hard and
I still tend toward long rambling sentences with many qualifiers that
confuse and detract from the point rather than make it clear what is
happening.
Eric
[email protected] (Eric W. Biederman) writes:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/9/20 6:40 PM, Eric W. Biederman wrote:
>>> Bernd Edlinger <[email protected]> writes:
>>>
>>>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>>>
>>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>> ^ over
>>>>
>>>> ... is held while waiting for the trace parent to handle PTRACE_EVENT_EXIT
>>>> or something?
>>>
>>> Yes. Let me see if I can phrase that better.
>>>
>>>> I wonder if we also should mention that
>>>> it is held while waiting for the trace parent to
>>>> receive the exit code with "wait"?
>>>
>>> I don't think we have to spell out the details of how it all works,
>>> unless that makes things clearer. Kernel developers can be expected
>>> to figure out how the kernel works. The critical thing is that it is
>>> an indefinite wait for userspace to take action.
>>>
>>> But I will look.
>>>
>>>>> threads are killed. The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>
>>>> Add ?
>>>
>>> Yes. That is what the change does: add exec_update_mutex.
>>>
>>
>> I just kind of missed the "subject" in this sentence,
>> like "This patch adds an exec_update_mutex that is ..."
>> but english is a foreign language for me, so may be okay as is.
>
> English has a lot of options. I think this is a stylistic difference.
>
> Instead of being an observer and describing what the change does:
> "This patch adds exec_update_mutex ..."
>
> I was being there in the moment and saying/commading what is happening:
> "Add exec_update_mutex ..."
>
> Using the more immdediate form ends up with more concise and clearer
> sentences.
>
> Every one of my writing teachers in school emphasized that point
> and I see the who it works when I write things. But writing is hard and
> I still tend toward long rambling sentences with many qualifiers that
> confuse and detract from the point rather than make it clear what is
> happening.
And reading through it all now I can see your confusion. That
description of my changes was not well done. Reworking it now.
Eric
My rewritten change description reads as follows:
exec: Add a exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly
indefinite waits for userspace. The possilbe indefinite waits for
userspace that I have identified are: The cred_guard_mutex is held in
PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
cred_guard_mutex is held over "get_user(futex_offset, ...") in
exit_robust_list. The cred_guard_mutex held over copy_strings.
The functions get_user and put_user can trigger a page fault which can
potentially wait indefinitely in the case of userfaultfd or if
userspace implements part of the page fault path.
In any of those cases the userspace process that the kernel is waiting
for might userspace might make a different system call that winds up
taking the cred_guard_mutex and result in deadlock.
Holding a mutex over any of those possibly indefinite waits for
userspace does not appear necessary. Add exec_update_mutex that will
just cover updating the process during exec where the permissions and
the objects pointed to by the task struct may be out of sync.
The plan is to switch the users of cred_guard_mutex to
exec_udpate_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <[email protected]>
Does that sound better?
Eric
On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>
> My rewritten change description reads as follows:
>
> exec: Add a exec_update_mutex to replace cred_guard_mutex
is this "an" exec_update_mutex?
>
> The cred_guard_mutex is problematic as it is held over possibly
> indefinite waits for userspace. The possilbe indefinite waits for
> userspace that I have identified are: The cred_guard_mutex is held in
> PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
> held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
> cred_guard_mutex is held over "get_user(futex_offset, ...") in
> exit_robust_list. The cred_guard_mutex held over copy_strings.
>
> The functions get_user and put_user can trigger a page fault which can
> potentially wait indefinitely in the case of userfaultfd or if
> userspace implements part of the page fault path.
>
> In any of those cases the userspace process that the kernel is waiting
> for might userspace might make a different system call that winds up
^-------------^
^- remove this
> taking the cred_guard_mutex and result in deadlock.
>
> Holding a mutex over any of those possibly indefinite waits for
> userspace does not appear necessary. Add exec_update_mutex that will
> just cover updating the process during exec where the permissions and
> the objects pointed to by the task struct may be out of sync.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one. This lets us move forward while still
^-- typo: update
> being careful and not introducing any regressions.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> Does that sound better?
>
almost done.
> Eric
>
Bernd Edlinger <[email protected]> writes:
> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>>
>>
>> Does that sound better?
>>
>
> almost done.
I think this text is finally clean.
exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly
indefinite waits for userspace. The possilbe indefinite waits for
userspace that I have identified are: The cred_guard_mutex is held in
PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
cred_guard_mutex is held over "get_user(futex_offset, ...") in
exit_robust_list. The cred_guard_mutex held over copy_strings.
The functions get_user and put_user can trigger a page fault which can
potentially wait indefinitely in the case of userfaultfd or if
userspace implements part of the page fault path.
In any of those cases the userspace process that the kernel is waiting
for might make a different system call that winds up taking the
cred_guard_mutex and result in deadlock.
Holding a mutex over any of those possibly indefinite waits for
userspace does not appear necessary. Add exec_update_mutex that will
just cover updating the process during exec where the permissions and
the objects pointed to by the task struct may be out of sync.
The plan is to switch the users of cred_guard_mutex to
exec_update_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <[email protected]>
Bernd do you want to give me your Reviewed-by for this part of the
series?
After that do you think you can write the obvious patch for mm_access?
I will apply these changes to my tree and push them into linux-next.
Eric
On 3/9/20 8:02 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>>>
>>>
>>> Does that sound better?
>>>
>>
>> almost done.
>
> I think this text is finally clean.
>
> exec: Add exec_update_mutex to replace cred_guard_mutex
>
> The cred_guard_mutex is problematic as it is held over possibly
> indefinite waits for userspace. The possilbe indefinite waits for
> userspace that I have identified are: The cred_guard_mutex is held in
> PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
> held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
> cred_guard_mutex is held over "get_user(futex_offset, ...") in
> exit_robust_list. The cred_guard_mutex held over copy_strings.
>
> The functions get_user and put_user can trigger a page fault which can
> potentially wait indefinitely in the case of userfaultfd or if
> userspace implements part of the page fault path.
>
> In any of those cases the userspace process that the kernel is waiting
> for might make a different system call that winds up taking the
> cred_guard_mutex and result in deadlock.
>
> Holding a mutex over any of those possibly indefinite waits for
> userspace does not appear necessary. Add exec_update_mutex that will
> just cover updating the process during exec where the permissions and
> the objects pointed to by the task struct may be out of sync.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_update_mutex one by one. This lets us move forward while still
> being careful and not introducing any regressions.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
I checked the urls they all work.
Just one last question, are these git references?
I can't find them in my linux git tree (cloned from linus' git)?
Sorry for being pedantically.
> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
>
> Bernd do you want to give me your Reviewed-by for this part of the
> series?
>
Sure also the other parts of course.
Reviewed-by: Bernd Edlinger <[email protected]>
> After that do you think you can write the obvious patch for mm_access?
>
Yes, I can do that.
I also have some typos in comments, will make them extra patches as well.
I wonder if the test case is okay to include the ptrace_attach altough
that is not yet passing?
Thanks
Bernd.
On 3/9/20 6:56 PM, Bernd Edlinger wrote:
> On 3/9/20 6:34 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>>> On 3/8/20 10:35 PM, Eric W. Biederman wrote:
>>>>
>>>> Make it clear that current only needs to be computed once in
>>>> flush_old_exec. This may have some efficiency improvements and it
>>>> makes the code easier to change.
>>>>
>>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>>> ---
>>>> fs/exec.c | 9 +++++----
>>>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index db17be51b112..c3f34791f2f0 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
>>>> */
>>>> int flush_old_exec(struct linux_binprm * bprm)
>>>> {
>>>> + struct task_struct *me = current;
>>>> int retval;
>>>>
>>>> /*
>>>> * Make sure we have a private signal table and that
>>>> * we are unassociated from the previous thread group.
>>>> */
>>>> - retval = de_thread(current);
>>>> + retval = de_thread(me);
>>>> if (retval)
>>>> goto out;
>>>>
>>>> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>>>> bprm->mm = NULL;
>>>>
>>>> set_fs(USER_DS);
>>>> - current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>>> + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>>>> PF_NOFREEZE | PF_NO_SETAFFINITY);
>>>
>>> I wonder if this line should be aligned with the previous?
>>
>> In this case I don't think so. The style used for second line is indent
>> with tabs as much as possible to the right. I haven't changed that.
>>
>> Further mixing a change in indentation style with just a variable rename
>> will make the patch confusing to read because two things have to be
>> verified at the same time.
>>
>> So while I see why you ask I think this bit needs to stay as is.
>>
>
> Ah, okay, I see.
> Thanks for explaining this rule, I was not aware of it,
> but I am still new here :)
>
Reviewed-by: Bernd Edlinger <[email protected]>
Bernd.
On 3/8/20 10:36 PM, Eric W. Biederman wrote:
>
> This makes the code clearer and makes it easier to implement a mutex
> that is not taken over any locations that may block indefinitely waiting
> for userspace.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Reviewed-by: Bernd Edlinger <[email protected]>
Bernd.
> ---
> fs/exec.c | 39 ++++++++++++++++++++++++++-------------
> 1 file changed, 26 insertions(+), 13 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index c3f34791f2f0..ff74b9a74d34 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
> flush_itimer_signals();
> #endif
>
> + BUG_ON(!thread_group_leader(tsk));
> + return 0;
> +
> +killed:
> + /* protects against exit_notify() and __exit_signal() */
> + read_lock(&tasklist_lock);
> + sig->group_exit_task = NULL;
> + sig->notify_count = 0;
> + read_unlock(&tasklist_lock);
> + return -EAGAIN;
> +}
> +
> +
> +static int unshare_sighand(struct task_struct *me)
> +{
> + struct sighand_struct *oldsighand = me->sighand;
> +
> if (refcount_read(&oldsighand->count) != 1) {
> struct sighand_struct *newsighand;
> /*
> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>
> write_lock_irq(&tasklist_lock);
> spin_lock(&oldsighand->siglock);
> - rcu_assign_pointer(tsk->sighand, newsighand);
> + rcu_assign_pointer(me->sighand, newsighand);
> spin_unlock(&oldsighand->siglock);
> write_unlock_irq(&tasklist_lock);
>
> __cleanup_sighand(oldsighand);
> }
> -
> - BUG_ON(!thread_group_leader(tsk));
> return 0;
> -
> -killed:
> - /* protects against exit_notify() and __exit_signal() */
> - read_lock(&tasklist_lock);
> - sig->group_exit_task = NULL;
> - sig->notify_count = 0;
> - read_unlock(&tasklist_lock);
> - return -EAGAIN;
> }
>
> char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
> @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
> int retval;
>
> /*
> - * Make sure we have a private signal table and that
> - * we are unassociated from the previous thread group.
> + * Make this the only thread in the thread group.
> */
> retval = de_thread(me);
> if (retval)
> goto out;
>
> + /*
> + * Make the signal table private.
> + */
> + retval = unshare_sighand(me);
> + if (retval)
> + goto out;
> +
> /*
> * Must be called _before_ exec_mmap() as bprm->mm is
> * not visibile until then. This also enables the update
>
On 3/8/20 10:36 PM, Eric W. Biederman wrote:
>
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Reviewed-by: Bernd Edlinger <[email protected]>
Bernd.
> ---
> fs/exec.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
> /* we have changed execution domain */
> tsk->exit_signal = SIGCHLD;
>
> -#ifdef CONFIG_POSIX_TIMERS
> - exit_itimers(sig);
> - flush_itimer_signals();
> -#endif
> -
> BUG_ON(!thread_group_leader(tsk));
> return 0;
>
> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
> if (retval)
> goto out;
>
> +#ifdef CONFIG_POSIX_TIMERS
> + exit_itimers(me->signal);
> + flush_itimer_signals();
> +#endif
> +
> /*
> * Make the signal table private.
> */
>
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
> > On 3/9/20 7:36 PM, Eric W. Biederman wrote:
> >>
> >>
> >> Does that sound better?
> >>
> >
> > almost done.
>
> I think this text is finally clean.
>
> exec: Add exec_update_mutex to replace cred_guard_mutex
>
> The cred_guard_mutex is problematic as it is held over possibly
> indefinite waits for userspace. The possilbe indefinite waits for
-------------------------------------------^^^^^^^^ possible?
--
ldv
On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
>
> This rearrangement of code has two siginficant benefits. It makes
^ typo: significant
> the determination of passing the point of no return by testing bprm->mm
> accurate. All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.
>
> Futher this consolidates all of the possible indefinite waits for ^ typo: Further
> userspace together at the top of flush_old_exec. The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
>
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held of possible indefinite userspace
can you also reword this "held of" thing here as well?
Thanks
Bernd.
Bernd Edlinger <[email protected]> writes:
> On 3/9/20 8:02 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>>> On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>>>>
>>>>
>>>> Does that sound better?
>>>>
>>>
>>> almost done.
>>
>> I think this text is finally clean.
>>
>> exec: Add exec_update_mutex to replace cred_guard_mutex
>>
>> The cred_guard_mutex is problematic as it is held over possibly
>> indefinite waits for userspace. The possilbe indefinite waits for
>> userspace that I have identified are: The cred_guard_mutex is held in
>> PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
>> held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
>> cred_guard_mutex is held over "get_user(futex_offset, ...") in
>> exit_robust_list. The cred_guard_mutex held over copy_strings.
>>
>> The functions get_user and put_user can trigger a page fault which can
>> potentially wait indefinitely in the case of userfaultfd or if
>> userspace implements part of the page fault path.
>>
>> In any of those cases the userspace process that the kernel is waiting
>> for might make a different system call that winds up taking the
>> cred_guard_mutex and result in deadlock.
>>
>> Holding a mutex over any of those possibly indefinite waits for
>> userspace does not appear necessary. Add exec_update_mutex that will
>> just cover updating the process during exec where the permissions and
>> the objects pointed to by the task struct may be out of sync.
>>
>> The plan is to switch the users of cred_guard_mutex to
>> exec_update_mutex one by one. This lets us move forward while still
>> being careful and not introducing any regressions.
>>
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>
> I checked the urls they all work.
> Just one last question, are these git references?
> I can't find them in my linux git tree (cloned from linus' git)?
>
> Sorry for being pedantically.
You have to track down tglx's historicaly git tree from when everything
was in bitkeeper.
But yes they are git references and yes they work. Just that part
of the history is not in linux.git.
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>
>>
>> Bernd do you want to give me your Reviewed-by for this part of the
>> series?
>>
>
> Sure also the other parts of course.
>
> Reviewed-by: Bernd Edlinger <[email protected]>
>
>> After that do you think you can write the obvious patch for mm_access?
>>
>
> Yes, I can do that.
> I also have some typos in comments, will make them extra patches as well.
>
> I wonder if the test case is okay to include the ptrace_attach altough
> that is not yet passing?
It is an existing kernel but that it doesn't pass.
My sense is that if you include it as a separate patch if it is a
problem for someone we can identify it easily via bisect and we do
whatever is appropriate.
Eric
Bernd Edlinger <[email protected]> writes:
> On 3/9/20 8:02 PM, Eric W. Biederman wrote:
>>
>>
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>
> I checked the urls they all work.
> Just one last question, are these git references?
> I can't find them in my linux git tree (cloned from linus' git)?
I will add this tag to help people figure out what is going on.
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Eric
"Dmitry V. Levin" <[email protected]> writes:
> On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>> > On 3/9/20 7:36 PM, Eric W. Biederman wrote:
>> >>
>> >>
>> >> Does that sound better?
>> >>
>> >
>> > almost done.
>>
>> I think this text is finally clean.
>>
>> exec: Add exec_update_mutex to replace cred_guard_mutex
>>
>> The cred_guard_mutex is problematic as it is held over possibly
>> indefinite waits for userspace. The possilbe indefinite waits for
>
> -------------------------------------------^^^^^^^^ possible?
Yes. Thank you. Fixed.
Eric
Bernd Edlinger <[email protected]> writes:
> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>
>> This consolidation allows the creation of a mutex to replace
>> cred_guard_mutex that is not held of possible indefinite userspace
>
> can you also reword this "held of" thing here as well?
Done:
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything
that depends on sighand or the sighand lock, or on signals in anyway
so this should be safe.
This rearrangement of code has two siginficant benefits. It makes
the determination of passing the point of no return by testing bprm->mm
accurate. All failures prior to that point in flush_old_exec are
either truly recoverable or they are fatal.
Futher this consolidates all of the possible indefinite waits for
userspace together at the top of flush_old_exec. The possible wait
for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
to be resolved in clear_child_tid, and the possible wait for a page
fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace
cred_guard_mutex that is not held over possible indefinite userspace
waits. Which will allow removing deadlock scenarios from the kernel.
Reviewed-by: Bernd Edlinger <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
Eric
On 3/9/20 8:45 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> On 3/8/20 10:38 PM, Eric W. Biederman wrote:
>>>
>>> This consolidation allows the creation of a mutex to replace
>>> cred_guard_mutex that is not held of possible indefinite userspace
>>
>> can you also reword this "held of" thing here as well?
>
> Done:
>
> exec: Move exec_mmap right after de_thread in flush_old_exec
>
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
>
> This rearrangement of code has two siginficant benefits. It makes
watch out: sig_i_nificant
> the determination of passing the point of no return by testing bprm->mm
> accurate. All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.
>
> Futher this consolidates all of the possible indefinite waits for
Add some r to "Futher", please?
> userspace together at the top of flush_old_exec. The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
>
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held over possible indefinite userspace
> waits. Which will allow removing deadlock scenarios from the kernel.
>
> Reviewed-by: Bernd Edlinger <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> Eric
>
Ok. I think this has it sorted:
exec: Move exec_mmap right after de_thread in flush_old_exec
I have read through the code in exec_mmap and I do not see anything
that depends on sighand or the sighand lock, or on signals in anyway
so this should be safe.
This rearrangement of code has two significant benefits. It makes
the determination of passing the point of no return by testing bprm->mm
accurate. All failures prior to that point in flush_old_exec are
either truly recoverable or they are fatal.
Further this consolidates all of the possible indefinite waits for
userspace together at the top of flush_old_exec. The possible wait
for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
to be resolved in clear_child_tid, and the possible wait for a page
fault in exit_robust_list.
This consolidation allows the creation of a mutex to replace
cred_guard_mutex that is not held over possible indefinite userspace
waits. Which will allow removing deadlock scenarios from the kernel.
Reviewed-by: Bernd Edlinger <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
I don't think I usually have this many typos. Sigh.
Eric
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
>
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
While you're cleaning up de_thread() wouldn't it be good to also take
the opportunity and remove the task argument from de_thread(). It's only
ever used with current. Could be done in one of your patches or as a
separate patch.
diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..ee108707e4b0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1061,8 +1061,9 @@ static int exec_mmap(struct mm_struct *mm)
* disturbing other processes. (Other processes might share the signal
* table via the CLONE_SIGHAND option to clone().)
*/
-static int de_thread(struct task_struct *tsk)
+static int de_thread(void)
{
+ struct task_struct *tsk = current;
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
@@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
* Make sure we have a private signal table and that
* we are unassociated from the previous thread group.
*/
- retval = de_thread(current);
+ retval = de_thread();
if (retval)
goto out;
On 3/9/20 8:58 PM, Eric W. Biederman wrote:
>
> Ok. I think this has it sorted:
>
> exec: Move exec_mmap right after de_thread in flush_old_exec
>
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
>
> This rearrangement of code has two significant benefits. It makes
> the determination of passing the point of no return by testing bprm->mm
> accurate. All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.
>
> Further this consolidates all of the possible indefinite waits for
> userspace together at the top of flush_old_exec. The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
>
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held over possible indefinite userspace
> waits. Which will allow removing deadlock scenarios from the kernel.
>
> Reviewed-by: Bernd Edlinger <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
>
> I don't think I usually have this many typos. Sigh.
>
OK.
never mind,
Bernd.
Christian Brauner <[email protected]> writes:
> On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
>>
>> These functions have very little to do with de_thread move them out
>> of de_thread an into flush_old_exec proper so it can be more clearly
>> seen what flush_old_exec is doing.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/exec.c | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index ff74b9a74d34..215d86f77b63 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>
> While you're cleaning up de_thread() wouldn't it be good to also take
> the opportunity and remove the task argument from de_thread(). It's only
> ever used with current. Could be done in one of your patches or as a
> separate patch.
How does that affect the code generation?
My sense is that computing current once in flush_old_exec is much
better than computing it in each function flush_old_exec calls.
I remember that computing current used to be not expensive but
noticable.
For clarity I can see renaming tsk to me. So that it is clear we are
talking about the current process, and not some arbitrary process.
And for clarity my goal here is not to clean up de_thread. Though
I don't mind that result. My goal is to get the extra work out of
de_thread so we can do process tear down cleanups that are safe
according to the ordinary process rules, before taking a mutex that
protects exec mucking with all of the state in exec.
Eric
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..ee108707e4b0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1061,8 +1061,9 @@ static int exec_mmap(struct mm_struct *mm)
> * disturbing other processes. (Other processes might share the signal
> * table via the CLONE_SIGHAND option to clone().)
> */
> -static int de_thread(struct task_struct *tsk)
> +static int de_thread(void)
> {
> + struct task_struct *tsk = current;
> struct signal_struct *sig = tsk->signal;
> struct sighand_struct *oldsighand = tsk->sighand;
> spinlock_t *lock = &oldsighand->siglock;
> @@ -1266,7 +1267,7 @@ int flush_old_exec(struct linux_binprm * bprm)
> * Make sure we have a private signal table and that
> * we are unassociated from the previous thread group.
> */
> - retval = de_thread(current);
> + retval = de_thread();
> if (retval)
> goto out;
On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
> >>
> >> These functions have very little to do with de_thread move them out
> >> of de_thread an into flush_old_exec proper so it can be more clearly
> >> seen what flush_old_exec is doing.
> >>
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> ---
> >> fs/exec.c | 10 +++++-----
> >> 1 file changed, 5 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index ff74b9a74d34..215d86f77b63 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
> >
> > While you're cleaning up de_thread() wouldn't it be good to also take
> > the opportunity and remove the task argument from de_thread(). It's only
> > ever used with current. Could be done in one of your patches or as a
> > separate patch.
>
> How does that affect the code generation?
The same way renaming "tsk" to "me" does.
>
> My sense is that computing current once in flush_old_exec is much
> better than computing it in each function flush_old_exec calls.
> I remember that computing current used to be not expensive but
> noticable.
>
> For clarity I can see renaming tsk to me. So that it is clear we are
> talking about the current process, and not some arbitrary process.
For clarity since de_thread() uses "tsk" giving the impression that any
task can be dethreaded while it's only ever used with current. It's just
a suggestion since you're doing the rename tsk->me anyway it would fit
with the series. You do whatever you want though.
(I just remember that the same request was made once to changes I did:
Don't pass current as arg when it's the only task passed.)
Bernd Edlinger <[email protected]> writes:
> On 3/9/20 8:58 PM, Eric W. Biederman wrote:
>>
>> Ok. I think this has it sorted:
>>
>> exec: Move exec_mmap right after de_thread in flush_old_exec
>>
>> I have read through the code in exec_mmap and I do not see anything
>> that depends on sighand or the sighand lock, or on signals in anyway
>> so this should be safe.
>>
>> This rearrangement of code has two significant benefits. It makes
>> the determination of passing the point of no return by testing bprm->mm
>> accurate. All failures prior to that point in flush_old_exec are
>> either truly recoverable or they are fatal.
>>
>> Further this consolidates all of the possible indefinite waits for
>> userspace together at the top of flush_old_exec. The possible wait
>> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>> to be resolved in clear_child_tid, and the possible wait for a page
>> fault in exit_robust_list.
>>
>> This consolidation allows the creation of a mutex to replace
>> cred_guard_mutex that is not held over possible indefinite userspace
>> waits. Which will allow removing deadlock scenarios from the kernel.
>>
>> Reviewed-by: Bernd Edlinger <[email protected]>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>
>>
>> I don't think I usually have this many typos. Sigh.
>>
>
> OK.
>
> never mind,
No no. I really appreciate all of the scrutiny. Frequently the issues
that will produce typos or poor patch descriptions are also the issues
that will produce sloppy patches as well. I was just frustrated with
myself.
Eric
Christian Brauner <[email protected]> writes:
> On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
>> >>
>> >> These functions have very little to do with de_thread move them out
>> >> of de_thread an into flush_old_exec proper so it can be more clearly
>> >> seen what flush_old_exec is doing.
>> >>
>> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> >> ---
>> >> fs/exec.c | 10 +++++-----
>> >> 1 file changed, 5 insertions(+), 5 deletions(-)
>> >>
>> >> diff --git a/fs/exec.c b/fs/exec.c
>> >> index ff74b9a74d34..215d86f77b63 100644
>> >> --- a/fs/exec.c
>> >> +++ b/fs/exec.c
>> >> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>> >
>> > While you're cleaning up de_thread() wouldn't it be good to also take
>> > the opportunity and remove the task argument from de_thread(). It's only
>> > ever used with current. Could be done in one of your patches or as a
>> > separate patch.
>>
>> How does that affect the code generation?
>
> The same way renaming "tsk" to "me" does.
>
>>
>> My sense is that computing current once in flush_old_exec is much
>> better than computing it in each function flush_old_exec calls.
>> I remember that computing current used to be not expensive but
>> noticable.
>>
>> For clarity I can see renaming tsk to me. So that it is clear we are
>> talking about the current process, and not some arbitrary process.
>
> For clarity since de_thread() uses "tsk" giving the impression that any
> task can be dethreaded while it's only ever used with current. It's just
> a suggestion since you're doing the rename tsk->me anyway it would fit
> with the series. You do whatever you want though.
> (I just remember that the same request was made once to changes I did:
> Don't pass current as arg when it's the only task passed.)
That's fair.
And I completely agree that we should at least rename tsk to me.
Just for clarity.
My apologies if I am a little short. My little son has been an extra
handful lately.
Eric
On Mon, Mar 09, 2020 at 03:48:55PM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Mon, Mar 09, 2020 at 03:06:46PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <[email protected]> writes:
> >>
> >> > On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
> >> >>
> >> >> These functions have very little to do with de_thread move them out
> >> >> of de_thread an into flush_old_exec proper so it can be more clearly
> >> >> seen what flush_old_exec is doing.
> >> >>
> >> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> >> ---
> >> >> fs/exec.c | 10 +++++-----
> >> >> 1 file changed, 5 insertions(+), 5 deletions(-)
> >> >>
> >> >> diff --git a/fs/exec.c b/fs/exec.c
> >> >> index ff74b9a74d34..215d86f77b63 100644
> >> >> --- a/fs/exec.c
> >> >> +++ b/fs/exec.c
> >> >> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
> >> >
> >> > While you're cleaning up de_thread() wouldn't it be good to also take
> >> > the opportunity and remove the task argument from de_thread(). It's only
> >> > ever used with current. Could be done in one of your patches or as a
> >> > separate patch.
> >>
> >> How does that affect the code generation?
> >
> > The same way renaming "tsk" to "me" does.
> >
> >>
> >> My sense is that computing current once in flush_old_exec is much
> >> better than computing it in each function flush_old_exec calls.
> >> I remember that computing current used to be not expensive but
> >> noticable.
> >>
> >> For clarity I can see renaming tsk to me. So that it is clear we are
> >> talking about the current process, and not some arbitrary process.
> >
> > For clarity since de_thread() uses "tsk" giving the impression that any
> > task can be dethreaded while it's only ever used with current. It's just
> > a suggestion since you're doing the rename tsk->me anyway it would fit
> > with the series. You do whatever you want though.
> > (I just remember that the same request was made once to changes I did:
> > Don't pass current as arg when it's the only task passed.)
>
> That's fair.
>
> And I completely agree that we should at least rename tsk to me.
> Just for clarity.
>
> My apologies if I am a little short. My little son has been an extra
> handful lately.
No worries, stress is a thing most of us know too well.
Christian
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.
I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated. They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.
The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
strace D 0 30614 30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
expect D 0 31933 30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This changes mm_access to use the new exec_update_mutex
instead of cred_guard_mutex.
This patch is based on the following patch by Eric W. Biederman:
"[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Bernd Edlinger <[email protected]>
---
kernel/fork.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index c12595a..5720ff3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
struct mm_struct *mm;
int err;
- err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ err = mutex_lock_killable(&task->signal->exec_update_mutex);
if (err)
return ERR_PTR(err);
@@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
mmput(mm);
mm = ERR_PTR(-EACCES);
}
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_update_mutex);
return mm;
}
--
1.9.1
This removes a duplicate "a" in the comment in process_vm_rw_core.
Signed-off-by: Bernd Edlinger <[email protected]>
---
mm/process_vm_access.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7b..b3e6eb5 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
if (!mm || IS_ERR(mm)) {
rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
/*
- * Explicitly map EACCES to EPERM as EPERM is a more a
+ * Explicitly map EACCES to EPERM as EPERM is a more
* appropriate error code for process_vw_readv/writev
*/
if (rc == -EACCES)
--
1.9.1
This adds test cases for ptrace deadlocks.
Additionally fixes a compile problem in get_syscall_info.c,
observed with gcc-4.8.4:
get_syscall_info.c: In function 'get_syscall_info':
get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
allowed in C99 mode
for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
^
get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
your code
Signed-off-by: Bernd Edlinger <[email protected]>
---
tools/testing/selftests/ptrace/Makefile | 4 +-
tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
2 files changed, 88 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index c0b7f89..2f1f532 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -iquote../../../../include/uapi -Wall
+CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
new file mode 100644
index 0000000..4db327b
--- /dev/null
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (c) 2020 Bernd Edlinger <[email protected]>
+ * All rights reserved.
+ *
+ * Check whether /proc/$pid/mem can be accessed without causing deadlocks
+ * when de_thread is blocked with ->cred_guard_mutex held.
+ */
+
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <signal.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+
+static void *thread(void *arg)
+{
+ ptrace(PTRACE_TRACEME, 0, 0L, 0L);
+ return NULL;
+}
+
+TEST(vmaccess)
+{
+ int f, pid = fork();
+ char mm[64];
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("true", "true", NULL);
+ }
+
+ sleep(1);
+ sprintf(mm, "/proc/%d/mem", pid);
+ f = open(mm, O_RDONLY);
+ ASSERT_GE(f, 0);
+ close(f);
+ f = kill(pid, SIGCONT);
+ ASSERT_EQ(f, 0);
+}
+
+TEST(attach)
+{
+ int s, k, pid = fork();
+
+ if (!pid) {
+ pthread_t pt;
+
+ pthread_create(&pt, NULL, thread, NULL);
+ pthread_join(pt, NULL);
+ execlp("sleep", "sleep", "2", NULL);
+ }
+
+ sleep(1);
+ k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(errno, EAGAIN);
+ ASSERT_EQ(k, -1);
+ k = waitpid(-1, &s, WNOHANG);
+ ASSERT_NE(k, -1);
+ ASSERT_NE(k, 0);
+ ASSERT_NE(k, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 0);
+ sleep(1);
+ k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
+ k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+ ASSERT_EQ(k, 0);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFEXITED(s), 1);
+ ASSERT_EQ(WEXITSTATUS(s), 0);
+ k = waitpid(-1, NULL, 0);
+ ASSERT_EQ(k, -1);
+ ASSERT_EQ(errno, ECHILD);
+}
+
+TEST_HARNESS_MAIN
--
1.9.1
This removes an outdated comment in prepare_kernel_cred.
There is no "cred_replace_mutex" any more, so the comment must
go away.
Signed-off-by: Bernd Edlinger <[email protected]>
---
kernel/cred.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985..71a7926 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -675,8 +675,6 @@ void __init cred_init(void)
* The caller may change these controls afterwards if desired.
*
* Returns the new credentials or NULL if out of memory.
- *
- * Does not take, and does not return holding current->cred_replace_mutex.
*/
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
--
1.9.1
This is a follow up on Eric's patch series to
fix the deadlocks observed with ptracing when execve
in multi-threaded applications.
This fixes the simple and most important case where
the cred_guard_mutex causes strace to deadlock.
This also adds a test case (which is only partially
fixed so far, the rest of the fixes will follow
soon).
Two trivial comment fixes are also included.
Bernd Edlinger (4):
exec: Fix a deadlock in ptrace
selftests/ptrace: add test cases for dead-locks
mm: docs: Fix a comment in process_vm_rw_core
kernel: doc: remove outdated comment in prepare_kernel_cred
kernel/cred.c | 2 -
kernel/fork.c | 4 +-
mm/process_vm_access.c | 2 +-
tools/testing/selftests/ptrace/Makefile | 4 +-
tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
5 files changed, 91 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
--
1.9.1
Bernd Edlinger <[email protected]> writes:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated. They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
Overall this looks good. Mind if I change the subject to:
"exec: Fix a deadlock in strace" ?
Eric
>
> strace D 0 30614 30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect D 0 31933 30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> This changes mm_access to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This patch is based on the following patch by Eric W. Biederman:
> "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
> Link: https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> kernel/fork.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c12595a..5720ff3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> struct mm_struct *mm;
> int err;
>
> - err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + err = mutex_lock_killable(&task->signal->exec_update_mutex);
> if (err)
> return ERR_PTR(err);
>
> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> mmput(mm);
> mm = ERR_PTR(-EACCES);
> }
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
>
> return mm;
> }
On 3/10/20 4:13 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> This fixes a deadlock in the tracer when tracing a multi-threaded
>> application that calls execve while more than one thread are running.
>>
>> I observed that when running strace on the gcc test suite, it always
>> blocks after a while, when expect calls execve, because other threads
>> have to be terminated. They send ptrace events, but the strace is no
>> longer able to respond, since it is blocked in vm_access.
>>
>> The deadlock is always happening when strace needs to access the
>> tracees process mmap, while another thread in the tracee starts to
>> execve a child process, but that cannot continue until the
>> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> Overall this looks good. Mind if I change the subject to:
> "exec: Fix a deadlock in strace" ?
>
Sure, go ahead.
Thanks
Bernd.
> Eric
>
>
>>
>> strace D 0 30614 30584 0x00000000
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> schedule_preempt_disabled+0x15/0x20
>> __mutex_lock.isra.13+0x1ec/0x520
>> __mutex_lock_killable_slowpath+0x13/0x20
>> mutex_lock_killable+0x28/0x30
>> mm_access+0x27/0xa0
>> process_vm_rw_core.isra.3+0xff/0x550
>> process_vm_rw+0xdd/0xf0
>> __x64_sys_process_vm_readv+0x31/0x40
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> expect D 0 31933 30876 0x80004003
>> Call Trace:
>> __schedule+0x3ce/0x6e0
>> schedule+0x5c/0xd0
>> flush_old_exec+0xc4/0x770
>> load_elf_binary+0x35a/0x16c0
>> search_binary_handler+0x97/0x1d0
>> __do_execve_file.isra.40+0x5d4/0x8a0
>> __x64_sys_execve+0x49/0x60
>> do_syscall_64+0x64/0x220
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> This changes mm_access to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This patch is based on the following patch by Eric W. Biederman:
>> "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
>> Link: https://lore.kernel.org/lkml/[email protected]/
>>
>> Signed-off-by: Bernd Edlinger <[email protected]>
>> ---
>> kernel/fork.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index c12595a..5720ff3 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>> struct mm_struct *mm;
>> int err;
>>
>> - err = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> + err = mutex_lock_killable(&task->signal->exec_update_mutex);
>> if (err)
>> return ERR_PTR(err);
>>
>> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
>> mmput(mm);
>> mm = ERR_PTR(-EACCES);
>> }
>> - mutex_unlock(&task->signal->cred_guard_mutex);
>> + mutex_unlock(&task->signal->exec_update_mutex);
>>
>> return mm;
>> }
Bernd Edlinger <[email protected]> writes:
> This is a follow up on Eric's patch series to
> fix the deadlocks observed with ptracing when execve
> in multi-threaded applications.
>
> This fixes the simple and most important case where
> the cred_guard_mutex causes strace to deadlock.
>
> This also adds a test case (which is only partially
> fixed so far, the rest of the fixes will follow
> soon).
>
> Two trivial comment fixes are also included.
>
> Bernd Edlinger (4):
> exec: Fix a deadlock in ptrace
> selftests/ptrace: add test cases for dead-locks
> mm: docs: Fix a comment in process_vm_rw_core
> kernel: doc: remove outdated comment in prepare_kernel_cred
>
> kernel/cred.c | 2 -
> kernel/fork.c | 4 +-
> mm/process_vm_access.c | 2 +-
> tools/testing/selftests/ptrace/Makefile | 4 +-
> tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
> 5 files changed, 91 insertions(+), 7 deletions(-)
> create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
Applied.
Thank you,
Eric
This changes kcmp_epoll_target to use the new exec_update_mutex
instead of cred_guard_mutex.
This should be safe, as the credentials are only used for reading,
and furthermore ->mm and ->sighand are updated on execve,
but only under the new exec_update_mutex.
Signed-off-by: Bernd Edlinger <[email protected]>
---
kernel/kcmp.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index a0e3d7a..b3ff928 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
/*
* One should have enough rights to inspect task details.
*/
- ret = kcmp_lock(&task1->signal->cred_guard_mutex,
- &task2->signal->cred_guard_mutex);
+ ret = kcmp_lock(&task1->signal->exec_update_mutex,
+ &task2->signal->exec_update_mutex);
if (ret)
goto err;
if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||
@@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
}
err_unlock:
- kcmp_unlock(&task1->signal->cred_guard_mutex,
- &task2->signal->cred_guard_mutex);
+ kcmp_unlock(&task1->signal->exec_update_mutex,
+ &task2->signal->exec_update_mutex);
err:
put_task_struct(task1);
put_task_struct(task2);
--
1.9.1
This changes lock_trace to use the new exec_update_mutex
instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing
/proc/$pid/stack for instance.
This should be safe, as the credentials are only used for reading,
and task->mm is updated on execve under the new exec_update_mutex.
Signed-off-by: Bernd Edlinger <[email protected]>
---
fs/proc/base.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea950..4fdfe4f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
static int lock_trace(struct task_struct *task)
{
- int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ int err = mutex_lock_killable(&task->signal->exec_update_mutex);
if (err)
return err;
if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_update_mutex);
return -EPERM;
}
return 0;
@@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
static void unlock_trace(struct task_struct *task)
{
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_update_mutex);
}
#ifdef CONFIG_STACKTRACE
--
1.9.1
This continues the execve anti-deadlock patch and addresses all
of the (mostly) simple cases, there the new exec_update_mutex
can be used instead of the cred_guard_mutex.
Note: each of these patches is independent of each other, so
in case one of them turns out to be controversial, that does
not affect the others.
Bernd Edlinger (4):
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
fs/proc/base.c | 10 +++++-----
kernel/events/core.c | 12 ++++++------
kernel/kcmp.c | 8 ++++----
3 files changed, 15 insertions(+), 15 deletions(-)
--
1.9.1
This changes do_io_accounting to use the new exec_update_mutex
instead of cred_guard_mutex.
This fixes possible deadlocks when the trace is accessing
/proc/$pid/io for instance.
This should be safe, as the credentials are only used for reading.
Signed-off-by: Bernd Edlinger <[email protected]>
---
fs/proc/base.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4fdfe4f..529d0c6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
unsigned long flags;
int result;
- result = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ result = mutex_lock_killable(&task->signal->exec_update_mutex);
if (result)
return result;
@@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
result = 0;
out_unlock:
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_update_mutex);
return result;
}
--
1.9.1
This changes perf_event_set_clock to use the new exec_update_mutex
instead of cred_guard_mutex.
This should be safe, as the credentials are only used for reading.
Signed-off-by: Bernd Edlinger <[email protected]>
---
kernel/events/core.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2173c23..c37f6eb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1248,7 +1248,7 @@ static void put_ctx(struct perf_event_context *ctx)
* function.
*
* Lock order:
- * cred_guard_mutex
+ * exec_update_mutex
* task_struct::perf_event_mutex
* perf_event_context::mutex
* perf_event::child_mutex;
@@ -11254,14 +11254,14 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
}
if (task) {
- err = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
+ err = mutex_lock_interruptible(&task->signal->exec_update_mutex);
if (err)
goto err_task;
/*
* Reuse ptrace permission checks for now.
*
- * We must hold cred_guard_mutex across this and any potential
+ * We must hold exec_update_mutex across this and any potential
* perf_install_in_context() call for this new event to
* serialize against exec() altering our credentials (and the
* perf_event_exit_task() that could imply).
@@ -11550,7 +11550,7 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
mutex_unlock(&ctx->mutex);
if (task) {
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_update_mutex);
put_task_struct(task);
}
@@ -11586,7 +11586,7 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
free_event(event);
err_cred:
if (task)
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_update_mutex);
err_task:
if (task)
put_task_struct(task);
@@ -11891,7 +11891,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
/*
* When a child task exits, feed back event values to parent events.
*
- * Can be called with cred_guard_mutex held when called from
+ * Can be called with exec_update_mutex held when called from
* install_exec_creds().
*/
void perf_event_exit_task(struct task_struct *child)
--
1.9.1
During exec some file descriptors are closed and the files struct is
unshared. But all of that can happen at other times and it has the
same protections during exec as at ordinary times. So stop taking the
cred_guard_mutex as it is useless.
Furthermore he cred_guard_mutex is a bad idea because it is deadlock
prone, as it is held in serveral while waiting possibly indefinitely
for userspace to do something.
Cc: Sargun Dhillon <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/pid.c | 6 ------
1 file changed, 6 deletions(-)
Christian if you don't have any objections I will take this one through
my tree.
I tried to figure out why this code path takes the cred_guard_mutex and
the archive on lore.kernel.org was not helpful in finding that part of
the conversation.
diff --git a/kernel/pid.c b/kernel/pid.c
index 60820e72634c..53646d5616d2 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
struct file *file;
int ret;
- ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
- if (ret)
- return ERR_PTR(ret);
-
if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
file = fget_task(task, fd);
else
file = ERR_PTR(-EPERM);
- mutex_unlock(&task->signal->cred_guard_mutex);
-
return file ?: ERR_PTR(-EBADF);
}
--
2.20.1
Bernd Edlinger <[email protected]> writes:
> This changes kcmp_epoll_target to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This should be safe, as the credentials are only used for reading,
> and furthermore ->mm and ->sighand are updated on execve,
> but only under the new exec_update_mutex.
>
Can you add a comment that the exec_update_mutex is not needed for
KCMP_FILE? As both sets of credentials during exec are valid
for accessing the files so exec_update_mutex does not matter.
I don't think exec_update_mutex is needed for KCMP_SYSVSEM
or KCMP_EPOLL_TFD either. As I don't think exec changes either
one of those.
Eric
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> kernel/kcmp.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kcmp.c b/kernel/kcmp.c
> index a0e3d7a..b3ff928 100644
> --- a/kernel/kcmp.c
> +++ b/kernel/kcmp.c
> @@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
> /*
> * One should have enough rights to inspect task details.
> */
> - ret = kcmp_lock(&task1->signal->cred_guard_mutex,
> - &task2->signal->cred_guard_mutex);
> + ret = kcmp_lock(&task1->signal->exec_update_mutex,
> + &task2->signal->exec_update_mutex);
> if (ret)
> goto err;
> if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||
> @@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
> }
>
> err_unlock:
> - kcmp_unlock(&task1->signal->cred_guard_mutex,
> - &task2->signal->cred_guard_mutex);
> + kcmp_unlock(&task1->signal->exec_update_mutex,
> + &task2->signal->exec_update_mutex);
> err:
> put_task_struct(task1);
> put_task_struct(task2);
Bernd Edlinger <[email protected]> writes:
> This changes do_io_accounting to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/io for instance.
>
> This should be safe, as the credentials are only used for reading.
This is an improvement.
We probably want to do this just as an incremental step in making things
better but perhaps I am blind but I am not finding the reason for
guarding this with the cred_guard_mutex to be at all persuasive.
I think moving the ptrace_may_access check down to after the
unlock_task_sighand would be just as effective at addressing the
concerns raised in the original commit. I think the task_lock provides
all of the barrier we need to make it safe to move the ptrace_may_access
checks safe.
The reason I say this is I don't see exec changing ->ioac. Just
performing some I/O which would update the io accounting statistics.
Can anyone see if I am wrong?
Eric
commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0
Author: Vasiliy Kulikov <[email protected]>
Date: Tue Jul 26 16:08:38 2011 -0700
proc: fix a race in do_io_accounting()
If an inode's mode permits opening /proc/PID/io and the resulting file
descriptor is kept across execve() of a setuid or similar binary, the
ptrace_may_access() check tries to prevent using this fd against the
task with escalated privileges.
Unfortunately, there is a race in the check against execve(). If
execve() is processed after the ptrace check, but before the actual io
information gathering, io statistics will be gathered from the
privileged process. At least in theory this might lead to gathering
sensible information (like ssh/ftp password length) that wouldn't be
available otherwise.
Holding task->signal->cred_guard_mutex while gathering the io
information should protect against the race.
The order of locking is similar to the one inside of ptrace_attach():
first goes cred_guard_mutex, then lock_task_sighand().
Signed-off-by: Vasiliy Kulikov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> fs/proc/base.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4fdfe4f..529d0c6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
> unsigned long flags;
> int result;
>
> - result = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + result = mutex_lock_killable(&task->signal->exec_update_mutex);
> if (result)
> return result;
>
> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
> result = 0;
>
> out_unlock:
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
> return result;
> }
On Tue, Mar 10, 2020 at 01:52:05PM -0500, Eric W. Biederman wrote:
>
> During exec some file descriptors are closed and the files struct is
> unshared. But all of that can happen at other times and it has the
> same protections during exec as at ordinary times. So stop taking the
> cred_guard_mutex as it is useless.
>
> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> prone, as it is held in serveral while waiting possibly indefinitely
> for userspace to do something.
>
> Cc: Sargun Dhillon <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/pid.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> Christian if you don't have any objections I will take this one through
> my tree.
Sure.
Acked-by: Christian Brauner <[email protected]>
>
> I tried to figure out why this code path takes the cred_guard_mutex and
> the archive on lore.kernel.org was not helpful in finding that part of
> the conversation.
Let me think a little harder and hopefully get back to you with a
sensible explanation.
On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
> During exec some file descriptors are closed and the files struct is
> unshared. But all of that can happen at other times and it has the
> same protections during exec as at ordinary times. So stop taking the
> cred_guard_mutex as it is useless.
>
> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> prone, as it is held in serveral while waiting possibly indefinitely
> for userspace to do something.
Please don't. Just use the new exec_update_mutex like everywhere else.
> Cc: Sargun Dhillon <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/pid.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> Christian if you don't have any objections I will take this one through
> my tree.
>
> I tried to figure out why this code path takes the cred_guard_mutex and
> the archive on lore.kernel.org was not helpful in finding that part of
> the conversation.
That was my suggestion.
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 60820e72634c..53646d5616d2 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
> struct file *file;
> int ret;
>
> - ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
> - if (ret)
> - return ERR_PTR(ret);
> -
> if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
> file = fget_task(task, fd);
> else
> file = ERR_PTR(-EPERM);
>
> - mutex_unlock(&task->signal->cred_guard_mutex);
> -
> return file ?: ERR_PTR(-EBADF);
> }
If you make this change, then if this races with execution of a setuid
program that afterwards e.g. opens a unix domain socket, an attacker
will be able to steal that socket and inject messages into
communication with things like DBus. procfs currently has the same
race, and that still needs to be fixed, but at least procfs doesn't
let you open things like sockets because they don't have a working
->open handler, and it enforces the normal permission check for opening files.
Jann Horn <[email protected]> writes:
> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
>> During exec some file descriptors are closed and the files struct is
>> unshared. But all of that can happen at other times and it has the
>> same protections during exec as at ordinary times. So stop taking the
>> cred_guard_mutex as it is useless.
>>
>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>> prone, as it is held in serveral while waiting possibly indefinitely
>> for userspace to do something.
>
> Please don't. Just use the new exec_update_mutex like everywhere else.
>
>> Cc: Sargun Dhillon <[email protected]>
>> Cc: Christian Brauner <[email protected]>
>> Cc: Arnd Bergmann <[email protected]>
>> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> kernel/pid.c | 6 ------
>> 1 file changed, 6 deletions(-)
>>
>> Christian if you don't have any objections I will take this one through
>> my tree.
>>
>> I tried to figure out why this code path takes the cred_guard_mutex and
>> the archive on lore.kernel.org was not helpful in finding that part of
>> the conversation.
>
> That was my suggestion.
>
>> diff --git a/kernel/pid.c b/kernel/pid.c
>> index 60820e72634c..53646d5616d2 100644
>> --- a/kernel/pid.c
>> +++ b/kernel/pid.c
>> @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
>> struct file *file;
>> int ret;
>>
>> - ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> - if (ret)
>> - return ERR_PTR(ret);
>> -
>> if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
>> file = fget_task(task, fd);
>> else
>> file = ERR_PTR(-EPERM);
>>
>> - mutex_unlock(&task->signal->cred_guard_mutex);
>> -
>> return file ?: ERR_PTR(-EBADF);
>> }
>
> If you make this change, then if this races with execution of a setuid
> program that afterwards e.g. opens a unix domain socket, an attacker
> will be able to steal that socket and inject messages into
> communication with things like DBus. procfs currently has the same
> race, and that still needs to be fixed, but at least procfs doesn't
> let you open things like sockets because they don't have a working
> ->open handler, and it enforces the normal permission check for
> opening files.
It isn't only exec that can change credentials. Do we need a lock for
changing credentials?
Wouldn't it be sufficient to simply test ptrace_may_access after
we get a copy of the file?
If we need a lock around credential change let's design and build that.
Having a mismatch between what a lock is designed to do, and what
people use it for can only result in other bugs as people get confused.
Eric
On 3/10/20 8:01 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> This changes kcmp_epoll_target to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This should be safe, as the credentials are only used for reading,
>> and furthermore ->mm and ->sighand are updated on execve,
>> but only under the new exec_update_mutex.
>>
>
> Can you add a comment that the exec_update_mutex is not needed for
> KCMP_FILE? As both sets of credentials during exec are valid
> for accessing the files so exec_update_mutex does not matter.
>
some files are closed by do_close_on_exec,
so in theory this allows you to examine files that
were open in the old user but closed for the new user
with either credential.
It is not a race condition, but it may be a security
concern.
> I don't think exec_update_mutex is needed for KCMP_SYSVSEM
> or KCMP_EPOLL_TFD either. As I don't think exec changes either
> one of those.
>
KCMP_EPOLL_TFD is also accessing file pointers,
that is possible.
It might be that KCMP_SYSVSEM is a missed optimization, but
I may have overlooked something.
I'd rather err on the safe side.
> Eric
>
>
>> Signed-off-by: Bernd Edlinger <[email protected]>
>> ---
>> kernel/kcmp.c | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/kcmp.c b/kernel/kcmp.c
>> index a0e3d7a..b3ff928 100644
>> --- a/kernel/kcmp.c
>> +++ b/kernel/kcmp.c
>> @@ -173,8 +173,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
>> /*
>> * One should have enough rights to inspect task details.
>> */
>> - ret = kcmp_lock(&task1->signal->cred_guard_mutex,
>> - &task2->signal->cred_guard_mutex);
>> + ret = kcmp_lock(&task1->signal->exec_update_mutex,
>> + &task2->signal->exec_update_mutex);
>> if (ret)
>> goto err;
>> if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) ||
>> @@ -229,8 +229,8 @@ static int kcmp_epoll_target(struct task_struct *task1,
>> }
>>
>> err_unlock:
>> - kcmp_unlock(&task1->signal->cred_guard_mutex,
>> - &task2->signal->cred_guard_mutex);
>> + kcmp_unlock(&task1->signal->exec_update_mutex,
>> + &task2->signal->exec_update_mutex);
>> err:
>> put_task_struct(task1);
>> put_task_struct(task2);
On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <[email protected]> wrote:
> Jann Horn <[email protected]> writes:
> > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
> >> During exec some file descriptors are closed and the files struct is
> >> unshared. But all of that can happen at other times and it has the
> >> same protections during exec as at ordinary times. So stop taking the
> >> cred_guard_mutex as it is useless.
> >>
> >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> >> prone, as it is held in serveral while waiting possibly indefinitely
> >> for userspace to do something.
> >
> > Please don't. Just use the new exec_update_mutex like everywhere else.
> >
> >> Cc: Sargun Dhillon <[email protected]>
> >> Cc: Christian Brauner <[email protected]>
> >> Cc: Arnd Bergmann <[email protected]>
> >> Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> ---
> >> kernel/pid.c | 6 ------
> >> 1 file changed, 6 deletions(-)
> >>
> >> Christian if you don't have any objections I will take this one through
> >> my tree.
> >>
> >> I tried to figure out why this code path takes the cred_guard_mutex and
> >> the archive on lore.kernel.org was not helpful in finding that part of
> >> the conversation.
> >
> > That was my suggestion.
> >
> >> diff --git a/kernel/pid.c b/kernel/pid.c
> >> index 60820e72634c..53646d5616d2 100644
> >> --- a/kernel/pid.c
> >> +++ b/kernel/pid.c
> >> @@ -577,17 +577,11 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
> >> struct file *file;
> >> int ret;
> >>
> >> - ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
> >> - if (ret)
> >> - return ERR_PTR(ret);
> >> -
> >> if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
> >> file = fget_task(task, fd);
> >> else
> >> file = ERR_PTR(-EPERM);
> >>
> >> - mutex_unlock(&task->signal->cred_guard_mutex);
> >> -
> >> return file ?: ERR_PTR(-EBADF);
> >> }
> >
> > If you make this change, then if this races with execution of a setuid
> > program that afterwards e.g. opens a unix domain socket, an attacker
> > will be able to steal that socket and inject messages into
> > communication with things like DBus. procfs currently has the same
> > race, and that still needs to be fixed, but at least procfs doesn't
> > let you open things like sockets because they don't have a working
> > ->open handler, and it enforces the normal permission check for
> > opening files.
>
> It isn't only exec that can change credentials. Do we need a lock for
> changing credentials?
Hmm, I guess so? Normally, a task that's changing credentials becomes
nondumpable at the same time (and there are explicit memory barriers
in commit_creds() and __ptrace_may_access() to enforce the ordering
for this); so you normally don't see tasks becoming ptrace-accessible
via anything other than execve(). But I guess if someone opens a
root-only file, closes it, drops privileges, and then explicitly does
prctl(PR_SET_DUMPABLE, 1), we should probably protect that, too.
> Wouldn't it be sufficient to simply test ptrace_may_access after
> we get a copy of the file?
There are also setuid helpers that can, after having done privileged
stuff, drop privileges and call execve(); after that,
ptrace_may_access() succeeds again. In particular, polkit has a helper
that does this.
> If we need a lock around credential change let's design and build that.
> Having a mismatch between what a lock is designed to do, and what
> people use it for can only result in other bugs as people get confused.
Hmm... what benefits do we get from making it a separate lock? I guess
it would allow us to make it a per-task lock instead of a
signal_struct-wide one? That might be helpful...
On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <[email protected]> wrote:
> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <[email protected]> wrote:
> > Jann Horn <[email protected]> writes:
> > > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
> > >> During exec some file descriptors are closed and the files struct is
> > >> unshared. But all of that can happen at other times and it has the
> > >> same protections during exec as at ordinary times. So stop taking the
> > >> cred_guard_mutex as it is useless.
> > >>
> > >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> > >> prone, as it is held in serveral while waiting possibly indefinitely
> > >> for userspace to do something.
[...]
> > > If you make this change, then if this races with execution of a setuid
> > > program that afterwards e.g. opens a unix domain socket, an attacker
> > > will be able to steal that socket and inject messages into
> > > communication with things like DBus. procfs currently has the same
> > > race, and that still needs to be fixed, but at least procfs doesn't
> > > let you open things like sockets because they don't have a working
> > > ->open handler, and it enforces the normal permission check for
> > > opening files.
> >
> > It isn't only exec that can change credentials. Do we need a lock for
> > changing credentials?
[...]
> > If we need a lock around credential change let's design and build that.
> > Having a mismatch between what a lock is designed to do, and what
> > people use it for can only result in other bugs as people get confused.
>
> Hmm... what benefits do we get from making it a separate lock? I guess
> it would allow us to make it a per-task lock instead of a
> signal_struct-wide one? That might be helpful...
But actually, isn't the core purpose of the cred_guard_mutex to guard
against concurrent credential changes anyway? That's what almost
everyone uses it for, and it's in the name...
On Mon, Mar 09, 2020 at 03:48:55PM -0500, Eric W. Biederman wrote:
> And I completely agree that we should at least rename tsk to me.
> Just for clarity.
I think it wouldn't hurt to add comments to spell it out explicitly
in each of the tsk->me functions, something like:
/*
* The "me" task_struct argument here must only ever refer to "current",
* but it gets passed in to avoid re-calculating "current" in each helper.
*/
I've found that the exec code in its entirety would be better off with
more comments. :) Usually that's the bulk of what I find myself adding
when I make changes in this area. ;)
-Kees
--
Kees Cook
On Sun, Mar 08, 2020 at 04:35:26PM -0500, Eric W. Biederman wrote:
>
> Make it clear that current only needs to be computed once in
> flush_old_exec. This may have some efficiency improvements and it
> makes the code easier to change.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
modulo my suggestion of adding more comments (it could even be kerndoc!)
that explicitly states that "me" should always be "current", yup, looks
good:
Reviewed-by: Kees Cook <[email protected]>
-Kees
> ---
> fs/exec.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index db17be51b112..c3f34791f2f0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1260,13 +1260,14 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
> */
> int flush_old_exec(struct linux_binprm * bprm)
> {
> + struct task_struct *me = current;
> int retval;
>
> /*
> * Make sure we have a private signal table and that
> * we are unassociated from the previous thread group.
> */
> - retval = de_thread(current);
> + retval = de_thread(me);
> if (retval)
> goto out;
>
> @@ -1294,10 +1295,10 @@ int flush_old_exec(struct linux_binprm * bprm)
> bprm->mm = NULL;
>
> set_fs(USER_DS);
> - current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> PF_NOFREEZE | PF_NO_SETAFFINITY);
> flush_thread();
> - current->personality &= ~bprm->per_clear;
> + me->personality &= ~bprm->per_clear;
>
> /*
> * We have to apply CLOEXEC before we change whether the process is
> @@ -1305,7 +1306,7 @@ int flush_old_exec(struct linux_binprm * bprm)
> * trying to access the should-be-closed file descriptors of a process
> * undergoing exec(2).
> */
> - do_close_on_exec(current->files);
> + do_close_on_exec(me->files);
> return 0;
>
> out:
> --
> 2.25.0
>
--
Kees Cook
On 3/10/20 8:06 PM, Eric W. Biederman wrote:
> Bernd Edlinger <[email protected]> writes:
>
>> This changes do_io_accounting to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This fixes possible deadlocks when the trace is accessing
>> /proc/$pid/io for instance.
>>
>> This should be safe, as the credentials are only used for reading.
>
> This is an improvement.
>
> We probably want to do this just as an incremental step in making things
> better but perhaps I am blind but I am not finding the reason for
> guarding this with the cred_guard_mutex to be at all persuasive.
>
> I think moving the ptrace_may_access check down to after the
> unlock_task_sighand would be just as effective at addressing the
> concerns raised in the original commit. I think the task_lock provides
> all of the barrier we need to make it safe to move the ptrace_may_access
> checks safe.
>
> The reason I say this is I don't see exec changing ->ioac. Just
> performing some I/O which would update the io accounting statistics.
>
Maybe the suid executable is starting up and doing io or not,
and what the program does immediately at startup is a secret,
that we want to keep secret but evil eve want to find out.
eve is using /proc/alice/io to do that.
It is a bit constructed, but seems like a security concern.
when we keep the exec_update_mutex while collecting the data, we
cannot see any io of the new process when the new credentials
don't allow that.
Bernd.
> Can anyone see if I am wrong?
>
> Eric
>
>
> commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0
> Author: Vasiliy Kulikov <[email protected]>
> Date: Tue Jul 26 16:08:38 2011 -0700
>
> proc: fix a race in do_io_accounting()
>
> If an inode's mode permits opening /proc/PID/io and the resulting file
> descriptor is kept across execve() of a setuid or similar binary, the
> ptrace_may_access() check tries to prevent using this fd against the
> task with escalated privileges.
>
> Unfortunately, there is a race in the check against execve(). If
> execve() is processed after the ptrace check, but before the actual io
> information gathering, io statistics will be gathered from the
> privileged process. At least in theory this might lead to gathering
> sensible information (like ssh/ftp password length) that wouldn't be
> available otherwise.
>
> Holding task->signal->cred_guard_mutex while gathering the io
> information should protect against the race.
>
> The order of locking is similar to the one inside of ptrace_attach():
> first goes cred_guard_mutex, then lock_task_sighand().
>
> Signed-off-by: Vasiliy Kulikov <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
>
>
>> Signed-off-by: Bernd Edlinger <[email protected]>
>> ---
>> fs/proc/base.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 4fdfe4f..529d0c6 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>> unsigned long flags;
>> int result;
>>
>> - result = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> + result = mutex_lock_killable(&task->signal->exec_update_mutex);
>> if (result)
>> return result;
>>
>> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>> result = 0;
>>
>> out_unlock:
>> - mutex_unlock(&task->signal->cred_guard_mutex);
>> + mutex_unlock(&task->signal->exec_update_mutex);
>> return result;
>> }
On 3/10/20 9:10 PM, Jann Horn wrote:
> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <[email protected]> wrote:
>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <[email protected]> wrote:
>>> Jann Horn <[email protected]> writes:
>>>> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
>>>>> During exec some file descriptors are closed and the files struct is
>>>>> unshared. But all of that can happen at other times and it has the
>>>>> same protections during exec as at ordinary times. So stop taking the
>>>>> cred_guard_mutex as it is useless.
>>>>>
>>>>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>>>>> prone, as it is held in serveral while waiting possibly indefinitely
>>>>> for userspace to do something.
> [...]
>>>> If you make this change, then if this races with execution of a setuid
>>>> program that afterwards e.g. opens a unix domain socket, an attacker
>>>> will be able to steal that socket and inject messages into
>>>> communication with things like DBus. procfs currently has the same
>>>> race, and that still needs to be fixed, but at least procfs doesn't
>>>> let you open things like sockets because they don't have a working
>>>> ->open handler, and it enforces the normal permission check for
>>>> opening files.
>>>
>>> It isn't only exec that can change credentials. Do we need a lock for
>>> changing credentials?
> [...]
>>> If we need a lock around credential change let's design and build that.
>>> Having a mismatch between what a lock is designed to do, and what
>>> people use it for can only result in other bugs as people get confused.
>>
>> Hmm... what benefits do we get from making it a separate lock? I guess
>> it would allow us to make it a per-task lock instead of a
>> signal_struct-wide one? That might be helpful...
>
> But actually, isn't the core purpose of the cred_guard_mutex to guard
> against concurrent credential changes anyway? That's what almost
> everyone uses it for, and it's in the name...
>
The main reason d'etre of exec_update_mutex is to get a consitent
view of task->mm and task credentials.
The reason why you want the cred_guard_mutex, is that some action
is changing the resulting credentials that the execve is about
to install, and that is the data flow in the opposite direction.
Bernd.
On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
>
> This makes the code clearer and makes it easier to implement a mutex
> that is not taken over any locations that may block indefinitely waiting
> for userspace.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 39 ++++++++++++++++++++++++++-------------
> 1 file changed, 26 insertions(+), 13 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index c3f34791f2f0..ff74b9a74d34 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
> flush_itimer_signals();
> #endif
Semi-related (existing behavior): in de_thread(), what keeps the thread
group from changing? i.e.:
if (thread_group_empty(tsk))
goto no_thread_group;
/*
* Kill all other threads in the thread group.
*/
spin_lock_irq(lock);
... kill other threads under lock ...
Why is the thread_group_emtpy() test not under lock?
>
> + BUG_ON(!thread_group_leader(tsk));
> + return 0;
> +
> +killed:
> + /* protects against exit_notify() and __exit_signal() */
I wonder if include/linux/sched/task.h's definition of tasklist_lock
should explicitly gain note about group_exit_task and notify_count,
or, alternatively, signal.h's section on these fields should gain a
comment? tasklist_lock is unmentioned in signal.h... :(
> + read_lock(&tasklist_lock);
> + sig->group_exit_task = NULL;
> + sig->notify_count = 0;
> + read_unlock(&tasklist_lock);
> + return -EAGAIN;
> +}
> +
> +
> +static int unshare_sighand(struct task_struct *me)
> +{
> + struct sighand_struct *oldsighand = me->sighand;
> +
> if (refcount_read(&oldsighand->count) != 1) {
> struct sighand_struct *newsighand;
> /*
> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>
> write_lock_irq(&tasklist_lock);
> spin_lock(&oldsighand->siglock);
> - rcu_assign_pointer(tsk->sighand, newsighand);
> + rcu_assign_pointer(me->sighand, newsighand);
> spin_unlock(&oldsighand->siglock);
> write_unlock_irq(&tasklist_lock);
>
> __cleanup_sighand(oldsighand);
> }
> -
> - BUG_ON(!thread_group_leader(tsk));
> return 0;
> -
> -killed:
> - /* protects against exit_notify() and __exit_signal() */
> - read_lock(&tasklist_lock);
> - sig->group_exit_task = NULL;
> - sig->notify_count = 0;
> - read_unlock(&tasklist_lock);
> - return -EAGAIN;
> }
>
> char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
> @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
> int retval;
>
> /*
> - * Make sure we have a private signal table and that
> - * we are unassociated from the previous thread group.
> + * Make this the only thread in the thread group.
> */
> retval = de_thread(me);
> if (retval)
> goto out;
>
> + /*
> + * Make the signal table private.
> + */
> + retval = unshare_sighand(me);
> + if (retval)
> + goto out;
> +
> /*
> * Must be called _before_ exec_mmap() as bprm->mm is
> * not visibile until then. This also enables the update
> --
> 2.25.0
Otherwise, yes, sensible separation.
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
>
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
> /* we have changed execution domain */
> tsk->exit_signal = SIGCHLD;
>
> -#ifdef CONFIG_POSIX_TIMERS
> - exit_itimers(sig);
> - flush_itimer_signals();
> -#endif
> -
> BUG_ON(!thread_group_leader(tsk));
> return 0;
>
> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
> if (retval)
> goto out;
>
> +#ifdef CONFIG_POSIX_TIMERS
> + exit_itimers(me->signal);
> + flush_itimer_signals();
> +#endif
> +
I twitch at seeing #ifdefs in .c instead of hidden in the .h declarations
of these two functions, but as this is a copy/paste, I'll live. ;)
Reviewed-by: Kees Cook <[email protected]>
-Kees
> /*
> * Make the signal table private.
> */
> --
> 2.25.0
>
--
Kees Cook
On 3/10/20 9:29 PM, Kees Cook wrote:
> On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
>>
>> This makes the code clearer and makes it easier to implement a mutex
>> that is not taken over any locations that may block indefinitely waiting
>> for userspace.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/exec.c | 39 ++++++++++++++++++++++++++-------------
>> 1 file changed, 26 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index c3f34791f2f0..ff74b9a74d34 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
>> flush_itimer_signals();
>> #endif
>
> Semi-related (existing behavior): in de_thread(), what keeps the thread
> group from changing? i.e.:
>
> if (thread_group_empty(tsk))
> goto no_thread_group;
>
> /*
> * Kill all other threads in the thread group.
> */
> spin_lock_irq(lock);
> ... kill other threads under lock ...
>
> Why is the thread_group_emtpy() test not under lock?
>
A new thread cannot created when only one thread is executing,
right?
>>
>> + BUG_ON(!thread_group_leader(tsk));
>> + return 0;
>> +
>> +killed:
>> + /* protects against exit_notify() and __exit_signal() */
>
> I wonder if include/linux/sched/task.h's definition of tasklist_lock
> should explicitly gain note about group_exit_task and notify_count,
> or, alternatively, signal.h's section on these fields should gain a
> comment? tasklist_lock is unmentioned in signal.h... :(
>
>> + read_lock(&tasklist_lock);
>> + sig->group_exit_task = NULL;
>> + sig->notify_count = 0;
>> + read_unlock(&tasklist_lock);
>> + return -EAGAIN;
>> +}
>> +
>> +
>> +static int unshare_sighand(struct task_struct *me)
>> +{
>> + struct sighand_struct *oldsighand = me->sighand;
>> +
>> if (refcount_read(&oldsighand->count) != 1) {
>> struct sighand_struct *newsighand;
>> /*
>> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>>
>> write_lock_irq(&tasklist_lock);
>> spin_lock(&oldsighand->siglock);
>> - rcu_assign_pointer(tsk->sighand, newsighand);
>> + rcu_assign_pointer(me->sighand, newsighand);
>> spin_unlock(&oldsighand->siglock);
>> write_unlock_irq(&tasklist_lock);
>>
>> __cleanup_sighand(oldsighand);
>> }
>> -
>> - BUG_ON(!thread_group_leader(tsk));
>> return 0;
>> -
>> -killed:
>> - /* protects against exit_notify() and __exit_signal() */
>> - read_lock(&tasklist_lock);
>> - sig->group_exit_task = NULL;
>> - sig->notify_count = 0;
>> - read_unlock(&tasklist_lock);
>> - return -EAGAIN;
>> }
>>
>> char *__get_task_comm(char *buf, size_t buf_size, struct task_struct *tsk)
>> @@ -1264,13 +1271,19 @@ int flush_old_exec(struct linux_binprm * bprm)
>> int retval;
>>
>> /*
>> - * Make sure we have a private signal table and that
>> - * we are unassociated from the previous thread group.
>> + * Make this the only thread in the thread group.
>> */
>> retval = de_thread(me);
>> if (retval)
>> goto out;
>>
>> + /*
>> + * Make the signal table private.
>> + */
>> + retval = unshare_sighand(me);
>> + if (retval)
>> + goto out;
>> +
>> /*
>> * Must be called _before_ exec_mmap() as bprm->mm is
>> * not visibile until then. This also enables the update
>> --
>> 2.25.0
>
> Otherwise, yes, sensible separation.
>
> Reviewed-by: Kees Cook <[email protected]>
>
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
>
> I have read through the code in exec_mmap and I do not see anything
> that depends on sighand or the sighand lock, or on signals in anyway
> so this should be safe.
>
> This rearrangement of code has two siginficant benefits. It makes
> the determination of passing the point of no return by testing bprm->mm
> accurate. All failures prior to that point in flush_old_exec are
> either truly recoverable or they are fatal.
Agreed. Though I see a use of "current", which maybe you want to
parameterize to a "me" argument in acct_arg_size(). (Though looking at
the callers, perhaps there is no benefit?)
>
> Futher this consolidates all of the possible indefinite waits for
> userspace together at the top of flush_old_exec. The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
>
> This consolidation allows the creation of a mutex to replace
> cred_guard_mutex that is not held of possible indefinite userspace
> waits. Which will allow removing deadlock scenarios from the kernel.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 24 ++++++++++++------------
> 1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 215d86f77b63..d820a7272a76 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
> if (retval)
> goto out;
>
> -#ifdef CONFIG_POSIX_TIMERS
> - exit_itimers(me->signal);
> - flush_itimer_signals();
> -#endif
I think this comment:
/*
* This is called by do_exit or de_thread, only when there are no more
* references to the shared signal_struct.
*/
void exit_itimers(struct signal_struct *sig)
Refers to there being other threads, yes? Not that the signal table is
private yet?
> -
> - /*
> - * Make the signal table private.
> - */
> - retval = unshare_sighand(me);
> - if (retval)
> - goto out;
> -
> /*
> * Must be called _before_ exec_mmap() as bprm->mm is
> * not visibile until then. This also enables the update
> @@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
> */
> bprm->mm = NULL;
>
> +#ifdef CONFIG_POSIX_TIMERS
> + exit_itimers(me->signal);
> + flush_itimer_signals();
> +#endif
I've mostly convinced myself that there are no "side-effects" from having
these timers expire as the mm is going away. I think some kind of comment
of that intent should be explicitly stated here above the timer work.
Beyond that:
Reviewed-by: Kees Cook <[email protected]>
-Kees
> +
> + /*
> + * Make the signal table private.
> + */
> + retval = unshare_sighand(me);
> + if (retval)
> + goto out;
> +
> set_fs(USER_DS);
> me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> PF_NOFREEZE | PF_NO_SETAFFINITY);
> --
> 2.25.0
>
--
Kees Cook
On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
> Futher this consolidates all of the possible indefinite waits for
> userspace together at the top of flush_old_exec. The possible wait
> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
> to be resolved in clear_child_tid, and the possible wait for a page
> fault in exit_robust_list.
I forgot to mention, just as a point of clarity, there are lots of
other page faults possible, but they're _before_ flush_old_exec()
(i.e. all the copy_strings() calls). Is it worth clarifying this to
"before or at the top of flush_old_exec()" or do you mean something
else? (And as always: perhaps expand flush_old_exec()'s comment to
describe the newly intended state.)
--
Kees Cook
On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
> exec: Add exec_update_mutex to replace cred_guard_mutex
>
> The cred_guard_mutex is problematic as it is held over possibly
> indefinite waits for userspace. The possilbe indefinite waits for
> userspace that I have identified are: The cred_guard_mutex is held in
> PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
> held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
> cred_guard_mutex is held over "get_user(futex_offset, ...") in
> exit_robust_list. The cred_guard_mutex held over copy_strings.
I suspect you're not trying to make a comprehensive list here, but do
you want to mention seccomp too (since it's yet another weird case).
> [...]
> Holding a mutex over any of those possibly indefinite waits for
> userspace does not appear necessary. Add exec_update_mutex that will
> just cover updating the process during exec where the permissions and
> the objects pointed to by the task struct may be out of sync.
Should the specific resources be pointed out here? creds, mm, ... ?
But otherwise, yup, looks sane:
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
On Tue, Mar 10, 2020 at 09:34:03PM +0100, Bernd Edlinger wrote:
> On 3/10/20 9:29 PM, Kees Cook wrote:
> > On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
> >>
> >> This makes the code clearer and makes it easier to implement a mutex
> >> that is not taken over any locations that may block indefinitely waiting
> >> for userspace.
> >>
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> ---
> >> fs/exec.c | 39 ++++++++++++++++++++++++++-------------
> >> 1 file changed, 26 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index c3f34791f2f0..ff74b9a74d34 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
> >> flush_itimer_signals();
> >> #endif
> >
> > Semi-related (existing behavior): in de_thread(), what keeps the thread
> > group from changing? i.e.:
> >
> > if (thread_group_empty(tsk))
> > goto no_thread_group;
> >
> > /*
> > * Kill all other threads in the thread group.
> > */
> > spin_lock_irq(lock);
> > ... kill other threads under lock ...
> >
> > Why is the thread_group_emtpy() test not under lock?
> >
>
> A new thread cannot created when only one thread is executing,
> right?
*face palm* Yes, of course. :) I'm thinking too hard.
--
Kees Cook
On Sun, Mar 8, 2020 at 10:39 PM Eric W. Biederman <[email protected]> wrote:
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ff74b9a74d34..215d86f77b63 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
> /* we have changed execution domain */
> tsk->exit_signal = SIGCHLD;
>
> -#ifdef CONFIG_POSIX_TIMERS
> - exit_itimers(sig);
> - flush_itimer_signals();
> -#endif
> -
> BUG_ON(!thread_group_leader(tsk));
> return 0;
>
> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
> if (retval)
> goto out;
>
> +#ifdef CONFIG_POSIX_TIMERS
> + exit_itimers(me->signal);
> + flush_itimer_signals();
> +#endif
nit: exit_itimers() has a comment referring to de_thread, that should
probably be updated
On Tue, Mar 10, 2020 at 02:43:41PM +0100, Bernd Edlinger wrote:
> This fixes a deadlock in the tracer when tracing a multi-threaded
> application that calls execve while more than one thread are running.
>
> I observed that when running strace on the gcc test suite, it always
> blocks after a while, when expect calls execve, because other threads
> have to be terminated. They send ptrace events, but the strace is no
> longer able to respond, since it is blocked in vm_access.
>
> The deadlock is always happening when strace needs to access the
> tracees process mmap, while another thread in the tracee starts to
> execve a child process, but that cannot continue until the
> PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:
>
> strace D 0 30614 30584 0x00000000
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> schedule_preempt_disabled+0x15/0x20
> __mutex_lock.isra.13+0x1ec/0x520
> __mutex_lock_killable_slowpath+0x13/0x20
> mutex_lock_killable+0x28/0x30
> mm_access+0x27/0xa0
> process_vm_rw_core.isra.3+0xff/0x550
> process_vm_rw+0xdd/0xf0
> __x64_sys_process_vm_readv+0x31/0x40
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> expect D 0 31933 30876 0x80004003
> Call Trace:
> __schedule+0x3ce/0x6e0
> schedule+0x5c/0xd0
> flush_old_exec+0xc4/0x770
> load_elf_binary+0x35a/0x16c0
> search_binary_handler+0x97/0x1d0
> __do_execve_file.isra.40+0x5d4/0x8a0
> __x64_sys_execve+0x49/0x60
> do_syscall_64+0x64/0x220
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> This changes mm_access to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This patch is based on the following patch by Eric W. Biederman:
> "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
> Link: https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Bernd Edlinger <[email protected]>
Cool, yes, on top of the new infrastructure this looks correct to me --
the new mutex wraps mm changes and mm_access() is looking at *drum roll*
the mm! :)
Reviewed-by: Kees Cook <[email protected]>
-Kees
> ---
> kernel/fork.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c12595a..5720ff3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> struct mm_struct *mm;
> int err;
>
> - err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + err = mutex_lock_killable(&task->signal->exec_update_mutex);
> if (err)
> return ERR_PTR(err);
>
> @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode)
> mmput(mm);
> mm = ERR_PTR(-EACCES);
> }
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
>
> return mm;
> }
> --
> 1.9.1
--
Kees Cook
Jann Horn <[email protected]> writes:
> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <[email protected]> wrote:
>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <[email protected]> wrote:
>> > Jann Horn <[email protected]> writes:
>> > > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
>> > >> During exec some file descriptors are closed and the files struct is
>> > >> unshared. But all of that can happen at other times and it has the
>> > >> same protections during exec as at ordinary times. So stop taking the
>> > >> cred_guard_mutex as it is useless.
>> > >>
>> > >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>> > >> prone, as it is held in serveral while waiting possibly indefinitely
>> > >> for userspace to do something.
> [...]
>> > > If you make this change, then if this races with execution of a setuid
>> > > program that afterwards e.g. opens a unix domain socket, an attacker
>> > > will be able to steal that socket and inject messages into
>> > > communication with things like DBus. procfs currently has the same
>> > > race, and that still needs to be fixed, but at least procfs doesn't
>> > > let you open things like sockets because they don't have a working
>> > > ->open handler, and it enforces the normal permission check for
>> > > opening files.
>> >
>> > It isn't only exec that can change credentials. Do we need a lock for
>> > changing credentials?
> [...]
>> > If we need a lock around credential change let's design and build that.
>> > Having a mismatch between what a lock is designed to do, and what
>> > people use it for can only result in other bugs as people get confused.
>>
>> Hmm... what benefits do we get from making it a separate lock? I guess
>> it would allow us to make it a per-task lock instead of a
>> signal_struct-wide one? That might be helpful...
>
> But actually, isn't the core purpose of the cred_guard_mutex to guard
> against concurrent credential changes anyway? That's what almost
> everyone uses it for, and it's in the name...
Having been through all of the users nope.
Maybe someone tried to repurpose for that. I haven't traced through
when it went the it was renamed from cred_exec_mutex to
cred_guard_mutex.
The original purpose was to make make exec and ptrace deadlock. But it
was seen as being there to allow safely calculating the new credentials
before the point of now return. Because if a process is ptraced or not
affects the new credential calculations. Unfortunately offering that
guarantee fundamentally leads to deadlock.
So ptrace_attach and seccomp use the cred_guard_mutex to guarantee
a deadlock.
The common use is to take cred_guard_mutex to guard the window when
credentials and process details are out of sync in exec. But there
is at least do_io_accounting that seems to have the same justification
for holding __pidfd_fget.
With effort I suspect we can replace exec_change_mutex with task_lock.
When we are guaranteed to be single threaded placing exec_change_mutex
in signal_struct doesn't really help us (except maybe in some races?).
The deep problem is no one really understands cred_guard_mutex so it is
a mess. Code with poorly defined semantics is always wrong somewhere
for someone. Which is part of why I am attacking this and having the
conversations to make certain I understand what is going on.
I see your point about commit_creds making a process undumpable. So in
practice it really is only exec that changes creds in a way that
ptrace_may_access will allow the process to be inspected.
So I guess for now the practical non-regressing course is to change
everything to my exec_change_mutex, removing the deadlock. Then we
figure out how to cleanly deal with the races inspecting a process with
changing credentials has.
Eric
Kees Cook <[email protected]> writes:
> On Mon, Mar 09, 2020 at 02:02:37PM -0500, Eric W. Biederman wrote:
>> exec: Add exec_update_mutex to replace cred_guard_mutex
>>
>> The cred_guard_mutex is problematic as it is held over possibly
>> indefinite waits for userspace. The possilbe indefinite waits for
>> userspace that I have identified are: The cred_guard_mutex is held in
>> PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
>> held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
>> cred_guard_mutex is held over "get_user(futex_offset, ...") in
>> exit_robust_list. The cred_guard_mutex held over copy_strings.
>
> I suspect you're not trying to make a comprehensive list here, but do
> you want to mention seccomp too (since it's yet another weird case).
I was calling out all of the places I have found so far where
cred_guard_mutex is held over waiting for userspace to maybe do
something. Those places are what cause our deadlocks.
>> [...]
>> Holding a mutex over any of those possibly indefinite waits for
>> userspace does not appear necessary. Add exec_update_mutex that will
>> just cover updating the process during exec where the permissions and
>> the objects pointed to by the task struct may be out of sync.
>
> Should the specific resources be pointed out here? creds, mm, ... ?
>
> But otherwise, yup, looks sane:
Probably not. The design is if exec changes it we will hold the
cred_guard_mutex over it, so things are semi-atomic.
> Reviewed-by: Kees Cook <[email protected]>
Eric
Jann Horn <[email protected]> writes:
> On Sun, Mar 8, 2020 at 10:39 PM Eric W. Biederman <[email protected]> wrote:
>> These functions have very little to do with de_thread move them out
>> of de_thread an into flush_old_exec proper so it can be more clearly
>> seen what flush_old_exec is doing.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/exec.c | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index ff74b9a74d34..215d86f77b63 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1189,11 +1189,6 @@ static int de_thread(struct task_struct *tsk)
>> /* we have changed execution domain */
>> tsk->exit_signal = SIGCHLD;
>>
>> -#ifdef CONFIG_POSIX_TIMERS
>> - exit_itimers(sig);
>> - flush_itimer_signals();
>> -#endif
>> -
>> BUG_ON(!thread_group_leader(tsk));
>> return 0;
>>
>> @@ -1277,6 +1272,11 @@ int flush_old_exec(struct linux_binprm * bprm)
>> if (retval)
>> goto out;
>>
>> +#ifdef CONFIG_POSIX_TIMERS
>> + exit_itimers(me->signal);
>> + flush_itimer_signals();
>> +#endif
>
> nit: exit_itimers() has a comment referring to de_thread, that should
> probably be updated
Good point.
Eric
Kees Cook <[email protected]> writes:
> On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
>> Futher this consolidates all of the possible indefinite waits for
>> userspace together at the top of flush_old_exec. The possible wait
>> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>> to be resolved in clear_child_tid, and the possible wait for a page
>> fault in exit_robust_list.
>
> I forgot to mention, just as a point of clarity, there are lots of
> other page faults possible, but they're _before_ flush_old_exec()
> (i.e. all the copy_strings() calls). Is it worth clarifying this to
> "before or at the top of flush_old_exec()" or do you mean something
> else? (And as always: perhaps expand flush_old_exec()'s comment to
> describe the newly intended state.)
Yes. Before or at the start of flush_old_exec where the mutex
is taken. That is the point. I will see if I can come up with
and appropriate comment.
Eric
On Sun, Mar 08, 2020 at 04:35:26PM -0500, Eric W. Biederman wrote:
>
> Make it clear that current only needs to be computed once in
> flush_old_exec. This may have some efficiency improvements and it
> makes the code easier to change.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Acked-by: Christian Brauner <[email protected]>
On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <[email protected]> wrote:
> The cred_guard_mutex is problematic. The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed. The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one. This lets us move forward while still
> being careful and not introducing any regressions.
[...]
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
> return -EINTR;
> }
> }
> +
> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> + if (ret)
> + return ret;
We're already holding the old mmap_sem, and now nest the
exec_update_mutex inside it; but then while still holding the
exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
at least lockdep will be unhappy, and I'm not sure whether it's an
actual problem or not.
On Sun, Mar 08, 2020 at 04:36:17PM -0500, Eric W. Biederman wrote:
>
> This makes the code clearer and makes it easier to implement a mutex
> that is not taken over any locations that may block indefinitely waiting
> for userspace.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 39 ++++++++++++++++++++++++++-------------
> 1 file changed, 26 insertions(+), 13 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index c3f34791f2f0..ff74b9a74d34 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1194,6 +1194,23 @@ static int de_thread(struct task_struct *tsk)
> flush_itimer_signals();
> #endif
>
> + BUG_ON(!thread_group_leader(tsk));
> + return 0;
> +
> +killed:
> + /* protects against exit_notify() and __exit_signal() */
> + read_lock(&tasklist_lock);
> + sig->group_exit_task = NULL;
> + sig->notify_count = 0;
> + read_unlock(&tasklist_lock);
> + return -EAGAIN;
> +}
> +
> +
> +static int unshare_sighand(struct task_struct *me)
> +{
> + struct sighand_struct *oldsighand = me->sighand;
> +
> if (refcount_read(&oldsighand->count) != 1) {
> struct sighand_struct *newsighand;
> /*
> @@ -1210,23 +1227,13 @@ static int de_thread(struct task_struct *tsk)
>
> write_lock_irq(&tasklist_lock);
> spin_lock(&oldsighand->siglock);
> - rcu_assign_pointer(tsk->sighand, newsighand);
> + rcu_assign_pointer(me->sighand, newsighand);
> spin_unlock(&oldsighand->siglock);
> write_unlock_irq(&tasklist_lock);
>
> __cleanup_sighand(oldsighand);
> }
This is fine for now but we share an aweful lot of code with
copy_sighand(). We should earmark this to look into consolidating the
core operations into a common helper called from both copy_sighand() and
unshare_sighand() maybe even dumbing it down to one helper. But not
needed for now.
Otherwise:
Acked-by: Christian Brauner <[email protected]>
Kees Cook <[email protected]> writes:
> On Sun, Mar 08, 2020 at 04:38:00PM -0500, Eric W. Biederman wrote:
>>
>> I have read through the code in exec_mmap and I do not see anything
>> that depends on sighand or the sighand lock, or on signals in anyway
>> so this should be safe.
>>
>> This rearrangement of code has two siginficant benefits. It makes
>> the determination of passing the point of no return by testing bprm->mm
>> accurate. All failures prior to that point in flush_old_exec are
>> either truly recoverable or they are fatal.
>
> Agreed. Though I see a use of "current", which maybe you want to
> parameterize to a "me" argument in acct_arg_size(). (Though looking at
> the callers, perhaps there is no benefit?)
My testing suggests there is a small benefit on x86.
The code is just "#define current get_current()"
and get_current() revoles into a read of "%gs:current_task".
But looking at the code I find gcc can sometimes when the
reads are close in the source code can optimize the read
away. But gcc does not manage to optimize the extra
read of "%gs:current_task" away.
So I think things are much much better than they used to be,
code generation wise. But it still helps to cache current
in a local variable.
>> Futher this consolidates all of the possible indefinite waits for
>> userspace together at the top of flush_old_exec. The possible wait
>> for a ptracer on PTRACE_EVENT_EXIT, the possible wait for a page fault
>> to be resolved in clear_child_tid, and the possible wait for a page
>> fault in exit_robust_list.
>>
>> This consolidation allows the creation of a mutex to replace
>> cred_guard_mutex that is not held of possible indefinite userspace
>> waits. Which will allow removing deadlock scenarios from the kernel.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/exec.c | 24 ++++++++++++------------
>> 1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 215d86f77b63..d820a7272a76 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1272,18 +1272,6 @@ int flush_old_exec(struct linux_binprm * bprm)
>> if (retval)
>> goto out;
>>
>> -#ifdef CONFIG_POSIX_TIMERS
>> - exit_itimers(me->signal);
>> - flush_itimer_signals();
>> -#endif
>
> I think this comment:
>
> /*
> * This is called by do_exit or de_thread, only when there are no more
> * references to the shared signal_struct.
> */
> void exit_itimers(struct signal_struct *sig)
>
> Refers to there being other threads, yes? Not that the signal table is
> private yet?
The signal table is in sighand_struct.
So yes that refers to the other threads being gone.
>> -
>> - /*
>> - * Make the signal table private.
>> - */
>> - retval = unshare_sighand(me);
>> - if (retval)
>> - goto out;
>> -
>> /*
>> * Must be called _before_ exec_mmap() as bprm->mm is
>> * not visibile until then. This also enables the update
>> @@ -1307,6 +1295,18 @@ int flush_old_exec(struct linux_binprm * bprm)
>> */
>> bprm->mm = NULL;
>>
>> +#ifdef CONFIG_POSIX_TIMERS
>> + exit_itimers(me->signal);
>> + flush_itimer_signals();
>> +#endif
>
> I've mostly convinced myself that there are no "side-effects" from having
> these timers expire as the mm is going away. I think some kind of comment
> of that intent should be explicitly stated here above the timer work.
The timers can at most generate signals. And we are not handling
signals in the middle of exec.
So the only possible interaction would be to set a timeout and then try
exec, and have the timer kill the caller.
Maybe we get a killable signal from a scenario like that and maybe this
changes the time before the timer expires into the dangerous zone.
But that is all I can think of.
We have to return to the edge of userspace before any signals are
delivered.
> Beyond that:
>
> Reviewed-by: Kees Cook <[email protected]>
>
> -Kees
>
>> +
>> + /*
>> + * Make the signal table private.
>> + */
>> + retval = unshare_sighand(me);
>> + if (retval)
>> + goto out;
>> +
>> set_fs(USER_DS);
>> me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
>> PF_NOFREEZE | PF_NO_SETAFFINITY);
>> --
>> 2.25.0
>>
Eric
On Sun, Mar 08, 2020 at 04:36:55PM -0500, Eric W. Biederman wrote:
>
> These functions have very little to do with de_thread move them out
> of de_thread an into flush_old_exec proper so it can be more clearly
> seen what flush_old_exec is doing.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Bernd Edlinger <[email protected]> writes:
> On 3/10/20 8:06 PM, Eric W. Biederman wrote:
>> Bernd Edlinger <[email protected]> writes:
>>
>>> This changes do_io_accounting to use the new exec_update_mutex
>>> instead of cred_guard_mutex.
>>>
>>> This fixes possible deadlocks when the trace is accessing
>>> /proc/$pid/io for instance.
>>>
>>> This should be safe, as the credentials are only used for reading.
>>
>> This is an improvement.
>>
>> We probably want to do this just as an incremental step in making things
>> better but perhaps I am blind but I am not finding the reason for
>> guarding this with the cred_guard_mutex to be at all persuasive.
>>
>> I think moving the ptrace_may_access check down to after the
>> unlock_task_sighand would be just as effective at addressing the
>> concerns raised in the original commit. I think the task_lock provides
>> all of the barrier we need to make it safe to move the ptrace_may_access
>> checks safe.
>>
>> The reason I say this is I don't see exec changing ->ioac. Just
>> performing some I/O which would update the io accounting statistics.
>>
>
> Maybe the suid executable is starting up and doing io or not,
> and what the program does immediately at startup is a secret,
> that we want to keep secret but evil eve want to find out.
> eve is using /proc/alice/io to do that.
>
> It is a bit constructed, but seems like a security concern.
> when we keep the exec_update_mutex while collecting the data, we
> cannot see any io of the new process when the new credentials
> don't allow that.
Jann Horn has convinced me we should just convert these to the
exec_change_mutex today. Because while not 100% correct in theory, the
only really interesting case is exec. So the code does something
interesting and worth while, and mostly correct. The last thing I want
to do is to cause an unnecessary regression.
Eric
On Tue, Mar 10, 2020 at 03:57:35PM -0500, Eric W. Biederman wrote:
> Jann Horn <[email protected]> writes:
>
> > On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <[email protected]> wrote:
> >> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <[email protected]> wrote:
> >> > Jann Horn <[email protected]> writes:
> >> > > On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
> >> > >> During exec some file descriptors are closed and the files struct is
> >> > >> unshared. But all of that can happen at other times and it has the
> >> > >> same protections during exec as at ordinary times. So stop taking the
> >> > >> cred_guard_mutex as it is useless.
> >> > >>
> >> > >> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> >> > >> prone, as it is held in serveral while waiting possibly indefinitely
> >> > >> for userspace to do something.
> > [...]
> >> > > If you make this change, then if this races with execution of a setuid
> >> > > program that afterwards e.g. opens a unix domain socket, an attacker
> >> > > will be able to steal that socket and inject messages into
> >> > > communication with things like DBus. procfs currently has the same
> >> > > race, and that still needs to be fixed, but at least procfs doesn't
> >> > > let you open things like sockets because they don't have a working
> >> > > ->open handler, and it enforces the normal permission check for
> >> > > opening files.
> >> >
> >> > It isn't only exec that can change credentials. Do we need a lock for
> >> > changing credentials?
> > [...]
> >> > If we need a lock around credential change let's design and build that.
> >> > Having a mismatch between what a lock is designed to do, and what
> >> > people use it for can only result in other bugs as people get confused.
> >>
> >> Hmm... what benefits do we get from making it a separate lock? I guess
> >> it would allow us to make it a per-task lock instead of a
> >> signal_struct-wide one? That might be helpful...
> >
> > But actually, isn't the core purpose of the cred_guard_mutex to guard
> > against concurrent credential changes anyway? That's what almost
> > everyone uses it for, and it's in the name...
>
> Having been through all of the users nope.
>
> Maybe someone tried to repurpose for that. I haven't traced through
> when it went the it was renamed from cred_exec_mutex to
> cred_guard_mutex.
>
> The original purpose was to make make exec and ptrace deadlock. But it
> was seen as being there to allow safely calculating the new credentials
> before the point of now return. Because if a process is ptraced or not
> affects the new credential calculations. Unfortunately offering that
> guarantee fundamentally leads to deadlock.
>
> So ptrace_attach and seccomp use the cred_guard_mutex to guarantee
> a deadlock.
>
> The common use is to take cred_guard_mutex to guard the window when
> credentials and process details are out of sync in exec. But there
> is at least do_io_accounting that seems to have the same justification
> for holding __pidfd_fget.
>
> With effort I suspect we can replace exec_change_mutex with task_lock.
> When we are guaranteed to be single threaded placing exec_change_mutex
> in signal_struct doesn't really help us (except maybe in some races?).
>
> The deep problem is no one really understands cred_guard_mutex so it is
> a mess. Code with poorly defined semantics is always wrong somewhere
This is a good point. When discussing patches sensitive to credential
changes cred_guard_mutex was always introduced as having the purpose to
guard against concurrent credential changes. And I'm pretty sure that
that's how most people have been using it for quite a long time. I mean,
it's at least the case for seccomp and proc and probably quite a few
more. So the problem seems to me that it has clear _intended_ semantics
that runs into issues in all sorts of cases. So if cred_guard_mutex is
not that then we seem to need to provide something that serves it's
intended purpose.
Jann Horn <[email protected]> writes:
> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <[email protected]> wrote:
>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> threads are killed. The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>>
>> Add exec_update_mutex that is only held over exec updating process
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>>
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one. This lets us move forward while still
>> being careful and not introducing any regressions.
> [...]
>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>> return -EINTR;
>> }
>> }
>> +
>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> + if (ret)
>> + return ret;
>
> We're already holding the old mmap_sem, and now nest the
> exec_update_mutex inside it; but then while still holding the
> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
> at least lockdep will be unhappy, and I'm not sure whether it's an
> actual problem or not.
Good point. I should double check the lock ordering here with mmap_sem.
It doesn't look like mmput takes mmap_sem, but still there might be a
lock inversion of some kind here. At least as far as lockdep is
concerned and we don't want anything like that.
Eric
On Tue, Mar 10, 2020 at 02:44:01PM +0100, Bernd Edlinger wrote:
> This adds test cases for ptrace deadlocks.
>
> Additionally fixes a compile problem in get_syscall_info.c,
> observed with gcc-4.8.4:
>
> get_syscall_info.c: In function 'get_syscall_info':
> get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
> allowed in C99 mode
> for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
> ^
> get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
> your code
*discomfort noises* (see below)
>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> tools/testing/selftests/ptrace/Makefile | 4 +-
> tools/testing/selftests/ptrace/vmaccess.c | 86 +++++++++++++++++++++++++++++++
> 2 files changed, 88 insertions(+), 2 deletions(-)
> create mode 100644 tools/testing/selftests/ptrace/vmaccess.c
>
> diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
> index c0b7f89..2f1f532 100644
> --- a/tools/testing/selftests/ptrace/Makefile
> +++ b/tools/testing/selftests/ptrace/Makefile
> @@ -1,6 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -CFLAGS += -iquote../../../../include/uapi -Wall
> +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
This isn't the common solution in the kernel (the variable declaration
would just be lifted out of the loop), but as it's selftest code, which
does lots of special things ... I *guess* this is okay.
>
> -TEST_GEN_PROGS := get_syscall_info peeksiginfo
> +TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
I love having this deadlock test added to the selftests.
I think I need to make an improvement to the test harness, though, as
the failure mode right now just blows up after the 30 second timeout
and leaves this deadlocked:
$ ./vmaccess
[==========] Running 2 tests from 1 test cases.
[ RUN ] global.vmaccess
Alarm clock
$ ps
PID TTY TIME CMD
2605 pts/0 00:00:00 bash
23360 pts/0 00:00:00 vmaccess
23361 pts/0 00:00:00 vmaccess
23363 pts/0 00:00:00 ps
But that's mostly unrelated to this code.
Reviewed-by: Kees Cook <[email protected]>
-Kees
>
> include ../lib.mk
> diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
> new file mode 100644
> index 0000000..4db327b
> --- /dev/null
> +++ b/tools/testing/selftests/ptrace/vmaccess.c
> @@ -0,0 +1,86 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (c) 2020 Bernd Edlinger <[email protected]>
> + * All rights reserved.
> + *
> + * Check whether /proc/$pid/mem can be accessed without causing deadlocks
> + * when de_thread is blocked with ->cred_guard_mutex held.
> + */
> +
> +#include "../kselftest_harness.h"
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <unistd.h>
> +#include <sys/ptrace.h>
> +
> +static void *thread(void *arg)
> +{
> + ptrace(PTRACE_TRACEME, 0, 0L, 0L);
> + return NULL;
> +}
> +
> +TEST(vmaccess)
> +{
> + int f, pid = fork();
> + char mm[64];
> +
> + if (!pid) {
> + pthread_t pt;
> +
> + pthread_create(&pt, NULL, thread, NULL);
> + pthread_join(pt, NULL);
> + execlp("true", "true", NULL);
> + }
> +
> + sleep(1);
> + sprintf(mm, "/proc/%d/mem", pid);
> + f = open(mm, O_RDONLY);
> + ASSERT_GE(f, 0);
> + close(f);
> + f = kill(pid, SIGCONT);
> + ASSERT_EQ(f, 0);
> +}
> +
> +TEST(attach)
> +{
> + int s, k, pid = fork();
> +
> + if (!pid) {
> + pthread_t pt;
> +
> + pthread_create(&pt, NULL, thread, NULL);
> + pthread_join(pt, NULL);
> + execlp("sleep", "sleep", "2", NULL);
> + }
> +
> + sleep(1);
> + k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> + ASSERT_EQ(errno, EAGAIN);
> + ASSERT_EQ(k, -1);
> + k = waitpid(-1, &s, WNOHANG);
> + ASSERT_NE(k, -1);
> + ASSERT_NE(k, 0);
> + ASSERT_NE(k, pid);
> + ASSERT_EQ(WIFEXITED(s), 1);
> + ASSERT_EQ(WEXITSTATUS(s), 0);
> + sleep(1);
> + k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
> + ASSERT_EQ(k, 0);
> + k = waitpid(-1, &s, 0);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFSTOPPED(s), 1);
> + ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
> + k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
> + ASSERT_EQ(k, 0);
> + k = waitpid(-1, &s, 0);
> + ASSERT_EQ(k, pid);
> + ASSERT_EQ(WIFEXITED(s), 1);
> + ASSERT_EQ(WEXITSTATUS(s), 0);
> + k = waitpid(-1, NULL, 0);
> + ASSERT_EQ(k, -1);
> + ASSERT_EQ(errno, ECHILD);
> +}
> +
> +TEST_HARNESS_MAIN
> --
> 1.9.1
--
Kees Cook
On Tue, Mar 10, 2020 at 02:44:01PM +0100, Bernd Edlinger wrote:
> This adds test cases for ptrace deadlocks.
>
> Additionally fixes a compile problem in get_syscall_info.c,
> observed with gcc-4.8.4:
>
> get_syscall_info.c: In function 'get_syscall_info':
> get_syscall_info.c:93:3: error: 'for' loop initial declarations are only
> allowed in C99 mode
> for (unsigned int i = 0; i < ARRAY_SIZE(args); ++i) {
> ^
> get_syscall_info.c:93:3: note: use option -std=c99 or -std=gnu99 to compile
> your code
[...]
> @@ -1,6 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0-only
> -CFLAGS += -iquote../../../../include/uapi -Wall
> +CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
Wouldn't it be better to choose -std=gnu99 over -std=c99?
--
ldv
On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
<[email protected]> wrote:
> Jann Horn <[email protected]> writes:
> > On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <[email protected]> wrote:
> >> The cred_guard_mutex is problematic. The cred_guard_mutex is held
> >> over the userspace accesses as the arguments from userspace are read.
> >> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> >> threads are killed. The cred_guard_mutex is held over
> >> "put_user(0, tsk->clear_child_tid)" in exit_mm().
> >>
> >> Any of those can result in deadlock, as the cred_guard_mutex is held
> >> over a possible indefinite userspace waits for userspace.
> >>
> >> Add exec_update_mutex that is only held over exec updating process
> >> with the new contents of exec, so that code that needs not to be
> >> confused by exec changing the mm and the cred in ways that can not
> >> happen during ordinary execution of a process.
> >>
> >> The plan is to switch the users of cred_guard_mutex to
> >> exec_udpate_mutex one by one. This lets us move forward while still
> >> being careful and not introducing any regressions.
> > [...]
> >> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
> >> return -EINTR;
> >> }
> >> }
> >> +
> >> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> >> + if (ret)
> >> + return ret;
> >
> > We're already holding the old mmap_sem, and now nest the
> > exec_update_mutex inside it; but then while still holding the
> > exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
> > which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
> > at least lockdep will be unhappy, and I'm not sure whether it's an
> > actual problem or not.
>
> Good point. I should double check the lock ordering here with mmap_sem.
> It doesn't look like mmput takes mmap_sem
You sure about that? mmput() -> __mmput() -> ksm_exit() ->
__ksm_exit() -> down_write(&mm->mmap_sem)
Or also: mmput() -> __mmput() -> khugepaged_exit() ->
__khugepaged_exit() -> down_write(&mm->mmap_sem)
Or is there a reason why those paths can't happen?
Jann Horn <[email protected]> writes:
> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
> <[email protected]> wrote:
>> Jann Horn <[email protected]> writes:
>> > On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <[email protected]> wrote:
>> >> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>> >> over the userspace accesses as the arguments from userspace are read.
>> >> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> >> threads are killed. The cred_guard_mutex is held over
>> >> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>> >>
>> >> Any of those can result in deadlock, as the cred_guard_mutex is held
>> >> over a possible indefinite userspace waits for userspace.
>> >>
>> >> Add exec_update_mutex that is only held over exec updating process
>> >> with the new contents of exec, so that code that needs not to be
>> >> confused by exec changing the mm and the cred in ways that can not
>> >> happen during ordinary execution of a process.
>> >>
>> >> The plan is to switch the users of cred_guard_mutex to
>> >> exec_udpate_mutex one by one. This lets us move forward while still
>> >> being careful and not introducing any regressions.
>> > [...]
>> >> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>> >> return -EINTR;
>> >> }
>> >> }
>> >> +
>> >> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> >> + if (ret)
>> >> + return ret;
>> >
>> > We're already holding the old mmap_sem, and now nest the
>> > exec_update_mutex inside it; but then while still holding the
>> > exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>> > which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>> > at least lockdep will be unhappy, and I'm not sure whether it's an
>> > actual problem or not.
>>
>> Good point. I should double check the lock ordering here with mmap_sem.
>> It doesn't look like mmput takes mmap_sem
>
> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
> __ksm_exit() -> down_write(&mm->mmap_sem)
>
> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>
> Or is there a reason why those paths can't happen?
Clearly I didn't look far enough.
I will adjust this so that exec_update_mutex is taken before mmap_sem.
Anything else is just asking for trouble.
Eric
On 3/10/20 9:22 PM, Bernd Edlinger wrote:
> On 3/10/20 9:10 PM, Jann Horn wrote:
>> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <[email protected]> wrote:
>>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <[email protected]> wrote:
>>>> Jann Horn <[email protected]> writes:
>>>>> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
>>>>>> During exec some file descriptors are closed and the files struct is
>>>>>> unshared. But all of that can happen at other times and it has the
>>>>>> same protections during exec as at ordinary times. So stop taking the
>>>>>> cred_guard_mutex as it is useless.
>>>>>>
>>>>>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
>>>>>> prone, as it is held in serveral while waiting possibly indefinitely
>>>>>> for userspace to do something.
>> [...]
>>>>> If you make this change, then if this races with execution of a setuid
>>>>> program that afterwards e.g. opens a unix domain socket, an attacker
>>>>> will be able to steal that socket and inject messages into
>>>>> communication with things like DBus. procfs currently has the same
>>>>> race, and that still needs to be fixed, but at least procfs doesn't
>>>>> let you open things like sockets because they don't have a working
>>>>> ->open handler, and it enforces the normal permission check for
>>>>> opening files.
>>>>
>>>> It isn't only exec that can change credentials. Do we need a lock for
>>>> changing credentials?
>> [...]
>>>> If we need a lock around credential change let's design and build that.
>>>> Having a mismatch between what a lock is designed to do, and what
>>>> people use it for can only result in other bugs as people get confused.
>>>
>>> Hmm... what benefits do we get from making it a separate lock? I guess
>>> it would allow us to make it a per-task lock instead of a
>>> signal_struct-wide one? That might be helpful...
>>
>> But actually, isn't the core purpose of the cred_guard_mutex to guard
>> against concurrent credential changes anyway? That's what almost
>> everyone uses it for, and it's in the name...
>>
>
> The main reason d'etre of exec_update_mutex is to get a consitent
> view of task->mm and task credentials.
> > The reason why you want the cred_guard_mutex, is that some action
> is changing the resulting credentials that the execve is about
> to install, and that is the data flow in the opposite direction.
>
So in other words, you need the exec_update_mutex when you
access another thread's credentials and possibly the mmap at the
same time.
You need the cred_guard_mutex when you *change* the credentials
of another thread. (Where you cannot be sure that the other thread
just started to execve something)
You need no mutex at all when you are just accessing or
even changing the credentials of the current thread. (If another
thread is doing execve, your task will be killed, and wether
or not the credentials were changed does not matter any more)
>
> Bernd.
>
On 3/11/20 1:15 AM, Eric W. Biederman wrote:
> Jann Horn <[email protected]> writes:
>
>> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
>> <[email protected]> wrote:
>>> Jann Horn <[email protected]> writes:
>>>> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <[email protected]> wrote:
>>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>> threads are killed. The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>> with the new contents of exec, so that code that needs not to be
>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>> happen during ordinary execution of a process.
>>>>>
>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>> exec_udpate_mutex one by one. This lets us move forward while still
>>>>> being careful and not introducing any regressions.
>>>> [...]
>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>> return -EINTR;
>>>>> }
>>>>> }
>>>>> +
>>>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>> + if (ret)
>>>>> + return ret;
>>>>
>>>> We're already holding the old mmap_sem, and now nest the
>>>> exec_update_mutex inside it; but then while still holding the
>>>> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>>>> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>>>> at least lockdep will be unhappy, and I'm not sure whether it's an
>>>> actual problem or not.
>>>
>>> Good point. I should double check the lock ordering here with mmap_sem.
>>> It doesn't look like mmput takes mmap_sem
>>
>> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
>> __ksm_exit() -> down_write(&mm->mmap_sem)
>>
>> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
>> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>>
>> Or is there a reason why those paths can't happen?
>
> Clearly I didn't look far enough.
>
> I will adjust this so that exec_update_mutex is taken before mmap_sem.
> Anything else is just asking for trouble.
>
Note that vm_access does also mmput under the exec_update_mutex.
So I don't see a huge problem here.
But maybe I missed something.
Bernd.
On Sun, 2020-03-08 at 16:38 -0500, Eric W. Biederman wrote:
> The cred_guard_mutex is problematic. The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed. The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one. This lets us move forward while still
> being careful and not introducing any regressions.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
This patch will trigger a warning during boot,
[Â Â Â 19.707214][Â Â Â Â T1] pci 0035:01:00.0: enabling device (0545 -> 0547)
[Â Â Â 19.707287][Â Â Â Â T1] EEH: Capable adapter found: recovery enabled.
[Â Â Â 19.732541][Â Â Â Â T1] cpuidle-powernv: Default stop: psscr =
0x0000000000000330,mask=0x00000000003003ff
[Â Â Â 19.732567][Â Â Â Â T1] cpuidle-powernv: Deepest stop: psscr =
0x0000000000300375,mask=0x00000000003003ff
[Â Â Â 19.732598][Â Â Â Â T1] cpuidle-powernv: First stop level that may lose SPRs =
0x4
[Â Â Â 19.732617][Â Â Â Â T1] cpuidle-powernv: First stop level that may lose timebase
= 0x10
[Â Â Â 19.769784][Â Â Â Â T1] HugeTLB registered 2.00 MiB page size, pre-allocated 0
pages
[Â Â Â 19.769810][Â Â Â Â T1] HugeTLB registered 1.00 GiB page size, pre-allocated 0
pages
[Â Â Â 19.789344][Â Â T718]Â
[Â Â Â 19.789367][Â Â T718] =====================================
[Â Â Â 19.789379][Â Â T718] WARNING: bad unlock balance detected!
[Â Â Â 19.789393][Â Â T718] 5.6.0-rc5-next-20200311+ #4 Not tainted
[Â Â Â 19.789414][Â Â T718] -------------------------------------
[Â Â Â 19.789426][Â Â T718] kworker/u257:0/718 is trying to release lock (&sig-
>exec_update_mutex) at:
[Â Â Â 19.789459][Â Â T718] [<c0000000004c6770>] free_bprm+0xe0/0xf0
[Â Â Â 19.789481][Â Â T718] but there are no more locks to release!
[Â Â Â 19.789502][Â Â T718]Â
[Â Â Â 19.789502][Â Â T718] other info that might help us debug this:
[Â Â Â 19.789537][Â Â T718] 1 lock held by kworker/u257:0/718:
[Â Â Â 19.789558][Â Â T718]Â Â #0: c000001fa8842808 (&sig->cred_guard_mutex){+.+.}, at:
__do_execve_file.isra.33+0x1b0/0xda0
[Â Â Â 19.789611][Â Â T718]Â
[Â Â Â 19.789611][Â Â T718] stack backtrace:
[Â Â Â 19.789645][Â Â T718] CPU: 8 PID: 718 Comm: kworker/u257:0 Not tainted 5.6.0-
rc5-next-20200311+ #4
[Â Â Â 19.789681][Â Â T718] Call Trace:
[Â Â Â 19.789703][Â Â T718] [c000000dad8cfa70] [c000000000979b40]
dump_stack+0xf4/0x164 (unreliable)
[Â Â Â 19.789742][Â Â T718] [c000000dad8cfac0] [c0000000001c1d78]
print_unlock_imbalance_bug+0x118/0x140
[Â Â Â 19.789780][Â Â T718] [c000000dad8cfb40] [c0000000001ceaa0]
lock_release+0x270/0x520
[Â Â Â 19.789817][Â Â T718] [c000000dad8cfbf0] [c0000000009a2898]
__mutex_unlock_slowpath+0x68/0x400
[Â Â Â 19.789854][Â Â T718] [c000000dad8cfcc0] [c0000000004c6770] free_bprm+0xe0/0xf0
[Â Â Â 19.789900][Â Â T718] [c000000dad8cfcf0] [c0000000004c845c]
__do_execve_file.isra.33+0x44c/0xda0
__do_execve_file at fs/exec.c:1904
[Â Â Â 19.789938][Â Â T718] [c000000dad8cfde0] [c0000000001391d8]
call_usermodehelper_exec_async+0x218/0x250
[Â Â Â 19.789977][Â Â T718] [c000000dad8cfe20] [c00000000000b748]
ret_from_kernel_thread+0x5c/0x74
> ---
> fs/exec.c | 9 +++++++++
> include/linux/sched/signal.h | 9 ++++++++-
> init/init_task.c | 1 +
> kernel/fork.c | 1 +
> 4 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index d820a7272a76..ffeebb1f167b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> + int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
> return -EINTR;
> }
> }
> +
> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> + if (ret)
> + return ret;
> +
> task_lock(tsk);
> active_mm = tsk->active_mm;
> membarrier_exec_mmap(mm);
> @@ -1438,6 +1444,8 @@ static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> + if (!bprm->mm)
> + mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> abort_creds(bprm->cred);
> }
> @@ -1487,6 +1495,7 @@ void install_exec_creds(struct linux_binprm *bprm)
> * credentials; any time after this it may be unlocked.
> */
> security_bprm_committed_creds(bprm);
> + mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> }
> EXPORT_SYMBOL(install_exec_creds);
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 88050259c466..a29df79540ce 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -224,7 +224,14 @@ struct signal_struct {
>
> struct mutex cred_guard_mutex; /* guard against foreign influences on
> * credential calculations
> - * (notably. ptrace) */
> + * (notably. ptrace)
> + * Deprecated do not use in new code.
> + * Use exec_update_mutex instead.
> + */
> + struct mutex exec_update_mutex; /* Held while task_struct is being
> + * updated during exec, and may have
> + * inconsistent permissions.
> + */
> } __randomize_layout;
>
> /*
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5eab7b..bd403ed3e418 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -26,6 +26,7 @@ static struct signal_struct init_signals = {
> .multiprocess = HLIST_HEAD_INIT,
> .rlim = INIT_RLIMITS,
> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
> #ifdef CONFIG_POSIX_TIMERS
> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
> .cputimer = {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 60a1295f4384..12896a6ecee6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>
> mutex_init(&sig->cred_guard_mutex);
> + mutex_init(&sig->exec_update_mutex);
>
> return 0;
> }
On Wed, Mar 11, 2020 at 7:12 AM Bernd Edlinger
<[email protected]> wrote:
> On 3/10/20 9:22 PM, Bernd Edlinger wrote:
> > On 3/10/20 9:10 PM, Jann Horn wrote:
> >> On Tue, Mar 10, 2020 at 9:00 PM Jann Horn <[email protected]> wrote:
> >>> On Tue, Mar 10, 2020 at 8:29 PM Eric W. Biederman <[email protected]> wrote:
> >>>> Jann Horn <[email protected]> writes:
> >>>>> On Tue, Mar 10, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
> >>>>>> During exec some file descriptors are closed and the files struct is
> >>>>>> unshared. But all of that can happen at other times and it has the
> >>>>>> same protections during exec as at ordinary times. So stop taking the
> >>>>>> cred_guard_mutex as it is useless.
> >>>>>>
> >>>>>> Furthermore he cred_guard_mutex is a bad idea because it is deadlock
> >>>>>> prone, as it is held in serveral while waiting possibly indefinitely
> >>>>>> for userspace to do something.
> >> [...]
> >>>>> If you make this change, then if this races with execution of a setuid
> >>>>> program that afterwards e.g. opens a unix domain socket, an attacker
> >>>>> will be able to steal that socket and inject messages into
> >>>>> communication with things like DBus. procfs currently has the same
> >>>>> race, and that still needs to be fixed, but at least procfs doesn't
> >>>>> let you open things like sockets because they don't have a working
> >>>>> ->open handler, and it enforces the normal permission check for
> >>>>> opening files.
> >>>>
> >>>> It isn't only exec that can change credentials. Do we need a lock for
> >>>> changing credentials?
> >> [...]
> >>>> If we need a lock around credential change let's design and build that.
> >>>> Having a mismatch between what a lock is designed to do, and what
> >>>> people use it for can only result in other bugs as people get confused.
> >>>
> >>> Hmm... what benefits do we get from making it a separate lock? I guess
> >>> it would allow us to make it a per-task lock instead of a
> >>> signal_struct-wide one? That might be helpful...
> >>
> >> But actually, isn't the core purpose of the cred_guard_mutex to guard
> >> against concurrent credential changes anyway? That's what almost
> >> everyone uses it for, and it's in the name...
> >>
> >
> > The main reason d'etre of exec_update_mutex is to get a consitent
> > view of task->mm and task credentials.
> > > The reason why you want the cred_guard_mutex, is that some action
> > is changing the resulting credentials that the execve is about
> > to install, and that is the data flow in the opposite direction.
> >
>
> So in other words, you need the exec_update_mutex when you
> access another thread's credentials and possibly the mmap at the
> same time.
Or the file descriptor table, or register state, ...
> You need no mutex at all when you are just accessing or
> even changing the credentials of the current thread. (If another
> thread is doing execve, your task will be killed, and wether
> or not the credentials were changed does not matter any more)
Only if the only access checks you care about are those related to mm access.
Bernd Edlinger <[email protected]> writes:
> On 3/11/20 1:15 AM, Eric W. Biederman wrote:
>> Jann Horn <[email protected]> writes:
>>
>>> On Tue, Mar 10, 2020 at 10:33 PM Eric W. Biederman
>>> <[email protected]> wrote:
>>>> Jann Horn <[email protected]> writes:
>>>>> On Sun, Mar 8, 2020 at 10:41 PM Eric W. Biederman <[email protected]> wrote:
>>>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>>> threads are killed. The cred_guard_mutex is held over
>>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>>
>>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>>> over a possible indefinite userspace waits for userspace.
>>>>>>
>>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>>> with the new contents of exec, so that code that needs not to be
>>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>>> happen during ordinary execution of a process.
>>>>>>
>>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>>> exec_udpate_mutex one by one. This lets us move forward while still
>>>>>> being careful and not introducing any regressions.
>>>>> [...]
>>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>> return -EINTR;
>>>>>> }
>>>>>> }
>>>>>> +
>>>>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>>> + if (ret)
>>>>>> + return ret;
>>>>>
>>>>> We're already holding the old mmap_sem, and now nest the
>>>>> exec_update_mutex inside it; but then while still holding the
>>>>> exec_update_mutex, we do mmput(), which can e.g. end up in ksm_exit(),
>>>>> which can do down_write(&mm->mmap_sem) from __ksm_exit(). So I think
>>>>> at least lockdep will be unhappy, and I'm not sure whether it's an
>>>>> actual problem or not.
>>>>
>>>> Good point. I should double check the lock ordering here with mmap_sem.
>>>> It doesn't look like mmput takes mmap_sem
>>>
>>> You sure about that? mmput() -> __mmput() -> ksm_exit() ->
>>> __ksm_exit() -> down_write(&mm->mmap_sem)
>>>
>>> Or also: mmput() -> __mmput() -> khugepaged_exit() ->
>>> __khugepaged_exit() -> down_write(&mm->mmap_sem)
>>>
>>> Or is there a reason why those paths can't happen?
>>
>> Clearly I didn't look far enough.
>>
>> I will adjust this so that exec_update_mutex is taken before mmap_sem.
>> Anything else is just asking for trouble.
>>
>
> Note that vm_access does also mmput under the exec_update_mutex.
> So I don't see a huge problem here.
> But maybe I missed something.
The issue is that to prevent deadlock locks must always be taken
in the same order.
Taking mmap_sem then exec_update_mutex at the start of the function,
then taking exec_update_mutex then mmap_sem in mmput, takes the
two locks in two different orders. Which means that in the right
set or circumstances:
thread1: thread2:
obtain mmap_sem optain exec_update_mutex
wait for exec_update_mutex wait for mmap_sem
Which guarantees that neither thread will make progress.
The fix is easy I just need to take exec_update_mutex a few lines
earlier.
Eric
On Tue, Mar 10, 2020 at 03:57:35PM -0500, Eric W. Biederman wrote:
> So ptrace_attach and seccomp use the cred_guard_mutex to guarantee
> a deadlock.
Well, that's the result, but seccomp uses it because it wants to
be certain that credentials and no_new_privs are changed together
"atomically".
--
Kees Cook
On Tue, Mar 10, 2020 at 02:44:10PM +0100, Bernd Edlinger wrote:
> This removes a duplicate "a" in the comment in process_vm_rw_core.
>
> Signed-off-by: Bernd Edlinger <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
-Kees
> ---
> mm/process_vm_access.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 357aa7b..b3e6eb5 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -204,7 +204,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
> if (!mm || IS_ERR(mm)) {
> rc = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> /*
> - * Explicitly map EACCES to EPERM as EPERM is a more a
> + * Explicitly map EACCES to EPERM as EPERM is a more
> * appropriate error code for process_vw_readv/writev
> */
> if (rc == -EACCES)
> --
> 1.9.1
--
Kees Cook
On Tue, Mar 10, 2020 at 02:44:18PM +0100, Bernd Edlinger wrote:
> This removes an outdated comment in prepare_kernel_cred.
>
> There is no "cred_replace_mutex" any more, so the comment must
> go away.
>
> Signed-off-by: Bernd Edlinger <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
-Kees
> ---
> kernel/cred.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 809a985..71a7926 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -675,8 +675,6 @@ void __init cred_init(void)
> * The caller may change these controls afterwards if desired.
> *
> * Returns the new credentials or NULL if out of memory.
> - *
> - * Does not take, and does not return holding current->cred_replace_mutex.
> */
> struct cred *prepare_kernel_cred(struct task_struct *daemon)
> {
> --
> 1.9.1
--
Kees Cook
On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
> This changes lock_trace to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/stack for instance.
>
> This should be safe, as the credentials are only used for reading,
> and task->mm is updated on execve under the new exec_update_mutex.
>
> Signed-off-by: Bernd Edlinger <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
-Kees
> ---
> fs/proc/base.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ebea950..4fdfe4f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
>
> static int lock_trace(struct task_struct *task)
> {
> - int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + int err = mutex_lock_killable(&task->signal->exec_update_mutex);
> if (err)
> return err;
> if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
> return -EPERM;
> }
> return 0;
> @@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
>
> static void unlock_trace(struct task_struct *task)
> {
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
> }
>
> #ifdef CONFIG_STACKTRACE
> --
> 1.9.1
--
Kees Cook
On Tue, Mar 10, 2020 at 06:45:47PM +0100, Bernd Edlinger wrote:
> This changes do_io_accounting to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/io for instance.
>
> This should be safe, as the credentials are only used for reading.
I'd like to see the rationale described better here for why it should be
safe. I'm still not seeing why this is safe here, as we might check
ptrace_may_access() with one cred and then iterate io accounting with a
different credential...
What am I missing?
-Kees
>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> fs/proc/base.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4fdfe4f..529d0c6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
> unsigned long flags;
> int result;
>
> - result = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + result = mutex_lock_killable(&task->signal->exec_update_mutex);
> if (result)
> return result;
>
> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
> result = 0;
>
> out_unlock:
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
> return result;
> }
>
> --
> 1.9.1
--
Kees Cook
On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
> This changes lock_trace to use the new exec_update_mutex
> instead of cred_guard_mutex.
>
> This fixes possible deadlocks when the trace is accessing
> /proc/$pid/stack for instance.
>
> This should be safe, as the credentials are only used for reading,
> and task->mm is updated on execve under the new exec_update_mutex.
>
> Signed-off-by: Bernd Edlinger <[email protected]>
I have the same question here as in 3/4. I should probably rescind my
Reviewed-by until I'm convinced about the security-safety of this -- why
is this not a race against cred changes?
-Kees
> ---
> fs/proc/base.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ebea950..4fdfe4f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
>
> static int lock_trace(struct task_struct *task)
> {
> - int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
> + int err = mutex_lock_killable(&task->signal->exec_update_mutex);
> if (err)
> return err;
> if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
> return -EPERM;
> }
> return 0;
> @@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
>
> static void unlock_trace(struct task_struct *task)
> {
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + mutex_unlock(&task->signal->exec_update_mutex);
> }
>
> #ifdef CONFIG_STACKTRACE
> --
> 1.9.1
--
Kees Cook
On 3/11/20 8:10 PM, Kees Cook wrote:
> On Tue, Mar 10, 2020 at 06:45:32PM +0100, Bernd Edlinger wrote:
>> This changes lock_trace to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This fixes possible deadlocks when the trace is accessing
>> /proc/$pid/stack for instance.
>>
>> This should be safe, as the credentials are only used for reading,
>> and task->mm is updated on execve under the new exec_update_mutex.
>>
>> Signed-off-by: Bernd Edlinger <[email protected]>
>
> I have the same question here as in 3/4. I should probably rescind my
> Reviewed-by until I'm convinced about the security-safety of this -- why
> is this not a race against cred changes?
>
The credentials of a thread that is currently executing execve is already
set in the bprm structure, however the credential in the task structure
is not yet changed, as well as the process memory map keeps stable
until the exec_update_mutex is acquired.
What is done with this functions is access the call stack of the
process before the new executable is actually started.
There would immediately be a severe security problem if we did
not use any mutex as the check would be then with the old credential,
but the stack trace would potentially reveal secret function
calls that are done by a setuid program when it starts up.
Bernd.
> -Kees
>
>> ---
>> fs/proc/base.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index ebea950..4fdfe4f 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -403,11 +403,11 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
>>
>> static int lock_trace(struct task_struct *task)
>> {
>> - int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> + int err = mutex_lock_killable(&task->signal->exec_update_mutex);
>> if (err)
>> return err;
>> if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_FSCREDS)) {
>> - mutex_unlock(&task->signal->cred_guard_mutex);
>> + mutex_unlock(&task->signal->exec_update_mutex);
>> return -EPERM;
>> }
>> return 0;
>> @@ -415,7 +415,7 @@ static int lock_trace(struct task_struct *task)
>>
>> static void unlock_trace(struct task_struct *task)
>> {
>> - mutex_unlock(&task->signal->cred_guard_mutex);
>> + mutex_unlock(&task->signal->exec_update_mutex);
>> }
>>
>> #ifdef CONFIG_STACKTRACE
>> --
>> 1.9.1
>
On 3/11/20 8:08 PM, Kees Cook wrote:
> On Tue, Mar 10, 2020 at 06:45:47PM +0100, Bernd Edlinger wrote:
>> This changes do_io_accounting to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This fixes possible deadlocks when the trace is accessing
>> /proc/$pid/io for instance.
>>
>> This should be safe, as the credentials are only used for reading.
>
> I'd like to see the rationale described better here for why it should be
> safe. I'm still not seeing why this is safe here, as we might check
> ptrace_may_access() with one cred and then iterate io accounting with a
> different credential...
>
> What am I missing?
>
The same here, even if execve is already started, the credentials
are not actually changed until the execve acquired the exec_update_mutex.
The data flow is from the task->cred => do_io_accounting,
if the data flow would be from do_io_accounting => task's no new privs
you would see an entirely different patch.
I am open for suggestions how to improve the description, or even
add a comment from time to time :)
Thanks
Bernd.
> -Kees
>
>>
>> Signed-off-by: Bernd Edlinger <[email protected]>
>> ---
>> fs/proc/base.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 4fdfe4f..529d0c6 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>> unsigned long flags;
>> int result;
>>
>> - result = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> + result = mutex_lock_killable(&task->signal->exec_update_mutex);
>> if (result)
>> return result;
>>
>> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>> result = 0;
>>
>> out_unlock:
>> - mutex_unlock(&task->signal->cred_guard_mutex);
>> + mutex_unlock(&task->signal->exec_update_mutex);
>> return result;
>> }
>>
>> --
>> 1.9.1
>
Kees Cook <[email protected]> writes:
> On Tue, Mar 10, 2020 at 06:45:47PM +0100, Bernd Edlinger wrote:
>> This changes do_io_accounting to use the new exec_update_mutex
>> instead of cred_guard_mutex.
>>
>> This fixes possible deadlocks when the trace is accessing
>> /proc/$pid/io for instance.
>>
>> This should be safe, as the credentials are only used for reading.
>
> I'd like to see the rationale described better here for why it should be
> safe. I'm still not seeing why this is safe here, as we might check
> ptrace_may_access() with one cred and then iterate io accounting with a
> different credential...
>
> What am I missing?
The rational for non-regression is that exec_update_mutex covers all
of the same tsk->cred changes as cred_guard_mutex. Therefore we are not
any worse off, and we avoid the deadlock.
As for safety. Jann's argument that the only interesting credential
change is in exec applies. All other credential changes that have any
effect on permission checks make the new cred non-dumpable (excepions
apply see the code).
So I think this is a non-regressing change. A safe change.
I don't think either version of this code is fully correct.
Eric
>> Signed-off-by: Bernd Edlinger <[email protected]>
>> ---
>> fs/proc/base.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 4fdfe4f..529d0c6 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -2770,7 +2770,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>> unsigned long flags;
>> int result;
>>
>> - result = mutex_lock_killable(&task->signal->cred_guard_mutex);
>> + result = mutex_lock_killable(&task->signal->exec_update_mutex);
>> if (result)
>> return result;
>>
>> @@ -2806,7 +2806,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh
>> result = 0;
>>
>> out_unlock:
>> - mutex_unlock(&task->signal->cred_guard_mutex);
>> + mutex_unlock(&task->signal->exec_update_mutex);
>> return result;
>> }
>>
>> --
>> 1.9.1
On 09.03.2020 00:38, Eric W. Biederman wrote:
>
> The cred_guard_mutex is problematic. The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed. The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one. This lets us move forward while still
> being careful and not introducing any regressions.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 9 +++++++++
> include/linux/sched/signal.h | 9 ++++++++-
> init/init_task.c | 1 +
> kernel/fork.c | 1 +
> 4 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index d820a7272a76..ffeebb1f167b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> + int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
> return -EINTR;
> }
> }
> +
> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> + if (ret)
> + return ret;
You missed old_mm->mmap_sem unlock. See here:
diff --git a/fs/exec.c b/fs/exec.c
index 47582cd97f86..d557bac3e862 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1063,8 +1063,11 @@ static int exec_mmap(struct mm_struct *mm)
}
ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
+ if (ret) {
+ if (old_mm)
+ up_read(&old_mm->mmap_sem);
return ret;
+ }
task_lock(tsk);
active_mm = tsk->active_mm;
Kirill Tkhai <[email protected]> writes:
> On 09.03.2020 00:38, Eric W. Biederman wrote:
>>
>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> threads are killed. The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>>
>> Add exec_update_mutex that is only held over exec updating process
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>>
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one. This lets us move forward while still
>> being careful and not introducing any regressions.
>>
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> fs/exec.c | 9 +++++++++
>> include/linux/sched/signal.h | 9 ++++++++-
>> init/init_task.c | 1 +
>> kernel/fork.c | 1 +
>> 4 files changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index d820a7272a76..ffeebb1f167b 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>> {
>> struct task_struct *tsk;
>> struct mm_struct *old_mm, *active_mm;
>> + int ret;
>>
>> /* Notify parent that we're no longer interested in the old VM */
>> tsk = current;
>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>> return -EINTR;
>> }
>> }
>> +
>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> + if (ret)
>> + return ret;
>
> You missed old_mm->mmap_sem unlock. See here:
Duh. Thank you.
I actually need to switch the lock ordering here, and I haven't yet
because my son was sick yesterday.
Something like this.
diff --git a/fs/exec.c b/fs/exec.c
index 96f89401b4d1..03d50c27ec01 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1020,9 +1020,14 @@ static int exec_mmap(struct mm_struct *mm)
tsk = current;
old_mm = current->mm;
exec_mm_release(tsk, old_mm);
+ if (old_mm)
+ sync_mm_rss(old_mm);
+
+ ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+ if (ret)
+ return ret;
if (old_mm) {
- sync_mm_rss(old_mm);
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
@@ -1032,14 +1037,11 @@ static int exec_mmap(struct mm_struct *mm)
down_read(&old_mm->mmap_sem);
if (unlikely(old_mm->core_state)) {
up_read(&old_mm->mmap_sem);
+ mutex_unlock(&tsk->signal->exec_update_mutex);
return -EINTR;
}
}
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
- return ret;
-
task_lock(tsk);
active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
> diff --git a/fs/exec.c b/fs/exec.c
> index 47582cd97f86..d557bac3e862 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1063,8 +1063,11 @@ static int exec_mmap(struct mm_struct *mm)
> }
>
> ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> - if (ret)
> + if (ret) {
> + if (old_mm)
> + up_read(&old_mm->mmap_sem);
> return ret;
> + }
>
> task_lock(tsk);
> active_mm = tsk->active_mm;
Eric
On 12.03.2020 15:24, Eric W. Biederman wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 09.03.2020 00:38, Eric W. Biederman wrote:
>>>
>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>> threads are killed. The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>
>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>> over a possible indefinite userspace waits for userspace.
>>>
>>> Add exec_update_mutex that is only held over exec updating process
>>> with the new contents of exec, so that code that needs not to be
>>> confused by exec changing the mm and the cred in ways that can not
>>> happen during ordinary execution of a process.
>>>
>>> The plan is to switch the users of cred_guard_mutex to
>>> exec_udpate_mutex one by one. This lets us move forward while still
>>> being careful and not introducing any regressions.
>>>
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>> ---
>>> fs/exec.c | 9 +++++++++
>>> include/linux/sched/signal.h | 9 ++++++++-
>>> init/init_task.c | 1 +
>>> kernel/fork.c | 1 +
>>> 4 files changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index d820a7272a76..ffeebb1f167b 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>>> {
>>> struct task_struct *tsk;
>>> struct mm_struct *old_mm, *active_mm;
>>> + int ret;
>>>
>>> /* Notify parent that we're no longer interested in the old VM */
>>> tsk = current;
>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>> return -EINTR;
>>> }
>>> }
>>> +
>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>> + if (ret)
>>> + return ret;
>>
>> You missed old_mm->mmap_sem unlock. See here:
>
> Duh. Thank you.
>
> I actually need to switch the lock ordering here, and I haven't yet
> because my son was sick yesterday.
There is some fundamental problem with your patch, since the below fires in 100% cases
on current linux-next:
[ 22.838717] kernel BUG at fs/exec.c:1474!
diff --git a/fs/exec.c b/fs/exec.c
index 47582cd97f86..0f77f8c94905 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- if (!bprm->mm)
+ if (!bprm->mm) {
+ BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex));
mutex_unlock(¤t->signal->exec_update_mutex);
+ }
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex));
mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
}
---------------------------------------------------------------------------------------------
First time the mutex is unlocked in:
exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
bprm->mm = NULL;
Second time the mutex is unlocked in free_bprm():
if (bprm->cred) {
if (!bprm->mm)
mutex_unlock(¤t->signal->exec_update_mutex);
My opinion is we should not relay on side indicators like bprm->mm. Better you may
introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing
with this after you won't waste much time on diving into this. Also, if someone decides
to change the place, where bprm->mm is set into NULL, this person will bump into hell
of dependences between unrelated components like your newly introduced mutex.
So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves
modularity.
Kirill Tkhai <[email protected]> writes:
> On 12.03.2020 15:24, Eric W. Biederman wrote:
>> Kirill Tkhai <[email protected]> writes:
>>
>>> On 09.03.2020 00:38, Eric W. Biederman wrote:
>>>>
>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>> threads are killed. The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>> over a possible indefinite userspace waits for userspace.
>>>>
>>>> Add exec_update_mutex that is only held over exec updating process
>>>> with the new contents of exec, so that code that needs not to be
>>>> confused by exec changing the mm and the cred in ways that can not
>>>> happen during ordinary execution of a process.
>>>>
>>>> The plan is to switch the users of cred_guard_mutex to
>>>> exec_udpate_mutex one by one. This lets us move forward while still
>>>> being careful and not introducing any regressions.
>>>>
>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>>> ---
>>>> fs/exec.c | 9 +++++++++
>>>> include/linux/sched/signal.h | 9 ++++++++-
>>>> init/init_task.c | 1 +
>>>> kernel/fork.c | 1 +
>>>> 4 files changed, 19 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index d820a7272a76..ffeebb1f167b 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>> {
>>>> struct task_struct *tsk;
>>>> struct mm_struct *old_mm, *active_mm;
>>>> + int ret;
>>>>
>>>> /* Notify parent that we're no longer interested in the old VM */
>>>> tsk = current;
>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>> return -EINTR;
>>>> }
>>>> }
>>>> +
>>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>> + if (ret)
>>>> + return ret;
>>>
>>> You missed old_mm->mmap_sem unlock. See here:
>>
>> Duh. Thank you.
>>
>> I actually need to switch the lock ordering here, and I haven't yet
>> because my son was sick yesterday.
>
> There is some fundamental problem with your patch, since the below fires in 100% cases
> on current linux-next:
Thank you.
I have just backed this out of linux-next for now because it is clearly
flawed.
You make some good points about the recursion. I will go back to the
drawing board and see what I can work out.
> [ 22.838717] kernel BUG at fs/exec.c:1474!
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 47582cd97f86..0f77f8c94905 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> - if (!bprm->mm)
> + if (!bprm->mm) {
> + BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex));
> mutex_unlock(¤t->signal->exec_update_mutex);
> + }
> mutex_unlock(¤t->signal->cred_guard_mutex);
> abort_creds(bprm->cred);
> }
> @@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm)
> * credentials; any time after this it may be unlocked.
> */
> security_bprm_committed_creds(bprm);
> + BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex));
> mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> }
>
> ---------------------------------------------------------------------------------------------
>
> First time the mutex is unlocked in:
>
> exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
>
> Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
>
> bprm->mm = NULL;
>
> Second time the mutex is unlocked in free_bprm():
>
> if (bprm->cred) {
> if (!bprm->mm)
> mutex_unlock(¤t->signal->exec_update_mutex);
>
> My opinion is we should not relay on side indicators like bprm->mm. Better you may
> introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing
> with this after you won't waste much time on diving into this. Also, if someone decides
> to change the place, where bprm->mm is set into NULL, this person will bump into hell
> of dependences between unrelated components like your newly introduced mutex.
>
> So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves
> modularity.
Am I wrong or is that also a problem with cred_guard_mutex?
Eric
On 12.03.2020 17:38, Eric W. Biederman wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 12.03.2020 15:24, Eric W. Biederman wrote:
>>> Kirill Tkhai <[email protected]> writes:
>>>
>>>> On 09.03.2020 00:38, Eric W. Biederman wrote:
>>>>>
>>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>> threads are killed. The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>> with the new contents of exec, so that code that needs not to be
>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>> happen during ordinary execution of a process.
>>>>>
>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>> exec_udpate_mutex one by one. This lets us move forward while still
>>>>> being careful and not introducing any regressions.
>>>>>
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>>>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>>>> ---
>>>>> fs/exec.c | 9 +++++++++
>>>>> include/linux/sched/signal.h | 9 ++++++++-
>>>>> init/init_task.c | 1 +
>>>>> kernel/fork.c | 1 +
>>>>> 4 files changed, 19 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>>> index d820a7272a76..ffeebb1f167b 100644
>>>>> --- a/fs/exec.c
>>>>> +++ b/fs/exec.c
>>>>> @@ -1014,6 +1014,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>> {
>>>>> struct task_struct *tsk;
>>>>> struct mm_struct *old_mm, *active_mm;
>>>>> + int ret;
>>>>>
>>>>> /* Notify parent that we're no longer interested in the old VM */
>>>>> tsk = current;
>>>>> @@ -1034,6 +1035,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>> return -EINTR;
>>>>> }
>>>>> }
>>>>> +
>>>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>> + if (ret)
>>>>> + return ret;
>>>>
>>>> You missed old_mm->mmap_sem unlock. See here:
>>>
>>> Duh. Thank you.
>>>
>>> I actually need to switch the lock ordering here, and I haven't yet
>>> because my son was sick yesterday.
>>
>> There is some fundamental problem with your patch, since the below fires in 100% cases
>> on current linux-next:
>
> Thank you.
>
> I have just backed this out of linux-next for now because it is clearly
> flawed.
>
> You make some good points about the recursion. I will go back to the
> drawing board and see what I can work out.
>
>
>> [ 22.838717] kernel BUG at fs/exec.c:1474!
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 47582cd97f86..0f77f8c94905 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1470,8 +1470,10 @@ static void free_bprm(struct linux_binprm *bprm)
>> {
>> free_arg_pages(bprm);
>> if (bprm->cred) {
>> - if (!bprm->mm)
>> + if (!bprm->mm) {
>> + BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex));
>> mutex_unlock(¤t->signal->exec_update_mutex);
>> + }
>> mutex_unlock(¤t->signal->cred_guard_mutex);
>> abort_creds(bprm->cred);
>> }
>> @@ -1521,6 +1523,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>> * credentials; any time after this it may be unlocked.
>> */
>> security_bprm_committed_creds(bprm);
>> + BUG_ON(!mutex_is_locked(¤t->signal->exec_update_mutex));
>> mutex_unlock(¤t->signal->exec_update_mutex);
>> mutex_unlock(¤t->signal->cred_guard_mutex);
>> }
>>
>> ---------------------------------------------------------------------------------------------
>>
>> First time the mutex is unlocked in:
>>
>> exec_binprm()->search_binary_handler()->.load_binary->install_exec_creds()
>>
>> Then exec_binprm()->search_binary_handler()->.load_binary->flush_old_exec() clears mm:
>>
>> bprm->mm = NULL;
>>
>> Second time the mutex is unlocked in free_bprm():
>>
>> if (bprm->cred) {
>> if (!bprm->mm)
>> mutex_unlock(¤t->signal->exec_update_mutex);
>>
>> My opinion is we should not relay on side indicators like bprm->mm. Better you may
>> introduce struct linux_binprm::exec_update_mutex_is_locked. So the next person dealing
>> with this after you won't waste much time on diving into this. Also, if someone decides
>> to change the place, where bprm->mm is set into NULL, this person will bump into hell
>> of dependences between unrelated components like your newly introduced mutex.
>>
>> So, I'm strongly for *struct linux_binprm::exec_update_mutex_is_locked*, since this improves
>> modularity.
>
> Am I wrong or is that also a problem with cred_guard_mutex?
No, there is no a problem.
cred_guard_mutex is locked in a pair with bprm->cred = prepare_exec_creds() assignment.
cred_guard_mutex is unlocked in a pair with bprm->cred = NULL clearing (see install_exec_creds()).
Further free_bprm() skip unlock in case of bprm->cred is NULL.
On 3/12/20 3:38 PM, Eric W. Biederman wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 12.03.2020 15:24, Eric W. Biederman wrote:
>>>
>>> I actually need to switch the lock ordering here, and I haven't yet
>>> because my son was sick yesterday.
All the best wishes to you and your son. I hope he will get well soon.
And sorry for not missing the issue in the review. The reason turns
out that bprm_mm_init is called after prepare_bprm_creds, but there
are error pathes between those where free_bprm is called up with
cred != NULL and mm == NULL, but the mutex not locked.
I figured out a possible fix for the problem that was pointed out:
From ceb6f65b52b3a7f0280f4f20509a1564a439edf6 Mon Sep 17 00:00:00 2001
From: Bernd Edlinger <[email protected]>
Date: Wed, 11 Mar 2020 15:31:07 +0100
Subject: [PATCH] Fix issues with exec_update_mutex
Signed-off-by: Bernd Edlinger <[email protected]>
---
fs/exec.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index ffeebb1..cde4937 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1021,8 +1021,14 @@ static int exec_mmap(struct mm_struct *mm)
old_mm = current->mm;
exec_mm_release(tsk, old_mm);
- if (old_mm) {
+ if (old_mm)
sync_mm_rss(old_mm);
+
+ ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+ if (ret)
+ return ret;
+
+ if (old_mm) {
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
@@ -1032,14 +1038,11 @@ static int exec_mmap(struct mm_struct *mm)
down_read(&old_mm->mmap_sem);
if (unlikely(old_mm->core_state)) {
up_read(&old_mm->mmap_sem);
+ mutex_unlock(&tsk->signal->exec_update_mutex);
return -EINTR;
}
}
- ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
- if (ret)
- return ret;
-
task_lock(tsk);
active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
@@ -1444,8 +1447,6 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- if (!bprm->mm)
- mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1846,6 +1847,8 @@ static int __do_execve_file(int fd, struct filename *filename,
would_dump(bprm, bprm->file);
retval = exec_binprm(bprm);
+ if (bprm->cred && !bprm->mm)
+ mutex_unlock(¤t->signal->exec_update_mutex);
if (retval < 0)
goto out;
--
1.9.1
On 13.03.2020 04:05, Bernd Edlinger wrote:
> On 3/12/20 3:38 PM, Eric W. Biederman wrote:
>> Kirill Tkhai <[email protected]> writes:
>>
>>> On 12.03.2020 15:24, Eric W. Biederman wrote:
>>>>
>>>> I actually need to switch the lock ordering here, and I haven't yet
>>>> because my son was sick yesterday.
>
> All the best wishes to you and your son. I hope he will get well soon.
>
> And sorry for not missing the issue in the review. The reason turns
> out that bprm_mm_init is called after prepare_bprm_creds, but there
> are error pathes between those where free_bprm is called up with
> cred != NULL and mm == NULL, but the mutex not locked.
>
> I figured out a possible fix for the problem that was pointed out:
>
>
> From ceb6f65b52b3a7f0280f4f20509a1564a439edf6 Mon Sep 17 00:00:00 2001
> From: Bernd Edlinger <[email protected]>
> Date: Wed, 11 Mar 2020 15:31:07 +0100
> Subject: [PATCH] Fix issues with exec_update_mutex
>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> fs/exec.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ffeebb1..cde4937 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1021,8 +1021,14 @@ static int exec_mmap(struct mm_struct *mm)
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
>
> - if (old_mm) {
> + if (old_mm)
> sync_mm_rss(old_mm);
> +
> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> + if (ret)
> + return ret;
> +
> + if (old_mm) {
> /*
> * Make sure that if there is a core dump in progress
> * for the old mm, we get out and die instead of going
> @@ -1032,14 +1038,11 @@ static int exec_mmap(struct mm_struct *mm)
> down_read(&old_mm->mmap_sem);
> if (unlikely(old_mm->core_state)) {
> up_read(&old_mm->mmap_sem);
> + mutex_unlock(&tsk->signal->exec_update_mutex);
> return -EINTR;
> }
> }
>
> - ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> - if (ret)
> - return ret;
> -
> task_lock(tsk);
> active_mm = tsk->active_mm;
> membarrier_exec_mmap(mm);
> @@ -1444,8 +1447,6 @@ static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> - if (!bprm->mm)
> - mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> abort_creds(bprm->cred);
> }
> @@ -1846,6 +1847,8 @@ static int __do_execve_file(int fd, struct filename *filename,
> would_dump(bprm, bprm->file);
>
> retval = exec_binprm(bprm);
> + if (bprm->cred && !bprm->mm)
> + mutex_unlock(¤t->signal->exec_update_mutex);
Despite this should fix the problem, this looks like a broken puzzle.
We can't use bprm->cred as an identifier whether the mutex was locked or not.
We can check for bprm->cred in regard to cred_guard_mutex, because of there is
strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
(see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
Take attention on modularity of all this: there is no dependencies between anything else.
In regard to newly introduced exec_update_mutex, your fix and source patch way look like
an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
and this introduces the problems in the future modifications and support of all involved
entities. If someone wants to move some functions in relation to each other, there will
be a pain, and this person will have to go again the same dependencies and bug way,
Eric stepped on in the original patch.
The cred_guard_mutex is problematic. The cred_guard_mutex is held
over the userspace accesses as the arguments from userspace are read.
The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
threads are killed. The cred_guard_mutex is held over
"put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held
over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
with the new contents of exec, so that code that needs not to be
confused by exec changing the mm and the cred in ways that can not
happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to
exec_udpate_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Bernd Edlinger <[email protected]>
---
fs/exec.c | 17 ++++++++++++++---
include/linux/binfmts.h | 8 +++++++-
include/linux/sched/signal.h | 9 ++++++++-
init/init_task.c | 1 +
kernel/fork.c | 1 +
5 files changed, 31 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
diff --git a/fs/exec.c b/fs/exec.c
index d820a72..11974a1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct *old_mm, *active_mm;
+ int ret;
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
exec_mm_release(tsk, old_mm);
+ ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+ if (ret)
+ return ret;
+
if (old_mm) {
sync_mm_rss(old_mm);
/*
@@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm)
down_read(&old_mm->mmap_sem);
if (unlikely(old_mm->core_state)) {
up_read(&old_mm->mmap_sem);
+ mutex_unlock(&tsk->signal->exec_update_mutex);
return -EINTR;
}
}
+
task_lock(tsk);
active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
@@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm)
goto out;
/*
- * After clearing bprm->mm (to mark that current is using the
- * prepared mm now), we have nothing left of the original
+ * After setting bprm->called_exec_mmap (to mark that current is
+ * using the prepared mm now), we have nothing left of the original
* process. If anything from here on returns an error, the check
* in search_binary_handler() will SEGV current.
*/
+ bprm->called_exec_mmap = 1;
bprm->mm = NULL;
#ifdef CONFIG_POSIX_TIMERS
@@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
+ if (bprm->called_exec_mmap)
+ mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
@@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm)
read_lock(&binfmt_lock);
put_binfmt(fmt);
- if (retval < 0 && !bprm->mm) {
+ if (retval < 0 && bprm->called_exec_mmap) {
/* we got to flush_old_exec() and failed after it */
read_unlock(&binfmt_lock);
force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..a345d9f 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,13 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by flush_old_exec, when exec_mmap has been called.
+ * This is past the point of no return, when the
+ * exec_update_mutex has been taken.
+ */
+ called_exec_mmap:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..a29df79 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -224,7 +224,14 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
- * (notably. ptrace) */
+ * (notably. ptrace)
+ * Deprecated do not use in new code.
+ * Use exec_update_mutex instead.
+ */
+ struct mutex exec_update_mutex; /* Held while task_struct is being
+ * updated during exec, and may have
+ * inconsistent permissions.
+ */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..bd403ed 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/fork.c b/kernel/fork.c
index 8642530..036b692 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ mutex_init(&sig->exec_update_mutex);
return 0;
}
--
1.9.1
On 3/14/20 10:57 AM, Bernd Edlinger wrote:
> On 3/13/20 10:13 AM, Kirill Tkhai wrote:
>>
>> Despite this should fix the problem, this looks like a broken puzzle.
>>
>> We can't use bprm->cred as an identifier whether the mutex was locked or not.
>> We can check for bprm->cred in regard to cred_guard_mutex, because of there is
>> strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
>> (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
>> Take attention on modularity of all this: there is no dependencies between anything else.
>>
>> In regard to newly introduced exec_update_mutex, your fix and source patch way look like
>> an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
>> and this introduces the problems in the future modifications and support of all involved
>> entities. If someone wants to move some functions in relation to each other, there will
>> be a pain, and this person will have to go again the same dependencies and bug way,
>> Eric stepped on in the original patch.
>>
>
> Okay, yes, valid points you make, thanks.
> I just wanted to understand what was exactly wrong with this patch,
> since the failure mode looked a lot like it was failing because of
> something clobbering the data unexpectedly.
>
>
> So I have posted a few updated patch for the failed one here:
>
> [PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
> [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
>
> which replaces these:
> [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
> https://lore.kernel.org/lkml/[email protected]/
>
> [PATCH] pidfd: Stop taking cred_guard_mutex
> https://lore.kernel.org/lkml/[email protected]/
>
>
> and a new patch series to fix deadlock in ptrace_attach and update doc:
> [PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach
> [PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach
> [PATCH 2/2] doc: Update documentation of ->exec_*_mutex
>
>
> Other patches needed, still valid:
>
> [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
> https://lore.kernel.org/lkml/[email protected]/
>
> [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
> https://lore.kernel.org/lkml/[email protected]/
>
Ah, sorry, forgot this one:
[PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
https://lore.kernel.org/lkml/[email protected]/
> [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
> https://lore.kernel.org/lkml/[email protected]/
>
> [PATCH 1/4] exec: Fix a deadlock in ptrace
> https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
> [PATCH 2/4] selftests/ptrace: add test cases for dead-locks
> https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
> [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
> https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
> [PATCH 4/4] kernel: doc: remove outdated comment cred.c
> https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
> [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
> [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
> [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
> [PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve
> https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>
>
> I think most of the existing patches are already approved, but if
> there are still change requests, please let me know.
>
>
> Thanks
> Bernd.
>
Hope it is correct now.
I haven't seen the new patches on the kernel archives yet,
so I cannot add URLs for them.
Bernd.
This brings the outdated Documentation/security/credentials.rst
back in line with the current implementation, and describes the
purpose of current->signal->exec_update_mutex,
current->signal->exec_guard_mutex and
current->signal->unsafe_execve_in_progress.
Signed-off-by: Bernd Edlinger <[email protected]>
---
Documentation/security/credentials.rst | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 282e79f..fe4cd76 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -437,15 +437,30 @@ new set of credentials by calling::
struct cred *prepare_creds(void);
-this locks current->cred_replace_mutex and then allocates and constructs a
-duplicate of the current process's credentials, returning with the mutex still
-held if successful. It returns NULL if not successful (out of memory).
+this allocates and constructs a duplicate of the current process's credentials.
+It returns NULL if not successful (out of memory).
+
+If called from __do_execve_file, the mutex current->signal->exec_guard_mutex
+is acquired before this function gets called, and usually released after
+the new process mmap and credentials are installed. However if one of the
+sibling threads are being traced when the execve is invoked, there is no
+guarantee how long it takes to terminate all sibling threads, and therefore
+the variable current->signal->unsafe_execve_in_progress is set, and the
+exec_guard_mutex is released immediately. Functions that may have effect
+on the credentials of a different thread need to lock the exec_guard_mutex
+and additionally check the unsafe_execve_in_progress status, and fail with
+-EAGAIN if that variable is set.
The mutex prevents ``ptrace()`` from altering the ptrace state of a process
while security checks on credentials construction and changing is taking place
as the ptrace state may alter the outcome, particularly in the case of
``execve()``.
+The mutex current->signal->exec_update_mutex is acquired when only a single
+thread is remaining, and the credentials and the process mmap are actually
+changed. Functions that only need to access to a consistent state of the
+credentials and the process mmap do only need to aquire this mutex.
+
The new credentials set should be altered appropriately, and any security
checks and hooks done. Both the current and the proposed sets of credentials
are available for this purpose as current_cred() will return the current set
@@ -466,9 +481,8 @@ by calling::
This will alter various aspects of the credentials and the process, giving the
LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
-actually commit the new credentials to ``current->cred``, it will release
-``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
-will notify the scheduler and others of the changes.
+actually commit the new credentials to ``current->cred``, and it will notify
+the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
end of such functions as ``sys_setresuid()``.
@@ -486,8 +500,7 @@ invoked::
void abort_creds(struct cred *new);
-This releases the lock on ``current->cred_replace_mutex`` that
-``prepare_creds()`` got and then releases the new credentials.
+This releases the new credentials.
A typical credentials alteration function would look something like this::
--
1.9.1
This changes __pidfd_fget to use the new exec_update_mutex
instead of cred_guard_mutex.
This should be safe, as the credentials do not change
before exec_update_mutex is locked. Therefore whatever
file access is possible with holding the cred_guard_mutex
here is also possbile with the exec_update_mutex.
Signed-off-by: Bernd Edlinger <[email protected]>
---
kernel/pid.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
This replaces Eric's "[PATCH] pidfd: Stop taking cred_guard_mutex"
diff --git a/kernel/pid.c b/kernel/pid.c
index 0f4ecb5..04821f4 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -584,7 +584,7 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
struct file *file;
int ret;
- ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ ret = mutex_lock_killable(&task->signal->exec_update_mutex);
if (ret)
return ERR_PTR(ret);
@@ -593,7 +593,7 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd)
else
file = ERR_PTR(-EPERM);
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_update_mutex);
return file ?: ERR_PTR(-EBADF);
}
--
1.9.1
This removes the last users of cred_guard_mutex
and replaces it with a new mutex exec_guard_mutex,
and a boolean unsafe_execve_in_progress.
This addresses the case when at least one of the
sibling threads is traced, and therefore the trace
process may dead-lock in ptrace_attach, but de_thread
will need to wait for the tracer to continue execution.
The solution is to detect this situation and make
ptrace_attach and similar functions return -EAGAIN,
but only in a situation where a dead-lock is imminent.
This means this is an API change, but only when the
process is traced while execve happens in a
multi-threaded application.
See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.
Signed-off-by: Bernd Edlinger <[email protected]>
---
fs/exec.c | 44 +++++++++++++++++++++++++++++++++++---------
fs/proc/base.c | 13 ++++++++-----
include/linux/sched/signal.h | 14 +++++++++-----
init/init_task.c | 2 +-
kernel/cred.c | 2 +-
kernel/fork.c | 2 +-
kernel/ptrace.c | 20 +++++++++++++++++---
kernel/seccomp.c | 15 +++++++++------
8 files changed, 81 insertions(+), 31 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 11974a1..6b78518 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1073,14 +1073,26 @@ static int de_thread(struct task_struct *tsk)
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t = tsk;
if (thread_group_empty(tsk))
goto no_thread_group;
+ spin_lock_irq(lock);
+ while_each_thread(tsk, t) {
+ if (unlikely(t->ptrace))
+ sig->unsafe_execve_in_progress = true;
+ }
+
+ if (unlikely(sig->unsafe_execve_in_progress)) {
+ spin_unlock_irq(lock);
+ mutex_unlock(&sig->exec_guard_mutex);
+ spin_lock_irq(lock);
+ }
+
/*
* Kill all other threads in the thread group.
*/
- spin_lock_irq(lock);
if (signal_group_exit(sig)) {
/*
* Another group action in progress, just
@@ -1424,22 +1436,30 @@ void finalize_exec(struct linux_binprm *bprm)
EXPORT_SYMBOL(finalize_exec);
/*
- * Prepare credentials and lock ->cred_guard_mutex.
+ * Prepare credentials and lock ->exec_guard_mutex.
* install_exec_creds() commits the new creds and drops the lock.
* Or, if exec fails before, free_bprm() should release ->cred and
* and unlock.
*/
static int prepare_bprm_creds(struct linux_binprm *bprm)
{
- if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
+ int ret;
+
+ if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex))
return -ERESTARTNOINTR;
+ ret = -EAGAIN;
+ if (unlikely(current->signal->unsafe_execve_in_progress))
+ goto out;
+
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
- mutex_unlock(¤t->signal->cred_guard_mutex);
- return -ENOMEM;
+ ret = -ENOMEM;
+out:
+ mutex_unlock(¤t->signal->exec_guard_mutex);
+ return ret;
}
static void free_bprm(struct linux_binprm *bprm)
@@ -1448,7 +1468,10 @@ static void free_bprm(struct linux_binprm *bprm)
if (bprm->cred) {
if (bprm->called_exec_mmap)
mutex_unlock(¤t->signal->exec_update_mutex);
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ if (unlikely(current->signal->unsafe_execve_in_progress))
+ mutex_lock(¤t->signal->exec_guard_mutex);
+ current->signal->unsafe_execve_in_progress = false;
+ mutex_unlock(¤t->signal->exec_guard_mutex);
abort_creds(bprm->cred);
}
if (bprm->file) {
@@ -1492,19 +1515,22 @@ void install_exec_creds(struct linux_binprm *bprm)
if (get_dumpable(current->mm) != SUID_DUMP_USER)
perf_event_exit_task(current);
/*
- * cred_guard_mutex must be held at least to this point to prevent
+ * exec_guard_mutex must be held at least to this point to prevent
* ptrace_attach() from altering our determination of the task's
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
mutex_unlock(¤t->signal->exec_update_mutex);
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ if (unlikely(current->signal->unsafe_execve_in_progress))
+ mutex_lock(¤t->signal->exec_guard_mutex);
+ current->signal->unsafe_execve_in_progress = false;
+ mutex_unlock(¤t->signal->exec_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
/*
* determine how safe it is to execute the proposed program
- * - the caller must hold ->cred_guard_mutex to protect against
+ * - the caller must hold ->exec_guard_mutex to protect against
* PTRACE_ATTACH or seccomp thread-sync
*/
static void check_unsafe_exec(struct linux_binprm *bprm)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6b13fc4..a428536 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2680,14 +2680,17 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
}
/* Guard against adverse ptrace interaction */
- rv = mutex_lock_interruptible(¤t->signal->cred_guard_mutex);
+ rv = mutex_lock_interruptible(¤t->signal->exec_guard_mutex);
if (rv < 0)
goto out_free;
- rv = security_setprocattr(PROC_I(inode)->op.lsm,
- file->f_path.dentry->d_name.name, page,
- count);
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ if (unlikely(current->signal->unsafe_execve_in_progress))
+ rv = -EAGAIN;
+ else
+ rv = security_setprocattr(PROC_I(inode)->op.lsm,
+ file->f_path.dentry->d_name.name,
+ page, count);
+ mutex_unlock(¤t->signal->exec_guard_mutex);
out_free:
kfree(page);
out:
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index a29df79..e83cef2 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -212,6 +212,13 @@ struct signal_struct {
#endif
/*
+ * Set while execve is executing but is *not* holding
+ * exec_guard_mutex to avoid possible dead-locks.
+ * Only valid when exec_guard_mutex is held.
+ */
+ bool unsafe_execve_in_progress;
+
+ /*
* Thread is the potential origin of an oom condition; kill first on
* oom
*/
@@ -222,11 +229,8 @@ struct signal_struct {
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
- struct mutex cred_guard_mutex; /* guard against foreign influences on
- * credential calculations
- * (notably. ptrace)
- * Deprecated do not use in new code.
- * Use exec_update_mutex instead.
+ struct mutex exec_guard_mutex; /* Held while execve runs, except when
+ * a sibling thread is being traced.
*/
struct mutex exec_update_mutex; /* Held while task_struct is being
* updated during exec, and may have
diff --git a/init/init_task.c b/init/init_task.c
index bd403ed..6f96327 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -25,7 +25,7 @@
},
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
- .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .exec_guard_mutex = __MUTEX_INITIALIZER(init_signals.exec_guard_mutex),
.exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
diff --git a/kernel/cred.c b/kernel/cred.c
index 71a7926..341ca59 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -295,7 +295,7 @@ struct cred *prepare_creds(void)
/*
* Prepare credentials for current to perform an execve()
- * - The caller must hold ->cred_guard_mutex
+ * - The caller must hold ->exec_guard_mutex
*/
struct cred *prepare_exec_creds(void)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index e23ccac..98012f7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1593,7 +1593,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
- mutex_init(&sig->cred_guard_mutex);
+ mutex_init(&sig->exec_guard_mutex);
mutex_init(&sig->exec_update_mutex);
return 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179..221759e 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -392,9 +392,13 @@ static int ptrace_attach(struct task_struct *task, long request,
* under ptrace.
*/
retval = -ERESTARTNOINTR;
- if (mutex_lock_interruptible(&task->signal->cred_guard_mutex))
+ if (mutex_lock_interruptible(&task->signal->exec_guard_mutex))
goto out;
+ retval = -EAGAIN;
+ if (unlikely(task->signal->unsafe_execve_in_progress))
+ goto unlock_creds;
+
task_lock(task);
retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
task_unlock(task);
@@ -447,7 +451,7 @@ static int ptrace_attach(struct task_struct *task, long request,
unlock_tasklist:
write_unlock_irq(&tasklist_lock);
unlock_creds:
- mutex_unlock(&task->signal->cred_guard_mutex);
+ mutex_unlock(&task->signal->exec_guard_mutex);
out:
if (!retval) {
/*
@@ -472,10 +476,18 @@ static int ptrace_attach(struct task_struct *task, long request,
*/
static int ptrace_traceme(void)
{
- int ret = -EPERM;
+ int ret;
+
+ if (mutex_lock_interruptible(¤t->signal->exec_guard_mutex))
+ return -ERESTARTNOINTR;
+
+ ret = -EAGAIN;
+ if (unlikely(current->signal->unsafe_execve_in_progress))
+ goto unlock_creds;
write_lock_irq(&tasklist_lock);
/* Are we already being traced? */
+ ret = -EPERM;
if (!current->ptrace) {
ret = security_ptrace_traceme(current->parent);
/*
@@ -490,6 +502,8 @@ static int ptrace_traceme(void)
}
write_unlock_irq(&tasklist_lock);
+unlock_creds:
+ mutex_unlock(¤t->signal->exec_guard_mutex);
return ret;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b6ea3dc..acd6960 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -329,7 +329,7 @@ static int is_ancestor(struct seccomp_filter *parent,
/**
* seccomp_can_sync_threads: checks if all threads can be synchronized
*
- * Expects sighand and cred_guard_mutex locks to be held.
+ * Expects sighand and exec_guard_mutex locks to be held.
*
* Returns 0 on success, -ve on error, or the pid of a thread which was
* either not in the correct seccomp mode or did not have an ancestral
@@ -339,9 +339,12 @@ static inline pid_t seccomp_can_sync_threads(void)
{
struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
+ BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex));
assert_spin_locked(¤t->sighand->siglock);
+ if (unlikely(current->signal->unsafe_execve_in_progress))
+ return -EAGAIN;
+
/* Validate all threads being eligible for synchronization. */
caller = current;
for_each_thread(caller, thread) {
@@ -371,7 +374,7 @@ static inline pid_t seccomp_can_sync_threads(void)
/**
* seccomp_sync_threads: sets all threads to use current's filter
*
- * Expects sighand and cred_guard_mutex locks to be held, and for
+ * Expects sighand and exec_guard_mutex locks to be held, and for
* seccomp_can_sync_threads() to have returned success already
* without dropping the locks.
*
@@ -380,7 +383,7 @@ static inline void seccomp_sync_threads(unsigned long flags)
{
struct task_struct *thread, *caller;
- BUG_ON(!mutex_is_locked(¤t->signal->cred_guard_mutex));
+ BUG_ON(!mutex_is_locked(¤t->signal->exec_guard_mutex));
assert_spin_locked(¤t->sighand->siglock);
/* Synchronize all threads. */
@@ -1319,7 +1322,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
* while another thread is in the middle of calling exec.
*/
if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(¤t->signal->cred_guard_mutex))
+ mutex_lock_killable(¤t->signal->exec_guard_mutex))
goto out_put_fd;
spin_lock_irq(¤t->sighand->siglock);
@@ -1337,7 +1340,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
out:
spin_unlock_irq(¤t->sighand->siglock);
if (flags & SECCOMP_FILTER_FLAG_TSYNC)
- mutex_unlock(¤t->signal->cred_guard_mutex);
+ mutex_unlock(¤t->signal->exec_guard_mutex);
out_put_fd:
if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
if (ret) {
--
1.9.1
On 3/13/20 10:13 AM, Kirill Tkhai wrote:
>
> Despite this should fix the problem, this looks like a broken puzzle.
>
> We can't use bprm->cred as an identifier whether the mutex was locked or not.
> We can check for bprm->cred in regard to cred_guard_mutex, because of there is
> strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
> (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
> Take attention on modularity of all this: there is no dependencies between anything else.
>
> In regard to newly introduced exec_update_mutex, your fix and source patch way look like
> an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
> and this introduces the problems in the future modifications and support of all involved
> entities. If someone wants to move some functions in relation to each other, there will
> be a pain, and this person will have to go again the same dependencies and bug way,
> Eric stepped on in the original patch.
>
Okay, yes, valid points you make, thanks.
I just wanted to understand what was exactly wrong with this patch,
since the failure mode looked a lot like it was failing because of
something clobbering the data unexpectedly.
So I have posted a few updated patch for the failed one here:
[PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
[PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
which replaces these:
[PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
https://lore.kernel.org/lkml/[email protected]/
[PATCH] pidfd: Stop taking cred_guard_mutex
https://lore.kernel.org/lkml/[email protected]/
and a new patch series to fix deadlock in ptrace_attach and update doc:
[PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach
[PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach
[PATCH 2/2] doc: Update documentation of ->exec_*_mutex
Other patches needed, still valid:
[PATCH v2 1/5] exec: Only compute current once in flush_old_exec
https://lore.kernel.org/lkml/[email protected]/
[PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
https://lore.kernel.org/lkml/[email protected]/
[PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
https://lore.kernel.org/lkml/[email protected]/
[PATCH 1/4] exec: Fix a deadlock in ptrace
https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH 2/4] selftests/ptrace: add test cases for dead-locks
https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH 4/4] kernel: doc: remove outdated comment cred.c
https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
I think most of the existing patches are already approved, but if
there are still change requests, please let me know.
Thanks
Bernd.
This completes the new infrastructure patch, and replaces the
cred_guard_mutex with an exec_guard_mutex, and a boolean, that
is set, when a dead-lock situation is detected.
I also change ptrace_traceme to use the new mutex, but I consider
it a bug, that it didn't take any mutex previously since it calls
security_ptrace_traceme, and all the security modules operate under
the assumption that execve is not operating in parallel.
This patch fixes the test case tools/testing/selftests/ptrace/vmaccess:
[==========] Running 2 tests from 1 test cases.
[ RUN ] global.vmaccess
[ OK ] global.vmaccess
[ RUN ] global.attach
[ OK ] global.attach <= this was still failing
[==========] 2 / 2 tests passed.
[ PASSED ]
Yes, it is an API change, but only in some very special case,
so I would exepect this to be un-noticeable to user space applications.
Bernd Edlinger (2):
exec: Fix dead-lock in de_thread with ptrace_attach
doc: Update documentation of ->exec_*_mutex
Documentation/security/credentials.rst | 29 +++++++++++++++-------
fs/exec.c | 44 +++++++++++++++++++++++++++-------
fs/proc/base.c | 13 ++++++----
include/linux/sched/signal.h | 14 +++++++----
init/init_task.c | 2 +-
kernel/cred.c | 2 +-
kernel/fork.c | 2 +-
kernel/ptrace.c | 20 +++++++++++++---
kernel/seccomp.c | 15 +++++++-----
9 files changed, 102 insertions(+), 39 deletions(-)
--
1.9.1
On 14.03.2020 12:11, Bernd Edlinger wrote:
> The cred_guard_mutex is problematic. The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed. The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one. This lets us move forward while still
> being careful and not introducing any regressions.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> fs/exec.c | 17 ++++++++++++++---
> include/linux/binfmts.h | 8 +++++++-
> include/linux/sched/signal.h | 9 ++++++++-
> init/init_task.c | 1 +
> kernel/fork.c | 1 +
> 5 files changed, 31 insertions(+), 5 deletions(-)
>
> v3: this update fixes lock-order and adds an explicit data member in linux_binprm
>
> diff --git a/fs/exec.c b/fs/exec.c
> index d820a72..11974a1 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> + int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
>
> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> + if (ret)
> + return ret;
> +
> if (old_mm) {
> sync_mm_rss(old_mm);
> /*
> @@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm)
> down_read(&old_mm->mmap_sem);
> if (unlikely(old_mm->core_state)) {
> up_read(&old_mm->mmap_sem);
> + mutex_unlock(&tsk->signal->exec_update_mutex);
> return -EINTR;
> }
> }
> +
> task_lock(tsk);
> active_mm = tsk->active_mm;
> membarrier_exec_mmap(mm);
> @@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm)
> goto out;
>
> /*
> - * After clearing bprm->mm (to mark that current is using the
> - * prepared mm now), we have nothing left of the original
> + * After setting bprm->called_exec_mmap (to mark that current is
> + * using the prepared mm now), we have nothing left of the original
> * process. If anything from here on returns an error, the check
> * in search_binary_handler() will SEGV current.
> */
> + bprm->called_exec_mmap = 1;
The two below is non-breaking pair:
exec_mmap(bprm->mm);
bprm->called_exec_mmap = 1;
Why not move this into exec_mmap(), so nobody definitely inserts something
between them?
> bprm->mm = NULL;
>
> #ifdef CONFIG_POSIX_TIMERS
> @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> + if (bprm->called_exec_mmap)
> + mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> abort_creds(bprm->cred);
> }
> @@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm)
> * credentials; any time after this it may be unlocked.
> */
> security_bprm_committed_creds(bprm);
> + mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> }
> EXPORT_SYMBOL(install_exec_creds);
> @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>
> read_lock(&binfmt_lock);
> put_binfmt(fmt);
> - if (retval < 0 && !bprm->mm) {
> + if (retval < 0 && bprm->called_exec_mmap) {
> /* we got to flush_old_exec() and failed after it */
> read_unlock(&binfmt_lock);
> force_sigsegv(SIGSEGV);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index b40fc63..a345d9f 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,13 @@ struct linux_binprm {
> * exec has happened. Used to sanitize execution environment
> * and to set AT_SECURE auxv for glibc.
> */
> - secureexec:1;
> + secureexec:1,
> + /*
> + * Set by flush_old_exec, when exec_mmap has been called.
> + * This is past the point of no return, when the
> + * exec_update_mutex has been taken.
> + */
> + called_exec_mmap:1;
> #ifdef __alpha__
> unsigned int taso:1;
> #endif
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 8805025..a29df79 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -224,7 +224,14 @@ struct signal_struct {
>
> struct mutex cred_guard_mutex; /* guard against foreign influences on
> * credential calculations
> - * (notably. ptrace) */
> + * (notably. ptrace)
> + * Deprecated do not use in new code.
> + * Use exec_update_mutex instead.
> + */
> + struct mutex exec_update_mutex; /* Held while task_struct is being
> + * updated during exec, and may have
> + * inconsistent permissions.
> + */
> } __randomize_layout;
>
> /*
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5..bd403ed 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -26,6 +26,7 @@
> .multiprocess = HLIST_HEAD_INIT,
> .rlim = INIT_RLIMITS,
> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
> #ifdef CONFIG_POSIX_TIMERS
> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
> .cputimer = {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8642530..036b692 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>
> mutex_init(&sig->cred_guard_mutex);
> + mutex_init(&sig->exec_update_mutex);
>
> return 0;
> }
>
On 14.03.2020 13:02, Bernd Edlinger wrote:
> On 3/14/20 10:57 AM, Bernd Edlinger wrote:
>> On 3/13/20 10:13 AM, Kirill Tkhai wrote:
>>>
>>> Despite this should fix the problem, this looks like a broken puzzle.
>>>
>>> We can't use bprm->cred as an identifier whether the mutex was locked or not.
>>> We can check for bprm->cred in regard to cred_guard_mutex, because of there is
>>> strong rule: "cred_guard_mutex is becomes locked together with bprm->cred assignment
>>> (see prepare_bprm_creds()), and it becomes unlocked together with bprm->cred zeroing".
>>> Take attention on modularity of all this: there is no dependencies between anything else.
>>>
>>> In regard to newly introduced exec_update_mutex, your fix and source patch way look like
>>> an obfuscation. The mutex becomes deadly glued to unrelated bprm->cred and bprm->mm,
>>> and this introduces the problems in the future modifications and support of all involved
>>> entities. If someone wants to move some functions in relation to each other, there will
>>> be a pain, and this person will have to go again the same dependencies and bug way,
>>> Eric stepped on in the original patch.
>>>
>>
>> Okay, yes, valid points you make, thanks.
>> I just wanted to understand what was exactly wrong with this patch,
>> since the failure mode looked a lot like it was failing because of
>> something clobbering the data unexpectedly.
>>
>>
>> So I have posted a few updated patch for the failed one here:
>>
>> [PATCH v3 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
>> [PATCH] pidfd: Use new infrastructure to fix deadlocks in execve
>>
>> which replaces these:
>> [PATCH v2 5/5] exec: Add a exec_update_mutex to replace cred_guard_mutex
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> [PATCH] pidfd: Stop taking cred_guard_mutex
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> and a new patch series to fix deadlock in ptrace_attach and update doc:
>> [PATCH 0/2] exec: Fix dead-lock in de_thread with ptrace_attach
>> [PATCH 1/2] exec: Fix dead-lock in de_thread with ptrace_attach
>> [PATCH 2/2] doc: Update documentation of ->exec_*_mutex
>>
>>
>> Other patches needed, still valid:
>>
>> [PATCH v2 1/5] exec: Only compute current once in flush_old_exec
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> [PATCH v2 2/5] exec: Factor unshare_sighand out of de_thread and call it separately
>> https://lore.kernel.org/lkml/[email protected]/
>>
>
> Ah, sorry, forgot this one:
> [PATCH v2 3/5] exec: Move cleanup of posix timers on exec out of de_thread
> https://lore.kernel.org/lkml/[email protected]/
>
>> [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec
>> https://lore.kernel.org/lkml/[email protected]/
1-4/5 look OK for me. You may add my
Reviewed-by: Kirill Tkhai <[email protected]>
>> [PATCH 1/4] exec: Fix a deadlock in ptrace
>> https://lore.kernel.org/lkml/AM6PR03MB517033EAD25BED15CC84E17DE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 2/4] selftests/ptrace: add test cases for dead-locks
>> https://lore.kernel.org/lkml/AM6PR03MB51703199741A2C27A78980FFE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 3/4] mm: docs: Fix a comment in process_vm_rw_core
>> https://lore.kernel.org/lkml/AM6PR03MB5170ED6D4D216EEEEF400136E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 4/4] kernel: doc: remove outdated comment cred.c
>> https://lore.kernel.org/lkml/AM6PR03MB517039DB07AB641C194FEA57E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 1/4] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB517057A2269C3A4FB287B76EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 2/4] proc: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB51705D211EC8E7EA270627B1E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 3/4] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB5170BD2476E35068E182EFA4E4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>> [PATCH 4/4] perf: Use new infrastructure to fix deadlocks in execve
>> https://lore.kernel.org/lkml/AM6PR03MB517035DEEDB9C8699CB6B34EE4FF0@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>
>>
>> I think most of the existing patches are already approved, but if
>> there are still change requests, please let me know.
>>
>>
>> Thanks
>> Bernd.
>>
>
> Hope it is correct now.
> I haven't seen the new patches on the kernel archives yet,
> so I cannot add URLs for them.
>
> Bernd.
>
On 3/17/20 9:56 AM, Kirill Tkhai wrote:
> On 14.03.2020 12:11, Bernd Edlinger wrote:
>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>> over the userspace accesses as the arguments from userspace are read.
>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>> threads are killed. The cred_guard_mutex is held over
>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>
>> Any of those can result in deadlock, as the cred_guard_mutex is held
>> over a possible indefinite userspace waits for userspace.
>>
>> Add exec_update_mutex that is only held over exec updating process
>> with the new contents of exec, so that code that needs not to be
>> confused by exec changing the mm and the cred in ways that can not
>> happen during ordinary execution of a process.
>>
>> The plan is to switch the users of cred_guard_mutex to
>> exec_udpate_mutex one by one. This lets us move forward while still
>> being careful and not introducing any regressions.
>>
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> Signed-off-by: Bernd Edlinger <[email protected]>
>> ---
>> fs/exec.c | 17 ++++++++++++++---
>> include/linux/binfmts.h | 8 +++++++-
>> include/linux/sched/signal.h | 9 ++++++++-
>> init/init_task.c | 1 +
>> kernel/fork.c | 1 +
>> 5 files changed, 31 insertions(+), 5 deletions(-)
>>
>> v3: this update fixes lock-order and adds an explicit data member in linux_binprm
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index d820a72..11974a1 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm)
>> {
>> struct task_struct *tsk;
>> struct mm_struct *old_mm, *active_mm;
>> + int ret;
>>
>> /* Notify parent that we're no longer interested in the old VM */
>> tsk = current;
>> old_mm = current->mm;
>> exec_mm_release(tsk, old_mm);
>>
>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> + if (ret)
>> + return ret;
>> +
>> if (old_mm) {
>> sync_mm_rss(old_mm);
>> /*
>> @@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm)
>> down_read(&old_mm->mmap_sem);
>> if (unlikely(old_mm->core_state)) {
>> up_read(&old_mm->mmap_sem);
>> + mutex_unlock(&tsk->signal->exec_update_mutex);
>> return -EINTR;
>> }
>> }
>> +
>> task_lock(tsk);
>> active_mm = tsk->active_mm;
>> membarrier_exec_mmap(mm);
>> @@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm)
>> goto out;
>>
>> /*
>> - * After clearing bprm->mm (to mark that current is using the
>> - * prepared mm now), we have nothing left of the original
>> + * After setting bprm->called_exec_mmap (to mark that current is
>> + * using the prepared mm now), we have nothing left of the original
>> * process. If anything from here on returns an error, the check
>> * in search_binary_handler() will SEGV current.
>> */
>> + bprm->called_exec_mmap = 1;
>
> The two below is non-breaking pair:
>
> exec_mmap(bprm->mm);
> bprm->called_exec_mmap = 1;
>
> Why not move this into exec_mmap(), so nobody definitely inserts something
> between them?
>
Hmm, could be done, but then I would probably need a different name than
"called_exec_mmap".
How about adding a nice function comment to exec_mmap that calls out the
changed behaviour that the exec_update_mutex is taken unless the function
fails?
Bernd.
>> bprm->mm = NULL;
>>
>> #ifdef CONFIG_POSIX_TIMERS
>> @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm)
>> {
>> free_arg_pages(bprm);
>> if (bprm->cred) {
>> + if (bprm->called_exec_mmap)
>> + mutex_unlock(¤t->signal->exec_update_mutex);
>> mutex_unlock(¤t->signal->cred_guard_mutex);
>> abort_creds(bprm->cred);
>> }
>> @@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>> * credentials; any time after this it may be unlocked.
>> */
>> security_bprm_committed_creds(bprm);
>> + mutex_unlock(¤t->signal->exec_update_mutex);
>> mutex_unlock(¤t->signal->cred_guard_mutex);
>> }
>> EXPORT_SYMBOL(install_exec_creds);
>> @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>>
>> read_lock(&binfmt_lock);
>> put_binfmt(fmt);
>> - if (retval < 0 && !bprm->mm) {
>> + if (retval < 0 && bprm->called_exec_mmap) {
>> /* we got to flush_old_exec() and failed after it */
>> read_unlock(&binfmt_lock);
>> force_sigsegv(SIGSEGV);
>> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
>> index b40fc63..a345d9f 100644
>> --- a/include/linux/binfmts.h
>> +++ b/include/linux/binfmts.h
>> @@ -44,7 +44,13 @@ struct linux_binprm {
>> * exec has happened. Used to sanitize execution environment
>> * and to set AT_SECURE auxv for glibc.
>> */
>> - secureexec:1;
>> + secureexec:1,
>> + /*
>> + * Set by flush_old_exec, when exec_mmap has been called.
>> + * This is past the point of no return, when the
>> + * exec_update_mutex has been taken.
>> + */
>> + called_exec_mmap:1;
>> #ifdef __alpha__
>> unsigned int taso:1;
>> #endif
>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index 8805025..a29df79 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -224,7 +224,14 @@ struct signal_struct {
>>
>> struct mutex cred_guard_mutex; /* guard against foreign influences on
>> * credential calculations
>> - * (notably. ptrace) */
>> + * (notably. ptrace)
>> + * Deprecated do not use in new code.
>> + * Use exec_update_mutex instead.
>> + */
>> + struct mutex exec_update_mutex; /* Held while task_struct is being
>> + * updated during exec, and may have
>> + * inconsistent permissions.
>> + */
>> } __randomize_layout;
>>
>> /*
>> diff --git a/init/init_task.c b/init/init_task.c
>> index 9e5cbe5..bd403ed 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -26,6 +26,7 @@
>> .multiprocess = HLIST_HEAD_INIT,
>> .rlim = INIT_RLIMITS,
>> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>> + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
>> #ifdef CONFIG_POSIX_TIMERS
>> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
>> .cputimer = {
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 8642530..036b692 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>>
>> mutex_init(&sig->cred_guard_mutex);
>> + mutex_init(&sig->exec_update_mutex);
>>
>> return 0;
>> }
>>
>
On 18.03.2020 00:53, Bernd Edlinger wrote:
> On 3/17/20 9:56 AM, Kirill Tkhai wrote:
>> On 14.03.2020 12:11, Bernd Edlinger wrote:
>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>> over the userspace accesses as the arguments from userspace are read.
>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>> threads are killed. The cred_guard_mutex is held over
>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>
>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>> over a possible indefinite userspace waits for userspace.
>>>
>>> Add exec_update_mutex that is only held over exec updating process
>>> with the new contents of exec, so that code that needs not to be
>>> confused by exec changing the mm and the cred in ways that can not
>>> happen during ordinary execution of a process.
>>>
>>> The plan is to switch the users of cred_guard_mutex to
>>> exec_udpate_mutex one by one. This lets us move forward while still
>>> being careful and not introducing any regressions.
>>>
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>> Signed-off-by: Bernd Edlinger <[email protected]>
>>> ---
>>> fs/exec.c | 17 ++++++++++++++---
>>> include/linux/binfmts.h | 8 +++++++-
>>> include/linux/sched/signal.h | 9 ++++++++-
>>> init/init_task.c | 1 +
>>> kernel/fork.c | 1 +
>>> 5 files changed, 31 insertions(+), 5 deletions(-)
>>>
>>> v3: this update fixes lock-order and adds an explicit data member in linux_binprm
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index d820a72..11974a1 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm)
>>> {
>>> struct task_struct *tsk;
>>> struct mm_struct *old_mm, *active_mm;
>>> + int ret;
>>>
>>> /* Notify parent that we're no longer interested in the old VM */
>>> tsk = current;
>>> old_mm = current->mm;
>>> exec_mm_release(tsk, old_mm);
>>>
>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>> + if (ret)
>>> + return ret;
>>> +
>>> if (old_mm) {
>>> sync_mm_rss(old_mm);
>>> /*
>>> @@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm)
>>> down_read(&old_mm->mmap_sem);
>>> if (unlikely(old_mm->core_state)) {
>>> up_read(&old_mm->mmap_sem);
>>> + mutex_unlock(&tsk->signal->exec_update_mutex);
>>> return -EINTR;
>>> }
>>> }
>>> +
>>> task_lock(tsk);
>>> active_mm = tsk->active_mm;
>>> membarrier_exec_mmap(mm);
>>> @@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm)
>>> goto out;
>>>
>>> /*
>>> - * After clearing bprm->mm (to mark that current is using the
>>> - * prepared mm now), we have nothing left of the original
>>> + * After setting bprm->called_exec_mmap (to mark that current is
>>> + * using the prepared mm now), we have nothing left of the original
>>> * process. If anything from here on returns an error, the check
>>> * in search_binary_handler() will SEGV current.
>>> */
>>> + bprm->called_exec_mmap = 1;
>>
>> The two below is non-breaking pair:
>>
>> exec_mmap(bprm->mm);
>> bprm->called_exec_mmap = 1;
>>
>> Why not move this into exec_mmap(), so nobody definitely inserts something
>> between them?
>>
>
> Hmm, could be done, but then I would probably need a different name than
> "called_exec_mmap".
>
> How about adding a nice function comment to exec_mmap that calls out the
> changed behaviour that the exec_update_mutex is taken unless the function
> fails?
Not sure, I understand correct.
Could you post this like a small patch hunk (on top of anything you want)?
> Bernd.
>
>
>>> bprm->mm = NULL;
>>>
>>> #ifdef CONFIG_POSIX_TIMERS
>>> @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm)
>>> {
>>> free_arg_pages(bprm);
>>> if (bprm->cred) {
>>> + if (bprm->called_exec_mmap)
>>> + mutex_unlock(¤t->signal->exec_update_mutex);
>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>> abort_creds(bprm->cred);
>>> }
>>> @@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>>> * credentials; any time after this it may be unlocked.
>>> */
>>> security_bprm_committed_creds(bprm);
>>> + mutex_unlock(¤t->signal->exec_update_mutex);
>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>> }
>>> EXPORT_SYMBOL(install_exec_creds);
>>> @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>>>
>>> read_lock(&binfmt_lock);
>>> put_binfmt(fmt);
>>> - if (retval < 0 && !bprm->mm) {
>>> + if (retval < 0 && bprm->called_exec_mmap) {
>>> /* we got to flush_old_exec() and failed after it */
>>> read_unlock(&binfmt_lock);
>>> force_sigsegv(SIGSEGV);
>>> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
>>> index b40fc63..a345d9f 100644
>>> --- a/include/linux/binfmts.h
>>> +++ b/include/linux/binfmts.h
>>> @@ -44,7 +44,13 @@ struct linux_binprm {
>>> * exec has happened. Used to sanitize execution environment
>>> * and to set AT_SECURE auxv for glibc.
>>> */
>>> - secureexec:1;
>>> + secureexec:1,
>>> + /*
>>> + * Set by flush_old_exec, when exec_mmap has been called.
>>> + * This is past the point of no return, when the
>>> + * exec_update_mutex has been taken.
>>> + */
>>> + called_exec_mmap:1;
>>> #ifdef __alpha__
>>> unsigned int taso:1;
>>> #endif
>>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>>> index 8805025..a29df79 100644
>>> --- a/include/linux/sched/signal.h
>>> +++ b/include/linux/sched/signal.h
>>> @@ -224,7 +224,14 @@ struct signal_struct {
>>>
>>> struct mutex cred_guard_mutex; /* guard against foreign influences on
>>> * credential calculations
>>> - * (notably. ptrace) */
>>> + * (notably. ptrace)
>>> + * Deprecated do not use in new code.
>>> + * Use exec_update_mutex instead.
>>> + */
>>> + struct mutex exec_update_mutex; /* Held while task_struct is being
>>> + * updated during exec, and may have
>>> + * inconsistent permissions.
>>> + */
>>> } __randomize_layout;
>>>
>>> /*
>>> diff --git a/init/init_task.c b/init/init_task.c
>>> index 9e5cbe5..bd403ed 100644
>>> --- a/init/init_task.c
>>> +++ b/init/init_task.c
>>> @@ -26,6 +26,7 @@
>>> .multiprocess = HLIST_HEAD_INIT,
>>> .rlim = INIT_RLIMITS,
>>> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>> + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
>>> #ifdef CONFIG_POSIX_TIMERS
>>> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
>>> .cputimer = {
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 8642530..036b692 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>>> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>>>
>>> mutex_init(&sig->cred_guard_mutex);
>>> + mutex_init(&sig->exec_update_mutex);
>>>
>>> return 0;
>>> }
>>>
>>
On 3/18/20 1:22 PM, Kirill Tkhai wrote:
> On 18.03.2020 00:53, Bernd Edlinger wrote:
>> On 3/17/20 9:56 AM, Kirill Tkhai wrote:
>>> On 14.03.2020 12:11, Bernd Edlinger wrote:
>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>> over the userspace accesses as the arguments from userspace are read.
>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>> threads are killed. The cred_guard_mutex is held over
>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>
>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>> over a possible indefinite userspace waits for userspace.
>>>>
>>>> Add exec_update_mutex that is only held over exec updating process
>>>> with the new contents of exec, so that code that needs not to be
>>>> confused by exec changing the mm and the cred in ways that can not
>>>> happen during ordinary execution of a process.
>>>>
>>>> The plan is to switch the users of cred_guard_mutex to
>>>> exec_udpate_mutex one by one. This lets us move forward while still
>>>> being careful and not introducing any regressions.
>>>>
>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>>> Signed-off-by: Bernd Edlinger <[email protected]>
>>>> ---
>>>> fs/exec.c | 17 ++++++++++++++---
>>>> include/linux/binfmts.h | 8 +++++++-
>>>> include/linux/sched/signal.h | 9 ++++++++-
>>>> init/init_task.c | 1 +
>>>> kernel/fork.c | 1 +
>>>> 5 files changed, 31 insertions(+), 5 deletions(-)
>>>>
>>>> v3: this update fixes lock-order and adds an explicit data member in linux_binprm
>>>>
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index d820a72..11974a1 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm)
>>>> {
>>>> struct task_struct *tsk;
>>>> struct mm_struct *old_mm, *active_mm;
>>>> + int ret;
>>>>
>>>> /* Notify parent that we're no longer interested in the old VM */
>>>> tsk = current;
>>>> old_mm = current->mm;
>>>> exec_mm_release(tsk, old_mm);
>>>>
>>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> if (old_mm) {
>>>> sync_mm_rss(old_mm);
>>>> /*
>>>> @@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>> down_read(&old_mm->mmap_sem);
>>>> if (unlikely(old_mm->core_state)) {
>>>> up_read(&old_mm->mmap_sem);
>>>> + mutex_unlock(&tsk->signal->exec_update_mutex);
>>>> return -EINTR;
>>>> }
>>>> }
>>>> +
>>>> task_lock(tsk);
>>>> active_mm = tsk->active_mm;
>>>> membarrier_exec_mmap(mm);
>>>> @@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm)
>>>> goto out;
>>>>
>>>> /*
>>>> - * After clearing bprm->mm (to mark that current is using the
>>>> - * prepared mm now), we have nothing left of the original
>>>> + * After setting bprm->called_exec_mmap (to mark that current is
>>>> + * using the prepared mm now), we have nothing left of the original
>>>> * process. If anything from here on returns an error, the check
>>>> * in search_binary_handler() will SEGV current.
>>>> */
>>>> + bprm->called_exec_mmap = 1;
>>>
>>> The two below is non-breaking pair:
>>>
>>> exec_mmap(bprm->mm);
>>> bprm->called_exec_mmap = 1;
>>>
>>> Why not move this into exec_mmap(), so nobody definitely inserts something
>>> between them?
>>>
>>
>> Hmm, could be done, but then I would probably need a different name than
>> "called_exec_mmap".
>>
>> How about adding a nice function comment to exec_mmap that calls out the
>> changed behaviour that the exec_update_mutex is taken unless the function
>> fails?
>
> Not sure, I understand correct.
>
> Could you post this like a small patch hunk (on top of anything you want)?
>
I was thinking of something like that:
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1010,6 +1010,11 @@ ssize_t read_code(struct file *file, unsigned long addr,
}
EXPORT_SYMBOL(read_code);
+/*
+ * Maps the mm_struct mm into the current task struct.
+ * On success, this function returns with the mutex
+ * exec_update_mutex locked.
+ */
static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
>> Bernd.
>>
>>
>>>> bprm->mm = NULL;
>>>>
>>>> #ifdef CONFIG_POSIX_TIMERS
>>>> @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm)
>>>> {
>>>> free_arg_pages(bprm);
>>>> if (bprm->cred) {
>>>> + if (bprm->called_exec_mmap)
>>>> + mutex_unlock(¤t->signal->exec_update_mutex);
>>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>>> abort_creds(bprm->cred);
>>>> }
>>>> @@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>>>> * credentials; any time after this it may be unlocked.
>>>> */
>>>> security_bprm_committed_creds(bprm);
>>>> + mutex_unlock(¤t->signal->exec_update_mutex);
>>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>>> }
>>>> EXPORT_SYMBOL(install_exec_creds);
>>>> @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>>>>
>>>> read_lock(&binfmt_lock);
>>>> put_binfmt(fmt);
>>>> - if (retval < 0 && !bprm->mm) {
>>>> + if (retval < 0 && bprm->called_exec_mmap) {
>>>> /* we got to flush_old_exec() and failed after it */
>>>> read_unlock(&binfmt_lock);
>>>> force_sigsegv(SIGSEGV);
>>>> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
>>>> index b40fc63..a345d9f 100644
>>>> --- a/include/linux/binfmts.h
>>>> +++ b/include/linux/binfmts.h
>>>> @@ -44,7 +44,13 @@ struct linux_binprm {
>>>> * exec has happened. Used to sanitize execution environment
>>>> * and to set AT_SECURE auxv for glibc.
>>>> */
>>>> - secureexec:1;
>>>> + secureexec:1,
>>>> + /*
>>>> + * Set by flush_old_exec, when exec_mmap has been called.
>>>> + * This is past the point of no return, when the
>>>> + * exec_update_mutex has been taken.
>>>> + */
>>>> + called_exec_mmap:1;
>>>> #ifdef __alpha__
>>>> unsigned int taso:1;
>>>> #endif
>>>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>>>> index 8805025..a29df79 100644
>>>> --- a/include/linux/sched/signal.h
>>>> +++ b/include/linux/sched/signal.h
>>>> @@ -224,7 +224,14 @@ struct signal_struct {
>>>>
>>>> struct mutex cred_guard_mutex; /* guard against foreign influences on
>>>> * credential calculations
>>>> - * (notably. ptrace) */
>>>> + * (notably. ptrace)
>>>> + * Deprecated do not use in new code.
>>>> + * Use exec_update_mutex instead.
>>>> + */
>>>> + struct mutex exec_update_mutex; /* Held while task_struct is being
>>>> + * updated during exec, and may have
>>>> + * inconsistent permissions.
>>>> + */
>>>> } __randomize_layout;
>>>>
>>>> /*
>>>> diff --git a/init/init_task.c b/init/init_task.c
>>>> index 9e5cbe5..bd403ed 100644
>>>> --- a/init/init_task.c
>>>> +++ b/init/init_task.c
>>>> @@ -26,6 +26,7 @@
>>>> .multiprocess = HLIST_HEAD_INIT,
>>>> .rlim = INIT_RLIMITS,
>>>> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>>> + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
>>>> #ifdef CONFIG_POSIX_TIMERS
>>>> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
>>>> .cputimer = {
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index 8642530..036b692 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>>>> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>>>>
>>>> mutex_init(&sig->cred_guard_mutex);
>>>> + mutex_init(&sig->exec_update_mutex);
>>>>
>>>> return 0;
>>>> }
>>>>
>>>
>
On 18.03.2020 23:06, Bernd Edlinger wrote:
> On 3/18/20 1:22 PM, Kirill Tkhai wrote:
>> On 18.03.2020 00:53, Bernd Edlinger wrote:
>>> On 3/17/20 9:56 AM, Kirill Tkhai wrote:
>>>> On 14.03.2020 12:11, Bernd Edlinger wrote:
>>>>> The cred_guard_mutex is problematic. The cred_guard_mutex is held
>>>>> over the userspace accesses as the arguments from userspace are read.
>>>>> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
>>>>> threads are killed. The cred_guard_mutex is held over
>>>>> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>>>>>
>>>>> Any of those can result in deadlock, as the cred_guard_mutex is held
>>>>> over a possible indefinite userspace waits for userspace.
>>>>>
>>>>> Add exec_update_mutex that is only held over exec updating process
>>>>> with the new contents of exec, so that code that needs not to be
>>>>> confused by exec changing the mm and the cred in ways that can not
>>>>> happen during ordinary execution of a process.
>>>>>
>>>>> The plan is to switch the users of cred_guard_mutex to
>>>>> exec_udpate_mutex one by one. This lets us move forward while still
>>>>> being careful and not introducing any regressions.
>>>>>
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
>>>>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Link: https://lore.kernel.org/lkml/[email protected]/
>>>>> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
>>>>> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
>>>>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>>>>> Signed-off-by: Bernd Edlinger <[email protected]>
>>>>> ---
>>>>> fs/exec.c | 17 ++++++++++++++---
>>>>> include/linux/binfmts.h | 8 +++++++-
>>>>> include/linux/sched/signal.h | 9 ++++++++-
>>>>> init/init_task.c | 1 +
>>>>> kernel/fork.c | 1 +
>>>>> 5 files changed, 31 insertions(+), 5 deletions(-)
>>>>>
>>>>> v3: this update fixes lock-order and adds an explicit data member in linux_binprm
>>>>>
>>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>>> index d820a72..11974a1 100644
>>>>> --- a/fs/exec.c
>>>>> +++ b/fs/exec.c
>>>>> @@ -1014,12 +1014,17 @@ static int exec_mmap(struct mm_struct *mm)
>>>>> {
>>>>> struct task_struct *tsk;
>>>>> struct mm_struct *old_mm, *active_mm;
>>>>> + int ret;
>>>>>
>>>>> /* Notify parent that we're no longer interested in the old VM */
>>>>> tsk = current;
>>>>> old_mm = current->mm;
>>>>> exec_mm_release(tsk, old_mm);
>>>>>
>>>>> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>>>>> + if (ret)
>>>>> + return ret;
>>>>> +
>>>>> if (old_mm) {
>>>>> sync_mm_rss(old_mm);
>>>>> /*
>>>>> @@ -1031,9 +1036,11 @@ static int exec_mmap(struct mm_struct *mm)
>>>>> down_read(&old_mm->mmap_sem);
>>>>> if (unlikely(old_mm->core_state)) {
>>>>> up_read(&old_mm->mmap_sem);
>>>>> + mutex_unlock(&tsk->signal->exec_update_mutex);
>>>>> return -EINTR;
>>>>> }
>>>>> }
>>>>> +
>>>>> task_lock(tsk);
>>>>> active_mm = tsk->active_mm;
>>>>> membarrier_exec_mmap(mm);
>>>>> @@ -1288,11 +1295,12 @@ int flush_old_exec(struct linux_binprm * bprm)
>>>>> goto out;
>>>>>
>>>>> /*
>>>>> - * After clearing bprm->mm (to mark that current is using the
>>>>> - * prepared mm now), we have nothing left of the original
>>>>> + * After setting bprm->called_exec_mmap (to mark that current is
>>>>> + * using the prepared mm now), we have nothing left of the original
>>>>> * process. If anything from here on returns an error, the check
>>>>> * in search_binary_handler() will SEGV current.
>>>>> */
>>>>> + bprm->called_exec_mmap = 1;
>>>>
>>>> The two below is non-breaking pair:
>>>>
>>>> exec_mmap(bprm->mm);
>>>> bprm->called_exec_mmap = 1;
>>>>
>>>> Why not move this into exec_mmap(), so nobody definitely inserts something
>>>> between them?
>>>>
>>>
>>> Hmm, could be done, but then I would probably need a different name than
>>> "called_exec_mmap".
>>>
>>> How about adding a nice function comment to exec_mmap that calls out the
>>> changed behaviour that the exec_update_mutex is taken unless the function
>>> fails?
>>
>> Not sure, I understand correct.
>>
>> Could you post this like a small patch hunk (on top of anything you want)?
>>
>
> I was thinking of something like that:
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1010,6 +1010,11 @@ ssize_t read_code(struct file *file, unsigned long addr,
> }
> EXPORT_SYMBOL(read_code);
>
> +/*
> + * Maps the mm_struct mm into the current task struct.
> + * On success, this function returns with the mutex
> + * exec_update_mutex locked.
> + */
Looks OK for me.
> static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
>
>
>>> Bernd.
>>>
>>>
>>>>> bprm->mm = NULL;
>>>>>
>>>>> #ifdef CONFIG_POSIX_TIMERS
>>>>> @@ -1438,6 +1446,8 @@ static void free_bprm(struct linux_binprm *bprm)
>>>>> {
>>>>> free_arg_pages(bprm);
>>>>> if (bprm->cred) {
>>>>> + if (bprm->called_exec_mmap)
>>>>> + mutex_unlock(¤t->signal->exec_update_mutex);
>>>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>>>> abort_creds(bprm->cred);
>>>>> }
>>>>> @@ -1487,6 +1497,7 @@ void install_exec_creds(struct linux_binprm *bprm)
>>>>> * credentials; any time after this it may be unlocked.
>>>>> */
>>>>> security_bprm_committed_creds(bprm);
>>>>> + mutex_unlock(¤t->signal->exec_update_mutex);
>>>>> mutex_unlock(¤t->signal->cred_guard_mutex);
>>>>> }
>>>>> EXPORT_SYMBOL(install_exec_creds);
>>>>> @@ -1678,7 +1689,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>>>>>
>>>>> read_lock(&binfmt_lock);
>>>>> put_binfmt(fmt);
>>>>> - if (retval < 0 && !bprm->mm) {
>>>>> + if (retval < 0 && bprm->called_exec_mmap) {
>>>>> /* we got to flush_old_exec() and failed after it */
>>>>> read_unlock(&binfmt_lock);
>>>>> force_sigsegv(SIGSEGV);
>>>>> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
>>>>> index b40fc63..a345d9f 100644
>>>>> --- a/include/linux/binfmts.h
>>>>> +++ b/include/linux/binfmts.h
>>>>> @@ -44,7 +44,13 @@ struct linux_binprm {
>>>>> * exec has happened. Used to sanitize execution environment
>>>>> * and to set AT_SECURE auxv for glibc.
>>>>> */
>>>>> - secureexec:1;
>>>>> + secureexec:1,
>>>>> + /*
>>>>> + * Set by flush_old_exec, when exec_mmap has been called.
>>>>> + * This is past the point of no return, when the
>>>>> + * exec_update_mutex has been taken.
>>>>> + */
>>>>> + called_exec_mmap:1;
>>>>> #ifdef __alpha__
>>>>> unsigned int taso:1;
>>>>> #endif
>>>>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>>>>> index 8805025..a29df79 100644
>>>>> --- a/include/linux/sched/signal.h
>>>>> +++ b/include/linux/sched/signal.h
>>>>> @@ -224,7 +224,14 @@ struct signal_struct {
>>>>>
>>>>> struct mutex cred_guard_mutex; /* guard against foreign influences on
>>>>> * credential calculations
>>>>> - * (notably. ptrace) */
>>>>> + * (notably. ptrace)
>>>>> + * Deprecated do not use in new code.
>>>>> + * Use exec_update_mutex instead.
>>>>> + */
>>>>> + struct mutex exec_update_mutex; /* Held while task_struct is being
>>>>> + * updated during exec, and may have
>>>>> + * inconsistent permissions.
>>>>> + */
>>>>> } __randomize_layout;
>>>>>
>>>>> /*
>>>>> diff --git a/init/init_task.c b/init/init_task.c
>>>>> index 9e5cbe5..bd403ed 100644
>>>>> --- a/init/init_task.c
>>>>> +++ b/init/init_task.c
>>>>> @@ -26,6 +26,7 @@
>>>>> .multiprocess = HLIST_HEAD_INIT,
>>>>> .rlim = INIT_RLIMITS,
>>>>> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
>>>>> + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
>>>>> #ifdef CONFIG_POSIX_TIMERS
>>>>> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
>>>>> .cputimer = {
>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>> index 8642530..036b692 100644
>>>>> --- a/kernel/fork.c
>>>>> +++ b/kernel/fork.c
>>>>> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>>>>> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>>>>>
>>>>> mutex_init(&sig->cred_guard_mutex);
>>>>> + mutex_init(&sig->exec_update_mutex);
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>
>>
On 3/19/20 8:13 AM, Kirill Tkhai wrote:
> On 18.03.2020 23:06, Bernd Edlinger wrote:
>>
>> I was thinking of something like that:
>>
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1010,6 +1010,11 @@ ssize_t read_code(struct file *file, unsigned long addr,
>> }
>> EXPORT_SYMBOL(read_code);
>>
>> +/*
>> + * Maps the mm_struct mm into the current task struct.
>> + * On success, this function returns with the mutex
>> + * exec_update_mutex locked.
>> + */
>
> Looks OK for me.
>
Cool, yeah, then I will post an updated patch in a moment.
Thanks
Bernd.
The cred_guard_mutex is problematic. The cred_guard_mutex is held
over the userspace accesses as the arguments from userspace are read.
The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
threads are killed. The cred_guard_mutex is held over
"put_user(0, tsk->clear_child_tid)" in exit_mm().
Any of those can result in deadlock, as the cred_guard_mutex is held
over a possible indefinite userspace waits for userspace.
Add exec_update_mutex that is only held over exec updating process
with the new contents of exec, so that code that needs not to be
confused by exec changing the mm and the cred in ways that can not
happen during ordinary execution of a process.
The plan is to switch the users of cred_guard_mutex to
exec_udpate_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Signed-off-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Bernd Edlinger <[email protected]>
---
fs/exec.c | 22 +++++++++++++++++++---
include/linux/binfmts.h | 8 +++++++-
include/linux/sched/signal.h | 9 ++++++++-
init/init_task.c | 1 +
kernel/fork.c | 1 +
5 files changed, 36 insertions(+), 5 deletions(-)
v3: this update fixes lock-order and adds an explicit data member in linux_binprm
v4: add a function comment to exec_mmap
diff --git a/fs/exec.c b/fs/exec.c
index d820a72..0e46ec5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1010,16 +1010,26 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
}
EXPORT_SYMBOL(read_code);
+/*
+ * Maps the mm_struct mm into the current task struct.
+ * On success, this function returns with the mutex
+ * exec_update_mutex locked.
+ */
static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct *old_mm, *active_mm;
+ int ret;
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
exec_mm_release(tsk, old_mm);
+ ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
+ if (ret)
+ return ret;
+
if (old_mm) {
sync_mm_rss(old_mm);
/*
@@ -1031,9 +1041,11 @@ static int exec_mmap(struct mm_struct *mm)
down_read(&old_mm->mmap_sem);
if (unlikely(old_mm->core_state)) {
up_read(&old_mm->mmap_sem);
+ mutex_unlock(&tsk->signal->exec_update_mutex);
return -EINTR;
}
}
+
task_lock(tsk);
active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
@@ -1288,11 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm)
goto out;
/*
- * After clearing bprm->mm (to mark that current is using the
- * prepared mm now), we have nothing left of the original
+ * After setting bprm->called_exec_mmap (to mark that current is
+ * using the prepared mm now), we have nothing left of the original
* process. If anything from here on returns an error, the check
* in search_binary_handler() will SEGV current.
*/
+ bprm->called_exec_mmap = 1;
bprm->mm = NULL;
#ifdef CONFIG_POSIX_TIMERS
@@ -1438,6 +1451,8 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
+ if (bprm->called_exec_mmap)
+ mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
@@ -1487,6 +1502,7 @@ void install_exec_creds(struct linux_binprm *bprm)
* credentials; any time after this it may be unlocked.
*/
security_bprm_committed_creds(bprm);
+ mutex_unlock(¤t->signal->exec_update_mutex);
mutex_unlock(¤t->signal->cred_guard_mutex);
}
EXPORT_SYMBOL(install_exec_creds);
@@ -1678,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm)
read_lock(&binfmt_lock);
put_binfmt(fmt);
- if (retval < 0 && !bprm->mm) {
+ if (retval < 0 && bprm->called_exec_mmap) {
/* we got to flush_old_exec() and failed after it */
read_unlock(&binfmt_lock);
force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b40fc63..a345d9f 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -44,7 +44,13 @@ struct linux_binprm {
* exec has happened. Used to sanitize execution environment
* and to set AT_SECURE auxv for glibc.
*/
- secureexec:1;
+ secureexec:1,
+ /*
+ * Set by flush_old_exec, when exec_mmap has been called.
+ * This is past the point of no return, when the
+ * exec_update_mutex has been taken.
+ */
+ called_exec_mmap:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 8805025..a29df79 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -224,7 +224,14 @@ struct signal_struct {
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
- * (notably. ptrace) */
+ * (notably. ptrace)
+ * Deprecated do not use in new code.
+ * Use exec_update_mutex instead.
+ */
+ struct mutex exec_update_mutex; /* Held while task_struct is being
+ * updated during exec, and may have
+ * inconsistent permissions.
+ */
} __randomize_layout;
/*
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5..bd403ed 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -26,6 +26,7 @@
.multiprocess = HLIST_HEAD_INIT,
.rlim = INIT_RLIMITS,
.cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
+ .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
#ifdef CONFIG_POSIX_TIMERS
.posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
.cputimer = {
diff --git a/kernel/fork.c b/kernel/fork.c
index 8642530..036b692 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
+ mutex_init(&sig->exec_update_mutex);
return 0;
}
--
1.9.1
Ah, sorry this is actuall v4 5/5.
Should I send a new version or can you handle it?
On 3/19/20 10:11 AM, Bernd Edlinger wrote:
> The cred_guard_mutex is problematic. The cred_guard_mutex is held
> over the userspace accesses as the arguments from userspace are read.
> The cred_guard_mutex is held of PTRACE_EVENT_EXIT as the the other
> threads are killed. The cred_guard_mutex is held over
> "put_user(0, tsk->clear_child_tid)" in exit_mm().
>
> Any of those can result in deadlock, as the cred_guard_mutex is held
> over a possible indefinite userspace waits for userspace.
>
> Add exec_update_mutex that is only held over exec updating process
> with the new contents of exec, so that code that needs not to be
> confused by exec changing the mm and the cred in ways that can not
> happen during ordinary execution of a process.
>
> The plan is to switch the users of cred_guard_mutex to
> exec_udpate_mutex one by one. This lets us move forward while still
> being careful and not introducing any regressions.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
> Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> Signed-off-by: Bernd Edlinger <[email protected]>
> ---
> fs/exec.c | 22 +++++++++++++++++++---
> include/linux/binfmts.h | 8 +++++++-
> include/linux/sched/signal.h | 9 ++++++++-
> init/init_task.c | 1 +
> kernel/fork.c | 1 +
> 5 files changed, 36 insertions(+), 5 deletions(-)
>
> v3: this update fixes lock-order and adds an explicit data member in linux_binprm
> v4: add a function comment to exec_mmap
>
> diff --git a/fs/exec.c b/fs/exec.c
> index d820a72..0e46ec5 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1010,16 +1010,26 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len)
> }
> EXPORT_SYMBOL(read_code);
>
> +/*
> + * Maps the mm_struct mm into the current task struct.
> + * On success, this function returns with the mutex
> + * exec_update_mutex locked.
> + */
> static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct *old_mm, *active_mm;
> + int ret;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> exec_mm_release(tsk, old_mm);
>
> + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> + if (ret)
> + return ret;
> +
> if (old_mm) {
> sync_mm_rss(old_mm);
> /*
> @@ -1031,9 +1041,11 @@ static int exec_mmap(struct mm_struct *mm)
> down_read(&old_mm->mmap_sem);
> if (unlikely(old_mm->core_state)) {
> up_read(&old_mm->mmap_sem);
> + mutex_unlock(&tsk->signal->exec_update_mutex);
> return -EINTR;
> }
> }
> +
> task_lock(tsk);
> active_mm = tsk->active_mm;
> membarrier_exec_mmap(mm);
> @@ -1288,11 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm)
> goto out;
>
> /*
> - * After clearing bprm->mm (to mark that current is using the
> - * prepared mm now), we have nothing left of the original
> + * After setting bprm->called_exec_mmap (to mark that current is
> + * using the prepared mm now), we have nothing left of the original
> * process. If anything from here on returns an error, the check
> * in search_binary_handler() will SEGV current.
> */
> + bprm->called_exec_mmap = 1;
> bprm->mm = NULL;
>
> #ifdef CONFIG_POSIX_TIMERS
> @@ -1438,6 +1451,8 @@ static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> + if (bprm->called_exec_mmap)
> + mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> abort_creds(bprm->cred);
> }
> @@ -1487,6 +1502,7 @@ void install_exec_creds(struct linux_binprm *bprm)
> * credentials; any time after this it may be unlocked.
> */
> security_bprm_committed_creds(bprm);
> + mutex_unlock(¤t->signal->exec_update_mutex);
> mutex_unlock(¤t->signal->cred_guard_mutex);
> }
> EXPORT_SYMBOL(install_exec_creds);
> @@ -1678,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>
> read_lock(&binfmt_lock);
> put_binfmt(fmt);
> - if (retval < 0 && !bprm->mm) {
> + if (retval < 0 && bprm->called_exec_mmap) {
> /* we got to flush_old_exec() and failed after it */
> read_unlock(&binfmt_lock);
> force_sigsegv(SIGSEGV);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index b40fc63..a345d9f 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -44,7 +44,13 @@ struct linux_binprm {
> * exec has happened. Used to sanitize execution environment
> * and to set AT_SECURE auxv for glibc.
> */
> - secureexec:1;
> + secureexec:1,
> + /*
> + * Set by flush_old_exec, when exec_mmap has been called.
> + * This is past the point of no return, when the
> + * exec_update_mutex has been taken.
> + */
> + called_exec_mmap:1;
> #ifdef __alpha__
> unsigned int taso:1;
> #endif
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 8805025..a29df79 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -224,7 +224,14 @@ struct signal_struct {
>
> struct mutex cred_guard_mutex; /* guard against foreign influences on
> * credential calculations
> - * (notably. ptrace) */
> + * (notably. ptrace)
> + * Deprecated do not use in new code.
> + * Use exec_update_mutex instead.
> + */
> + struct mutex exec_update_mutex; /* Held while task_struct is being
> + * updated during exec, and may have
> + * inconsistent permissions.
> + */
> } __randomize_layout;
>
> /*
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5..bd403ed 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -26,6 +26,7 @@
> .multiprocess = HLIST_HEAD_INIT,
> .rlim = INIT_RLIMITS,
> .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex),
> + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex),
> #ifdef CONFIG_POSIX_TIMERS
> .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers),
> .cputimer = {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8642530..036b692 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> sig->oom_score_adj_min = current->signal->oom_score_adj_min;
>
> mutex_init(&sig->cred_guard_mutex);
> + mutex_init(&sig->exec_update_mutex);
>
> return 0;
> }
>
On Thu, Mar 19, 2020 at 10:13:20AM +0100, Bernd Edlinger wrote:
> Ah, sorry this is actuall v4 5/5.
> Should I send a new version or can you handle it?
This thread is a total crazy mess of different versions.
I know I can't unwind any of this, so I _STRONGLY_ suggest resending the
whole series, properly versioned, as a new thread.
Would you want to try to pick out the proper patches from this pile?
thanks,
greg k-h
On 3/19/20 10:19 AM, Greg Kroah-Hartman wrote:
> On Thu, Mar 19, 2020 at 10:13:20AM +0100, Bernd Edlinger wrote:
>> Ah, sorry this is actuall v4 5/5.
>> Should I send a new version or can you handle it?
>
> This thread is a total crazy mess of different versions.
>
> I know I can't unwind any of this, so I _STRONGLY_ suggest resending the
> whole series, properly versioned, as a new thread.
>
> Would you want to try to pick out the proper patches from this pile?
>
> thanks,
>
> greg k-h
>
Yes, thanks, good suggestion.
I will do that in the evening.
On 3/19/20 10:19 AM, Greg Kroah-Hartman wrote:
> On Thu, Mar 19, 2020 at 10:13:20AM +0100, Bernd Edlinger wrote:
>> Ah, sorry this is actuall v4 5/5.
>> Should I send a new version or can you handle it?
>
> This thread is a total crazy mess of different versions.
>
> I know I can't unwind any of this, so I _STRONGLY_ suggest resending the
> whole series, properly versioned, as a new thread.
>
> Would you want to try to pick out the proper patches from this pile?
>
> thanks,
>
> greg k-h
>
Okay, meanwhile I collected everything I could find from this thread
and sent it again:
[PATCH v6 00/16] Infrastructure to allow fixing exec deadlocks
https://lore.kernel.org/lkml/AM6PR03MB5170B2F5BE24A28980D05780E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 01/16] exec: Only compute current once in flush_old_exec
https://lore.kernel.org/lkml/AM6PR03MB5170FC93B158EB8179F91D6AE4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 02/16] exec: Factor unshare_sighand out of de_thread and call it separately
https://lore.kernel.org/lkml/AM6PR03MB51708AECEA6E05CAE2FDC166E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 03/16] exec: Move cleanup of posix timers on exec out of de_thread
https://lore.kernel.org/lkml/AM6PR03MB5170CCB8D8B36F6002446FBDE4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 04/16] exec: Move exec_mmap right after de_thread in flush_old_exec
https://lore.kernel.org/lkml/AM6PR03MB5170FDB2C9B5225224B76398E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 05/16] exec: Add exec_update_mutex to replace cred_guard_mutex
https://lore.kernel.org/lkml/AM6PR03MB5170739C1B582B37E637279EE4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 06/16] exec: Fix a deadlock in strace
https://lore.kernel.org/lkml/AM6PR03MB51709A321EBA829CC36EE1F8E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 07/16] selftests/ptrace: add test cases for dead-locks
https://lore.kernel.org/lkml/AM6PR03MB517022530A9BECDBCAADC8D2E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 08/16] mm: docs: Fix a comment in process_vm_rw_core
https://lore.kernel.org/lkml/AM6PR03MB517027F6ACBB4CF2D9BF014CE4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 09/16] kernel: doc: remove outdated comment cred.c
https://lore.kernel.org/lkml/AM6PR03MB51705CEFAB7D02E6EA6CEBA6E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 10/16] kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB5170FFDE1D7BF09DD2663EDEE4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 11/16] proc: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB5170C4D177DD76E3C65E8033E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 12/16] proc: io_accounting: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB51701CB541B08F21D56DCAC9E4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 13/16] perf: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/AM6PR03MB51704A188C3A1FA02B76B9EFE4F50@AM6PR03MB5170.eurprd03.prod.outlook.com/
[PATCH v6 14/16] pidfd: Use new infrastructure to fix deadlocks in execve
https://lore.kernel.org/lkml/[email protected]/
[PATCH v6 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
https://lore.kernel.org/lkml/[email protected]/
[PATCH v6 16/16] doc: Update documentation of ->exec_*_mutex
https://lore.kernel.org/lkml/[email protected]/
Each of the patches in this series build on the previous one and are independent from the following
patches. So if one or more of these turn out to be controversial, the previous patches are still an
improvement, especially [PATCH v6 06/16] which fixes the deadlock in strace, this one fixes the most
important tracing deadlocks.
Thanks
Bernd.