2020-04-02 19:36:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <[email protected]> wrote:
>
> The work on exec starts solving a long standing issue with exec that
> it takes mutexes of blocking userspace applications, which makes exec
> extremely deadlock prone. For the moment this adds a second mutex
> with a narrower scope that handles all of the easy cases. Which
> makes the tricky cases easy to spot. With a little luck the code to
> solve those deadlocks will be ready by next merge window.

So this worries me.

I've pulled it, but I'm not entirely happy about some of it.

For example, the "rationale" for some of the changes is

This should be safe, as the credentials are only used for reading.

which is just nonsensical. "Only used for reading" is immaterial, and
there's no explanation for why that would matter at all. Most of the
credentials are ever used for reading, and the worry about execve() is
that the credentials can change, and people race with them and use the
new 'suid' credentials and allow things that shouldn't be allowed. So
the rationale makes no sense at all.

Btw, if "this only takes it for reading" is such a big deal, why is
that mutex not an rw-semaphore?

The pidfd change at least has a rationale that makes sense:

This should be safe, as the credentials do not change
before exec_update_mutex is locked. Therefore whatever
file access is possible with holding the cred_guard_mutex
here is also possbile with the exec_update_mutex.

so now you at least have an explanation of why that particular lock
makes sense and works and is equivalent.

It's still not a *great* explanation for why it would be equivalent,
because cred_guard_mutex ends up not just guarding the write of the
credentials, but makes it atomic wrt *other* data. That's the same
problem as "I'm only reading".

Locking is not about *one* value being consistent - if that was the
case, then you could just do a "get refcount on the credentials, now I
have a stable set of creds I can read forever". No lock needed.

So locking is about *multiple* values being consistent, which is why
that "I'm only reading" is not an explanation for why you can change
the lock.

It's also why that second one is questionable: it's a _better_ attempt
at explaining things, but the point is really that cred_guard_mutex
protects *other* things too.

A real explanation would have absolutely *nothing* to do with
"reading" or "same value of credentials". Both of those are entirely
immaterial, since - as mentioned - you could just get a snapshot of
the creds instead.

A real explanation would be about how there is no other state that
cred_guard_mutex protects that matters.

See what I'm saying?

This code is subtle as h*ll, and we've had bugs in it, and it has a
series of tens of patches to fix them. But that also means that the
explanations for the patches should take the subtleties into account,
and not gloss over them with things like this.

Ok, enough about the explanations. The actual _code_ is kind of odd
too. For example, you have that "bprm->called_exec_mmap" flag to say
"I've taken the exec_update_mutex, and need to drop it".

But that flag is not set anywhere _near_ actually taking the lock.
Sure, it is taken after exec_mmap() returns successfully, and that
makes sense from a naming standpoint, but wouldn't it have been a
_lot_ more obvious if you just set the flag when you took that lock,
and instead of naming it by some magical code sequence, you named it
for what it does?

Again, this looks all technically correct, but it's written in a way
that doesn't seem to make a lot of sense. Why is the code literally
written with a magical assumption of "calling exec_mmap takes this
lock, so if the flag named called_exec_mmap is set, I have to free
that lock that is not named that at all".

I hate conditional locking in the first place, but if it has to exist,
then the conditional should be named after the lock, and the lock
getting should be very very explicitly tied to it.

Wouldn't it have been much clearer if you called that flag
"exec_update_mutex_taken", and set it WHEN YOU TAKE IT?

In fact, then you could drop the

mutex_unlock(&tsk->signal->exec_update_mutex);

in the error case of exec_mmap(), because now the error handling in
free_bprm() would do the cleanup automatically.

See what I'm saying? You've made the locking more complex and subtle
than it needed to be. And since the whole point of the *new* lock is
that it should replace an old lock that was really complex and subtle,
that's a problem.

Linus


2020-04-02 19:48:40

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/2/20 9:04 PM, Linus Torvalds wrote:
> On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <[email protected]> wrote:
>>
>> The work on exec starts solving a long standing issue with exec that
>> it takes mutexes of blocking userspace applications, which makes exec
>> extremely deadlock prone. For the moment this adds a second mutex
>> with a narrower scope that handles all of the easy cases. Which
>> makes the tricky cases easy to spot. With a little luck the code to
>> solve those deadlocks will be ready by next merge window.
>
> So this worries me.
>
> I've pulled it, but I'm not entirely happy about some of it.
>
> For example, the "rationale" for some of the changes is
>
> This should be safe, as the credentials are only used for reading.
>

What I meant, but did probably not find a good way to say it.

There are places where credentials of other threads are written,
e.g. set no new privs on a thread, that already started to execve
a setuid process.

You always have the right to change the credentials of the own thread,
you dont need a mutex for it.

This is at least what is my impression how the existing mutexes are used,
a mutex called "cred_guard_mutex" is a not very good self explaining name,
in my opinion, it is totally unclear what it does "guard", and why.


Bernd.

> which is just nonsensical. "Only used for reading" is immaterial, and
> there's no explanation for why that would matter at all. Most of the
> credentials are ever used for reading, and the worry about execve() is
> that the credentials can change, and people race with them and use the
> new 'suid' credentials and allow things that shouldn't be allowed. So
> the rationale makes no sense at all.
>
> Btw, if "this only takes it for reading" is such a big deal, why is
> that mutex not an rw-semaphore?
>
> The pidfd change at least has a rationale that makes sense:
>
> This should be safe, as the credentials do not change
> before exec_update_mutex is locked. Therefore whatever
> file access is possible with holding the cred_guard_mutex
> here is also possbile with the exec_update_mutex.
>
> so now you at least have an explanation of why that particular lock
> makes sense and works and is equivalent.
>
> It's still not a *great* explanation for why it would be equivalent,
> because cred_guard_mutex ends up not just guarding the write of the
> credentials, but makes it atomic wrt *other* data. That's the same
> problem as "I'm only reading".
>
> Locking is not about *one* value being consistent - if that was the
> case, then you could just do a "get refcount on the credentials, now I
> have a stable set of creds I can read forever". No lock needed.
>
> So locking is about *multiple* values being consistent, which is why
> that "I'm only reading" is not an explanation for why you can change
> the lock.
>
> It's also why that second one is questionable: it's a _better_ attempt
> at explaining things, but the point is really that cred_guard_mutex
> protects *other* things too.
>
> A real explanation would have absolutely *nothing* to do with
> "reading" or "same value of credentials". Both of those are entirely
> immaterial, since - as mentioned - you could just get a snapshot of
> the creds instead.
>
> A real explanation would be about how there is no other state that
> cred_guard_mutex protects that matters.
>
> See what I'm saying?
>
> This code is subtle as h*ll, and we've had bugs in it, and it has a
> series of tens of patches to fix them. But that also means that the
> explanations for the patches should take the subtleties into account,
> and not gloss over them with things like this.
>
> Ok, enough about the explanations. The actual _code_ is kind of odd
> too. For example, you have that "bprm->called_exec_mmap" flag to say
> "I've taken the exec_update_mutex, and need to drop it".
>
> But that flag is not set anywhere _near_ actually taking the lock.
> Sure, it is taken after exec_mmap() returns successfully, and that
> makes sense from a naming standpoint, but wouldn't it have been a
> _lot_ more obvious if you just set the flag when you took that lock,
> and instead of naming it by some magical code sequence, you named it
> for what it does?
>
> Again, this looks all technically correct, but it's written in a way
> that doesn't seem to make a lot of sense. Why is the code literally
> written with a magical assumption of "calling exec_mmap takes this
> lock, so if the flag named called_exec_mmap is set, I have to free
> that lock that is not named that at all".
>
> I hate conditional locking in the first place, but if it has to exist,
> then the conditional should be named after the lock, and the lock
> getting should be very very explicitly tied to it.
>
> Wouldn't it have been much clearer if you called that flag
> "exec_update_mutex_taken", and set it WHEN YOU TAKE IT?
>
> In fact, then you could drop the
>
> mutex_unlock(&tsk->signal->exec_update_mutex);
>
> in the error case of exec_mmap(), because now the error handling in
> free_bprm() would do the cleanup automatically.
>
> See what I'm saying? You've made the locking more complex and subtle
> than it needed to be. And since the whole point of the *new* lock is
> that it should replace an old lock that was really complex and subtle,
> that's a problem.
>
> Linus
>

2020-04-02 19:54:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 2, 2020 at 12:31 PM Bernd Edlinger
<[email protected]> wrote:
>
> This is at least what is my impression how the existing mutexes are used,
> a mutex called "cred_guard_mutex" is a not very good self explaining name,
> in my opinion, it is totally unclear what it does "guard", and why.

Oh, I absolutely agree that cred_guard_mutex is a horrible lock.

It actually _used_ to be a lot more understandable, and the name used
to make more sense in the context it was used.

See commit

a2a8474c3fff ("exec: do not sleep in TASK_TRACED under ->cred_guard_mutex")

for when it changed from "somewhat understandable" to "really hard to follow".

Don't get me wrong - that commit has a very good reason for it, but it
does make the locking really hard to understand.

It all used to be in one function - do_execve() - and it was holding
the lock over a fairly obvious range, starting at

bprm->cred = prepare_exec_creds();

and ending at basically "we're done with execve()".

So basically, cred_guard_mutex ends up being the thing that is held
all the way from the "before execve looks at the old creds" to "execve
is done, and has changed the creds".

The reason it's needed is exactly that there are some nasty situations
where execve() itself does things with creds to determine that the new
creds are ok. And it uses the old creds to do that, but it also uses
the task->flags and task->ptrace.

So think of cred_guard_mutex as a lock around not just the creds, but
the combination of creds and the task flags/ptrace.

Anybody who changes the task ptrace setting needs to serialize with
execve(). Or anybody who tests for "dumpable()", for example.

If *all* you care about is just the creds, then you don't need it.
It's really only users that do more checks than just credentials.
"dumpable()" is I think the common one.

And that's why cred_guard_mutex has that big range - it starts when we
read the original creds (because it will use those creds to determine
how the *new* creds will affect dumpability etc), and it ends when it
has updated not only to the new creds, but it has set all those other
flags too.

So I'm not at all against splitting the lock up, and trying to make it
more directed and specific.

My complaints were about how the new lock wasn't much better. It was
still completely incomprehensible, the conditional unlocking was hard
to follow, and it really wasn't obvious that the converted users were
fine.

See?

Linus

2020-04-02 21:05:49

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/2/20 9:52 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 12:31 PM Bernd Edlinger
> <[email protected]> wrote:
>>
>> This is at least what is my impression how the existing mutexes are used,
>> a mutex called "cred_guard_mutex" is a not very good self explaining name,
>> in my opinion, it is totally unclear what it does "guard", and why.
>
> Oh, I absolutely agree that cred_guard_mutex is a horrible lock.
>
> It actually _used_ to be a lot more understandable, and the name used
> to make more sense in the context it was used.
>
> See commit
>
> a2a8474c3fff ("exec: do not sleep in TASK_TRACED under ->cred_guard_mutex")
>
> for when it changed from "somewhat understandable" to "really hard to follow".
>
> Don't get me wrong - that commit has a very good reason for it, but it
> does make the locking really hard to understand.
>
> It all used to be in one function - do_execve() - and it was holding
> the lock over a fairly obvious range, starting at
>
> bprm->cred = prepare_exec_creds();
>
> and ending at basically "we're done with execve()".
>
> So basically, cred_guard_mutex ends up being the thing that is held
> all the way from the "before execve looks at the old creds" to "execve
> is done, and has changed the creds".
>
> The reason it's needed is exactly that there are some nasty situations
> where execve() itself does things with creds to determine that the new
> creds are ok. And it uses the old creds to do that, but it also uses
> the task->flags and task->ptrace.
>
> So think of cred_guard_mutex as a lock around not just the creds, but
> the combination of creds and the task flags/ptrace.
>
> Anybody who changes the task ptrace setting needs to serialize with
> execve(). Or anybody who tests for "dumpable()", for example.
>
> If *all* you care about is just the creds, then you don't need it.
> It's really only users that do more checks than just credentials.
> "dumpable()" is I think the common one.
>
> And that's why cred_guard_mutex has that big range - it starts when we
> read the original creds (because it will use those creds to determine
> how the *new* creds will affect dumpability etc), and it ends when it
> has updated not only to the new creds, but it has set all those other
> flags too.
>
> So I'm not at all against splitting the lock up, and trying to make it
> more directed and specific.
>
> My complaints were about how the new lock wasn't much better. It was
> still completely incomprehensible, the conditional unlocking was hard
> to follow, and it really wasn't obvious that the converted users were
> fine.
>
> See?
>

Understand completely. The change is in a way mechanic, that is we
have the following sequence:

1 execve starts
|
| access args, may fault, deadlock in user mode fault handler
| de_thread, may block waiting for strace to call wait and so
|
| exec_mm_release, may also falut, deadlock un user mode fault handler
v
2 process update begins
|
| should not block, to our current knowledge (except when loading a nfs image probably ?)
| credentials may change at any time.
|
v
3 process update done
|
v
4 execve done


So we have functions that access the process memory map. they use the credentials
and need to access the correct image, not to reveal secrets of the new about to be
loaded image. they need the inner mutex from 2 .. 3

Also when you want to read credentials of another thread, it is probably better
to have a consistent state of the credentials, and no new privs for instance.
That also needs the inner mutex from 2 .. 3

And then we have things that change the credentials or no new privs, in general
all ptrace_attach and security modules are of that kind. They need the
mutex from 1 .. 4, but I want to change the name, in the two patches below, and
I want to break the dead-lock from ptrace in a API-incompatible way, but in a
very limited breaking change, that only breaks what is already broken.


There are two more patches, which might be of interest for you, just to
make the picture complete.
It is not clear if we go that way, or if Eric has a yet better idea.

[PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
https://www.spinics.net/lists/kernel/msg3459067.html

[PATCH v6 16/16] doc: Update documentation of ->exec_*_mutex
https://www.spinics.net/lists/kernel/msg3449539.html



Thanks
Bernd.



> Linus
>

2020-04-02 21:48:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <[email protected]> wrote:
>
> There are two more patches, which might be of interest for you, just to
> make the picture complete.
> It is not clear if we go that way, or if Eric has a yet better idea.
>
> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
> https://www.spinics.net/lists/kernel/msg3459067.html

There is no way I would ever take that patch.

The amount of confusion in that patch is not acceptable. Randomly
unlocking the new lock?

That code makes everything worse, it's completely incomprehensible,
the locking rules make no sense ahwt-so-ever.

I'm seriously starting to feel like I should not have pulled this
code, because the future looks _worse_ than what we used to have.

No. No no no. Eric, this is not an acceptable direction.

Linus

2020-04-02 23:03:41

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/2/20 11:46 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <[email protected]> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
>
> There is no way I would ever take that patch.
>
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
>
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
>
> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
>
> No. No no no. Eric, this is not an acceptable direction.
>

Seriously, Linus,

nobody is forcing anything on you.

That would be quite a stupid idea (to try to force you per e-mail :-) )

The future is not yet written.

I think Eric has an alternative idea for the next step (he did not tell
me more but I am curious). Maybe that will be better, maybe not. And of
course I do not try to win a battle here, and I am willing to take advice.
So I am sure, we can work together to understand the problem better when
we take the time to analyze the problem better (I have not
yet read everything in your last mail completely, and followed
every link you have given, so what I write is just preliminary.

I just would like to know one thing,
how did you like my "big fat warning" comments?

If it turns out to be the wrong direction,
is it too late to turn back now?



Thanks
Bernd.


> Linus
>

2020-04-02 23:06:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Linus Torvalds <[email protected]> writes:

> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <[email protected]> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
>
> There is no way I would ever take that patch.
>
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
>
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
>
> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
>
> No. No no no. Eric, this is not an acceptable direction.

That is not the direction I intend to take either.

I was hoping I could put off replying to this thread for a bit because
I only managed to get 4 hours of sleep last night and I am not as alert
to technical details as I would like to be.

Long story short:

The exec_update_mutex is to be a strict subset of today's
cred_guard_mutex. I tried to copy cred_guard_mutex's unlock style so
that was obvious and that turns out was messier than I intended.

I thought the changes to the individual locking changes were
sufficiently unsubtle that they did not need my personal attention.
Especially as they are just a substitution of one lock for another
with a slightly smaller scope.

I started working on the the series of changes that reorganizes
the changes in exec.

It was reported that something had gone wrong with my introduction
of exec_update_mutex and I pulled it from linux-next.

By the time I was ready to start putting humpty dumpty back together
again Bernd had collected everything up and had it working. I had seen
that he had been given the feedback about better change descriptions.

I had looked at the code of his patches earlier and the basic changes
were trivial.

Since I thought I already knew what was in the patches and the worst
problem was the missing unlock of cred_guard_mutex, and I know Bernd's
patches had been tested I applied them. I missed that Bernd had added
the exec_mmap_called flag into my patch. I thought he had only added
the missing unlock.

I spotted the weirdness in unlocking exec_update_mutex, and because it
does fix a real world deadlock with ptrace I did not back it out from my
tree.

I have been much laxer on the details than I like to be my apologies.

The plan is:
exec_udpate_mutex will cover just the subset of cred_guard_mutex
after the point of no return, and after we do any actions that
might block waiting for userspace to do anything.

So exec_update_mutex will just cover those things that exec
is updating, so if you want an atomic snapshot of them
plus the appropriate struct cred you can grab exec_update_mutex.

I added a new mutex instead of just fixing cred_guard_mutex so
that we can update or revert the individual code paths if it
is found that something is wrong.

The cred_guard_mutex also prevents other tasks from starting
to ptrace the task that is exec'ing, and other tasks from
setting no_new_privs on the task that is exec'ing.

My plan is to carefully refactor the code so it can perform
both the ptrace and no_new_privs checks after the point of
no return.

I have uncovered all kinds of surprises while working in that direction
and I realize it is going to take a very delicate and careful touch to
achieve my goal.

There are silly things like normal linux exec when you are ptraced and
exec changes the credentials the ordinary code will continue with the
old credentials, but the an LSM enabled your process is likely to be
killed instead.

There is the personal mind blowing scenario where selinux will increase
your credentials upon exec but if a magic directive is supplied in it's
rules will avoid setting AT_SECURE, so that userspace won't protect
itself from hostile takeover from the pre credential change environment.
Much to my surprise "noatsecure" is a known and documented feature of
selinux. I am not certain but I think I even spotted it in use on
production.

I will catch up on my sleep before I allow any more changes, and I will
see replacing the called_exec_mmap flag with something saner.

Eric

2020-04-02 23:43:12

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/3/20 1:01 AM, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
>> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <[email protected]> wrote:
>>>
>>> There are two more patches, which might be of interest for you, just to
>>> make the picture complete.
>>> It is not clear if we go that way, or if Eric has a yet better idea.
>>>
>>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>>> https://www.spinics.net/lists/kernel/msg3459067.html
>>
>> There is no way I would ever take that patch.
>>
>> The amount of confusion in that patch is not acceptable. Randomly
>> unlocking the new lock?
>>
>> That code makes everything worse, it's completely incomprehensible,
>> the locking rules make no sense ahwt-so-ever.
>>
>> I'm seriously starting to feel like I should not have pulled this
>> code, because the future looks _worse_ than what we used to have.
>>
>> No. No no no. Eric, this is not an acceptable direction.
>
> That is not the direction I intend to take either.
>
> I was hoping I could put off replying to this thread for a bit because
> I only managed to get 4 hours of sleep last night and I am not as alert
> to technical details as I would like to be.
>
> Long story short:
>
> The exec_update_mutex is to be a strict subset of today's
> cred_guard_mutex. I tried to copy cred_guard_mutex's unlock style so
> that was obvious and that turns out was messier than I intended.
>
> I thought the changes to the individual locking changes were
> sufficiently unsubtle that they did not need my personal attention.
> Especially as they are just a substitution of one lock for another
> with a slightly smaller scope.
>
> I started working on the the series of changes that reorganizes
> the changes in exec.
>
> It was reported that something had gone wrong with my introduction
> of exec_update_mutex and I pulled it from linux-next.
>
> By the time I was ready to start putting humpty dumpty back together
> again Bernd had collected everything up and had it working. I had seen
> that he had been given the feedback about better change descriptions.
>
> I had looked at the code of his patches earlier and the basic changes
> were trivial.
>
> Since I thought I already knew what was in the patches and the worst
> problem was the missing unlock of cred_guard_mutex, and I know Bernd's
> patches had been tested I applied them. I missed that Bernd had added
> the exec_mmap_called flag into my patch. I thought he had only added
> the missing unlock.
>

Hi Eric,

oh, sorry for that, that was requested in the peer review, I could not
get a patch approved that does not have such a boolean, that simplified
the error handling.

Actually I had sent you an e-mail with that patch 24H before I posted
the update, then Greg asked me to re-post the whole series, that
took at least another two days, so at that time I was seriously
concerned how you are doing, since I head nothing from you about the
updated patch with the exec_mmap_called.

Linus that is not the boolean I was talking in the other mail.
That boolean is called unsafe_execve_in_progres.

So, and now I also try to get some sleep....


Thanks
Bernd.

> I spotted the weirdness in unlocking exec_update_mutex, and because it
> does fix a real world deadlock with ptrace I did not back it out from my
> tree.
>
> I have been much laxer on the details than I like to be my apologies.
>
> The plan is:
> exec_udpate_mutex will cover just the subset of cred_guard_mutex
> after the point of no return, and after we do any actions that
> might block waiting for userspace to do anything.
>
> So exec_update_mutex will just cover those things that exec
> is updating, so if you want an atomic snapshot of them
> plus the appropriate struct cred you can grab exec_update_mutex.
>
> I added a new mutex instead of just fixing cred_guard_mutex so
> that we can update or revert the individual code paths if it
> is found that something is wrong.
>
> The cred_guard_mutex also prevents other tasks from starting
> to ptrace the task that is exec'ing, and other tasks from
> setting no_new_privs on the task that is exec'ing.
>
> My plan is to carefully refactor the code so it can perform
> both the ptrace and no_new_privs checks after the point of
> no return.
>
> I have uncovered all kinds of surprises while working in that direction
> and I realize it is going to take a very delicate and careful touch to
> achieve my goal.
>
> There are silly things like normal linux exec when you are ptraced and
> exec changes the credentials the ordinary code will continue with the
> old credentials, but the an LSM enabled your process is likely to be
> killed instead.
>
> There is the personal mind blowing scenario where selinux will increase
> your credentials upon exec but if a magic directive is supplied in it's
> rules will avoid setting AT_SECURE, so that userspace won't protect
> itself from hostile takeover from the pre credential change environment.
> Much to my surprise "noatsecure" is a known and documented feature of
> selinux. I am not certain but I think I even spotted it in use on
> production.
>
> I will catch up on my sleep before I allow any more changes, and I will
> see replacing the called_exec_mmap flag with something saner.
>
> Eric
>

2020-04-03 00:27:33

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/2/20 11:46 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <[email protected]> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
>
> There is no way I would ever take that patch.
>
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
>
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
>

Linus,

let me explain what the locking here does.

It is a kind of soft mutex, which is normally strong, so taken
from 1 .. 4. and nothing changes from how it was before.

But it can also be weak.

So if we detect that another thread is being ptraced, we drop
the lock, and keep the boolean set to true, which makes the ptrace_attach
acquire the lock, and the boolean is true, that make the
ptrace_attach return -EAGAIN. release the lock immediatly,
the deadlock is broken, the thread can handle the deadly signal
from de_thread, de_thread continues. And just
at the end of the execve, when the boolean has to be set
to false again, we have to lock the mutex, set the boolean to
false, and unlock the mutex. It is very important for the
correctness that the boolean is only changed when the mutex
is held.

Once again, please give Eric the time to catch up with his
sleep, that can be more serious as you would think to have
too less sleep. Then I am looking forward to see his idea,
usually that may be something worth do consider. But
we have all the time we want for that.


Thanks
Bernd.


> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
>
> No. No no no. Eric, this is not an acceptable direction.
>
> Linus
>

2020-04-03 00:29:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 2, 2020 at 4:42 PM Bernd Edlinger <[email protected]> wrote:
>
> Linus that is not the boolean I was talking in the other mail.
> That boolean is called unsafe_execve_in_progres.

Yeah, I tracked what you were trying to say - I did understand how
that patch worked.

I just absolutely and utterly hated it ;/

Linus

2020-04-03 00:29:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Bernd Edlinger <[email protected]> writes:

> oh, sorry for that, that was requested in the peer review, I could not
> get a patch approved that does not have such a boolean, that simplified
> the error handling.

If you had included a note in your changlog when you respun my patch I
probably would have realized what you had done I would have spotted it
faster.

When I glanced at the patch quickly I thought you had just added the
missing unlock.

Eric

2020-04-03 00:30:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 2, 2020 at 4:04 PM Eric W. Biederman <[email protected]> wrote:
>
> That is not the direction I intend to take either.

Ahh, good. Because those kinds of "play games with dropping locks in
the middle based on conditionals" really have been horrible.

Yes, we've done it, and it's almost always been asource of truly subtle bugs.

> The exec_update_mutex is to be a strict subset of today's
> cred_guard_mutex. I tried to copy cred_guard_mutex's unlock style so
> that was obvious and that turns out was messier than I intended.

Yes. That is why I had no problem pulling that subset, and my worries
were mainly about the explanations and that flag use.

> Since I thought I already knew what was in the patches and the worst
> problem was the missing unlock of cred_guard_mutex, and I know Bernd's
> patches had been tested I applied them. I missed that Bernd had added
> the exec_mmap_called flag into my patch. I thought he had only added
> the missing unlock.

Ahh, so you meant for all of that to be entirely static refactoring,
rather than the conditional unlock depending on just how far we had
gotten.

Good, that's generally the much superior approach.

I absolutely _hate_ the "drop and retake" model, unless it's a very
local case with a very explicit retry path.

In contrast, the "we have a flag that shows how far we've gotten"
_has_ been a successful model, and while I much prefer a static "lock
pairs with unlock", that "I have done this, so you need to unlock" is
not entirely out of the question when the static rules become too
complex to think about.

The vfs code has something similar in FMODE_OPENED which is basically
a flag saying "I actually made it all the way to the ->open()"
callback. We used to have a static model, but the rules for when we
can use fput(), and when we have to use fdrop() were too hard for
people.

> I spotted the weirdness in unlocking exec_update_mutex, and because it
> does fix a real world deadlock with ptrace I did not back it out from my
> tree.
>
> I have been much laxer on the details than I like to be my apologies.

Ok, as long as we have a sane plan..

And

> The plan is:
> exec_udpate_mutex will cover just the subset of cred_guard_mutex
> after the point of no return, and after we do any actions that
> might block waiting for userspace to do anything.
>
> So exec_update_mutex will just cover those things that exec
> is updating, so if you want an atomic snapshot of them
> plus the appropriate struct cred you can grab exec_update_mutex.
>
> I added a new mutex instead of just fixing cred_guard_mutex so
> that we can update or revert the individual code paths if it
> is found that something is wrong.
>
> The cred_guard_mutex also prevents other tasks from starting
> to ptrace the task that is exec'ing, and other tasks from
> setting no_new_privs on the task that is exec'ing.
>
> My plan is to carefully refactor the code so it can perform
> both the ptrace and no_new_privs checks after the point of
> no return.

Ok. Sounds good.

> I have uncovered all kinds of surprises while working in that direction
> and I realize it is going to take a very delicate and careful touch to
> achieve my goal.
>
> There are silly things like normal linux exec when you are ptraced and
> exec changes the credentials the ordinary code will continue with the
> old credentials, but the an LSM enabled your process is likely to be
> killed instead.

Yeah. The "continue with old credentials" is actually very traditional
and the original behavior, and is useful for handling the case of
debugging something that is suid, but doesn't necessarily _require_
it.

But the LSM's just say yes/no.

I have this dim memory that it also triggers when you do the debugging
as root, but that may be some medication-induced memory.

> There is the personal mind blowing scenario where selinux will increase
> your credentials upon exec but if a magic directive is supplied in it's
> rules will avoid setting AT_SECURE, so that userspace won't protect
> itself from hostile takeover from the pre credential change environment.
> Much to my surprise "noatsecure" is a known and documented feature of
> selinux. I am not certain but I think I even spotted it in use on
> production.

We have had a _ton_ of random small rules so that people could enable
SElinux in legacy environments.

They are _probably_ effectively dead code in this day and age, but
it's hard to tell...

Linus

2020-04-03 05:12:03

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/3/20 1:01 AM, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
>> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <[email protected]> wrote:
>>>
>>> There are two more patches, which might be of interest for you, just to
>>> make the picture complete.
>>> It is not clear if we go that way, or if Eric has a yet better idea.
>>>
>>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>>> https://www.spinics.net/lists/kernel/msg3459067.html
>>
>> There is no way I would ever take that patch.
>>
>> The amount of confusion in that patch is not acceptable. Randomly
>> unlocking the new lock?
>>
>> That code makes everything worse, it's completely incomprehensible,
>> the locking rules make no sense ahwt-so-ever.
>>
>> I'm seriously starting to feel like I should not have pulled this
>> code, because the future looks _worse_ than what we used to have.
>>
>> No. No no no. Eric, this is not an acceptable direction.
>
> That is not the direction I intend to take either.
>
> I was hoping I could put off replying to this thread for a bit because
> I only managed to get 4 hours of sleep last night and I am not as alert
> to technical details as I would like to be.
>
> Long story short:
>
> The exec_update_mutex is to be a strict subset of today's
> cred_guard_mutex. I tried to copy cred_guard_mutex's unlock style so
> that was obvious and that turns out was messier than I intended.
>
> I thought the changes to the individual locking changes were
> sufficiently unsubtle that they did not need my personal attention.
> Especially as they are just a substitution of one lock for another
> with a slightly smaller scope.
>
> I started working on the the series of changes that reorganizes
> the changes in exec.
>
> It was reported that something had gone wrong with my introduction
> of exec_update_mutex and I pulled it from linux-next.
>
> By the time I was ready to start putting humpty dumpty back together
> again Bernd had collected everything up and had it working. I had seen
> that he had been given the feedback about better change descriptions.
>

Sorry, I did it as slowly as I could possibly do.
I wanted to wait for you, but....


> I had looked at the code of his patches earlier and the basic changes
> were trivial.
>
> Since I thought I already knew what was in the patches and the worst
> problem was the missing unlock of cred_guard_mutex, and I know Bernd's
> patches had been tested I applied them. I missed that Bernd had added
> the exec_mmap_called flag into my patch. I thought he had only added
> the missing unlock.
>
> I spotted the weirdness in unlocking exec_update_mutex, and because it
> does fix a real world deadlock with ptrace I did not back it out from my
> tree.
>
> I have been much laxer on the details than I like to be my apologies.
>
> The plan is:
> exec_udpate_mutex will cover just the subset of cred_guard_mutex
> after the point of no return, and after we do any actions that
> might block waiting for userspace to do anything.
>
> So exec_update_mutex will just cover those things that exec
> is updating, so if you want an atomic snapshot of them
> plus the appropriate struct cred you can grab exec_update_mutex.
>
> I added a new mutex instead of just fixing cred_guard_mutex so
> that we can update or revert the individual code paths if it
> is found that something is wrong.
>
> The cred_guard_mutex also prevents other tasks from starting
> to ptrace the task that is exec'ing, and other tasks from
> setting no_new_privs on the task that is exec'ing.
>
> My plan is to carefully refactor the code so it can perform
> both the ptrace and no_new_privs checks after the point of
> no return.
>
> I have uncovered all kinds of surprises while working in that direction
> and I realize it is going to take a very delicate and careful touch to
> achieve my goal.
>

That worries me a bit.
Could you please share details of the failed attempts with us,
Leaning from failures could help us better understand the issue.



> There are silly things like normal linux exec when you are ptraced and
> exec changes the credentials the ordinary code will continue with the
> old credentials, but the an LSM enabled your process is likely to be
> killed instead.
>

Please elaborate on the details.

> There is the personal mind blowing scenario where selinux will increase
> your credentials upon exec but if a magic directive is supplied in it's
> rules will avoid setting AT_SECURE, so that userspace won't protect
> itself from hostile takeover from the pre credential change environment.
> Much to my surprise "noatsecure" is a known and documented feature of
> selinux. I am not certain but I think I even spotted it in use on
> production.
>

Also here, it might help to make us aware of the problems you face.

I also considered moving all the credentials to the inner block,
but had the impression that is probably a really tough problem instead.

I wondered what happens if a ptraced execve process executes a
suid program that is. Don't you need different credentials
when you are pthraced, I mean, doesn't that override the suid bit,
while when not ptraced, you be root user, and have all the root
powers to load the image in the new vm?

Isn't there a race when execve starts, and ptrace attach happens later?


Thanks
Bernd.

> I will catch up on my sleep before I allow any more changes, and I will
> see replacing the called_exec_mmap flag with something saner.
>
> Eric
>

2020-04-03 07:53:52

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/2/20 11:46 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 2:00 PM Bernd Edlinger <[email protected]> wrote:
>>
>> There are two more patches, which might be of interest for you, just to
>> make the picture complete.
>> It is not clear if we go that way, or if Eric has a yet better idea.
>>
>> [PATCH v7 15/16] exec: Fix dead-lock in de_thread with ptrace_attach
>> https://www.spinics.net/lists/kernel/msg3459067.html
>
> There is no way I would ever take that patch.
>
> The amount of confusion in that patch is not acceptable. Randomly
> unlocking the new lock?
>
> That code makes everything worse, it's completely incomprehensible,
> the locking rules make no sense ahwt-so-ever.
>
> I'm seriously starting to feel like I should not have pulled this
> code, because the future looks _worse_ than what we used to have.
>

No problem, sometimes they say the cure is worse than the disease,
and I would not rule out the possibility that this is also
an example for that.

My initial proposal was much smaller and probably more on the issue,
but in peer review it turned out that we want to solve the problem
from ground up. Otherwise I saw no possibility how to get it approved.
That forced me in that direction that this took.

I just try to help with that. But I do not insist in a specific
direction (-:

This is what I initially proposed:

[PATCH] exec: Fix a deadlock in ptrace
https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/


Thanks
Bernd.

> No. No no no. Eric, this is not an acceptable direction.
>
> Linus
>

2020-04-03 15:10:08

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/2/20 9:04 PM, Linus Torvalds wrote:
> On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <[email protected]> wrote:
>>
>> The work on exec starts solving a long standing issue with exec that
>> it takes mutexes of blocking userspace applications, which makes exec
>> extremely deadlock prone. For the moment this adds a second mutex
>> with a narrower scope that handles all of the easy cases. Which
>> makes the tricky cases easy to spot. With a little luck the code to
>> solve those deadlocks will be ready by next merge window.
>
> So this worries me.
>
> I've pulled it, but I'm not entirely happy about some of it.
>
> For example, the "rationale" for some of the changes is
>
> This should be safe, as the credentials are only used for reading.
>
> which is just nonsensical. "Only used for reading" is immaterial, and
> there's no explanation for why that would matter at all. Most of the
> credentials are ever used for reading, and the worry about execve() is
> that the credentials can change, and people race with them and use the
> new 'suid' credentials and allow things that shouldn't be allowed. So
> the rationale makes no sense at all.
>
> Btw, if "this only takes it for reading" is such a big deal, why is
> that mutex not an rw-semaphore?
>
> The pidfd change at least has a rationale that makes sense:
>
> This should be safe, as the credentials do not change
> before exec_update_mutex is locked. Therefore whatever
> file access is possible with holding the cred_guard_mutex
> here is also possbile with the exec_update_mutex.
>
> so now you at least have an explanation of why that particular lock
> makes sense and works and is equivalent.
>
> It's still not a *great* explanation for why it would be equivalent,
> because cred_guard_mutex ends up not just guarding the write of the
> credentials, but makes it atomic wrt *other* data. That's the same
> problem as "I'm only reading".
>
> Locking is not about *one* value being consistent - if that was the
> case, then you could just do a "get refcount on the credentials, now I
> have a stable set of creds I can read forever". No lock needed.
>
> So locking is about *multiple* values being consistent, which is why
> that "I'm only reading" is not an explanation for why you can change
> the lock.
>
> It's also why that second one is questionable: it's a _better_ attempt
> at explaining things, but the point is really that cred_guard_mutex
> protects *other* things too.
>

Can we still edit the change logs, maybe that is a clear indication
that they are not sufficiently clear, when one don't understand the
patch without following the whole email thread.


> A real explanation would have absolutely *nothing* to do with
> "reading" or "same value of credentials". Both of those are entirely
> immaterial, since - as mentioned - you could just get a snapshot of
> the creds instead.
>

The problem we have here is that *another* thread can change no_new_privs
of the execve thread, that is a write. I think that must be avoided
whatever it costs. Those are the hard issues,
and reading another thread's credentials, an taking a reference of the
vm need to be consistent, so should just not happen while the vm
is updated, but the credentials not yet.

Or am I missing something here?

> A real explanation would be about how there is no other state that
> cred_guard_mutex protects that matters.
>
> See what I'm saying?
>
> This code is subtle as h*ll, and we've had bugs in it, and it has a
> series of tens of patches to fix them. But that also means that the
> explanations for the patches should take the subtleties into account,
> and not gloss over them with things like this.
>

:-)

> Ok, enough about the explanations. The actual _code_ is kind of odd
> too. For example, you have that "bprm->called_exec_mmap" flag to say
> "I've taken the exec_update_mutex, and need to drop it".
>

previously that was bprm->mm == NULL, that is even more hacky.

> But that flag is not set anywhere _near_ actually taking the lock.
> Sure, it is taken after exec_mmap() returns successfully, and that
> makes sense from a naming standpoint, but wouldn't it have been a
> _lot_ more obvious if you just set the flag when you took that lock,
> and instead of naming it by some magical code sequence, you named it
> for what it does?
>

Linus, I take full responsibility for this part of the patch.
In this case, I just did not want to change the name again.
That name was in a previous version of my patch, that I merged
with Eric's and at the same time had to fix the mutex-lock-order
issue in Eric's original patch. But if anybody would have
suggested a better name, and advised me to pass a parameter to
exec_mmap that would have happened.
So a kind of laziness on my side, and unfortunately I forgot to
point to all the changes in a revision log, I usually do that,
but this time I forgot it somehow. This was a 16-part patch
series at that time, so I just was really busy with following
each mail of the previous patch version, and also get the latest
revision of the change log (I use the mail maybe I should have
pulled Eric's tree, but I am still a newbie here ... :-) ).
Anyhow I was surprised that Eric did not see my changes by
looking at them, but that is the human nature, nothing to be
blamed for.


> Again, this looks all technically correct, but it's written in a way
> that doesn't seem to make a lot of sense. Why is the code literally
> written with a magical assumption of "calling exec_mmap takes this
> lock, so if the flag named called_exec_mmap is set, I have to free
> that lock that is not named that at all".
>

Names can be changed. In the peer review everybody was happy with it.
But that is not set in stone.

Initially I only wanted to address the ptrace attach, but Eric
came up with the user mode page fault handler, that made the patch
a lot more complicated, if that goal is dropped, also the place
where the mutex need to be taken could be a different one.


> I hate conditional locking in the first place, but if it has to exist,
> then the conditional should be named after the lock, and the lock
> getting should be very very explicitly tied to it.
>
> Wouldn't it have been much clearer if you called that flag
> "exec_update_mutex_taken", and set it WHEN YOU TAKE IT?
>

Can be done. I don't care. It is one additional register taken
with a parameter to exec_mmap and it is probably inlined, nothing
more nothing less.


> In fact, then you could drop the
>
> mutex_unlock(&tsk->signal->exec_update_mutex);
>
> in the error case of exec_mmap(), because now the error handling in
> free_bprm() would do the cleanup automatically.
>

The error handling is sometimes called when the exec_update_mutex is
not taken, in fact even de_thread not called.

Can you say how you would suggest that to be done?


> See what I'm saying? You've made the locking more complex and subtle
> than it needed to be. And since the whole point of the *new* lock is
> that it should replace an old lock that was really complex and subtle,
> that's a problem.
>
> Linus
>

2020-04-03 16:06:04

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/2/20 9:52 PM, Linus Torvalds wrote:
> On Thu, Apr 2, 2020 at 12:31 PM Bernd Edlinger
> <[email protected]> wrote:
>>
>> This is at least what is my impression how the existing mutexes are used,
>> a mutex called "cred_guard_mutex" is a not very good self explaining name,
>> in my opinion, it is totally unclear what it does "guard", and why.
>
> Oh, I absolutely agree that cred_guard_mutex is a horrible lock.
>
> It actually _used_ to be a lot more understandable, and the name used
> to make more sense in the context it was used.
>
> See commit
>
> a2a8474c3fff ("exec: do not sleep in TASK_TRACED under ->cred_guard_mutex")
> > for when it changed from "somewhat understandable" to "really hard to follow".
>

Ah, yes, there it was introduced.

That fixed only the case of a single-threaded process doing execve,
but missed to fix the case of a multi-threaded process doing execve,
and the other threads racing with the execve. That is what happened
on my laptop, again and again, when I tried to fix a bug in the
gcc testsuite, that is while I wanted to track down another bug,
that is why the gcc testsuite left loads of temp-files in /tmp,
until I decided to go on a little bug-hunt in the linux kernel
instead :-/

And I had no idea what was happening at all. But that way this bug
bit me again and again, until I realized the nature of the strace
problem, when I was really baffled.

Before I considered a linux patch for that I tried to fix it in the
strace code instead, and in fact I had tried two approaches,
one is wait in a signal handler, that did not work.
The second one is use another thread that does the wait, and that
did only work when I disable the PTRACE_O_TRACEEXIT flags.

I posted the two patches on lkml, just for reference.
Maybe you are amused by those patches. I consider that a craziness myself,
but it was indeed able to avoid the deadlock, with a user space change alone:

https://lore.kernel.org/lkml/AM6PR03MB5170D68B5010FCA627A603F8E4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/


so that is more or less for your amusement, sincerely I would not propose
that as the way to fix the strace deadlock.


Bernd.

> Don't get me wrong - that commit has a very good reason for it, but it
> does make the locking really hard to understand.
>
> It all used to be in one function - do_execve() - and it was holding
> the lock over a fairly obvious range, starting at
>
> bprm->cred = prepare_exec_creds();
>
> and ending at basically "we're done with execve()".
>
> So basically, cred_guard_mutex ends up being the thing that is held
> all the way from the "before execve looks at the old creds" to "execve
> is done, and has changed the creds".
>
> The reason it's needed is exactly that there are some nasty situations
> where execve() itself does things with creds to determine that the new
> creds are ok. And it uses the old creds to do that, but it also uses
> the task->flags and task->ptrace.
>
> So think of cred_guard_mutex as a lock around not just the creds, but
> the combination of creds and the task flags/ptrace.
>
> Anybody who changes the task ptrace setting needs to serialize with
> execve(). Or anybody who tests for "dumpable()", for example.
>
> If *all* you care about is just the creds, then you don't need it.
> It's really only users that do more checks than just credentials.
> "dumpable()" is I think the common one.
>
> And that's why cred_guard_mutex has that big range - it starts when we
> read the original creds (because it will use those creds to determine
> how the *new* creds will affect dumpability etc), and it ends when it
> has updated not only to the new creds, but it has set all those other
> flags too.
>
> So I'm not at all against splitting the lock up, and trying to make it
> more directed and specific.
>
> My complaints were about how the new lock wasn't much better. It was
> still completely incomprehensible, the conditional unlocking was hard
> to follow, and it really wasn't obvious that the converted users were
> fine.
>
> See?
>
> Linus
>

2020-04-03 16:52:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <[email protected]> wrote:
>
> On 4/2/20 9:04 PM, Linus Torvalds wrote:
> > In fact, then you could drop the
> >
> > mutex_unlock(&tsk->signal->exec_update_mutex);
> >
> > in the error case of exec_mmap(), because now the error handling in
> > free_bprm() would do the cleanup automatically.
> >
>
> The error handling is sometimes called when the exec_update_mutex is
> not taken, in fact even de_thread not called.

But that's the whole point of the flag. Make the flag be about "do I
hold the mutex", and then the error handling does the right thing
regardless.

> Can you say how you would suggest that to be done?

I think the easiest thing to do to explain is to just write the patch.

This is entirely untested, but see what the difference is? I make the
flag be about exactly where I take the lock, not about some "I have
called exec_mmap".

Which means that now exec_mmap() doesn't even need to unlock it in the
error case, because the unlocking will happen properly in the
bprm_exit regardless.

This makes that unconditional unlocking logic much more obvious.

That said, Eric says he can make it all properly static so that it
doesn't need that kind of dynamic "if (x) unlock()" logic at all,
which is much better.

So this patch is not for consumption, it's purely for "look, something
like this"

Linus


Attachments:
patch.diff (2.68 kB)

2020-04-03 19:27:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

[ For Waiman & co - the problem is that the current cred_guard_mutex
is horrendous and has problems with execve() deadlocking against
various users. We've had this bug before, there's a new one, it's just
nasty ]

On Thu, Apr 2, 2020 at 4:04 PM Eric W. Biederman <[email protected]> wrote:
>
> That is not the direction I intend to take either.
>
> I was hoping I could put off replying to this thread for a bit because
> I only managed to get 4 hours of sleep last night and I am not as alert
> to technical details as I would like to be.

Hmm.. So I've been looking at this cred_guard_mutex, and I wonder...

This is a bit hand-wavy, because I haven't walker through all the
paths, but could we perhaps work around a lot of the problems a
different way., namely:

- make the "cred_guard_mutex" an rwsem-like thing instead of being a mutex.

- make the ptrace_attach() case get it for writing - not because
ptrace changes the creds, but because ptrace changes 'task->ptrace'
and depends on dumpability etc.

- change the *name* of that damn thing. Not because it's now
rwsem'ish rather than a mutex, but because it was never really about
just "creds". It was about creds+ptrace+dumpable flags etc.

- make all the ones that read the creds to just take it for reading
(IOW, the cases that were basically switched over to
exec_update_mutex).

- and finally: make "execve()" take it just for reading too, but
introduce a "upgrade to write" at the very end (when it actually is
all done and then finally changes the creds and dumpability)

Wouldn't that solve all problems? We wouldn't get deadlocks wrt
execve(), simply because execve() doesn't need it to be writable, and
the things execve() does and can deadlock all only want readability.

But hear me out, because the above is fundamentally broken in a couple
of ways, so let me address that brokenness before you tell me I'm a
complete nincompoop and an idiot.

I'm including some locking people here because of these issues, so
that they can maybe verify my thinking.

(a) our rwsem's are fair

So the whole "execve takes it for reading, so now others can take
it for reading too without deadlocks" is simply not true - if you use
the existing rwsem.

Because a concurrent (blocked) writer will then block other
readers for fairness reasons, and holding it for reading doesn't
guarantee that others can get it for reading.

So clearly, the above doesn't even *fix* the deadlocks - unless
we have an unfair mode (or just a special lock for just this that is
not our standard rwsem, but a special unfair one).

So I'm suggesting we use a special unfair rwsem here (we can make
a simple spinlock-based one - it doesn't need to be as clever or
optimized as the real rwsems are)

(b) similarly, our rwsem's don't actually have a "upgrade from read
to write", because that's also a fundamentally deadlocky operation.

Again, that's true. Except execve() is special, and we know
there's only _one_ execve() at a time that will complete, since we're
serializing them. So for this particular use, "upgrade to write" would
be possible without the general-case deadlock issues.

(c) I didn't think things through, and even with these special
semantics, my idea is complete garbage

Ok, this may well be true.

Anyway, the advantage of this (if it works) is that it would allow us
to go back to the _really_ simple original model of just taking this
lock for reading at the beginning of execve(), and not worrying so
much about complex nesting or very complex rules for exactly when we
got the lock and error handling.

The final part when we actually update the credentials and dumpability
and stuff in execve() is actually fairly simple. So the "upgrade to a
write lock" phase doesn't worry me too much. It's the interaction
with all the previous parts (which happen with it held just for
reading) that tend to be the nastier ones.

And ptrace_attach() really is special, and I think it would be the
only one that really needs that write lock.

The disadvantage, of course, is that it would require that
special-case lock semantic, and I might also be missing some thing
that makes it not work anyway.

Comments? Am I just dreaming of a simpler model without my medications again?

Linus

2020-04-03 20:42:33

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/3/20 3:26 PM, Linus Torvalds wrote:
> I'm including some locking people here because of these issues, so
> that they can maybe verify my thinking.
>
> (a) our rwsem's are fair
>
> So the whole "execve takes it for reading, so now others can take
> it for reading too without deadlocks" is simply not true - if you use
> the existing rwsem.
>
> Because a concurrent (blocked) writer will then block other
> readers for fairness reasons, and holding it for reading doesn't
> guarantee that others can get it for reading.
>
> So clearly, the above doesn't even *fix* the deadlocks - unless
> we have an unfair mode (or just a special lock for just this that is
> not our standard rwsem, but a special unfair one).
>
> So I'm suggesting we use a special unfair rwsem here (we can make
> a simple spinlock-based one - it doesn't need to be as clever or
> optimized as the real rwsems are)
>
> (b) similarly, our rwsem's don't actually have a "upgrade from read
> to write", because that's also a fundamentally deadlocky operation.
>
> Again, that's true. Except execve() is special, and we know
> there's only _one_ execve() at a time that will complete, since we're
> serializing them. So for this particular use, "upgrade to write" would
> be possible without the general-case deadlock issues.
>
> (c) I didn't think things through, and even with these special
> semantics, my idea is complete garbage
>
> Ok, this may well be true.
>
> Anyway, the advantage of this (if it works) is that it would allow us
> to go back to the _really_ simple original model of just taking this
> lock for reading at the beginning of execve(), and not worrying so
> much about complex nesting or very complex rules for exactly when we
> got the lock and error handling.
>
> The final part when we actually update the credentials and dumpability
> and stuff in execve() is actually fairly simple. So the "upgrade to a
> write lock" phase doesn't worry me too much. It's the interaction
> with all the previous parts (which happen with it held just for
> reading) that tend to be the nastier ones.
>
> And ptrace_attach() really is special, and I think it would be the
> only one that really needs that write lock.

Making an unfair rwsem that prefer readers (like the original rwlock
semantics) is certainly doable. I don't think that is hard to do. I can
think of 2 possible ways to do that. We  could make the unfairness
globally applies to all the readers of a rwsem by defining the fairness
state at init time. That will require keeping the state in the rwsem
structure increasing its size.

Another alternative is to add new functions like down_read_unfair() that
perform unfair read locking for its callers. That will require less code
change, but the calling functions have to make the right choice.

Cheers,
Longman


2020-04-03 21:01:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <[email protected]> wrote:
>
> Another alternative is to add new functions like down_read_unfair() that
> perform unfair read locking for its callers. That will require less code
> change, but the calling functions have to make the right choice.

I'd prefer the static choice model - and I'd hide this in some
"task_cred_read_lock()" function anyway rather than have the users do
"mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
now.

How nasty would it be to add the "upgrade" op? I took a quick look,
but that just made me go "Waiman would know" ;)

Linus

2020-04-03 23:17:00

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/3/20 4:59 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <[email protected]> wrote:
>> Another alternative is to add new functions like down_read_unfair() that
>> perform unfair read locking for its callers. That will require less code
>> change, but the calling functions have to make the right choice.
> I'd prefer the static choice model - and I'd hide this in some
> "task_cred_read_lock()" function anyway rather than have the users do
> "mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
> now.
>
> How nasty would it be to add the "upgrade" op? I took a quick look,
> but that just made me go "Waiman would know" ;)
>
> Linus
>
With static choice, you mean defined at init time. Right? In that case,
you don't really need a special encapsulation function.

With upgrade, if there is only one reader, it is pretty straight
forward. With more than one readers, it gets more complicated as we have
to wait for other readers to unlock. We can spin for a certain period of
time. After that, that reader can use the handoff mechanism by queuing
itself in front the wait queue before releasing the read lock and go to
sleep. That will make sure that it will get the lock once all the other
readers exits. For an unfair rwsem, the writer cannot assert the handoff
bit and so it shouldn't interfere with this upgrade process.

If there are multiple upgrade readers, only one can win the race. The
others have to release the read lock and queue themselves as writers.
Will that be acceptable?

Cheers,
Longman



Cheers,
Longman

2020-04-03 23:24:43

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/3/20 7:16 PM, Waiman Long wrote:
> On 4/3/20 4:59 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <[email protected]> wrote:
>>> Another alternative is to add new functions like down_read_unfair() that
>>> perform unfair read locking for its callers. That will require less code
>>> change, but the calling functions have to make the right choice.
>> I'd prefer the static choice model - and I'd hide this in some
>> "task_cred_read_lock()" function anyway rather than have the users do
>> "mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
>> now.
>>
>> How nasty would it be to add the "upgrade" op? I took a quick look,
>> but that just made me go "Waiman would know" ;)
>>
>> Linus
>>
> With static choice, you mean defined at init time. Right? In that case,
> you don't really need a special encapsulation function.
>
> With upgrade, if there is only one reader, it is pretty straight
> forward. With more than one readers, it gets more complicated as we have
> to wait for other readers to unlock. We can spin for a certain period of
> time. After that, that reader can use the handoff mechanism by queuing
> itself in front the wait queue before releasing the read lock and go to
> sleep. That will make sure that it will get the lock once all the other
> readers exits. For an unfair rwsem, the writer cannot assert the handoff
> bit and so it shouldn't interfere with this upgrade process.
>
> If there are multiple upgrade readers, only one can win the race. The
> others have to release the read lock and queue themselves as writers.
> Will that be acceptable?

Alternatively, we could assert that only one reader can do the upgrade
and do a WARN_ON_ONCE() if multiple concurrent upgrade attempts is detected.

Regards,
Longman

2020-04-04 01:32:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Fri, Apr 3, 2020 at 4:23 PM Waiman Long <[email protected]> wrote:
>
> Alternatively, we could assert that only one reader can do the upgrade
> and do a WARN_ON_ONCE() if multiple concurrent upgrade attempts is detected.

Ack, that would be best.

[ And since I'm not on mobile any more, and my html email got thrown
out by the list, I'll just repeat that by "static choice" I mean "no
runtime decisions or flags": code that needs the unfair behavior would
use a special unfair function ]

Linus

2020-04-04 02:08:03

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/3/20 9:30 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 4:23 PM Waiman Long <[email protected]> wrote:
>> Alternatively, we could assert that only one reader can do the upgrade
>> and do a WARN_ON_ONCE() if multiple concurrent upgrade attempts is detected.
> Ack, that would be best.
>
> [ And since I'm not on mobile any more, and my html email got thrown
> out by the list, I'll just repeat that by "static choice" I mean "no
> runtime decisions or flags": code that needs the unfair behavior would
> use a special unfair function ]
>
> Linus
>
Got it.

So in term of priority, my current thinking is

    upgrading unfair reader > unfair reader > reader/writer

A higher priority locker will block other lockers from acquiring the lock.

Thought?

Cheers,
Longman

2020-04-04 02:31:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <[email protected]> wrote:
>
> So in term of priority, my current thinking is
>
> upgrading unfair reader > unfair reader > reader/writer
>
> A higher priority locker will block other lockers from acquiring the lock.

An alternative option might be to have readers normally be 100% normal
(ie with fairness wrt writers), and not really introduce any special
"unfair reader" lock.

Instead, all the unfairness would come into play only when the special
case - execve() - does it's special "lock for reading with intent to
upgrade".

But when it enters that kind of "intent to upgrade" lock state, it
would not only block all subsequent writers, it would also guarantee
that all other readers can continue to go).

So then the new rwsem operations would be

- read_with_write_intent_lock_interruptible()

This is the beginning of "execve()", and waits for all writers to
exit, and puts the lock into "all readers can go" mode.

You could think of it as a "I'm queuing myself for a write lock,
but I'm allowing readers to go ahead" state.

- read_lock_to_write_upgrade()

This is the "now this turns into a regular write lock". It needs to
wait for all other readers to exit, of course.

- read_with_write_intent_unlock()

This is the "I'm unqueuing myself, I aborted and will not become a
write lock after all" operation.

NOTE! In this model, there may be multiple threads that do that
initial queuing thing. We only guarantee that only one of them will
get to the actual write lock stage, and the others will abort before
that happens.

If that is a more natural state machine, then that should work fine
too. And it has some advantages, in that it keeps the readers normally
fair, and only turns them unfair when we get to that special
read-for-write stage.

But whatever it most natural for the rwsem code. Entirely up to you.

Linus

2020-04-04 04:25:57

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/4/20 1:16 AM, Waiman Long wrote:
> On 4/3/20 4:59 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 1:41 PM Waiman Long <[email protected]> wrote:
>>> Another alternative is to add new functions like down_read_unfair() that
>>> perform unfair read locking for its callers. That will require less code
>>> change, but the calling functions have to make the right choice.
>> I'd prefer the static choice model - and I'd hide this in some
>> "task_cred_read_lock()" function anyway rather than have the users do
>> "mutex_lock_killable(&task->signal->cred_guard_mutex)" like they do
>> now.
>>
>> How nasty would it be to add the "upgrade" op? I took a quick look,
>> but that just made me go "Waiman would know" ;)
>>
>> Linus
>>
> With static choice, you mean defined at init time. Right? In that case,
> you don't really need a special encapsulation function.
>
> With upgrade, if there is only one reader, it is pretty straight
> forward. With more than one readers, it gets more complicated as we have
> to wait for other readers to unlock. We can spin for a certain period of
> time. After that, that reader can use the handoff mechanism by queuing
> itself in front the wait queue before releasing the read lock and go to
> sleep. That will make sure that it will get the lock once all the other
> readers exits. For an unfair rwsem, the writer cannot assert the handoff
> bit and so it shouldn't interfere with this upgrade process.
>
> If there are multiple upgrade readers, only one can win the race. The
> others have to release the read lock and queue themselves as writers.
> Will that be acceptable?
>

Someone pointer out prevoiosly I think
that with the real time linux
the rwmutex are just mutex and we
better not base our desing on that.

To me linux_rt is a must.

Thanks
Bernd.

> Cheers,
> Longman
>
>
>
> Cheers,
> Longman
>

2020-04-04 05:48:13

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/3/20 6:23 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <[email protected]> wrote:
>>
>> On 4/2/20 9:04 PM, Linus Torvalds wrote:
>>> In fact, then you could drop the
>>>
>>> mutex_unlock(&tsk->signal->exec_update_mutex);
>>>
>>> in the error case of exec_mmap(), because now the error handling in
>>> free_bprm() would do the cleanup automatically.
>>>
>>
>> The error handling is sometimes called when the exec_update_mutex is
>> not taken, in fact even de_thread not called.
>
> But that's the whole point of the flag. Make the flag be about "do I
> hold the mutex", and then the error handling does the right thing
> regardless.
>
>> Can you say how you would suggest that to be done?
>
> I think the easiest thing to do to explain is to just write the patch.
>
> This is entirely untested, but see what the difference is? I make the
> flag be about exactly where I take the lock, not about some "I have
> called exec_mmap".
>
> Which means that now exec_mmap() doesn't even need to unlock it in the
> error case, because the unlocking will happen properly in the
> bprm_exit regardless.
>
> This makes that unconditional unlocking logic much more obvious.
>
> That said, Eric says he can make it all properly static so that it
> doesn't need that kind of dynamic "if (x) unlock()" logic at all,
> which is much better.
>
> So this patch is not for consumption, it's purely for "look, something
> like this"
>


Just one suggestion, in general It would feel pretty much okay if you
like to improve the naming, and the consistency in any of my patches.

> @@ -1067,7 +1069,6 @@ static int exec_mmap(struct mm_struct *mm)
> down_read(&old_mm->mmap_sem);
> if (unlikely(old_mm->core_state)) {
> up_read(&old_mm->mmap_sem);
> - mutex_unlock(&tsk->signal->exec_update_mutex);

I was trying to replicate the behavior of prepare_bprm_creds
which also unlocks the mutex in the error case, therefore it felt
okay to unlock the mutex here, but it will work either way.

I should further note, that the mutex would be locked if this
error exit is taken, and unlocked if this error happens:

ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
if (ret)
return ret;

so at least the function comment I introduced above should be updated:
* Maps the mm_struct mm into the current task struct.
* On success, this function returns with the mutex
* exec_update_mutex locked.


> put_binfmt(fmt);
> - if (retval < 0 && bprm->called_exec_mmap) {
> + if (retval < 0 && !bprm->mm) {

Using bprm->mm like this feels like a hack to me. It works here,
but nowhere else. Therefore I changed this line.

Using !bprm->mm in the error handling code made Eric's patch fail.


Thanks
Bernd.


> Linus
>

2020-04-04 05:51:08

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/4/20 7:43 AM, Bernd Edlinger wrote:
>
>
> On 4/3/20 6:23 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <[email protected]> wrote:
>>>
>>> On 4/2/20 9:04 PM, Linus Torvalds wrote:
>>>> In fact, then you could drop the
>>>>
>>>> mutex_unlock(&tsk->signal->exec_update_mutex);
>>>>
>>>> in the error case of exec_mmap(), because now the error handling in
>>>> free_bprm() would do the cleanup automatically.
>>>>
>>>
>>> The error handling is sometimes called when the exec_update_mutex is
>>> not taken, in fact even de_thread not called.
>>
>> But that's the whole point of the flag. Make the flag be about "do I
>> hold the mutex", and then the error handling does the right thing
>> regardless.
>>
>>> Can you say how you would suggest that to be done?
>>
>> I think the easiest thing to do to explain is to just write the patch.
>>
>> This is entirely untested, but see what the difference is? I make the
>> flag be about exactly where I take the lock, not about some "I have
>> called exec_mmap".
>>
>> Which means that now exec_mmap() doesn't even need to unlock it in the
>> error case, because the unlocking will happen properly in the
>> bprm_exit regardless.
>>
>> This makes that unconditional unlocking logic much more obvious.
>>
>> That said, Eric says he can make it all properly static so that it
>> doesn't need that kind of dynamic "if (x) unlock()" logic at all,
>> which is much better.
>>
>> So this patch is not for consumption, it's purely for "look, something
>> like this"
>>
>
>
> Just one suggestion, in general It would feel pretty much okay if you
> like to improve the naming, and the consistency in any of my patches.
>
>> @@ -1067,7 +1069,6 @@ static int exec_mmap(struct mm_struct *mm)
>> down_read(&old_mm->mmap_sem);
>> if (unlikely(old_mm->core_state)) {
>> up_read(&old_mm->mmap_sem);
>> - mutex_unlock(&tsk->signal->exec_update_mutex);
>
> I was trying to replicate the behavior of prepare_bprm_creds
> which also unlocks the mutex in the error case, therefore it felt
> okay to unlock the mutex here, but it will work either way.
>
> I should further note, that the mutex would be locked if this
> error exit is taken, and unlocked if this error happens:
>
> ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
> if (ret)
> return ret;
>
> so at least the function comment I introduced above should be updated:
> * Maps the mm_struct mm into the current task struct.
> * On success, this function returns with the mutex
> * exec_update_mutex locked.
>
>
>> put_binfmt(fmt);
>> - if (retval < 0 && bprm->called_exec_mmap) {
>> + if (retval < 0 && !bprm->mm) {
>
> Using bprm->mm like this feels like a hack to me. It works here,
> but nowhere else. Therefore I changed this line.
>
> Using !bprm->mm in the error handling code made Eric's patch fail.
>

That does probably work better it the boolean is named
after_the_point_of_no_return or something....


>
> Thanks
> Bernd.
>
>
>> Linus
>>

2020-04-04 06:37:55

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/4/20 4:28 AM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <[email protected]> wrote:
>>
>> So in term of priority, my current thinking is
>>
>> upgrading unfair reader > unfair reader > reader/writer
>>
>> A higher priority locker will block other lockers from acquiring the lock.
>
> An alternative option might be to have readers normally be 100% normal
> (ie with fairness wrt writers), and not really introduce any special
> "unfair reader" lock.
>
> Instead, all the unfairness would come into play only when the special
> case - execve() - does it's special "lock for reading with intent to
> upgrade".
>
> But when it enters that kind of "intent to upgrade" lock state, it
> would not only block all subsequent writers, it would also guarantee
> that all other readers can continue to go).
>
> So then the new rwsem operations would be
>
> - read_with_write_intent_lock_interruptible()
>
> This is the beginning of "execve()", and waits for all writers to
> exit, and puts the lock into "all readers can go" mode.
>
> You could think of it as a "I'm queuing myself for a write lock,
> but I'm allowing readers to go ahead" state.
>
> - read_lock_to_write_upgrade()
>
> This is the "now this turns into a regular write lock". It needs to
> wait for all other readers to exit, of course.
>
> - read_with_write_intent_unlock()
>
> This is the "I'm unqueuing myself, I aborted and will not become a
> write lock after all" operation.
>
> NOTE! In this model, there may be multiple threads that do that
> initial queuing thing. We only guarantee that only one of them will
> get to the actual write lock stage, and the others will abort before
> that happens.

One of the problems that add to the current situation, is that sometimes
the cred_guard_mutex is locked killable, so can be killed by de_thread.
But in other places cred_guard_mutex is not killable. So cannot be
locked and cannot be killed either -> dead-lock.


But Fear Not!

Overall we are pretty much in a good position to defeat the
enemy now, once an forever.

- We have my ugly-crazy patch that just works.

- We will have Eric's patch that is even better.

- We can try to put something togeter with creative new rw-type semaphores.

- We can merge ideas from one of the patches to another.


So it is impossible we not succeed to fix it this time :-)


Bernd.

>
> If that is a more natural state machine, then that should work fine
> too. And it has some advantages, in that it keeps the readers normally
> fair, and only turns them unfair when we get to that special
> read-for-write stage.
>
> But whatever it most natural for the rwsem code. Entirely up to you.
>
> Linus
>

2020-04-05 02:48:30

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/3/20 10:28 PM, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <[email protected]> wrote:
>> So in term of priority, my current thinking is
>>
>> upgrading unfair reader > unfair reader > reader/writer
>>
>> A higher priority locker will block other lockers from acquiring the lock.
> An alternative option might be to have readers normally be 100% normal
> (ie with fairness wrt writers), and not really introduce any special
> "unfair reader" lock.
A regular down_read() caller will be handled normally.
> Instead, all the unfairness would come into play only when the special
> case - execve() - does it's special "lock for reading with intent to
> upgrade".
>
> But when it enters that kind of "intent to upgrade" lock state, it
> would not only block all subsequent writers, it would also guarantee
> that all other readers can continue to go).

Yes, that shouldn't be hard to do. If that is what is required, we may
only need a special upgrade function to drain the OSQ and then wake up
all the readers in the wait queue. I will add a flags argument to that
special upgrade function so that we may be able to select different
behavior in the future.

The regular down_read_interruptible() can be used unless we want to
designate only some readers are allowed to do upgrade by calling a
special down_read() function.
>
> So then the new rwsem operations would be
>
> - read_with_write_intent_lock_interruptible()
>
> This is the beginning of "execve()", and waits for all writers to
> exit, and puts the lock into "all readers can go" mode.
>
> You could think of it as a "I'm queuing myself for a write lock,
> but I'm allowing readers to go ahead" state.
>
> - read_lock_to_write_upgrade()
>
> This is the "now this turns into a regular write lock". It needs to
> wait for all other readers to exit, of course.
>
> - read_with_write_intent_unlock()
>
> This is the "I'm unqueuing myself, I aborted and will not become a
> write lock after all" operation.
>
> NOTE! In this model, there may be multiple threads that do that
> initial queuing thing. We only guarantee that only one of them will
> get to the actual write lock stage, and the others will abort before
> that happens.
>
> If that is a more natural state machine, then that should work fine
> too. And it has some advantages, in that it keeps the readers normally
> fair, and only turns them unfair when we get to that special
> read-for-write stage.
>
> But whatever it most natural for the rwsem code. Entirely up to you.

To be symmetric with the existing downgrade_write() function, I will
choose the name upgrade_read() for the upgrade function.

Will that work for you?

Cheers,
Longman

2020-04-05 03:39:20

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/5/20 4:42 AM, Waiman Long wrote:
> On 4/3/20 10:28 PM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <[email protected]> wrote:
>>> So in term of priority, my current thinking is
>>>
>>> upgrading unfair reader > unfair reader > reader/writer
>>>
>>> A higher priority locker will block other lockers from acquiring the lock.
>> An alternative option might be to have readers normally be 100% normal
>> (ie with fairness wrt writers), and not really introduce any special
>> "unfair reader" lock.
> A regular down_read() caller will be handled normally.
>> Instead, all the unfairness would come into play only when the special
>> case - execve() - does it's special "lock for reading with intent to
>> upgrade".
>>
>> But when it enters that kind of "intent to upgrade" lock state, it
>> would not only block all subsequent writers, it would also guarantee
>> that all other readers can continue to go).
>
> Yes, that shouldn't be hard to do. If that is what is required, we may
> only need a special upgrade function to drain the OSQ and then wake up
> all the readers in the wait queue. I will add a flags argument to that
> special upgrade function so that we may be able to select different
> behavior in the future.
>
> The regular down_read_interruptible() can be used unless we want to
> designate only some readers are allowed to do upgrade by calling a
> special down_read() function.
>>
>> So then the new rwsem operations would be
>>
>> - read_with_write_intent_lock_interruptible()
>>
>> This is the beginning of "execve()", and waits for all writers to
>> exit, and puts the lock into "all readers can go" mode.
>>
>> You could think of it as a "I'm queuing myself for a write lock,
>> but I'm allowing readers to go ahead" state.
>>
>> - read_lock_to_write_upgrade()
>>
>> This is the "now this turns into a regular write lock". It needs to
>> wait for all other readers to exit, of course.
>>
>> - read_with_write_intent_unlock()
>>
>> This is the "I'm unqueuing myself, I aborted and will not become a
>> write lock after all" operation.
>>
>> NOTE! In this model, there may be multiple threads that do that
>> initial queuing thing. We only guarantee that only one of them will
>> get to the actual write lock stage, and the others will abort before
>> that happens.
>>
>> If that is a more natural state machine, then that should work fine
>> too. And it has some advantages, in that it keeps the readers normally
>> fair, and only turns them unfair when we get to that special
>> read-for-write stage.
>>
>> But whatever it most natural for the rwsem code. Entirely up to you.
>
> To be symmetric with the existing downgrade_write() function, I will
> choose the name upgrade_read() for the upgrade function.
>
> Will that work for you?
>

May I ask, if the proposed rwsem will also work for RT-linux,
or will it be a normal mutex there?


Thanks
Bernd.


> Cheers,
> Longman
>

2020-04-05 03:46:52

by Waiman Long

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/4/20 11:35 PM, Bernd Edlinger wrote:
>> To be symmetric with the existing downgrade_write() function, I will
>> choose the name upgrade_read() for the upgrade function.
>>
>> Will that work for you?
>>
> May I ask, if the proposed rwsem will also work for RT-linux,
> or will it be a normal mutex there?

Good question. RT have their own special code for rwsem. I need to take
a look at that to see if something like that is possible.

Cheers,
Longman

2020-04-05 06:38:24

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/4/20 8:34 AM, Bernd Edlinger wrote:
>
>
> On 4/4/20 4:28 AM, Linus Torvalds wrote:
>> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <[email protected]> wrote:
>>>
>>> So in term of priority, my current thinking is
>>>
>>> upgrading unfair reader > unfair reader > reader/writer
>>>
>>> A higher priority locker will block other lockers from acquiring the lock.
>>
>> An alternative option might be to have readers normally be 100% normal
>> (ie with fairness wrt writers), and not really introduce any special
>> "unfair reader" lock.
>>
>> Instead, all the unfairness would come into play only when the special
>> case - execve() - does it's special "lock for reading with intent to
>> upgrade".
>>
>> But when it enters that kind of "intent to upgrade" lock state, it
>> would not only block all subsequent writers, it would also guarantee
>> that all other readers can continue to go).
>>
>> So then the new rwsem operations would be
>>
>> - read_with_write_intent_lock_interruptible()
>>
>> This is the beginning of "execve()", and waits for all writers to
>> exit, and puts the lock into "all readers can go" mode.
>>
>> You could think of it as a "I'm queuing myself for a write lock,
>> but I'm allowing readers to go ahead" state.
>>
>> - read_lock_to_write_upgrade()
>>
>> This is the "now this turns into a regular write lock". It needs to
>> wait for all other readers to exit, of course.
>>
>> - read_with_write_intent_unlock()
>>
>> This is the "I'm unqueuing myself, I aborted and will not become a
>> write lock after all" operation.
>>
>> NOTE! In this model, there may be multiple threads that do that
>> initial queuing thing. We only guarantee that only one of them will
>> get to the actual write lock stage, and the others will abort before
>> that happens.
>
> One of the problems that add to the current situation, is that sometimes
> the cred_guard_mutex is locked killable, so can be killed by de_thread.
> But in other places cred_guard_mutex is not killable. So cannot be
> locked and cannot be killed either -> dead-lock.
>
>
> But Fear Not!
>
> Overall we are pretty much in a good position to defeat the
> enemy now, once an forever.
>
> - We have my ugly-crazy patch that just works.
>
> - We will have Eric's patch that is even better.
>
> - We can try to put something togeter with creative new rw-type semaphores.
>
> - We can merge ideas from one of the patches to another.
>
>
> So it is impossible we not succeed to fix it this time :-)
>

BTW there is one other independent thing that came to my attention when I tried
to fix the ptrace deadlock from user space, which I tried first.

In order to break the deadlock from user space the strace program would have
to be rewitten to be multi-threaded. I tried that, but the problem was,
that an event from the tracee can be received either in the main thread,
or a signal handler, or another thread. I tried to implement both possibilities,
see my strace-patches which I pointed out previously here.

The signal handler approach completely failed, and the second thread
approach did not completely fail but was just definitely insane.

What really makes it impossible to write a multi-threaded strace program,
is that *only* the tread that made PTRACE_ATTACH can do all the other
PTRACE-APIs, but for a multi-treaded strace, any thread should be able
to call PTRACE-APIs as long as we are in the same process.

I don't know if that is really hard to achieve, but it seems like something
that would allow user space much more flexibility.

What do you think?


Thanks
Bernd.


>
> Bernd.
>
>>
>> If that is a more natural state machine, then that should work fine
>> too. And it has some advantages, in that it keeps the readers normally
>> fair, and only turns them unfair when we get to that special
>> read-for-write stage.
>>
>> But whatever it most natural for the rwsem code. Entirely up to you.
>>
>> Linus
>>

2020-04-05 19:38:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Sat, Apr 4, 2020 at 11:34 PM Bernd Edlinger
<[email protected]> wrote:
>
> What really makes it impossible to write a multi-threaded strace program,
> is that *only* the tread that made PTRACE_ATTACH can do all the other
> PTRACE-APIs, but for a multi-treaded strace, any thread should be able
> to call PTRACE-APIs as long as we are in the same process.

I agree that the ptrace model is broken, and no, you can't do a
threaded ptrace the way things are now.

Some of that is really fundamental to how we do things (ie the ptracer
is the parent), and our data structures really make that be
per-thread.

I'm not sure how easy it would be to fix. Some of it is probably
really painful. For example, right now we know we can't race between
different pthread operations, because only the thread that did
PTRACE_ATTACH is allowed to do most of them.

So it could be very painful indeed to try to fix it so that you can do
threaded tracing. It woudl probably be a good thing to have, but it
might not be worth the pain.

Some daring person could try...

Linus

2020-04-06 06:42:16

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1


On 4/4/20 7:48 AM, Bernd Edlinger wrote:
>
>
> On 4/4/20 7:43 AM, Bernd Edlinger wrote:
>>
>>
>> On 4/3/20 6:23 PM, Linus Torvalds wrote:
>>> On Fri, Apr 3, 2020 at 8:09 AM Bernd Edlinger <[email protected]> wrote:
>>>>
>>>> On 4/2/20 9:04 PM, Linus Torvalds wrote:
>>>>> In fact, then you could drop the
>>>>>
>>>>> mutex_unlock(&tsk->signal->exec_update_mutex);
>>>>>
>>>>> in the error case of exec_mmap(), because now the error handling in
>>>>> free_bprm() would do the cleanup automatically.
>>>>>
>>>>
>>>> The error handling is sometimes called when the exec_update_mutex is
>>>> not taken, in fact even de_thread not called.
>>>
>>> But that's the whole point of the flag. Make the flag be about "do I
>>> hold the mutex", and then the error handling does the right thing
>>> regardless.
>>>
>>>> Can you say how you would suggest that to be done?
>>>
>>> I think the easiest thing to do to explain is to just write the patch.
>>>
>>> This is entirely untested, but see what the difference is? I make the
>>> flag be about exactly where I take the lock, not about some "I have
>>> called exec_mmap".
>>>
>>> Which means that now exec_mmap() doesn't even need to unlock it in the
>>> error case, because the unlocking will happen properly in the
>>> bprm_exit regardless.
>>>
>>> This makes that unconditional unlocking logic much more obvious.
>>>
>>> That said, Eric says he can make it all properly static so that it
>>> doesn't need that kind of dynamic "if (x) unlock()" logic at all,
>>> which is much better.
>>>
>>> So this patch is not for consumption, it's purely for "look, something
>>> like this"
>>>
>>
>>
>> Just one suggestion, in general It would feel pretty much okay if you
>> like to improve the naming, and the consistency in any of my patches.
>>

I mean it, I could not imagine a greater honor, than You improving
one of my patches.

Just please consider what I said below before you do it.


Thanks
Bernd.

>>> @@ -1067,7 +1069,6 @@ static int exec_mmap(struct mm_struct *mm)
>>> down_read(&old_mm->mmap_sem);
>>> if (unlikely(old_mm->core_state)) {
>>> up_read(&old_mm->mmap_sem);
>>> - mutex_unlock(&tsk->signal->exec_update_mutex);
>>
>> I was trying to replicate the behavior of prepare_bprm_creds
>> which also unlocks the mutex in the error case, therefore it felt
>> okay to unlock the mutex here, but it will work either way.
>>
>> I should further note, that the mutex would be locked if this
>> error exit is taken, and unlocked if this error happens:
>>
>> ret = mutex_lock_killable(&tsk->signal->exec_update_mutex);
>> if (ret)
>> return ret;
>>
>> so at least the function comment I introduced above should be updated:
>> * Maps the mm_struct mm into the current task struct.
>> * On success, this function returns with the mutex
>> * exec_update_mutex locked.
>>
>>
>>> put_binfmt(fmt);
>>> - if (retval < 0 && bprm->called_exec_mmap) {
>>> + if (retval < 0 && !bprm->mm) {
>>
>> Using bprm->mm like this feels like a hack to me. It works here,
>> but nowhere else. Therefore I changed this line.
>>
>> Using !bprm->mm in the error handling code made Eric's patch fail.
>>
>
> That does probably work better it the boolean is named
> after_the_point_of_no_return or something....
>
>
>>
>> Thanks
>> Bernd.
>>
>>
>>> Linus
>>>

2020-04-06 13:14:29

by Will Deacon

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

[+Peter]

On Fri, Apr 03, 2020 at 07:28:36PM -0700, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 7:02 PM Waiman Long <[email protected]> wrote:
> >
> > So in term of priority, my current thinking is
> >
> > upgrading unfair reader > unfair reader > reader/writer
> >
> > A higher priority locker will block other lockers from acquiring the lock.
>
> An alternative option might be to have readers normally be 100% normal
> (ie with fairness wrt writers), and not really introduce any special
> "unfair reader" lock.
>
> Instead, all the unfairness would come into play only when the special
> case - execve() - does it's special "lock for reading with intent to
> upgrade".
>
> But when it enters that kind of "intent to upgrade" lock state, it
> would not only block all subsequent writers, it would also guarantee
> that all other readers can continue to go).
>
> So then the new rwsem operations would be
>
> - read_with_write_intent_lock_interruptible()
>
> This is the beginning of "execve()", and waits for all writers to
> exit, and puts the lock into "all readers can go" mode.
>
> You could think of it as a "I'm queuing myself for a write lock,
> but I'm allowing readers to go ahead" state.
>
> - read_lock_to_write_upgrade()
>
> This is the "now this turns into a regular write lock". It needs to
> wait for all other readers to exit, of course.

... and at this point, subsequent readers queue behind the upgrader so we
can't run into the usual "stream of readers prevents forward progress"
issue, which was my initial worry when I started reading the thread. Makes
sense.

> - read_with_write_intent_unlock()
>
> This is the "I'm unqueuing myself, I aborted and will not become a
> write lock after all" operation.
>
> NOTE! In this model, there may be multiple threads that do that
> initial queuing thing. We only guarantee that only one of them will
> get to the actual write lock stage, and the others will abort before
> that happens.

I do worry a bit about how much of this we can enforce, but I suppose I'll
wait for the patches. For example, it would nice for
read_lock_to_write_upgrade() to return -EBUSY if there was a concurrent
(successful) upgrade rather than some pathological failure mode like
deadlock, but that feels like it might be a pain to do. It would probably
also be nice to scream if read_lock_to_write_upgrade() is called on a lock
where the upgrade *did* go ahead. Maybe some of this is food for lockdep.

That said, if this all ends up being spelled task_cred_*() then perhaps
it doesn't matter.

Will

2020-04-06 22:21:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Linus Torvalds <[email protected]> writes:

> [ For Waiman & co - the problem is that the current cred_guard_mutex
> is horrendous and has problems with execve() deadlocking against
> various users. We've had this bug before, there's a new one, it's just
> nasty ]
>
> On Thu, Apr 2, 2020 at 4:04 PM Eric W. Biederman <[email protected]> wrote:
>>
>> That is not the direction I intend to take either.
>>
>> I was hoping I could put off replying to this thread for a bit because
>> I only managed to get 4 hours of sleep last night and I am not as alert
>> to technical details as I would like to be.
>
> Hmm.. So I've been looking at this cred_guard_mutex, and I wonder...
>
> This is a bit hand-wavy, because I haven't walker through all the
> paths, but could we perhaps work around a lot of the problems a
> different way., namely:
>
> - make the "cred_guard_mutex" an rwsem-like thing instead of being a mutex.
>
> - make the ptrace_attach() case get it for writing - not because
> ptrace changes the creds, but because ptrace changes 'task->ptrace'
> and depends on dumpability etc.
>
> - change the *name* of that damn thing. Not because it's now
> rwsem'ish rather than a mutex, but because it was never really about
> just "creds". It was about creds+ptrace+dumpable flags etc.
>
> - make all the ones that read the creds to just take it for reading
> (IOW, the cases that were basically switched over to
> exec_update_mutex).
>
> - and finally: make "execve()" take it just for reading too, but
> introduce a "upgrade to write" at the very end (when it actually is
> all done and then finally changes the creds and dumpability)
>
> Wouldn't that solve all problems? We wouldn't get deadlocks wrt
> execve(), simply because execve() doesn't need it to be writable, and
> the things execve() does and can deadlock all only want readability.
>
> But hear me out, because the above is fundamentally broken in a couple
> of ways, so let me address that brokenness before you tell me I'm a
> complete nincompoop and an idiot.
>
> I'm including some locking people here because of these issues, so
> that they can maybe verify my thinking.
>
> (a) our rwsem's are fair
>
> So the whole "execve takes it for reading, so now others can take
> it for reading too without deadlocks" is simply not true - if you use
> the existing rwsem.
>
> Because a concurrent (blocked) writer will then block other
> readers for fairness reasons, and holding it for reading doesn't
> guarantee that others can get it for reading.
>
> So clearly, the above doesn't even *fix* the deadlocks - unless
> we have an unfair mode (or just a special lock for just this that is
> not our standard rwsem, but a special unfair one).
>
> So I'm suggesting we use a special unfair rwsem here (we can make
> a simple spinlock-based one - it doesn't need to be as clever or
> optimized as the real rwsems are)
>
> (b) similarly, our rwsem's don't actually have a "upgrade from read
> to write", because that's also a fundamentally deadlocky operation.
>
> Again, that's true. Except execve() is special, and we know
> there's only _one_ execve() at a time that will complete, since we're
> serializing them. So for this particular use, "upgrade to write" would
> be possible without the general-case deadlock issues.
>
> (c) I didn't think things through, and even with these special
> semantics, my idea is complete garbage
>
> Ok, this may well be true.
>
> Anyway, the advantage of this (if it works) is that it would allow us
> to go back to the _really_ simple original model of just taking this
> lock for reading at the beginning of execve(), and not worrying so
> much about complex nesting or very complex rules for exactly when we
> got the lock and error handling.
>
> The final part when we actually update the credentials and dumpability
> and stuff in execve() is actually fairly simple. So the "upgrade to a
> write lock" phase doesn't worry me too much. It's the interaction
> with all the previous parts (which happen with it held just for
> reading) that tend to be the nastier ones.
>
> And ptrace_attach() really is special, and I think it would be the
> only one that really needs that write lock.
>
> The disadvantage, of course, is that it would require that
> special-case lock semantic, and I might also be missing some thing
> that makes it not work anyway.
>
> Comments? Am I just dreaming of a simpler model without my medications
> again?

Withough reading everything through at least.

* There is also security_setprocattr which needs ptrace and nnp state not to
change it needs to set something that at least selinux's cred
calculations needs to remain constant (like nnp and ptrace).

Which means one thread calling security_setprocattr and another thread
calling exec can deadlock in de_thread.

* Even with your lock and just the ptrace case I can deadlock.
Ptracer: Thread A Tread B
ptrace_attach A
exec
ptrace_attach B
uprade R to RW
---------------------- DEADLOCKED -------------------------

Those are the first two cases I have thought of. There are probably
more.



But fundamentally the only reason we need this information stable
before the point of no return is so that we can return a nice error
code to the process calling exec. Instead of terminating the
process with SIGSEGV.

These are for the most part unlikely scenarios or people would have been
complaining much more loudly about deadlock.

So my plan is to perform the relevant calculations effectively twice.
Once just before the point of no return, and give a graceful return code
if necessary and possible. Once just afer the point of no return, and
SIGSEGV if necessary.



Of course this all only applies to LSMs that refuse to continue under
NNP or ptrace without changing the cred. Linux without those LSMs
enabled will just continue with the original credentials.


So I don't think we will noticably be sacraficing the quality of
the user experience with my plan. In the worst case a deadlock
will become a SIGSEGV killing the execing program.

Eric

p.s. Yes we can do better than a mutex that makes everything mutually
exclusive. I am just starting there for simplicity, and to
see if we need anything better. Unfortuantely too many things are
changing simultaneously for rcu to cover all of the read side
cases.

2020-04-07 01:33:32

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 0/3] exec_update_mutex related cleanups


Linus,

Since you rightly pointed out the code in fs/exec.c is less readable
than it should be right now. Here is where I currently sit on making
that code static where possible and as obvious as possible.

I will resend this after the merge window for a proper review when
people are less likely to be distrcacted but I figured I might as well
send this out now so I can see if anyone runs screaming from this code.

Eric W. Biederman (3):
binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
exec: Make unlocking exec_update_mutex explict
exec: Rename the flag called_exec_mmap point_of_no_return

arch/x86/ia32/ia32_aout.c | 3 +--
fs/binfmt_aout.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
fs/binfmt_flat.c | 3 +--
fs/exec.c | 18 +++++++++---------
include/linux/binfmts.h | 7 +++----
6 files changed, 16 insertions(+), 19 deletions(-)


Eric

2020-04-07 01:35:23

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf


In 2016 Linus moved install_exec_creds immediately after
setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
potential information leak.

Perform the same cleanup for the other binary formats.

Different binary formats doing the same things the same way makes exec
easier to reason about and easier to maintain.

Putting install_exec_creds immediate after setup_new_exec makes many
simplifications possible in the code.

Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/x86/ia32/ia32_aout.c | 3 +--
fs/binfmt_aout.c | 2 +-
fs/binfmt_elf_fdpic.c | 2 +-
fs/binfmt_flat.c | 3 +--
4 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 9bb71abd66bd..37b36a8ce5fa 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -140,6 +140,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
set_personality_ia32(false);

setup_new_exec(bprm);
+ install_exec_creds(bprm);

regs->cs = __USER32_CS;
regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 =
@@ -156,8 +157,6 @@ static int load_aout_binary(struct linux_binprm *bprm)
if (retval < 0)
return retval;

- install_exec_creds(bprm);
-
if (N_MAGIC(ex) == OMAGIC) {
unsigned long text_addr, map_size;

diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 8e8346a81723..ace587b66904 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -162,6 +162,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
set_personality(PER_LINUX);
#endif
setup_new_exec(bprm);
+ install_exec_creds(bprm);

current->mm->end_code = ex.a_text +
(current->mm->start_code = N_TXTADDR(ex));
@@ -174,7 +175,6 @@ static int load_aout_binary(struct linux_binprm * bprm)
if (retval < 0)
return retval;

- install_exec_creds(bprm);

if (N_MAGIC(ex) == OMAGIC) {
unsigned long text_addr, map_size;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 240f66663543..6c94c6d53d97 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -353,6 +353,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
current->personality |= READ_IMPLIES_EXEC;

setup_new_exec(bprm);
+ install_exec_creds(bprm);

set_binfmt(&elf_fdpic_format);

@@ -434,7 +435,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
current->mm->start_stack = current->mm->start_brk + stack_size;
#endif

- install_exec_creds(bprm);
if (create_elf_fdpic_tables(bprm, current->mm,
&exec_params, &interp_params) < 0)
goto error;
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index 831a2b25ba79..1a1d1fcb893f 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -541,6 +541,7 @@ static int load_flat_file(struct linux_binprm *bprm,
/* OK, This is the point of no return */
set_personality(PER_LINUX_32BIT);
setup_new_exec(bprm);
+ install_exec_creds(bprm);
}

/*
@@ -963,8 +964,6 @@ static int load_flat_binary(struct linux_binprm *bprm)
}
}

- install_exec_creds(bprm);
-
set_binfmt(&flat_format);

#ifdef CONFIG_MMU
--
2.25.0

2020-04-07 01:36:22

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return


Update the comments and make the code easier to understand by
renaming this flag.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 12 ++++++------
include/linux/binfmts.h | 6 +++---
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 28c87020da9b..a61987d6dc33 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1300,12 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm)
goto out;

/*
- * After setting bprm->called_exec_mmap (to mark that current is
- * using the prepared mm now), we have nothing left of the original
- * process. If anything from here on returns an error, the check
- * in search_binary_handler() will SEGV current.
+ * With the new mm installed it is completely impossible to
+ * fail and return to the original process. If anything from
+ * here on returns an error, the check in
+ * search_binary_handler() will SEGV current.
*/
- bprm->called_exec_mmap = 1;
+ bprm->point_of_no_return = true;
bprm->mm = NULL;

#ifdef CONFIG_POSIX_TIMERS
@@ -1694,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm)

read_lock(&binfmt_lock);
put_binfmt(fmt);
- if (retval < 0 && bprm->called_exec_mmap) {
+ if (retval < 0 && bprm->point_of_no_return) {
/* we got to flush_old_exec() and failed after it */
read_unlock(&binfmt_lock);
force_sigsegv(SIGSEGV);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 6f564b9ad882..8f479dad7931 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -46,10 +46,10 @@ struct linux_binprm {
*/
secureexec:1,
/*
- * Set by flush_old_exec, when exec_mmap has been called.
- * This is past the point of no return.
+ * Set when errors can no longer be returned to the
+ * original userspace.
*/
- called_exec_mmap:1;
+ point_of_no_return:1;
#ifdef __alpha__
unsigned int taso:1;
#endif
--
2.25.0

2020-04-07 01:36:24

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/3] exec: Make unlocking exec_update_mutex explict


With install_exec_creds updated to follow immediately after
setup_new_exec, the failure of unshare_sighand is the only
code path where exec_update_mutex is held but not explicitly
unlocked.

Update that code path to explicitly unlock exec_update_mutex.

Remove the unlocking of exec_update_mutex from free_bprm.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 6 +++---
include/linux/binfmts.h | 3 +--
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index d55710a36056..28c87020da9b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1318,7 +1318,7 @@ int flush_old_exec(struct linux_binprm * bprm)
*/
retval = unshare_sighand(me);
if (retval)
- goto out;
+ goto out_unlock;

set_fs(USER_DS);
me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
@@ -1335,6 +1335,8 @@ int flush_old_exec(struct linux_binprm * bprm)
do_close_on_exec(me->files);
return 0;

+out_unlock:
+ mutex_unlock(&me->signal->exec_update_mutex);
out:
return retval;
}
@@ -1451,8 +1453,6 @@ static void free_bprm(struct linux_binprm *bprm)
{
free_arg_pages(bprm);
if (bprm->cred) {
- if (bprm->called_exec_mmap)
- mutex_unlock(&current->signal->exec_update_mutex);
mutex_unlock(&current->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a345d9fed3d8..6f564b9ad882 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -47,8 +47,7 @@ struct linux_binprm {
secureexec:1,
/*
* Set by flush_old_exec, when exec_mmap has been called.
- * This is past the point of no return, when the
- * exec_update_mutex has been taken.
+ * This is past the point of no return.
*/
called_exec_mmap:1;
#ifdef __alpha__
--
2.25.0

2020-04-07 15:59:41

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf

On Mon, Apr 06, 2020 at 08:31:25PM -0500, Eric W. Biederman wrote:
>
> In 2016 Linus moved install_exec_creds immediately after
> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
> potential information leak.
>
> Perform the same cleanup for the other binary formats.
>
> Different binary formats doing the same things the same way makes exec
> easier to reason about and easier to maintain.
>
> Putting install_exec_creds immediate after setup_new_exec makes many
> simplifications possible in the code.
>
> Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm")
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Acked-by: Kees Cook <[email protected]>

-Kees

> ---
> arch/x86/ia32/ia32_aout.c | 3 +--
> fs/binfmt_aout.c | 2 +-
> fs/binfmt_elf_fdpic.c | 2 +-
> fs/binfmt_flat.c | 3 +--
> 4 files changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
> index 9bb71abd66bd..37b36a8ce5fa 100644
> --- a/arch/x86/ia32/ia32_aout.c
> +++ b/arch/x86/ia32/ia32_aout.c
> @@ -140,6 +140,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
> set_personality_ia32(false);
>
> setup_new_exec(bprm);
> + install_exec_creds(bprm);
>
> regs->cs = __USER32_CS;
> regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 =
> @@ -156,8 +157,6 @@ static int load_aout_binary(struct linux_binprm *bprm)
> if (retval < 0)
> return retval;
>
> - install_exec_creds(bprm);
> -
> if (N_MAGIC(ex) == OMAGIC) {
> unsigned long text_addr, map_size;
>
> diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
> index 8e8346a81723..ace587b66904 100644
> --- a/fs/binfmt_aout.c
> +++ b/fs/binfmt_aout.c
> @@ -162,6 +162,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
> set_personality(PER_LINUX);
> #endif
> setup_new_exec(bprm);
> + install_exec_creds(bprm);
>
> current->mm->end_code = ex.a_text +
> (current->mm->start_code = N_TXTADDR(ex));
> @@ -174,7 +175,6 @@ static int load_aout_binary(struct linux_binprm * bprm)
> if (retval < 0)
> return retval;
>
> - install_exec_creds(bprm);
>
> if (N_MAGIC(ex) == OMAGIC) {
> unsigned long text_addr, map_size;
> diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
> index 240f66663543..6c94c6d53d97 100644
> --- a/fs/binfmt_elf_fdpic.c
> +++ b/fs/binfmt_elf_fdpic.c
> @@ -353,6 +353,7 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
> current->personality |= READ_IMPLIES_EXEC;
>
> setup_new_exec(bprm);
> + install_exec_creds(bprm);
>
> set_binfmt(&elf_fdpic_format);
>
> @@ -434,7 +435,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm)
> current->mm->start_stack = current->mm->start_brk + stack_size;
> #endif
>
> - install_exec_creds(bprm);
> if (create_elf_fdpic_tables(bprm, current->mm,
> &exec_params, &interp_params) < 0)
> goto error;
> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
> index 831a2b25ba79..1a1d1fcb893f 100644
> --- a/fs/binfmt_flat.c
> +++ b/fs/binfmt_flat.c
> @@ -541,6 +541,7 @@ static int load_flat_file(struct linux_binprm *bprm,
> /* OK, This is the point of no return */
> set_personality(PER_LINUX_32BIT);
> setup_new_exec(bprm);
> + install_exec_creds(bprm);
> }
>
> /*
> @@ -963,8 +964,6 @@ static int load_flat_binary(struct linux_binprm *bprm)
> }
> }
>
> - install_exec_creds(bprm);
> -
> set_binfmt(&flat_format);
>
> #ifdef CONFIG_MMU
> --
> 2.25.0
>

--
Kees Cook

2020-04-07 16:04:21

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/3] exec: Make unlocking exec_update_mutex explict

On Mon, Apr 06, 2020 at 08:31:52PM -0500, Eric W. Biederman wrote:
>
> With install_exec_creds updated to follow immediately after
> setup_new_exec, the failure of unshare_sighand is the only
> code path where exec_update_mutex is held but not explicitly
> unlocked.
>
> Update that code path to explicitly unlock exec_update_mutex.
>
> Remove the unlocking of exec_update_mutex from free_bprm.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Kees Cook <[email protected]>

-Kees

> ---
> fs/exec.c | 6 +++---
> include/linux/binfmts.h | 3 +--
> 2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index d55710a36056..28c87020da9b 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1318,7 +1318,7 @@ int flush_old_exec(struct linux_binprm * bprm)
> */
> retval = unshare_sighand(me);
> if (retval)
> - goto out;
> + goto out_unlock;
>
> set_fs(USER_DS);
> me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
> @@ -1335,6 +1335,8 @@ int flush_old_exec(struct linux_binprm * bprm)
> do_close_on_exec(me->files);
> return 0;
>
> +out_unlock:
> + mutex_unlock(&me->signal->exec_update_mutex);
> out:
> return retval;
> }
> @@ -1451,8 +1453,6 @@ static void free_bprm(struct linux_binprm *bprm)
> {
> free_arg_pages(bprm);
> if (bprm->cred) {
> - if (bprm->called_exec_mmap)
> - mutex_unlock(&current->signal->exec_update_mutex);
> mutex_unlock(&current->signal->cred_guard_mutex);
> abort_creds(bprm->cred);
> }
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index a345d9fed3d8..6f564b9ad882 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -47,8 +47,7 @@ struct linux_binprm {
> secureexec:1,
> /*
> * Set by flush_old_exec, when exec_mmap has been called.
> - * This is past the point of no return, when the
> - * exec_update_mutex has been taken.
> + * This is past the point of no return.
> */
> called_exec_mmap:1;
> #ifdef __alpha__
> --
> 2.25.0
>

--
Kees Cook

2020-04-07 16:06:55

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return

On Mon, Apr 06, 2020 at 08:32:23PM -0500, Eric W. Biederman wrote:
>
> Update the comments and make the code easier to understand by
> renaming this flag.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

I like it, yes!

Acked-by: Kees Cook <[email protected]>

-Kees

> ---
> fs/exec.c | 12 ++++++------
> include/linux/binfmts.h | 6 +++---
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 28c87020da9b..a61987d6dc33 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1300,12 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm)
> goto out;
>
> /*
> - * After setting bprm->called_exec_mmap (to mark that current is
> - * using the prepared mm now), we have nothing left of the original
> - * process. If anything from here on returns an error, the check
> - * in search_binary_handler() will SEGV current.
> + * With the new mm installed it is completely impossible to
> + * fail and return to the original process. If anything from
> + * here on returns an error, the check in
> + * search_binary_handler() will SEGV current.
> */
> - bprm->called_exec_mmap = 1;
> + bprm->point_of_no_return = true;
> bprm->mm = NULL;
>
> #ifdef CONFIG_POSIX_TIMERS
> @@ -1694,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm)
>
> read_lock(&binfmt_lock);
> put_binfmt(fmt);
> - if (retval < 0 && bprm->called_exec_mmap) {
> + if (retval < 0 && bprm->point_of_no_return) {
> /* we got to flush_old_exec() and failed after it */
> read_unlock(&binfmt_lock);
> force_sigsegv(SIGSEGV);
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 6f564b9ad882..8f479dad7931 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -46,10 +46,10 @@ struct linux_binprm {
> */
> secureexec:1,
> /*
> - * Set by flush_old_exec, when exec_mmap has been called.
> - * This is past the point of no return.
> + * Set when errors can no longer be returned to the
> + * original userspace.
> */
> - called_exec_mmap:1;
> + point_of_no_return:1;
> #ifdef __alpha__
> unsigned int taso:1;
> #endif
> --
> 2.25.0
>

--
Kees Cook

2020-04-07 16:12:46

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf

On Mon, Apr 06, 2020 at 08:31:25PM -0500, Eric W. Biederman wrote:
>
> In 2016 Linus moved install_exec_creds immediately after
> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
> potential information leak.
>
> Perform the same cleanup for the other binary formats.
>
> Different binary formats doing the same things the same way makes exec
> easier to reason about and easier to maintain.
>
> Putting install_exec_creds immediate after setup_new_exec makes many
> simplifications possible in the code.
>
> Ref: 9f834ec18def ("binfmt_elf: switch to new creds when switching to new mm")
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Sure, why not.
Acked-by: Christian Brauner <[email protected]>

2020-04-07 16:18:33

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/3] exec: Make unlocking exec_update_mutex explict

On Mon, Apr 06, 2020 at 08:31:52PM -0500, Eric W. Biederman wrote:
>
> With install_exec_creds updated to follow immediately after
> setup_new_exec, the failure of unshare_sighand is the only
> code path where exec_update_mutex is held but not explicitly
> unlocked.
>
> Update that code path to explicitly unlock exec_update_mutex.
>
> Remove the unlocking of exec_update_mutex from free_bprm.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Yeah, assuming that I didn't miss any subtleties just now.
By "explicit" I assume you mean not conditionally unlocked, i.e. we
don't need to check any condition in free_binprm().

Acked-by: Christian Brauner <[email protected]>

2020-04-07 16:22:11

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 3/3] exec: Rename the flag called_exec_mmap point_of_no_return

On Mon, Apr 06, 2020 at 08:32:23PM -0500, Eric W. Biederman wrote:
>
> Update the comments and make the code easier to understand by
> renaming this flag.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

There were only 4 calls where called_exec_mmap was referenced, I could
find. The last one being removed in 2/3 in this series so this seems
fine.

Acked-by: Christian Brauner <[email protected]>

2020-04-07 16:24:29

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] exec_update_mutex related cleanups

On Mon, Apr 06, 2020 at 08:29:50PM -0500, Eric W. Biederman wrote:
>
> Linus,
>
> Since you rightly pointed out the code in fs/exec.c is less readable
> than it should be right now. Here is where I currently sit on making
> that code static where possible and as obvious as possible.
>
> I will resend this after the merge window for a proper review when
> people are less likely to be distrcacted but I figured I might as well
> send this out now so I can see if anyone runs screaming from this code.
>
> Eric W. Biederman (3):
> binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf
> exec: Make unlocking exec_update_mutex explict
> exec: Rename the flag called_exec_mmap point_of_no_return

Under the assumption that we go forward with this approach this seems
like a good cleanup.

Acked-by: Christian Brauner <[email protected]>

2020-04-07 16:26:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/3] exec: Make unlocking exec_update_mutex explict

Christian Brauner <[email protected]> writes:

> On Mon, Apr 06, 2020 at 08:31:52PM -0500, Eric W. Biederman wrote:
>>
>> With install_exec_creds updated to follow immediately after
>> setup_new_exec, the failure of unshare_sighand is the only
>> code path where exec_update_mutex is held but not explicitly
>> unlocked.
>>
>> Update that code path to explicitly unlock exec_update_mutex.
>>
>> Remove the unlocking of exec_update_mutex from free_bprm.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> Yeah, assuming that I didn't miss any subtleties just now.
> By "explicit" I assume you mean not conditionally unlocked, i.e. we
> don't need to check any condition in free_binprm().

Yes. Not conditionally unlocked is what I meant.

> Acked-by: Christian Brauner <[email protected]>

Eric

2020-04-07 19:52:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <[email protected]> wrote:
>
> But fundamentally the only reason we need this information stable
> before the point of no return is so that we can return a nice error
> code to the process calling exec. Instead of terminating the
> process with SIGSEGV.

I'd suggest doing it the other way around instead: let the thread that
does the security_setprocattr() die, since execve() is terminating
other threads anyway.

And the easy way to do that is to just make the rule be that anybody
who waits for this thing for write needs to use a killable wait.

So if the execve() got started earlier, and already took the cred lock
(whatever we'll call it) for reading, then zap_other_threads() will
take care of another thread doing setprocattr().

That sounds like a really simple model, no?

Linus

2020-04-07 20:30:43

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1



On 4/7/20 9:50 PM, Linus Torvalds wrote:
> On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <[email protected]> wrote:
>>
>> But fundamentally the only reason we need this information stable
>> before the point of no return is so that we can return a nice error
>> code to the process calling exec. Instead of terminating the
>> process with SIGSEGV.
>
> I'd suggest doing it the other way around instead: let the thread that
> does the security_setprocattr() die, since execve() is terminating
> other threads anyway.
>
> And the easy way to do that is to just make the rule be that anybody
> who waits for this thing for write needs to use a killable wait.
>
> So if the execve() got started earlier, and already took the cred lock
> (whatever we'll call it) for reading, then zap_other_threads() will
> take care of another thread doing setprocattr().
>
> That sounds like a really simple model, no?
>

Maybe, actually I considered this, but I was anxious that making something
that is so far not killable suddenly killable might break other things.

But I am a wimp :-)


Bernd.




> Linus
>

2020-04-07 20:48:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Tue, Apr 7, 2020 at 1:29 PM Bernd Edlinger <[email protected]> wrote:
>
> Maybe, actually I considered this, but I was anxious that making something
> that is so far not killable suddenly killable might break other things.

I don't think it can.

Basically, if you have a execve() and a setprocattr() racing, one or
the other starts first.

And if the execve() started first, then the setprocattr() thread would
get killed by the execve(), and there's no serialization. So you might
as well just say "it got killed before it even started to wait".

So semantically, having a killable wait is basically exactly the same
as losing the race - which wasn't ordered to begin with.

It's not like anybody will see the return value - the thread that
would have gotten the value got killed.

So doing

if (down_writel_killable(&credlock))
return -EINTR;

may *look* like it's new semantics, but it isn't really. That EINTR
error isn't visible to anybody, and everything looks absolutely
identical to "execve() in the other thread started earlier and killed
the thread even before it got to the system call".

Linus

2020-04-08 17:00:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Linus Torvalds <[email protected]> writes:

> On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <[email protected]> wrote:
>>
>> But fundamentally the only reason we need this information stable
>> before the point of no return is so that we can return a nice error
>> code to the process calling exec. Instead of terminating the
>> process with SIGSEGV.
>
> I'd suggest doing it the other way around instead: let the thread that
> does the security_setprocattr() die, since execve() is terminating
> other threads anyway.
>
> And the easy way to do that is to just make the rule be that anybody
> who waits for this thing for write needs to use a killable wait.
>
> So if the execve() got started earlier, and already took the cred lock
> (whatever we'll call it) for reading, then zap_other_threads() will
> take care of another thread doing setprocattr().
>
> That sounds like a really simple model, no?

Yes. I missed the fact that we could take the lock killable.
We still unfortunately have the deadlock with ptrace.

It might be simpler to make whichever lock we are dealing with per
task_struct instead of per signal_struct. Then we don't even have to
think about what de_thread does or if the lock is taken killable.


Looking at the code in binfmt_elf.c there are about 11 other places
after install_exec_creds where we can fail and would be forced to
terminate the application with SIGSEGV instead of causing fork to fail.




I keep wondering if we could do something similar to vfork. That is
allocate an new task_struct and fully set it up for the post exec
process, and then make it visible under tasklist_lock. Finally we could
free the old process.

That would appear as if everything happened atomically from
the point of view of the rest of the kernel.

As well as fixing all of the deadlocks and making it easy
to ensure we don't have any more weird failures in the future.

Eric

p.s. For tasklist_lock I suspect we can put a lock in struct pid
and use that to guard the task lists in struct pid. Which would
allow for tasklist_lock to be take much less. Then we would
just need a solution for task->parent and task->real_parent and
I think all of the major users of tasklist_lock would be gone.


2020-04-08 17:19:17

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/8/20 5:14 PM, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
>> On Mon, Apr 6, 2020 at 3:20 PM Eric W. Biederman <[email protected]> wrote:
>>>
>>> But fundamentally the only reason we need this information stable
>>> before the point of no return is so that we can return a nice error
>>> code to the process calling exec. Instead of terminating the
>>> process with SIGSEGV.
>>
>> I'd suggest doing it the other way around instead: let the thread that
>> does the security_setprocattr() die, since execve() is terminating
>> other threads anyway.
>>
>> And the easy way to do that is to just make the rule be that anybody
>> who waits for this thing for write needs to use a killable wait.
>>
>> So if the execve() got started earlier, and already took the cred lock
>> (whatever we'll call it) for reading, then zap_other_threads() will
>> take care of another thread doing setprocattr().
>>
>> That sounds like a really simple model, no?
>
> Yes. I missed the fact that we could take the lock killable.
> We still unfortunately have the deadlock with ptrace.
>
> It might be simpler to make whichever lock we are dealing with per
> task_struct instead of per signal_struct. Then we don't even have to
> think about what de_thread does or if the lock is taken killable.
>

I think you said that already, but I did not understand the difference,
could you please give some more details about your idea?


Thanks
Bernd.

>
> Looking at the code in binfmt_elf.c there are about 11 other places
> after install_exec_creds where we can fail and would be forced to
> terminate the application with SIGSEGV instead of causing fork to fail.
>
>
>
>
> I keep wondering if we could do something similar to vfork. That is
> allocate an new task_struct and fully set it up for the post exec
> process, and then make it visible under tasklist_lock. Finally we could
> free the old process.
>
> That would appear as if everything happened atomically from
> the point of view of the rest of the kernel.
>
> As well as fixing all of the deadlocks and making it easy
> to ensure we don't have any more weird failures in the future.
>
> Eric
>
> p.s. For tasklist_lock I suspect we can put a lock in struct pid
> and use that to guard the task lists in struct pid. Which would
> allow for tasklist_lock to be take much less. Then we would
> just need a solution for task->parent and task->real_parent and
> I think all of the major users of tasklist_lock would be gone.
>
>

2020-04-08 18:30:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Wed, Apr 8, 2020 at 8:17 AM Eric W. Biederman <[email protected]> wrote:
>
> Yes. I missed the fact that we could take the lock killable.
> We still unfortunately have the deadlock with ptrace.

That, I feel, is similarly trivial.

Again, anybody who takes the lock for writing should just do so
killably. So you have three cases:

- ptrace wins the race and gets the lock.

Fine, the execve will wait until afterwards.

- ptrace loses the race and is not a thread with execve.

Fine, the execve() won, and the ptrace will wait until after execve.

- ptrace loses the race and is a thread with execve.

Fine, the execve() will kill the thing in dethread() and the ptrace
thread will release the lock and die.

So all three cases are fine, and none of them have any behavioral
differences (as mentioned, the killing is "invisible" to users since
it's fundamentally a race, and you can consider the kill to have
happened before the ptrace started).

> It might be simpler to make whichever lock we are dealing with per
> task_struct instead of per signal_struct. Then we don't even have to
> think about what de_thread does or if the lock is taken killable.

Well, yes, but I think the dethread behavior of killing threads is
required anyway, so..

> I keep wondering if we could do something similar to vfork. That is
> allocate an new task_struct and fully set it up for the post exec
> process, and then make it visible under tasklist_lock. Finally we could
> free the old process.
>
> That would appear as if everything happened atomically from
> the point of view of the rest of the kernel.

I do think that would have been a lovely design originally, and would
avoid a lot of things. So "execve()" would basically look like an exit
and a thread creation with the same pid (without the SIGCHILD to the
parent, obviously)

That would also naturally handle the "flush pending signals" etc issues.

The fact that we created a whole new mm-struct ended up fixing a lot
of problems (even if it was painful to do). This might be similar.

But it's not what we've ever done, and I do suspect you'd run into a
lot of odd small special cases if we were to try to do it now.

So I think it's simpler to just start making the "cred lock waiters
have to be killable" rule. It's not like that's a very complex rule.

Linus

2020-04-08 19:51:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] exec_update_mutex related cleanups

On Mon, Apr 6, 2020 at 6:32 PM Eric W. Biederman <[email protected]> wrote:
>
> I will resend this after the merge window for a proper review when
> people are less likely to be distrcacted but I figured I might as well
> send this out now so I can see if anyone runs screaming from this code.

Ack. It looks sane to me. I had that one question about a further
simplification, but even without that it looks like an improvement.

Linus

2020-04-08 21:21:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf

On Mon, Apr 6, 2020 at 6:34 PM Eric W. Biederman <[email protected]> wrote:
>
> In 2016 Linus moved install_exec_creds immediately after
> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
> potential information leak.
>
> Perform the same cleanup for the other binary formats

Can we not move it _into_ setup_new_exec() now if you've changed all
the binfmt handlers?

The fewer cases of "this gets called by the low-level handler at
different points" that we have, the better off we'd be, I think. One
of the complexities of our execve() code is that some of it gets
called directly, and some of it gets called by the binfmt handler, and
it's often very hard to see the logic when it jumps out to the binfmt
code and then back to the generic fs/exec.c code..

Linus

2020-04-08 21:56:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/3] binfmt: Move install_exec_creds after setup_new_exec to match binfmt_elf

Linus Torvalds <[email protected]> writes:

> On Mon, Apr 6, 2020 at 6:34 PM Eric W. Biederman <[email protected]> wrote:
>>
>> In 2016 Linus moved install_exec_creds immediately after
>> setup_new_exec, in binfmt_elf as a cleanup and as part of closing a
>> potential information leak.
>>
>> Perform the same cleanup for the other binary formats
>
> Can we not move it _into_ setup_new_exec() now if you've changed all
> the binfmt handlers?
>
> The fewer cases of "this gets called by the low-level handler at
> different points" that we have, the better off we'd be, I think. One
> of the complexities of our execve() code is that some of it gets
> called directly, and some of it gets called by the binfmt handler, and
> it's often very hard to see the logic when it jumps out to the binfmt
> code and then back to the generic fs/exec.c code..

Yes. I already have merging of setup_new_exec and install_exec_creds in
my working tree. I just posted the simplest set of patches to get the
idea across.

We can almost merge those two with flush_old_exec as well except for the
code that sets the personality between flush_old_exec and and
setup_new_exec. I am wondering if maybe setting the personality should
be a callback.

Eric

2020-04-09 15:02:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

Linus Torvalds <[email protected]> writes:

> On Wed, Apr 8, 2020 at 8:17 AM Eric W. Biederman <[email protected]> wrote:
>>
>> Yes. I missed the fact that we could take the lock killable.
>> We still unfortunately have the deadlock with ptrace.
>
> That, I feel, is similarly trivial.
>
> Again, anybody who takes the lock for writing should just do so
> killably. So you have three cases:
>
> - ptrace wins the race and gets the lock.
>
> Fine, the execve will wait until afterwards.
>
> - ptrace loses the race and is not a thread with execve.
>
> Fine, the execve() won, and the ptrace will wait until after execve.
>
> - ptrace loses the race and is a thread with execve.
>
> Fine, the execve() will kill the thing in dethread() and the ptrace
> thread will release the lock and die.

That would be nice.

That is unfortunately not how ptrace_event(PTRACE_EVENT_EXIT, ...) works.

When a thread going about it's ordinary business receives the SIGKILL
from de_thread the thread changes course and finds it's way to do_exit.
In do_exit the thread calls ptrace_event(PTRACE_EVENT_EXIT, ...) and
blocks waiting for the tracer to let it continue.

Further from previous attempts to fix this we know that there
are pieces of userspace expect that stop to happen. So if the
PTRACE_EVENT_EXIT stop does not happen userspace which is already
attached breaks.

Further this case with ptrace is something we know userspace
does and is is just a matter of bad timing of attaching to the
threads when one thread is exec'ing. So we don't even need to wonder if
userspace would do such a silling thing.



There are a lot similar cases that can happen if userspace inserts
itself into the path of page faults, directly or indirectly,
as long as some wait somewhere ultimately waits for a ptrace attach.


> So all three cases are fine, and none of them have any behavioral
> differences (as mentioned, the killing is "invisible" to users since
> it's fundamentally a race, and you can consider the kill to have
> happened before the ptrace started).

See above.


>> It might be simpler to make whichever lock we are dealing with per
>> task_struct instead of per signal_struct. Then we don't even have to
>> think about what de_thread does or if the lock is taken killable.
>
> Well, yes, but I think the dethread behavior of killing threads is
> required anyway, so..

It is, but it is actually part of the problem.

I think making some of this thread local might solve another easy case
and let us focus more on the really hard problem.

>> I keep wondering if we could do something similar to vfork. That is
>> allocate an new task_struct and fully set it up for the post exec
>> process, and then make it visible under tasklist_lock. Finally we could
>> free the old process.
>>
>> That would appear as if everything happened atomically from
>> the point of view of the rest of the kernel.
>
> I do think that would have been a lovely design originally, and would
> avoid a lot of things. So "execve()" would basically look like an exit
> and a thread creation with the same pid (without the SIGCHILD to the
> parent, obviously)
>
> That would also naturally handle the "flush pending signals" etc issues.
>
> The fact that we created a whole new mm-struct ended up fixing a lot
> of problems (even if it was painful to do). This might be similar.
>
> But it's not what we've ever done, and I do suspect you'd run into a
> lot of odd small special cases if we were to try to do it now.

I completely agree, which is why I haven't been rushing to do that.
But this remains the only idea that I have thought of that would solve all
of the issues.

> So I think it's simpler to just start making the "cred lock waiters
> have to be killable" rule. It's not like that's a very complex rule.

I just looked at the remaining users of cred_guard_mutex and they are
all killable or interruptible. Further all of the places that have been
converted to use the exec_update_mutex are also all killable or
interruptible.

So where we came in is that we had the killable rule and that has what
has allowed this to remain on the backburner for so long. At least you
could kill the affected process from userspace. Unfortunately the
deadlocks still happen.

Eric

2020-04-09 16:55:16

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On 4/9/20 4:58 PM, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
>> On Wed, Apr 8, 2020 at 8:17 AM Eric W. Biederman <[email protected]> wrote:
>>>
>>> Yes. I missed the fact that we could take the lock killable.
>>> We still unfortunately have the deadlock with ptrace.
>>
>> That, I feel, is similarly trivial.
>>
>> Again, anybody who takes the lock for writing should just do so
>> killably. So you have three cases:
>>
>> - ptrace wins the race and gets the lock.
>>
>> Fine, the execve will wait until afterwards.
>>
>> - ptrace loses the race and is not a thread with execve.
>>
>> Fine, the execve() won, and the ptrace will wait until after execve.
>>
>> - ptrace loses the race and is a thread with execve.
>>
>> Fine, the execve() will kill the thing in dethread() and the ptrace
>> thread will release the lock and die.
>
> That would be nice.
>
> That is unfortunately not how ptrace_event(PTRACE_EVENT_EXIT, ...) works.
>
> When a thread going about it's ordinary business receives the SIGKILL
> from de_thread the thread changes course and finds it's way to do_exit.
> In do_exit the thread calls ptrace_event(PTRACE_EVENT_EXIT, ...) and
> blocks waiting for the tracer to let it continue.
>
> Further from previous attempts to fix this we know that there
> are pieces of userspace expect that stop to happen. So if the
> PTRACE_EVENT_EXIT stop does not happen userspace which is already
> attached breaks.
>
> Further this case with ptrace is something we know userspace
> does and is is just a matter of bad timing of attaching to the
> threads when one thread is exec'ing. So we don't even need to wonder if
> userspace would do such a silling thing.
>
>
>
> There are a lot similar cases that can happen if userspace inserts
> itself into the path of page faults, directly or indirectly,
> as long as some wait somewhere ultimately waits for a ptrace attach.
>
>

Remember, as a last resort there is my "insane" 15/16 patch, which
Linus admittedly hates, but it works. If we find a cleaner solution
it can always be reverted, that is just fine for me.

Thanks
Bernd.

>> So all three cases are fine, and none of them have any behavioral
>> differences (as mentioned, the killing is "invisible" to users since
>> it's fundamentally a race, and you can consider the kill to have
>> happened before the ptrace started).
>
> See above.
>
>
>>> It might be simpler to make whichever lock we are dealing with per
>>> task_struct instead of per signal_struct. Then we don't even have to
>>> think about what de_thread does or if the lock is taken killable.
>>
>> Well, yes, but I think the dethread behavior of killing threads is
>> required anyway, so..
>
> It is, but it is actually part of the problem.
>
> I think making some of this thread local might solve another easy case
> and let us focus more on the really hard problem.
>
>>> I keep wondering if we could do something similar to vfork. That is
>>> allocate an new task_struct and fully set it up for the post exec
>>> process, and then make it visible under tasklist_lock. Finally we could
>>> free the old process.
>>>
>>> That would appear as if everything happened atomically from
>>> the point of view of the rest of the kernel.
>>
>> I do think that would have been a lovely design originally, and would
>> avoid a lot of things. So "execve()" would basically look like an exit
>> and a thread creation with the same pid (without the SIGCHILD to the
>> parent, obviously)
>>
>> That would also naturally handle the "flush pending signals" etc issues.
>>
>> The fact that we created a whole new mm-struct ended up fixing a lot
>> of problems (even if it was painful to do). This might be similar.
>>
>> But it's not what we've ever done, and I do suspect you'd run into a
>> lot of odd small special cases if we were to try to do it now.
>
> I completely agree, which is why I haven't been rushing to do that.
> But this remains the only idea that I have thought of that would solve all
> of the issues.
>
>> So I think it's simpler to just start making the "cred lock waiters
>> have to be killable" rule. It's not like that's a very complex rule.
>
> I just looked at the remaining users of cred_guard_mutex and they are
> all killable or interruptible. Further all of the places that have been
> converted to use the exec_update_mutex are also all killable or
> interruptible.
>
> So where we came in is that we had the killable rule and that has what
> has allowed this to remain on the backburner for so long. At least you
> could kill the affected process from userspace. Unfortunately the
> deadlocks still happen.
>
> Eric
>

2020-04-09 17:27:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

On Thu, Apr 9, 2020 at 8:01 AM Eric W. Biederman <[email protected]> wrote:
>
> When a thread going about it's ordinary business receives the SIGKILL
> from de_thread the thread changes course and finds it's way to do_exit.
> In do_exit the thread calls ptrace_event(PTRACE_EVENT_EXIT, ...) and
> blocks waiting for the tracer to let it continue.

Hah.

That code isn't _supposed_ to block.

may_ptrace_stop() is supposed to stop the blocking exactly so that it
doesn't deadlock.

I wonder why that doesn't work..

[ Goes and look ]

Oh. I see.

That ptrace_may_stop() only ever considered core-dumping, not execve().

But if _that_ is the reason for the deadlock, then it's trivially fixed.

Famous last words..

Linus