Souptick and I have been auditing the various page fault handler routines
and we've noticed that graphics drivers assume that a signal should be
able to interrupt a page fault. In contrast, the page cache takes great
care to allow only fatal signals to interrupt a page fault.
I believe (but have not verified) that a non-fatal signal being delivered
to a task which is in the middle of a page fault may well end up in an
infinite loop, attempting to handle the page fault and failing forever.
Here's one of the simpler ones:
ret = mutex_lock_interruptible(&etnaviv_obj->lock);
if (ret)
return VM_FAULT_NOPAGE;
(many other drivers do essentially the same thing including i915)
On seeing NOPAGE, the fault handler believes the PTE is in the page
table, so does nothing before it returns to arch code at which point
I get lost in the magic assembler macros. I believe it will end up
returning to userspace if the signal is non-fatal, at which point it'll
go right back into the page fault handler, and mutex_lock_interruptible()
will immediately fail. So we've converted a sleeping lock into the most
expensive spinlock.
I don't think the graphics drivers really want to be interrupted by
any signal. I think they want to be interruptible by fatal signals
and should use the mutex_lock_killable / fatal_signal_pending family of
functions. That's going to be a bit of churn, funnelling TASK_KILLABLE
/ TASK_INTERRUPTIBLE all the way down into the dma-fence code. Before
anyone gets started on that, I want to be sure that my analysis is
correct, and the drivers are doing the wrong thing by using interruptible
waits in a page fault handler.
Quoting Matthew Wilcox (2018-04-02 15:10:58)
>
> Souptick and I have been auditing the various page fault handler routines
> and we've noticed that graphics drivers assume that a signal should be
> able to interrupt a page fault. In contrast, the page cache takes great
> care to allow only fatal signals to interrupt a page fault.
>
> I believe (but have not verified) that a non-fatal signal being delivered
> to a task which is in the middle of a page fault may well end up in an
> infinite loop, attempting to handle the page fault and failing forever.
>
> Here's one of the simpler ones:
>
> ret = mutex_lock_interruptible(&etnaviv_obj->lock);
> if (ret)
> return VM_FAULT_NOPAGE;
>
> (many other drivers do essentially the same thing including i915)
>
> On seeing NOPAGE, the fault handler believes the PTE is in the page
> table, so does nothing before it returns to arch code at which point
> I get lost in the magic assembler macros. I believe it will end up
> returning to userspace if the signal is non-fatal, at which point it'll
> go right back into the page fault handler, and mutex_lock_interruptible()
> will immediately fail. So we've converted a sleeping lock into the most
> expensive spinlock.
I'll ask the obvious question: why isn't the signal handled on return to
userspace?
> I don't think the graphics drivers really want to be interrupted by
> any signal.
Assume the worst case and we may block for 10s. Even a 10ms delay may be
unacceptable to some signal handlers (one presumes). For the number one
^C usecase, yes that may be reduced to only bother if it's killable, but
I wonder if there are not timing loops (e.g. sigitimer in Xorg < 1.19)
that want to be able to interrupt random blockages.
-Chris
On Tue, Apr 03, 2018 at 01:33:15PM +0100, Chris Wilson wrote:
> Quoting Matthew Wilcox (2018-04-02 15:10:58)
> > Souptick and I have been auditing the various page fault handler routines
> > and we've noticed that graphics drivers assume that a signal should be
> > able to interrupt a page fault. In contrast, the page cache takes great
> > care to allow only fatal signals to interrupt a page fault.
> >
> > I believe (but have not verified) that a non-fatal signal being delivered
> > to a task which is in the middle of a page fault may well end up in an
> > infinite loop, attempting to handle the page fault and failing forever.
> >
> > Here's one of the simpler ones:
> >
> > ret = mutex_lock_interruptible(&etnaviv_obj->lock);
> > if (ret)
> > return VM_FAULT_NOPAGE;
> >
> > (many other drivers do essentially the same thing including i915)
> >
> > On seeing NOPAGE, the fault handler believes the PTE is in the page
> > table, so does nothing before it returns to arch code at which point
> > I get lost in the magic assembler macros. I believe it will end up
> > returning to userspace if the signal is non-fatal, at which point it'll
> > go right back into the page fault handler, and mutex_lock_interruptible()
> > will immediately fail. So we've converted a sleeping lock into the most
> > expensive spinlock.
>
> I'll ask the obvious question: why isn't the signal handled on return to
> userspace?
As I said, I don't know exactly how it works due to getting lost in the
assembler parts of kernel entry/exit. But if it did work, then wouldn't
the page cache use it, rather than taking such great pains to only handle
fatal signals? See commit 37b23e0525d393d48a7d59f870b3bc061a30ccdb
which introduced it.
One thing I did come up with is: What if the signal handler is the
one touching the page that needs to be faulted in? And for the page
cache case, what if the page that needs to be faulted in is the page
containing the text of the signal handler? (I don't think we support
mapping graphics memory PROT_EXEC ;-)
> > I don't think the graphics drivers really want to be interrupted by
> > any signal.
>
> Assume the worst case and we may block for 10s. Even a 10ms delay may be
> unacceptable to some signal handlers (one presumes). For the number one
> ^C usecase, yes that may be reduced to only bother if it's killable, but
> I wonder if there are not timing loops (e.g. sigitimer in Xorg < 1.19)
> that want to be able to interrupt random blockages.
Ah, setitimer / SIGALRM. So what do we want to have happen if that
signal handler touches the mmaped device memory?
On 04/03/2018 02:33 PM, Chris Wilson wrote:
> Quoting Matthew Wilcox (2018-04-02 15:10:58)
>> Souptick and I have been auditing the various page fault handler routines
>> and we've noticed that graphics drivers assume that a signal should be
>> able to interrupt a page fault. In contrast, the page cache takes great
>> care to allow only fatal signals to interrupt a page fault.
>>
>> I believe (but have not verified) that a non-fatal signal being delivered
>> to a task which is in the middle of a page fault may well end up in an
>> infinite loop, attempting to handle the page fault and failing forever.
>>
>> Here's one of the simpler ones:
>>
>> ret = mutex_lock_interruptible(&etnaviv_obj->lock);
>> if (ret)
>> return VM_FAULT_NOPAGE;
>>
>> (many other drivers do essentially the same thing including i915)
>>
>> On seeing NOPAGE, the fault handler believes the PTE is in the page
>> table, so does nothing before it returns to arch code at which point
>> I get lost in the magic assembler macros. I believe it will end up
>> returning to userspace if the signal is non-fatal, at which point it'll
>> go right back into the page fault handler, and mutex_lock_interruptible()
>> will immediately fail. So we've converted a sleeping lock into the most
>> expensive spinlock.
> I'll ask the obvious question: why isn't the signal handled on return to
> userspace?
+1
>
>> I don't think the graphics drivers really want to be interrupted by
>> any signal.
> Assume the worst case and we may block for 10s. Even a 10ms delay may be
> unacceptable to some signal handlers (one presumes). For the number one
> ^C usecase, yes that may be reduced to only bother if it's killable, but
> I wonder if there are not timing loops (e.g. sigitimer in Xorg < 1.19)
> that want to be able to interrupt random blockages.
> -Chris
I think the TTM page fault handler originally set the standard for this.
First, IMO any critical section that waits for the GPU (like typically
the page fault handler does), should be locked at least killable. The
need for interruptible locks came from the X server's silken mouse
relying on signals for smooth mouse operations: You didn't want the X
server to be stuck in the kernel waiting for GPU completion when it
should handle the cursor move request.. Now that doesn't seem to be the
case anymore but to reiterate Chris' question, why would the signal
persist once returned to user-space?
/Thomas
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Apr 03, 2018 at 02:20:43PM +0100, Chris Wilson wrote:
> Quoting Matthew Wilcox (2018-04-03 14:10:25)
> > On Tue, Apr 03, 2018 at 01:33:15PM +0100, Chris Wilson wrote:
> > > Quoting Matthew Wilcox (2018-04-02 15:10:58)
> > > > I don't think the graphics drivers really want to be interrupted by
> > > > any signal.
> > >
> > > Assume the worst case and we may block for 10s. Even a 10ms delay may be
> > > unacceptable to some signal handlers (one presumes). For the number one
> > > ^C usecase, yes that may be reduced to only bother if it's killable, but
> > > I wonder if there are not timing loops (e.g. sigitimer in Xorg < 1.19)
> > > that want to be able to interrupt random blockages.
> >
> > Ah, setitimer / SIGALRM. So what do we want to have happen if that
> > signal handler touches the mmaped device memory?
>
> Burn in a great ball of fire :) Isn't that what usually happens if you
> do anything in a signal handler?
I don't know. My mummy and daddy don't let me play with sharp things
like signals.
> Hmm, if SIGBUS has a handler does that count as a killable signal? The
> ddx does have code to service SIGBUS emitted when accessing the mmapped
> pointer that may result from the page insertion failing with no memory
> (or other random error). There we stop accessing via the pointer and
> use another indirect method.
Any signal with a handler is non-fatal, and so a call to
mutex_lock_killable() would not return if SIGBUS was delivered to a thread
blocking in a page fault. mutex_lock_interruptible() would return -EINTR.
On Tue, Apr 03, 2018 at 03:12:35PM +0200, Thomas Hellstrom wrote:
> I think the TTM page fault handler originally set the standard for this.
> First, IMO any critical section that waits for the GPU (like typically the
> page fault handler does), should be locked at least killable. The need for
> interruptible locks came from the X server's silken mouse relying on signals
> for smooth mouse operations: You didn't want the X server to be stuck in the
> kernel waiting for GPU completion when it should handle the cursor move
> request.. Now that doesn't seem to be the case anymore but to reiterate
> Chris' question, why would the signal persist once returned to user-space?
Yeah, you graphics people have had to deal with much more recalcitrant
hardware than most of the rest of us ... and less reasonable user
expectations ("My graphics card was doing something and I expected
everything else to keep going" vs "My hard drive died and my kernel
paniced, oh well.")
I don't know exactly how the signal code works at the delivery end;
I'm not sure when TIF_SIGPENDING gets cleared. I just get concerned
when I see one bit of kernel code doing things in a very complicated
and careful manner and another bit of kernel code blithely assuming
that everything's going to be OK.
On Tue, Apr 03, 2018 at 07:48:29AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 03, 2018 at 03:12:35PM +0200, Thomas Hellstrom wrote:
> > I think the TTM page fault handler originally set the standard for this.
> > First, IMO any critical section that waits for the GPU (like typically the
> > page fault handler does), should be locked at least killable. The need for
> > interruptible locks came from the X server's silken mouse relying on signals
> > for smooth mouse operations: You didn't want the X server to be stuck in the
> > kernel waiting for GPU completion when it should handle the cursor move
> > request.. Now that doesn't seem to be the case anymore but to reiterate
> > Chris' question, why would the signal persist once returned to user-space?
>
> Yeah, you graphics people have had to deal with much more recalcitrant
> hardware than most of the rest of us ... and less reasonable user
> expectations ("My graphics card was doing something and I expected
> everything else to keep going" vs "My hard drive died and my kernel
> paniced, oh well.")
>
> I don't know exactly how the signal code works at the delivery end;
> I'm not sure when TIF_SIGPENDING gets cleared. I just get concerned
> when I see one bit of kernel code doing things in a very complicated
> and careful manner and another bit of kernel code blithely assuming
> that everything's going to be OK.
I think you last line pretty much sums up the proper attitude when writing
gpu drivers:
https://i.imgflip.com/27nm7w.jpg
Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Mon, Apr 02, 2018 at 07:10:58AM -0700, Matthew Wilcox wrote:
>
> Souptick and I have been auditing the various page fault handler routines
> and we've noticed that graphics drivers assume that a signal should be
> able to interrupt a page fault. In contrast, the page cache takes great
> care to allow only fatal signals to interrupt a page fault.
>
> I believe (but have not verified) that a non-fatal signal being delivered
> to a task which is in the middle of a page fault may well end up in an
> infinite loop, attempting to handle the page fault and failing forever.
>
> Here's one of the simpler ones:
>
> ret = mutex_lock_interruptible(&etnaviv_obj->lock);
> if (ret)
> return VM_FAULT_NOPAGE;
>
> (many other drivers do essentially the same thing including i915)
>
> On seeing NOPAGE, the fault handler believes the PTE is in the page
> table, so does nothing before it returns to arch code at which point
> I get lost in the magic assembler macros. I believe it will end up
> returning to userspace if the signal is non-fatal, at which point it'll
> go right back into the page fault handler, and mutex_lock_interruptible()
> will immediately fail. So we've converted a sleeping lock into the most
> expensive spinlock.
>
> I don't think the graphics drivers really want to be interrupted by
> any signal. I think they want to be interruptible by fatal signals
> and should use the mutex_lock_killable / fatal_signal_pending family of
> functions. That's going to be a bit of churn, funnelling TASK_KILLABLE
> / TASK_INTERRUPTIBLE all the way down into the dma-fence code. Before
> anyone gets started on that, I want to be sure that my analysis is
> correct, and the drivers are doing the wrong thing by using interruptible
> waits in a page fault handler.
So we've done some experiments for the case where the fault originated
from kernel context (copy_to|from_user and friends). The fixup code seems
to retry the copy once after the fault (in copy_user_handle_tail), if that
fails again we get a short read/write. This might result in an -EFAULT,
short read()/write() or anything else really, depending upon the syscall
api.
Except in some code paths in gpu drivers where we convert anything into
-ERESTARTSYS/EINTR if there's a signal pending it won't ever result in the
syscall getting restarted (well except maybe short read/writes if
userspace bothers with that).
So I guess gpu fault handlers indeed break the kernel's expectations, but
then I think we're getting away with that because the inner workings of
gpu memory objects is all heavily abstracted away by opengl/vulkan and
friends.
I guess what we could do is try to only do killable sleeps if it's a
kernel fault, but that means wiring a flag through all the callchains. Not
pretty. Except when there's a magic set of functions that would convert
all interruptible sleeps to killable ones only for us.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Wed, Apr 04, 2018 at 11:32:54AM +0200, Daniel Vetter wrote:
> So we've done some experiments for the case where the fault originated
> from kernel context (copy_to|from_user and friends). The fixup code seems
> to retry the copy once after the fault (in copy_user_handle_tail), if that
> fails again we get a short read/write. This might result in an -EFAULT,
> short read()/write() or anything else really, depending upon the syscall
> api.
>
> Except in some code paths in gpu drivers where we convert anything into
> -ERESTARTSYS/EINTR if there's a signal pending it won't ever result in the
> syscall getting restarted (well except maybe short read/writes if
> userspace bothers with that).
>
> So I guess gpu fault handlers indeed break the kernel's expectations, but
> then I think we're getting away with that because the inner workings of
> gpu memory objects is all heavily abstracted away by opengl/vulkan and
> friends.
>
> I guess what we could do is try to only do killable sleeps if it's a
> kernel fault, but that means wiring a flag through all the callchains. Not
> pretty. Except when there's a magic set of functions that would convert
> all interruptible sleeps to killable ones only for us.
I actually have plans to allow mutex_lock_{interruptible,killable} to
return -EWOULDBLOCK if a flag is set. So this doesn't seem entirely
unrelated. Something like this perhaps:
struct task_struct {
+ unsigned int sleep_state;
};
static noinline int __sched
-__mutex_lock_interruptible_slowpath(struct mutex *lock)
+__mutex_lock_slowpath(struct mutex *lock, long state)
{
- return __mutex_lock(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
+ if (state == TASK_NOBLOCK)
+ return -EWOULDBLOCK;
+ return __mutex_lock(lock, state, 0, NULL, _RET_IP_);
}
+int __sched mutex_lock_state(struct mutex *lock, long state)
+{
+ might_sleep();
+
+ if (__mutex_trylock_fast(lock))
+ return 0;
+
+ return __mutex_lock_slowpath(lock, state);
+}
+EXPORT_SYMBOL(mutex_lock_state);
Then the page fault handler can do something like:
old_state = current->sleep_state;
current->sleep_state = TASK_INTERRUPTIBLE;
...
current->sleep_state = old_state;
This has the page-fault-in-a-signal-handler problem. I don't know if
there's a way to determine if we're already in a signal handler and use
a different sleep_state ...?
On Wed, Apr 4, 2018 at 4:39 PM, Matthew Wilcox <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 11:32:54AM +0200, Daniel Vetter wrote:
>> So we've done some experiments for the case where the fault originated
>> from kernel context (copy_to|from_user and friends). The fixup code seems
>> to retry the copy once after the fault (in copy_user_handle_tail), if that
>> fails again we get a short read/write. This might result in an -EFAULT,
>> short read()/write() or anything else really, depending upon the syscall
>> api.
>>
>> Except in some code paths in gpu drivers where we convert anything into
>> -ERESTARTSYS/EINTR if there's a signal pending it won't ever result in the
>> syscall getting restarted (well except maybe short read/writes if
>> userspace bothers with that).
>>
>> So I guess gpu fault handlers indeed break the kernel's expectations, but
>> then I think we're getting away with that because the inner workings of
>> gpu memory objects is all heavily abstracted away by opengl/vulkan and
>> friends.
>>
>> I guess what we could do is try to only do killable sleeps if it's a
>> kernel fault, but that means wiring a flag through all the callchains. Not
>> pretty. Except when there's a magic set of functions that would convert
>> all interruptible sleeps to killable ones only for us.
>
> I actually have plans to allow mutex_lock_{interruptible,killable} to
> return -EWOULDBLOCK if a flag is set. So this doesn't seem entirely
> unrelated. Something like this perhaps:
>
> struct task_struct {
> + unsigned int sleep_state;
> };
>
> static noinline int __sched
> -__mutex_lock_interruptible_slowpath(struct mutex *lock)
> +__mutex_lock_slowpath(struct mutex *lock, long state)
> {
> - return __mutex_lock(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
> + if (state == TASK_NOBLOCK)
> + return -EWOULDBLOCK;
> + return __mutex_lock(lock, state, 0, NULL, _RET_IP_);
> }
>
> +int __sched mutex_lock_state(struct mutex *lock, long state)
> +{
> + might_sleep();
> +
> + if (__mutex_trylock_fast(lock))
> + return 0;
> +
> + return __mutex_lock_slowpath(lock, state);
> +}
> +EXPORT_SYMBOL(mutex_lock_state);
>
> Then the page fault handler can do something like:
>
> old_state = current->sleep_state;
> current->sleep_state = TASK_INTERRUPTIBLE;
> ...
> current->sleep_state = old_state;
>
>
> This has the page-fault-in-a-signal-handler problem. I don't know if
> there's a way to determine if we're already in a signal handler and use
> a different sleep_state ...?
Not sure what problem you're trying to solve, but I don't think that's
the one we have. The only way what we do goes wrong is if the fault
originates from kernel context. For faults from the signal handler I
think you just get to keep the pieces. Faults form kernel we can
detect through FAULT_FLAG_USER.
The issue I'm seeing is the following:
1. Some kernel code does copy_*_user, and it points at a gpu mmap region.
2. We fault and go into the gpu driver fault handler. That refuses to
insert the pte because a signal is pending (because of all the
interruptible waits and locks).
3. Fixup section runs, which afaict tries to do the copy once more
using copy_user_handle_tail.
4. We fault again, because the pte is still not present.
5. GPU driver is still refusing to install the pte because signals are pending.
6. Fixup section for copy_user_handle_tail just bails out.
7. copy_*_user returns and indicates that that not all bytes have been copied.
8. syscall (or whatever it is) bails out and returns to userspace,
most likely with -EFAULT (but this ofc depends upon the syscall and
what it should do when userspace access faults.
9. Signal finally gets handled, but the syscall already failed, and no
one will restart it. If userspace is prudent, it might fail (or maybe
hit an assert or something).
Or maybe I'm confused by your diff, since nothing seems to use
current->sleep_state. The problem is also that it's any sleep we do
(they all tend to be interruptible, at least when waiting for the gpu
or taking any locks that might be held while waiting for the gpu, or
anything else that might be blocked waiting for the gpu really). So
only patching mutex_lock won't fix this.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Wed, Apr 04, 2018 at 05:15:46PM +0200, Daniel Vetter wrote:
> On Wed, Apr 4, 2018 at 4:39 PM, Matthew Wilcox <[email protected]> wrote:
> > I actually have plans to allow mutex_lock_{interruptible,killable} to
> > return -EWOULDBLOCK if a flag is set. So this doesn't seem entirely
> > unrelated. Something like this perhaps:
> >
> > struct task_struct {
> > + unsigned int sleep_state;
> > };
> >
> > static noinline int __sched
> > -__mutex_lock_interruptible_slowpath(struct mutex *lock)
> > +__mutex_lock_slowpath(struct mutex *lock, long state)
> > {
> > - return __mutex_lock(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
> > + if (state == TASK_NOBLOCK)
> > + return -EWOULDBLOCK;
> > + return __mutex_lock(lock, state, 0, NULL, _RET_IP_);
> > }
> >
> > +int __sched mutex_lock_state(struct mutex *lock, long state)
> > +{
> > + might_sleep();
> > +
> > + if (__mutex_trylock_fast(lock))
> > + return 0;
> > +
> > + return __mutex_lock_slowpath(lock, state);
> > +}
> > +EXPORT_SYMBOL(mutex_lock_state);
> >
> > Then the page fault handler can do something like:
> >
> > old_state = current->sleep_state;
> > current->sleep_state = TASK_INTERRUPTIBLE;
> > ...
> > current->sleep_state = old_state;
> >
> >
> > This has the page-fault-in-a-signal-handler problem. I don't know if
> > there's a way to determine if we're already in a signal handler and use
> > a different sleep_state ...?
>
> Not sure what problem you're trying to solve, but I don't think that's
> the one we have. The only way what we do goes wrong is if the fault
> originates from kernel context. For faults from the signal handler I
> think you just get to keep the pieces. Faults form kernel we can
> detect through FAULT_FLAG_USER.
Gah, I didn't explain well enough ;-(
From the get_user_pages (and similar) handlers, we'd do
old_state = current->sleep_state;
current->sleep_state = TASK_KILLABLE;
...
current->sleep_state = old_state;
So you wouldn't need to discriminate on whether FAULT_FLAG_USER was set,
but could just use current->sleep_state.
> The issue I'm seeing is the following:
> 1. Some kernel code does copy_*_user, and it points at a gpu mmap region.
> 2. We fault and go into the gpu driver fault handler. That refuses to
> insert the pte because a signal is pending (because of all the
> interruptible waits and locks).
> 3. Fixup section runs, which afaict tries to do the copy once more
> using copy_user_handle_tail.
> 4. We fault again, because the pte is still not present.
> 5. GPU driver is still refusing to install the pte because signals are pending.
> 6. Fixup section for copy_user_handle_tail just bails out.
> 7. copy_*_user returns and indicates that that not all bytes have been copied.
> 8. syscall (or whatever it is) bails out and returns to userspace,
> most likely with -EFAULT (but this ofc depends upon the syscall and
> what it should do when userspace access faults.
> 9. Signal finally gets handled, but the syscall already failed, and no
> one will restart it. If userspace is prudent, it might fail (or maybe
> hit an assert or something).
I think my patch above fixes this. It makes the syscall killable rather
than interruptible, so it can never observe the short read / -EFAULT
return if it gets a fatal signal, and the non-fatal signal will be held
off until the syscall completes.
> Or maybe I'm confused by your diff, since nothing seems to use
> current->sleep_state. The problem is also that it's any sleep we do
> (they all tend to be interruptible, at least when waiting for the gpu
> or taking any locks that might be held while waiting for the gpu, or
> anything else that might be blocked waiting for the gpu really). So
> only patching mutex_lock won't fix this.
Sure, I was only patching mutex_lock_state in as an illustration.
I've also got a 'state' equivalent for wait_on_page_bit() (although
I'm not sure you care ...).
Looks like you'd need wait_for_completion_state() and
wait_event_state_timeout() as well.
On Wed, Apr 4, 2018 at 6:24 PM, Matthew Wilcox <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 05:15:46PM +0200, Daniel Vetter wrote:
>> On Wed, Apr 4, 2018 at 4:39 PM, Matthew Wilcox <[email protected]> wrote:
>> > I actually have plans to allow mutex_lock_{interruptible,killable} to
>> > return -EWOULDBLOCK if a flag is set. So this doesn't seem entirely
>> > unrelated. Something like this perhaps:
>> >
>> > struct task_struct {
>> > + unsigned int sleep_state;
>> > };
>> >
>> > static noinline int __sched
>> > -__mutex_lock_interruptible_slowpath(struct mutex *lock)
>> > +__mutex_lock_slowpath(struct mutex *lock, long state)
>> > {
>> > - return __mutex_lock(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_);
>> > + if (state == TASK_NOBLOCK)
>> > + return -EWOULDBLOCK;
>> > + return __mutex_lock(lock, state, 0, NULL, _RET_IP_);
>> > }
>> >
>> > +int __sched mutex_lock_state(struct mutex *lock, long state)
>> > +{
>> > + might_sleep();
>> > +
>> > + if (__mutex_trylock_fast(lock))
>> > + return 0;
>> > +
>> > + return __mutex_lock_slowpath(lock, state);
>> > +}
>> > +EXPORT_SYMBOL(mutex_lock_state);
>> >
>> > Then the page fault handler can do something like:
>> >
>> > old_state = current->sleep_state;
>> > current->sleep_state = TASK_INTERRUPTIBLE;
>> > ...
>> > current->sleep_state = old_state;
>> >
>> >
>> > This has the page-fault-in-a-signal-handler problem. I don't know if
>> > there's a way to determine if we're already in a signal handler and use
>> > a different sleep_state ...?
>>
>> Not sure what problem you're trying to solve, but I don't think that's
>> the one we have. The only way what we do goes wrong is if the fault
>> originates from kernel context. For faults from the signal handler I
>> think you just get to keep the pieces. Faults form kernel we can
>> detect through FAULT_FLAG_USER.
>
> Gah, I didn't explain well enough ;-(
>
> From the get_user_pages (and similar) handlers, we'd do
>
> old_state = current->sleep_state;
> current->sleep_state = TASK_KILLABLE;
> ...
> current->sleep_state = old_state;
>
> So you wouldn't need to discriminate on whether FAULT_FLAG_USER was set,
> but could just use current->sleep_state.
>
>> The issue I'm seeing is the following:
>> 1. Some kernel code does copy_*_user, and it points at a gpu mmap region.
>> 2. We fault and go into the gpu driver fault handler. That refuses to
>> insert the pte because a signal is pending (because of all the
>> interruptible waits and locks).
>> 3. Fixup section runs, which afaict tries to do the copy once more
>> using copy_user_handle_tail.
>> 4. We fault again, because the pte is still not present.
>> 5. GPU driver is still refusing to install the pte because signals are pending.
>> 6. Fixup section for copy_user_handle_tail just bails out.
>> 7. copy_*_user returns and indicates that that not all bytes have been copied.
>> 8. syscall (or whatever it is) bails out and returns to userspace,
>> most likely with -EFAULT (but this ofc depends upon the syscall and
>> what it should do when userspace access faults.
>> 9. Signal finally gets handled, but the syscall already failed, and no
>> one will restart it. If userspace is prudent, it might fail (or maybe
>> hit an assert or something).
>
> I think my patch above fixes this. It makes the syscall killable rather
> than interruptible, so it can never observe the short read / -EFAULT
> return if it gets a fatal signal, and the non-fatal signal will be held
> off until the syscall completes.
>
>> Or maybe I'm confused by your diff, since nothing seems to use
>> current->sleep_state. The problem is also that it's any sleep we do
>> (they all tend to be interruptible, at least when waiting for the gpu
>> or taking any locks that might be held while waiting for the gpu, or
>> anything else that might be blocked waiting for the gpu really). So
>> only patching mutex_lock won't fix this.
>
> Sure, I was only patching mutex_lock_state in as an illustration.
> I've also got a 'state' equivalent for wait_on_page_bit() (although
> I'm not sure you care ...).
>
> Looks like you'd need wait_for_completion_state() and
> wait_event_state_timeout() as well.
Yeah, plus ww_mutex_lock_state, and same for the dma_fence wait
functions, but we can patch those once the core stuff has landed.
The part I'm wondering about, that you didn't really illustrate: Are
mutex_lock_interruptible&friends supposed to automatically use
current->sleep_state, or are we somehow supposed to pass that through
the call stack? Or is the idea that we'd mass-convert all the gpu code
that can be reached from the fault handlers to use the new _state
variants, which will internally consult sleep_state?
Asking since the former would be best for us: Most paths can do
interruptible, but some can't, so changing the current->state logic to
take the most limiting of both the explicitly passed sleep_state and
the one set in current->sleep_state would be best. Otrherwise if you
have a path that can at most do killable waits, but it's context might
either allow fully interruptible or killable or even neither, then
there's still some fragile code around.
Bonus points if you could even nest assignemtns to
current->sleep_state (even when there's not really all that many
options), similar to how preempt_disable and other atomic sections
nest.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch