LinuxLists.cc - [BUG] completely bonkers use of set_need_resched + VM_FAULT

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

Op 12-09-13 17:06, Peter Zijlstra schreef:
> Hi Dave,
>
> So I'm poking around the preemption code and stumbled upon:
>
> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>
> All these sites basically do:
>
> while (!trylock())
> yield();
>
> which is a horrible and broken locking pattern.
>
> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
> task that preempted the lock holder at FIFOn.
>
> Secondly the implementation is worse than usual by abusing
> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
> doesn't retry, but you're using it as a get out of fault path. And
> you're using set_need_resched() which is not something a driver should
> _ever_ touch.
>
> Now I'm going to take away set_need_resched() -- and while you can
> 'reimplement' it using set_thread_flag() you're not going to do that
> because it will be broken due to changes to the preempt code.
>
> So please as to fix ASAP and don't allow anybody to trick you into
> merging silly things like that again ;-)
>
Agreed, but this is a case of locking inversion. How do you propose to get around that?

~Maarten

2013-09-12 15:14:29

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 05:11:25PM +0200, Maarten Lankhorst wrote:
> Op 12-09-13 17:06, Peter Zijlstra schreef:
> > Hi Dave,
> >
> > So I'm poking around the preemption code and stumbled upon:
> >
> > drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
> > drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
> > drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
> > drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
> >
> > All these sites basically do:
> >
> > while (!trylock())
> > yield();
> >
> > which is a horrible and broken locking pattern.
> >
> > Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
> > task that preempted the lock holder at FIFOn.
> >
> > Secondly the implementation is worse than usual by abusing
> > VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
> > doesn't retry, but you're using it as a get out of fault path. And
> > you're using set_need_resched() which is not something a driver should
> > _ever_ touch.
> >
> > Now I'm going to take away set_need_resched() -- and while you can
> > 'reimplement' it using set_thread_flag() you're not going to do that
> > because it will be broken due to changes to the preempt code.
> >
> > So please as to fix ASAP and don't allow anybody to trick you into
> > merging silly things like that again ;-)
> >
> Agreed, but this is a case of locking inversion. How do you propose to get around that?

me? No idea, I've never looked at the actual locking in DRM. Someone
who's familiar with that code would have to tackle that.

I just spotted the fail because I was going to remove set_need_resched()
and had a WTF moment when I tripped over this stuff.

2013-09-12 15:36:45

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>
> So I'm poking around the preemption code and stumbled upon:
>
> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>
> All these sites basically do:
>
> while (!trylock())
> yield();
>
> which is a horrible and broken locking pattern.
>
> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
> task that preempted the lock holder at FIFOn.
>
> Secondly the implementation is worse than usual by abusing
> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
> doesn't retry, but you're using it as a get out of fault path. And
> you're using set_need_resched() which is not something a driver should
> _ever_ touch.
>
> Now I'm going to take away set_need_resched() -- and while you can
> 'reimplement' it using set_thread_flag() you're not going to do that
> because it will be broken due to changes to the preempt code.
>
> So please as to fix ASAP and don't allow anybody to trick you into
> merging silly things like that again ;-)

The set_need_resched in i915_gem.c:i915_gem_fault can actually be
removed. It was there to give the error handler a chance to sneak in
and reset the hw/sw tracking when the gpu is dead. That hack goes back
to the days when the locking around our error handler was somewhere
between nonexistent and totally broken, nowadays we keep things from
live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
whip up a patch to rip this out. I'll also check that our testsuite
properly exercises this path (needs a bit of work on a quick look for
better coverage).

The one in ttm is just bonghits to shut up lockdep: ttm can recurse
into it's own pagefault handler and then deadlock, the trylock just
keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
fun userspace did and now have testcases for them. The right solution
to fix this is to use copy_to|from_user_atomic in ttm everywhere it
holds locks and have slowpaths which drops locks, copies stuff into a
temp allocation and then continues. At least that's how we've fixed
all those inversions in i915-gem. I'm not volunteering to fix this ;-)

The one in udl just looks like copypasta from i915, without any
justification (at least I don't see any) for why it's there. Probably
can die, too, since there isn't any gpu to reset on usb display-link
devices ...

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-12 15:43:42

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 05:36:43PM +0200, Daniel Vetter wrote:
> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
> removed. It was there to give the error handler a chance to sneak in
> and reset the hw/sw tracking when the gpu is dead. That hack goes back
> to the days when the locking around our error handler was somewhere
> between nonexistent and totally broken, nowadays we keep things from
> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
> whip up a patch to rip this out. I'll also check that our testsuite
> properly exercises this path (needs a bit of work on a quick look for
> better coverage).
>
> The one in udl just looks like copypasta from i915, without any
> justification (at least I don't see any) for why it's there. Probably
> can die, too, since there isn't any gpu to reset on usb display-link
> devices ...

OK, awesome that. 2 down.

> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
> into it's own pagefault handler and then deadlock, the trylock just
> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
> fun userspace did and now have testcases for them. The right solution
> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
> holds locks and have slowpaths which drops locks, copies stuff into a
> temp allocation and then continues. At least that's how we've fixed
> all those inversions in i915-gem. I'm not volunteering to fix this ;-)

Yikes.. so how common is it? If I simply rip the set_need_resched() out
it will 'spin' on the fault a little longer until a 'natural' preemption
point -- if such a thing is every going to happen.

It would make that path take longer, but not be more or less broken.

So if its a rare path, I'll just rip the set_need_resched() out and you
DRM people can then fix up at your own pace.

2013-09-12 15:45:44

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

Op 12-09-13 17:36, Daniel Vetter schreef:
> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>> So I'm poking around the preemption code and stumbled upon:
>>
>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>
>> All these sites basically do:
>>
>> while (!trylock())
>> yield();
>>
>> which is a horrible and broken locking pattern.
>>
>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>> task that preempted the lock holder at FIFOn.
>>
>> Secondly the implementation is worse than usual by abusing
>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>> doesn't retry, but you're using it as a get out of fault path. And
>> you're using set_need_resched() which is not something a driver should
>> _ever_ touch.
>>
>> Now I'm going to take away set_need_resched() -- and while you can
>> 'reimplement' it using set_thread_flag() you're not going to do that
>> because it will be broken due to changes to the preempt code.
>>
>> So please as to fix ASAP and don't allow anybody to trick you into
>> merging silly things like that again ;-)
> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
> removed. It was there to give the error handler a chance to sneak in
> and reset the hw/sw tracking when the gpu is dead. That hack goes back
> to the days when the locking around our error handler was somewhere
> between nonexistent and totally broken, nowadays we keep things from
> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
> whip up a patch to rip this out. I'll also check that our testsuite
> properly exercises this path (needs a bit of work on a quick look for
> better coverage).
>
> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
> into it's own pagefault handler and then deadlock, the trylock just
> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
> fun userspace did and now have testcases for them. The right solution
> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
> holds locks and have slowpaths which drops locks, copies stuff into a
> temp allocation and then continues. At least that's how we've fixed
> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
Ah the case where a mmap'd address is passed to the execbuf ioctl? :P

Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
> The one in udl just looks like copypasta from i915, without any
> justification (at least I don't see any) for why it's there. Probably
> can die, too, since there isn't any gpu to reset on usb display-link
> devices ...
>
> Cheers, Daniel

2013-09-12 15:58:50

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 5:43 PM, Peter Zijlstra <[email protected]> wrote:
>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>> into it's own pagefault handler and then deadlock, the trylock just
>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>> fun userspace did and now have testcases for them. The right solution
>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>> holds locks and have slowpaths which drops locks, copies stuff into a
>> temp allocation and then continues. At least that's how we've fixed
>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>
> Yikes.. so how common is it? If I simply rip the set_need_resched() out
> it will 'spin' on the fault a little longer until a 'natural' preemption
> point -- if such a thing is every going to happen.

It's a case of "our userspace doesn't do this", so as long as you're
not evil and frob the drm device nodes of ttm drivers directly the
deadlock will never happen. No idea how much contention actually
happens on e.g. shared buffer objects - in i915 we have just one lock
and so suffer quite a bit more from contention. So no idea how much
removing the yield would hurt.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-12 16:22:21

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 05:58:49PM +0200, Daniel Vetter wrote:
> On Thu, Sep 12, 2013 at 5:43 PM, Peter Zijlstra <[email protected]> wrote:
> >> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
> >> into it's own pagefault handler and then deadlock, the trylock just
> >> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
> >> fun userspace did and now have testcases for them. The right solution
> >> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
> >> holds locks and have slowpaths which drops locks, copies stuff into a
> >> temp allocation and then continues. At least that's how we've fixed
> >> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
> >
> > Yikes.. so how common is it? If I simply rip the set_need_resched() out
> > it will 'spin' on the fault a little longer until a 'natural' preemption
> > point -- if such a thing is every going to happen.
>
> It's a case of "our userspace doesn't do this", so as long as you're
> not evil and frob the drm device nodes of ttm drivers directly the
> deadlock will never happen. No idea how much contention actually
> happens on e.g. shared buffer objects - in i915 we have just one lock
> and so suffer quite a bit more from contention. So no idea how much
> removing the yield would hurt.

If 'sane' userspace is never supposed to do this, then only insane
userspace is going to hurt from this and that's a GOOD (tm) thing,
right? ;-)

And it won't actually deadlock if you don't use FIFO, for the regular
scheduler class it'll just spin a little longer before getting preempted
so no real worries there.

2013-09-12 16:39:28

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/12/2013 05:58 PM, Daniel Vetter wrote:
> On Thu, Sep 12, 2013 at 5:43 PM, Peter Zijlstra <[email protected]> wrote:
>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>> into it's own pagefault handler and then deadlock, the trylock just
>>> keeps lockdep quiet.

Could you describe how it could recurse into it's own pagefault handler?
IIRC the VM flags of the TTM VMAs makes get_user_pages() refrain from
touching these VMAs,
hence I don't think this code can deadlock, but admittedly it's far from
the optimal solution.

Never mind, more on the set_need_resched() below.

>>> We've had that bug arise in drm/i915 due to some
>>> fun userspace did and now have testcases for them. The right solution
>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>> temp allocation and then continues. At least that's how we've fixed
>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>> Yikes.. so how common is it? If I simply rip the set_need_resched() out
>> it will 'spin' on the fault a little longer until a 'natural' preemption
>> point -- if such a thing is every going to happen.

A typical case is if a process is throwing out a buffer from the GPU or
system memory while another
process pagefaults while writing to it. It's not a common situation, and
it's by no means a fastpath situation.
For correctness purposes, I think set_need_resched() can be safely removed.

> It's a case of "our userspace doesn't do this", so as long as you're
> not evil and frob the drm device nodes of ttm drivers directly the
> deadlock will never happen. No idea how much contention actually
> happens on e.g. shared buffer objects - in i915 we have just one lock
> and so suffer quite a bit more from contention. So no idea how much
> removing the yield would hurt.
> -Daniel

/Thomas

2013-09-12 16:44:18

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
> Op 12-09-13 17:36, Daniel Vetter schreef:
>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>> So I'm poking around the preemption code and stumbled upon:
>>>
>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>
>>> All these sites basically do:
>>>
>>> while (!trylock())
>>> yield();
>>>
>>> which is a horrible and broken locking pattern.
>>>
>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>> task that preempted the lock holder at FIFOn.
>>>
>>> Secondly the implementation is worse than usual by abusing
>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>> doesn't retry, but you're using it as a get out of fault path. And
>>> you're using set_need_resched() which is not something a driver should
>>> _ever_ touch.
>>>
>>> Now I'm going to take away set_need_resched() -- and while you can
>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>> because it will be broken due to changes to the preempt code.
>>>
>>> So please as to fix ASAP and don't allow anybody to trick you into
>>> merging silly things like that again ;-)
>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>> removed. It was there to give the error handler a chance to sneak in
>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>> to the days when the locking around our error handler was somewhere
>> between nonexistent and totally broken, nowadays we keep things from
>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>> whip up a patch to rip this out. I'll also check that our testsuite
>> properly exercises this path (needs a bit of work on a quick look for
>> better coverage).
>>
>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>> into it's own pagefault handler and then deadlock, the trylock just
>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>> fun userspace did and now have testcases for them. The right solution
>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>> holds locks and have slowpaths which drops locks, copies stuff into a
>> temp allocation and then continues. At least that's how we've fixed
>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>
> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..

I think a possible fix would be if fault() were allowed to return an
error and drop the mmap_sem() before returning.

Otherwise we need to track down all copy_to_user / copy_from_user which
happen with bo::reserve held.

/Thomas

2013-09-12 17:02:11

by Chris Wilson

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 06:22:10PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 12, 2013 at 05:58:49PM +0200, Daniel Vetter wrote:
> > On Thu, Sep 12, 2013 at 5:43 PM, Peter Zijlstra <[email protected]> wrote:
> > >> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
> > >> into it's own pagefault handler and then deadlock, the trylock just
> > >> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
> > >> fun userspace did and now have testcases for them. The right solution
> > >> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
> > >> holds locks and have slowpaths which drops locks, copies stuff into a
> > >> temp allocation and then continues. At least that's how we've fixed
> > >> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
> > >
> > > Yikes.. so how common is it? If I simply rip the set_need_resched() out
> > > it will 'spin' on the fault a little longer until a 'natural' preemption
> > > point -- if such a thing is every going to happen.
> >
> > It's a case of "our userspace doesn't do this", so as long as you're
> > not evil and frob the drm device nodes of ttm drivers directly the
> > deadlock will never happen. No idea how much contention actually
> > happens on e.g. shared buffer objects - in i915 we have just one lock
> > and so suffer quite a bit more from contention. So no idea how much
> > removing the yield would hurt.
>
> If 'sane' userspace is never supposed to do this, then only insane
> userspace is going to hurt from this and that's a GOOD (tm) thing,
> right? ;-)

Not quite, as it would be possible for the evil userspace to trigger a
GPU hang that would cause the sane userspace to spin indefinitely
waiting for the error recovery to kick in.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2013-09-12 19:48:24

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 6:44 PM, Thomas Hellstrom <[email protected]> wrote:
>
> I think a possible fix would be if fault() were allowed to return an error
> and drop the mmap_sem() before returning.
>
> Otherwise we need to track down all copy_to_user / copy_from_user which
> happen with bo::reserve held.

For maximal evilness submit the relocation list (or whatever data
execbuf slurps in with copy_from_user while holding bo::reserve) of a
bo in the execbuf list. At least that's the testcase we have for
drm/i915. Then make sure that the execbuf wants the bo somewhere it
can't be mmaped from userspace, so needs to be moved both in the fault
handler and then back for the execbuf to continue ;-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-12 19:52:26

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 6:22 PM, Peter Zijlstra <[email protected]> wrote:
> If 'sane' userspace is never supposed to do this, then only insane
> userspace is going to hurt from this and that's a GOOD (tm) thing,
> right? ;-)

Afaik sane userspace doesn't hit the _deadlock_ (or lifelock if we
have the set_need_resched in there). drm/i915 is a bit different since
we have just one lock, and so the same design would actually deadlock
even for sane userspace. But hitting contention there and yielding is
somewhat expected. Obviously shouldn't happen too often since it'll
hurt performance, with either blocking or the yield spinning loop.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-12 19:58:41

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, 12 Sep 2013, Daniel Vetter wrote:

> On Thu, Sep 12, 2013 at 6:22 PM, Peter Zijlstra <[email protected]> wrote:
> > If 'sane' userspace is never supposed to do this, then only insane
> > userspace is going to hurt from this and that's a GOOD (tm) thing,
> > right? ;-)
>
> Afaik sane userspace doesn't hit the _deadlock_ (or lifelock if we
> have the set_need_resched in there). drm/i915 is a bit different since
> we have just one lock, and so the same design would actually deadlock
> even for sane userspace. But hitting contention there and yielding is
> somewhat expected. Obviously shouldn't happen too often since it'll
> hurt performance, with either blocking or the yield spinning loop.

So this is actually a non priviledged DoS interface, right?

Thanks,

tglx

2013-09-12 20:04:09

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 9:58 PM, Thomas Gleixner <[email protected]> wrote:
>
>> On Thu, Sep 12, 2013 at 6:22 PM, Peter Zijlstra <[email protected]> wrote:
>> > If 'sane' userspace is never supposed to do this, then only insane
>> > userspace is going to hurt from this and that's a GOOD (tm) thing,
>> > right? ;-)
>>
>> Afaik sane userspace doesn't hit the _deadlock_ (or lifelock if we
>> have the set_need_resched in there). drm/i915 is a bit different since
>> we have just one lock, and so the same design would actually deadlock
>> even for sane userspace. But hitting contention there and yielding is
>> somewhat expected. Obviously shouldn't happen too often since it'll
>> hurt performance, with either blocking or the yield spinning loop.
>
> So this is actually a non priviledged DoS interface, right?

I think for ttm drivers it's just execbuf being exploitable. But on
drm/i915 we've
had the same issue with the pwrite/pread ioctls, so a simple
glBufferData(glMap) kind of recursion from gl clients blew the kernel
to pieces ...
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-12 20:21:14

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, 12 Sep 2013, Daniel Vetter wrote:

> On Thu, Sep 12, 2013 at 9:58 PM, Thomas Gleixner <[email protected]> wrote:
> >
> >> On Thu, Sep 12, 2013 at 6:22 PM, Peter Zijlstra <[email protected]> wrote:
> >> > If 'sane' userspace is never supposed to do this, then only insane
> >> > userspace is going to hurt from this and that's a GOOD (tm) thing,
> >> > right? ;-)
> >>
> >> Afaik sane userspace doesn't hit the _deadlock_ (or lifelock if we
> >> have the set_need_resched in there). drm/i915 is a bit different since
> >> we have just one lock, and so the same design would actually deadlock
> >> even for sane userspace. But hitting contention there and yielding is
> >> somewhat expected. Obviously shouldn't happen too often since it'll
> >> hurt performance, with either blocking or the yield spinning loop.
> >
> > So this is actually a non priviledged DoS interface, right?
>
> I think for ttm drivers it's just execbuf being exploitable. But on
> drm/i915 we've
> had the same issue with the pwrite/pread ioctls, so a simple
> glBufferData(glMap) kind of recursion from gl clients blew the kernel
> to pieces ...

And the only answer you folks came up with is set_need_resched() and
yield()? Oh well....

Thanks,

tglx

2013-09-12 20:23:07

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 10:20 PM, Thomas Gleixner <[email protected]> wrote:
>> I think for ttm drivers it's just execbuf being exploitable. But on
>> drm/i915 we've
>> had the same issue with the pwrite/pread ioctls, so a simple
>> glBufferData(glMap) kind of recursion from gl clients blew the kernel
>> to pieces ...
>
> And the only answer you folks came up with is set_need_resched() and
> yield()? Oh well....

The yield was for a different lifelock, and that one is also fixed by
now. The fault handler deadlock was fixed in the usual "drop locks and
jump into slowpath" fasion, at least in drm/i915.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-12 20:30:46

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, Sep 12, 2013 at 05:35:43PM +0100, Chris Wilson wrote:
> Not quite, as it would be possible for the evil userspace to trigger a
> GPU hang that would cause the sane userspace to spin indefinitely
> waiting for the error recovery to kick in.

So with FIFOn+1 preempting FIFOn its a live-lock because the faulting
thread will forever keep yielding to itself since its the highest
priority task around, therefore the set_need_resched() is an absolute
NOP in that case.

For OTHER it might run another task with set_need_resched(), without
set_need_resched() it'll simply spin on the fault until it runs out of
time and gets force preempted and another task gets to run.

So for either case, the set_need_resched() doesn't make an appreciable
difference.

Removing it will not make evil userspace much worse -- at worst it will
cause slightly more wasted cycles.

2013-09-12 20:38:33

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, 12 Sep 2013, Peter Zijlstra wrote:

> On Thu, Sep 12, 2013 at 05:35:43PM +0100, Chris Wilson wrote:
> > Not quite, as it would be possible for the evil userspace to trigger a
> > GPU hang that would cause the sane userspace to spin indefinitely
> > waiting for the error recovery to kick in.
>
> So with FIFOn+1 preempting FIFOn its a live-lock because the faulting
> thread will forever keep yielding to itself since its the highest
> priority task around, therefore the set_need_resched() is an absolute
> NOP in that case.
>
> For OTHER it might run another task with set_need_resched(), without
> set_need_resched() it'll simply spin on the fault until it runs out of
> time and gets force preempted and another task gets to run.
>
> So for either case, the set_need_resched() doesn't make an appreciable
> difference.
>
> Removing it will not make evil userspace much worse -- at worst it will
> cause slightly more wasted cycles.

Well, yield() is a completely doomed concept by definition no matter
whether you add set_need_resched() or not.

We really should put a

schedule_timeout_uninterruptible(1hour);

into the yield() implementation to get finally rid of it.

Thanks,

tglx

2013-09-12 20:39:36

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Thu, 12 Sep 2013, Daniel Vetter wrote:

> On Thu, Sep 12, 2013 at 10:20 PM, Thomas Gleixner <[email protected]> wrote:
> >> I think for ttm drivers it's just execbuf being exploitable. But on
> >> drm/i915 we've
> >> had the same issue with the pwrite/pread ioctls, so a simple
> >> glBufferData(glMap) kind of recursion from gl clients blew the kernel
> >> to pieces ...
> >
> > And the only answer you folks came up with is set_need_resched() and
> > yield()? Oh well....
>
> The yield was for a different lifelock, and that one is also fixed by
> now. The fault handler deadlock was fixed in the usual "drop locks and
> jump into slowpath" fasion, at least in drm/i915.

So we can remove that whole yield/set_need_resched() mess completely ?

Thanks,

tglx

2013-09-12 20:48:27

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/12/2013 10:39 PM, Thomas Gleixner wrote:
> On Thu, 12 Sep 2013, Daniel Vetter wrote:
>
>> On Thu, Sep 12, 2013 at 10:20 PM, Thomas Gleixner <[email protected]> wrote:
>>>> I think for ttm drivers it's just execbuf being exploitable. But on
>>>> drm/i915 we've
>>>> had the same issue with the pwrite/pread ioctls, so a simple
>>>> glBufferData(glMap) kind of recursion from gl clients blew the kernel
>>>> to pieces ...
>>> And the only answer you folks came up with is set_need_resched() and
>>> yield()? Oh well....
>> The yield was for a different lifelock, and that one is also fixed by
>> now. The fault handler deadlock was fixed in the usual "drop locks and
>> jump into slowpath" fasion, at least in drm/i915.
> So we can remove that whole yield/set_need_resched() mess completely ?
>
> Thanks,
>
> tglx
No.

The while(trylock) is there to address a potential locking inversion
deadlock. If the trylock fails, the code returns to user-space which
retries the fault. This code needs to stay until we can come up with
either a way to drop the mmap_sem and sleep before returning to
user-space, or a bunch of code is fixed with a different locking order.

The set_need_resched() can (and should according to Peter) go.

/Thomas

2013-09-12 21:50:58

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

Op 12-09-13 18:44, Thomas Hellstrom schreef:
> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>> So I'm poking around the preemption code and stumbled upon:
>>>>
>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>
>>>> All these sites basically do:
>>>>
>>>> while (!trylock())
>>>> yield();
>>>>
>>>> which is a horrible and broken locking pattern.
>>>>
>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>> task that preempted the lock holder at FIFOn.
>>>>
>>>> Secondly the implementation is worse than usual by abusing
>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>> you're using set_need_resched() which is not something a driver should
>>>> _ever_ touch.
>>>>
>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>> because it will be broken due to changes to the preempt code.
>>>>
>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>> merging silly things like that again ;-)
>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>> removed. It was there to give the error handler a chance to sneak in
>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>> to the days when the locking around our error handler was somewhere
>>> between nonexistent and totally broken, nowadays we keep things from
>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>> whip up a patch to rip this out. I'll also check that our testsuite
>>> properly exercises this path (needs a bit of work on a quick look for
>>> better coverage).
>>>
>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>> into it's own pagefault handler and then deadlock, the trylock just
>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>> fun userspace did and now have testcases for them. The right solution
>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>> temp allocation and then continues. At least that's how we've fixed
>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>
>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>
> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>
> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
CONFIG_PROVE_LOCKING=y

and hard grab that reserve lock within the fault handler, done.. lockdep will spit it out for you :p

~Maarten

2013-09-13 05:33:26

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>
>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>
>>>>> All these sites basically do:
>>>>>
>>>>> while (!trylock())
>>>>> yield();
>>>>>
>>>>> which is a horrible and broken locking pattern.
>>>>>
>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>> task that preempted the lock holder at FIFOn.
>>>>>
>>>>> Secondly the implementation is worse than usual by abusing
>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>> you're using set_need_resched() which is not something a driver should
>>>>> _ever_ touch.
>>>>>
>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>> because it will be broken due to changes to the preempt code.
>>>>>
>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>> merging silly things like that again ;-)
>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>> removed. It was there to give the error handler a chance to sneak in
>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>> to the days when the locking around our error handler was somewhere
>>>> between nonexistent and totally broken, nowadays we keep things from
>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>> properly exercises this path (needs a bit of work on a quick look for
>>>> better coverage).
>>>>
>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>> fun userspace did and now have testcases for them. The right solution
>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>> temp allocation and then continues. At least that's how we've fixed
>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>
>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>
>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
> CONFIG_PROVE_LOCKING=y
>
> and hard grab that reserve lock within the fault handler, done.. lockdep will spit it out for you :p
>
> ~Maarten

Given that all copy_to_user / copy_from_user paths are actually hit
during testing, right?

/Thomas

2013-09-13 07:16:59

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

Op 13-09-13 08:44, Thomas Hellstrom schreef:
> On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
>> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>>
>>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>>
>>>>>> All these sites basically do:
>>>>>>
>>>>>> while (!trylock())
>>>>>> yield();
>>>>>>
>>>>>> which is a horrible and broken locking pattern.
>>>>>>
>>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>>> task that preempted the lock holder at FIFOn.
>>>>>>
>>>>>> Secondly the implementation is worse than usual by abusing
>>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>>> you're using set_need_resched() which is not something a driver should
>>>>>> _ever_ touch.
>>>>>>
>>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>>> because it will be broken due to changes to the preempt code.
>>>>>>
>>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>>> merging silly things like that again ;-)
>>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>>> removed. It was there to give the error handler a chance to sneak in
>>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>>> to the days when the locking around our error handler was somewhere
>>>>> between nonexistent and totally broken, nowadays we keep things from
>>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>>> properly exercises this path (needs a bit of work on a quick look for
>>>>> better coverage).
>>>>>
>>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>>> fun userspace did and now have testcases for them. The right solution
>>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>>> temp allocation and then continues. At least that's how we've fixed
>>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>>
>>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>>
>>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
>
> Actually, from looking at the mm code, it seems OK to do the following:
>
> if (!bo_tryreserve()) {
> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
> bo_reserve(); // Wait for the BO to become available (interruptible)
> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
> }
Is this meant as a jab at me? You're doing locking wrong here! Again!

> Somebody conveniently added a VM_FAULT_RETRY, but for a different purpose.
>
> If possible, I suggest to take this route for now to avoid the mess of changing locking order in all TTM drivers, with
> all give-up-locking slowpaths that comes with it. IIRC it took some time for i915 to get that right, and completely get rid of all lockdep warnings.
Sorry, but it's still the right thing to do. I can convert nouveau and take a look at radeon. Locking
slowpaths are easy to test too with CONFIG_DEBUG_WW_MUTEX_SLOWPATH.
Just because it's harder, doesn't mean we have to avoid doing it.

The might_fault function will verify the usage of mmap_sem with lockdep automatically when PROVE_LOCKING=y.
This means that any copy_from_user / copy_to_user will always check mmap_sem.

> This will keep the official locking order
> bo::reserve
> mmap_sem
Disagree, fix the order and the trylock and 'wait for unreserved' half assed locking will disappear.

~Maarten

2013-09-13 07:51:12

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

Op 13-09-13 09:46, Thomas Hellstrom schreef:
> On 09/13/2013 09:16 AM, Maarten Lankhorst wrote:
>> Op 13-09-13 08:44, Thomas Hellstrom schreef:
>>> On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
>>>> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>>>>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>>>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>>>>
>>>>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>>>>
>>>>>>>> All these sites basically do:
>>>>>>>>
>>>>>>>> while (!trylock())
>>>>>>>> yield();
>>>>>>>>
>>>>>>>> which is a horrible and broken locking pattern.
>>>>>>>>
>>>>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>>>>> task that preempted the lock holder at FIFOn.
>>>>>>>>
>>>>>>>> Secondly the implementation is worse than usual by abusing
>>>>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>>>>> you're using set_need_resched() which is not something a driver should
>>>>>>>> _ever_ touch.
>>>>>>>>
>>>>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>>>>> because it will be broken due to changes to the preempt code.
>>>>>>>>
>>>>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>>>>> merging silly things like that again ;-)
>>>>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>>>>> removed. It was there to give the error handler a chance to sneak in
>>>>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>>>>> to the days when the locking around our error handler was somewhere
>>>>>>> between nonexistent and totally broken, nowadays we keep things from
>>>>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>>>>> properly exercises this path (needs a bit of work on a quick look for
>>>>>>> better coverage).
>>>>>>>
>>>>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>>>>> fun userspace did and now have testcases for them. The right solution
>>>>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>>>>> temp allocation and then continues. At least that's how we've fixed
>>>>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>>>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>>>>
>>>>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>>>>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>>>>
>>>>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
>>> Actually, from looking at the mm code, it seems OK to do the following:
>>>
>>> if (!bo_tryreserve()) {
>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>> }
>> Is this meant as a jab at me? You're doing locking wrong here! Again!
>
> It's not meant as a jab at you. I'm sorry if it came out that way. It was meant as a joke. I wasn't aware the topic was sensitive.
>
> Anyway, could you describe what is wrong, with the above solution, because it seems perfectly legal to me.
> There is no substantial overhead, and there is no risc of deadlocks. Or do you mean it's bad because it confuses lockdep?
Evil userspace can pass a bo as pointer to use for relocation lists, lockdep will warn when that locks up, but still..
This is already a problem now, and your fixing will only cause lockdep to explicitly warn on it.

You can make a complicated user program to test this, or simply use this function for debugging:
void ttm_might_fault(void) { struct reservation_object obj; reservation_object_init(&obj); ww_mutex_lock(&obj.lock, NULL); ww_mutex_unlock(&obj.lock); reservation_object_fini(&obj); }

Put it near every instance of copy_to_user/copy_from_user and you'll find the bugs. :)

~Maarten

2013-09-13 07:46:14

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/13/2013 09:16 AM, Maarten Lankhorst wrote:
> Op 13-09-13 08:44, Thomas Hellstrom schreef:
>> On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
>>> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>>>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>>>
>>>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>>>
>>>>>>> All these sites basically do:
>>>>>>>
>>>>>>> while (!trylock())
>>>>>>> yield();
>>>>>>>
>>>>>>> which is a horrible and broken locking pattern.
>>>>>>>
>>>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>>>> task that preempted the lock holder at FIFOn.
>>>>>>>
>>>>>>> Secondly the implementation is worse than usual by abusing
>>>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>>>> you're using set_need_resched() which is not something a driver should
>>>>>>> _ever_ touch.
>>>>>>>
>>>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>>>> because it will be broken due to changes to the preempt code.
>>>>>>>
>>>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>>>> merging silly things like that again ;-)
>>>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>>>> removed. It was there to give the error handler a chance to sneak in
>>>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>>>> to the days when the locking around our error handler was somewhere
>>>>>> between nonexistent and totally broken, nowadays we keep things from
>>>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>>>> properly exercises this path (needs a bit of work on a quick look for
>>>>>> better coverage).
>>>>>>
>>>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>>>> fun userspace did and now have testcases for them. The right solution
>>>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>>>> temp allocation and then continues. At least that's how we've fixed
>>>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>>>
>>>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>>>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>>>
>>>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
>> Actually, from looking at the mm code, it seems OK to do the following:
>>
>> if (!bo_tryreserve()) {
>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>> bo_reserve(); // Wait for the BO to become available (interruptible)
>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>> }
> Is this meant as a jab at me? You're doing locking wrong here! Again!

It's not meant as a jab at you. I'm sorry if it came out that way. It
was meant as a joke. I wasn't aware the topic was sensitive.

Anyway, could you describe what is wrong, with the above solution,
because it seems perfectly legal to me.
There is no substantial overhead, and there is no risc of deadlocks. Or
do you mean it's bad because it confuses lockdep?

/Thomas

2013-09-13 08:29:50

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
> >>if (!bo_tryreserve()) {
> >> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
> >> bo_reserve(); // Wait for the BO to become available (interruptible)
> >> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
> >> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
> >>}
>
> Anyway, could you describe what is wrong, with the above solution, because
> it seems perfectly legal to me.

Luckily the rule of law doesn't have anything to do with this stuff --
at least I sincerely hope so.

The thing that's wrong with that pattern is that its still not
deterministic - although its a lot better than the pure trylock. Because
you have to release and re-acquire with the trylock another user might
have gotten in again. Its utterly prone to starvation.

The acquire+release does remove the dead/life-lock scenario from the
FIFO case, since blocking on the acquire will allow the other task to
run (or even get boosted on -rt).

Aside from that there's nothing particularly wrong with it and lockdep
should be happy afaict (but I haven't had my morning juice yet).

2013-09-13 08:23:18

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/13/2013 09:51 AM, Maarten Lankhorst wrote:
> Op 13-09-13 09:46, Thomas Hellstrom schreef:
>> On 09/13/2013 09:16 AM, Maarten Lankhorst wrote:
>>> Op 13-09-13 08:44, Thomas Hellstrom schreef:
>>>> On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
>>>>> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>>>>>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>>>>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>>>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>>>>>
>>>>>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>>>>>
>>>>>>>>> All these sites basically do:
>>>>>>>>>
>>>>>>>>> while (!trylock())
>>>>>>>>> yield();
>>>>>>>>>
>>>>>>>>> which is a horrible and broken locking pattern.
>>>>>>>>>
>>>>>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>>>>>> task that preempted the lock holder at FIFOn.
>>>>>>>>>
>>>>>>>>> Secondly the implementation is worse than usual by abusing
>>>>>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>>>>>> you're using set_need_resched() which is not something a driver should
>>>>>>>>> _ever_ touch.
>>>>>>>>>
>>>>>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>>>>>> because it will be broken due to changes to the preempt code.
>>>>>>>>>
>>>>>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>>>>>> merging silly things like that again ;-)
>>>>>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>>>>>> removed. It was there to give the error handler a chance to sneak in
>>>>>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>>>>>> to the days when the locking around our error handler was somewhere
>>>>>>>> between nonexistent and totally broken, nowadays we keep things from
>>>>>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>>>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>>>>>> properly exercises this path (needs a bit of work on a quick look for
>>>>>>>> better coverage).
>>>>>>>>
>>>>>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>>>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>>>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>>>>>> fun userspace did and now have testcases for them. The right solution
>>>>>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>>>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>>>>>> temp allocation and then continues. At least that's how we've fixed
>>>>>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>>>>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>>>>>
>>>>>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>>>>>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>>>>>
>>>>>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
>>>> Actually, from looking at the mm code, it seems OK to do the following:
>>>>
>>>> if (!bo_tryreserve()) {
>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>> }
>>> Is this meant as a jab at me? You're doing locking wrong here! Again!
>> It's not meant as a jab at you. I'm sorry if it came out that way. It was meant as a joke. I wasn't aware the topic was sensitive.
>>
>> Anyway, could you describe what is wrong, with the above solution, because it seems perfectly legal to me.
>> There is no substantial overhead, and there is no risc of deadlocks. Or do you mean it's bad because it confuses lockdep?
> Evil userspace can pass a bo as pointer to use for relocation lists, lockdep will warn when that locks up, but still..
> This is already a problem now, and your fixing will only cause lockdep to explicitly warn on it.

As previously mentioned, copy_from_user should return -EFAULT, since the
VMAs are marked with VM_IO. It should not recurse into fault(), so evil
user-space looses.

>
> You can make a complicated user program to test this, or simply use this function for debugging:
> void ttm_might_fault(void) { struct reservation_object obj; reservation_object_init(&obj); ww_mutex_lock(&obj.lock, NULL); ww_mutex_unlock(&obj.lock); reservation_object_fini(&obj); }
>
> Put it near every instance of copy_to_user/copy_from_user and you'll find the bugs. :)

I'm still not convinced that there are any problems with this solution.
Did you take what's said above into account?

Now, could we try to approach this based on pros and cons? Let's say we
would be able to choose locking order without doing anything ugly. I'd
put it like this:

mmap_sem->bo_reserve:
Good: Native locking order of VM subsystem. Good if we in the future
will need to reserve in mmap().
Bad: pwrite, pread, copy_to user, copy_from_user usage needs a slowpath
that releases all locking, which has to be done in multiple places in
multiple drivers. Grabbing the mmap_sem and then waiting for multiple
possibly sleeping bo_reserves in slow paths will stall VMA write
operations for this MM.

bo_reserve->mmap_sem:
Good: Natural locking order for all driver ioctls. Slowpath needs to be
done in a single place, in common code.
Bad: Bad if we ever need to perform bo_reserve in mmap.

In my view we have a clear winner. Given the problems i915 had when
converting their driver, and the bashing they had to withstand, we have
an even clearer winner.

And then we need to take into account that, (given that I understand
things correctly) lockdep will complain because it thinks there is a
recursion that will never happen.
That will make the bo_reserve->mmap_sem solution look bad, but is this
really enough to justify giving it up?

/Thomas

2013-09-13 08:39:59

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/13/2013 10:32 AM, Daniel Vetter wrote:
> On Fri, Sep 13, 2013 at 10:23 AM, Thomas Hellstrom
> <[email protected]> wrote:
>> As previously mentioned, copy_from_user should return -EFAULT, since the
>> VMAs are marked with VM_IO. It should not recurse into fault(), so evil
>> user-space looses.
> I haven't put a printk in the code to prove this, but gem mmap also
> sets VM_IO in drm_gem_mmap_obj. And we can very much hit our own fault
> handler and deadlock ....

If this is indeed true, I guess I need to accept the fact that my
solution is bad.
(and worse I can't blame not having my morning coffee).

I'll take a deeper look.
/Thomas

2013-09-13 08:32:53

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Fri, Sep 13, 2013 at 10:23 AM, Thomas Hellstrom
<[email protected]> wrote:
> As previously mentioned, copy_from_user should return -EFAULT, since the
> VMAs are marked with VM_IO. It should not recurse into fault(), so evil
> user-space looses.

I haven't put a printk in the code to prove this, but gem mmap also
sets VM_IO in drm_gem_mmap_obj. And we can very much hit our own fault
handler and deadlock ....

On a _very_ quick reading (and definitely not enough coffee yet for
reading mm/* stuff) it looks like it's get_user_pages that will return
an -EFAULT when hitting upon a VM_IO mapping (which makes sense since
there's really no page backing it). Actually using get_user_pages was
the original slowpath we've had in a few places until we've noticed
that for pwrite that breaks legit userspace (the glBufferData(glMap))
use-case), so we've switched to lock dropping and proper slowpaths
using copy_*_user everywhere instead of trying to pin the userspace
storage with get_user_pages.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-13 08:26:52

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Fri, Sep 13, 2013 at 7:33 AM, Thomas Hellstrom <[email protected]> wrote:
> Given that all copy_to_user / copy_from_user paths are actually hit during
> testing, right?

Ime it requires a bit of ingenuity to properly test this from
userspace. We're using a few tricks in drm/i915 kernel testing:
- When we hand a gtt mmap pointer to execbuf or other ioctls we upload
the data in there through pwrite (or if you don't have that use the
gpu to blt it there). This way you can careful control when the
pagefault will happen. Also since we supply correct data we can make
sure that the kernel actually does the right thing and not just
whether it'll blow up.
- We have a module parameter which can be changed at runtime to
disable all the prefaulting we're doing.
- We have a debugfs interface to drop caches/evict lrus. If you have a
parallel thread that regularly forces the inactive list to be evicted
we can force a refault even after the first fault already happend.
That's useful to test the slowpath after a slowpath already happened,
e.g. when trying to copy reloc offset out to userspace after execbuf
completed.

With these tricks we have imo great test coverage for i915.ko and more
important good assurance that any regressions in this tricky code will
get caught.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-13 06:44:40

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>
>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>
>>>>> All these sites basically do:
>>>>>
>>>>> while (!trylock())
>>>>> yield();
>>>>>
>>>>> which is a horrible and broken locking pattern.
>>>>>
>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>> task that preempted the lock holder at FIFOn.
>>>>>
>>>>> Secondly the implementation is worse than usual by abusing
>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>> you're using set_need_resched() which is not something a driver should
>>>>> _ever_ touch.
>>>>>
>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>> because it will be broken due to changes to the preempt code.
>>>>>
>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>> merging silly things like that again ;-)
>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>> removed. It was there to give the error handler a chance to sneak in
>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>> to the days when the locking around our error handler was somewhere
>>>> between nonexistent and totally broken, nowadays we keep things from
>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>> properly exercises this path (needs a bit of work on a quick look for
>>>> better coverage).
>>>>
>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>> fun userspace did and now have testcases for them. The right solution
>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>> temp allocation and then continues. At least that's how we've fixed
>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>
>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>
>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.

Actually, from looking at the mm code, it seems OK to do the following:

if (!bo_tryreserve()) {
up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
bo_reserve(); // Wait for the BO to become available
(interruptible)
bo_unreserve(); // Where is bo_wait_unreserved() when we
need it, Maarten :P
return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after
regrabbing
}

Somebody conveniently added a VM_FAULT_RETRY, but for a different purpose.

If possible, I suggest to take this route for now to avoid the mess of
changing locking order in all TTM drivers, with
all give-up-locking slowpaths that comes with it. IIRC it took some time
for i915 to get that right, and completely get rid of all lockdep warnings.

This will keep the official locking order
bo::reserve
mmap_sem

/Thomas

2013-09-13 08:41:59

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>> >>if (!bo_tryreserve()) {
>> >> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>> >> bo_reserve(); // Wait for the BO to become available (interruptible)
>> >> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>> >> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>> >>}
>>
>> Anyway, could you describe what is wrong, with the above solution, because
>> it seems perfectly legal to me.
>
> Luckily the rule of law doesn't have anything to do with this stuff --
> at least I sincerely hope so.
>
> The thing that's wrong with that pattern is that its still not
> deterministic - although its a lot better than the pure trylock. Because
> you have to release and re-acquire with the trylock another user might
> have gotten in again. Its utterly prone to starvation.
>
> The acquire+release does remove the dead/life-lock scenario from the
> FIFO case, since blocking on the acquire will allow the other task to
> run (or even get boosted on -rt).
>
> Aside from that there's nothing particularly wrong with it and lockdep
> should be happy afaict (but I haven't had my morning juice yet).

bo_reserve internally maps to a ww-mutex and task can already hold
ww-mutex (potentially even the same for especially nasty userspace).
So lockdep will complain and I think the only way to properly solve
this is to have lock-dropping slowpaths around all copy_*_user
callsites that already hold a bo_reserve ww_mutex. At least that's
been my conclusion after much head-banging against this issue for
drm/i915, and we've tried a lot approaches ;-)
-Daniel

--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-13 08:58:18

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

Op 13-09-13 10:23, Thomas Hellstrom schreef:
> On 09/13/2013 09:51 AM, Maarten Lankhorst wrote:
>> Op 13-09-13 09:46, Thomas Hellstrom schreef:
>>> On 09/13/2013 09:16 AM, Maarten Lankhorst wrote:
>>>> Op 13-09-13 08:44, Thomas Hellstrom schreef:
>>>>> On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
>>>>>> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>>>>>>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>>>>>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>>>>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>>>>>>
>>>>>>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>>>>>>
>>>>>>>>>> All these sites basically do:
>>>>>>>>>>
>>>>>>>>>> while (!trylock())
>>>>>>>>>> yield();
>>>>>>>>>>
>>>>>>>>>> which is a horrible and broken locking pattern.
>>>>>>>>>>
>>>>>>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>>>>>>> task that preempted the lock holder at FIFOn.
>>>>>>>>>>
>>>>>>>>>> Secondly the implementation is worse than usual by abusing
>>>>>>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>>>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>>>>>>> you're using set_need_resched() which is not something a driver should
>>>>>>>>>> _ever_ touch.
>>>>>>>>>>
>>>>>>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>>>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>>>>>>> because it will be broken due to changes to the preempt code.
>>>>>>>>>>
>>>>>>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>>>>>>> merging silly things like that again ;-)
>>>>>>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>>>>>>> removed. It was there to give the error handler a chance to sneak in
>>>>>>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>>>>>>> to the days when the locking around our error handler was somewhere
>>>>>>>>> between nonexistent and totally broken, nowadays we keep things from
>>>>>>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>>>>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>>>>>>> properly exercises this path (needs a bit of work on a quick look for
>>>>>>>>> better coverage).
>>>>>>>>>
>>>>>>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>>>>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>>>>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>>>>>>> fun userspace did and now have testcases for them. The right solution
>>>>>>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>>>>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>>>>>>> temp allocation and then continues. At least that's how we've fixed
>>>>>>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>>>>>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>>>>>>
>>>>>>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>>>>>>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>>>>>>
>>>>>>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
>>>>> Actually, from looking at the mm code, it seems OK to do the following:
>>>>>
>>>>> if (!bo_tryreserve()) {
>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>> }
>>>> Is this meant as a jab at me? You're doing locking wrong here! Again!
>>> It's not meant as a jab at you. I'm sorry if it came out that way. It was meant as a joke. I wasn't aware the topic was sensitive.
>>>
>>> Anyway, could you describe what is wrong, with the above solution, because it seems perfectly legal to me.
>>> There is no substantial overhead, and there is no risc of deadlocks. Or do you mean it's bad because it confuses lockdep?
>> Evil userspace can pass a bo as pointer to use for relocation lists, lockdep will warn when that locks up, but still..
>> This is already a problem now, and your fixing will only cause lockdep to explicitly warn on it.
>
> As previously mentioned, copy_from_user should return -EFAULT, since the VMAs are marked with VM_IO. It should not recurse into fault(), so evil user-space looses.
>
>>
>> You can make a complicated user program to test this, or simply use this function for debugging:
>> void ttm_might_fault(void) { struct reservation_object obj; reservation_object_init(&obj); ww_mutex_lock(&obj.lock, NULL); ww_mutex_unlock(&obj.lock); reservation_object_fini(&obj); }
>>
>> Put it near every instance of copy_to_user/copy_from_user and you'll find the bugs. :)
>
> I'm still not convinced that there are any problems with this solution. Did you take what's said above into account?
>
>
> Now, could we try to approach this based on pros and cons? Let's say we would be able to choose locking order without doing anything ugly. I'd put it like this:
>
> mmap_sem->bo_reserve:
> Good: Native locking order of VM subsystem. Good if we in the future will need to reserve in mmap().
> Bad: pwrite, pread, copy_to user, copy_from_user usage needs a slowpath that releases all locking, which has to be done in multiple places in multiple drivers. Grabbing the mmap_sem and then waiting for multiple possibly sleeping bo_reserves in slow paths will stall VMA write operations for this MM.
I think the good offsets the bad a million times here. Just because it's harder.

> bo_reserve->mmap_sem:
> Good: Natural locking order for all driver ioctls. Slowpath needs to be done in a single place, in common code.
> Bad: Bad if we ever need to perform bo_reserve in mmap.
Considering you're open coding a mutex_lock with the reserve/unreserve+trylock, I think this is a horrible approach. The possibility of a deadlock still exists too. :(
> In my view we have a clear winner. Given the problems i915 had when converting their driver, and the bashing they had to withstand, we have an even clearer winner.
>
> And then we need to take into account that, (given that I understand things correctly) lockdep will complain because it thinks there is a recursion that will never happen.
> That will make the bo_reserve->mmap_sem solution look bad, but is this really enough to justify giving it up?
>
> /Thomas
>

2013-09-13 09:00:17

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
> >> >>if (!bo_tryreserve()) {
> >> >> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
> >> >> bo_reserve(); // Wait for the BO to become available (interruptible)
> >> >> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
> >> >> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
> >> >>}
> >>
> >> Anyway, could you describe what is wrong, with the above solution, because
> >> it seems perfectly legal to me.
> >
> > Luckily the rule of law doesn't have anything to do with this stuff --
> > at least I sincerely hope so.
> >
> > The thing that's wrong with that pattern is that its still not
> > deterministic - although its a lot better than the pure trylock. Because
> > you have to release and re-acquire with the trylock another user might
> > have gotten in again. Its utterly prone to starvation.
> >
> > The acquire+release does remove the dead/life-lock scenario from the
> > FIFO case, since blocking on the acquire will allow the other task to
> > run (or even get boosted on -rt).
> >
> > Aside from that there's nothing particularly wrong with it and lockdep
> > should be happy afaict (but I haven't had my morning juice yet).
>
> bo_reserve internally maps to a ww-mutex and task can already hold
> ww-mutex (potentially even the same for especially nasty userspace).

OK, yes I wasn't aware of that. Yes in that case you're quite right.

2013-09-13 09:21:18

[permalink] [raw]

Subject: Re: [BUG] completely bonkers use of set_need_resched + VM_FAULT_NOPAGE

On 09/13/2013 10:58 AM, Maarten Lankhorst wrote:
> Op 13-09-13 10:23, Thomas Hellstrom schreef:
>> On 09/13/2013 09:51 AM, Maarten Lankhorst wrote:
>>> Op 13-09-13 09:46, Thomas Hellstrom schreef:
>>>> On 09/13/2013 09:16 AM, Maarten Lankhorst wrote:
>>>>> Op 13-09-13 08:44, Thomas Hellstrom schreef:
>>>>>> On 09/12/2013 11:50 PM, Maarten Lankhorst wrote:
>>>>>>> Op 12-09-13 18:44, Thomas Hellstrom schreef:
>>>>>>>> On 09/12/2013 05:45 PM, Maarten Lankhorst wrote:
>>>>>>>>> Op 12-09-13 17:36, Daniel Vetter schreef:
>>>>>>>>>> On Thu, Sep 12, 2013 at 5:06 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>>>>>>> So I'm poking around the preemption code and stumbled upon:
>>>>>>>>>>>
>>>>>>>>>>> drivers/gpu/drm/i915/i915_gem.c: set_need_resched();
>>>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c: set_need_resched();
>>>>>>>>>>> drivers/gpu/drm/udl/udl_gem.c: set_need_resched();
>>>>>>>>>>>
>>>>>>>>>>> All these sites basically do:
>>>>>>>>>>>
>>>>>>>>>>> while (!trylock())
>>>>>>>>>>> yield();
>>>>>>>>>>>
>>>>>>>>>>> which is a horrible and broken locking pattern.
>>>>>>>>>>>
>>>>>>>>>>> Firstly its deadlock prone, suppose the faulting process is a FIFOn+1
>>>>>>>>>>> task that preempted the lock holder at FIFOn.
>>>>>>>>>>>
>>>>>>>>>>> Secondly the implementation is worse than usual by abusing
>>>>>>>>>>> VM_FAULT_NOPAGE, which is supposed to install a PTE so that the fault
>>>>>>>>>>> doesn't retry, but you're using it as a get out of fault path. And
>>>>>>>>>>> you're using set_need_resched() which is not something a driver should
>>>>>>>>>>> _ever_ touch.
>>>>>>>>>>>
>>>>>>>>>>> Now I'm going to take away set_need_resched() -- and while you can
>>>>>>>>>>> 'reimplement' it using set_thread_flag() you're not going to do that
>>>>>>>>>>> because it will be broken due to changes to the preempt code.
>>>>>>>>>>>
>>>>>>>>>>> So please as to fix ASAP and don't allow anybody to trick you into
>>>>>>>>>>> merging silly things like that again ;-)
>>>>>>>>>> The set_need_resched in i915_gem.c:i915_gem_fault can actually be
>>>>>>>>>> removed. It was there to give the error handler a chance to sneak in
>>>>>>>>>> and reset the hw/sw tracking when the gpu is dead. That hack goes back
>>>>>>>>>> to the days when the locking around our error handler was somewhere
>>>>>>>>>> between nonexistent and totally broken, nowadays we keep things from
>>>>>>>>>> live-locking by a bit of magic in i915_mutex_lock_interruptible. I'll
>>>>>>>>>> whip up a patch to rip this out. I'll also check that our testsuite
>>>>>>>>>> properly exercises this path (needs a bit of work on a quick look for
>>>>>>>>>> better coverage).
>>>>>>>>>>
>>>>>>>>>> The one in ttm is just bonghits to shut up lockdep: ttm can recurse
>>>>>>>>>> into it's own pagefault handler and then deadlock, the trylock just
>>>>>>>>>> keeps lockdep quiet. We've had that bug arise in drm/i915 due to some
>>>>>>>>>> fun userspace did and now have testcases for them. The right solution
>>>>>>>>>> to fix this is to use copy_to|from_user_atomic in ttm everywhere it
>>>>>>>>>> holds locks and have slowpaths which drops locks, copies stuff into a
>>>>>>>>>> temp allocation and then continues. At least that's how we've fixed
>>>>>>>>>> all those inversions in i915-gem. I'm not volunteering to fix this ;-)
>>>>>>>>> Ah the case where a mmap'd address is passed to the execbuf ioctl? :P
>>>>>>>>>
>>>>>>>>> Fine I'll look into it a bit, hopefully before tuesday. Else it might take a bit longer since I'll be on my way to plumbers..
>>>>>>>> I think a possible fix would be if fault() were allowed to return an error and drop the mmap_sem() before returning.
>>>>>>>>
>>>>>>>> Otherwise we need to track down all copy_to_user / copy_from_user which happen with bo::reserve held.
>>>>>> Actually, from looking at the mm code, it seems OK to do the following:
>>>>>>
>>>>>> if (!bo_tryreserve()) {
>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>> }
>>>>> Is this meant as a jab at me? You're doing locking wrong here! Again!
>>>> It's not meant as a jab at you. I'm sorry if it came out that way. It was meant as a joke. I wasn't aware the topic was sensitive.
>>>>
>>>> Anyway, could you describe what is wrong, with the above solution, because it seems perfectly legal to me.
>>>> There is no substantial overhead, and there is no risc of deadlocks. Or do you mean it's bad because it confuses lockdep?
>>> Evil userspace can pass a bo as pointer to use for relocation lists, lockdep will warn when that locks up, but still..
>>> This is already a problem now, and your fixing will only cause lockdep to explicitly warn on it.
>> As previously mentioned, copy_from_user should return -EFAULT, since the VMAs are marked with VM_IO. It should not recurse into fault(), so evil user-space looses.
>>
>>> You can make a complicated user program to test this, or simply use this function for debugging:
>>> void ttm_might_fault(void) { struct reservation_object obj; reservation_object_init(&obj); ww_mutex_lock(&obj.lock, NULL); ww_mutex_unlock(&obj.lock); reservation_object_fini(&obj); }
>>>
>>> Put it near every instance of copy_to_user/copy_from_user and you'll find the bugs. :)
>> I'm still not convinced that there are any problems with this solution. Did you take what's said above into account?

8<---------------------------------------------------------------------------------------

>>
>> Now, could we try to approach this based on pros and cons? Let's say we would be able to choose locking order without doing anything ugly. I'd put it like this:
>>
>> mmap_sem->bo_reserve:
>> Good: Native locking order of VM subsystem. Good if we in the future will need to reserve in mmap().
>> Bad: pwrite, pread, copy_to user, copy_from_user usage needs a slowpath that releases all locking, which has to be done in multiple places in multiple drivers. Grabbing the mmap_sem and then waiting for multiple possibly sleeping bo_reserves in slow paths will stall VMA write operations for this MM.
> I think the good offsets the bad a million times here. Just because it's harder.
>
>> bo_reserve->mmap_sem:
>> Good: Natural locking order for all driver ioctls. Slowpath needs to be done in a single place, in common code.
>> Bad: Bad if we ever need to perform bo_reserve in mmap.
> Considering you're open coding a mutex_lock with the reserve/unreserve+trylock, I think this is a horrible approach. The possibility of a deadlock still exists too. :(

8<---------------------------------------------------------------------------

Maarten, the above section was intended to discuss the benefits of
different locking orders, assuming, as stated, that there
was a legitimatate solution to either choice.
Wording like "it's better bacause it's harder", "your implementation is
horrible", "half-assed" doesn't really add much.

I still think, from a code maintenance point of view that the
bo:reserve->mmap_sem locking order is superior, since we don't have the
same tools that the Intel guys have. But implementation doesn't seem
feasible ATM, both because as Peter pointed out, the pattern is
non-deterministic, and as Daniel correctly pointed out, the copy_xx_user
paths use regular page-faults, which makes recursion possible.

So I guess we need to take the mmap_sem->bo:reserve route after all.

/Thomas

>> In my view we have a clear winner. Given the problems i915 had when converting their driver, and the bashing they had to withstand, we have an even clearer winner.
>>
>> And then we need to take into account that, (given that I understand things correctly) lockdep will complain because it thinks there is a recursion that will never happen.
>> That will make the bo_reserve->mmap_sem solution look bad, but is this really enough to justify giving it up?
>>
>> /Thomas
>>

2013-09-23 15:33:31

[permalink] [raw]

Subject: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

Hey,

Op 13-09-13 11:00, Peter Zijlstra schreef:
> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>> if (!bo_tryreserve()) {
>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>> }
>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>> it seems perfectly legal to me.
>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>> at least I sincerely hope so.
>>>
>>> The thing that's wrong with that pattern is that its still not
>>> deterministic - although its a lot better than the pure trylock. Because
>>> you have to release and re-acquire with the trylock another user might
>>> have gotten in again. Its utterly prone to starvation.
>>>
>>> The acquire+release does remove the dead/life-lock scenario from the
>>> FIFO case, since blocking on the acquire will allow the other task to
>>> run (or even get boosted on -rt).
>>>
>>> Aside from that there's nothing particularly wrong with it and lockdep
>>> should be happy afaict (but I haven't had my morning juice yet).
>> bo_reserve internally maps to a ww-mutex and task can already hold
>> ww-mutex (potentially even the same for especially nasty userspace).
> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>
I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.

This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
Nouveau was a bit of a headache, but afaict it should work.

In almost all cases relocs are not updated, so I kept intact the fastpath
of not copying relocs from userspace. The slowpath tries to copy it atomically,
and if that fails it will unreserve all bo's and copy everything.

One thing to note is that the command submission ioctl may fail now with -EFAULT
if presumed cannot be updated, while the commands are submitted succesfully.

I'm not sure what the right behavior was here, and this can only happen if you
touch the memory during the ioctl or use a read-only page. Both of them are not done
in the common case.

Reviews welcome. :P

8<---

diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index e4d60e7..2964bb7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -445,8 +445,6 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
uint64_t user_pbbo_ptr)
{
struct nouveau_drm *drm = chan->drm;
- struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
- (void __force __user *)(uintptr_t)user_pbbo_ptr;
struct nouveau_bo *nvbo;
int ret, relocs = 0;

@@ -475,7 +473,7 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
return ret;
}

- if (nv_device(drm->device)->card_type < NV_50) {
+ if (nv_device(drm->device)->card_type < NV_50 && !relocs) {
if (nvbo->bo.offset == b->presumed.offset &&
((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
@@ -483,53 +481,86 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART)))
continue;

- if (nvbo->bo.mem.mem_type == TTM_PL_TT)
- b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
- else
- b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
- b->presumed.offset = nvbo->bo.offset;
- b->presumed.valid = 0;
- relocs++;
-
- if (DRM_COPY_TO_USER(&upbbo[nvbo->pbbo_index].presumed,
- &b->presumed, sizeof(b->presumed)))
- return -EFAULT;
+ relocs = 1;
}
}

return relocs;
}

+static inline void
+u_free(void *addr)
+{
+ if (!is_vmalloc_addr(addr))
+ kfree(addr);
+ else
+ vfree(addr);
+}
+
+static inline void *
+u_memcpya(uint64_t user, unsigned nmemb, unsigned size, unsigned inatomic)
+{
+ void *mem;
+ void __user *userptr = (void __force __user *)(uintptr_t)user;
+
+ size *= nmemb;
+
+ mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
+ if (!mem)
+ mem = vmalloc(size);
+ if (!mem)
+ return ERR_PTR(-ENOMEM);
+
+ if (inatomic && (!access_ok(VERIFY_READ, userptr, size) ||
+ __copy_from_user_inatomic(mem, userptr, size))) {
+ u_free(mem);
+ return ERR_PTR(-EFAULT);
+ } else if (!inatomic && copy_from_user(mem, userptr, size)) {
+ u_free(mem);
+ return ERR_PTR(-EFAULT);
+ }
+
+ return mem;
+}
+
+static int
+nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
+ struct drm_nouveau_gem_pushbuf *req,
+ struct drm_nouveau_gem_pushbuf_bo *bo,
+ struct drm_nouveau_gem_pushbuf_reloc *reloc);
+
static int
nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
struct drm_file *file_priv,
struct drm_nouveau_gem_pushbuf_bo *pbbo,
+ struct drm_nouveau_gem_pushbuf *req,
uint64_t user_buffers, int nr_buffers,
- struct validate_op *op, int *apply_relocs)
+ struct validate_op *op, int *do_reloc)
{
struct nouveau_cli *cli = nouveau_cli(file_priv);
int ret, relocs = 0;
+ struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
+
+ if (nr_buffers == 0)
+ return 0;

+restart:
INIT_LIST_HEAD(&op->vram_list);
INIT_LIST_HEAD(&op->gart_list);
INIT_LIST_HEAD(&op->both_list);

- if (nr_buffers == 0)
- return 0;
-
ret = validate_init(chan, file_priv, pbbo, nr_buffers, op);
if (unlikely(ret)) {
if (ret != -ERESTARTSYS)
NV_ERROR(cli, "validate_init\n");
- return ret;
+ goto err;
}

ret = validate_list(chan, cli, &op->vram_list, pbbo, user_buffers);
if (unlikely(ret < 0)) {
if (ret != -ERESTARTSYS)
NV_ERROR(cli, "validate vram_list\n");
- validate_fini(op, NULL);
- return ret;
+ goto err_fini;
}
relocs += ret;

@@ -537,8 +568,7 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
if (unlikely(ret < 0)) {
if (ret != -ERESTARTSYS)
NV_ERROR(cli, "validate gart_list\n");
- validate_fini(op, NULL);
- return ret;
+ goto err_fini;
}
relocs += ret;

@@ -546,58 +576,93 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
if (unlikely(ret < 0)) {
if (ret != -ERESTARTSYS)
NV_ERROR(cli, "validate both_list\n");
- validate_fini(op, NULL);
- return ret;
+ goto err_fini;
}
relocs += ret;
+ if (relocs) {
+ if (!reloc) {
+ //reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 1);
+ reloc = ERR_PTR(-EFAULT); NV_ERROR(cli, "slowpath!\n");
+ }
+ if (IS_ERR(reloc)) {
+ validate_fini(op, NULL);
+
+ if (PTR_ERR(reloc) == -EFAULT)
+ reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 0);
+
+ if (IS_ERR(reloc))
+ return PTR_ERR(reloc);
+ goto restart;
+ }
+
+ ret = nouveau_gem_pushbuf_reloc_apply(cli, req, pbbo, reloc);
+ if (ret) {
+ NV_ERROR(cli, "reloc apply: %d\n", ret);
+ /* No validate_fini, already called. */
+ return ret;
+ }
+ u_free(reloc);
+ *do_reloc = 1;
+ }

- *apply_relocs = relocs;
return 0;
-}

-static inline void
-u_free(void *addr)
-{
- if (!is_vmalloc_addr(addr))
- kfree(addr);
- else
- vfree(addr);
+err_fini:
+ validate_fini(op, NULL);
+err:
+ if (reloc)
+ u_free(reloc);
+ return ret;
}

-static inline void *
-u_memcpya(uint64_t user, unsigned nmemb, unsigned size)
+static int
+nouveau_gem_pushbuf_reloc_copy_to_user(struct drm_nouveau_gem_pushbuf *req,
+ struct drm_nouveau_gem_pushbuf_bo *bo)
{
- void *mem;
- void __user *userptr = (void __force __user *)(uintptr_t)user;
+ struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
+ (void __force __user *)(uintptr_t)req->buffers;
+ unsigned i;

- size *= nmemb;
+ for (i = 0; i < req->nr_buffers; ++i) {
+ struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];

- mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
- if (!mem)
- mem = vmalloc(size);
- if (!mem)
- return ERR_PTR(-ENOMEM);
-
- if (DRM_COPY_FROM_USER(mem, userptr, size)) {
- u_free(mem);
- return ERR_PTR(-EFAULT);
+ if (!b->presumed.valid &&
+ copy_to_user(&upbbo[i].presumed, &b->presumed, sizeof(b->presumed)))
+ return -EFAULT;
}
-
- return mem;
+ return 0;
}

static int
nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
struct drm_nouveau_gem_pushbuf *req,
- struct drm_nouveau_gem_pushbuf_bo *bo)
+ struct drm_nouveau_gem_pushbuf_bo *bo,
+ struct drm_nouveau_gem_pushbuf_reloc *reloc)
{
- struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
int ret = 0;
unsigned i;

- reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc));
- if (IS_ERR(reloc))
- return PTR_ERR(reloc);
+ for (i = 0; i < req->nr_buffers; ++i) {
+ struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];
+ struct nouveau_bo *nvbo = (void *)(unsigned long)
+ bo[i].user_priv;
+
+ if (nvbo->bo.offset == b->presumed.offset &&
+ ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
+ b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
+ (nvbo->bo.mem.mem_type == TTM_PL_TT &&
+ b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART))) {
+ b->presumed.valid = 1;
+ continue;
+ }
+
+ if (nvbo->bo.mem.mem_type == TTM_PL_TT)
+ b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
+ else
+ b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
+ b->presumed.offset = nvbo->bo.offset;
+ b->presumed.valid = 0;
+ }

for (i = 0; i < req->nr_relocs; i++) {
struct drm_nouveau_gem_pushbuf_reloc *r = &reloc[i];
@@ -664,8 +729,6 @@ nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,

nouveau_bo_wr32(nvbo, r->reloc_bo_offset >> 2, data);
}
-
- u_free(reloc);
return ret;
}

@@ -721,11 +784,11 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
return nouveau_abi16_put(abi16, -EINVAL);
}

- push = u_memcpya(req->push, req->nr_push, sizeof(*push));
+ push = u_memcpya(req->push, req->nr_push, sizeof(*push), 0);
if (IS_ERR(push))
return nouveau_abi16_put(abi16, PTR_ERR(push));

- bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo));
+ bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo), 0);
if (IS_ERR(bo)) {
u_free(push);
return nouveau_abi16_put(abi16, PTR_ERR(bo));
@@ -741,7 +804,7 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
}

/* Validate buffer list */
- ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req->buffers,
+ ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req, req->buffers,
req->nr_buffers, &op, &do_reloc);
if (ret) {
if (ret != -ERESTARTSYS)
@@ -749,15 +812,6 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
goto out_prevalid;
}

- /* Apply any relocations that are required */
- if (do_reloc) {
- ret = nouveau_gem_pushbuf_reloc_apply(cli, req, bo);
- if (ret) {
- NV_ERROR(cli, "reloc apply: %d\n", ret);
- goto out;
- }
- }
-
if (chan->dma.ib_max) {
ret = nouveau_dma_wait(chan, req->nr_push + 1, 16);
if (ret) {
@@ -837,6 +891,17 @@ out:
validate_fini(&op, fence);
nouveau_fence_unref(&fence);

+ if (do_reloc && !ret) {
+ ret = nouveau_gem_pushbuf_reloc_copy_to_user(req, bo);
+ if (ret) {
+ NV_ERROR(cli, "error copying presumed back to userspace: %d\n", ret);
+ /*
+ * XXX: We return -EFAULT, but command submission
+ * has already been completed.
+ */
+ }
+ }
+
out_prevalid:
u_free(bo);
u_free(push);
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 1006c15..829e911 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -64,12 +64,9 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* for reserve, and if it fails, retry the fault after scheduling.
*/

- ret = ttm_bo_reserve(bo, true, true, false, 0);
- if (unlikely(ret != 0)) {
- if (ret == -EBUSY)
- set_need_resched();
+ ret = ttm_bo_reserve(bo, true, false, false, 0);
+ if (unlikely(ret != 0))
return VM_FAULT_NOPAGE;
- }

if (bdev->driver->fault_reserve_notify) {
ret = bdev->driver->fault_reserve_notify(bo);
@@ -77,6 +74,7 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
case 0:
break;
case -EBUSY:
+ WARN_ON(1);
set_need_resched();
case -ERESTARTSYS:
retval = VM_FAULT_NOPAGE;

2013-09-24 07:22:25

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
> Hey,
>
> Op 13-09-13 11:00, Peter Zijlstra schreef:
>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>> if (!bo_tryreserve()) {
>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>> }
>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>> it seems perfectly legal to me.
>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>> at least I sincerely hope so.
>>>>
>>>> The thing that's wrong with that pattern is that its still not
>>>> deterministic - although its a lot better than the pure trylock. Because
>>>> you have to release and re-acquire with the trylock another user might
>>>> have gotten in again. Its utterly prone to starvation.
>>>>
>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>> run (or even get boosted on -rt).
>>>>
>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>> should be happy afaict (but I haven't had my morning juice yet).
>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>> ww-mutex (potentially even the same for especially nasty userspace).
>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>
> I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>
> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
> Nouveau was a bit of a headache, but afaict it should work.
>
> In almost all cases relocs are not updated, so I kept intact the fastpath
> of not copying relocs from userspace. The slowpath tries to copy it atomically,
> and if that fails it will unreserve all bo's and copy everything.
>
> One thing to note is that the command submission ioctl may fail now with -EFAULT
> if presumed cannot be updated, while the commands are submitted succesfully.

I think the Nouveau guys need to comment further on this, but returning
-EFAULT might break existing user-space, and that's not allowed, but
IIRC the return value of "presumed" is only a hint, and if it's
incorrect will only trigger future command stream patching.

Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the
vmwgfx driver doesn't need any fixups.
>
> I'm not sure what the right behavior was here, and this can only happen if you
> touch the memory during the ioctl or use a read-only page. Both of them are not done
> in the common case.
>
> Reviews welcome. :P
>
> 8<---
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index e4d60e7..2964bb7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -445,8 +445,6 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
> uint64_t user_pbbo_ptr)
> {
> struct nouveau_drm *drm = chan->drm;
> - struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
> - (void __force __user *)(uintptr_t)user_pbbo_ptr;
> struct nouveau_bo *nvbo;
> int ret, relocs = 0;
>
> @@ -475,7 +473,7 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
> return ret;
> }
>
> - if (nv_device(drm->device)->card_type < NV_50) {
> + if (nv_device(drm->device)->card_type < NV_50 && !relocs) {
> if (nvbo->bo.offset == b->presumed.offset &&
> ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
> b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
> @@ -483,53 +481,86 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
> b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART)))
> continue;
>
> - if (nvbo->bo.mem.mem_type == TTM_PL_TT)
> - b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
> - else
> - b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
> - b->presumed.offset = nvbo->bo.offset;
> - b->presumed.valid = 0;
> - relocs++;
> -
> - if (DRM_COPY_TO_USER(&upbbo[nvbo->pbbo_index].presumed,
> - &b->presumed, sizeof(b->presumed)))
> - return -EFAULT;
> + relocs = 1;
> }
> }
>
> return relocs;
> }
>
> +static inline void
> +u_free(void *addr)
> +{
> + if (!is_vmalloc_addr(addr))
> + kfree(addr);
> + else
> + vfree(addr);
> +}

Isn't there a DRM utilty for this?

> +
> +static inline void *
> +u_memcpya(uint64_t user, unsigned nmemb, unsigned size, unsigned inatomic)
> +{
> + void *mem;
> + void __user *userptr = (void __force __user *)(uintptr_t)user;
> +
> + size *= nmemb;
> +
> + mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> + if (!mem)
> + mem = vmalloc(size);
And for the above as well?

> + if (!mem)
> + return ERR_PTR(-ENOMEM);
> +
> + if (inatomic && (!access_ok(VERIFY_READ, userptr, size) ||
> + __copy_from_user_inatomic(mem, userptr, size))) {
> + u_free(mem);
> + return ERR_PTR(-EFAULT);
> + } else if (!inatomic && copy_from_user(mem, userptr, size)) {
> + u_free(mem);
> + return ERR_PTR(-EFAULT);
> + }
> +
> + return mem;
> +}
> +
> +static int
> +nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
> + struct drm_nouveau_gem_pushbuf *req,
> + struct drm_nouveau_gem_pushbuf_bo *bo,
> + struct drm_nouveau_gem_pushbuf_reloc *reloc);
> +
> static int
> nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
> struct drm_file *file_priv,
> struct drm_nouveau_gem_pushbuf_bo *pbbo,
> + struct drm_nouveau_gem_pushbuf *req,
> uint64_t user_buffers, int nr_buffers,
> - struct validate_op *op, int *apply_relocs)
> + struct validate_op *op, int *do_reloc)
> {
> struct nouveau_cli *cli = nouveau_cli(file_priv);
> int ret, relocs = 0;
> + struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
> +
> + if (nr_buffers == 0)
> + return 0;
>
> +restart:
> INIT_LIST_HEAD(&op->vram_list);
> INIT_LIST_HEAD(&op->gart_list);
> INIT_LIST_HEAD(&op->both_list);
>
> - if (nr_buffers == 0)
> - return 0;
> -
> ret = validate_init(chan, file_priv, pbbo, nr_buffers, op);
> if (unlikely(ret)) {
> if (ret != -ERESTARTSYS)
> NV_ERROR(cli, "validate_init\n");
> - return ret;
> + goto err;
> }
>
> ret = validate_list(chan, cli, &op->vram_list, pbbo, user_buffers);
> if (unlikely(ret < 0)) {
> if (ret != -ERESTARTSYS)
> NV_ERROR(cli, "validate vram_list\n");
> - validate_fini(op, NULL);
> - return ret;
> + goto err_fini;
> }
> relocs += ret;
>
> @@ -537,8 +568,7 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
> if (unlikely(ret < 0)) {
> if (ret != -ERESTARTSYS)
> NV_ERROR(cli, "validate gart_list\n");
> - validate_fini(op, NULL);
> - return ret;
> + goto err_fini;
> }
> relocs += ret;
>
> @@ -546,58 +576,93 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan,
> if (unlikely(ret < 0)) {
> if (ret != -ERESTARTSYS)
> NV_ERROR(cli, "validate both_list\n");
> - validate_fini(op, NULL);
> - return ret;
> + goto err_fini;
> }
> relocs += ret;
> + if (relocs) {
> + if (!reloc) {
> + //reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 1);
> + reloc = ERR_PTR(-EFAULT); NV_ERROR(cli, "slowpath!\n");
> + }
> + if (IS_ERR(reloc)) {
> + validate_fini(op, NULL);
> +
> + if (PTR_ERR(reloc) == -EFAULT)
> + reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc), 0);
> +
> + if (IS_ERR(reloc))
> + return PTR_ERR(reloc);
> + goto restart;
> + }
> +
> + ret = nouveau_gem_pushbuf_reloc_apply(cli, req, pbbo, reloc);
> + if (ret) {
> + NV_ERROR(cli, "reloc apply: %d\n", ret);
> + /* No validate_fini, already called. */
> + return ret;
> + }
> + u_free(reloc);
> + *do_reloc = 1;
> + }
>
> - *apply_relocs = relocs;
> return 0;
> -}
>
> -static inline void
> -u_free(void *addr)
> -{
> - if (!is_vmalloc_addr(addr))
> - kfree(addr);
> - else
> - vfree(addr);
> +err_fini:
> + validate_fini(op, NULL);
> +err:
> + if (reloc)
> + u_free(reloc);
> + return ret;
> }
>
> -static inline void *
> -u_memcpya(uint64_t user, unsigned nmemb, unsigned size)
> +static int
> +nouveau_gem_pushbuf_reloc_copy_to_user(struct drm_nouveau_gem_pushbuf *req,
> + struct drm_nouveau_gem_pushbuf_bo *bo)
> {
> - void *mem;
> - void __user *userptr = (void __force __user *)(uintptr_t)user;
> + struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
> + (void __force __user *)(uintptr_t)req->buffers;
> + unsigned i;
>
> - size *= nmemb;
> + for (i = 0; i < req->nr_buffers; ++i) {
> + struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];
>
> - mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
> - if (!mem)
> - mem = vmalloc(size);
> - if (!mem)
> - return ERR_PTR(-ENOMEM);
> -
> - if (DRM_COPY_FROM_USER(mem, userptr, size)) {
> - u_free(mem);
> - return ERR_PTR(-EFAULT);
> + if (!b->presumed.valid &&
> + copy_to_user(&upbbo[i].presumed, &b->presumed, sizeof(b->presumed)))
> + return -EFAULT;
> }
> -
> - return mem;
> + return 0;
> }
>
> static int
> nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
> struct drm_nouveau_gem_pushbuf *req,
> - struct drm_nouveau_gem_pushbuf_bo *bo)
> + struct drm_nouveau_gem_pushbuf_bo *bo,
> + struct drm_nouveau_gem_pushbuf_reloc *reloc)
> {
> - struct drm_nouveau_gem_pushbuf_reloc *reloc = NULL;
> int ret = 0;
> unsigned i;
>
> - reloc = u_memcpya(req->relocs, req->nr_relocs, sizeof(*reloc));
> - if (IS_ERR(reloc))
> - return PTR_ERR(reloc);
> + for (i = 0; i < req->nr_buffers; ++i) {
> + struct drm_nouveau_gem_pushbuf_bo *b = &bo[i];
> + struct nouveau_bo *nvbo = (void *)(unsigned long)
> + bo[i].user_priv;
> +
> + if (nvbo->bo.offset == b->presumed.offset &&
> + ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
> + b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
> + (nvbo->bo.mem.mem_type == TTM_PL_TT &&
> + b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART))) {
> + b->presumed.valid = 1;
> + continue;
> + }
> +
> + if (nvbo->bo.mem.mem_type == TTM_PL_TT)
> + b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
> + else
> + b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
> + b->presumed.offset = nvbo->bo.offset;
> + b->presumed.valid = 0;
> + }
>
> for (i = 0; i < req->nr_relocs; i++) {
> struct drm_nouveau_gem_pushbuf_reloc *r = &reloc[i];
> @@ -664,8 +729,6 @@ nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
>
> nouveau_bo_wr32(nvbo, r->reloc_bo_offset >> 2, data);
> }
> -
> - u_free(reloc);
> return ret;
> }
>
> @@ -721,11 +784,11 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
> return nouveau_abi16_put(abi16, -EINVAL);
> }
>
> - push = u_memcpya(req->push, req->nr_push, sizeof(*push));
> + push = u_memcpya(req->push, req->nr_push, sizeof(*push), 0);
> if (IS_ERR(push))
> return nouveau_abi16_put(abi16, PTR_ERR(push));
>
> - bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo));
> + bo = u_memcpya(req->buffers, req->nr_buffers, sizeof(*bo), 0);
> if (IS_ERR(bo)) {
> u_free(push);
> return nouveau_abi16_put(abi16, PTR_ERR(bo));
> @@ -741,7 +804,7 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
> }
>
> /* Validate buffer list */
> - ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req->buffers,
> + ret = nouveau_gem_pushbuf_validate(chan, file_priv, bo, req, req->buffers,
> req->nr_buffers, &op, &do_reloc);
> if (ret) {
> if (ret != -ERESTARTSYS)
> @@ -749,15 +812,6 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data,
> goto out_prevalid;
> }
>
> - /* Apply any relocations that are required */
> - if (do_reloc) {
> - ret = nouveau_gem_pushbuf_reloc_apply(cli, req, bo);
> - if (ret) {
> - NV_ERROR(cli, "reloc apply: %d\n", ret);
> - goto out;
> - }
> - }
> -
> if (chan->dma.ib_max) {
> ret = nouveau_dma_wait(chan, req->nr_push + 1, 16);
> if (ret) {
> @@ -837,6 +891,17 @@ out:
> validate_fini(&op, fence);
> nouveau_fence_unref(&fence);
>
> + if (do_reloc && !ret) {
> + ret = nouveau_gem_pushbuf_reloc_copy_to_user(req, bo);
> + if (ret) {
> + NV_ERROR(cli, "error copying presumed back to userspace: %d\n", ret);
> + /*
> + * XXX: We return -EFAULT, but command submission
> + * has already been completed.
> + */
> + }
> + }
> +
> out_prevalid:
> u_free(bo);
> u_free(push);
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 1006c15..829e911 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -64,12 +64,9 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> * for reserve, and if it fails, retry the fault after scheduling.
> */
>
> - ret = ttm_bo_reserve(bo, true, true, false, 0);
> - if (unlikely(ret != 0)) {
> - if (ret == -EBUSY)
> - set_need_resched();
> + ret = ttm_bo_reserve(bo, true, false, false, 0);
> + if (unlikely(ret != 0))
> return VM_FAULT_NOPAGE;
> - }
>

Actually, it's not the locking order bo:reserve -> mmap_sem that
triggers the lockdep warning, right? but the fact that copy_from_user
could recurse into the fault handler? Grabbing bo:reseves recursively?
which should leave us free to choose locking order at this point.

> if (bdev->driver->fault_reserve_notify) {
> ret = bdev->driver->fault_reserve_notify(bo);
> @@ -77,6 +74,7 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> case 0:
> break;
> case -EBUSY:
> + WARN_ON(1);
> set_need_resched();

I don't think this is used anymore, so set_need_resched might go.

> case -ERESTARTSYS:
> retval = VM_FAULT_NOPAGE;
/Thomas

2013-09-24 07:35:07

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

Op 24-09-13 09:22, Thomas Hellstrom schreef:
> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>> Hey,
>>
>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>> }
>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>> it seems perfectly legal to me.
>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>> at least I sincerely hope so.
>>>>>
>>>>> The thing that's wrong with that pattern is that its still not
>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>> you have to release and re-acquire with the trylock another user might
>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>
>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>> run (or even get boosted on -rt).
>>>>>
>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>
>> I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>
>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>> Nouveau was a bit of a headache, but afaict it should work.
>>
>> In almost all cases relocs are not updated, so I kept intact the fastpath
>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>> and if that fails it will unreserve all bo's and copy everything.
>>
>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>> if presumed cannot be updated, while the commands are submitted succesfully.
>
> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>
> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
which would probably result in libdrm crashing shortly afterwards anyway.

So I don't know whether to swallow that -EFAULT or not, which is what I asked.

And for vmwgfx just run it through PROVE_LOCKING and make sure all the copy_to/from_user call sites are called at least once, and then you can be certain it doesn't need fixups. ;)
Lockdep ftw..

>>
>> I'm not sure what the right behavior was here, and this can only happen if you
>> touch the memory during the ioctl or use a read-only page. Both of them are not done
>> in the common case.
>>
>> Reviews welcome. :P
>>
>> 8<---
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> index e4d60e7..2964bb7 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> @@ -445,8 +445,6 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>> uint64_t user_pbbo_ptr)
>> {
>> struct nouveau_drm *drm = chan->drm;
>> - struct drm_nouveau_gem_pushbuf_bo __user *upbbo =
>> - (void __force __user *)(uintptr_t)user_pbbo_ptr;
>> struct nouveau_bo *nvbo;
>> int ret, relocs = 0;
>> @@ -475,7 +473,7 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>> return ret;
>> }
>> - if (nv_device(drm->device)->card_type < NV_50) {
>> + if (nv_device(drm->device)->card_type < NV_50 && !relocs) {
>> if (nvbo->bo.offset == b->presumed.offset &&
>> ((nvbo->bo.mem.mem_type == TTM_PL_VRAM &&
>> b->presumed.domain & NOUVEAU_GEM_DOMAIN_VRAM) ||
>> @@ -483,53 +481,86 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli,
>> b->presumed.domain & NOUVEAU_GEM_DOMAIN_GART)))
>> continue;
>> - if (nvbo->bo.mem.mem_type == TTM_PL_TT)
>> - b->presumed.domain = NOUVEAU_GEM_DOMAIN_GART;
>> - else
>> - b->presumed.domain = NOUVEAU_GEM_DOMAIN_VRAM;
>> - b->presumed.offset = nvbo->bo.offset;
>> - b->presumed.valid = 0;
>> - relocs++;
>> -
>> - if (DRM_COPY_TO_USER(&upbbo[nvbo->pbbo_index].presumed,
>> - &b->presumed, sizeof(b->presumed)))
>> - return -EFAULT;
>> + relocs = 1;
>> }
>> }
>> return relocs;
>> }
>> +static inline void
>> +u_free(void *addr)
>> +{
>> + if (!is_vmalloc_addr(addr))
>> + kfree(addr);
>> + else
>> + vfree(addr);
>> +}
>
> Isn't there a DRM utilty for this?
Oops, seems you're right..

Those functions can be replaced with drm_malloc_ab and drm_free_large.
>> +
>> +static inline void *
>> +u_memcpya(uint64_t user, unsigned nmemb, unsigned size, unsigned inatomic)
>> +{
>> + void *mem;
>> + void __user *userptr = (void __force __user *)(uintptr_t)user;
>> +
>> + size *= nmemb;
>> +
>> + mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
>> + if (!mem)
>> + mem = vmalloc(size);
> And for the above as well?
Indeed..
>
>> ...
>> out_prevalid:
>> u_free(bo);
>> u_free(push);
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 1006c15..829e911 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -64,12 +64,9 @@ static int ttm_bo_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> * for reserve, and if it fails, retry the fault after scheduling.
>> */
>> - ret = ttm_bo_reserve(bo, true, true, false, 0);
>> - if (unlikely(ret != 0)) {
>> - if (ret == -EBUSY)
>> - set_need_resched();
>> + ret = ttm_bo_reserve(bo, true, false, false, 0);
>> + if (unlikely(ret != 0))
>> return VM_FAULT_NOPAGE;
>> - }
>>
>
> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
Same thing.

When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)

~Maarten

2013-09-24 08:20:54

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

* Maarten Lankhorst <[email protected]> wrote:

> > I think the Nouveau guys need to comment further on this, but
> > returning -EFAULT might break existing user-space, and that's not
> > allowed, but IIRC the return value of "presumed" is only a hint, and
> > if it's incorrect will only trigger future command stream patching.
> >
> > Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell
> > the vmwgfx driver doesn't need any fixups.
>
> Well because we read the list of buffer objects the presumed offsets are
> at least read-mapped. Although I guess in the worst case the mapping
> might disappear before the syscall copies back the data. So if -EFAULT
> happens here then userspace messed up in some way, either by forgetting
> to map the offsets read-write, which cannot happen with libdrm or
> free'ing the bo list before the syscall returns, which would probably
> result in libdrm crashing shortly afterwards anyway.
>
> So I don't know whether to swallow that -EFAULT or not, which is what I
> asked.

In such a case returning -EFAULT is very strongly preferred.

If there's a user-space bug, such as a context life time race between
graphics context creation/destruction and user threads making use of the
graphics context, then getting the -EFAULT would be very helpful to
user-space debugging as well.

Thanks,

Ingo

2013-09-24 09:03:47

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>> Hey,
>>>
>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>> }
>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>> it seems perfectly legal to me.
>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>> at least I sincerely hope so.
>>>>>>
>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>
>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>> run (or even get boosted on -rt).
>>>>>>
>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>
>>> I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>
>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>> Nouveau was a bit of a headache, but afaict it should work.
>>>
>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>> and if that fails it will unreserve all bo's and copy everything.
>>>
>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>> if presumed cannot be updated, while the commands are submitted succesfully.
>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>
>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
> which would probably result in libdrm crashing shortly afterwards anyway.

Hmm, is the list of buffer objects (and the "presumed" members) really
in DRM memory? Because if it resides or may reside in anonymous system
memory, it may well be paged out between reading and writing, in which
case the -EFAULT return is incorrect.

In fact failures of pushbuf / execbuf *after* commands are successfully
submitted are generally very hard to recover from. That's why the kernel
should do whatever it takes to recover from such failures, and
user-space should do whatever it takes to recover from copy-to-user
failures of needed info from the kernel, and it really depends on the
user-space usage pattern of "presumed". IIRC the original reason for
copying it back to user-space was, that if a relocation offsets were
patched up by the kernel, and then the process was sent a signal causing
it to retry execbuf, then "presumed" had to be updated, otherwise it
would be inconsistent with what's currently in the command stream, which
is very bad. If "presumed" is, however, only used by user-space to guess
an offset, the correct action would be to swallow the -EFAULT.

>
> So I don't know whether to swallow that -EFAULT or not, which is what I asked.
>
> And for vmwgfx just run it through PROVE_LOCKING and make sure all the copy_to/from_user call sites are called at least once, and then you can be certain it doesn't need fixups. ;)
> Lockdep ftw..
Will do that.

>
>> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
> Same thing.
>
> When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)
>
> ~Maarten

My point was that when we only have copy_[to|from]_user_inatomic inside
any bo:reservations, the might_fault would never be called inside any
reservations and we should, in principle, be free to choose locking
order, provided of course it could be done cleanly in fault()?

/Thomas

2013-09-24 09:36:34

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
> >Op 24-09-13 09:22, Thomas Hellstrom schreef:
> >>On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
> >>>Hey,
> >>>
> >>>Op 13-09-13 11:00, Peter Zijlstra schreef:
> >>>>On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
> >>>>>On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
> >>>>>>On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
> >>>>>>>>>if (!bo_tryreserve()) {
> >>>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
> >>>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
> >>>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
> >>>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
> >>>>>>>>>}
> >>>>>>>Anyway, could you describe what is wrong, with the above solution, because
> >>>>>>>it seems perfectly legal to me.
> >>>>>>Luckily the rule of law doesn't have anything to do with this stuff --
> >>>>>>at least I sincerely hope so.
> >>>>>>
> >>>>>>The thing that's wrong with that pattern is that its still not
> >>>>>>deterministic - although its a lot better than the pure trylock. Because
> >>>>>>you have to release and re-acquire with the trylock another user might
> >>>>>>have gotten in again. Its utterly prone to starvation.
> >>>>>>
> >>>>>>The acquire+release does remove the dead/life-lock scenario from the
> >>>>>>FIFO case, since blocking on the acquire will allow the other task to
> >>>>>>run (or even get boosted on -rt).
> >>>>>>
> >>>>>>Aside from that there's nothing particularly wrong with it and lockdep
> >>>>>>should be happy afaict (but I haven't had my morning juice yet).
> >>>>>bo_reserve internally maps to a ww-mutex and task can already hold
> >>>>>ww-mutex (potentially even the same for especially nasty userspace).
> >>>>OK, yes I wasn't aware of that. Yes in that case you're quite right.
> >>>>
> >>>I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
> >>>
> >>>This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
> >>>Nouveau was a bit of a headache, but afaict it should work.
> >>>
> >>>In almost all cases relocs are not updated, so I kept intact the fastpath
> >>>of not copying relocs from userspace. The slowpath tries to copy it atomically,
> >>>and if that fails it will unreserve all bo's and copy everything.
> >>>
> >>>One thing to note is that the command submission ioctl may fail now with -EFAULT
> >>>if presumed cannot be updated, while the commands are submitted succesfully.
> >>I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
> >>
> >>Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
> >Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
> >So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
> >which would probably result in libdrm crashing shortly afterwards anyway.
>
> Hmm, is the list of buffer objects (and the "presumed" members)
> really in DRM memory? Because if it resides or may reside in
> anonymous system memory, it may well be paged out between reading
> and writing, in which case the -EFAULT return is incorrect.
>
> In fact failures of pushbuf / execbuf *after* commands are
> successfully submitted are generally very hard to recover from.
> That's why the kernel should do whatever it takes to recover from
> such failures, and user-space should do whatever it takes to recover
> from copy-to-user failures of needed info from the kernel, and it
> really depends on the user-space usage pattern of "presumed". IIRC
> the original reason for copying it back to user-space was, that if a
> relocation offsets were patched up by the kernel, and then the
> process was sent a signal causing it to retry execbuf, then
> "presumed" had to be updated, otherwise it would be inconsistent
> with what's currently in the command stream, which is very bad. If
> "presumed" is, however, only used by user-space to guess an offset,
> the correct action would be to swallow the -EFAULT.

In i915 we've had tons of fun with a regression in 3.7 where exactly this
blows up: Some of our userspace (UXA ddx specifically) retains
relocations-trees partially accross an execbuf. Which means if the kernel
updates the relocations it _must_ update the presumed offset for
otherwise things will go haywire on the next execbuf. So we can't return
-EFAULT if the userspace memory needs to be just refaulted but still need
to guarante a "correct" presumed offset.

Since we didn't come up with a way to make sure this will work in all
cases when we get an -EFAULT when writing back presumed offsets we have a
rather tricky piece of fallback logic.
- Any -EFAULT error in the fastpath will drop us into the relocation
slowpath. The slowpath completly processes relocs anew, so any updates
done by the fastpath are irrelevant.

- The first thing the slowpath does is set the presumed offset in the
userspace reloc lists to an invalid address (we pick -1) to make sure
that any subsequent execbuf with the same partial reloc tree will again
go through the relocation update code.

- Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
locks and then process them. The copy-back of the presumed offset
happens with an copy_to_user_inatomic, and we ignore any errors.

Of course we try really hard to make sure that we never hit the reloc
slowpath ;-) But nowadays this is all fully tested with some nasty
testcases (and a small kernel option to disable prefaulting).

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-09-24 09:44:10

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

Op 24-09-13 11:03, Thomas Hellstrom schreef:
> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>> Hey,
>>>>
>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>> }
>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>> it seems perfectly legal to me.
>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>> at least I sincerely hope so.
>>>>>>>
>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>
>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>> run (or even get boosted on -rt).
>>>>>>>
>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>
>>>> I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>
>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>
>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>
>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>
>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>> which would probably result in libdrm crashing shortly afterwards anyway.
>
> Hmm, is the list of buffer objects (and the "presumed" members) really in DRM memory? Because if it resides or may reside in anonymous system memory, it may well be paged out between reading and writing, in which case the -EFAULT return is incorrect.
It may be swapped out, but I don't use read.*atomic copy calls, faults are allowed to happen at that point and no -EFAULT will be returned from
that. The only -EFAULT that can happen is if the memory ends up being read-only, or ceased to exist.

> In fact failures of pushbuf / execbuf *after* commands are successfully submitted are generally very hard to recover from. That's why the kernel should do whatever it takes to recover from such failures, and user-space should do whatever it takes to recover from copy-to-user failures of needed info from the kernel, and it really depends on the user-space usage pattern of "presumed". IIRC the original reason for copying it back to user-space was, that if a relocation offsets were patched up by the kernel, and then the process was sent a signal causing it to retry execbuf, then "presumed" had to be updated, otherwise it would be inconsistent with what's currently in the command stream, which is very bad. If "presumed" is, however, only used by user-space to guess an offset, the correct action would be to swallow the -EFAULT.
Yeah it's a guess from userspace. But it's probably still a programming error by userspace if it uses a read-only
mapping for presumed, or a mapping that's invalidated. :P But there's nothing I can do at that point,
I can't undo something I just committed. This is why I asked whether I should swallow it or not, because it's
unclear to me. :) There's not much of an impact to userspace if I swallow it, but userspace did deliberately
mess up the call.

>>
>> So I don't know whether to swallow that -EFAULT or not, which is what I asked.
>>
>> And for vmwgfx just run it through PROVE_LOCKING and make sure all the copy_to/from_user call sites are called at least once, and then you can be certain it doesn't need fixups. ;)
>> Lockdep ftw..
> Will do that.
>
>>
>>> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
>> Same thing.
>>
>> When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)
>
> My point was that when we only have copy_[to|from]_user_inatomic inside any bo:reservations, the might_fault would never be called inside any reservations and we should, in principle, be free to choose locking order, provided of course it could be done cleanly in fault()?
I think it makes sense to keep mmap_sem the outer lock here, and not do scary things in fault. Especially if the mm locking is going to be changed in the future. But you're correct, if holding reservations only inatomic should be used.

~Maarten

2013-09-24 10:11:45

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

Op 24-09-13 11:36, Daniel Vetter schreef:
> On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>>> Hey,
>>>>>
>>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>>> }
>>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>>> it seems perfectly legal to me.
>>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>>> at least I sincerely hope so.
>>>>>>>>
>>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>>
>>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>>> run (or even get boosted on -rt).
>>>>>>>>
>>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>>
>>>>> I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>>
>>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>>
>>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>>
>>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>>
>>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>>> which would probably result in libdrm crashing shortly afterwards anyway.
>> Hmm, is the list of buffer objects (and the "presumed" members)
>> really in DRM memory? Because if it resides or may reside in
>> anonymous system memory, it may well be paged out between reading
>> and writing, in which case the -EFAULT return is incorrect.
>>
>> In fact failures of pushbuf / execbuf *after* commands are
>> successfully submitted are generally very hard to recover from.
>> That's why the kernel should do whatever it takes to recover from
>> such failures, and user-space should do whatever it takes to recover
>> from copy-to-user failures of needed info from the kernel, and it
>> really depends on the user-space usage pattern of "presumed". IIRC
>> the original reason for copying it back to user-space was, that if a
>> relocation offsets were patched up by the kernel, and then the
>> process was sent a signal causing it to retry execbuf, then
>> "presumed" had to be updated, otherwise it would be inconsistent
>> with what's currently in the command stream, which is very bad. If
>> "presumed" is, however, only used by user-space to guess an offset,
>> the correct action would be to swallow the -EFAULT.
> In i915 we've had tons of fun with a regression in 3.7 where exactly this
> blows up: Some of our userspace (UXA ddx specifically) retains
> relocations-trees partially accross an execbuf. Which means if the kernel
> updates the relocations it _must_ update the presumed offset for
> otherwise things will go haywire on the next execbuf. So we can't return
> -EFAULT if the userspace memory needs to be just refaulted but still need
> to guarante a "correct" presumed offset.
>
> Since we didn't come up with a way to make sure this will work in all
> cases when we get an -EFAULT when writing back presumed offsets we have a
> rather tricky piece of fallback logic.
> - Any -EFAULT error in the fastpath will drop us into the relocation
> slowpath. The slowpath completly processes relocs anew, so any updates
> done by the fastpath are irrelevant.
>
> - The first thing the slowpath does is set the presumed offset in the
> userspace reloc lists to an invalid address (we pick -1) to make sure
> that any subsequent execbuf with the same partial reloc tree will again
> go through the relocation update code.
>
> - Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
> locks and then process them. The copy-back of the presumed offset
> happens with an copy_to_user_inatomic, and we ignore any errors.
>
> Of course we try really hard to make sure that we never hit the reloc
> slowpath ;-) But nowadays this is all fully tested with some nasty
> testcases (and a small kernel option to disable prefaulting).
>
It seems userspace only updates offset and domain in nouveau. If it fails to update
it would result in the same affect as when the buffer gets moved around by TTM.
But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
to 0x40000000, always force domain to be different and see what breaks.

My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..

~Maarten

2013-09-24 10:34:06

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

On 09/24/2013 12:11 PM, Maarten Lankhorst wrote:
> Op 24-09-13 11:36, Daniel Vetter schreef:
>> On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
>>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>>>> Hey,
>>>>>>
>>>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>>>> }
>>>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>>>> it seems perfectly legal to me.
>>>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>>>> at least I sincerely hope so.
>>>>>>>>>
>>>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>>>
>>>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>>>> run (or even get boosted on -rt).
>>>>>>>>>
>>>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>>>
>>>>>> I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>>>
>>>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>>>
>>>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>>>
>>>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>>>
>>>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>>>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>>>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>>>> which would probably result in libdrm crashing shortly afterwards anyway.
>>> Hmm, is the list of buffer objects (and the "presumed" members)
>>> really in DRM memory? Because if it resides or may reside in
>>> anonymous system memory, it may well be paged out between reading
>>> and writing, in which case the -EFAULT return is incorrect.
>>>
>>> In fact failures of pushbuf / execbuf *after* commands are
>>> successfully submitted are generally very hard to recover from.
>>> That's why the kernel should do whatever it takes to recover from
>>> such failures, and user-space should do whatever it takes to recover
>>> from copy-to-user failures of needed info from the kernel, and it
>>> really depends on the user-space usage pattern of "presumed". IIRC
>>> the original reason for copying it back to user-space was, that if a
>>> relocation offsets were patched up by the kernel, and then the
>>> process was sent a signal causing it to retry execbuf, then
>>> "presumed" had to be updated, otherwise it would be inconsistent
>>> with what's currently in the command stream, which is very bad. If
>>> "presumed" is, however, only used by user-space to guess an offset,
>>> the correct action would be to swallow the -EFAULT.
>> In i915 we've had tons of fun with a regression in 3.7 where exactly this
>> blows up: Some of our userspace (UXA ddx specifically) retains
>> relocations-trees partially accross an execbuf. Which means if the kernel
>> updates the relocations it _must_ update the presumed offset for
>> otherwise things will go haywire on the next execbuf. So we can't return
>> -EFAULT if the userspace memory needs to be just refaulted but still need
>> to guarante a "correct" presumed offset.
>>
>> Since we didn't come up with a way to make sure this will work in all
>> cases when we get an -EFAULT when writing back presumed offsets we have a
>> rather tricky piece of fallback logic.
>> - Any -EFAULT error in the fastpath will drop us into the relocation
>> slowpath. The slowpath completly processes relocs anew, so any updates
>> done by the fastpath are irrelevant.
>>
>> - The first thing the slowpath does is set the presumed offset in the
>> userspace reloc lists to an invalid address (we pick -1) to make sure
>> that any subsequent execbuf with the same partial reloc tree will again
>> go through the relocation update code.
>>
>> - Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
>> locks and then process them. The copy-back of the presumed offset
>> happens with an copy_to_user_inatomic, and we ignore any errors.
>>
>> Of course we try really hard to make sure that we never hit the reloc
>> slowpath ;-) But nowadays this is all fully tested with some nasty
>> testcases (and a small kernel option to disable prefaulting).
>>
> It seems userspace only updates offset and domain in nouveau. If it fails to update
> it would result in the same affect as when the buffer gets moved around by TTM.
> But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
> to 0x40000000, always force domain to be different and see what breaks.
>
> My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..
I think that would certainly break if your return an -ERESTARTSYS after
applying relocations but
before submitting the command stream to hardware....

/Thomas

>
> ~Maarten

2013-09-24 11:33:07

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

Op 24-09-13 12:33, Thomas Hellstrom schreef:
> On 09/24/2013 12:11 PM, Maarten Lankhorst wrote:
>> Op 24-09-13 11:36, Daniel Vetter schreef:
>>> On Tue, Sep 24, 2013 at 11:03:37AM +0200, Thomas Hellstrom wrote:
>>>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>>>> On 09/23/2013 05:33 PM, Maarten Lankhorst wrote:
>>>>>>> Hey,
>>>>>>>
>>>>>>> Op 13-09-13 11:00, Peter Zijlstra schreef:
>>>>>>>> On Fri, Sep 13, 2013 at 10:41:54AM +0200, Daniel Vetter wrote:
>>>>>>>>> On Fri, Sep 13, 2013 at 10:29 AM, Peter Zijlstra <[email protected]> wrote:
>>>>>>>>>> On Fri, Sep 13, 2013 at 09:46:03AM +0200, Thomas Hellstrom wrote:
>>>>>>>>>>>>> if (!bo_tryreserve()) {
>>>>>>>>>>>>> up_read mmap_sem(); // Release the mmap_sem to avoid deadlocks.
>>>>>>>>>>>>> bo_reserve(); // Wait for the BO to become available (interruptible)
>>>>>>>>>>>>> bo_unreserve(); // Where is bo_wait_unreserved() when we need it, Maarten :P
>>>>>>>>>>>>> return VM_FAULT_RETRY; // Go ahead and retry the VMA walk, after regrabbing
>>>>>>>>>>>>> }
>>>>>>>>>>> Anyway, could you describe what is wrong, with the above solution, because
>>>>>>>>>>> it seems perfectly legal to me.
>>>>>>>>>> Luckily the rule of law doesn't have anything to do with this stuff --
>>>>>>>>>> at least I sincerely hope so.
>>>>>>>>>>
>>>>>>>>>> The thing that's wrong with that pattern is that its still not
>>>>>>>>>> deterministic - although its a lot better than the pure trylock. Because
>>>>>>>>>> you have to release and re-acquire with the trylock another user might
>>>>>>>>>> have gotten in again. Its utterly prone to starvation.
>>>>>>>>>>
>>>>>>>>>> The acquire+release does remove the dead/life-lock scenario from the
>>>>>>>>>> FIFO case, since blocking on the acquire will allow the other task to
>>>>>>>>>> run (or even get boosted on -rt).
>>>>>>>>>>
>>>>>>>>>> Aside from that there's nothing particularly wrong with it and lockdep
>>>>>>>>>> should be happy afaict (but I haven't had my morning juice yet).
>>>>>>>>> bo_reserve internally maps to a ww-mutex and task can already hold
>>>>>>>>> ww-mutex (potentially even the same for especially nasty userspace).
>>>>>>>> OK, yes I wasn't aware of that. Yes in that case you're quite right.
>>>>>>>>
>>>>>>> I added a RFC patch below. I only tested with PROVE_LOCKING, and always forced the slowpath for debugging.
>>>>>>>
>>>>>>> This fixes nouveau and core ttm to always use blocking acquisition in fastpath.
>>>>>>> Nouveau was a bit of a headache, but afaict it should work.
>>>>>>>
>>>>>>> In almost all cases relocs are not updated, so I kept intact the fastpath
>>>>>>> of not copying relocs from userspace. The slowpath tries to copy it atomically,
>>>>>>> and if that fails it will unreserve all bo's and copy everything.
>>>>>>>
>>>>>>> One thing to note is that the command submission ioctl may fail now with -EFAULT
>>>>>>> if presumed cannot be updated, while the commands are submitted succesfully.
>>>>>> I think the Nouveau guys need to comment further on this, but returning -EFAULT might break existing user-space, and that's not allowed, but IIRC the return value of "presumed" is only a hint, and if it's incorrect will only trigger future command stream patching.
>>>>>>
>>>>>> Otherwise reviewing mostly the TTM stuff. FWIW, from wat I can tell the vmwgfx driver doesn't need any fixups.
>>>>> Well because we read the list of buffer objects the presumed offsets are at least read-mapped. Although I guess in the worst case the mapping might disappear before the syscall copies back the data.
>>>>> So if -EFAULT happens here then userspace messed up in some way, either by forgetting to map the offsets read-write, which cannot happen with libdrm or free'ing the bo list before the syscall returns,
>>>>> which would probably result in libdrm crashing shortly afterwards anyway.
>>>> Hmm, is the list of buffer objects (and the "presumed" members)
>>>> really in DRM memory? Because if it resides or may reside in
>>>> anonymous system memory, it may well be paged out between reading
>>>> and writing, in which case the -EFAULT return is incorrect.
>>>>
>>>> In fact failures of pushbuf / execbuf *after* commands are
>>>> successfully submitted are generally very hard to recover from.
>>>> That's why the kernel should do whatever it takes to recover from
>>>> such failures, and user-space should do whatever it takes to recover
>>>> from copy-to-user failures of needed info from the kernel, and it
>>>> really depends on the user-space usage pattern of "presumed". IIRC
>>>> the original reason for copying it back to user-space was, that if a
>>>> relocation offsets were patched up by the kernel, and then the
>>>> process was sent a signal causing it to retry execbuf, then
>>>> "presumed" had to be updated, otherwise it would be inconsistent
>>>> with what's currently in the command stream, which is very bad. If
>>>> "presumed" is, however, only used by user-space to guess an offset,
>>>> the correct action would be to swallow the -EFAULT.
>>> In i915 we've had tons of fun with a regression in 3.7 where exactly this
>>> blows up: Some of our userspace (UXA ddx specifically) retains
>>> relocations-trees partially accross an execbuf. Which means if the kernel
>>> updates the relocations it _must_ update the presumed offset for
>>> otherwise things will go haywire on the next execbuf. So we can't return
>>> -EFAULT if the userspace memory needs to be just refaulted but still need
>>> to guarante a "correct" presumed offset.
>>>
>>> Since we didn't come up with a way to make sure this will work in all
>>> cases when we get an -EFAULT when writing back presumed offsets we have a
>>> rather tricky piece of fallback logic.
>>> - Any -EFAULT error in the fastpath will drop us into the relocation
>>> slowpath. The slowpath completly processes relocs anew, so any updates
>>> done by the fastpath are irrelevant.
>>>
>>> - The first thing the slowpath does is set the presumed offset in the
>>> userspace reloc lists to an invalid address (we pick -1) to make sure
>>> that any subsequent execbuf with the same partial reloc tree will again
>>> go through the relocation update code.
>>>
>>> - Then we do the usual slowpath, i.e. copy relocs from userspace, re-grab
>>> locks and then process them. The copy-back of the presumed offset
>>> happens with an copy_to_user_inatomic, and we ignore any errors.
>>>
>>> Of course we try really hard to make sure that we never hit the reloc
>>> slowpath ;-) But nowadays this is all fully tested with some nasty
>>> testcases (and a small kernel option to disable prefaulting).
>>>
>> It seems userspace only updates offset and domain in nouveau. If it fails to update
>> it would result in the same affect as when the buffer gets moved around by TTM.
>> But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
>> to 0x40000000, always force domain to be different and see what breaks.
>>
>> My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..
> I think that would certainly break if your return an -ERESTARTSYS after applying relocations but
> before submitting the command stream to hardware....
The relocations are updated before submitting the command stream, but it's copied back to userspace
after submitting the command stream. I'm not sure what -ERESTARTSYS would change, the syscall
is not in an interruptible state.

Oh and after some testing with relocations it seems they always break nouveau. Originally I've set domain
and offset differently, but that crashed the card. In my second try I kept domain and offset correct, and hacked
the kernel to force apply all relocations. Unsurprisingly nouveau still breaks, even if the relocations themselves
should have been a noop. :/

Swallowing -EFAULT is fine theoretically though, I wouldn't worry about it..

~Maarten

2013-09-24 17:04:17

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

On 09/24/2013 01:32 PM, Maarten Lankhorst wrote:
> Op 24-09-13 12:33, Thomas Hellstrom schreef:
>> On 09/24/2013 12:11 PM, Maarten Lankhorst wrote:
>>>
>>> It seems userspace only updates offset and domain in nouveau. If it fails to update
>>> it would result in the same affect as when the buffer gets moved around by TTM.
>>> But hey maybe I'll have some fun, I'll lie to userspace, hardcode userspace offset
>>> to 0x40000000, always force domain to be different and see what breaks.
>>>
>>> My guess is absolutely nothing, except it might expose some bugs where we forgot annotation..
>> I think that would certainly break if your return an -ERESTARTSYS after applying relocations but
>> before submitting the command stream to hardware....
> The relocations are updated before submitting the command stream, but it's copied back to userspace
> after submitting the command stream. I'm not sure what -ERESTARTSYS would change, the syscall
> is not in an interruptible state.
Hmm, I was assuming without looking at the code that there was an
interruptible wait for pipe/ring space after
relocation application.

/Thomas

2013-09-24 17:54:05

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/nouveau: fix nested locking in mmap handler

On 09/24/2013 11:43 AM, Maarten Lankhorst wrote:
> Op 24-09-13 11:03, Thomas Hellstrom schreef:
>> On 09/24/2013 09:34 AM, Maarten Lankhorst wrote:
>>> Op 24-09-13 09:22, Thomas Hellstrom schreef:
>>>> Actually, it's not the locking order bo:reserve -> mmap_sem that triggers the lockdep warning, right? but the fact that copy_from_user could recurse into the fault handler? Grabbing bo:reseves recursively? which should leave us free to choose locking order at this point.
>>> Same thing.
>>>
>>> When PROVE_LOCKING is enabled the might_fault calls in copy_to/from_user do a fake locking of mmap_sem, which means all locking errors, provided that the reservation lock is taken at least once with mmap_sem held (eg the ttm fault handler is called at least once, it can be done separately, it doesn't need to be done in the same syscall). So any bugs will be found. The only thing to worry about is that lockdep turns off after finding 1 error, so you have to make sure it doesn't bomb out before completing tests, which is sometimes a problem on early rc kernels. ;)
>> My point was that when we only have copy_[to|from]_user_inatomic inside any bo:reservations, the might_fault would never be called inside any reservations and we should, in principle, be free to choose locking order, provided of course it could be done cleanly in fault()?
> I think it makes sense to keep mmap_sem the outer lock here, and not do scary things in fault. Especially if the mm locking is going to be changed in the future. But you're correct, if holding reservations only inatomic should be used.
>
> ~Maarten

Now that we after all were forced to enter the dark realm of copy-user
slowpaths, I don't really have a strong opinion anymore, other than that
we should try to avoid building too much code in that depends on either
locking order, so that we could change it if _really_ needed.

/Thomas

2013-10-08 14:14:50

[permalink] [raw]

Subject: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

Allocate and copy all kernel memory before doing reservations. This prevents a locking
inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
in ttm_bo_vm_fault without upsetting lockdep.

Signed-off-by: Maarten Lankhorst <[email protected]>
---
diff --git a/drivers/gpu/drm/radeon/r600_cs.c b/drivers/gpu/drm/radeon/r600_cs.c
index 01a3ec8..efa9bca 100644
--- a/drivers/gpu/drm/radeon/r600_cs.c
+++ b/drivers/gpu/drm/radeon/r600_cs.c
@@ -2391,18 +2391,13 @@ int r600_cs_legacy(struct drm_device *dev, void *data, struct drm_file *filp,
ib_chunk = &parser.chunks[parser.chunk_ib_idx];
parser.ib.length_dw = ib_chunk->length_dw;
*l = parser.ib.length_dw;
+ memcpy(ib, ib_chunk->kdata, ib_chunk->length_dw * 4);
r = r600_cs_parse(&parser);
if (r) {
DRM_ERROR("Invalid command stream !\n");
r600_cs_parser_fini(&parser, r);
return r;
}
- r = radeon_cs_finish_pages(&parser);
- if (r) {
- DRM_ERROR("Invalid command stream !\n");
- r600_cs_parser_fini(&parser, r);
- return r;
- }
r600_cs_parser_fini(&parser, r);
return r;
}
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index e067024..c52bb5e 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -983,12 +983,7 @@ struct radeon_cs_reloc {
struct radeon_cs_chunk {
uint32_t chunk_id;
uint32_t length_dw;
- int kpage_idx[2];
- uint32_t *kpage[2];
uint32_t *kdata;
- void __user *user_ptr;
- int last_copied_page;
- int last_page_index;
};

struct radeon_cs_parser {
@@ -1027,8 +1022,13 @@ struct radeon_cs_parser {
struct radeon_sa_bo *cpu_sema;
};

-extern int radeon_cs_finish_pages(struct radeon_cs_parser *p);
-extern u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx);
+static inline u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
+{
+ struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
+
+ return ibc->kdata[idx];
+}
+

struct radeon_cs_packet {
unsigned idx;
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
index 7d7322e..98d8898 100644
--- a/drivers/gpu/drm/radeon/radeon_cs.c
+++ b/drivers/gpu/drm/radeon/radeon_cs.c
@@ -217,9 +217,7 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
return -EFAULT;
}
p->chunks[i].length_dw = user_chunk.length_dw;
- p->chunks[i].kdata = NULL;
p->chunks[i].chunk_id = user_chunk.chunk_id;
- p->chunks[i].user_ptr = (void __user *)(unsigned long)user_chunk.chunk_data;
if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) {
p->chunk_relocs_idx = i;
}
@@ -242,25 +240,22 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
return -EINVAL;
}

- cdata = (uint32_t *)(unsigned long)user_chunk.chunk_data;
- if ((p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) ||
- (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS)) {
- size = p->chunks[i].length_dw * sizeof(uint32_t);
- p->chunks[i].kdata = kmalloc(size, GFP_KERNEL);
- if (p->chunks[i].kdata == NULL) {
- return -ENOMEM;
- }
- if (DRM_COPY_FROM_USER(p->chunks[i].kdata,
- p->chunks[i].user_ptr, size)) {
- return -EFAULT;
- }
- if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
- p->cs_flags = p->chunks[i].kdata[0];
- if (p->chunks[i].length_dw > 1)
- ring = p->chunks[i].kdata[1];
- if (p->chunks[i].length_dw > 2)
- priority = (s32)p->chunks[i].kdata[2];
- }
+ size = p->chunks[i].length_dw;
+ p->chunks[i].kdata = drm_malloc_ab(size, sizeof(uint32_t));
+ size *= sizeof(uint32_t);
+ if (p->chunks[i].kdata == NULL) {
+ return -ENOMEM;
+ }
+ cdata = (void __user *)(unsigned long)user_chunk.chunk_data;
+ if (DRM_COPY_FROM_USER(p->chunks[i].kdata, cdata, size)) {
+ return -EFAULT;
+ }
+ if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
+ p->cs_flags = p->chunks[i].kdata[0];
+ if (p->chunks[i].length_dw > 1)
+ ring = p->chunks[i].kdata[1];
+ if (p->chunks[i].length_dw > 2)
+ priority = (s32)p->chunks[i].kdata[2];
}
}

@@ -283,34 +278,6 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
}
}

- /* deal with non-vm */
- if ((p->chunk_ib_idx != -1) &&
- ((p->cs_flags & RADEON_CS_USE_VM) == 0) &&
- (p->chunks[p->chunk_ib_idx].chunk_id == RADEON_CHUNK_ID_IB)) {
- if (p->chunks[p->chunk_ib_idx].length_dw > (16 * 1024)) {
- DRM_ERROR("cs IB too big: %d\n",
- p->chunks[p->chunk_ib_idx].length_dw);
- return -EINVAL;
- }
- if (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) {
- p->chunks[p->chunk_ib_idx].kpage[0] = kmalloc(PAGE_SIZE, GFP_KERNEL);
- p->chunks[p->chunk_ib_idx].kpage[1] = kmalloc(PAGE_SIZE, GFP_KERNEL);
- if (p->chunks[p->chunk_ib_idx].kpage[0] == NULL ||
- p->chunks[p->chunk_ib_idx].kpage[1] == NULL) {
- kfree(p->chunks[p->chunk_ib_idx].kpage[0]);
- kfree(p->chunks[p->chunk_ib_idx].kpage[1]);
- p->chunks[p->chunk_ib_idx].kpage[0] = NULL;
- p->chunks[p->chunk_ib_idx].kpage[1] = NULL;
- return -ENOMEM;
- }
- }
- p->chunks[p->chunk_ib_idx].kpage_idx[0] = -1;
- p->chunks[p->chunk_ib_idx].kpage_idx[1] = -1;
- p->chunks[p->chunk_ib_idx].last_copied_page = -1;
- p->chunks[p->chunk_ib_idx].last_page_index =
- ((p->chunks[p->chunk_ib_idx].length_dw * 4) - 1) / PAGE_SIZE;
- }
-
return 0;
}

@@ -450,13 +417,8 @@ static void radeon_cs_parser_fini(struct radeon_cs_parser *parser, int error, bo
kfree(parser->track);
kfree(parser->relocs);
kfree(parser->relocs_ptr);
- for (i = 0; i < parser->nchunks; i++) {
- kfree(parser->chunks[i].kdata);
- if ((parser->rdev->flags & RADEON_IS_AGP)) {
- kfree(parser->chunks[i].kpage[0]);
- kfree(parser->chunks[i].kpage[1]);
- }
- }
+ for (i = 0; i < parser->nchunks; i++)
+ drm_free_large(parser->chunks[i].kdata);
kfree(parser->chunks);
kfree(parser->chunks_array);
radeon_ib_free(parser->rdev, &parser->ib);
@@ -483,17 +445,15 @@ static int radeon_cs_ib_chunk(struct radeon_device *rdev,
DRM_ERROR("Failed to get ib !\n");
return r;
}
+
+ memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
parser->ib.length_dw = ib_chunk->length_dw;
+
r = radeon_cs_parse(rdev, parser->ring, parser);
if (r || parser->parser_error) {
DRM_ERROR("Invalid command stream !\n");
return r;
}
- r = radeon_cs_finish_pages(parser);
- if (r) {
- DRM_ERROR("Invalid command stream !\n");
- return r;
- }

if (parser->ring == R600_RING_TYPE_UVD_INDEX)
radeon_uvd_note_usage(rdev);
@@ -555,10 +515,8 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
parser->const_ib.is_const_ib = true;
parser->const_ib.length_dw = ib_chunk->length_dw;
/* Copy the packet into the IB */
- if (DRM_COPY_FROM_USER(parser->const_ib.ptr, ib_chunk->user_ptr,
- ib_chunk->length_dw * 4)) {
- return -EFAULT;
- }
+ memcpy(parser->const_ib.ptr, ib_chunk->kdata,
+ ib_chunk->length_dw * 4);
r = radeon_ring_ib_parse(rdev, parser->ring, &parser->const_ib);
if (r) {
return r;
@@ -578,10 +536,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
}
parser->ib.length_dw = ib_chunk->length_dw;
/* Copy the packet into the IB */
- if (DRM_COPY_FROM_USER(parser->ib.ptr, ib_chunk->user_ptr,
- ib_chunk->length_dw * 4)) {
- return -EFAULT;
- }
+ memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
r = radeon_ring_ib_parse(rdev, parser->ring, &parser->ib);
if (r) {
return r;
@@ -704,97 +659,6 @@ int radeon_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
return r;
}

-int radeon_cs_finish_pages(struct radeon_cs_parser *p)
-{
- struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
- int i;
- int size = PAGE_SIZE;
-
- for (i = ibc->last_copied_page + 1; i <= ibc->last_page_index; i++) {
- if (i == ibc->last_page_index) {
- size = (ibc->length_dw * 4) % PAGE_SIZE;
- if (size == 0)
- size = PAGE_SIZE;
- }
-
- if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
- ibc->user_ptr + (i * PAGE_SIZE),
- size))
- return -EFAULT;
- }
- return 0;
-}
-
-static int radeon_cs_update_pages(struct radeon_cs_parser *p, int pg_idx)
-{
- int new_page;
- struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
- int i;
- int size = PAGE_SIZE;
- bool copy1 = (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) ?
- false : true;
-
- for (i = ibc->last_copied_page + 1; i < pg_idx; i++) {
- if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
- ibc->user_ptr + (i * PAGE_SIZE),
- PAGE_SIZE)) {
- p->parser_error = -EFAULT;
- return 0;
- }
- }
-
- if (pg_idx == ibc->last_page_index) {
- size = (ibc->length_dw * 4) % PAGE_SIZE;
- if (size == 0)
- size = PAGE_SIZE;
- }
-
- new_page = ibc->kpage_idx[0] < ibc->kpage_idx[1] ? 0 : 1;
- if (copy1)
- ibc->kpage[new_page] = p->ib.ptr + (pg_idx * (PAGE_SIZE / 4));
-
- if (DRM_COPY_FROM_USER(ibc->kpage[new_page],
- ibc->user_ptr + (pg_idx * PAGE_SIZE),
- size)) {
- p->parser_error = -EFAULT;
- return 0;
- }
-
- /* copy to IB for non single case */
- if (!copy1)
- memcpy((void *)(p->ib.ptr+(pg_idx*(PAGE_SIZE/4))), ibc->kpage[new_page], size);
-
- ibc->last_copied_page = pg_idx;
- ibc->kpage_idx[new_page] = pg_idx;
-
- return new_page;
-}
-
-u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
-{
- struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
- u32 pg_idx, pg_offset;
- u32 idx_value = 0;
- int new_page;
-
- pg_idx = (idx * 4) / PAGE_SIZE;
- pg_offset = (idx * 4) % PAGE_SIZE;
-
- if (ibc->kpage_idx[0] == pg_idx)
- return ibc->kpage[0][pg_offset/4];
- if (ibc->kpage_idx[1] == pg_idx)
- return ibc->kpage[1][pg_offset/4];
-
- new_page = radeon_cs_update_pages(p, pg_idx);
- if (new_page < 0) {
- p->parser_error = new_page;
- return 0;
- }
-
- idx_value = ibc->kpage[new_page][pg_offset/4];
- return idx_value;
-}
-
/**
* radeon_cs_packet_parse() - parse cp packet and point ib index to next packet
* @parser: parser structure holding parsing context.

2013-10-08 14:34:16

by Jerome Glisse

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
> Allocate and copy all kernel memory before doing reservations. This prevents a locking
> inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
> in ttm_bo_vm_fault without upsetting lockdep.
>
> Signed-off-by: Maarten Lankhorst <[email protected]>

I would say NAK. Current code only allocate temporary page in AGP case.
So AGP case is userspace -> temp page -> cs checker -> radeon ib.

Non AGP is directly memcpy to radeon IB.

Your patch allocate memory memcpy userspace to it and it will then be
memcpy to IB. Which means you introduce an extra memcpy in the process
not something we want.

Cheers,
Jerome

> ---
> diff --git a/drivers/gpu/drm/radeon/r600_cs.c b/drivers/gpu/drm/radeon/r600_cs.c
> index 01a3ec8..efa9bca 100644
> --- a/drivers/gpu/drm/radeon/r600_cs.c
> +++ b/drivers/gpu/drm/radeon/r600_cs.c
> @@ -2391,18 +2391,13 @@ int r600_cs_legacy(struct drm_device *dev, void *data, struct drm_file *filp,
> ib_chunk = &parser.chunks[parser.chunk_ib_idx];
> parser.ib.length_dw = ib_chunk->length_dw;
> *l = parser.ib.length_dw;
> + memcpy(ib, ib_chunk->kdata, ib_chunk->length_dw * 4);
> r = r600_cs_parse(&parser);
> if (r) {
> DRM_ERROR("Invalid command stream !\n");
> r600_cs_parser_fini(&parser, r);
> return r;
> }
> - r = radeon_cs_finish_pages(&parser);
> - if (r) {
> - DRM_ERROR("Invalid command stream !\n");
> - r600_cs_parser_fini(&parser, r);
> - return r;
> - }
> r600_cs_parser_fini(&parser, r);
> return r;
> }
> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
> index e067024..c52bb5e 100644
> --- a/drivers/gpu/drm/radeon/radeon.h
> +++ b/drivers/gpu/drm/radeon/radeon.h
> @@ -983,12 +983,7 @@ struct radeon_cs_reloc {
> struct radeon_cs_chunk {
> uint32_t chunk_id;
> uint32_t length_dw;
> - int kpage_idx[2];
> - uint32_t *kpage[2];
> uint32_t *kdata;
> - void __user *user_ptr;
> - int last_copied_page;
> - int last_page_index;
> };
>
> struct radeon_cs_parser {
> @@ -1027,8 +1022,13 @@ struct radeon_cs_parser {
> struct radeon_sa_bo *cpu_sema;
> };
>
> -extern int radeon_cs_finish_pages(struct radeon_cs_parser *p);
> -extern u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx);
> +static inline u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
> +{
> + struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> +
> + return ibc->kdata[idx];
> +}
> +
>
> struct radeon_cs_packet {
> unsigned idx;
> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
> index 7d7322e..98d8898 100644
> --- a/drivers/gpu/drm/radeon/radeon_cs.c
> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
> @@ -217,9 +217,7 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
> return -EFAULT;
> }
> p->chunks[i].length_dw = user_chunk.length_dw;
> - p->chunks[i].kdata = NULL;
> p->chunks[i].chunk_id = user_chunk.chunk_id;
> - p->chunks[i].user_ptr = (void __user *)(unsigned long)user_chunk.chunk_data;
> if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) {
> p->chunk_relocs_idx = i;
> }
> @@ -242,25 +240,22 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
> return -EINVAL;
> }
>
> - cdata = (uint32_t *)(unsigned long)user_chunk.chunk_data;
> - if ((p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) ||
> - (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS)) {
> - size = p->chunks[i].length_dw * sizeof(uint32_t);
> - p->chunks[i].kdata = kmalloc(size, GFP_KERNEL);
> - if (p->chunks[i].kdata == NULL) {
> - return -ENOMEM;
> - }
> - if (DRM_COPY_FROM_USER(p->chunks[i].kdata,
> - p->chunks[i].user_ptr, size)) {
> - return -EFAULT;
> - }
> - if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
> - p->cs_flags = p->chunks[i].kdata[0];
> - if (p->chunks[i].length_dw > 1)
> - ring = p->chunks[i].kdata[1];
> - if (p->chunks[i].length_dw > 2)
> - priority = (s32)p->chunks[i].kdata[2];
> - }
> + size = p->chunks[i].length_dw;
> + p->chunks[i].kdata = drm_malloc_ab(size, sizeof(uint32_t));
> + size *= sizeof(uint32_t);
> + if (p->chunks[i].kdata == NULL) {
> + return -ENOMEM;
> + }
> + cdata = (void __user *)(unsigned long)user_chunk.chunk_data;
> + if (DRM_COPY_FROM_USER(p->chunks[i].kdata, cdata, size)) {
> + return -EFAULT;
> + }
> + if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
> + p->cs_flags = p->chunks[i].kdata[0];
> + if (p->chunks[i].length_dw > 1)
> + ring = p->chunks[i].kdata[1];
> + if (p->chunks[i].length_dw > 2)
> + priority = (s32)p->chunks[i].kdata[2];
> }
> }
>
> @@ -283,34 +278,6 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
> }
> }
>
> - /* deal with non-vm */
> - if ((p->chunk_ib_idx != -1) &&
> - ((p->cs_flags & RADEON_CS_USE_VM) == 0) &&
> - (p->chunks[p->chunk_ib_idx].chunk_id == RADEON_CHUNK_ID_IB)) {
> - if (p->chunks[p->chunk_ib_idx].length_dw > (16 * 1024)) {
> - DRM_ERROR("cs IB too big: %d\n",
> - p->chunks[p->chunk_ib_idx].length_dw);
> - return -EINVAL;
> - }
> - if (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) {
> - p->chunks[p->chunk_ib_idx].kpage[0] = kmalloc(PAGE_SIZE, GFP_KERNEL);
> - p->chunks[p->chunk_ib_idx].kpage[1] = kmalloc(PAGE_SIZE, GFP_KERNEL);
> - if (p->chunks[p->chunk_ib_idx].kpage[0] == NULL ||
> - p->chunks[p->chunk_ib_idx].kpage[1] == NULL) {
> - kfree(p->chunks[p->chunk_ib_idx].kpage[0]);
> - kfree(p->chunks[p->chunk_ib_idx].kpage[1]);
> - p->chunks[p->chunk_ib_idx].kpage[0] = NULL;
> - p->chunks[p->chunk_ib_idx].kpage[1] = NULL;
> - return -ENOMEM;
> - }
> - }
> - p->chunks[p->chunk_ib_idx].kpage_idx[0] = -1;
> - p->chunks[p->chunk_ib_idx].kpage_idx[1] = -1;
> - p->chunks[p->chunk_ib_idx].last_copied_page = -1;
> - p->chunks[p->chunk_ib_idx].last_page_index =
> - ((p->chunks[p->chunk_ib_idx].length_dw * 4) - 1) / PAGE_SIZE;
> - }
> -
> return 0;
> }
>
> @@ -450,13 +417,8 @@ static void radeon_cs_parser_fini(struct radeon_cs_parser *parser, int error, bo
> kfree(parser->track);
> kfree(parser->relocs);
> kfree(parser->relocs_ptr);
> - for (i = 0; i < parser->nchunks; i++) {
> - kfree(parser->chunks[i].kdata);
> - if ((parser->rdev->flags & RADEON_IS_AGP)) {
> - kfree(parser->chunks[i].kpage[0]);
> - kfree(parser->chunks[i].kpage[1]);
> - }
> - }
> + for (i = 0; i < parser->nchunks; i++)
> + drm_free_large(parser->chunks[i].kdata);
> kfree(parser->chunks);
> kfree(parser->chunks_array);
> radeon_ib_free(parser->rdev, &parser->ib);
> @@ -483,17 +445,15 @@ static int radeon_cs_ib_chunk(struct radeon_device *rdev,
> DRM_ERROR("Failed to get ib !\n");
> return r;
> }
> +
> + memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
> parser->ib.length_dw = ib_chunk->length_dw;
> +
> r = radeon_cs_parse(rdev, parser->ring, parser);
> if (r || parser->parser_error) {
> DRM_ERROR("Invalid command stream !\n");
> return r;
> }
> - r = radeon_cs_finish_pages(parser);
> - if (r) {
> - DRM_ERROR("Invalid command stream !\n");
> - return r;
> - }
>
> if (parser->ring == R600_RING_TYPE_UVD_INDEX)
> radeon_uvd_note_usage(rdev);
> @@ -555,10 +515,8 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
> parser->const_ib.is_const_ib = true;
> parser->const_ib.length_dw = ib_chunk->length_dw;
> /* Copy the packet into the IB */
> - if (DRM_COPY_FROM_USER(parser->const_ib.ptr, ib_chunk->user_ptr,
> - ib_chunk->length_dw * 4)) {
> - return -EFAULT;
> - }
> + memcpy(parser->const_ib.ptr, ib_chunk->kdata,
> + ib_chunk->length_dw * 4);
> r = radeon_ring_ib_parse(rdev, parser->ring, &parser->const_ib);
> if (r) {
> return r;
> @@ -578,10 +536,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
> }
> parser->ib.length_dw = ib_chunk->length_dw;
> /* Copy the packet into the IB */
> - if (DRM_COPY_FROM_USER(parser->ib.ptr, ib_chunk->user_ptr,
> - ib_chunk->length_dw * 4)) {
> - return -EFAULT;
> - }
> + memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
> r = radeon_ring_ib_parse(rdev, parser->ring, &parser->ib);
> if (r) {
> return r;
> @@ -704,97 +659,6 @@ int radeon_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
> return r;
> }
>
> -int radeon_cs_finish_pages(struct radeon_cs_parser *p)
> -{
> - struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> - int i;
> - int size = PAGE_SIZE;
> -
> - for (i = ibc->last_copied_page + 1; i <= ibc->last_page_index; i++) {
> - if (i == ibc->last_page_index) {
> - size = (ibc->length_dw * 4) % PAGE_SIZE;
> - if (size == 0)
> - size = PAGE_SIZE;
> - }
> -
> - if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
> - ibc->user_ptr + (i * PAGE_SIZE),
> - size))
> - return -EFAULT;
> - }
> - return 0;
> -}
> -
> -static int radeon_cs_update_pages(struct radeon_cs_parser *p, int pg_idx)
> -{
> - int new_page;
> - struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> - int i;
> - int size = PAGE_SIZE;
> - bool copy1 = (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) ?
> - false : true;
> -
> - for (i = ibc->last_copied_page + 1; i < pg_idx; i++) {
> - if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
> - ibc->user_ptr + (i * PAGE_SIZE),
> - PAGE_SIZE)) {
> - p->parser_error = -EFAULT;
> - return 0;
> - }
> - }
> -
> - if (pg_idx == ibc->last_page_index) {
> - size = (ibc->length_dw * 4) % PAGE_SIZE;
> - if (size == 0)
> - size = PAGE_SIZE;
> - }
> -
> - new_page = ibc->kpage_idx[0] < ibc->kpage_idx[1] ? 0 : 1;
> - if (copy1)
> - ibc->kpage[new_page] = p->ib.ptr + (pg_idx * (PAGE_SIZE / 4));
> -
> - if (DRM_COPY_FROM_USER(ibc->kpage[new_page],
> - ibc->user_ptr + (pg_idx * PAGE_SIZE),
> - size)) {
> - p->parser_error = -EFAULT;
> - return 0;
> - }
> -
> - /* copy to IB for non single case */
> - if (!copy1)
> - memcpy((void *)(p->ib.ptr+(pg_idx*(PAGE_SIZE/4))), ibc->kpage[new_page], size);
> -
> - ibc->last_copied_page = pg_idx;
> - ibc->kpage_idx[new_page] = pg_idx;
> -
> - return new_page;
> -}
> -
> -u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
> -{
> - struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> - u32 pg_idx, pg_offset;
> - u32 idx_value = 0;
> - int new_page;
> -
> - pg_idx = (idx * 4) / PAGE_SIZE;
> - pg_offset = (idx * 4) % PAGE_SIZE;
> -
> - if (ibc->kpage_idx[0] == pg_idx)
> - return ibc->kpage[0][pg_offset/4];
> - if (ibc->kpage_idx[1] == pg_idx)
> - return ibc->kpage[1][pg_offset/4];
> -
> - new_page = radeon_cs_update_pages(p, pg_idx);
> - if (new_page < 0) {
> - p->parser_error = new_page;
> - return 0;
> - }
> -
> - idx_value = ibc->kpage[new_page][pg_offset/4];
> - return idx_value;
> -}
> -
> /**
> * radeon_cs_packet_parse() - parse cp packet and point ib index to next packet
> * @parser: parser structure holding parsing context.
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

2013-10-08 14:45:39

by Christian König

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

Am 08.10.2013 16:33, schrieb Jerome Glisse:
> On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
>> Allocate and copy all kernel memory before doing reservations. This prevents a locking
>> inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
>> in ttm_bo_vm_fault without upsetting lockdep.
>>
>> Signed-off-by: Maarten Lankhorst <[email protected]>
> I would say NAK. Current code only allocate temporary page in AGP case.
> So AGP case is userspace -> temp page -> cs checker -> radeon ib.
>
> Non AGP is directly memcpy to radeon IB.
>
> Your patch allocate memory memcpy userspace to it and it will then be
> memcpy to IB. Which means you introduce an extra memcpy in the process
> not something we want.

Totally agree. Additional to that there is no good reason to provide
anything else than anonymous system memory to the CS ioctl, so the
dependency between the mmap_sem and reservations are not really clear to me.

Christian.

> Cheers,
> Jerome
>
>> ---
>> diff --git a/drivers/gpu/drm/radeon/r600_cs.c b/drivers/gpu/drm/radeon/r600_cs.c
>> index 01a3ec8..efa9bca 100644
>> --- a/drivers/gpu/drm/radeon/r600_cs.c
>> +++ b/drivers/gpu/drm/radeon/r600_cs.c
>> @@ -2391,18 +2391,13 @@ int r600_cs_legacy(struct drm_device *dev, void *data, struct drm_file *filp,
>> ib_chunk = &parser.chunks[parser.chunk_ib_idx];
>> parser.ib.length_dw = ib_chunk->length_dw;
>> *l = parser.ib.length_dw;
>> + memcpy(ib, ib_chunk->kdata, ib_chunk->length_dw * 4);
>> r = r600_cs_parse(&parser);
>> if (r) {
>> DRM_ERROR("Invalid command stream !\n");
>> r600_cs_parser_fini(&parser, r);
>> return r;
>> }
>> - r = radeon_cs_finish_pages(&parser);
>> - if (r) {
>> - DRM_ERROR("Invalid command stream !\n");
>> - r600_cs_parser_fini(&parser, r);
>> - return r;
>> - }
>> r600_cs_parser_fini(&parser, r);
>> return r;
>> }
>> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
>> index e067024..c52bb5e 100644
>> --- a/drivers/gpu/drm/radeon/radeon.h
>> +++ b/drivers/gpu/drm/radeon/radeon.h
>> @@ -983,12 +983,7 @@ struct radeon_cs_reloc {
>> struct radeon_cs_chunk {
>> uint32_t chunk_id;
>> uint32_t length_dw;
>> - int kpage_idx[2];
>> - uint32_t *kpage[2];
>> uint32_t *kdata;
>> - void __user *user_ptr;
>> - int last_copied_page;
>> - int last_page_index;
>> };
>>
>> struct radeon_cs_parser {
>> @@ -1027,8 +1022,13 @@ struct radeon_cs_parser {
>> struct radeon_sa_bo *cpu_sema;
>> };
>>
>> -extern int radeon_cs_finish_pages(struct radeon_cs_parser *p);
>> -extern u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx);
>> +static inline u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
>> +{
>> + struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
>> +
>> + return ibc->kdata[idx];
>> +}
>> +
>>
>> struct radeon_cs_packet {
>> unsigned idx;
>> diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
>> index 7d7322e..98d8898 100644
>> --- a/drivers/gpu/drm/radeon/radeon_cs.c
>> +++ b/drivers/gpu/drm/radeon/radeon_cs.c
>> @@ -217,9 +217,7 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
>> return -EFAULT;
>> }
>> p->chunks[i].length_dw = user_chunk.length_dw;
>> - p->chunks[i].kdata = NULL;
>> p->chunks[i].chunk_id = user_chunk.chunk_id;
>> - p->chunks[i].user_ptr = (void __user *)(unsigned long)user_chunk.chunk_data;
>> if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) {
>> p->chunk_relocs_idx = i;
>> }
>> @@ -242,25 +240,22 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
>> return -EINVAL;
>> }
>>
>> - cdata = (uint32_t *)(unsigned long)user_chunk.chunk_data;
>> - if ((p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) ||
>> - (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS)) {
>> - size = p->chunks[i].length_dw * sizeof(uint32_t);
>> - p->chunks[i].kdata = kmalloc(size, GFP_KERNEL);
>> - if (p->chunks[i].kdata == NULL) {
>> - return -ENOMEM;
>> - }
>> - if (DRM_COPY_FROM_USER(p->chunks[i].kdata,
>> - p->chunks[i].user_ptr, size)) {
>> - return -EFAULT;
>> - }
>> - if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
>> - p->cs_flags = p->chunks[i].kdata[0];
>> - if (p->chunks[i].length_dw > 1)
>> - ring = p->chunks[i].kdata[1];
>> - if (p->chunks[i].length_dw > 2)
>> - priority = (s32)p->chunks[i].kdata[2];
>> - }
>> + size = p->chunks[i].length_dw;
>> + p->chunks[i].kdata = drm_malloc_ab(size, sizeof(uint32_t));
>> + size *= sizeof(uint32_t);
>> + if (p->chunks[i].kdata == NULL) {
>> + return -ENOMEM;
>> + }
>> + cdata = (void __user *)(unsigned long)user_chunk.chunk_data;
>> + if (DRM_COPY_FROM_USER(p->chunks[i].kdata, cdata, size)) {
>> + return -EFAULT;
>> + }
>> + if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
>> + p->cs_flags = p->chunks[i].kdata[0];
>> + if (p->chunks[i].length_dw > 1)
>> + ring = p->chunks[i].kdata[1];
>> + if (p->chunks[i].length_dw > 2)
>> + priority = (s32)p->chunks[i].kdata[2];
>> }
>> }
>>
>> @@ -283,34 +278,6 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
>> }
>> }
>>
>> - /* deal with non-vm */
>> - if ((p->chunk_ib_idx != -1) &&
>> - ((p->cs_flags & RADEON_CS_USE_VM) == 0) &&
>> - (p->chunks[p->chunk_ib_idx].chunk_id == RADEON_CHUNK_ID_IB)) {
>> - if (p->chunks[p->chunk_ib_idx].length_dw > (16 * 1024)) {
>> - DRM_ERROR("cs IB too big: %d\n",
>> - p->chunks[p->chunk_ib_idx].length_dw);
>> - return -EINVAL;
>> - }
>> - if (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) {
>> - p->chunks[p->chunk_ib_idx].kpage[0] = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> - p->chunks[p->chunk_ib_idx].kpage[1] = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> - if (p->chunks[p->chunk_ib_idx].kpage[0] == NULL ||
>> - p->chunks[p->chunk_ib_idx].kpage[1] == NULL) {
>> - kfree(p->chunks[p->chunk_ib_idx].kpage[0]);
>> - kfree(p->chunks[p->chunk_ib_idx].kpage[1]);
>> - p->chunks[p->chunk_ib_idx].kpage[0] = NULL;
>> - p->chunks[p->chunk_ib_idx].kpage[1] = NULL;
>> - return -ENOMEM;
>> - }
>> - }
>> - p->chunks[p->chunk_ib_idx].kpage_idx[0] = -1;
>> - p->chunks[p->chunk_ib_idx].kpage_idx[1] = -1;
>> - p->chunks[p->chunk_ib_idx].last_copied_page = -1;
>> - p->chunks[p->chunk_ib_idx].last_page_index =
>> - ((p->chunks[p->chunk_ib_idx].length_dw * 4) - 1) / PAGE_SIZE;
>> - }
>> -
>> return 0;
>> }
>>
>> @@ -450,13 +417,8 @@ static void radeon_cs_parser_fini(struct radeon_cs_parser *parser, int error, bo
>> kfree(parser->track);
>> kfree(parser->relocs);
>> kfree(parser->relocs_ptr);
>> - for (i = 0; i < parser->nchunks; i++) {
>> - kfree(parser->chunks[i].kdata);
>> - if ((parser->rdev->flags & RADEON_IS_AGP)) {
>> - kfree(parser->chunks[i].kpage[0]);
>> - kfree(parser->chunks[i].kpage[1]);
>> - }
>> - }
>> + for (i = 0; i < parser->nchunks; i++)
>> + drm_free_large(parser->chunks[i].kdata);
>> kfree(parser->chunks);
>> kfree(parser->chunks_array);
>> radeon_ib_free(parser->rdev, &parser->ib);
>> @@ -483,17 +445,15 @@ static int radeon_cs_ib_chunk(struct radeon_device *rdev,
>> DRM_ERROR("Failed to get ib !\n");
>> return r;
>> }
>> +
>> + memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
>> parser->ib.length_dw = ib_chunk->length_dw;
>> +
>> r = radeon_cs_parse(rdev, parser->ring, parser);
>> if (r || parser->parser_error) {
>> DRM_ERROR("Invalid command stream !\n");
>> return r;
>> }
>> - r = radeon_cs_finish_pages(parser);
>> - if (r) {
>> - DRM_ERROR("Invalid command stream !\n");
>> - return r;
>> - }
>>
>> if (parser->ring == R600_RING_TYPE_UVD_INDEX)
>> radeon_uvd_note_usage(rdev);
>> @@ -555,10 +515,8 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
>> parser->const_ib.is_const_ib = true;
>> parser->const_ib.length_dw = ib_chunk->length_dw;
>> /* Copy the packet into the IB */
>> - if (DRM_COPY_FROM_USER(parser->const_ib.ptr, ib_chunk->user_ptr,
>> - ib_chunk->length_dw * 4)) {
>> - return -EFAULT;
>> - }
>> + memcpy(parser->const_ib.ptr, ib_chunk->kdata,
>> + ib_chunk->length_dw * 4);
>> r = radeon_ring_ib_parse(rdev, parser->ring, &parser->const_ib);
>> if (r) {
>> return r;
>> @@ -578,10 +536,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
>> }
>> parser->ib.length_dw = ib_chunk->length_dw;
>> /* Copy the packet into the IB */
>> - if (DRM_COPY_FROM_USER(parser->ib.ptr, ib_chunk->user_ptr,
>> - ib_chunk->length_dw * 4)) {
>> - return -EFAULT;
>> - }
>> + memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
>> r = radeon_ring_ib_parse(rdev, parser->ring, &parser->ib);
>> if (r) {
>> return r;
>> @@ -704,97 +659,6 @@ int radeon_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>> return r;
>> }
>>
>> -int radeon_cs_finish_pages(struct radeon_cs_parser *p)
>> -{
>> - struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
>> - int i;
>> - int size = PAGE_SIZE;
>> -
>> - for (i = ibc->last_copied_page + 1; i <= ibc->last_page_index; i++) {
>> - if (i == ibc->last_page_index) {
>> - size = (ibc->length_dw * 4) % PAGE_SIZE;
>> - if (size == 0)
>> - size = PAGE_SIZE;
>> - }
>> -
>> - if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
>> - ibc->user_ptr + (i * PAGE_SIZE),
>> - size))
>> - return -EFAULT;
>> - }
>> - return 0;
>> -}
>> -
>> -static int radeon_cs_update_pages(struct radeon_cs_parser *p, int pg_idx)
>> -{
>> - int new_page;
>> - struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
>> - int i;
>> - int size = PAGE_SIZE;
>> - bool copy1 = (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) ?
>> - false : true;
>> -
>> - for (i = ibc->last_copied_page + 1; i < pg_idx; i++) {
>> - if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
>> - ibc->user_ptr + (i * PAGE_SIZE),
>> - PAGE_SIZE)) {
>> - p->parser_error = -EFAULT;
>> - return 0;
>> - }
>> - }
>> -
>> - if (pg_idx == ibc->last_page_index) {
>> - size = (ibc->length_dw * 4) % PAGE_SIZE;
>> - if (size == 0)
>> - size = PAGE_SIZE;
>> - }
>> -
>> - new_page = ibc->kpage_idx[0] < ibc->kpage_idx[1] ? 0 : 1;
>> - if (copy1)
>> - ibc->kpage[new_page] = p->ib.ptr + (pg_idx * (PAGE_SIZE / 4));
>> -
>> - if (DRM_COPY_FROM_USER(ibc->kpage[new_page],
>> - ibc->user_ptr + (pg_idx * PAGE_SIZE),
>> - size)) {
>> - p->parser_error = -EFAULT;
>> - return 0;
>> - }
>> -
>> - /* copy to IB for non single case */
>> - if (!copy1)
>> - memcpy((void *)(p->ib.ptr+(pg_idx*(PAGE_SIZE/4))), ibc->kpage[new_page], size);
>> -
>> - ibc->last_copied_page = pg_idx;
>> - ibc->kpage_idx[new_page] = pg_idx;
>> -
>> - return new_page;
>> -}
>> -
>> -u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
>> -{
>> - struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
>> - u32 pg_idx, pg_offset;
>> - u32 idx_value = 0;
>> - int new_page;
>> -
>> - pg_idx = (idx * 4) / PAGE_SIZE;
>> - pg_offset = (idx * 4) % PAGE_SIZE;
>> -
>> - if (ibc->kpage_idx[0] == pg_idx)
>> - return ibc->kpage[0][pg_offset/4];
>> - if (ibc->kpage_idx[1] == pg_idx)
>> - return ibc->kpage[1][pg_offset/4];
>> -
>> - new_page = radeon_cs_update_pages(p, pg_idx);
>> - if (new_page < 0) {
>> - p->parser_error = new_page;
>> - return 0;
>> - }
>> -
>> - idx_value = ibc->kpage[new_page][pg_offset/4];
>> - return idx_value;
>> -}
>> -
>> /**
>> * radeon_cs_packet_parse() - parse cp packet and point ib index to next packet
>> * @parser: parser structure holding parsing context.
>>
>> _______________________________________________
>> dri-devel mailing list
>> [email protected]
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> dri-devel mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

2013-10-08 14:45:55

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

op 08-10-13 16:33, Jerome Glisse schreef:
> On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
>> Allocate and copy all kernel memory before doing reservations. This prevents a locking
>> inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
>> in ttm_bo_vm_fault without upsetting lockdep.
>>
>> Signed-off-by: Maarten Lankhorst <[email protected]>
> I would say NAK. Current code only allocate temporary page in AGP case.
> So AGP case is userspace -> temp page -> cs checker -> radeon ib.
>
> Non AGP is directly memcpy to radeon IB.
>
> Your patch allocate memory memcpy userspace to it and it will then be
> memcpy to IB. Which means you introduce an extra memcpy in the process
> not something we want.
>
Can we move up the call to radeon_ib_get then to be done before calling radeon_cs_parser_relocs?

~Maarten

2013-10-08 14:55:50

by Jerome Glisse

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

On Tue, Oct 08, 2013 at 04:45:18PM +0200, Christian K?nig wrote:
> Am 08.10.2013 16:33, schrieb Jerome Glisse:
> >On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
> >>Allocate and copy all kernel memory before doing reservations. This prevents a locking
> >>inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
> >>in ttm_bo_vm_fault without upsetting lockdep.
> >>
> >>Signed-off-by: Maarten Lankhorst <[email protected]>
> >I would say NAK. Current code only allocate temporary page in AGP case.
> >So AGP case is userspace -> temp page -> cs checker -> radeon ib.
> >
> >Non AGP is directly memcpy to radeon IB.
> >
> >Your patch allocate memory memcpy userspace to it and it will then be
> >memcpy to IB. Which means you introduce an extra memcpy in the process
> >not something we want.
>
> Totally agree. Additional to that there is no good reason to provide
> anything else than anonymous system memory to the CS ioctl, so the
> dependency between the mmap_sem and reservations are not really
> clear to me.
>
> Christian.

I think is that in other code path you take mmap_sem first then reserve
bo. But here we reserve bo and then we take mmap_sem because of copy
from user.

Cheers,
Jerome

>
> >Cheers,
> >Jerome
> >
> >>---
> >>diff --git a/drivers/gpu/drm/radeon/r600_cs.c b/drivers/gpu/drm/radeon/r600_cs.c
> >>index 01a3ec8..efa9bca 100644
> >>--- a/drivers/gpu/drm/radeon/r600_cs.c
> >>+++ b/drivers/gpu/drm/radeon/r600_cs.c
> >>@@ -2391,18 +2391,13 @@ int r600_cs_legacy(struct drm_device *dev, void *data, struct drm_file *filp,
> >> ib_chunk = &parser.chunks[parser.chunk_ib_idx];
> >> parser.ib.length_dw = ib_chunk->length_dw;
> >> *l = parser.ib.length_dw;
> >>+ memcpy(ib, ib_chunk->kdata, ib_chunk->length_dw * 4);
> >> r = r600_cs_parse(&parser);
> >> if (r) {
> >> DRM_ERROR("Invalid command stream !\n");
> >> r600_cs_parser_fini(&parser, r);
> >> return r;
> >> }
> >>- r = radeon_cs_finish_pages(&parser);
> >>- if (r) {
> >>- DRM_ERROR("Invalid command stream !\n");
> >>- r600_cs_parser_fini(&parser, r);
> >>- return r;
> >>- }
> >> r600_cs_parser_fini(&parser, r);
> >> return r;
> >> }
> >>diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
> >>index e067024..c52bb5e 100644
> >>--- a/drivers/gpu/drm/radeon/radeon.h
> >>+++ b/drivers/gpu/drm/radeon/radeon.h
> >>@@ -983,12 +983,7 @@ struct radeon_cs_reloc {
> >> struct radeon_cs_chunk {
> >> uint32_t chunk_id;
> >> uint32_t length_dw;
> >>- int kpage_idx[2];
> >>- uint32_t *kpage[2];
> >> uint32_t *kdata;
> >>- void __user *user_ptr;
> >>- int last_copied_page;
> >>- int last_page_index;
> >> };
> >> struct radeon_cs_parser {
> >>@@ -1027,8 +1022,13 @@ struct radeon_cs_parser {
> >> struct radeon_sa_bo *cpu_sema;
> >> };
> >>-extern int radeon_cs_finish_pages(struct radeon_cs_parser *p);
> >>-extern u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx);
> >>+static inline u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
> >>+{
> >>+ struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> >>+
> >>+ return ibc->kdata[idx];
> >>+}
> >>+
> >> struct radeon_cs_packet {
> >> unsigned idx;
> >>diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
> >>index 7d7322e..98d8898 100644
> >>--- a/drivers/gpu/drm/radeon/radeon_cs.c
> >>+++ b/drivers/gpu/drm/radeon/radeon_cs.c
> >>@@ -217,9 +217,7 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
> >> return -EFAULT;
> >> }
> >> p->chunks[i].length_dw = user_chunk.length_dw;
> >>- p->chunks[i].kdata = NULL;
> >> p->chunks[i].chunk_id = user_chunk.chunk_id;
> >>- p->chunks[i].user_ptr = (void __user *)(unsigned long)user_chunk.chunk_data;
> >> if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) {
> >> p->chunk_relocs_idx = i;
> >> }
> >>@@ -242,25 +240,22 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
> >> return -EINVAL;
> >> }
> >>- cdata = (uint32_t *)(unsigned long)user_chunk.chunk_data;
> >>- if ((p->chunks[i].chunk_id == RADEON_CHUNK_ID_RELOCS) ||
> >>- (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS)) {
> >>- size = p->chunks[i].length_dw * sizeof(uint32_t);
> >>- p->chunks[i].kdata = kmalloc(size, GFP_KERNEL);
> >>- if (p->chunks[i].kdata == NULL) {
> >>- return -ENOMEM;
> >>- }
> >>- if (DRM_COPY_FROM_USER(p->chunks[i].kdata,
> >>- p->chunks[i].user_ptr, size)) {
> >>- return -EFAULT;
> >>- }
> >>- if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
> >>- p->cs_flags = p->chunks[i].kdata[0];
> >>- if (p->chunks[i].length_dw > 1)
> >>- ring = p->chunks[i].kdata[1];
> >>- if (p->chunks[i].length_dw > 2)
> >>- priority = (s32)p->chunks[i].kdata[2];
> >>- }
> >>+ size = p->chunks[i].length_dw;
> >>+ p->chunks[i].kdata = drm_malloc_ab(size, sizeof(uint32_t));
> >>+ size *= sizeof(uint32_t);
> >>+ if (p->chunks[i].kdata == NULL) {
> >>+ return -ENOMEM;
> >>+ }
> >>+ cdata = (void __user *)(unsigned long)user_chunk.chunk_data;
> >>+ if (DRM_COPY_FROM_USER(p->chunks[i].kdata, cdata, size)) {
> >>+ return -EFAULT;
> >>+ }
> >>+ if (p->chunks[i].chunk_id == RADEON_CHUNK_ID_FLAGS) {
> >>+ p->cs_flags = p->chunks[i].kdata[0];
> >>+ if (p->chunks[i].length_dw > 1)
> >>+ ring = p->chunks[i].kdata[1];
> >>+ if (p->chunks[i].length_dw > 2)
> >>+ priority = (s32)p->chunks[i].kdata[2];
> >> }
> >> }
> >>@@ -283,34 +278,6 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)
> >> }
> >> }
> >>- /* deal with non-vm */
> >>- if ((p->chunk_ib_idx != -1) &&
> >>- ((p->cs_flags & RADEON_CS_USE_VM) == 0) &&
> >>- (p->chunks[p->chunk_ib_idx].chunk_id == RADEON_CHUNK_ID_IB)) {
> >>- if (p->chunks[p->chunk_ib_idx].length_dw > (16 * 1024)) {
> >>- DRM_ERROR("cs IB too big: %d\n",
> >>- p->chunks[p->chunk_ib_idx].length_dw);
> >>- return -EINVAL;
> >>- }
> >>- if (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) {
> >>- p->chunks[p->chunk_ib_idx].kpage[0] = kmalloc(PAGE_SIZE, GFP_KERNEL);
> >>- p->chunks[p->chunk_ib_idx].kpage[1] = kmalloc(PAGE_SIZE, GFP_KERNEL);
> >>- if (p->chunks[p->chunk_ib_idx].kpage[0] == NULL ||
> >>- p->chunks[p->chunk_ib_idx].kpage[1] == NULL) {
> >>- kfree(p->chunks[p->chunk_ib_idx].kpage[0]);
> >>- kfree(p->chunks[p->chunk_ib_idx].kpage[1]);
> >>- p->chunks[p->chunk_ib_idx].kpage[0] = NULL;
> >>- p->chunks[p->chunk_ib_idx].kpage[1] = NULL;
> >>- return -ENOMEM;
> >>- }
> >>- }
> >>- p->chunks[p->chunk_ib_idx].kpage_idx[0] = -1;
> >>- p->chunks[p->chunk_ib_idx].kpage_idx[1] = -1;
> >>- p->chunks[p->chunk_ib_idx].last_copied_page = -1;
> >>- p->chunks[p->chunk_ib_idx].last_page_index =
> >>- ((p->chunks[p->chunk_ib_idx].length_dw * 4) - 1) / PAGE_SIZE;
> >>- }
> >>-
> >> return 0;
> >> }
> >>@@ -450,13 +417,8 @@ static void radeon_cs_parser_fini(struct radeon_cs_parser *parser, int error, bo
> >> kfree(parser->track);
> >> kfree(parser->relocs);
> >> kfree(parser->relocs_ptr);
> >>- for (i = 0; i < parser->nchunks; i++) {
> >>- kfree(parser->chunks[i].kdata);
> >>- if ((parser->rdev->flags & RADEON_IS_AGP)) {
> >>- kfree(parser->chunks[i].kpage[0]);
> >>- kfree(parser->chunks[i].kpage[1]);
> >>- }
> >>- }
> >>+ for (i = 0; i < parser->nchunks; i++)
> >>+ drm_free_large(parser->chunks[i].kdata);
> >> kfree(parser->chunks);
> >> kfree(parser->chunks_array);
> >> radeon_ib_free(parser->rdev, &parser->ib);
> >>@@ -483,17 +445,15 @@ static int radeon_cs_ib_chunk(struct radeon_device *rdev,
> >> DRM_ERROR("Failed to get ib !\n");
> >> return r;
> >> }
> >>+
> >>+ memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
> >> parser->ib.length_dw = ib_chunk->length_dw;
> >>+
> >> r = radeon_cs_parse(rdev, parser->ring, parser);
> >> if (r || parser->parser_error) {
> >> DRM_ERROR("Invalid command stream !\n");
> >> return r;
> >> }
> >>- r = radeon_cs_finish_pages(parser);
> >>- if (r) {
> >>- DRM_ERROR("Invalid command stream !\n");
> >>- return r;
> >>- }
> >> if (parser->ring == R600_RING_TYPE_UVD_INDEX)
> >> radeon_uvd_note_usage(rdev);
> >>@@ -555,10 +515,8 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
> >> parser->const_ib.is_const_ib = true;
> >> parser->const_ib.length_dw = ib_chunk->length_dw;
> >> /* Copy the packet into the IB */
> >>- if (DRM_COPY_FROM_USER(parser->const_ib.ptr, ib_chunk->user_ptr,
> >>- ib_chunk->length_dw * 4)) {
> >>- return -EFAULT;
> >>- }
> >>+ memcpy(parser->const_ib.ptr, ib_chunk->kdata,
> >>+ ib_chunk->length_dw * 4);
> >> r = radeon_ring_ib_parse(rdev, parser->ring, &parser->const_ib);
> >> if (r) {
> >> return r;
> >>@@ -578,10 +536,7 @@ static int radeon_cs_ib_vm_chunk(struct radeon_device *rdev,
> >> }
> >> parser->ib.length_dw = ib_chunk->length_dw;
> >> /* Copy the packet into the IB */
> >>- if (DRM_COPY_FROM_USER(parser->ib.ptr, ib_chunk->user_ptr,
> >>- ib_chunk->length_dw * 4)) {
> >>- return -EFAULT;
> >>- }
> >>+ memcpy(parser->ib.ptr, ib_chunk->kdata, ib_chunk->length_dw * 4);
> >> r = radeon_ring_ib_parse(rdev, parser->ring, &parser->ib);
> >> if (r) {
> >> return r;
> >>@@ -704,97 +659,6 @@ int radeon_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
> >> return r;
> >> }
> >>-int radeon_cs_finish_pages(struct radeon_cs_parser *p)
> >>-{
> >>- struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> >>- int i;
> >>- int size = PAGE_SIZE;
> >>-
> >>- for (i = ibc->last_copied_page + 1; i <= ibc->last_page_index; i++) {
> >>- if (i == ibc->last_page_index) {
> >>- size = (ibc->length_dw * 4) % PAGE_SIZE;
> >>- if (size == 0)
> >>- size = PAGE_SIZE;
> >>- }
> >>-
> >>- if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
> >>- ibc->user_ptr + (i * PAGE_SIZE),
> >>- size))
> >>- return -EFAULT;
> >>- }
> >>- return 0;
> >>-}
> >>-
> >>-static int radeon_cs_update_pages(struct radeon_cs_parser *p, int pg_idx)
> >>-{
> >>- int new_page;
> >>- struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> >>- int i;
> >>- int size = PAGE_SIZE;
> >>- bool copy1 = (p->rdev && (p->rdev->flags & RADEON_IS_AGP)) ?
> >>- false : true;
> >>-
> >>- for (i = ibc->last_copied_page + 1; i < pg_idx; i++) {
> >>- if (DRM_COPY_FROM_USER(p->ib.ptr + (i * (PAGE_SIZE/4)),
> >>- ibc->user_ptr + (i * PAGE_SIZE),
> >>- PAGE_SIZE)) {
> >>- p->parser_error = -EFAULT;
> >>- return 0;
> >>- }
> >>- }
> >>-
> >>- if (pg_idx == ibc->last_page_index) {
> >>- size = (ibc->length_dw * 4) % PAGE_SIZE;
> >>- if (size == 0)
> >>- size = PAGE_SIZE;
> >>- }
> >>-
> >>- new_page = ibc->kpage_idx[0] < ibc->kpage_idx[1] ? 0 : 1;
> >>- if (copy1)
> >>- ibc->kpage[new_page] = p->ib.ptr + (pg_idx * (PAGE_SIZE / 4));
> >>-
> >>- if (DRM_COPY_FROM_USER(ibc->kpage[new_page],
> >>- ibc->user_ptr + (pg_idx * PAGE_SIZE),
> >>- size)) {
> >>- p->parser_error = -EFAULT;
> >>- return 0;
> >>- }
> >>-
> >>- /* copy to IB for non single case */
> >>- if (!copy1)
> >>- memcpy((void *)(p->ib.ptr+(pg_idx*(PAGE_SIZE/4))), ibc->kpage[new_page], size);
> >>-
> >>- ibc->last_copied_page = pg_idx;
> >>- ibc->kpage_idx[new_page] = pg_idx;
> >>-
> >>- return new_page;
> >>-}
> >>-
> >>-u32 radeon_get_ib_value(struct radeon_cs_parser *p, int idx)
> >>-{
> >>- struct radeon_cs_chunk *ibc = &p->chunks[p->chunk_ib_idx];
> >>- u32 pg_idx, pg_offset;
> >>- u32 idx_value = 0;
> >>- int new_page;
> >>-
> >>- pg_idx = (idx * 4) / PAGE_SIZE;
> >>- pg_offset = (idx * 4) % PAGE_SIZE;
> >>-
> >>- if (ibc->kpage_idx[0] == pg_idx)
> >>- return ibc->kpage[0][pg_offset/4];
> >>- if (ibc->kpage_idx[1] == pg_idx)
> >>- return ibc->kpage[1][pg_offset/4];
> >>-
> >>- new_page = radeon_cs_update_pages(p, pg_idx);
> >>- if (new_page < 0) {
> >>- p->parser_error = new_page;
> >>- return 0;
> >>- }
> >>-
> >>- idx_value = ibc->kpage[new_page][pg_offset/4];
> >>- return idx_value;
> >>-}
> >>-
> >> /**
> >> * radeon_cs_packet_parse() - parse cp packet and point ib index to next packet
> >> * @parser: parser structure holding parsing context.
> >>
> >>_______________________________________________
> >>dri-devel mailing list
> >>[email protected]
> >>http://lists.freedesktop.org/mailman/listinfo/dri-devel
> >_______________________________________________
> >dri-devel mailing list
> >[email protected]
> >http://lists.freedesktop.org/mailman/listinfo/dri-devel
>

2013-10-08 16:30:05

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

On 10/08/2013 04:55 PM, Jerome Glisse wrote:
> On Tue, Oct 08, 2013 at 04:45:18PM +0200, Christian K?nig wrote:
>> Am 08.10.2013 16:33, schrieb Jerome Glisse:
>>> On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
>>>> Allocate and copy all kernel memory before doing reservations. This prevents a locking
>>>> inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
>>>> in ttm_bo_vm_fault without upsetting lockdep.
>>>>
>>>> Signed-off-by: Maarten Lankhorst <[email protected]>
>>> I would say NAK. Current code only allocate temporary page in AGP case.
>>> So AGP case is userspace -> temp page -> cs checker -> radeon ib.
>>>
>>> Non AGP is directly memcpy to radeon IB.
>>>
>>> Your patch allocate memory memcpy userspace to it and it will then be
>>> memcpy to IB. Which means you introduce an extra memcpy in the process
>>> not something we want.
>> Totally agree. Additional to that there is no good reason to provide
>> anything else than anonymous system memory to the CS ioctl, so the
>> dependency between the mmap_sem and reservations are not really
>> clear to me.
>>
>> Christian.
> I think is that in other code path you take mmap_sem first then reserve
> bo. But here we reserve bo and then we take mmap_sem because of copy
> from user.
>
> Cheers,
> Jerome
>
Actually the log message is a little confusing. I think the mmap_sem
locking inversion problem is orthogonal to what's being fixed here.

This patch fixes the possible recursive bo::reserve caused by malicious
user-space handing a pointer to ttm memory so that the ttm fault handler
is called when bos are already reserved. That may cause a (possibly
interruptible) livelock.

Once that is fixed, we are free to choose the mmap_sem -> bo::reserve
locking order. Currently it's bo::reserve->mmap_sem(), but the hack
required in the ttm fault handler is admittedly a bit ugly. The plan is
to change the locking order to mmap_sem->bo::reserve

I'm not sure if it applies to this particular case, but it should be
possible to make sure that copy_from_user_inatomic() will always
succeed, by making sure the pages are present using get_user_pages(),
and release the pages after copy_from_user_inatomic() is done. That way
there's no need for a double memcpy slowpath, but if the copied data is
very fragmented I guess the resulting code may look ugly. The
get_user_pages() function will return an error if it hits TTM pages.

/Thomas

2013-10-08 16:47:49

by Jerome Glisse

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

On Tue, Oct 08, 2013 at 06:29:35PM +0200, Thomas Hellstrom wrote:
> On 10/08/2013 04:55 PM, Jerome Glisse wrote:
> >On Tue, Oct 08, 2013 at 04:45:18PM +0200, Christian K?nig wrote:
> >>Am 08.10.2013 16:33, schrieb Jerome Glisse:
> >>>On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
> >>>>Allocate and copy all kernel memory before doing reservations. This prevents a locking
> >>>>inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
> >>>>in ttm_bo_vm_fault without upsetting lockdep.
> >>>>
> >>>>Signed-off-by: Maarten Lankhorst <[email protected]>
> >>>I would say NAK. Current code only allocate temporary page in AGP case.
> >>>So AGP case is userspace -> temp page -> cs checker -> radeon ib.
> >>>
> >>>Non AGP is directly memcpy to radeon IB.
> >>>
> >>>Your patch allocate memory memcpy userspace to it and it will then be
> >>>memcpy to IB. Which means you introduce an extra memcpy in the process
> >>>not something we want.
> >>Totally agree. Additional to that there is no good reason to provide
> >>anything else than anonymous system memory to the CS ioctl, so the
> >>dependency between the mmap_sem and reservations are not really
> >>clear to me.
> >>
> >>Christian.
> >I think is that in other code path you take mmap_sem first then reserve
> >bo. But here we reserve bo and then we take mmap_sem because of copy
> >from user.
> >
> >Cheers,
> >Jerome
> >
> Actually the log message is a little confusing. I think the mmap_sem
> locking inversion problem is orthogonal to what's being fixed here.
>
> This patch fixes the possible recursive bo::reserve caused by
> malicious user-space handing a pointer to ttm memory so that the ttm
> fault handler is called when bos are already reserved. That may
> cause a (possibly interruptible) livelock.
>
> Once that is fixed, we are free to choose the mmap_sem ->
> bo::reserve locking order. Currently it's bo::reserve->mmap_sem(),
> but the hack required in the ttm fault handler is admittedly a bit
> ugly. The plan is to change the locking order to
> mmap_sem->bo::reserve
>
> I'm not sure if it applies to this particular case, but it should be
> possible to make sure that copy_from_user_inatomic() will always
> succeed, by making sure the pages are present using
> get_user_pages(), and release the pages after
> copy_from_user_inatomic() is done. That way there's no need for a
> double memcpy slowpath, but if the copied data is very fragmented I
> guess the resulting code may look ugly. The get_user_pages()
> function will return an error if it hits TTM pages.
>
> /Thomas

get_user_pages + copy_from_user_inatomic is overkill. We should just
do get_user_pages which fails with ttm memory and then use copy_highpage
helper.

Cheers,
Jerome

2013-10-08 16:58:17

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

On 10/08/2013 06:47 PM, Jerome Glisse wrote:
> On Tue, Oct 08, 2013 at 06:29:35PM +0200, Thomas Hellstrom wrote:
>> On 10/08/2013 04:55 PM, Jerome Glisse wrote:
>>> On Tue, Oct 08, 2013 at 04:45:18PM +0200, Christian K?nig wrote:
>>>> Am 08.10.2013 16:33, schrieb Jerome Glisse:
>>>>> On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
>>>>>> Allocate and copy all kernel memory before doing reservations. This prevents a locking
>>>>>> inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
>>>>>> in ttm_bo_vm_fault without upsetting lockdep.
>>>>>>
>>>>>> Signed-off-by: Maarten Lankhorst <[email protected]>
>>>>> I would say NAK. Current code only allocate temporary page in AGP case.
>>>>> So AGP case is userspace -> temp page -> cs checker -> radeon ib.
>>>>>
>>>>> Non AGP is directly memcpy to radeon IB.
>>>>>
>>>>> Your patch allocate memory memcpy userspace to it and it will then be
>>>>> memcpy to IB. Which means you introduce an extra memcpy in the process
>>>>> not something we want.
>>>> Totally agree. Additional to that there is no good reason to provide
>>>> anything else than anonymous system memory to the CS ioctl, so the
>>>> dependency between the mmap_sem and reservations are not really
>>>> clear to me.
>>>>
>>>> Christian.
>>> I think is that in other code path you take mmap_sem first then reserve
>>> bo. But here we reserve bo and then we take mmap_sem because of copy
>> >from user.
>>> Cheers,
>>> Jerome
>>>
>> Actually the log message is a little confusing. I think the mmap_sem
>> locking inversion problem is orthogonal to what's being fixed here.
>>
>> This patch fixes the possible recursive bo::reserve caused by
>> malicious user-space handing a pointer to ttm memory so that the ttm
>> fault handler is called when bos are already reserved. That may
>> cause a (possibly interruptible) livelock.
>>
>> Once that is fixed, we are free to choose the mmap_sem ->
>> bo::reserve locking order. Currently it's bo::reserve->mmap_sem(),
>> but the hack required in the ttm fault handler is admittedly a bit
>> ugly. The plan is to change the locking order to
>> mmap_sem->bo::reserve
>>
>> I'm not sure if it applies to this particular case, but it should be
>> possible to make sure that copy_from_user_inatomic() will always
>> succeed, by making sure the pages are present using
>> get_user_pages(), and release the pages after
>> copy_from_user_inatomic() is done. That way there's no need for a
>> double memcpy slowpath, but if the copied data is very fragmented I
>> guess the resulting code may look ugly. The get_user_pages()
>> function will return an error if it hits TTM pages.
>>
>> /Thomas
> get_user_pages + copy_from_user_inatomic is overkill. We should just
> do get_user_pages which fails with ttm memory and then use copy_highpage
> helper.
>
> Cheers,
> Jerome
Yeah, it may well be that that's the preferred solution.

/Thomas

2013-10-09 10:58:46

[permalink] [raw]

Subject: Re: [RFC PATCH] drm/radeon: fixup locking inversion between mmap_sem and reservations

op 08-10-13 18:47, Jerome Glisse schreef:
> On Tue, Oct 08, 2013 at 06:29:35PM +0200, Thomas Hellstrom wrote:
>> On 10/08/2013 04:55 PM, Jerome Glisse wrote:
>>> On Tue, Oct 08, 2013 at 04:45:18PM +0200, Christian K?nig wrote:
>>>> Am 08.10.2013 16:33, schrieb Jerome Glisse:
>>>>> On Tue, Oct 08, 2013 at 04:14:40PM +0200, Maarten Lankhorst wrote:
>>>>>> Allocate and copy all kernel memory before doing reservations. This prevents a locking
>>>>>> inversion between mmap_sem and reservation_class, and allows us to drop the trylocking
>>>>>> in ttm_bo_vm_fault without upsetting lockdep.
>>>>>>
>>>>>> Signed-off-by: Maarten Lankhorst <[email protected]>
>>>>> I would say NAK. Current code only allocate temporary page in AGP case.
>>>>> So AGP case is userspace -> temp page -> cs checker -> radeon ib.
>>>>>
>>>>> Non AGP is directly memcpy to radeon IB.
>>>>>
>>>>> Your patch allocate memory memcpy userspace to it and it will then be
>>>>> memcpy to IB. Which means you introduce an extra memcpy in the process
>>>>> not something we want.
>>>> Totally agree. Additional to that there is no good reason to provide
>>>> anything else than anonymous system memory to the CS ioctl, so the
>>>> dependency between the mmap_sem and reservations are not really
>>>> clear to me.
>>>>
>>>> Christian.
>>> I think is that in other code path you take mmap_sem first then reserve
>>> bo. But here we reserve bo and then we take mmap_sem because of copy
>> >from user.
>>> Cheers,
>>> Jerome
>>>
>> Actually the log message is a little confusing. I think the mmap_sem
>> locking inversion problem is orthogonal to what's being fixed here.
>>
>> This patch fixes the possible recursive bo::reserve caused by
>> malicious user-space handing a pointer to ttm memory so that the ttm
>> fault handler is called when bos are already reserved. That may
>> cause a (possibly interruptible) livelock.
>>
>> Once that is fixed, we are free to choose the mmap_sem ->
>> bo::reserve locking order. Currently it's bo::reserve->mmap_sem(),
>> but the hack required in the ttm fault handler is admittedly a bit
>> ugly. The plan is to change the locking order to
>> mmap_sem->bo::reserve
>>
>> I'm not sure if it applies to this particular case, but it should be
>> possible to make sure that copy_from_user_inatomic() will always
>> succeed, by making sure the pages are present using
>> get_user_pages(), and release the pages after
>> copy_from_user_inatomic() is done. That way there's no need for a
>> double memcpy slowpath, but if the copied data is very fragmented I
>> guess the resulting code may look ugly. The get_user_pages()
>> function will return an error if it hits TTM pages.
>>
>> /Thomas
> get_user_pages + copy_from_user_inatomic is overkill. We should just
> do get_user_pages which fails with ttm memory and then use copy_highpage
> helper.
I don't think we have to do anything that complicated, the easiest solution seems to be to
call radeon_ib_get before calling radeon_cs_parser_relocs, and copy everything to the ib
buffer before taking the reserve lock.

~Maarten

2013-10-09 12:37:09