2020-06-19 16:23:14

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 5:15 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Jun 19, 2020 at 05:06:04PM +0200, Daniel Vetter wrote:
> > On Fri, Jun 19, 2020 at 1:39 PM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:
> > > > > As I've understood GPU that means you need to show that the commands
> > > > > associated with the buffer have completed. This is all local stuff
> > > > > within the driver, right? Why use fence (other than it already exists)
> > > >
> > > > Because that's the end-of-dma thing. And it's cross-driver for the
> > > > above reasons, e.g.
> > > > - device A renders some stuff. Userspace gets dma_fence A out of that
> > > > (well sync_file or one of the other uapi interfaces, but you get the
> > > > idea)
> > > > - userspace (across process or just different driver) issues more
> > > > rendering for device B, which depends upon the rendering done on
> > > > device A. So dma_fence A is an dependency and will block this dma
> > > > operation. Userspace (and the kernel) gets dma_fence B out of this
> > > > - because unfortunate reasons, the same rendering on device B also
> > > > needs a userptr buffer, which means that dma_fence B is also the one
> > > > that the mmu_range_notifier needs to wait on before it can tell core
> > > > mm that it can go ahead and release those pages
> > >
> > > I was afraid you'd say this - this is complete madness for other DMA
> > > devices to borrow the notifier hook of the first device!
> >
> > The first device might not even have a notifier. This is the 2nd
> > device, waiting on a dma_fence of its own, but which happens to be
> > queued up as a dma operation behind something else.
> >
> > > What if the first device is a page faulting device and doesn't call
> > > dma_fence??
> >
> > Not sure what you mean with this ... even if it does page-faulting for
> > some other reasons, it'll emit a dma_fence which the 2nd device can
> > consume as a dependency.
>
> At some point the pages under the buffer have to be either pinned
> or protected by mmu notifier. So each and every single device doing
> DMA to these pages must either pin, or use mmu notifier.
>
> Driver A should never 'borrow' a notifier from B

It doesn't. I guess this would be great topic for lpc with a seriously
big white-board, but I guess we don't have that this year again, so
let me try again. Simplified example ofc, but should be the gist.

Ingredients:
Device A and Device B
A dma-buf, shared between device A and device B, let's call that shared_buf
A userptr buffer, which userspace created on device B to hopefully
somewhat track a virtual memory range, let's call that userptr_buf.
A pile of other buffers, but we pretend they don't exist (because they
kinda don't matter.

Sequence of events as userspace issues them to the kernel.
1. dma operation on device A, which fills some interesting stuff into
shared_buf. Userspace gets back a handle to dma_fence fence_A. No mmu
notifier anywhere to be seen in the driver for device A.

2. userspace passes fence_A around to some other place

3. other places takes the handle for shared_buf and fence_A and
userptr_buf and starts a dma operation on device B. It's one dma
operation, maybe device B is taking the data from shared_buf and
compresses it into userptr_buf, so that userspace can then send it
over the network or to disk or whatever. device B has a mmu_notifier.
Userspace gets back fence_B, which represents this dma operation. The
kernel also stuffs this fence_B into the mmu_range_notifier for
userptr_buf.

-> at this point device A might still be crunching the numbers

4. device A is finally done doing whatever it was supposed to do, and
fence_A completes

5. device B wakes up (this might or might not involve the kernel,
usually it does) since fence_A has completed, and now starts doing its
own crunching.

6. once device B is also done, it signals fence_B

In all this device A has never borrowed the mmu notifier or even
accessd the memory in userptr_buf or had access to that buffer handle.

The madness is only that device B's mmu notifier might need to wait
for fence_B so that the dma operation finishes. Which in turn has to
wait for device A to finish first.

> If each driver controls its own lifetime of the buffers, why can't the
> driver locally wait for its device to finish?
>
> Can't the GPUs cancel work that is waiting on a DMA fence? Ie if
> Driver A detects that work completed and wants to trigger a DMA fence,
> but it now knows the buffer is invalidated, can't it tell driver B to
> give up?

We can (usually, the shitty hw where we can't has generally
disappeared) with gpu reset. Users make really sad faces when that
happens though, and generally they're only ok with that if it's indeed
a nasty gpu program that resulted in the crash (there's some webgl
shaders that run too long for quick&easy testing of how good the gpu
reset is, don't do that if you care about the data in your desktop
session ...).

The trouble is that userspace assembles the work that's queued up on
the gpu. After submission everyone has forgotten enough that just
canceling stuff and re-issuing everything isn't on the table.

Some hw is better, with real hw page faults and stuff, but those also
don't need dma_fence to track their memory. But generally just not
possible.

> > The problem is that there's piles of other dependencies for a dma job.
> > GPU doesn't just consume a single buffer each time, it consumes entire
> > lists of buffers and mixes them all up in funny ways. Some of these
> > buffers are userptr, entirely local to the device. Other buffers are
> > just normal device driver allocations (and managed with some shrinker
> > to keep them in check). And then there's the actually shared dma-buf
> > with other devices. The trouble is that they're all bundled up
> > together.
>
> But why does this matter? Does the GPU itself consume some work and
> then stall internally waiting for an external DMA fence?

Yup, see above, that's what's going on. Userspace queues up
distributed work across engines & drivers, and then just waits for the
entire thing to cascade and finish.

> Otherwise I would expect this dependency chain should be breakable by
> aborting work waiting on fences upon invalidation (without stalling)

Yup, it would. Now on some hw you have a gpu work scheduler that sits
in some kthread, and you could probably unschedule the work if there's
some external dependency and you get an mmu notifier callback. Then
put it on some queue, re-acquire the user pages and then reschedule
it.

It's still as horrible, since you still have the wait for the
completion in there, the only benefit is that other device drivers
without userptr support don't have to live with that specific
constraint. dma_fence rules are still very strict and easy to
deadlock, so we'd still want some lockdep checks, but now you'd have
to somehow annotate whether you're a driver with userptr or a driver
without userptr and make sure everyone gets it right.

Also a scheduler which can unschedule and reschedule is mighty more
complex than one which cannot, plus it needs to do that from mmu
notifier callback (not the nicest calling context we have in the
kernel by far). And if you have a single driver which doesn't
unschedule, you're still screwed from an overall subsystem pov.

So lots of code, lots of work, and not that much motivation to roll it
out consistently across the board since there's no incremental payoff.
Plus the thing is, the drivers without userptr are generally the
really simple ones. Much easier to just fix those than to change the
big complex render beasts which want userptr :-)

E.g. the atomic modeset framework we've rolled out in the past few
years and that almost all display drivers now use pulls any (sleeping)
locks and memory allocations out of the critical async work section by
design. Some drivers still managed to butcher it (the annotations
caught some locking bugs already, not just memory allocations in the
wrong spot), but generally easy to fix those.

> > > Do not need to wait on dma_fence in notifiers.
> >
> > Maybe :-) The goal of this series is more to document current rules
> > and make them more consistent. Fixing them if we don't like them might
> > be a follow-up task, but that would likely be a pile more work. First
> > we need to know what the exact shape of the problem even is.
>
> Fair enough

Full disclosure: We are aware that we've designed ourselves into an
impressive corner here, and there's lots of talks going on about
untangling the dma synchronization from the memory management
completely. But

- that needs minimally reliable preempt support for gpu work, and hw
engineers seem to have a hard time with that (or just don't want to do
it). hw page faults would be even better, and even more wishlist than
reality if you expect it to work everywhere.

- it'd be a complete break of the established userspace abi, including
all the cross driver stuff. Which means it's not just some in-kernel
refactoring, we need to rev the entire ecosystem. And that takes a
very long time, and needs serious pressure to get people moving.

E.g. the atomic modeset rework is still not yet rolled out to major
linux desktop environments, and it's over 5 years old, and it's
starting to seriously hurt because lots of performance features
require atomic modeset in userspace to be able to use them. I think
rev'ing the entire memory management support will take as long. Plus I
don't think we can ditch the old ways - even if all the hw currently
using this would be dead (and we can delete the drivers) there's still
the much smaller gpus in SoC that also need to go through the entire
evolution.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


2020-06-20 04:16:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:

> The madness is only that device B's mmu notifier might need to wait
> for fence_B so that the dma operation finishes. Which in turn has to
> wait for device A to finish first.

So, it sound, fundamentally you've got this graph of operations across
an unknown set of drivers and the kernel cannot insert itself in
dma_fence hand offs to re-validate any of the buffers involved?
Buffers which by definition cannot be touched by the hardware yet.

That really is a pretty horrible place to end up..

Pinning really is right answer for this kind of work flow. I think
converting pinning to notifers should not be done unless notifier
invalidation is relatively bounded.

I know people like notifiers because they give a bit nicer performance
in some happy cases, but this cripples all the bad cases..

If pinning doesn't work for some reason maybe we should address that?

> Full disclosure: We are aware that we've designed ourselves into an
> impressive corner here, and there's lots of talks going on about
> untangling the dma synchronization from the memory management
> completely. But

I think the documenting is really important: only GPU should be using
this stuff and driving notifiers this way. Complete NO for any
totally-not-a-GPU things in drivers/accel for sure.

Jason

2020-06-20 04:23:42

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
>
> > The madness is only that device B's mmu notifier might need to wait
> > for fence_B so that the dma operation finishes. Which in turn has to
> > wait for device A to finish first.
>
> So, it sound, fundamentally you've got this graph of operations across
> an unknown set of drivers and the kernel cannot insert itself in
> dma_fence hand offs to re-validate any of the buffers involved?
> Buffers which by definition cannot be touched by the hardware yet.
>
> That really is a pretty horrible place to end up..
>
> Pinning really is right answer for this kind of work flow. I think
> converting pinning to notifers should not be done unless notifier
> invalidation is relatively bounded.
>
> I know people like notifiers because they give a bit nicer performance
> in some happy cases, but this cripples all the bad cases..
>
> If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate
any HMM work and thus were using mmu notifier already. You need the
mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update
the GPU page table right away. Modulo needing to write to GPU memory
using dma engine if the GPU page table is in GPU memory that is not
accessible from the CPU but that's never the case for nouveau so far
(but i expect it will be at one point).


So i see this as 2 different cases, the user ptr case, which does pin
pages by the way, where things are synchronous. Versus the HMM cases
where everything is asynchronous.


I probably need to warn AMD folks again that using HMM means that you
must be able to update the GPU page table asynchronously without
fence wait. The issue for AMD is that they already update their GPU
page table using DMA engine. I believe this is still doable if they
use a kernel only DMA engine context, where only kernel can queue up
jobs so that you do not need to wait for unrelated things and you can
prioritize GPU page table update which should translate in fast GPU
page table update without DMA fence.


> > Full disclosure: We are aware that we've designed ourselves into an
> > impressive corner here, and there's lots of talks going on about
> > untangling the dma synchronization from the memory management
> > completely. But
>
> I think the documenting is really important: only GPU should be using
> this stuff and driving notifiers this way. Complete NO for any
> totally-not-a-GPU things in drivers/accel for sure.

Yes for user that expect HMM they need to be asynchronous. But it is
hard to revert user ptr has it was done a long time ago.

Cheers,
J?r?me

2020-06-20 04:24:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> >
> > > The madness is only that device B's mmu notifier might need to wait
> > > for fence_B so that the dma operation finishes. Which in turn has to
> > > wait for device A to finish first.
> >
> > So, it sound, fundamentally you've got this graph of operations across
> > an unknown set of drivers and the kernel cannot insert itself in
> > dma_fence hand offs to re-validate any of the buffers involved?
> > Buffers which by definition cannot be touched by the hardware yet.
> >
> > That really is a pretty horrible place to end up..
> >
> > Pinning really is right answer for this kind of work flow. I think
> > converting pinning to notifers should not be done unless notifier
> > invalidation is relatively bounded.
> >
> > I know people like notifiers because they give a bit nicer performance
> > in some happy cases, but this cripples all the bad cases..
> >
> > If pinning doesn't work for some reason maybe we should address that?
>
> Note that the dma fence is only true for user ptr buffer which predate
> any HMM work and thus were using mmu notifier already. You need the
> mmu notifier there because of fork and other corner cases.

I wonder if we should try to fix the fork case more directly - RDMA
has this same problem and added MADV_DONTFORK a long time ago as a
hacky way to deal with it.

Some crazy page pin that resolved COW in a way that always kept the
physical memory with the mm that initiated the pin?

(isn't this broken for O_DIRECT as well anyhow?)

How does mmu_notifiers help the fork case anyhow? Block fork from
progressing?

> I probably need to warn AMD folks again that using HMM means that you
> must be able to update the GPU page table asynchronously without
> fence wait.

It is kind of unrelated to HMM, it just shouldn't be using mmu
notifiers to replace page pinning..

> The issue for AMD is that they already update their GPU page table
> using DMA engine. I believe this is still doable if they use a
> kernel only DMA engine context, where only kernel can queue up jobs
> so that you do not need to wait for unrelated things and you can
> prioritize GPU page table update which should translate in fast GPU
> page table update without DMA fence.

Make sense

I'm not sure I saw this in the AMD hmm stuff - it would be good if
someone would look at that. Every time I do it looks like the locking
is wrong.

Jason

2020-06-20 04:30:22

by Alex Deucher

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse <[email protected]> wrote:
>
> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> >
> > > The madness is only that device B's mmu notifier might need to wait
> > > for fence_B so that the dma operation finishes. Which in turn has to
> > > wait for device A to finish first.
> >
> > So, it sound, fundamentally you've got this graph of operations across
> > an unknown set of drivers and the kernel cannot insert itself in
> > dma_fence hand offs to re-validate any of the buffers involved?
> > Buffers which by definition cannot be touched by the hardware yet.
> >
> > That really is a pretty horrible place to end up..
> >
> > Pinning really is right answer for this kind of work flow. I think
> > converting pinning to notifers should not be done unless notifier
> > invalidation is relatively bounded.
> >
> > I know people like notifiers because they give a bit nicer performance
> > in some happy cases, but this cripples all the bad cases..
> >
> > If pinning doesn't work for some reason maybe we should address that?
>
> Note that the dma fence is only true for user ptr buffer which predate
> any HMM work and thus were using mmu notifier already. You need the
> mmu notifier there because of fork and other corner cases.
>
> For nouveau the notifier do not need to wait for anything it can update
> the GPU page table right away. Modulo needing to write to GPU memory
> using dma engine if the GPU page table is in GPU memory that is not
> accessible from the CPU but that's never the case for nouveau so far
> (but i expect it will be at one point).
>
>
> So i see this as 2 different cases, the user ptr case, which does pin
> pages by the way, where things are synchronous. Versus the HMM cases
> where everything is asynchronous.
>
>
> I probably need to warn AMD folks again that using HMM means that you
> must be able to update the GPU page table asynchronously without
> fence wait. The issue for AMD is that they already update their GPU
> page table using DMA engine. I believe this is still doable if they
> use a kernel only DMA engine context, where only kernel can queue up
> jobs so that you do not need to wait for unrelated things and you can
> prioritize GPU page table update which should translate in fast GPU
> page table update without DMA fence.

All devices which support recoverable page faults also have a
dedicated paging engine for the kernel driver which the driver already
makes use of. We can also update the GPU page tables with the CPU.

Alex

>
>
> > > Full disclosure: We are aware that we've designed ourselves into an
> > > impressive corner here, and there's lots of talks going on about
> > > untangling the dma synchronization from the memory management
> > > completely. But
> >
> > I think the documenting is really important: only GPU should be using
> > this stuff and driving notifiers this way. Complete NO for any
> > totally-not-a-GPU things in drivers/accel for sure.
>
> Yes for user that expect HMM they need to be asynchronous. But it is
> hard to revert user ptr has it was done a long time ago.
>
> Cheers,
> Jérôme
>
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

2020-06-20 04:33:36

by Felix Kuehling

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations


Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
> On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse <[email protected]> wrote:
>> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
>>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
>>>
>>>> The madness is only that device B's mmu notifier might need to wait
>>>> for fence_B so that the dma operation finishes. Which in turn has to
>>>> wait for device A to finish first.
>>> So, it sound, fundamentally you've got this graph of operations across
>>> an unknown set of drivers and the kernel cannot insert itself in
>>> dma_fence hand offs to re-validate any of the buffers involved?
>>> Buffers which by definition cannot be touched by the hardware yet.
>>>
>>> That really is a pretty horrible place to end up..
>>>
>>> Pinning really is right answer for this kind of work flow. I think
>>> converting pinning to notifers should not be done unless notifier
>>> invalidation is relatively bounded.
>>>
>>> I know people like notifiers because they give a bit nicer performance
>>> in some happy cases, but this cripples all the bad cases..
>>>
>>> If pinning doesn't work for some reason maybe we should address that?
>> Note that the dma fence is only true for user ptr buffer which predate
>> any HMM work and thus were using mmu notifier already. You need the
>> mmu notifier there because of fork and other corner cases.
>>
>> For nouveau the notifier do not need to wait for anything it can update
>> the GPU page table right away. Modulo needing to write to GPU memory
>> using dma engine if the GPU page table is in GPU memory that is not
>> accessible from the CPU but that's never the case for nouveau so far
>> (but i expect it will be at one point).
>>
>>
>> So i see this as 2 different cases, the user ptr case, which does pin
>> pages by the way, where things are synchronous. Versus the HMM cases
>> where everything is asynchronous.
>>
>>
>> I probably need to warn AMD folks again that using HMM means that you
>> must be able to update the GPU page table asynchronously without
>> fence wait. The issue for AMD is that they already update their GPU
>> page table using DMA engine. I believe this is still doable if they
>> use a kernel only DMA engine context, where only kernel can queue up
>> jobs so that you do not need to wait for unrelated things and you can
>> prioritize GPU page table update which should translate in fast GPU
>> page table update without DMA fence.
> All devices which support recoverable page faults also have a
> dedicated paging engine for the kernel driver which the driver already
> makes use of. We can also update the GPU page tables with the CPU.

We have a potential problem with CPU updating page tables while the GPU
is retrying on page table entries because 64 bit CPU transactions don't
arrive in device memory atomically.

We are using SDMA for page table updates. This currently goes through a
the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode
only. But since it's based on the DRM GPU scheduler, we do use dma-fence
to wait for completion.

Regards,
  Felix


>
> Alex
>
>>
>>>> Full disclosure: We are aware that we've designed ourselves into an
>>>> impressive corner here, and there's lots of talks going on about
>>>> untangling the dma synchronization from the memory management
>>>> completely. But
>>> I think the documenting is really important: only GPU should be using
>>> this stuff and driving notifiers this way. Complete NO for any
>>> totally-not-a-GPU things in drivers/accel for sure.
>> Yes for user that expect HMM they need to be asynchronous. But it is
>> hard to revert user ptr has it was done a long time ago.
>>
>> Cheers,
>> Jérôme
>>
>> _______________________________________________
>> amd-gfx mailing list
>> [email protected]
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

2020-06-20 04:37:18

by Felix Kuehling

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
> On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
>> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
>>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
>>>
>>>> The madness is only that device B's mmu notifier might need to wait
>>>> for fence_B so that the dma operation finishes. Which in turn has to
>>>> wait for device A to finish first.
>>> So, it sound, fundamentally you've got this graph of operations across
>>> an unknown set of drivers and the kernel cannot insert itself in
>>> dma_fence hand offs to re-validate any of the buffers involved?
>>> Buffers which by definition cannot be touched by the hardware yet.
>>>
>>> That really is a pretty horrible place to end up..
>>>
>>> Pinning really is right answer for this kind of work flow. I think
>>> converting pinning to notifers should not be done unless notifier
>>> invalidation is relatively bounded.
>>>
>>> I know people like notifiers because they give a bit nicer performance
>>> in some happy cases, but this cripples all the bad cases..
>>>
>>> If pinning doesn't work for some reason maybe we should address that?
>> Note that the dma fence is only true for user ptr buffer which predate
>> any HMM work and thus were using mmu notifier already. You need the
>> mmu notifier there because of fork and other corner cases.
> I wonder if we should try to fix the fork case more directly - RDMA
> has this same problem and added MADV_DONTFORK a long time ago as a
> hacky way to deal with it.
>
> Some crazy page pin that resolved COW in a way that always kept the
> physical memory with the mm that initiated the pin?
>
> (isn't this broken for O_DIRECT as well anyhow?)
>
> How does mmu_notifiers help the fork case anyhow? Block fork from
> progressing?

How much the mmu_notifier blocks fork progress depends, on quickly we
can preempt GPU jobs accessing affected memory. If we don't have
fine-grained preemption capability (graphics), the best we can do is
wait for the GPU jobs to complete. We can also delay submission of new
GPU jobs to the same memory until the MMU notifier is done. Future jobs
would use the new page addresses.

With fine-grained preemption (ROCm compute), we can preempt GPU work on
the affected adders space to minimize the delay seen by fork.

With recoverable device page faults, we can invalidate GPU page table
entries, so device access to the affected pages stops immediately.

In all cases, the end result is, that the device page table gets updated
with the address of the copied pages before the GPU accesses the COW
memory again.Without the MMU notifier, we'd end up with the GPU
corrupting memory of the other process.

Regards,
  Felix


>
>> I probably need to warn AMD folks again that using HMM means that you
>> must be able to update the GPU page table asynchronously without
>> fence wait.
> It is kind of unrelated to HMM, it just shouldn't be using mmu
> notifiers to replace page pinning..
>
>> The issue for AMD is that they already update their GPU page table
>> using DMA engine. I believe this is still doable if they use a
>> kernel only DMA engine context, where only kernel can queue up jobs
>> so that you do not need to wait for unrelated things and you can
>> prioritize GPU page table update which should translate in fast GPU
>> page table update without DMA fence.
> Make sense
>
> I'm not sure I saw this in the AMD hmm stuff - it would be good if
> someone would look at that. Every time I do it looks like the locking
> is wrong.
>
> Jason
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

2020-06-20 04:37:54

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
>
> Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
> > On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse <[email protected]> wrote:
> >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> >>>
> >>>> The madness is only that device B's mmu notifier might need to wait
> >>>> for fence_B so that the dma operation finishes. Which in turn has to
> >>>> wait for device A to finish first.
> >>> So, it sound, fundamentally you've got this graph of operations across
> >>> an unknown set of drivers and the kernel cannot insert itself in
> >>> dma_fence hand offs to re-validate any of the buffers involved?
> >>> Buffers which by definition cannot be touched by the hardware yet.
> >>>
> >>> That really is a pretty horrible place to end up..
> >>>
> >>> Pinning really is right answer for this kind of work flow. I think
> >>> converting pinning to notifers should not be done unless notifier
> >>> invalidation is relatively bounded.
> >>>
> >>> I know people like notifiers because they give a bit nicer performance
> >>> in some happy cases, but this cripples all the bad cases..
> >>>
> >>> If pinning doesn't work for some reason maybe we should address that?
> >> Note that the dma fence is only true for user ptr buffer which predate
> >> any HMM work and thus were using mmu notifier already. You need the
> >> mmu notifier there because of fork and other corner cases.
> >>
> >> For nouveau the notifier do not need to wait for anything it can update
> >> the GPU page table right away. Modulo needing to write to GPU memory
> >> using dma engine if the GPU page table is in GPU memory that is not
> >> accessible from the CPU but that's never the case for nouveau so far
> >> (but i expect it will be at one point).
> >>
> >>
> >> So i see this as 2 different cases, the user ptr case, which does pin
> >> pages by the way, where things are synchronous. Versus the HMM cases
> >> where everything is asynchronous.
> >>
> >>
> >> I probably need to warn AMD folks again that using HMM means that you
> >> must be able to update the GPU page table asynchronously without
> >> fence wait. The issue for AMD is that they already update their GPU
> >> page table using DMA engine. I believe this is still doable if they
> >> use a kernel only DMA engine context, where only kernel can queue up
> >> jobs so that you do not need to wait for unrelated things and you can
> >> prioritize GPU page table update which should translate in fast GPU
> >> page table update without DMA fence.
> > All devices which support recoverable page faults also have a
> > dedicated paging engine for the kernel driver which the driver already
> > makes use of. We can also update the GPU page tables with the CPU.
>
> We have a potential problem with CPU updating page tables while the GPU
> is retrying on page table entries because 64 bit CPU transactions don't
> arrive in device memory atomically.
>
> We are using SDMA for page table updates. This currently goes through a
> the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode
> only. But since it's based on the DRM GPU scheduler, we do use dma-fence
> to wait for completion.

Yeah my worry is mostly that some cross dma fence leak into it but
it should never happen realy, maybe there is a way to catch if it
does and print a warning.

So yes you can use dma fence, as long as they do not have cross-dep.
Another expectation is that they complete quickly and usualy page
table update do.

Cheers,
J?r?me

2020-06-20 04:38:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
> Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
> > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> >>>
> >>>> The madness is only that device B's mmu notifier might need to wait
> >>>> for fence_B so that the dma operation finishes. Which in turn has to
> >>>> wait for device A to finish first.
> >>> So, it sound, fundamentally you've got this graph of operations across
> >>> an unknown set of drivers and the kernel cannot insert itself in
> >>> dma_fence hand offs to re-validate any of the buffers involved?
> >>> Buffers which by definition cannot be touched by the hardware yet.
> >>>
> >>> That really is a pretty horrible place to end up..
> >>>
> >>> Pinning really is right answer for this kind of work flow. I think
> >>> converting pinning to notifers should not be done unless notifier
> >>> invalidation is relatively bounded.
> >>>
> >>> I know people like notifiers because they give a bit nicer performance
> >>> in some happy cases, but this cripples all the bad cases..
> >>>
> >>> If pinning doesn't work for some reason maybe we should address that?
> >> Note that the dma fence is only true for user ptr buffer which predate
> >> any HMM work and thus were using mmu notifier already. You need the
> >> mmu notifier there because of fork and other corner cases.
> > I wonder if we should try to fix the fork case more directly - RDMA
> > has this same problem and added MADV_DONTFORK a long time ago as a
> > hacky way to deal with it.
> >
> > Some crazy page pin that resolved COW in a way that always kept the
> > physical memory with the mm that initiated the pin?
> >
> > (isn't this broken for O_DIRECT as well anyhow?)
> >
> > How does mmu_notifiers help the fork case anyhow? Block fork from
> > progressing?
>
> How much the mmu_notifier blocks fork progress depends, on quickly we
> can preempt GPU jobs accessing affected memory. If we don't have
> fine-grained preemption capability (graphics), the best we can do is
> wait for the GPU jobs to complete. We can also delay submission of new
> GPU jobs to the same memory until the MMU notifier is done. Future jobs
> would use the new page addresses.
>
> With fine-grained preemption (ROCm compute), we can preempt GPU work on
> the affected adders space to minimize the delay seen by fork.
>
> With recoverable device page faults, we can invalidate GPU page table
> entries, so device access to the affected pages stops immediately.
>
> In all cases, the end result is, that the device page table gets updated
> with the address of the copied pages before the GPU accesses the COW
> memory again.Without the MMU notifier, we'd end up with the GPU
> corrupting memory of the other process.

The model here in fork has been wrong for a long time, and I do wonder
how O_DIRECT manages to not be broken too.. I guess the time windows
there are too small to get unlucky.

If you have a write pin on a page then it should not be COW'd into the
fork'd process but copied with the originating page remaining with the
original mm.

I wonder if there is some easy way to achive that - if that is the
main reason to use notifiers then it would be a better solution.

Jason

2020-06-20 04:39:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
> We have a potential problem with CPU updating page tables while the GPU
> is retrying on page table entries because 64 bit CPU transactions don't
> arrive in device memory atomically.

Except for 32 bit platforms atomicity is guarenteed if you use uncached
writeq() to aligned addresses..

The linux driver model breaks of the writeX() stuff is not atomic.

Jason

2020-06-20 04:41:42

by Felix Kuehling

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations


Am 2020-06-19 um 3:55 p.m. schrieb Jason Gunthorpe:
> On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
>> Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
>>> On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
>>>> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
>>>>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
>>>>>
>>>>>> The madness is only that device B's mmu notifier might need to wait
>>>>>> for fence_B so that the dma operation finishes. Which in turn has to
>>>>>> wait for device A to finish first.
>>>>> So, it sound, fundamentally you've got this graph of operations across
>>>>> an unknown set of drivers and the kernel cannot insert itself in
>>>>> dma_fence hand offs to re-validate any of the buffers involved?
>>>>> Buffers which by definition cannot be touched by the hardware yet.
>>>>>
>>>>> That really is a pretty horrible place to end up..
>>>>>
>>>>> Pinning really is right answer for this kind of work flow. I think
>>>>> converting pinning to notifers should not be done unless notifier
>>>>> invalidation is relatively bounded.
>>>>>
>>>>> I know people like notifiers because they give a bit nicer performance
>>>>> in some happy cases, but this cripples all the bad cases..
>>>>>
>>>>> If pinning doesn't work for some reason maybe we should address that?
>>>> Note that the dma fence is only true for user ptr buffer which predate
>>>> any HMM work and thus were using mmu notifier already. You need the
>>>> mmu notifier there because of fork and other corner cases.
>>> I wonder if we should try to fix the fork case more directly - RDMA
>>> has this same problem and added MADV_DONTFORK a long time ago as a
>>> hacky way to deal with it.
>>>
>>> Some crazy page pin that resolved COW in a way that always kept the
>>> physical memory with the mm that initiated the pin?
>>>
>>> (isn't this broken for O_DIRECT as well anyhow?)
>>>
>>> How does mmu_notifiers help the fork case anyhow? Block fork from
>>> progressing?
>> How much the mmu_notifier blocks fork progress depends, on quickly we
>> can preempt GPU jobs accessing affected memory. If we don't have
>> fine-grained preemption capability (graphics), the best we can do is
>> wait for the GPU jobs to complete. We can also delay submission of new
>> GPU jobs to the same memory until the MMU notifier is done. Future jobs
>> would use the new page addresses.
>>
>> With fine-grained preemption (ROCm compute), we can preempt GPU work on
>> the affected adders space to minimize the delay seen by fork.
>>
>> With recoverable device page faults, we can invalidate GPU page table
>> entries, so device access to the affected pages stops immediately.
>>
>> In all cases, the end result is, that the device page table gets updated
>> with the address of the copied pages before the GPU accesses the COW
>> memory again.Without the MMU notifier, we'd end up with the GPU
>> corrupting memory of the other process.
> The model here in fork has been wrong for a long time, and I do wonder
> how O_DIRECT manages to not be broken too.. I guess the time windows
> there are too small to get unlucky.
>
> If you have a write pin on a page then it should not be COW'd into the
> fork'd process but copied with the originating page remaining with the
> original mm.
>
> I wonder if there is some easy way to achive that - if that is the
> main reason to use notifiers then it would be a better solution.

Other than the application changing its own virtual address mappings
(mprotect, munmap, etc.), triggering MMU notifiers, we also get MMU
notifiers from THP worker threads, and NUMA balancing.

When we start doing migration to DEVICE_PRIVATE memory with HMM, we also
get MMU notifiers during those driver-initiated migrations.

Regards,
  Felix


>
> Jason

2020-06-20 04:42:14

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > >
> > > > The madness is only that device B's mmu notifier might need to wait
> > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > wait for device A to finish first.
> > >
> > > So, it sound, fundamentally you've got this graph of operations across
> > > an unknown set of drivers and the kernel cannot insert itself in
> > > dma_fence hand offs to re-validate any of the buffers involved?
> > > Buffers which by definition cannot be touched by the hardware yet.
> > >
> > > That really is a pretty horrible place to end up..
> > >
> > > Pinning really is right answer for this kind of work flow. I think
> > > converting pinning to notifers should not be done unless notifier
> > > invalidation is relatively bounded.
> > >
> > > I know people like notifiers because they give a bit nicer performance
> > > in some happy cases, but this cripples all the bad cases..
> > >
> > > If pinning doesn't work for some reason maybe we should address that?
> >
> > Note that the dma fence is only true for user ptr buffer which predate
> > any HMM work and thus were using mmu notifier already. You need the
> > mmu notifier there because of fork and other corner cases.
>
> I wonder if we should try to fix the fork case more directly - RDMA
> has this same problem and added MADV_DONTFORK a long time ago as a
> hacky way to deal with it.
>
> Some crazy page pin that resolved COW in a way that always kept the
> physical memory with the mm that initiated the pin?

Just no way to deal with it easily, i thought about forcing the
anon_vma (page->mapping for anonymous page) to the anon_vma that
belongs to the vma against which the GUP was done but it would
break things if page is already in other branch of a fork tree.
Also this forbid fast GUP.

Quite frankly the fork was not the main motivating factor. GPU
can pin potentialy GBytes of memory thus we wanted to be able
to release it but since Michal changes to reclaim code this is
no longer effective.

User buffer should never end up in those weird corner case, iirc
the first usage was for xorg exa texture upload, then generalize
to texture upload in mesa and latter on to more upload cases
(vertices, ...). At least this is what i remember today. So in
those cases we do not expect fork, splice, mremap, mprotect, ...

Maybe we can audit how user ptr buffer are use today and see if
we can define a usage pattern that would allow to cut corner in
kernel. For instance we could use mmu notifier just to block CPU
pte update while we do GUP and thus never wait on dma fence.

Then GPU driver just keep the GUP pin around until they are done
with the page. They can also use the mmu notifier to keep a flag
so that the driver know if it needs to redo a GUP ie:

The notifier path:
GPU_mmu_notifier_start_callback(range)
gpu_lock_cpu_pagetable(range)
for_each_bo_in(bo, range) {
bo->need_gup = true;
}
gpu_unlock_cpu_pagetable(range)

GPU_validate_buffer_pages(bo)
if (!bo->need_gup)
return;
put_pages(bo->pages);
range = bo_vaddr_range(bo)
gpu_lock_cpu_pagetable(range)
GUP(bo->pages, range)
gpu_unlock_cpu_pagetable(range)


Depending on how user_ptr are use today this could work.


> (isn't this broken for O_DIRECT as well anyhow?)

Yes it can in theory, if you have an application that does O_DIRECT
and fork concurrently (ie O_DIRECT in one thread and fork in another).
Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast
was able to lookup a page with write permission before fork had the
chance to update it to read only for COW.

But doing O_DIRECT (or anything that use GUP fast) in one thread and
fork in another is inherently broken ie there is no way to fix it.

See 17839856fd588f4ab6b789f482ed3ffd7c403e1f

>
> How does mmu_notifiers help the fork case anyhow? Block fork from
> progressing?

It enforce ordering between fork and GUP, if fork is first it blocks
GUP and if forks is last then fork waits on GUP and then user buffer
get invalidated.

>
> > I probably need to warn AMD folks again that using HMM means that you
> > must be able to update the GPU page table asynchronously without
> > fence wait.
>
> It is kind of unrelated to HMM, it just shouldn't be using mmu
> notifiers to replace page pinning..

Well my POV is that if you abide by rules HMM defined then you do
not need to pin pages. The rule is asynchronous device page table
update.

Pinning pages is problematic it blocks many core mm features and
it is just bad all around. Also it is inherently broken in front
of fork/mremap/splice/...

Cheers,
J?r?me

2020-06-20 04:42:33

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 04:55:38PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
> > Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > >>>
> > >>>> The madness is only that device B's mmu notifier might need to wait
> > >>>> for fence_B so that the dma operation finishes. Which in turn has to
> > >>>> wait for device A to finish first.
> > >>> So, it sound, fundamentally you've got this graph of operations across
> > >>> an unknown set of drivers and the kernel cannot insert itself in
> > >>> dma_fence hand offs to re-validate any of the buffers involved?
> > >>> Buffers which by definition cannot be touched by the hardware yet.
> > >>>
> > >>> That really is a pretty horrible place to end up..
> > >>>
> > >>> Pinning really is right answer for this kind of work flow. I think
> > >>> converting pinning to notifers should not be done unless notifier
> > >>> invalidation is relatively bounded.
> > >>>
> > >>> I know people like notifiers because they give a bit nicer performance
> > >>> in some happy cases, but this cripples all the bad cases..
> > >>>
> > >>> If pinning doesn't work for some reason maybe we should address that?
> > >> Note that the dma fence is only true for user ptr buffer which predate
> > >> any HMM work and thus were using mmu notifier already. You need the
> > >> mmu notifier there because of fork and other corner cases.
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> > >
> > > (isn't this broken for O_DIRECT as well anyhow?)
> > >
> > > How does mmu_notifiers help the fork case anyhow? Block fork from
> > > progressing?
> >
> > How much the mmu_notifier blocks fork progress depends, on quickly we
> > can preempt GPU jobs accessing affected memory. If we don't have
> > fine-grained preemption capability (graphics), the best we can do is
> > wait for the GPU jobs to complete. We can also delay submission of new
> > GPU jobs to the same memory until the MMU notifier is done. Future jobs
> > would use the new page addresses.
> >
> > With fine-grained preemption (ROCm compute), we can preempt GPU work on
> > the affected adders space to minimize the delay seen by fork.
> >
> > With recoverable device page faults, we can invalidate GPU page table
> > entries, so device access to the affected pages stops immediately.
> >
> > In all cases, the end result is, that the device page table gets updated
> > with the address of the copied pages before the GPU accesses the COW
> > memory again.Without the MMU notifier, we'd end up with the GPU
> > corrupting memory of the other process.
>
> The model here in fork has been wrong for a long time, and I do wonder
> how O_DIRECT manages to not be broken too.. I guess the time windows
> there are too small to get unlucky.

This was discuss extensively in the GUP works John have been doing.
Yes O_DIRECT can potentialy break but only if you are writting to
COW pages and you initiated the O_DIRECT right before the fork and
GUP happen before fork was able to write protect the pages.

If you O_DIRECT but use memory as input ie you are writting the
memory to the file not reading from the file. Then fork is harmless
as you are just reading memory. You can still face the COW uncertainty
(the process against which you did the O_DIRECT get "new" pages but your
O_DIRECT goes on with the "old" pages) but doing O_DIRECT and fork
concurently is asking for trouble.

>
> If you have a write pin on a page then it should not be COW'd into the
> fork'd process but copied with the originating page remaining with the
> original mm.
>
> I wonder if there is some easy way to achive that - if that is the
> main reason to use notifiers then it would be a better solution.

Not doable as page refcount can change for things unrelated to GUP, with
John changes we can identify GUP and we could potentialy copy GUPed page
instead of COW but this can potentialy slow down fork() and i am not sure
how acceptable this would be. Also this does not solve GUP against page
that are already in fork tree ie page P0 is in process A which forks,
we now have page P0 in process A and B. Now we have process A which forks
again and we have page P0 in A, B, and C. Here B and C are two branches
with root in A. B and/or C can keep forking and grow the fork tree.

Now if read only GUP on P0 happens in C (or B everything is symetrical in
respect to root A) then P0 might not be the page that is in C after the
GUP ie if something in C write to the virtual address corresponding to P0
then a new page might get allocated and the virtual address will no longer
point to P0 for C.

Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some
what "fix" that but GUP fast is still succeptible to this.

Note that above commit only address the GUP after/while forking. GUP
before fork() need mmu notifier (or forcing page copy instead of COW).

Cheers,
J?r?me

2020-06-20 04:45:28

by Daniel Vetter

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse <[email protected]> wrote:
>
> On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > >
> > > > > The madness is only that device B's mmu notifier might need to wait
> > > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > > wait for device A to finish first.
> > > >
> > > > So, it sound, fundamentally you've got this graph of operations across
> > > > an unknown set of drivers and the kernel cannot insert itself in
> > > > dma_fence hand offs to re-validate any of the buffers involved?
> > > > Buffers which by definition cannot be touched by the hardware yet.
> > > >
> > > > That really is a pretty horrible place to end up..
> > > >
> > > > Pinning really is right answer for this kind of work flow. I think
> > > > converting pinning to notifers should not be done unless notifier
> > > > invalidation is relatively bounded.
> > > >
> > > > I know people like notifiers because they give a bit nicer performance
> > > > in some happy cases, but this cripples all the bad cases..
> > > >
> > > > If pinning doesn't work for some reason maybe we should address that?
> > >
> > > Note that the dma fence is only true for user ptr buffer which predate
> > > any HMM work and thus were using mmu notifier already. You need the
> > > mmu notifier there because of fork and other corner cases.
> >
> > I wonder if we should try to fix the fork case more directly - RDMA
> > has this same problem and added MADV_DONTFORK a long time ago as a
> > hacky way to deal with it.
> >
> > Some crazy page pin that resolved COW in a way that always kept the
> > physical memory with the mm that initiated the pin?
>
> Just no way to deal with it easily, i thought about forcing the
> anon_vma (page->mapping for anonymous page) to the anon_vma that
> belongs to the vma against which the GUP was done but it would
> break things if page is already in other branch of a fork tree.
> Also this forbid fast GUP.
>
> Quite frankly the fork was not the main motivating factor. GPU
> can pin potentialy GBytes of memory thus we wanted to be able
> to release it but since Michal changes to reclaim code this is
> no longer effective.

What where how? My patch to annote reclaim paths with mmu notifier
possibility just landed in -mm, so if direct reclaim can't reclaim mmu
notifier'ed stuff anymore we need to know.

Also this would resolve the entire pain we're discussing in this
thread about dma_fence_wait deadlocking against anything that's not
GFP_ATOMIC ...
-Daniel

>
> User buffer should never end up in those weird corner case, iirc
> the first usage was for xorg exa texture upload, then generalize
> to texture upload in mesa and latter on to more upload cases
> (vertices, ...). At least this is what i remember today. So in
> those cases we do not expect fork, splice, mremap, mprotect, ...
>
> Maybe we can audit how user ptr buffer are use today and see if
> we can define a usage pattern that would allow to cut corner in
> kernel. For instance we could use mmu notifier just to block CPU
> pte update while we do GUP and thus never wait on dma fence.
>
> Then GPU driver just keep the GUP pin around until they are done
> with the page. They can also use the mmu notifier to keep a flag
> so that the driver know if it needs to redo a GUP ie:
>
> The notifier path:
> GPU_mmu_notifier_start_callback(range)
> gpu_lock_cpu_pagetable(range)
> for_each_bo_in(bo, range) {
> bo->need_gup = true;
> }
> gpu_unlock_cpu_pagetable(range)
>
> GPU_validate_buffer_pages(bo)
> if (!bo->need_gup)
> return;
> put_pages(bo->pages);
> range = bo_vaddr_range(bo)
> gpu_lock_cpu_pagetable(range)
> GUP(bo->pages, range)
> gpu_unlock_cpu_pagetable(range)
>
>
> Depending on how user_ptr are use today this could work.
>
>
> > (isn't this broken for O_DIRECT as well anyhow?)
>
> Yes it can in theory, if you have an application that does O_DIRECT
> and fork concurrently (ie O_DIRECT in one thread and fork in another).
> Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast
> was able to lookup a page with write permission before fork had the
> chance to update it to read only for COW.
>
> But doing O_DIRECT (or anything that use GUP fast) in one thread and
> fork in another is inherently broken ie there is no way to fix it.
>
> See 17839856fd588f4ab6b789f482ed3ffd7c403e1f
>
> >
> > How does mmu_notifiers help the fork case anyhow? Block fork from
> > progressing?
>
> It enforce ordering between fork and GUP, if fork is first it blocks
> GUP and if forks is last then fork waits on GUP and then user buffer
> get invalidated.
>
> >
> > > I probably need to warn AMD folks again that using HMM means that you
> > > must be able to update the GPU page table asynchronously without
> > > fence wait.
> >
> > It is kind of unrelated to HMM, it just shouldn't be using mmu
> > notifiers to replace page pinning..
>
> Well my POV is that if you abide by rules HMM defined then you do
> not need to pin pages. The rule is asynchronous device page table
> update.
>
> Pinning pages is problematic it blocks many core mm features and
> it is just bad all around. Also it is inherently broken in front
> of fork/mremap/splice/...
>
> Cheers,
> Jérôme
>


--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2020-06-20 04:45:39

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 10:43:20PM +0200, Daniel Vetter wrote:
> On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse <[email protected]> wrote:
> >
> > On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > > > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > > >
> > > > > > The madness is only that device B's mmu notifier might need to wait
> > > > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > > > wait for device A to finish first.
> > > > >
> > > > > So, it sound, fundamentally you've got this graph of operations across
> > > > > an unknown set of drivers and the kernel cannot insert itself in
> > > > > dma_fence hand offs to re-validate any of the buffers involved?
> > > > > Buffers which by definition cannot be touched by the hardware yet.
> > > > >
> > > > > That really is a pretty horrible place to end up..
> > > > >
> > > > > Pinning really is right answer for this kind of work flow. I think
> > > > > converting pinning to notifers should not be done unless notifier
> > > > > invalidation is relatively bounded.
> > > > >
> > > > > I know people like notifiers because they give a bit nicer performance
> > > > > in some happy cases, but this cripples all the bad cases..
> > > > >
> > > > > If pinning doesn't work for some reason maybe we should address that?
> > > >
> > > > Note that the dma fence is only true for user ptr buffer which predate
> > > > any HMM work and thus were using mmu notifier already. You need the
> > > > mmu notifier there because of fork and other corner cases.
> > >
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> >
> > Just no way to deal with it easily, i thought about forcing the
> > anon_vma (page->mapping for anonymous page) to the anon_vma that
> > belongs to the vma against which the GUP was done but it would
> > break things if page is already in other branch of a fork tree.
> > Also this forbid fast GUP.
> >
> > Quite frankly the fork was not the main motivating factor. GPU
> > can pin potentialy GBytes of memory thus we wanted to be able
> > to release it but since Michal changes to reclaim code this is
> > no longer effective.
>
> What where how? My patch to annote reclaim paths with mmu notifier
> possibility just landed in -mm, so if direct reclaim can't reclaim mmu
> notifier'ed stuff anymore we need to know.
>
> Also this would resolve the entire pain we're discussing in this
> thread about dma_fence_wait deadlocking against anything that's not
> GFP_ATOMIC ...

Sorry my bad, reclaim still works, only oom skip. It was couple
years ago and i thought that some of the things discuss while
back did make it upstream.

It is probably a good time to also point out that what i wanted
to do is have all the mmu notifier callback provide some kind
of fence (not dma fence) so that we can split the notification
into step:
A- schedule notification on all devices/system get fences
this step should minimize lock dependency and should
not have to wait for anything also best if you can avoid
memory allocation for instance by pre-allocating what
you need for notification.
B- mm can do things like unmap but can not map new page
so write special swap pte to cpu page table
C- wait on each fences from A
... resume old code ie replace pte or finish unmap ...

The idea here is that at step C the core mm can decide to back
off if any fence returned from A have to wait. This means that
every device is invalidating for nothing but if we get there
then it might still be a good thing as next time around maybe
the kernel would be successfull without a wait.

This would allow things like reclaim to make forward progress
and skip over or limit wait time to given timeout.

Also I thought to extend this even to multi-cpu tlb flush so
that device and CPUs follow same pattern and we can make //
progress on each.


Getting to such scheme is a lot of work. My plan was to first
get the fence as part of the notifier user API and hide it from
mm inside notifier common code. Then update each core mm path to
new model and see if there is any benefit from it. Reclaim would
be first candidate.

Cheers,
J?r?me

2020-06-22 11:48:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
> Not doable as page refcount can change for things unrelated to GUP, with
> John changes we can identify GUP and we could potentialy copy GUPed page
> instead of COW but this can potentialy slow down fork() and i am not sure
> how acceptable this would be. Also this does not solve GUP against page
> that are already in fork tree ie page P0 is in process A which forks,
> we now have page P0 in process A and B. Now we have process A which forks
> again and we have page P0 in A, B, and C. Here B and C are two branches
> with root in A. B and/or C can keep forking and grow the fork tree.

For a long time now RDMA has broken COW pages when creating user DMA
regions.

The problem has been that fork re-COW's regions that had their COW
broken.

So, if you break the COW upon mapping and prevent fork (and others)
from copying DMA pinned then you'd cover the cases.

> Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some
> what "fix" that but GUP fast is still succeptible to this.

Ah, so everyone breaks the COW now, not just RDMA..

What do you mean 'GUP fast is still succeptible to this' ?

Jason

2020-06-22 20:18:01

by Jerome Glisse

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
> > Not doable as page refcount can change for things unrelated to GUP, with
> > John changes we can identify GUP and we could potentialy copy GUPed page
> > instead of COW but this can potentialy slow down fork() and i am not sure
> > how acceptable this would be. Also this does not solve GUP against page
> > that are already in fork tree ie page P0 is in process A which forks,
> > we now have page P0 in process A and B. Now we have process A which forks
> > again and we have page P0 in A, B, and C. Here B and C are two branches
> > with root in A. B and/or C can keep forking and grow the fork tree.
>
> For a long time now RDMA has broken COW pages when creating user DMA
> regions.
>
> The problem has been that fork re-COW's regions that had their COW
> broken.
>
> So, if you break the COW upon mapping and prevent fork (and others)
> from copying DMA pinned then you'd cover the cases.

I am not sure we want to prevent COW for pinned GUP pages, this would
change current semantic and potentialy break/slow down existing apps.

Anyway i think we focus too much on fork/COW, it is just an unfixable
broken corner cases, mmu notifier allows you to avoid it. Forcing real
copy on fork would likely be seen as regression by most people.


> > Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some
> > what "fix" that but GUP fast is still succeptible to this.
>
> Ah, so everyone breaks the COW now, not just RDMA..
>
> What do you mean 'GUP fast is still succeptible to this' ?

Not all GUP fast path are updated (intentionaly) __get_user_pages_fast()
for instance still keeps COW intact. People using GUP should really knows
what they are doing.

Cheers,
J?r?me

2020-06-23 00:06:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Mon, Jun 22, 2020 at 04:15:40PM -0400, Jerome Glisse wrote:
> On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
> > > Not doable as page refcount can change for things unrelated to GUP, with
> > > John changes we can identify GUP and we could potentialy copy GUPed page
> > > instead of COW but this can potentialy slow down fork() and i am not sure
> > > how acceptable this would be. Also this does not solve GUP against page
> > > that are already in fork tree ie page P0 is in process A which forks,
> > > we now have page P0 in process A and B. Now we have process A which forks
> > > again and we have page P0 in A, B, and C. Here B and C are two branches
> > > with root in A. B and/or C can keep forking and grow the fork tree.
> >
> > For a long time now RDMA has broken COW pages when creating user DMA
> > regions.
> >
> > The problem has been that fork re-COW's regions that had their COW
> > broken.
> >
> > So, if you break the COW upon mapping and prevent fork (and others)
> > from copying DMA pinned then you'd cover the cases.
>
> I am not sure we want to prevent COW for pinned GUP pages, this would
> change current semantic and potentialy break/slow down existing apps.

Isn't that basically exactly what 17839856fd588 does? It looks like it
uses the same approach RDMA does by sticking FOLL_WRITE even though it
is a read action.

After that change the reamining bug is that fork can re-establish a
COW./

> Anyway i think we focus too much on fork/COW, it is just an unfixable
> broken corner cases, mmu notifier allows you to avoid it. Forcing real
> copy on fork would likely be seen as regression by most people.

If you don't copy the there are data corruption bugs though. Real apps
probably don't hit a problem here as they are not forking while GUP's
are active (RDMA excluded, which does do this)

I think that implementing page pinning by blocking mmu notifiers for
the duration of the pin is a particularly good idea either, that
actually seems a lot worse than just having the pin in the first
place.

Particularly if it is only being done to avoid corner case bugs that
already afflict other GUP cases :(

> > What do you mean 'GUP fast is still succeptible to this' ?
>
> Not all GUP fast path are updated (intentionaly) __get_user_pages_fast()

Sure, that is is the 'raw' accessor

Jason

2020-06-23 02:03:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

On Fri, Jun 19, 2020 at 04:10:11PM -0400, Jerome Glisse wrote:

> Maybe we can audit how user ptr buffer are use today and see if
> we can define a usage pattern that would allow to cut corner in
> kernel. For instance we could use mmu notifier just to block CPU
> pte update while we do GUP and thus never wait on dma fence.

The DMA fence is the main problem, if you can think of a way to avoid
it then it would be great!

> Then GPU driver just keep the GUP pin around until they are done
> with the page. They can also use the mmu notifier to keep a flag
> so that the driver know if it needs to redo a GUP ie:
>
> The notifier path:
> GPU_mmu_notifier_start_callback(range)
> gpu_lock_cpu_pagetable(range)
> for_each_bo_in(bo, range) {
> bo->need_gup = true;
> }
> gpu_unlock_cpu_pagetable(range)

So some kind of invalidation tracking? But this doesn't solve COW and
Fork problem?

> > It is kind of unrelated to HMM, it just shouldn't be using mmu
> > notifiers to replace page pinning..
>
> Well my POV is that if you abide by rules HMM defined then you do
> not need to pin pages. The rule is asynchronous device page table
> update.

I think one of the hmm rules is to not block notifiers for a long
time, which these scheme seem to violate already.

Pinning for a long time is less bad than blocing notifiers for a long
time, IMHO

Jason