2016-11-21 20:36:33

by Deucher, Alexander

[permalink] [raw]
Subject: Enabling peer to peer device transactions for PCIe devices

This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.

Some use cases:
1. Storage devices streaming directly to GPU device memory
2. GPU device memory to GPU device memory streaming
3. DVB/V4L/SDI devices streaming directly to GPU device memory
4. DVB/V4L/SDI devices streaming directly to storage devices

Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
- Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
- get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
- put_page() will deal with struct pages for device memory

Previously proposed solutions and related proposals:
1.P2P DMA
DMA-API/PCI map_peer_resource support for peer-to-peer (http://www.spinics.net/lists/linux-pci/msg44560.html)
Pros: Low impact, already largely reviewed.
Cons: requires explicit support in all drivers that want to support it, doesn't handle S/G in device memory.

2. ZONE_DEVICE IO
Direct I/O and DMA for persistent memory (https://lwn.net/Articles/672457/)
Add support for ZONE_DEVICE IO memory with struct pages. (https://patchwork.kernel.org/patch/8583221/)
Pro: Doesn't waste system memory for ZONE metadata
Cons: CPU access to ZONE metadata slow, may be lost, corrupted on device reset.

3. DMA-BUF
RDMA subsystem DMA-BUF support (http://www.spinics.net/lists/linux-rdma/msg38748.html)
Pros: uses existing dma-buf interface
Cons: dma-buf is handle based, requires explicit dma-buf support in drivers.

4. iopmem
iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)

5. HMM
Heterogeneous Memory Management (http://lkml.iu.edu/hypermail/linux/kernel/1611.2/02473.html)

6. Some new mmap-like interface that takes a userptr and a length and returns a dma-buf and offset?

Alex



2016-11-22 18:11:56

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexander
<[email protected]> wrote:
> This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward. Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory. Also in cases where both devices are behind a switch, it avoids the CPU entirely. Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc. Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
>
> Some use cases:
> 1. Storage devices streaming directly to GPU device memory
> 2. GPU device memory to GPU device memory streaming
> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
> 4. DVB/V4L/SDI devices streaming directly to storage devices
>
> Here is a relatively simple example of how this could work for testing. This is obviously not a complete solution.
> - Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
> - get_user_pages_fast() will return corresponding struct pages when CPU address points to the device memory
> - put_page() will deal with struct pages for device memory
>
[..]
> 4. iopmem
> iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)

The change I suggest for this particular approach is to switch to
"device-DAX" [1]. I.e. a character device for establishing DAX
mappings rather than a block device plus a DAX filesystem. The pro of
this approach is standard user pointers and struct pages rather than a
new construct. The con is that this is done via an interface separate
from the existing gpu and storage device. For example it would require
a /dev/dax instance alongside a /dev/nvme interface, but I don't see
that as a significant blocking concern.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html

2016-11-22 20:01:22

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
<[email protected]> wrote:
> Dan,
>
> I personally like "device-DAX" idea but my concerns are:
>
> - How well it will co-exists with the DRM infrastructure / implementations
> in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.

> - How well we will be able to handle case when we need to "move"/"evict"
> memory/data to the new location so CPU pointer should point to the new
> physical location/address
> (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

2016-11-22 20:24:37

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Nov 22, 2016 at 12:10 PM, Daniel Vetter <[email protected]> wrote:
> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <[email protected]> wrote:
>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>> <[email protected]> wrote:
>>> I personally like "device-DAX" idea but my concerns are:
>>>
>>> - How well it will co-exists with the DRM infrastructure / implementations
>>> in part dealing with CPU pointers?
>>
>> Inside the kernel a device-DAX range is "just memory" in the sense
>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>> not migratable. To be honest I do not know how well that co-exists
>> with drm infrastructure.
>>
>>> - How well we will be able to handle case when we need to "move"/"evict"
>>> memory/data to the new location so CPU pointer should point to the new
>>> physical location/address
>>> (and may be not in PCI device memory at all)?
>>
>> So, device-DAX deliberately avoids support for in-kernel migration or
>> overcommit. Those cases are left to the core mm or drm. The device-dax
>> interface is for cases where all that is needed is a direct-mapping to
>> a statically-allocated physical-address range be it persistent memory
>> or some other special reserved memory range.
>
> For some of the fancy use-cases (e.g. to be comparable to what HMM can
> pull off) I think we want all the magic in core mm, i.e. migration and
> overcommit. At least that seems to be the very strong drive in all
> general-purpose gpu abstractions and implementations, where memory is
> allocated with malloc, and then mapped/moved into vram/gpu address
> space through some magic, but still visible on both the cpu and gpu
> side in some form. Special device to allocate memory, and not being
> able to migrate stuff around sound like misfeatures from that pov.

Agreed. For general purpose P2P use cases where all you want is
direct-I/O to a memory range that happens to be on a PCIe device then
I think a special device fits the bill. For gpu P2P use cases that
already have migration/overcommit expectations then it is not a good
fit.

2016-11-22 20:30:43

by Daniel Vetter

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <[email protected]> wrote:
> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
> <[email protected]> wrote:
>> I personally like "device-DAX" idea but my concerns are:
>>
>> - How well it will co-exists with the DRM infrastructure / implementations
>> in part dealing with CPU pointers?
>
> Inside the kernel a device-DAX range is "just memory" in the sense
> that you can perform pfn_to_page() on it and issue I/O, but the vma is
> not migratable. To be honest I do not know how well that co-exists
> with drm infrastructure.
>
>> - How well we will be able to handle case when we need to "move"/"evict"
>> memory/data to the new location so CPU pointer should point to the new
>> physical location/address
>> (and may be not in PCI device memory at all)?
>
> So, device-DAX deliberately avoids support for in-kernel migration or
> overcommit. Those cases are left to the core mm or drm. The device-dax
> interface is for cases where all that is needed is a direct-mapping to
> a statically-allocated physical-address range be it persistent memory
> or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic, but still visible on both the cpu and gpu
side in some form. Special device to allocate memory, and not being
able to migrate stuff around sound like misfeatures from that pov.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2016-11-22 20:35:28

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 2016-11-22 03:10 PM, Daniel Vetter wrote:
> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <[email protected]> wrote:
>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>> <[email protected]> wrote:
>>> I personally like "device-DAX" idea but my concerns are:
>>>
>>> - How well it will co-exists with the DRM infrastructure / implementations
>>> in part dealing with CPU pointers?
>> Inside the kernel a device-DAX range is "just memory" in the sense
>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>> not migratable. To be honest I do not know how well that co-exists
>> with drm infrastructure.
>>
>>> - How well we will be able to handle case when we need to "move"/"evict"
>>> memory/data to the new location so CPU pointer should point to the new
>>> physical location/address
>>> (and may be not in PCI device memory at all)?
>> So, device-DAX deliberately avoids support for in-kernel migration or
>> overcommit. Those cases are left to the core mm or drm. The device-dax
>> interface is for cases where all that is needed is a direct-mapping to
>> a statically-allocated physical-address range be it persistent memory
>> or some other special reserved memory range.
> For some of the fancy use-cases (e.g. to be comparable to what HMM can
> pull off) I think we want all the magic in core mm, i.e. migration and
> overcommit. At least that seems to be the very strong drive in all
> general-purpose gpu abstractions and implementations, where memory is
> allocated with malloc, and then mapped/moved into vram/gpu address
> space through some magic,
It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.
> but still visible on both the cpu and gpu
> side in some form. Special device to allocate memory, and not being
> able to migrate stuff around sound like misfeatures from that pov.
> -Daniel

2016-11-22 21:16:26

by Daniel Vetter

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
<[email protected]> wrote:
>
> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>
>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <[email protected]>
>> wrote:
>>>
>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>> <[email protected]> wrote:
>>>>
>>>> I personally like "device-DAX" idea but my concerns are:
>>>>
>>>> - How well it will co-exists with the DRM infrastructure /
>>>> implementations
>>>> in part dealing with CPU pointers?
>>>
>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>> not migratable. To be honest I do not know how well that co-exists
>>> with drm infrastructure.
>>>
>>>> - How well we will be able to handle case when we need to
>>>> "move"/"evict"
>>>> memory/data to the new location so CPU pointer should point to the
>>>> new
>>>> physical location/address
>>>> (and may be not in PCI device memory at all)?
>>>
>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>> interface is for cases where all that is needed is a direct-mapping to
>>> a statically-allocated physical-address range be it persistent memory
>>> or some other special reserved memory range.
>>
>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>> pull off) I think we want all the magic in core mm, i.e. migration and
>> overcommit. At least that seems to be the very strong drive in all
>> general-purpose gpu abstractions and implementations, where memory is
>> allocated with malloc, and then mapped/moved into vram/gpu address
>> space through some magic,
>
> It is possible that there is other way around: memory is requested to be
> allocated and should be kept in vram for performance reason but due
> to possible overcommit case we need at least temporally to "move" such
> allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2016-11-22 21:21:40

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <[email protected]> wrote:
> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
> <[email protected]> wrote:
>>
>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>
>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <[email protected]>
>>> wrote:
>>>>
>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>> <[email protected]> wrote:
>>>>>
>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>
>>>>> - How well it will co-exists with the DRM infrastructure /
>>>>> implementations
>>>>> in part dealing with CPU pointers?
>>>>
>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>>> not migratable. To be honest I do not know how well that co-exists
>>>> with drm infrastructure.
>>>>
>>>>> - How well we will be able to handle case when we need to
>>>>> "move"/"evict"
>>>>> memory/data to the new location so CPU pointer should point to the
>>>>> new
>>>>> physical location/address
>>>>> (and may be not in PCI device memory at all)?
>>>>
>>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>>> interface is for cases where all that is needed is a direct-mapping to
>>>> a statically-allocated physical-address range be it persistent memory
>>>> or some other special reserved memory range.
>>>
>>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>>> pull off) I think we want all the magic in core mm, i.e. migration and
>>> overcommit. At least that seems to be the very strong drive in all
>>> general-purpose gpu abstractions and implementations, where memory is
>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>> space through some magic,
>>
>> It is possible that there is other way around: memory is requested to be
>> allocated and should be kept in vram for performance reason but due
>> to possible overcommit case we need at least temporally to "move" such
>> allocation to system memory.
>
> With migration I meant migrating both ways of course. And with stuff
> like numactl we can also influence where exactly the malloc'ed memory
> is allocated originally, at least if we'd expose the vram range as a
> very special numa node that happens to be far away and not hold any
> cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior. The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

2016-11-22 22:21:20

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

>?I don't think we should be using numa distance to reverse engineer a
>?certain allocation behavior.? The latency data should be truthful, but
>?you're right we'll need a mechanism to keep general purpose
>?allocations out of that range by default.?

Just to clarify: Do you propose/thinking to utilize NUMA API for?
such (VRAM) allocations?





2016-11-23 07:49:09

by Daniel Vetter

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:
> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <[email protected]> wrote:
> > On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
> > <[email protected]> wrote:
> >>
> >> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
> >>>
> >>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <[email protected]>
> >>> wrote:
> >>>>
> >>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> I personally like "device-DAX" idea but my concerns are:
> >>>>>
> >>>>> - How well it will co-exists with the DRM infrastructure /
> >>>>> implementations
> >>>>> in part dealing with CPU pointers?
> >>>>
> >>>> Inside the kernel a device-DAX range is "just memory" in the sense
> >>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
> >>>> not migratable. To be honest I do not know how well that co-exists
> >>>> with drm infrastructure.
> >>>>
> >>>>> - How well we will be able to handle case when we need to
> >>>>> "move"/"evict"
> >>>>> memory/data to the new location so CPU pointer should point to the
> >>>>> new
> >>>>> physical location/address
> >>>>> (and may be not in PCI device memory at all)?
> >>>>
> >>>> So, device-DAX deliberately avoids support for in-kernel migration or
> >>>> overcommit. Those cases are left to the core mm or drm. The device-dax
> >>>> interface is for cases where all that is needed is a direct-mapping to
> >>>> a statically-allocated physical-address range be it persistent memory
> >>>> or some other special reserved memory range.
> >>>
> >>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
> >>> pull off) I think we want all the magic in core mm, i.e. migration and
> >>> overcommit. At least that seems to be the very strong drive in all
> >>> general-purpose gpu abstractions and implementations, where memory is
> >>> allocated with malloc, and then mapped/moved into vram/gpu address
> >>> space through some magic,
> >>
> >> It is possible that there is other way around: memory is requested to be
> >> allocated and should be kept in vram for performance reason but due
> >> to possible overcommit case we need at least temporally to "move" such
> >> allocation to system memory.
> >
> > With migration I meant migrating both ways of course. And with stuff
> > like numactl we can also influence where exactly the malloc'ed memory
> > is allocated originally, at least if we'd expose the vram range as a
> > very special numa node that happens to be far away and not hold any
> > cpu cores.
>
> I don't think we should be using numa distance to reverse engineer a
> certain allocation behavior. The latency data should be truthful, but
> you're right we'll need a mechanism to keep general purpose
> allocations out of that range by default. Btw, strict isolation is
> another design point of device-dax, but I think in this case we're
> describing something between the two extremes of full isolation and
> full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't work.
So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.

Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram nodes
out of any default allocation numasets, full isolation should be possible.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2016-11-23 08:51:44

by Christian König

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:
> On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:
>> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <[email protected]> wrote:
>>> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
>>> <[email protected]> wrote:
>>>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <[email protected]>
>>>>> wrote:
>>>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>>>> <[email protected]> wrote:
>>>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>>>
>>>>>>> - How well it will co-exists with the DRM infrastructure /
>>>>>>> implementations
>>>>>>> in part dealing with CPU pointers?
>>>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>>>>> not migratable. To be honest I do not know how well that co-exists
>>>>>> with drm infrastructure.
>>>>>>
>>>>>>> - How well we will be able to handle case when we need to
>>>>>>> "move"/"evict"
>>>>>>> memory/data to the new location so CPU pointer should point to the
>>>>>>> new
>>>>>>> physical location/address
>>>>>>> (and may be not in PCI device memory at all)?
>>>>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>>>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>>>>> interface is for cases where all that is needed is a direct-mapping to
>>>>>> a statically-allocated physical-address range be it persistent memory
>>>>>> or some other special reserved memory range.
>>>>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>>>>> pull off) I think we want all the magic in core mm, i.e. migration and
>>>>> overcommit. At least that seems to be the very strong drive in all
>>>>> general-purpose gpu abstractions and implementations, where memory is
>>>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>>>> space through some magic,
>>>> It is possible that there is other way around: memory is requested to be
>>>> allocated and should be kept in vram for performance reason but due
>>>> to possible overcommit case we need at least temporally to "move" such
>>>> allocation to system memory.
>>> With migration I meant migrating both ways of course. And with stuff
>>> like numactl we can also influence where exactly the malloc'ed memory
>>> is allocated originally, at least if we'd expose the vram range as a
>>> very special numa node that happens to be far away and not hold any
>>> cpu cores.
>> I don't think we should be using numa distance to reverse engineer a
>> certain allocation behavior. The latency data should be truthful, but
>> you're right we'll need a mechanism to keep general purpose
>> allocations out of that range by default. Btw, strict isolation is
>> another design point of device-dax, but I think in this case we're
>> describing something between the two extremes of full isolation and
>> full compatibility with existing numactl apis.
> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
> to reuse all the existing allocation policies directly, those won't work.
> So at boot-up your default numa policy would exclude any vram nodes.
>
> But I think (as an -mm layman) that numa gives us a lot of the tools and
> policy interface that we need to implement what we want for gpus.

Agree completely. From a ten mile high view our GPUs are just command
processors with local memory as well .

Basically this is also the whole idea of what AMD is pushing with HSA
for a while.

It's just that a lot of problems start to pop up when you look at all
the nasty details. For example only part of the GPU memory is usually
accessible by the CPU.

So even when numa nodes expose a good foundation for this I think there
is still a lot of code to write.

BTW: I should probably start to read into the numa code of the kernel.
Any good pointers for that?

Regards,
Christian.

> Wrt isolation: There's a sliding scale of what different users expect,
> from full auto everything, including migrating pages around if needed to
> full isolation all seems to be on the table. As long as we keep vram nodes
> out of any default allocation numasets, full isolation should be possible.
> -Daniel


2016-11-23 17:03:39

by Dave Hansen

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/22/2016 11:49 PM, Daniel Vetter wrote:
> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
> to reuse all the existing allocation policies directly, those won't work.
> So at boot-up your default numa policy would exclude any vram nodes.
>
> But I think (as an -mm layman) that numa gives us a lot of the tools and
> policy interface that we need to implement what we want for gpus.

Are you suggesting creating NUMA nodes for video RAM (I assume that's
what you mean by vram) where that RAM is not at all CPU-accessible?

2016-11-23 17:13:13

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hey,

On 22/11/16 11:59 AM, Serguei Sagalovitch wrote:
> - How well we will be able to handle case when we need to "move"/"evict"
> memory/data to the new location so CPU pointer should point to the
> new physical location/address
> (and may be not in PCI device memory at all)?

IMO any memory that has been registered for a P2P transaction should be
locked from being evicted. So if there's a get_user_pages call it needs
to be pinned until the put_page. The main issue being with the RDMA
case: handling an eviction when a chunk of memory has been registered as
an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application. This seems like a
lot of burden to place on applications and may be subject to timing
issues. Either that or all RDMA applications need to be written with the
assumption that their target memory could go away at any time.

More generally, if you tell one PCI device to do a DMA transfer to
another PCI device's BAR space, and the target memory gets evicted then
DMA transaction needs to be aborted which means every driver doing the
transfer would need special support for this. If the memory can be
relied on to not be evicted than existing drivers should work unmodified
(ie O_DIRECT to/from an NVMe card would just work).

I feel the better approach is to pin memory subject to P2P transactions
as is typically done with DMA transfers to main memory.

Logan

2016-11-23 17:28:57

by Bart Van Assche

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
> IMO any memory that has been registered for a P2P transaction should be
> locked from being evicted. So if there's a get_user_pages call it needs
> to be pinned until the put_page. The main issue being with the RDMA
> case: handling an eviction when a chunk of memory has been registered as
> an MR would be very tricky. The MR may be relied upon by another host
> and the kernel would have to inform user-space the MR was invalid then
> user-space would have to tell the remote application.

Hello Logan,

Are you aware that the Linux kernel already supports ODP (On Demand
Paging)? See also the output of git grep -nHi on.demand.paging. See also
https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.

Bart.

2016-11-23 18:40:51

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 9:27 AM, Bart Van Assche
<[email protected]> wrote:
> On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
>>
>> IMO any memory that has been registered for a P2P transaction should be
>> locked from being evicted. So if there's a get_user_pages call it needs
>> to be pinned until the put_page. The main issue being with the RDMA
>> case: handling an eviction when a chunk of memory has been registered as
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
>
>
> Hello Logan,
>
> Are you aware that the Linux kernel already supports ODP (On Demand Paging)?
> See also the output of git grep -nHi on.demand.paging. See also
> https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.
>

I don't think that was designed for the case where the backing memory
is a special/static physical address range rather than anonymous
"System RAM", right?

I think we should handle the graphics P2P concerns separately from the
general P2P-DMA case since the latter does not require the higher
order memory management facilities. Using ZONE_DEVICE/DAX mappings to
avoid changes to every driver that wants to support P2P-DMA separately
from typical DMA still seems the path of least resistance.

2016-11-23 19:06:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:

> an MR would be very tricky. The MR may be relied upon by another host
> and the kernel would have to inform user-space the MR was invalid then
> user-space would have to tell the remote application.

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back. This
includes the usual fencing mechanism so the CPU can block, flush, and
then evict a page coherently.

This is the general direction the industry is going in: Link PCI DMA
directly to dynamic user page tabels, including support for demand
faulting and synchronicity.

Mellanox ODP is a rough implementation of mirroring a process's page
table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
probably a good example of where this is ultimately headed.

CAPI allows a PCI DMA to directly target an ASID associated with a
user process and then use the usual CPU machinery to do the page
translation for the DMA. This includes page faults for evicted pages,
and obviously allows eviction and migration..

So, of all the solutions in the original list, I would discard
anything that isn't VMA focused. Emulating what CAPI does in hardware
with software is probably the best choice, or we have to do it all
again when CAPI style hardware broadly rolls out :(

DAX and GPU allocators should create VMAs and manipulate them in the
usual way to achieve migration, windowing, cache&mirror, movement or
swap of the potentially peer-peer memory pages. They would have to
respect the usual rules for a VMA, including pinning.

DMA drivers would use the usual approaches for dealing with DMA from
a VMA: short term pin or long term coherent translation mirror.

So, to my view (looking from RDMA), the main problem with peer-peer is
how do you DMA translate VMA's that point at non struct page memory?

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

>From a RDMA perspective we could use something other than
get_user_pages() to pin and DMA translate a VMA if the core community
could decide on an API. eg get_user_dma_sg() would probably be quite
usable.

Jason

2016-11-23 19:12:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote:

> I don't think that was designed for the case where the backing memory
> is a special/static physical address range rather than anonymous
> "System RAM", right?

The hardware doesn't care where the memory is. ODP is just a generic
mechanism to provide demand-fault behavior for a mirrored page table.

ODP has the same issue as everything else, it needs to translate a
page table entry into a DMA address, and we have no API to do that
when the page table points to peer-peer memory.

Jason

2016-11-23 19:14:53

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:
>
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
> As Bart says, it would be best to be combined with something like
> Mellanox's ODP MRs, which allows a page to be evicted and then trigger
> a CPU interrupt if a DMA is attempted so it can be brought back.
Please note that in the general case (including MR one) we could have
"page fault" from the different PCIe device. So all PCIe device must
be synchronized.
> includes the usual fencing mechanism so the CPU can block, flush, and
> then evict a page coherently.
>
> This is the general direction the industry is going in: Link PCI DMA
> directly to dynamic user page tabels, including support for demand
> faulting and synchronicity.
>
> Mellanox ODP is a rough implementation of mirroring a process's page
> table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
> probably a good example of where this is ultimately headed.
>
> CAPI allows a PCI DMA to directly target an ASID associated with a
> user process and then use the usual CPU machinery to do the page
> translation for the DMA. This includes page faults for evicted pages,
> and obviously allows eviction and migration..
>
> So, of all the solutions in the original list, I would discard
> anything that isn't VMA focused. Emulating what CAPI does in hardware
> with software is probably the best choice, or we have to do it all
> again when CAPI style hardware broadly rolls out :(
>
> DAX and GPU allocators should create VMAs and manipulate them in the
> usual way to achieve migration, windowing, cache&mirror, movement or
> swap of the potentially peer-peer memory pages. They would have to
> respect the usual rules for a VMA, including pinning.
>
> DMA drivers would use the usual approaches for dealing with DMA from
> a VMA: short term pin or long term coherent translation mirror.
>
> So, to my view (looking from RDMA), the main problem with peer-peer is
> how do you DMA translate VMA's that point at non struct page memory?
>
> Does HMM solve the peer-peer problem? Does it do it generically or
> only for drivers that are mirroring translation tables?
In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of "malloc" memory on GPU which is not always what needed.
Additionally there is need to have opportunity to share VRAM allocations
between different processes.
> From a RDMA perspective we could use something other than
> get_user_pages() to pin and DMA translate a VMA if the core community
> could decide on an API. eg get_user_dma_sg() would probably be quite
> usable.
>
> Jason

2016-11-23 19:23:14

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 2016-11-23 12:27 PM, Bart Van Assche wrote:
> On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
>> IMO any memory that has been registered for a P2P transaction should be
>> locked from being evicted. So if there's a get_user_pages call it needs
>> to be pinned until the put_page. The main issue being with the RDMA
>> case: handling an eviction when a chunk of memory has been registered as
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
>
> Hello Logan,
>
> Are you aware that the Linux kernel already supports ODP (On Demand
> Paging)? See also the output of git grep -nHi on.demand.paging. See
> also
> https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.
>
> Bart.
My understanding is that the main problems are (a) h/w support (b)
compatibility with IB Verbs semantic.

2016-11-23 19:32:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 02:14:40PM -0500, Serguei Sagalovitch wrote:
>
> On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

> >As Bart says, it would be best to be combined with something like
> >Mellanox's ODP MRs, which allows a page to be evicted and then trigger
> >a CPU interrupt if a DMA is attempted so it can be brought back.

> Please note that in the general case (including MR one) we could have
> "page fault" from the different PCIe device. So all PCIe device must
> be synchronized.

Standard RDMA MRs require pinned pages, the DMA address cannot change
while the MR exists (there is no hardware support for this at all), so
page faulting from any other device is out of the question while they
exist. This is the same requirement as typical simple driver DMA which
requires pages pinned until the simple device completes DMA.

ODP RDMA MRs do not require that, they just page fault like the CPU or
really anything and the kernel has to make sense of concurrant page
faults from multiple sources.

The upshot is that GPU scenarios that rely on highly dynamic
virtual->physical translation cannot sanely be combined with standard
long-life RDMA MRs.

Certainly, any solution for GPUs must follow the typical page pinning
semantics, changing the DMA address of a page must be blocked while
any DMA is in progress.

> >Does HMM solve the peer-peer problem? Does it do it generically or
> >only for drivers that are mirroring translation tables?

> In current form HMM doesn't solve peer-peer problem. Currently it allow
> "mirroring" of "malloc" memory on GPU which is not always what needed.
> Additionally there is need to have opportunity to share VRAM allocations
> between different processes.

Humm, so it can be removed from Alexander's list then :\

As Dan suggested, maybe we need to do both. Some kind of fix for
get_user_pages() for smaller mappings (eg ZONE_DEVICE) and a mandatory
API conversion to get_user_dma_sg() for other cases?

Jason

2016-11-23 19:43:14

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


On 2016-11-23 03:51 AM, Christian K?nig wrote:
> Am 23.11.2016 um 08:49 schrieb Daniel Vetter:
>> On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:
>>> On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <[email protected]> wrote:
>>>> On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
>>>> <[email protected]> wrote:
>>>>> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>>>>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams
>>>>>> <[email protected]>
>>>>>> wrote:
>>>>>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>>>>> <[email protected]> wrote:
>>>>>>>> I personally like "device-DAX" idea but my concerns are:
>>>>>>>>
>>>>>>>> - How well it will co-exists with the DRM infrastructure /
>>>>>>>> implementations
>>>>>>>> in part dealing with CPU pointers?
>>>>>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>>>>>> that you can perform pfn_to_page() on it and issue I/O, but the
>>>>>>> vma is
>>>>>>> not migratable. To be honest I do not know how well that co-exists
>>>>>>> with drm infrastructure.
>>>>>>>
>>>>>>>> - How well we will be able to handle case when we need to
>>>>>>>> "move"/"evict"
>>>>>>>> memory/data to the new location so CPU pointer should
>>>>>>>> point to the
>>>>>>>> new
>>>>>>>> physical location/address
>>>>>>>> (and may be not in PCI device memory at all)?
>>>>>>> So, device-DAX deliberately avoids support for in-kernel
>>>>>>> migration or
>>>>>>> overcommit. Those cases are left to the core mm or drm. The
>>>>>>> device-dax
>>>>>>> interface is for cases where all that is needed is a
>>>>>>> direct-mapping to
>>>>>>> a statically-allocated physical-address range be it persistent
>>>>>>> memory
>>>>>>> or some other special reserved memory range.
>>>>>> For some of the fancy use-cases (e.g. to be comparable to what
>>>>>> HMM can
>>>>>> pull off) I think we want all the magic in core mm, i.e.
>>>>>> migration and
>>>>>> overcommit. At least that seems to be the very strong drive in all
>>>>>> general-purpose gpu abstractions and implementations, where
>>>>>> memory is
>>>>>> allocated with malloc, and then mapped/moved into vram/gpu address
>>>>>> space through some magic,
>>>>> It is possible that there is other way around: memory is requested
>>>>> to be
>>>>> allocated and should be kept in vram for performance reason but due
>>>>> to possible overcommit case we need at least temporally to "move"
>>>>> such
>>>>> allocation to system memory.
>>>> With migration I meant migrating both ways of course. And with stuff
>>>> like numactl we can also influence where exactly the malloc'ed memory
>>>> is allocated originally, at least if we'd expose the vram range as a
>>>> very special numa node that happens to be far away and not hold any
>>>> cpu cores.
>>> I don't think we should be using numa distance to reverse engineer a
>>> certain allocation behavior. The latency data should be truthful, but
>>> you're right we'll need a mechanism to keep general purpose
>>> allocations out of that range by default. Btw, strict isolation is
>>> another design point of device-dax, but I think in this case we're
>>> describing something between the two extremes of full isolation and
>>> full compatibility with existing numactl apis.
>> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
>> to reuse all the existing allocation policies directly, those won't
>> work.
>> So at boot-up your default numa policy would exclude any vram nodes.
>>
>> But I think (as an -mm layman) that numa gives us a lot of the tools and
>> policy interface that we need to implement what we want for gpus.
>
> Agree completely. From a ten mile high view our GPUs are just command
> processors with local memory as well .
>
> Basically this is also the whole idea of what AMD is pushing with HSA
> for a while.
>
> It's just that a lot of problems start to pop up when you look at all
> the nasty details. For example only part of the GPU memory is usually
> accessible by the CPU.
>
> So even when numa nodes expose a good foundation for this I think
> there is still a lot of code to write.
>
> BTW: I should probably start to read into the numa code of the kernel.
> Any good pointers for that?
I would assume that "page" allocation logic itself should be inside of
graphics driver due to possible different requirements especially from
graphics: alignment, etc.

>
> Regards,
> Christian.
>
>> Wrt isolation: There's a sliding scale of what different users expect,
>> from full auto everything, including migrating pages around if needed to
>> full isolation all seems to be on the table. As long as we keep vram
>> nodes
>> out of any default allocation numasets, full isolation should be
>> possible.
>> -Daniel
>
>

Sincerely yours,
Serguei Sagalovitch

2016-11-23 19:59:06

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 2016-11-23 02:12 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote:
>
>> I don't think that was designed for the case where the backing memory
>> is a special/static physical address range rather than anonymous
>> "System RAM", right?
> The hardware doesn't care where the memory is. ODP is just a generic
> mechanism to provide demand-fault behavior for a mirrored page table.
>
> ODP has the same issue as everything else, it needs to translate a
> page table entry into a DMA address, and we have no API to do that
> when the page table points to peer-peer memory.
>
> Jason
I would like to note that for graphics applications (especially for VR
support) we
should avoid ODP case at any cost during graphics commands execution due
to requirement to have smooth and predictable playback. We want to load
/ "pin"
all required resources before graphics processor begin to touch them.
This is not
so critical for compute applications. Because only graphics / compute stack
knows which resource will be in used as well as all statistics
accordingly only graphics
stack is capable to make the correct decision when and _where_ evict as
well
as when and _where_ to put memory back.

2016-11-23 20:33:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote:

> We do not want to have "highly" dynamic translation due to
> performance cost. We need to support "overcommit" but would
> like to minimize impact. To support RDMA MRs for GPU/VRAM/PCIe
> device memory (which is must) we need either globally force
> pinning for the scope of "get_user_pages() / "put_pages" or have
> special handling for RDMA MRs and similar cases.

As I said, there is no possible special handling. Standard IB hardware
does not support changing the DMA address once a MR is created. Forget
about doing that.

Only ODP hardware allows changing the DMA address on the fly, and it
works at the page table level. We do not need special handling for
RDMA.

> Generally it could be difficult to correctly handle "DMA in
> progress" due to the facts that (a) DMA could originate from
> numerous PCIe devices simultaneously including requests to
> receive network data.

We handle all of this today in kernel via the page pinning mechanism.
This needs to be copied into peer-peer memory and GPU memory schemes
as well. A pinned page means the DMA address channot be changed and
there is active non-CPU access to it.

Any hardware that does not support page table mirroring must go this
route.

> (b) in HSA case DMA could originated from user space without kernel
> driver knowledge. So without corresponding h/w support
> everywhere I do not see how it could be solved effectively.

All true user triggered DMA must go through some kind of coherent page
table mirroring scheme (eg this is what CAPI does, presumably AMDs HSA
is similar). A page table mirroring scheme is basically the same as
what ODP does.

Like I said, this is the direction the industry seems to be moving in,
so any solution here should focus on VMAs/page tables as the way to link
the peer-peer devices.

To me this means at least items #1 and #3 should be removed from
Alexander's list.

Jason

2016-11-23 21:11:36

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 23/11/16 01:33 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote:
>
>> We do not want to have "highly" dynamic translation due to
>> performance cost. We need to support "overcommit" but would
>> like to minimize impact. To support RDMA MRs for GPU/VRAM/PCIe
>> device memory (which is must) we need either globally force
>> pinning for the scope of "get_user_pages() / "put_pages" or have
>> special handling for RDMA MRs and similar cases.
>
> As I said, there is no possible special handling. Standard IB hardware
> does not support changing the DMA address once a MR is created. Forget
> about doing that.

Yeah, that's essentially the point I was trying to make. Not to mention
all the other unrelated hardware that can't DMA to an address that might
disappear mid-transfer.

> Only ODP hardware allows changing the DMA address on the fly, and it
> works at the page table level. We do not need special handling for
> RDMA.

I am aware of ODP but, noted by others, it doesn't provide a general
solution to the points above.

> Like I said, this is the direction the industry seems to be moving in,
> so any solution here should focus on VMAs/page tables as the way to link
> the peer-peer devices.

Yes, this was the appeal to us of using ZONE_DEVICE.

> To me this means at least items #1 and #3 should be removed from
> Alexander's list.

It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
really the same option. iopmem is really just one way to get BAR
addresses to user-space while inside the kernel it's ZONE_DEVICE.

Logan

2016-11-23 22:32:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
> > As I said, there is no possible special handling. Standard IB hardware
> > does not support changing the DMA address once a MR is created. Forget
> > about doing that.
>
> Yeah, that's essentially the point I was trying to make. Not to mention
> all the other unrelated hardware that can't DMA to an address that might
> disappear mid-transfer.

Right, it is impossible to ask for generic page migration with ongoing
DMA. That is simply not supported by any of the hardware at all.

> > Only ODP hardware allows changing the DMA address on the fly, and it
> > works at the page table level. We do not need special handling for
> > RDMA.
>
> I am aware of ODP but, noted by others, it doesn't provide a general
> solution to the points above.

How do you mean?

Perhaps I am not following what Serguei is asking for, but I
understood the desire was for a complex GPU allocator that could
migrate pages between GPU and CPU memory under control of the GPU
driver, among other things. The desire is for DMA to continue to work
even after these migrations happen.

Page table mirroring *is* the general solution for this problem. The
GPU driver controls the VMA and the DMA driver mirrors that VMA.

Do you know of another option that doesn't just degenerate to page
table mirroring??

Remember, there are two facets to the RDMA ODP implementation, I feel
there is some confusion here..

The crucial part for this discussion is the ability to fence and block
DMA for a specific range. This is the hardware capability that lets
page migration happen: fence&block DMA, migrate page, update page
table in HCA, unblock DMA.

Without that hardware support the DMA address must be unchanging, and
there is nothing we can do about it. This is why standard IB hardware
must have fixed MRs - it lacks the fence&block capability.

The other part is the page faulting implementation, but that is not
required, and to Serguei's point, is not desired for GPU anyhow.

> > To me this means at least items #1 and #3 should be removed from
> > Alexander's list.
>
> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
> really the same option. iopmem is really just one way to get BAR
> addresses to user-space while inside the kernel it's ZONE_DEVICE.

Seems fine for RDMA?

Didn't we just strike off everything on the list except #2? :\

Jason

2016-11-23 22:42:25

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 1:55 PM, Jason Gunthorpe
<[email protected]> wrote:
> On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
>> > As I said, there is no possible special handling. Standard IB hardware
>> > does not support changing the DMA address once a MR is created. Forget
>> > about doing that.
>>
>> Yeah, that's essentially the point I was trying to make. Not to mention
>> all the other unrelated hardware that can't DMA to an address that might
>> disappear mid-transfer.
>
> Right, it is impossible to ask for generic page migration with ongoing
> DMA. That is simply not supported by any of the hardware at all.
>
>> > Only ODP hardware allows changing the DMA address on the fly, and it
>> > works at the page table level. We do not need special handling for
>> > RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
>
> How do you mean?
>
> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.
>
> Page table mirroring *is* the general solution for this problem. The
> GPU driver controls the VMA and the DMA driver mirrors that VMA.
>
> Do you know of another option that doesn't just degenerate to page
> table mirroring??
>
> Remember, there are two facets to the RDMA ODP implementation, I feel
> there is some confusion here..
>
> The crucial part for this discussion is the ability to fence and block
> DMA for a specific range. This is the hardware capability that lets
> page migration happen: fence&block DMA, migrate page, update page
> table in HCA, unblock DMA.

Wait, ODP requires migratable pages, ZONE_DEVICE pages are not
migratable. You can't replace a PCIe mapping with just any other
System RAM physical address, right? At least not without a filesystem
recording where things went, but at point we're no longer talking
about the base P2P-DMA mapping mechanism and are instead talking about
something like pnfs-rdma to a DAX filesystem.

2016-11-23 23:25:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 02:42:12PM -0800, Dan Williams wrote:
> > The crucial part for this discussion is the ability to fence and block
> > DMA for a specific range. This is the hardware capability that lets
> > page migration happen: fence&block DMA, migrate page, update page
> > table in HCA, unblock DMA.
>
> Wait, ODP requires migratable pages, ZONE_DEVICE pages are not
> migratable.

Does it? I didn't think so.. Does ZONE_DEVICE break MMU notifiers/etc
or something? There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.

I used 'migration' in the broader sense of doing any transformation to
the page such that the DMA address changes - not the specific kernel
MM process...

> You can't replace a PCIe mapping with just any other System RAM
> physical address, right?

I thought that was exactly what HMM was trying to do? Migrate pages
between CPU and GPU memory as needed. As Serguei has said this process
needs to be driven by the GPU driver.

The peer-peer issue is how do you do that while RDMA is possible on
those pages, because when the page migrates to GPU memory you want the
RDMA to follow it seamlessly.

This is why page table mirroring is the best solution - use the
existing mm machinery to link the DMA driver and whatever is
controlling the VMA.

> At least not without a filesystem recording where things went, but
> at point we're no longer talking about the base P2P-DMA mapping

In the filesystem/DAX case, it would be the filesystem that initiates
any change in the page physical address.

ODP *follows* changes in the VMA it does not cause any change in
address mapping. That has to be done by whoever is in charge of the
VMA.

> something like pnfs-rdma to a DAX filesystem.

Something in the kernel (ie nfs-rdma) would be entirely different. We
generally don't do long lived mappings in the kernel for RDMA
(certainly not for NFS), so it is much more like your basic every day
DMA operation: map, execute, unmap. We probably don't need to use page
table mirroring for this.

ODP comes in when userpsace mmaps a DAX file and then tries to use it
for RDMA. Page table mirroring lets the DAX filesystem decide to move
the backing pages at any time. When it wants to do that it interacts
with the MM in the usual way which links to ODP and makes sure the
migration is seamless.

Jason

2016-11-24 00:43:09

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:

> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.

The main issue is to how to solve use cases when p2p is
requested/initiated via CPU pointers where such pointers could
point to non-system memory location e.g. VRAM.

It will allow to provide consistent working model for user to deal only
with pointers (HSA, CUDA, OpenCL 2.0 SVM) as well as provide
performance optimization avoiding double-buffering and extra special code
when dealing with PCIe device memory.

Examples are:

- RDMA Network operations. RDMA MRs where registered memory
could be e.g. VRAM. Currently it is solved using so called PeerDirect
interface which is currently out-of-tree and provided as part of OFED.
- File operations (fread/fwrite) when user wants to transfer file data directly
to/from e.g. VRAM


Challenges are:
- Because graphics sub-system must support overcomit (at least each
application/process should independently see all resources) ideally
such memory should be movable without changing CPU pointer value
as well as "paged-out" supporting "page fault" at least on access from
CPU.
- We must co-exist with existing DRM infrastructure, as well as
support sharing VRAM memory between different processes
- We should be able to deal with large allocations: tens, hundreds of
MBs or may be GBs.
- We may have PCIe devices where p2p may not work
- Potentially any GPU memory should be supported including
memory carved out from system RAM (e.g. allocated via
get_free_pages()).


Note:
- In the case of RDMA MRs life-span of "pinning"
(get_user_pages"/put_page) may be defined/controlled by
application not kernel which may be should
treated differently as special case.


Original proposal was to create "struct pages" for VRAM memory
to allow "get_user_pages" to work transparently similar
how it is/was done for "DAX Device" case. Unfortunately
based on my understanding "DAX Device" implementation
deal only with permanently "locked" memory (fixed location)
unrelated to "get_user_pages"/"put_page" scope
which doesn't satisfy requirements for "eviction" / "moving" of
memory keeping CPU address intact.

> The desire is for DMA to continue to work
> even after these migrations happen
At least some kind of mm notifier callback to inform about changing
in location (pre- and post-) similar how it is done for system pages.
My understanding is that It will not solve RDMA MR issue where "lock"
could be during the whole application life but (a) it will not make
RDMA MR case worse (b) should be enough for all other cases for
"get_user_pages"/"put_page" controlled by kernel.



2016-11-24 01:25:29

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
>>> Only ODP hardware allows changing the DMA address on the fly, and it
>>> works at the page table level. We do not need special handling for
>>> RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
>
> How do you mean?

I was only saying it wasn't general in that it wouldn't work for IB
hardware that doesn't support ODP or other hardware that doesn't do
similar things (like an NVMe drive).

It makes sense for hardware that supports ODP to allow MRs to not pin
the underlying memory and provide for migrations that the hardware can
follow. But most DMA engines will require the memory to be pinned and
any complex allocators (GPU or otherwise) should respect that. And that
seems like it should be the default way most of this works -- and I
think it wouldn't actually take too much effort to make it all work now
as is. (Our iopmem work is actually quite small and simple.)

>> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
>> really the same option. iopmem is really just one way to get BAR
>> addresses to user-space while inside the kernel it's ZONE_DEVICE.
>
> Seems fine for RDMA?

Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
memory working for some time. I'd say it's a good fit. The main question
we've had is how to expose PCIe bars to userspace to be used as MRs and
such.


Logan

2016-11-24 09:45:42

by Christian König

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
> There is certainly nothing about the hardware that cares
> about ZONE_DEVICE vs System memory.
Well that is clearly not so simple. When your ZONE_DEVICE pages describe
a PCI BAR and another PCI device initiates a DMA to this address the DMA
subsystem must be able to check if the interconnection really works.

E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
Not PCI device B (a SATA device) can directly read/write to it because
it is on the same bus segment, but PCI device C (a network card for
example) can't because it is on a different bus segment and the bridge
can't handle P2P transactions.

We need to be able to handle such cases and fall back to bouncing
buffers, but I don't see that in the DMA subsystem right now.

Regards,
Christian.

2016-11-24 16:24:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Thu, Nov 24, 2016 at 12:40:37AM +0000, Sagalovitch, Serguei wrote:
> On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
>
> > Perhaps I am not following what Serguei is asking for, but I
> > understood the desire was for a complex GPU allocator that could
> > migrate pages between GPU and CPU memory under control of the GPU
> > driver, among other things. The desire is for DMA to continue to work
> > even after these migrations happen.
>
> The main issue is to how to solve use cases when p2p is
> requested/initiated via CPU pointers where such pointers could
> point to non-system memory location e.g. VRAM.

Okay, but your list is conflating a whole bunch of problems..

1) How to go from a __user pointer to a p2p DMA address
a) How to validate, setup iommu and maybe worst case bounce buffer
these p2p DMAs
2) How to allow drivers (ie GPU allocator) dynamically
remap pages in a VMA to/from p2p DMA addresses
3) How to expose uncachable p2p DMA address to user space via mmap

> to allow "get_user_pages" to work transparently similar
> how it is/was done for "DAX Device" case. Unfortunately
> based on my understanding "DAX Device" implementation
> deal only with permanently "locked" memory (fixed location)
> unrelated to "get_user_pages"/"put_page" scope
> which doesn't satisfy requirements for "eviction" / "moving" of
> memory keeping CPU address intact.

Hurm, isn't that issue with DAX only to do with being coherent with
the page cache?

A GPU allocator would not use the page cache, it would have to
construct VMAs some other way.

> My understanding is that It will not solve RDMA MR issue where "lock"
> could be during the whole application life but (a) it will not make
> RDMA MR case worse (b) should be enough for all other cases for
> "get_user_pages"/"put_page" controlled by kernel.

Right. There is no solution to the RDMA MR issue on old hardware. Apps
that are using GPU+RDMA+Old hardware will have to use short lived MRs
and pay that performance cost, or give up on migration.

Jason

2016-11-24 16:26:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian K?nig wrote:
> Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
> >There is certainly nothing about the hardware that cares
> >about ZONE_DEVICE vs System memory.
> Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
> PCI BAR and another PCI device initiates a DMA to this address the DMA
> subsystem must be able to check if the interconnection really works.

I said the hardware doesn't care.. You are right, we still have an
outstanding problem in Linux of how to generically DMA map a P2P
address - which is a different issue from getting the P2P address from
a __user pointer...

Jason

2016-11-24 16:43:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:
>
>
> On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
> >>> Only ODP hardware allows changing the DMA address on the fly, and it
> >>> works at the page table level. We do not need special handling for
> >>> RDMA.
> >>
> >> I am aware of ODP but, noted by others, it doesn't provide a general
> >> solution to the points above.
> >
> > How do you mean?
>
> I was only saying it wasn't general in that it wouldn't work for IB
> hardware that doesn't support ODP or other hardware that doesn't do
> similar things (like an NVMe drive).

There are three cases to worry about:
- Coherent long lived page table mirroring (RDMA ODP MR)
- Non-coherent long lived page table mirroring (RDMA MR)
- Short lived DMA mapping (everything else)

Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.

> any complex allocators (GPU or otherwise) should respect that. And that
> seems like it should be the default way most of this works -- and I
> think it wouldn't actually take too much effort to make it all work now
> as is. (Our iopmem work is actually quite small and simple.)

Yes, absolutely, some kind of page pinning like locking is a hard
requirement.

> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
> memory working for some time. I'd say it's a good fit. The main question
> we've had is how to expose PCIe bars to userspace to be used as MRs and
> such.

Is there any progress on that?

I still don't quite get what iopmem was about.. I thought the
objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
over iopmem and still ending up with uncacheable mmaps still seems
like a non-starter to me...

Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?

One approach might be to mmap the uncachable ZONE_DEVICE memory and
mark it inaccessible to the CPU - DMA could still translate. If the
CPU needs it then the kernel migrates it to system memory so it
becomes cachable. ??

Jason

2016-11-24 17:00:35

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


On 2016-11-24 11:26 AM, Jason Gunthorpe wrote:
> On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian K?nig wrote:
>> Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
>>> There is certainly nothing about the hardware that cares
>>> about ZONE_DEVICE vs System memory.
>> Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
>> PCI BAR and another PCI device initiates a DMA to this address the DMA
>> subsystem must be able to check if the interconnection really works.
> I said the hardware doesn't care.. You are right, we still have an
> outstanding problem in Linux of how to generically DMA map a P2P
> address - which is a different issue from getting the P2P address from
> a __user pointer...
>
> Jason
I agreed but the problem is that one issue immediately introduce another
one
to solve and so on (if we do not want to cut corners). I would think that
a lot of them interconnected because the way how one problem could be
solved may impact solution for another.

btw: about "DMA map a p2p address": Right now to enable p2p between
devices
it is required/recommended to disable iommu support (e.g. intel iommu
driver
has special logic for graphics and comment "Reserve all PCI MMIO to avoid
peer-to-peer access").

2016-11-24 17:55:46

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hey,

On 24/11/16 02:45 AM, Christian K?nig wrote:
> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
> Not PCI device B (a SATA device) can directly read/write to it because
> it is on the same bus segment, but PCI device C (a network card for
> example) can't because it is on a different bus segment and the bridge
> can't handle P2P transactions.

Yeah, that could be an issue but in our experience we have yet to see
it. We've tested with two separate PCI buses on different CPUs connected
through QPI links and it works fine. (It is rather slow but I understand
Intel has improved the bottleneck in newer CPUs than the ones we tested.)

It may just be older hardware that has this issue. I expect that as long
as a failed transfer can be handled gracefully by the initiator I don't
see a need to predetermine whether a device can see another devices memory.


Logan

2016-11-24 18:11:40

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 24/11/16 09:42 AM, Jason Gunthorpe wrote:
> There are three cases to worry about:
> - Coherent long lived page table mirroring (RDMA ODP MR)
> - Non-coherent long lived page table mirroring (RDMA MR)
> - Short lived DMA mapping (everything else)
>
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Yes, this makes sense to me. Though I thought regular IB MRs with
regular memory currently pinned the pages (despite being long lived)
that's why we can run up against the "max locked memory" limit. It
doesn't seem so terrible if GPU memory had a similar restriction until
ODP like solutions get implemented.

>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.

> Is there any progress on that?

Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.

* Regular DAX in the FS doesn't work at this time because the FS can
move the file you think your transfer to out from under you. Though I
understand there's been some work with XFS to solve that issue.

Though, we've been considering that the backed memory would be
non-volatile which adds some of this complexity. If the memory were
volatile the kernel would just need to do some relatively straight
forward allocation to user-space when asked. For example, with NVMe, the
kernel could give chunks of the CMB buffer to userspace via an mmap call
to /dev/nvmeX. Though I think there's been some push back against things
like that as well.

> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...

The latest incarnation of iopmem simply created a block device backed by
ZONE_DEVICE memory on a PCIe BAR. We then put a DAX FS on it and
user-space could mmap the files and send them to other devices to do P2P
transfers.

I don't think there was a hard objection to uncachable ZONE_DEVICE and
DAX. We did try our experimental hardware with cached ZONE_DEVICE and it
did work but the performance was beyond unusable (which may be a
hardware issue). In the end I feel the driver would have to decide the
most appropriate caching for the hardware and I don't understand why WC
or UC wouldn't work with ZONE_DEVICE.

Logan

2016-11-25 08:21:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Thu, Nov 24, 2016 at 11:11:34AM -0700, Logan Gunthorpe wrote:
> * Regular DAX in the FS doesn't work at this time because the FS can
> move the file you think your transfer to out from under you. Though I
> understand there's been some work with XFS to solve that issue.

The file system will never move anything under locked down pages,
locking down pages is used exactly to protect against that. So as long
as we page structures available RDMA to/from device memory _from kernel
space_ is trivial, although for file systems to work properly you
really want a notification to the consumer if the file systems wants
to remove the mapping. We have implemented that using FL_LAYOUTS locks
for NFSD, but only XFS supports it so far. Without that a long term
locked down region of memory (e.g. a kernel MR) would prevent various
file operations that would simply hang.

2016-11-25 13:06:51

by Christian König

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Am 24.11.2016 um 18:55 schrieb Logan Gunthorpe:
> Hey,
>
> On 24/11/16 02:45 AM, Christian K?nig wrote:
>> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
>> Not PCI device B (a SATA device) can directly read/write to it because
>> it is on the same bus segment, but PCI device C (a network card for
>> example) can't because it is on a different bus segment and the bridge
>> can't handle P2P transactions.
> Yeah, that could be an issue but in our experience we have yet to see
> it. We've tested with two separate PCI buses on different CPUs connected
> through QPI links and it works fine. (It is rather slow but I understand
> Intel has improved the bottleneck in newer CPUs than the ones we tested.)

Well Serguei send me a couple of documents about QPI when we started to
discuss this internally as well and that's exactly one of the cases I
had in mind when writing this.

If I understood it correctly for such systems P2P is technical possible,
but not necessary a good idea. Usually it is faster to just use a
bouncing buffer when the peers are a bit "father" apart.

That this problem is solved on newer hardware is good, but doesn't helps
us at all if we at want to support at least systems from the last five
years or so.

> It may just be older hardware that has this issue. I expect that as long
> as a failed transfer can be handled gracefully by the initiator I don't
> see a need to predetermine whether a device can see another devices memory.

I don't want to predetermine whether a device can see another devices
memory at get_user_pages() time.

My thinking was more going into the direction of a whitelist to figure
out during dma_map_single()/dma_map_sg() time if we should use a
bouncing buffer or not.

Christian.

>
>
> Logan


2016-11-25 14:57:06

by Christian König

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Am 24.11.2016 um 17:42 schrieb Jason Gunthorpe:
> On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:
>>
>> On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
>>>>> Only ODP hardware allows changing the DMA address on the fly, and it
>>>>> works at the page table level. We do not need special handling for
>>>>> RDMA.
>>>> I am aware of ODP but, noted by others, it doesn't provide a general
>>>> solution to the points above.
>>> How do you mean?
>> I was only saying it wasn't general in that it wouldn't work for IB
>> hardware that doesn't support ODP or other hardware that doesn't do
>> similar things (like an NVMe drive).
> There are three cases to worry about:
> - Coherent long lived page table mirroring (RDMA ODP MR)
> - Non-coherent long lived page table mirroring (RDMA MR)
> - Short lived DMA mapping (everything else)
>
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Well a problem which wasn't mentioned so far is that while GPUs do have
a page table to mirror the CPU page table, they usually can't recover
from page faults.

So what we do is making sure that all memory accessed by the GPU Jobs
stays in place while those jobs run (pretty much the same pinning you do
for the DMA).

But since this can lock down huge amounts of memory the whole command
submission to GPUs is bound to the memory management. So when to much
memory would get blocked by the GPU we block further command submissions
until the situation resolves.

>> any complex allocators (GPU or otherwise) should respect that. And that
>> seems like it should be the default way most of this works -- and I
>> think it wouldn't actually take too much effort to make it all work now
>> as is. (Our iopmem work is actually quite small and simple.)
> Yes, absolutely, some kind of page pinning like locking is a hard
> requirement.
>
>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.
> Is there any progress on that?
>
> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...
>
> Serguei, what is your plan in GPU land for migration? Ie if I have a
> CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
> - do you still allow the CPU to access it? Or do you swap it back to
> cachable memory if the CPU touches it?

Depends on the policy in command, but currently it's the other way
around most of the time.

E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids
reading because that is slow, the GPU in turn can access it with full speed.

When we run out of VRAM we move those allocations to system memory and
update both the CPU as well as the GPU page tables.

So that move is transparent for both userspace as well as shaders
running on the GPU.

> One approach might be to mmap the uncachable ZONE_DEVICE memory and
> mark it inaccessible to the CPU - DMA could still translate. If the
> CPU needs it then the kernel migrates it to system memory so it
> becomes cachable. ??

The whole purpose of this effort is that we can do I/O on VRAM directly
without migrating everything back to system memory.

Allowing this, but then doing the migration by the first touch of the
CPU is clearly not a good idea.

Regards,
Christian.

>
> Jason


2016-11-25 16:46:08

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 25/11/16 06:06 AM, Christian K?nig wrote:
> Well Serguei send me a couple of documents about QPI when we started to
> discuss this internally as well and that's exactly one of the cases I
> had in mind when writing this.
>
> If I understood it correctly for such systems P2P is technical possible,
> but not necessary a good idea. Usually it is faster to just use a
> bouncing buffer when the peers are a bit "father" apart.
>
> That this problem is solved on newer hardware is good, but doesn't helps
> us at all if we at want to support at least systems from the last five
> years or so.

Well we have been testing with Sandy Bridge, I think the problem was
supposed to be fixed in Ivy but we never tested it so I can't say what
the performance turned out to be. Ivy is nearly 5 years old. I expect
this is something that will be improved more and more with subsequent
generations.

A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan

2016-11-25 17:22:23

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


> A white list may end up being rather complicated if it has to cover
> different CPU generations and system architectures. I feel this is a
> decision user space could easily make.
>
> Logan
I agreed that it is better to leave up to user space to check what is
working
and what is not. I found that write is practically always working but
read very
often not. Also sometimes system BIOS update could fix the issue.

2016-11-25 17:59:41

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


> Well, I guess there's some consensus building to do. The existing
> options are:
>
> * Device DAX: which could work but the problem I see with it is that it
> only allows one application to do these transfers. Or there would have
> to be some user-space coordination to figure which application gets what
> memeroy.
About one application restriction: so it is per memory mapping? I assume
that
it should not be problem for one application to do transfer to the
several devices
simultaneously? Am I right?

May be we should follow RDMA MR design and register memory for p2p
transfer from user
space?

What about the following:

a) Device DAX is created
b) "Normal" (movable, etc.) allocation will be done for PCIe memory and
CPU pointer/access will
be requested.
c) p2p_mr_register() will be called and CPU pointer (mmap( on DAX
Device)) will be returned.
Accordingly such memory will be marked as "unmovable" by e.g. graphics
driver.
d) When p2p is not needed then p2p_mr_unregister() will be called.

What do you think? Will it work?


2016-11-25 18:54:19

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 2016-11-25 08:22 AM, Christian K?nig wrote:
>
>> Serguei, what is your plan in GPU land for migration? Ie if I have a
>> CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
>> - do you still allow the CPU to access it? Or do you swap it back to
>> cachable memory if the CPU touches it?
>
> Depends on the policy in command, but currently it's the other way
> around most of the time.
>
> E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids
> reading because that is slow, the GPU in turn can access it with full
> speed.
>
> When we run out of VRAM we move those allocations to system memory and
> update both the CPU as well as the GPU page tables.
>
> So that move is transparent for both userspace as well as shaders
> running on the GPU.
I would like to add more in relation to CPU access :

a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register)
and non-CPU accessible part. As the result if user needs to have
CPU access than memory should be located in CPU-accessible part
of VRAM or in system memory.

Application/user mode driver could specify preference/hints of
locations based on their assumption / knowledge about access
patterns requirements, game resolution, knowledge
about size of VRAM memory, etc. So if CPU access performance
is critical then such memory should be allocated in system memory
as the first (and may be only) choice.

b) Allocation may not have CPU address at all - only GPU one.
Also we may not be able to have CPU address/accesses for all VRAM
memory but memory may still be migrated in any case unrelated
if we have CPU address or not.

c) " VRAM, it becomes non-cachable "
Strictly speaking VRAM is configured as WC (write-combined memory) to
provide fast CPU write access. Also it was found that sometimes if CPU
access is not critical from performance perspective it may be useful
to allocate/program system memory also as WC to avoid needs for
extra "snooping" to synchronize with CPU caches during GPU access.
So potentially system memory could be WC too.


2016-11-25 19:33:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian K?nig wrote:

> >Like you say below we have to handle short lived in the usual way, and
> >that covers basically every device except IB MRs, including the
> >command queue on a NVMe drive.
>
> Well a problem which wasn't mentioned so far is that while GPUs do have a
> page table to mirror the CPU page table, they usually can't recover from
> page faults.

> So what we do is making sure that all memory accessed by the GPU Jobs stays
> in place while those jobs run (pretty much the same pinning you do for the
> DMA).

Yes, it is DMA, so this is a valid approach.

But, you don't need page faults from the GPU to do proper coherent
page table mirroring. Basically when the driver submits the work to
the GPU it 'faults' the pages into the CPU and mirror translation
table (instead of pinning).

Like in ODP, MMU notifiers/HMM are used to monitor for translation
changes. If a change comes in the GPU driver checks if an executing
command is touching those pages and blocks the MMU notifier until the
command flushes, then unfaults the page (blocking future commands) and
unblocks the mmu notifier.

The code moving the page will move it and the next GPU command that
needs it will refault it in the usual way, just like the CPU would.

This might be much more efficient since it optimizes for the common
case of unchanging translation tables.

This assumes the commands are fairly short lived of course, the
expectation of the mmu notifiers is that a flush is reasonably prompt
..

> >Serguei, what is your plan in GPU land for migration? Ie if I have a
> >CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
> >- do you still allow the CPU to access it? Or do you swap it back to
> >cachable memory if the CPU touches it?
>
> Depends on the policy in command, but currently it's the other way around
> most of the time.
>
> E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids reading
> because that is slow, the GPU in turn can access it with full speed.
>
> When we run out of VRAM we move those allocations to system memory and
> update both the CPU as well as the GPU page tables.
>
> So that move is transparent for both userspace as well as shaders running on
> the GPU.

That makes sense to me, but the objection that came back for
non-cachable CPU mappings is that it basically breaks too much stuff
subtly, eg atomics, unaligned accesses, the CPU threading memory
model, all change on various architectures and break when caching is
disabled.

IMHO that is OK for specialty things like the GPU where the mmap comes
in via drm or something and apps know to handle that buffer specially.

But it is certainly not OK for DAX where the application is coded for
normal file open()/mmap() is not prepared for a mmap where (eg)
unaligned read accesses or atomics don't work depending on how the
filesystem is setup.

Which is why I think iopmem is still problematic..

At the very least I think a mmap flag or open flag should be needed to
opt into this behavior and by default non-cachebale DAX mmaps should
be paged into system ram when the CPU accesses them.

I'm hearing most people say ZONE_DEVICE is the way to handle this,
which means the missing remaing piece for RDMA is some kind of DMA
core support for p2p address translation..

Jason

2016-11-25 19:34:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:

> b) Allocation may not have CPU address at all - only GPU one.

But you don't expect RDMA to work in the case, right?

GPU people need to stop doing this windowed memory stuff :)

Jason

2016-11-25 19:41:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Thu, Nov 24, 2016 at 11:58:17PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 24, 2016 at 11:11:34AM -0700, Logan Gunthorpe wrote:
> > * Regular DAX in the FS doesn't work at this time because the FS can
> > move the file you think your transfer to out from under you. Though I
> > understand there's been some work with XFS to solve that issue.
>
> The file system will never move anything under locked down pages,
> locking down pages is used exactly to protect against that.

.. And ODP style mmu notifiers work correctly as well, I'd assume.

So this should work with ZONE_DEVICE, if it doesn't it is a filesystem
bug?

> really want a notification to the consumer if the file systems wants
> to remove the mapping. We have implemented that using FL_LAYOUTS locks
> for NFSD, but only XFS supports it so far. Without that a long term
> locked down region of memory (e.g. a kernel MR) would prevent various
> file operations that would simply hang.

So you imagine a signal back to user space asking user space to drop
any RDMA MRS so the FS can relocate things?

Do we need that, or should we encourage people to use either short
lived MRs or ODP MRs when working with scenarios that need FS
relocation?

Jason

2016-11-25 19:50:14

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 2016-11-25 02:34 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
>
>> b) Allocation may not have CPU address at all - only GPU one.
> But you don't expect RDMA to work in the case, right?
>
> GPU people need to stop doing this windowed memory stuff :)
GPU could perfectly access all VRAM. It is only issue for p2p without
special interconnect and CPU access. Strictly speaking as long as we
have "bus address" we could have RDMA but I agreed that for
RDMA we could/should(?) always "request" CPU address (I hope that we
could forget about 32-bit application :-)).

BTW/FYI: About CPU access: Some user-level API is mainly handle based
so there is no need for CPU access by default.

About "visible" / non-visible VRAM parts: I assume that going
forward we will be able to get rid from it completely as soon as support
for resizable PCI BAR will be implemented and/or old/current h/w
will become obsolete.

2016-11-25 20:19:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Fri, Nov 25, 2016 at 02:49:50PM -0500, Serguei Sagalovitch wrote:

> GPU could perfectly access all VRAM. It is only issue for p2p without
> special interconnect and CPU access. Strictly speaking as long as we
> have "bus address" we could have RDMA but I agreed that for
> RDMA we could/should(?) always "request" CPU address (I hope that we
> could forget about 32-bit application :-)).

At least on x86 if you have a bus address you have a CPU address. All
RDMAable VRAM has to be visible in the BAR.

> BTW/FYI: About CPU access: Some user-level API is mainly handle based
> so there is no need for CPU access by default.

You mean no need for the memory to be virtually mapped into the
process?

Do you expect to RDMA from this kind of API? How will that work?

Jason

2016-11-25 20:40:28

by Christian König

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.

Yeah, we have a function to "import" anonymous pages from a CPU pointer
which works exactly that way as well.

We call this "userptr" and it's just a combination of get_user_pages()
on command submission and making sure the returned list of pages stays
valid using a MMU notifier.

The "big" problem with this approach is that it is horrible slow. I mean
seriously horrible slow so that we actually can't use it for some of the
purposes we wanted to use it.

> The code moving the page will move it and the next GPU command that
> needs it will refault it in the usual way, just like the CPU would.

And here comes the problem. CPU do this on a page by page basis, so they
fault only what needed and everything else gets filled in on demand.
This results that faulting a page is relatively light weight operation.

But for GPU command submission we don't know which pages might be
accessed beforehand, so what we do is walking all possible pages and
make sure all of them are present.

Now as far as I understand it the I/O subsystem for example assumes that
it can easily change the CPU page tables without much overhead. So for
example when a page can't modified it is temporary marked as readonly
AFAIK (you are probably way deeper into this than me, so please confirm).

That absolutely kills any performance for GPU command submissions. We
have use cases where we practically ended up playing ping/pong between
the GPU driver trying to grab the page with get_user_pages() and sombody
else in the kernel marking it readonly.

> This might be much more efficient since it optimizes for the common
> case of unchanging translation tables.

Yeah, completely agree. It works perfectly fine as long as you don't
have two drivers trying to mess with the same page.

> This assumes the commands are fairly short lived of course, the
> expectation of the mmu notifiers is that a flush is reasonably prompt

Correct, this is another problem. GFX command submissions usually don't
take longer than a few milliseconds, but compute command submission can
easily take multiple hours.

I can easily imagine what would happen when kswapd is blocked by a GPU
command submission for an hour or so while the system is under memory
pressure :)

I'm thinking on this problem for about a year now and going in circles
for quite a while. So if you have ideas on this even if they sound
totally crazy, feel free to come up.

Cheers,
Christian.

2016-11-25 20:49:09

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


On 2016-11-25 03:26 PM, Felix Kuehling wrote:
> On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:
>>> A white list may end up being rather complicated if it has to cover
>>> different CPU generations and system architectures. I feel this is a
>>> decision user space could easily make.
>>>
>>> Logan
>> I agreed that it is better to leave up to user space to check what is
>> working
>> and what is not. I found that write is practically always working but
>> read very
>> often not. Also sometimes system BIOS update could fix the issue.
>>
> But is user mode always aware that P2P is going on or even possible? For
> example you may have a library reading a buffer from a file, but it
> doesn't necessarily know where that buffer is located (system memory,
> VRAM, ...) and it may not know what kind of the device the file is on
> (SATA drive, NVMe SSD, ...). The library will never know if all it gets
> is a pointer and a file descriptor.
>
> The library ends up calling a read system call. Then it would be up to
> the kernel to figure out the most efficient way to read the buffer from
> the file. If supported, it could use P2P between a GPU and NVMe where
> the NVMe device performs a DMA write to VRAM.
>
> If you put the burden of figuring out the P2P details on user mode code,
> I think it will severely limit the use cases that actually take
> advantage of it. You also risk a bunch of different implementations that
> get it wrong half the time on half the systems out there.
>
> Regards,
> Felix
>
>
I agreed in theory with you but I must admit that I do not know how
kernel could effectively collect all informations without running
pretty complicated tests each time on boot-up (if any configuration
changed including BIOS settings) and on pnp events. Also for efficient
way kernel needs to know performance results (and it could also
depends on clock / power mode) for read/write of each pair devices, for
double-buffering it needs to know / detect on which NUMA node
to allocate, etc. etc. Also device could be fully configured only
on the first request for access so it may be needed to change initialization
sequences.

2016-11-25 20:51:23

by Felix Kuehling

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


On 16-11-25 03:40 PM, Christian König wrote:
> Am 25.11.2016 um 20:32 schrieb Jason Gunthorpe:
>> This assumes the commands are fairly short lived of course, the
>> expectation of the mmu notifiers is that a flush is reasonably prompt
>
> Correct, this is another problem. GFX command submissions usually
> don't take longer than a few milliseconds, but compute command
> submission can easily take multiple hours.
>
> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)
>
> I'm thinking on this problem for about a year now and going in circles
> for quite a while. So if you have ideas on this even if they sound
> totally crazy, feel free to come up.

Our GPUs (at least starting with VI) support compute-wave-save-restore
and can swap out compute queues with fairly low latency. Yes, there is
some overhead (both memory usage and time), but it's a fairly regular
thing with our hardware scheduler (firmware, actually) when we need to
preempt running compute queues to update runlists or we overcommit the
hardware queue resources.

Regards,
Felix

2016-11-25 21:00:05

by Felix Kuehling

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:
>
>> A white list may end up being rather complicated if it has to cover
>> different CPU generations and system architectures. I feel this is a
>> decision user space could easily make.
>>
>> Logan
> I agreed that it is better to leave up to user space to check what is
> working
> and what is not. I found that write is practically always working but
> read very
> often not. Also sometimes system BIOS update could fix the issue.
>
But is user mode always aware that P2P is going on or even possible? For
example you may have a library reading a buffer from a file, but it
doesn't necessarily know where that buffer is located (system memory,
VRAM, ...) and it may not know what kind of the device the file is on
(SATA drive, NVMe SSD, ...). The library will never know if all it gets
is a pointer and a file descriptor.

The library ends up calling a read system call. Then it would be up to
the kernel to figure out the most efficient way to read the buffer from
the file. If supported, it could use P2P between a GPU and NVMe where
the NVMe device performs a DMA write to VRAM.

If you put the burden of figuring out the P2P details on user mode code,
I think it will severely limit the use cases that actually take
advantage of it. You also risk a bunch of different implementations that
get it wrong half the time on half the systems out there.

Regards,
Felix


2016-11-25 21:19:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Fri, Nov 25, 2016 at 09:40:10PM +0100, Christian K?nig wrote:

> We call this "userptr" and it's just a combination of get_user_pages() on
> command submission and making sure the returned list of pages stays valid
> using a MMU notifier.

Doesn't that still pin the page?

> The "big" problem with this approach is that it is horrible slow. I mean
> seriously horrible slow so that we actually can't use it for some of the
> purposes we wanted to use it.
>
> >The code moving the page will move it and the next GPU command that
> >needs it will refault it in the usual way, just like the CPU would.
>
> And here comes the problem. CPU do this on a page by page basis, so they
> fault only what needed and everything else gets filled in on demand. This
> results that faulting a page is relatively light weight operation.
>
> But for GPU command submission we don't know which pages might be accessed
> beforehand, so what we do is walking all possible pages and make sure all of
> them are present.

Little confused why this is slow? So you fault the entire user MM into
your page tables at start of day and keep track of it with mmu
notifiers?

> >This might be much more efficient since it optimizes for the common
> >case of unchanging translation tables.
>
> Yeah, completely agree. It works perfectly fine as long as you don't have
> two drivers trying to mess with the same page.

Well, the idea would be to not have the GPU block the other driver
beyond hinting that the page shouldn't be swapped out.

> >This assumes the commands are fairly short lived of course, the
> >expectation of the mmu notifiers is that a flush is reasonably prompt
>
> Correct, this is another problem. GFX command submissions usually don't take
> longer than a few milliseconds, but compute command submission can easily
> take multiple hours.

So, that won't work - you have the same issue as RDMA with work loads
like that.

If you can't somehow fence the hardware then pinning is the only
solution. Felix has the right kind of suggestion for what is needed -
globally stop the GPU, fence the DMA, fix the page tables, and start
it up again. :\

> I can easily imagine what would happen when kswapd is blocked by a GPU
> command submission for an hour or so while the system is under memory
> pressure :)

Right. The advantage of pinning is it tells the other stuff not to
touch the page and doesn't block it, MMU notifiers have to be able to
block&fence quickly.

> I'm thinking on this problem for about a year now and going in circles for
> quite a while. So if you have ideas on this even if they sound totally
> crazy, feel free to come up.

Well, it isn't a software problem. From what I've seen in this thread
the GPU application requires coherent page table mirroring, so the
only full & complete solution is going to be to actually implement
that somehow in GPU hardware.

Everything else is going to be deeply flawed somehow. Linux just
doesn't have the support for this kind of stuff - and I'm honestly not
sure something better is even possible considering the hardware
constraints....

This doesn't have to be faulting, but really anything that lets you
pause the GPU DMA and reload the page tables.

You might look at trying to use the IOMMU and/or PCI ATS in very new
hardware. IIRC the physical IOMMU hardware can do the fault and fence
and block stuff, but I'm not sure about software support for using the
IOMMU to create coherent user page table mirrors - that is something
Linux doesn't do today. But there is demand for this kind of capability..

Jason

2016-11-25 23:41:33

by Alex Deucher

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Fri, Nov 25, 2016 at 2:34 PM, Jason Gunthorpe
<[email protected]> wrote:
> On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:
>
>> b) Allocation may not have CPU address at all - only GPU one.
>
> But you don't expect RDMA to work in the case, right?
>
> GPU people need to stop doing this windowed memory stuff :)
>

Blame 32 bit systems and GPUs with tons of vram :)

I think resizable bars are finally coming in a useful way so this
should go away soon.

Alex

2016-11-27 14:07:53

by Christian König

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Am 27.11.2016 um 15:02 schrieb Haggai Eran:
> On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
>> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian K?nig wrote:
>>
>>>> Like you say below we have to handle short lived in the usual way, and
>>>> that covers basically every device except IB MRs, including the
>>>> command queue on a NVMe drive.
>>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>>> page table to mirror the CPU page table, they usually can't recover from
>>> page faults.
>>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>>> in place while those jobs run (pretty much the same pinning you do for the
>>> DMA).
>> Yes, it is DMA, so this is a valid approach.
>>
>> But, you don't need page faults from the GPU to do proper coherent
>> page table mirroring. Basically when the driver submits the work to
>> the GPU it 'faults' the pages into the CPU and mirror translation
>> table (instead of pinning).
>>
>> Like in ODP, MMU notifiers/HMM are used to monitor for translation
>> changes. If a change comes in the GPU driver checks if an executing
>> command is touching those pages and blocks the MMU notifier until the
>> command flushes, then unfaults the page (blocking future commands) and
>> unblocks the mmu notifier.
> I think blocking mmu notifiers against something that is basically
> controlled by user-space can be problematic. This can block things like
> memory reclaim. If you have user-space access to the device's queues,
> user-space can block the mmu notifier forever.
Really good point.

I think this means the bare minimum if we don't have recoverable page
faults is to have preemption support like Felix described in his answer
as well.

Going to keep that in mind,
Christian.

2016-11-27 14:36:08

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian K?nig wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
>
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai

2016-11-27 14:49:38

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian K?nig wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
>
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai

2016-11-28 05:49:07

by Zhou, David(ChunMing)

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

+Qiang, who is working on it.

On 2016年11月27日 22:07, Christian König wrote:
> Am 27.11.2016 um 15:02 schrieb Haggai Eran:
>> On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
>>> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>>>
>>>>> Like you say below we have to handle short lived in the usual way,
>>>>> and
>>>>> that covers basically every device except IB MRs, including the
>>>>> command queue on a NVMe drive.
>>>> Well a problem which wasn't mentioned so far is that while GPUs do
>>>> have a
>>>> page table to mirror the CPU page table, they usually can't recover
>>>> from
>>>> page faults.
>>>> So what we do is making sure that all memory accessed by the GPU
>>>> Jobs stays
>>>> in place while those jobs run (pretty much the same pinning you do
>>>> for the
>>>> DMA).
>>> Yes, it is DMA, so this is a valid approach.
>>>
>>> But, you don't need page faults from the GPU to do proper coherent
>>> page table mirroring. Basically when the driver submits the work to
>>> the GPU it 'faults' the pages into the CPU and mirror translation
>>> table (instead of pinning).
>>>
>>> Like in ODP, MMU notifiers/HMM are used to monitor for translation
>>> changes. If a change comes in the GPU driver checks if an executing
>>> command is touching those pages and blocks the MMU notifier until the
>>> command flushes, then unfaults the page (blocking future commands) and
>>> unblocks the mmu notifier.
>> I think blocking mmu notifiers against something that is basically
>> controlled by user-space can be problematic. This can block things like
>> memory reclaim. If you have user-space access to the device's queues,
>> user-space can block the mmu notifier forever.
> Really good point.
>
> I think this means the bare minimum if we don't have recoverable page
> faults is to have preemption support like Felix described in his
> answer as well.
>
> Going to keep that in mind,
> Christian.
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

2016-11-28 15:03:43

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 2016-11-27 09:02 AM, Haggai Eran wrote
> On PeerDirect, we have some kind of a middle-ground solution for pinning
> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> user-space and the GPU not to migrate it. If they do, the MR gets
> destroyed immediately. This should work on legacy devices without ODP
> support, and allows the system to safely terminate a process that
> misbehaves. The downside of course is that it cannot transparently
> migrate memory but I think for user-space RDMA doing that transparently
> requires hardware support for paging, via something like HMM.
>
> ...
May be I am wrong but my understanding is that PeerDirect logic basically
follow "RDMA register MR" logic so basically nothing prevent to "terminate"
process for "MMU notifier" case when we are very low on memory
not making it similar (not worse) then PeerDirect case.
>> I'm hearing most people say ZONE_DEVICE is the way to handle this,
>> which means the missing remaing piece for RDMA is some kind of DMA
>> core support for p2p address translation..
> Yes, this is definitely something we need. I think Will Davis's patches
> are a good start.
>
> Another thing I think is that while HMM is good for user-space
> applications, for kernel p2p use there is no need for that.
About HMM: I do not think that in the current form HMM would fit in
requirement for generic P2P transfer case. My understanding is that at
the current stage HMM is good for "caching" system memory
in device memory for fast GPU access but in RDMA MR non-ODP case
it will not work because the location of memory should not be
changed so memory should be allocated directly in PCIe memory.
> Using ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
> pages for the short duration as you wrote above could work fine for
> kernel uses in which we can guarantee they are short.
Potentially there is another issue related to pin/unpin. If memory could
be used a lot of time then there is no sense to rebuild and program
s/g tables each time if location of memory was not changed.


2016-11-28 16:58:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:

> > Like in ODP, MMU notifiers/HMM are used to monitor for translation
> > changes. If a change comes in the GPU driver checks if an executing
> > command is touching those pages and blocks the MMU notifier until the
> > command flushes, then unfaults the page (blocking future commands) and
> > unblocks the mmu notifier.

> I think blocking mmu notifiers against something that is basically
> controlled by user-space can be problematic. This can block things like
> memory reclaim. If you have user-space access to the device's queues,
> user-space can block the mmu notifier forever.

Right, I mentioned that..

> On PeerDirect, we have some kind of a middle-ground solution for pinning
> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> user-space and the GPU not to migrate it. If they do, the MR gets
> destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

I know people don't like requiring new hardware, but in this case we
really do need ODP hardware to get all the semantics people want..

> Another thing I think is that while HMM is good for user-space
> applications, for kernel p2p use there is no need for that. Using

>From what I understand we are not really talking about kernel p2p,
everything proposed so far is being mediated by a userspace VMA, so
I'd focus on making that work.

Jason

2016-11-28 18:20:47

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 28/11/16 09:57 AM, Jason Gunthorpe wrote:
>> On PeerDirect, we have some kind of a middle-ground solution for pinning
>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>> user-space and the GPU not to migrate it. If they do, the MR gets
>> destroyed immediately.
>
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? I would not support that
> upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping. It just meant the application would fail or need complicated
logic to detect this and redo just about everything. And given that it
was a reasonably rare occurrence during development it probably means
not a lot of applications will be developed to handle it and most would
end up being randomly broken in environments with memory pressure.

Logan

2016-11-28 18:36:21

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, 2016-11-28 at 09:48 -0500, Serguei Sagalovitch wrote:
+AD4- On 2016-11-27 09:02 AM, Haggai Eran wrote
+AD4- +AD4-
+AD4- +AD4- On PeerDirect, we have some kind of a middle-ground solution for
+AD4- +AD4- pinning
+AD4- +AD4- GPU memory. We create a non-ODP MR pointing to VRAM but rely on
+AD4- +AD4- user-space and the GPU not to migrate it. If they do, the MR gets
+AD4- +AD4- destroyed immediately. This should work on legacy devices without
+AD4- +AD4- ODP
+AD4- +AD4- support, and allows the system to safely terminate a process that
+AD4- +AD4- misbehaves. The downside of course is that it cannot transparently
+AD4- +AD4- migrate memory but I think for user-space RDMA doing that
+AD4- +AD4- transparently
+AD4- +AD4- requires hardware support for paging, via something like HMM.
+AD4- +AD4-
+AD4- +AD4- ...
+AD4- May be I am wrong but my understanding is that PeerDirect logic
+AD4- basically
+AD4- follow+AKAAoAAi-RDMA register MR+ACI- logic
Yes. The only difference from regular MRs is the invalidation process I
mentioned, and the fact that we get the addresses not from
get+AF8-user+AF8-pages but from a peer driver.

+AD4- so basically nothing prevent to +ACI-terminate+ACI-
+AD4- process for +ACI-MMU notifier+ACI- case when we are very low on memory
+AD4- not making it similar (not worse) then PeerDirect case.
I'm not sure I understand. I don't think any solution prevents
terminating an application. The paragraph above is just trying to
explain how a non-ODP device/MR can handle an invalidation.

+AD4- +AD4- +AD4- I'm hearing most people say ZONE+AF8-DEVICE is the way to handle this,
+AD4- +AD4- +AD4- which means the missing remaing piece for RDMA is some kind of DMA
+AD4- +AD4- +AD4- core support for p2p address translation..
+AD4- +AD4- Yes, this is definitely something we need. I think Will Davis's
+AD4- +AD4- patches
+AD4- +AD4- are a good start.
+AD4- +AD4-
+AD4- +AD4- Another thing I think is that while HMM is good for user-space
+AD4- +AD4- applications, for kernel p2p use there is no need for that.
+AD4- About HMM: I do not think that in the current form HMM would+AKAAoA-fit in
+AD4- requirement for generic P2P transfer case. My understanding is that at
+AD4- the current stage HMM is good for +ACI-caching+ACI- system memory
+AD4- in device memory for fast GPU access but in RDMA MR non-ODP case
+AD4- it will not work because+AKAAoA-the location of memory should not be
+AD4- changed so memory should be allocated directly in PCIe memory.
The way I see it there are two ways to handle non-ODP MRs. Either you
prevent the GPU from migrating / reusing the MR's VRAM pages for as long
as the MR is alive (if I understand correctly you didn't like this
solution), or you allow the GPU to somehow notify the HCA to invalidate
the MR. If you do that, you can use mmu notifiers or HMM or something
else, but HMM provides a nice framework to facilitate that notification.

+AD4- +AD4-
+AD4- +AD4- Using ZONE+AF8-DEVICE with or without something like DMA-BUF to pin and
+AD4- +AD4- unpin
+AD4- +AD4- pages for the short duration as you wrote above could work fine for
+AD4- +AD4- kernel uses in which we can guarantee they are short.
+AD4- Potentially there is another issue related to pin/unpin. If memory
+AD4- could
+AD4- be used a lot of time then there is no sense to rebuild and program
+AD4- s/g tables each time if location of memory was not changed.
Is this about the kernel use or user-space? In user-space I think the MR
concept captures a long-lived s/g table so you don't need to rebuild it
(unless the mapping changes).

Haggai

2016-11-28 19:03:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
> > > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > > user-space and the GPU not to migrate it. If they do, the MR gets
> > > destroyed immediately.
> > That sounds horrible. How can that possibly work? What if the MR is
> > being used when the GPU decides to migrate?
> Naturally this doesn't support migration. The GPU is expected to pin
> these pages as long as the MR lives. The MR invalidation is done only as
> a last resort to keep system correctness.

That just forces applications to handle horrible unexpected
failures. If this sort of thing is needed for correctness then OOM
kill the offending process, don't corrupt its operation.

> I think it is similar to how non-ODP MRs rely on user-space today to
> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
> non-ODP MR's pages, you can still get yourself into a data corruption
> situation (HCA sees one page and the process sees another for the same
> virtual address). The pinning that we use only guarentees the HCA's page
> won't be reused.

That is not really data corruption - the data still goes where it was
originally destined. That is an application violating the
requirements of a MR. An application cannot munmap/mremap a VMA
while a non ODP MR points to it and then keep using the MR.

That is totally different from a GPU driver wanthing to mess with
translation to physical pages.

> > From what I understand we are not really talking about kernel p2p,
> > everything proposed so far is being mediated by a userspace VMA, so
> > I'd focus on making that work.

> Fair enough, although we will need both eventually, and I hope the
> infrastructure can be shared to some degree.

What use case do you see for in kernel?

Presumably in-kernel could use a vmap or something and the same basic
flow?

Jason

2016-11-28 19:35:30

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 2016-11-28 01:20 PM, Logan Gunthorpe wrote:
>
> On 28/11/16 09:57 AM, Jason Gunthorpe wrote:
>>> On PeerDirect, we have some kind of a middle-ground solution for pinning
>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>> destroyed immediately.
>> That sounds horrible. How can that possibly work? What if the MR is
>> being used when the GPU decides to migrate? I would not support that
>> upstream without a lot more explanation..
> Yup, this was our experience when playing around with PeerDirect. There
> was nothing we could do if the GPU decided to invalidate the P2P
> mapping.
As soon as PeerDirect mapping is called then GPU must not "move" the
such memory. It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

2016-11-28 19:54:05

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, 2016-11-28 at 09:57 -0700, Jason Gunthorpe wrote:
+AD4- On Sun, Nov 27, 2016 at 04:02:16PM +-0200, Haggai Eran wrote:
+AD4- +AD4- I think blocking mmu notifiers against something that is basically
+AD4- +AD4- controlled by user-space can be problematic. This can block things
+AD4- +AD4- like
+AD4- +AD4- memory reclaim. If you have user-space access to the device's
+AD4- +AD4- queues,
+AD4- +AD4- user-space can block the mmu notifier forever.
+AD4- Right, I mentioned that..
Sorry, I must have missed it.

+AD4- +AD4- On PeerDirect, we have some kind of a middle-ground solution for
+AD4- +AD4- pinning
+AD4- +AD4- GPU memory. We create a non-ODP MR pointing to VRAM but rely on
+AD4- +AD4- user-space and the GPU not to migrate it. If they do, the MR gets
+AD4- +AD4- destroyed immediately.
+AD4- That sounds horrible. How can that possibly work? What if the MR is
+AD4- being used when the GPU decides to migrate?
Naturally this doesn't support migration. The GPU is expected to pin
these pages as long as the MR lives. The MR invalidation is done only as
a last resort to keep system correctness.

I think it is similar to how non-ODP MRs rely on user-space today to
keep them correct. If you do something like madvise(MADV+AF8-DONTNEED) on a
non-ODP MR's pages, you can still get yourself into a data corruption
situation (HCA sees one page and the process sees another for the same
virtual address). The pinning that we use only guarentees the HCA's page
won't be reused.

+AD4- I would not support that
+AD4- upstream without a lot more explanation..
+AD4-
+AD4- I know people don't like requiring new hardware, but in this case we
+AD4- really do need ODP hardware to get all the semantics people want..
+AD4-
+AD4- +AD4-
+AD4- +AD4- Another thing I think is that while HMM is good for user-space
+AD4- +AD4- applications, for kernel p2p use there is no need for that. Using
+AD4- From what I understand we are not really talking about kernel p2p,
+AD4- everything proposed so far is being mediated by a userspace VMA, so
+AD4- I'd focus on making that work.
Fair enough, although we will need both eventually, and I hope the
infrastructure can be shared to some degree.

2016-11-28 21:37:02

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:
> As soon as PeerDirect mapping is called then GPU must not "move" the
> such memory. It is by PeerDirect design. It is similar how it is works
> with system memory and RDMA MR: when "get_user_pages" is called then the
> memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

Logan

2016-11-28 21:55:44

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices


On 2016-11-28 04:36 PM, Logan Gunthorpe wrote:
> On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:
>> As soon as PeerDirect mapping is called then GPU must not "move" the
>> such memory. It is by PeerDirect design. It is similar how it is works
>> with system memory and RDMA MR: when "get_user_pages" is called then the
>> memory is pinned.
> We haven't touch this in a long time and perhaps it changed, but there
> definitely was a call back in the PeerDirect API to allow the GPU to
> invalidate the mapping. That's what we don't want.
I assume that you are talking about "invalidate_peer_memory()' callback?
I was told that it is the "last resort" because HCA (and driver) is not
able to handle it in the safe manner so it is basically "abort"
everything.

2016-11-28 22:25:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Nov 28, 2016 at 04:55:23PM -0500, Serguei Sagalovitch wrote:

> >We haven't touch this in a long time and perhaps it changed, but there
> >definitely was a call back in the PeerDirect API to allow the GPU to
> >invalidate the mapping. That's what we don't want.

> I assume that you are talking about "invalidate_peer_memory()' callback?
> I was told that it is the "last resort" because HCA (and driver) is not
> able to handle it in the safe manner so it is basically "abort" everything.

If it is a last resort to save system stability then kill the impacted
process, that will release the MRs.

Jason

2016-11-30 10:46:23

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
>>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>>> destroyed immediately.
>>> That sounds horrible. How can that possibly work? What if the MR is
>>> being used when the GPU decides to migrate?
>> Naturally this doesn't support migration. The GPU is expected to pin
>> these pages as long as the MR lives. The MR invalidation is done only as
>> a last resort to keep system correctness.
>
> That just forces applications to handle horrible unexpected
> failures. If this sort of thing is needed for correctness then OOM
> kill the offending process, don't corrupt its operation.
Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

>
>> I think it is similar to how non-ODP MRs rely on user-space today to
>> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
>> non-ODP MR's pages, you can still get yourself into a data corruption
>> situation (HCA sees one page and the process sees another for the same
>> virtual address). The pinning that we use only guarentees the HCA's page
>> won't be reused.
>
> That is not really data corruption - the data still goes where it was
> originally destined. That is an application violating the
> requirements of a MR.
I guess it is a matter of terminology. If you compare it to the ODP case
or the CPU case then you usually expect a single virtual address to map to
a single physical page. Violating this cause some of your writes to be dropped
which is a data corruption in my book, even if the application caused it.

> An application cannot munmap/mremap a VMA
> while a non ODP MR points to it and then keep using the MR.
Right. And it is perfectly fine to have some similar requirements from the application
when doing peer to peer with a non-ODP MR.

> That is totally different from a GPU driver wanthing to mess with
> translation to physical pages.
>
>>> From what I understand we are not really talking about kernel p2p,
>>> everything proposed so far is being mediated by a userspace VMA, so
>>> I'd focus on making that work.
>
>> Fair enough, although we will need both eventually, and I hope the
>> infrastructure can be shared to some degree.
>
> What use case do you see for in kernel?
Two cases I can think of are RDMA access to an NVMe device's controller
memory buffer, and O_DIRECT operations that access GPU memory.
Also, HMM's migration between two GPUs could use peer to peer in the kernel,
although that is intended to be handled by the GPU driver if I understand
correctly.

> Presumably in-kernel could use a vmap or something and the same basic
> flow?
I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.

Haggai

2016-11-30 16:24:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:

> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.

> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

I don't know..

> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> >
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> >
> > What use case do you see for in kernel?

> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer,

I'm not sure on the use model there..

> and O_DIRECT operations that access GPU memory.

This goes through user space so there is still a VMA..

> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel, although that is intended to be handled by the GPU driver if
> I understand correctly.

Hum, presumably these migrations are VMA backed as well...

> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.

Well, if there is no virtual map then we are back to how do you do
migrations and other things people seem to want to do on these
pages. Maybe the loose 'struct page' flow is not for those users.

But I think if you want kGPU or similar then you probably need vmaps
or something similar to represent the GPU pages in kernel memory.

Jason

2016-11-30 17:29:36

by Sagalovitch, Serguei

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:
>> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
>> Or do we need to extend the OOM killer to manage GPU pages?
> I don't know..
We could use send_sig_info to send signal from kernel to user space.
So theoretically GPU driver
could issue KILL signal to some process.

> On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
I do not think that using DMA-API as it is is the best solution (at
least in the current form):

- It deals with handles/fd for the whole allocation but client
could/will use sub-allocation as
well as theoretically possible to "merge" several allocations in one
from GPU perspective.
- It require knowledge to export but because "sharing" is controlled
from user space it
means that we must "export" all allocation by default
- It deals with 'fd'/handles but user application may work with
addresses/pointers.

Also current DMA-API force each time to do all DMA table programming
unrelated if
location was changed or not. With vma / mmu we are able to install
notifier to intercept
changes in location and update translation tables only as needed (we do
not need to keep
get_user_pages() lock).

2016-11-30 17:45:28

by Deucher, Alexander

[permalink] [raw]
Subject: RE: Enabling peer to peer device transactions for PCIe devices

> -----Original Message-----
> From: Haggai Eran [mailto:[email protected]]
> Sent: Wednesday, November 30, 2016 5:46 AM
> To: Jason Gunthorpe
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Koenig, Christian; Suthikulpanit, Suravee; Bridgman,
> John; Deucher, Alexander; [email protected];
> [email protected]; [email protected]; dri-
> [email protected]; Max Gurtovoy; [email protected];
> Sagalovitch, Serguei; Blinzer, Paul; Kuehling, Felix; Sander, Ben
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
>
> On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
> >>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> >>>> user-space and the GPU not to migrate it. If they do, the MR gets
> >>>> destroyed immediately.
> >>> That sounds horrible. How can that possibly work? What if the MR is
> >>> being used when the GPU decides to migrate?
> >> Naturally this doesn't support migration. The GPU is expected to pin
> >> these pages as long as the MR lives. The MR invalidation is done only as
> >> a last resort to keep system correctness.
> >
> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.
> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

Christian sent out an RFC patch a while back that extended the OOM to cover memory allocated for the GPU:
https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html

Alex

>
> >
> >> I think it is similar to how non-ODP MRs rely on user-space today to
> >> keep them correct. If you do something like madvise(MADV_DONTNEED)
> on a
> >> non-ODP MR's pages, you can still get yourself into a data corruption
> >> situation (HCA sees one page and the process sees another for the same
> >> virtual address). The pinning that we use only guarentees the HCA's page
> >> won't be reused.
> >
> > That is not really data corruption - the data still goes where it was
> > originally destined. That is an application violating the
> > requirements of a MR.
> I guess it is a matter of terminology. If you compare it to the ODP case
> or the CPU case then you usually expect a single virtual address to map to
> a single physical page. Violating this cause some of your writes to be dropped
> which is a data corruption in my book, even if the application caused it.
>
> > An application cannot munmap/mremap a VMA
> > while a non ODP MR points to it and then keep using the MR.
> Right. And it is perfectly fine to have some similar requirements from the
> application
> when doing peer to peer with a non-ODP MR.
>
> > That is totally different from a GPU driver wanthing to mess with
> > translation to physical pages.
> >
> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> >
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> >
> > What use case do you see for in kernel?
> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer, and O_DIRECT operations that access GPU memory.
> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel,
> although that is intended to be handled by the GPU driver if I understand
> correctly.
>
> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API
> support
> for peer to peer. I'm not sure we need vmap. We need a way to have a
> scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.
>
> Haggai

2016-11-30 18:02:26

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 30/11/16 09:23 AM, Jason Gunthorpe wrote:
>> Two cases I can think of are RDMA access to an NVMe device's controller
>> memory buffer,
>
> I'm not sure on the use model there..

The NVMe fabrics stuff could probably make use of this. It's an
in-kernel system to allow remote access to an NVMe device over RDMA. So
they ought to be able to optimize their transfers by DMAing directly to
the NVMe's CMB -- no userspace interface would be required but there
would need some kernel infrastructure.

Logan

2016-12-04 11:11:17

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/30/2016 7:28 PM, Serguei Sagalovitch wrote:
> On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:
>>> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
>>> Or do we need to extend the OOM killer to manage GPU pages?
>> I don't know..
> We could use send_sig_info to send signal from kernel to user space. So theoretically GPU driver
> could issue KILL signal to some process.
>
>> On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:
>>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
>>> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
>>> of MMIO pfns, and ZONE_DEVICE allows that.
> I do not think that using DMA-API as it is is the best solution (at least in the current form):
>
> - It deals with handles/fd for the whole allocation but client could/will use sub-allocation as
> well as theoretically possible to "merge" several allocations in one from GPU perspective.
> - It require knowledge to export but because "sharing" is controlled from user space it
> means that we must "export" all allocation by default
> - It deals with 'fd'/handles but user application may work with addresses/pointers.

Aren't you confusing DMABUF and DMA-API? DMA-API is how you program the IOMMU (dma_map_page/dma_map_sg/etc.).
The comment above is just about the need to extend this API to allow mapping peer device pages to bus addresses.

In the past I sent an RFC for using DMABUF for peer to peer. I think it had some
advantages for legacy devices. I agree that working with addresses and pointers through
something like HMM/ODP is much more flexible and easier to program from user-space.
For legacy, DMABUF would have allowed you a way to pin the pages so the GPU knows not to
move them. However, that can probably also be achieved simply via the reference count
on ZONE_DEVICE pages. The other nice thing about DMABUF is that it migrate the buffer
itself during attachment according to the requirements of the device that is attaching,
so you can automatically decide in the exporter whether to use p2p or a staging buffer.

>
> Also current DMA-API force each time to do all DMA table programming unrelated if
> location was changed or not. With vma / mmu we are able to install notifier to intercept
> changes in location and update translation tables only as needed (we do not need to keep
> get_user_pages() lock).
I agree.

2016-12-04 12:13:44

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/30/2016 8:01 PM, Logan Gunthorpe wrote:
>
>
> On 30/11/16 09:23 AM, Jason Gunthorpe wrote:
>>> Two cases I can think of are RDMA access to an NVMe device's controller
>>> memory buffer,
>>
>> I'm not sure on the use model there..
>
> The NVMe fabrics stuff could probably make use of this. It's an
> in-kernel system to allow remote access to an NVMe device over RDMA. So
> they ought to be able to optimize their transfers by DMAing directly to
> the NVMe's CMB -- no userspace interface would be required but there
> would need some kernel infrastructure.

Yes, that's what I was thinking. The NVMe/f driver needs to map the CMB for
RDMA. I guess if it used ZONE_DEVICE like in the iopmem patches it would be
relatively easy to do.

2016-12-04 13:09:03

by Stephen Bates

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

>>
>> The NVMe fabrics stuff could probably make use of this. It's an
>> in-kernel system to allow remote access to an NVMe device over RDMA. So
>> they ought to be able to optimize their transfers by DMAing directly to
>> the NVMe's CMB -- no userspace interface would be required but there
>> would need some kernel infrastructure.
>
> Yes, that's what I was thinking. The NVMe/f driver needs to map the CMB
> for RDMA. I guess if it used ZONE_DEVICE like in the iopmem patches it
> would be relatively easy to do.
>

Haggai, yes that was one of the use cases we considered when we put
together the patchset.

2016-12-04 13:30:14

by Haggai Eran

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 11/30/2016 6:23 PM, Jason Gunthorpe wrote:
>> and O_DIRECT operations that access GPU memory.
> This goes through user space so there is still a VMA..
>
>> Also, HMM's migration between two GPUs could use peer to peer in the
>> kernel, although that is intended to be handled by the GPU driver if
>> I understand correctly.
> Hum, presumably these migrations are VMA backed as well...
I guess so.

>>> Presumably in-kernel could use a vmap or something and the same basic
>>> flow?
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
> Well, if there is no virtual map then we are back to how do you do
> migrations and other things people seem to want to do on these
> pages. Maybe the loose 'struct page' flow is not for those users.
I was thinking that kernel use cases would disallow migration, similar to how
non-ODP MRs would work. Either they are short-lived (like an O_DIRECT transfer)
or they can be longed lived but non-migratable (like perhaps a CMB staging buffer).

> But I think if you want kGPU or similar then you probably need vmaps
> or something similar to represent the GPU pages in kernel memory.
Right, although sometimes the GPU pages are simply inaccessible to the CPU.
In any case, I haven't thought about kGPU as a use-case.

2016-12-04 13:47:34

by Stephen Bates

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hi All

This has been a great thread (thanks to Alex for kicking it off) and I
wanted to jump in and maybe try and put some summary around the
discussion. I also wanted to propose we include this as a topic for LFS/MM
because I think we need more discussion on the best way to add this
functionality to the kernel.

As far as I can tell the people looking for P2P support in the kernel fall
into two main camps:

1. Those who simply want to expose static BARs on PCIe devices that can be
used as the source/destination for DMAs from another PCIe device. This
group has no need for memory invalidation and are happy to use
physical/bus addresses and not virtual addresses.

2. Those who want to support devices that suffer from occasional memory
pressure and need to invalidate memory regions from time to time. This
camp also would like to use virtual addresses rather than physical ones to
allow for things like migration.

I am wondering if people agree with this assessment?

I think something like the iopmem patches Logan and I submitted recently
come close to addressing use case 1. There are some issues around
routability but based on feedback to date that does not seem to be a
show-stopper for an initial inclusion.

For use-case 2 it looks like there are several options and some of them
(like HMM) have been around for quite some time without gaining
acceptance. I think there needs to be more discussion on this usecase and
it could be some time before we get something upstreamable.

I for one, would really like to see use case 1 get addressed soon because
we have consumers for it coming soon in the form of CMBs for NVMe devices.

Long term I think Jason summed it up really well. CPU vendors will put
high-speed, open, switchable, coherent buses on their processors and all
these problems will vanish. But I ain't holding my breathe for that to
happen ;-).

Cheers

Stephen

2016-12-05 17:18:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
> Hi All
>
> This has been a great thread (thanks to Alex for kicking it off) and I
> wanted to jump in and maybe try and put some summary around the
> discussion. I also wanted to propose we include this as a topic for LFS/MM
> because I think we need more discussion on the best way to add this
> functionality to the kernel.
>
> As far as I can tell the people looking for P2P support in the kernel fall
> into two main camps:
>
> 1. Those who simply want to expose static BARs on PCIe devices that can be
> used as the source/destination for DMAs from another PCIe device. This
> group has no need for memory invalidation and are happy to use
> physical/bus addresses and not virtual addresses.

I didn't think there was much on this topic except for the CMB
thing.. Even that is really a mapped kernel address..

> I think something like the iopmem patches Logan and I submitted recently
> come close to addressing use case 1. There are some issues around
> routability but based on feedback to date that does not seem to be a
> show-stopper for an initial inclusion.

If it is kernel only with physical addresess we don't need a uAPI for
it, so I'm not sure #1 is at all related to iopmem.

Most people who want #1 probably can just mmap
/sys/../pci/../resourceX to get a user handle to it, or pass around
__iomem pointers in the kernel. This has been asked for before with
RDMA.

I'm still not really clear what iopmem is for, or why DAX should ever
be involved in this..

> For use-case 2 it looks like there are several options and some of them
> (like HMM) have been around for quite some time without gaining
> acceptance. I think there needs to be more discussion on this usecase and
> it could be some time before we get something upstreamable.

AFAIK, hmm makes parts easier, but isn't directly addressing this
need..

I think you need to get ZONE_DEVICE accepted for non-cachable PCI BARs
as the first step.

>From there is pretty clear we the DMA API needs to be updated to
support that use and work can be done to solve the various problems
there on the basis of using ZONE_DEVICE pages to figure out to the
PCI-E end points

Jason

2016-12-05 17:40:45

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Dec 5, 2016 at 9:18 AM, Jason Gunthorpe
<[email protected]> wrote:
> On Sun, Dec 04, 2016 at 07:23:00AM -0600, Stephen Bates wrote:
>> Hi All
>>
>> This has been a great thread (thanks to Alex for kicking it off) and I
>> wanted to jump in and maybe try and put some summary around the
>> discussion. I also wanted to propose we include this as a topic for LFS/MM
>> because I think we need more discussion on the best way to add this
>> functionality to the kernel.
>>
>> As far as I can tell the people looking for P2P support in the kernel fall
>> into two main camps:
>>
>> 1. Those who simply want to expose static BARs on PCIe devices that can be
>> used as the source/destination for DMAs from another PCIe device. This
>> group has no need for memory invalidation and are happy to use
>> physical/bus addresses and not virtual addresses.
>
> I didn't think there was much on this topic except for the CMB
> thing.. Even that is really a mapped kernel address..
>
>> I think something like the iopmem patches Logan and I submitted recently
>> come close to addressing use case 1. There are some issues around
>> routability but based on feedback to date that does not seem to be a
>> show-stopper for an initial inclusion.
>
> If it is kernel only with physical addresess we don't need a uAPI for
> it, so I'm not sure #1 is at all related to iopmem.
>
> Most people who want #1 probably can just mmap
> /sys/../pci/../resourceX to get a user handle to it, or pass around
> __iomem pointers in the kernel. This has been asked for before with
> RDMA.
>
> I'm still not really clear what iopmem is for, or why DAX should ever
> be involved in this..

Right, by default remap_pfn_range() does not establish DMA capable
mappings. You can think of iopmem as remap_pfn_range() converted to
use devm_memremap_pages(). Given the extra constraints of
devm_memremap_pages() it seems reasonable to have those DMA capable
mappings be optionally established via a separate driver.

2016-12-05 18:02:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:

> > If it is kernel only with physical addresess we don't need a uAPI for
> > it, so I'm not sure #1 is at all related to iopmem.
> >
> > Most people who want #1 probably can just mmap
> > /sys/../pci/../resourceX to get a user handle to it, or pass around
> > __iomem pointers in the kernel. This has been asked for before with
> > RDMA.
> >
> > I'm still not really clear what iopmem is for, or why DAX should ever
> > be involved in this..
>
> Right, by default remap_pfn_range() does not establish DMA capable
> mappings. You can think of iopmem as remap_pfn_range() converted to
> use devm_memremap_pages(). Given the extra constraints of
> devm_memremap_pages() it seems reasonable to have those DMA capable
> mappings be optionally established via a separate driver.

Except the iopmem driver claims the PCI ID, and presents a block
interface which is really *NOT* what people who have asked for this in
the past have wanted. IIRC it was embedded stuff eg RDMA video
directly out of a capture card or a similar kind of thinking.

It is a good point about devm_memremap_pages limitations, but maybe
that just says to create a /sys/.../resource_dmableX ?

Or is there some reason why people want a filesystem on top of BAR
memory? That does not seem to have been covered yet..

Jason

2016-12-05 18:08:44

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Dec 5, 2016 at 10:02 AM, Jason Gunthorpe
<[email protected]> wrote:
> On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:
>
>> > If it is kernel only with physical addresess we don't need a uAPI for
>> > it, so I'm not sure #1 is at all related to iopmem.
>> >
>> > Most people who want #1 probably can just mmap
>> > /sys/../pci/../resourceX to get a user handle to it, or pass around
>> > __iomem pointers in the kernel. This has been asked for before with
>> > RDMA.
>> >
>> > I'm still not really clear what iopmem is for, or why DAX should ever
>> > be involved in this..
>>
>> Right, by default remap_pfn_range() does not establish DMA capable
>> mappings. You can think of iopmem as remap_pfn_range() converted to
>> use devm_memremap_pages(). Given the extra constraints of
>> devm_memremap_pages() it seems reasonable to have those DMA capable
>> mappings be optionally established via a separate driver.
>
> Except the iopmem driver claims the PCI ID, and presents a block
> interface which is really *NOT* what people who have asked for this in
> the past have wanted. IIRC it was embedded stuff eg RDMA video
> directly out of a capture card or a similar kind of thinking.
>
> It is a good point about devm_memremap_pages limitations, but maybe
> that just says to create a /sys/.../resource_dmableX ?
>
> Or is there some reason why people want a filesystem on top of BAR
> memory? That does not seem to have been covered yet..
>

I've already recommended that iopmem not be a block device and instead
be a device-dax instance. I also don't think it should claim the PCI
ID, rather the driver that wants to map one of its bars this way can
register the memory region with the device-dax core.

I'm not sure there are enough device drivers that want to do this to
have it be a generic /sys/.../resource_dmableX capability. It still
seems to be an exotic one-off type of configuration.

2016-12-05 18:39:30

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On 05/12/16 11:08 AM, Dan Williams wrote:
> I've already recommended that iopmem not be a block device and instead
> be a device-dax instance. I also don't think it should claim the PCI
> ID, rather the driver that wants to map one of its bars this way can
> register the memory region with the device-dax core.
>
> I'm not sure there are enough device drivers that want to do this to
> have it be a generic /sys/.../resource_dmableX capability. It still
> seems to be an exotic one-off type of configuration.

Yes, this is essentially my thinking. Except I think the userspace
interface should really depend on the device itself. Device dax is a
good choice for many and I agree the block device approach wouldn't be
ideal.

Specifically for NVME CMB: I think it would make a lot of sense to just
hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB
buffers would be volatile and thus you wouldn't need to keep track of
where in the BAR the region came from. Thus, the mmap call would just be
an allocator from BAR memory. If device-dax were used, userspace would
need to lookup which device-dax instance corresponds to which nvme drive.

Logan


2016-12-05 18:49:04

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe <[email protected]> wrote:
> On 05/12/16 11:08 AM, Dan Williams wrote:
>>
>> I've already recommended that iopmem not be a block device and instead
>> be a device-dax instance. I also don't think it should claim the PCI
>> ID, rather the driver that wants to map one of its bars this way can
>> register the memory region with the device-dax core.
>>
>> I'm not sure there are enough device drivers that want to do this to
>> have it be a generic /sys/.../resource_dmableX capability. It still
>> seems to be an exotic one-off type of configuration.
>
>
> Yes, this is essentially my thinking. Except I think the userspace interface
> should really depend on the device itself. Device dax is a good choice for
> many and I agree the block device approach wouldn't be ideal.
>
> Specifically for NVME CMB: I think it would make a lot of sense to just hand
> out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> would be volatile and thus you wouldn't need to keep track of where in the
> BAR the region came from. Thus, the mmap call would just be an allocator
> from BAR memory. If device-dax were used, userspace would need to lookup
> which device-dax instance corresponds to which nvme drive.
>

I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial
to accomplish in sysfs through /sys/dev/char to find the sysfs path of
the device-dax instance under the nvme device, or if you already have
the nvme sysfs path the dax instance(s) will appear under the "dax"
sub-directory.

2016-12-05 19:15:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Dec 05, 2016 at 10:48:58AM -0800, Dan Williams wrote:
> On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe <[email protected]> wrote:
> > On 05/12/16 11:08 AM, Dan Williams wrote:
> >>
> >> I've already recommended that iopmem not be a block device and instead
> >> be a device-dax instance. I also don't think it should claim the PCI
> >> ID, rather the driver that wants to map one of its bars this way can
> >> register the memory region with the device-dax core.
> >>
> >> I'm not sure there are enough device drivers that want to do this to
> >> have it be a generic /sys/.../resource_dmableX capability. It still
> >> seems to be an exotic one-off type of configuration.
> >
> >
> > Yes, this is essentially my thinking. Except I think the userspace interface
> > should really depend on the device itself. Device dax is a good choice for
> > many and I agree the block device approach wouldn't be ideal.
> >
> > Specifically for NVME CMB: I think it would make a lot of sense to just hand
> > out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> > would be volatile and thus you wouldn't need to keep track of where in the
> > BAR the region came from. Thus, the mmap call would just be an allocator
> > from BAR memory. If device-dax were used, userspace would need to lookup
> > which device-dax instance corresponds to which nvme drive.
>
> I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path
> of

But CMB sounds much more like the GPU case where there is a
specialized allocator handing out the BAR to consumers, so I'm not
sure a general purpose chardev makes a lot of sense?

Jason

2016-12-05 19:27:35

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 05/12/16 12:14 PM, Jason Gunthorpe wrote:
> But CMB sounds much more like the GPU case where there is a
> specialized allocator handing out the BAR to consumers, so I'm not
> sure a general purpose chardev makes a lot of sense?

I don't think it will ever need to be as complicated as the GPU case.
There will probably only ever be a relatively small amount of memory
behind the CMB and really the only users are those doing P2P work. Thus
the specialized allocator could be pretty simple and I expect it would
be fine to just return -ENOMEM if there is not enough memory.

Also, if it was implemented this way, if there was a need to make the
allocator more complicated it could easily be added later as the
userspace interface is just mmap to obtain a buffer.

Logan

2016-12-05 19:46:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Dec 05, 2016 at 12:27:20PM -0700, Logan Gunthorpe wrote:
>
>
> On 05/12/16 12:14 PM, Jason Gunthorpe wrote:
> >But CMB sounds much more like the GPU case where there is a
> >specialized allocator handing out the BAR to consumers, so I'm not
> >sure a general purpose chardev makes a lot of sense?
>
> I don't think it will ever need to be as complicated as the GPU case. There
> will probably only ever be a relatively small amount of memory behind the
> CMB and really the only users are those doing P2P work. Thus the specialized
> allocator could be pretty simple and I expect it would be fine to just
> return -ENOMEM if there is not enough memory.

NVMe might have to deal with pci-e hot-unplug, which is a similar
problem-class to the GPU case..

In any event the allocator still needs to track which regions are in
use and be able to hook 'free' from userspace. That does suggest it
should be integrated into the nvme driver and not a bolt on driver..

Jason

2016-12-05 19:59:46

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices



On 05/12/16 12:46 PM, Jason Gunthorpe wrote:
> NVMe might have to deal with pci-e hot-unplug, which is a similar
> problem-class to the GPU case..

Sure, but if the NVMe device gets hot-unplugged it means that all the
CMB mappings are useless and need to be torn down. This probably means
killing any process that has mappings open.

> In any event the allocator still needs to track which regions are in
> use and be able to hook 'free' from userspace. That does suggest it
> should be integrated into the nvme driver and not a bolt on driver..

Yup, that's correct. And yes, I've never suggested this to be a bolt on
driver -- I always expected for it to get integrated into the nvme
driver. (iopmem was not meant for this.)

Logan

2016-12-05 20:07:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Mon, Dec 05, 2016 at 12:46:14PM -0700, Jason Gunthorpe wrote:
> In any event the allocator still needs to track which regions are in
> use and be able to hook 'free' from userspace. That does suggest it
> should be integrated into the nvme driver and not a bolt on driver..

Two totally different use cases:

- a card that exposes directly byte addressable storage as a PCI-e
bar. Thin of it as a nvdimm on a PCI-e card. That's the iopmem
case.
- the NVMe CMB which exposes a byte addressable indirection buffer for
I/O, but does not actually provide byte addressable persistent
storage. This is something that needs to be added to the NVMe driver
(and the block layer for the abstraction probably).

2016-12-06 08:28:45

by Stephen Bates

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

>>> I've already recommended that iopmem not be a block device and
>>> instead be a device-dax instance. I also don't think it should claim
>>> the PCI ID, rather the driver that wants to map one of its bars this
>>> way can register the memory region with the device-dax core.
>>>
>>> I'm not sure there are enough device drivers that want to do this to
>>> have it be a generic /sys/.../resource_dmableX capability. It still
>>> seems to be an exotic one-off type of configuration.
>>
>>
>> Yes, this is essentially my thinking. Except I think the userspace
>> interface should really depend on the device itself. Device dax is a
>> good choice for many and I agree the block device approach wouldn't be
>> ideal.

I tend to agree here. The block device interface has seen quite a bit of
resistance and /dev/dax looks like a better approach for most. We can look
at doing it that way in v2.

>>
>> Specifically for NVME CMB: I think it would make a lot of sense to just
>> hand out these mappings with an mmap call on /dev/nvmeX. I expect CMB
>> buffers would be volatile and thus you wouldn't need to keep track of
>> where in the BAR the region came from. Thus, the mmap call would just be
>> an allocator from BAR memory. If device-dax were used, userspace would
>> need to lookup which device-dax instance corresponds to which nvme
>> drive.
>>
>
> I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> device-dax instance under the nvme device, or if you already have the nvme
> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>

Personally I think mapping the dax resource in the sysfs tree is a nice
way to do this and a bit more intuitive than mapping a /dev/nvmeX.


2016-12-06 16:39:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

> > I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial
> > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > device-dax instance under the nvme device, or if you already have the nvme
> > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>
> Personally I think mapping the dax resource in the sysfs tree is a nice
> way to do this and a bit more intuitive than mapping a /dev/nvmeX.

It is still not at all clear to me what userpsace is supposed to do
with this on nvme.. How is the CMB usable from userspace?

Jason

2016-12-06 16:52:12

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hey,

On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
>>> I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial
>>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
>>> device-dax instance under the nvme device, or if you already have the nvme
>>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
>>
>> Personally I think mapping the dax resource in the sysfs tree is a nice
>> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
>
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

The flow is pretty simple. For example to write to NVMe from an RDMA device:

1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
/dev/nvmx, the device dax char device or through a block layer interface
(which sounds like a good suggestion from Christoph, but I'm not really
sure how it would look).

2) Create an MR with the buffer and use an RDMA function to fill it with
data from a remote host. This will cause the RDMA hardware to write
directly to the memory in the NVMe card.

3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
When the address reaches hardware the NVMe will recognize it as local
memory and copy it directly there.

Thus we are able to transfer data to any file on an NVMe device without
going through system memory. This has benefits on systems with lots of
activity in system memory but step 3 is likely to be slowish due to the
need to pin/unpin the memory for every transaction.

Logan

2016-12-06 17:12:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Dec 06, 2016 at 09:38:50AM -0700, Jason Gunthorpe wrote:
> > > I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial
> > > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > > device-dax instance under the nvme device, or if you already have the nvme
> > > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> >
> > Personally I think mapping the dax resource in the sysfs tree is a nice
> > way to do this and a bit more intuitive than mapping a /dev/nvmeX.
>
> It is still not at all clear to me what userpsace is supposed to do
> with this on nvme.. How is the CMB usable from userspace?

I don't think trying to expose it to userspace makes any sense.
Exposing it to in-kernel storage targets on the other hand makes a lot
of sense.

2016-12-06 17:29:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Dec 06, 2016 at 09:51:15AM -0700, Logan Gunthorpe wrote:
> Hey,
>
> On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
> >>> I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial
> >>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> >>> device-dax instance under the nvme device, or if you already have the nvme
> >>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> >>
> >> Personally I think mapping the dax resource in the sysfs tree is a nice
> >> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> >
> > It is still not at all clear to me what userpsace is supposed to do
> > with this on nvme.. How is the CMB usable from userspace?
>
> The flow is pretty simple. For example to write to NVMe from an RDMA device:
>
> 1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
> /dev/nvmx, the device dax char device or through a block layer interface
> (which sounds like a good suggestion from Christoph, but I'm not really
> sure how it would look).

Okay, so clearly this needs a kernel side NVMe specific allocator
and locking so users don't step on each other..

Or as Christoph says some kind of general mechanism to get these
bounce buffers..

> 2) Create an MR with the buffer and use an RDMA function to fill it with
> data from a remote host. This will cause the RDMA hardware to write
> directly to the memory in the NVMe card.
>
> 3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
> When the address reaches hardware the NVMe will recognize it as local
> memory and copy it directly there.

Ah, I see.

As a first draft I'd stick with some kind of API built into the
/dev/nvmeX that backs the filesystem. The user app would fstat the
target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
ioctl to get a CMB mmap, and then proceed from there..

When that is all working kernel-side, it would make sense to look at a
more general mechanism that could be used unprivileged??

> Thus we are able to transfer data to any file on an NVMe device without
> going through system memory. This has benefits on systems with lots of
> activity in system memory but step 3 is likely to be slowish due to the
> need to pin/unpin the memory for every transaction.

This is similar to the GPU issues too.. On NVMe you don't need to pin
the pages, you just need to lock that VMA so it doesn't get freed from
the NVMe CMB allocator while the IO is running...

Probably in the long run the get_user_pages is going to have to be
pushed down into drivers.. Future MMU coherent IO hardware also does
not need the pinning or other overheads.

Jason

2016-12-06 21:47:12

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hey,

> Okay, so clearly this needs a kernel side NVMe specific allocator
> and locking so users don't step on each other..

Yup, ideally. That's why device dax isn't ideal for this application: it
doesn't provide any way to prevent users from stepping on each other.

> Or as Christoph says some kind of general mechanism to get these
> bounce buffers..

Yeah, I imagine a general allocate from BAR/region system would be very
useful.

> Ah, I see.
>
> As a first draft I'd stick with some kind of API built into the
> /dev/nvmeX that backs the filesystem. The user app would fstat the
> target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
> ioctl to get a CMB mmap, and then proceed from there..
>
> When that is all working kernel-side, it would make sense to look at a
> more general mechanism that could be used unprivileged??

That makes a lot of sense to me. I suggested mmapping the char device
because it's really easy, but I can see that an ioctl on the block
device does seem more general and device agnostic.

> This is similar to the GPU issues too.. On NVMe you don't need to pin
> the pages, you just need to lock that VMA so it doesn't get freed from
> the NVMe CMB allocator while the IO is running...
> Probably in the long run the get_user_pages is going to have to be
> pushed down into drivers.. Future MMU coherent IO hardware also does
> not need the pinning or other overheads.

Yup. Yup.

Logan

2016-12-06 22:02:25

by Dan Williams

[permalink] [raw]
Subject: Re: Enabling peer to peer device transactions for PCIe devices

On Tue, Dec 6, 2016 at 1:47 PM, Logan Gunthorpe <[email protected]> wrote:
> Hey,
>
>> Okay, so clearly this needs a kernel side NVMe specific allocator
>> and locking so users don't step on each other..
>
> Yup, ideally. That's why device dax isn't ideal for this application: it
> doesn't provide any way to prevent users from stepping on each other.

On this particular point I'm in the process of posting patches that
allow device-dax sub-division, so you could carve up a bar into
multiple devices of various sizes.