Hey all,
Currently, the RDMA subsystem can only work with dynamic dmabuf
attachments, which requires the RDMA device to support on-demand-paging
(ODP) which is not common on most devices (only supported by mlx5).
While the dynamic requirement makes sense for certain GPUs, some devices
(such as habanalabs) have device memory that is always "pinned" and do
not need/use the move_notify operation.
The motivation of this RFC is to use habanalabs as the dmabuf exporter,
and EFA as the importer to allow for peer2peer access through libibverbs.
This draft patch changes the dmabuf driver to differentiate between
static/dynamic attachments by looking at the move_notify op instead of
the importer_ops struct, and allowing the peer2peer flag to be enabled
in case of a static exporter.
Thanks
Signed-off-by: Gal Pressman <[email protected]>
---
drivers/dma-buf/dma-buf.c | 5 +++--
drivers/infiniband/core/umem_dmabuf.c | 2 +-
include/linux/dma-buf.h | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 511fe0d217a0..e3faad8f492c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -727,7 +727,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
if (WARN_ON(!dmabuf || !dev))
return ERR_PTR(-EINVAL);
- if (WARN_ON(importer_ops && !importer_ops->move_notify))
+ if (WARN_ON(importer_ops && !importer_ops->move_notify &&
+ dma_buf_is_dynamic(attach->dmabuf)))
return ERR_PTR(-EINVAL);
attach = kzalloc(sizeof(*attach), GFP_KERNEL);
@@ -1048,7 +1049,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf)
dma_resv_assert_held(dmabuf->resv);
list_for_each_entry(attach, &dmabuf->attachments, node)
- if (attach->importer_ops)
+ if (attach->importer_ops && attach->importer_ops->move_notify)
attach->importer_ops->move_notify(attach);
}
EXPORT_SYMBOL_GPL(dma_buf_move_notify);
diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
index c6e875619fac..c502ae828bd3 100644
--- a/drivers/infiniband/core/umem_dmabuf.c
+++ b/drivers/infiniband/core/umem_dmabuf.c
@@ -118,7 +118,7 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device,
if (check_add_overflow(offset, (unsigned long)size, &end))
return ret;
- if (unlikely(!ops || !ops->move_notify))
+ if (unlikely(!ops))
return ret;
dmabuf = dma_buf_get(fd);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index efdc56b9d95f..4b2e99012cb1 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -473,7 +473,7 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf)
static inline bool
dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
{
- return !!attach->importer_ops;
+ return !!attach->importer_ops->move_notify;
}
struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
--
2.32.0
Am 18.08.21 um 09:43 schrieb Gal Pressman:
> Hey all,
>
> Currently, the RDMA subsystem can only work with dynamic dmabuf
> attachments, which requires the RDMA device to support on-demand-paging
> (ODP) which is not common on most devices (only supported by mlx5).
>
> While the dynamic requirement makes sense for certain GPUs, some devices
> (such as habanalabs) have device memory that is always "pinned" and do
> not need/use the move_notify operation.
>
> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> and EFA as the importer to allow for peer2peer access through libibverbs.
>
> This draft patch changes the dmabuf driver to differentiate between
> static/dynamic attachments by looking at the move_notify op instead of
> the importer_ops struct, and allowing the peer2peer flag to be enabled
> in case of a static exporter.
Well NAK to the general approach, this can be solved much easier.
If you can't support dynamic moves while using the buffer then just pin
all buffers during import/export.
This avoids the move notification and the framework/exporter can still
correctly account for pinned buffers.
But please note that at least amdgpu never uses P2P support for pinned
buffers since we want to avoid that unmoveable buffers clutter video
memory and create conflicts with V4L and scanout.
If you don't have such concerns in habanalabs then you can implement the
pinning there while keeping P2P still enabled.
Regards,
Christian.
>
> Thanks
>
> Signed-off-by: Gal Pressman <[email protected]>
> ---
> drivers/dma-buf/dma-buf.c | 5 +++--
> drivers/infiniband/core/umem_dmabuf.c | 2 +-
> include/linux/dma-buf.h | 2 +-
> 3 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 511fe0d217a0..e3faad8f492c 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -727,7 +727,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> if (WARN_ON(!dmabuf || !dev))
> return ERR_PTR(-EINVAL);
>
> - if (WARN_ON(importer_ops && !importer_ops->move_notify))
> + if (WARN_ON(importer_ops && !importer_ops->move_notify &&
> + dma_buf_is_dynamic(attach->dmabuf)))
> return ERR_PTR(-EINVAL);
>
> attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> @@ -1048,7 +1049,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf)
> dma_resv_assert_held(dmabuf->resv);
>
> list_for_each_entry(attach, &dmabuf->attachments, node)
> - if (attach->importer_ops)
> + if (attach->importer_ops && attach->importer_ops->move_notify)
> attach->importer_ops->move_notify(attach);
> }
> EXPORT_SYMBOL_GPL(dma_buf_move_notify);
> diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
> index c6e875619fac..c502ae828bd3 100644
> --- a/drivers/infiniband/core/umem_dmabuf.c
> +++ b/drivers/infiniband/core/umem_dmabuf.c
> @@ -118,7 +118,7 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device,
> if (check_add_overflow(offset, (unsigned long)size, &end))
> return ret;
>
> - if (unlikely(!ops || !ops->move_notify))
> + if (unlikely(!ops))
> return ret;
>
> dmabuf = dma_buf_get(fd);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index efdc56b9d95f..4b2e99012cb1 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -473,7 +473,7 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf)
> static inline bool
> dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
> {
> - return !!attach->importer_ops;
> + return !!attach->importer_ops->move_notify;
> }
>
> struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
On 18/08/2021 11:00, Christian König wrote:
> Am 18.08.21 um 09:43 schrieb Gal Pressman:
>> Hey all,
>>
>> Currently, the RDMA subsystem can only work with dynamic dmabuf
>> attachments, which requires the RDMA device to support on-demand-paging
>> (ODP) which is not common on most devices (only supported by mlx5).
>>
>> While the dynamic requirement makes sense for certain GPUs, some devices
>> (such as habanalabs) have device memory that is always "pinned" and do
>> not need/use the move_notify operation.
>>
>> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
>> and EFA as the importer to allow for peer2peer access through libibverbs.
>>
>> This draft patch changes the dmabuf driver to differentiate between
>> static/dynamic attachments by looking at the move_notify op instead of
>> the importer_ops struct, and allowing the peer2peer flag to be enabled
>> in case of a static exporter.
>
> Well NAK to the general approach, this can be solved much easier.
>
> If you can't support dynamic moves while using the buffer then just pin all
> buffers during import/export.
>
> This avoids the move notification and the framework/exporter can still correctly
> account for pinned buffers.
>
> But please note that at least amdgpu never uses P2P support for pinned buffers
> since we want to avoid that unmoveable buffers clutter video memory and create
> conflicts with V4L and scanout.
>
> If you don't have such concerns in habanalabs then you can implement the pinning
> there while keeping P2P still enabled.
Thanks Christian!
Are you suggesting to pass an empty move_notify callback instead of passing NULL?
Also, doesn't the pin operation move the memory from the device to host memory?
On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <[email protected]> wrote:
>
> Hey all,
>
> Currently, the RDMA subsystem can only work with dynamic dmabuf
> attachments, which requires the RDMA device to support on-demand-paging
> (ODP) which is not common on most devices (only supported by mlx5).
>
> While the dynamic requirement makes sense for certain GPUs, some devices
> (such as habanalabs) have device memory that is always "pinned" and do
> not need/use the move_notify operation.
>
> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> and EFA as the importer to allow for peer2peer access through libibverbs.
>
> This draft patch changes the dmabuf driver to differentiate between
> static/dynamic attachments by looking at the move_notify op instead of
> the importer_ops struct, and allowing the peer2peer flag to be enabled
> in case of a static exporter.
>
> Thanks
>
> Signed-off-by: Gal Pressman <[email protected]>
Given that habanalabs dma-buf support is very firmly in limbo (at
least it's not yet in linux-next or anywhere else) I think you want to
solve that problem first before we tackle the additional issue of
making p2p work without dynamic dma-buf. Without that it just doesn't
make a lot of sense really to talk about solutions here.
-Daniel
> ---
> drivers/dma-buf/dma-buf.c | 5 +++--
> drivers/infiniband/core/umem_dmabuf.c | 2 +-
> include/linux/dma-buf.h | 2 +-
> 3 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 511fe0d217a0..e3faad8f492c 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -727,7 +727,8 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev,
> if (WARN_ON(!dmabuf || !dev))
> return ERR_PTR(-EINVAL);
>
> - if (WARN_ON(importer_ops && !importer_ops->move_notify))
> + if (WARN_ON(importer_ops && !importer_ops->move_notify &&
> + dma_buf_is_dynamic(attach->dmabuf)))
> return ERR_PTR(-EINVAL);
>
> attach = kzalloc(sizeof(*attach), GFP_KERNEL);
> @@ -1048,7 +1049,7 @@ void dma_buf_move_notify(struct dma_buf *dmabuf)
> dma_resv_assert_held(dmabuf->resv);
>
> list_for_each_entry(attach, &dmabuf->attachments, node)
> - if (attach->importer_ops)
> + if (attach->importer_ops && attach->importer_ops->move_notify)
> attach->importer_ops->move_notify(attach);
> }
> EXPORT_SYMBOL_GPL(dma_buf_move_notify);
> diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c
> index c6e875619fac..c502ae828bd3 100644
> --- a/drivers/infiniband/core/umem_dmabuf.c
> +++ b/drivers/infiniband/core/umem_dmabuf.c
> @@ -118,7 +118,7 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device,
> if (check_add_overflow(offset, (unsigned long)size, &end))
> return ret;
>
> - if (unlikely(!ops || !ops->move_notify))
> + if (unlikely(!ops))
> return ret;
>
> dmabuf = dma_buf_get(fd);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index efdc56b9d95f..4b2e99012cb1 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -473,7 +473,7 @@ static inline bool dma_buf_is_dynamic(struct dma_buf *dmabuf)
> static inline bool
> dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
> {
> - return !!attach->importer_ops;
> + return !!attach->importer_ops->move_notify;
> }
>
> struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
> --
> 2.32.0
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
> On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <[email protected]> wrote:
> >
> > Hey all,
> >
> > Currently, the RDMA subsystem can only work with dynamic dmabuf
> > attachments, which requires the RDMA device to support on-demand-paging
> > (ODP) which is not common on most devices (only supported by mlx5).
> >
> > While the dynamic requirement makes sense for certain GPUs, some devices
> > (such as habanalabs) have device memory that is always "pinned" and do
> > not need/use the move_notify operation.
> >
> > The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> > and EFA as the importer to allow for peer2peer access through libibverbs.
> >
> > This draft patch changes the dmabuf driver to differentiate between
> > static/dynamic attachments by looking at the move_notify op instead of
> > the importer_ops struct, and allowing the peer2peer flag to be enabled
> > in case of a static exporter.
> >
> > Thanks
> >
> > Signed-off-by: Gal Pressman <[email protected]>
>
> Given that habanalabs dma-buf support is very firmly in limbo (at
> least it's not yet in linux-next or anywhere else) I think you want to
> solve that problem first before we tackle the additional issue of
> making p2p work without dynamic dma-buf. Without that it just doesn't
> make a lot of sense really to talk about solutions here.
I have been thinking about adding a dmabuf exporter to VFIO, for
basically the same reason habana labs wants to do it.
In that situation we'd want to see an approach similar to this as well
to have a broad usability.
The GPU drivers also want this for certain sophisticated scenarios
with RDMA, the intree drivers just haven't quite got there yet.
So, I think it is worthwhile to start thinking about this regardless
of habana labs.
Jason
On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <[email protected]> wrote:
> On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <[email protected]> wrote:
> > >
> > > Hey all,
> > >
> > > Currently, the RDMA subsystem can only work with dynamic dmabuf
> > > attachments, which requires the RDMA device to support on-demand-paging
> > > (ODP) which is not common on most devices (only supported by mlx5).
> > >
> > > While the dynamic requirement makes sense for certain GPUs, some devices
> > > (such as habanalabs) have device memory that is always "pinned" and do
> > > not need/use the move_notify operation.
> > >
> > > The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> > > and EFA as the importer to allow for peer2peer access through libibverbs.
> > >
> > > This draft patch changes the dmabuf driver to differentiate between
> > > static/dynamic attachments by looking at the move_notify op instead of
> > > the importer_ops struct, and allowing the peer2peer flag to be enabled
> > > in case of a static exporter.
> > >
> > > Thanks
> > >
> > > Signed-off-by: Gal Pressman <[email protected]>
> >
> > Given that habanalabs dma-buf support is very firmly in limbo (at
> > least it's not yet in linux-next or anywhere else) I think you want to
> > solve that problem first before we tackle the additional issue of
> > making p2p work without dynamic dma-buf. Without that it just doesn't
> > make a lot of sense really to talk about solutions here.
>
> I have been thinking about adding a dmabuf exporter to VFIO, for
> basically the same reason habana labs wants to do it.
>
> In that situation we'd want to see an approach similar to this as well
> to have a broad usability.
>
> The GPU drivers also want this for certain sophisticated scenarios
> with RDMA, the intree drivers just haven't quite got there yet.
>
> So, I think it is worthwhile to start thinking about this regardless
> of habana labs.
Oh sure, I've been having these for a while. I think there's two options:
- some kind of soft-pin, where the contract is that we only revoke
when absolutely necessary, and it's expected to be catastrophic on the
importer's side. The use-case would be single user that fully controls
all accelerator local memory, and so kernel driver evicting stuff. I
havent implemented it, but the idea is that essentially in eviction we
check whom we're evicting for (by checking owners of buffers maybe,
atm those are not tracked in generic code but not that hard to add),
and if it's the same userspace owner we don't ever pick these buffers
as victims for eviction, preferreing -ENOMEM/-ENOSPC. If a new user
comes around then we'd still throw these out to avoid abuse, and it
would be up to sysadmins to make sure this doesn't happen untimely,
maybe with the next thing.
- cgroups for (pinned) buffers. Mostly because cgroups for local
memory is somewhere on the plans anyway, but that one could also take
forever since there's questions about overlap with memcg and things
like that, plus thus far everyone who cares made and incompatible
proposal about how it should be done :-/
A variant of the first one would be device-level revoke, which is a
concept we already have in drm for the modesetting side and also for
like 20 year old gpu drivers. We could brush that off and close some
of the gaps (a student is fixing the locking right now, the thing left
to do is mmap revoke), and I think that model of exclusive device
ownership with the option to revoke fits pretty well for at least some
of the accelerators floating around. In that case importers would
never get a move_notify (maybe we should call this revoke_notify to
make it clear it's a bit different) callback, except when the entire
thing has been yanked. I think that would fit pretty well for VFIO,
and I think we should be able to make it work for rdma too as some
kind of auto-deregister. The locking might be fun with both of these
since I expect some inversions compared to the register path, we'll
have to figure these out.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Fri, Aug 20, 2021 at 09:25:30AM +0200, Daniel Vetter wrote:
> On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <[email protected]> wrote:
> > On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
> > > On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <[email protected]> wrote:
> > > >
> > > > Hey all,
> > > >
> > > > Currently, the RDMA subsystem can only work with dynamic dmabuf
> > > > attachments, which requires the RDMA device to support on-demand-paging
> > > > (ODP) which is not common on most devices (only supported by mlx5).
> > > >
> > > > While the dynamic requirement makes sense for certain GPUs, some devices
> > > > (such as habanalabs) have device memory that is always "pinned" and do
> > > > not need/use the move_notify operation.
> > > >
> > > > The motivation of this RFC is to use habanalabs as the dmabuf exporter,
> > > > and EFA as the importer to allow for peer2peer access through libibverbs.
> > > >
> > > > This draft patch changes the dmabuf driver to differentiate between
> > > > static/dynamic attachments by looking at the move_notify op instead of
> > > > the importer_ops struct, and allowing the peer2peer flag to be enabled
> > > > in case of a static exporter.
> > > >
> > > > Thanks
> > > >
> > > > Signed-off-by: Gal Pressman <[email protected]>
> > >
> > > Given that habanalabs dma-buf support is very firmly in limbo (at
> > > least it's not yet in linux-next or anywhere else) I think you want to
> > > solve that problem first before we tackle the additional issue of
> > > making p2p work without dynamic dma-buf. Without that it just doesn't
> > > make a lot of sense really to talk about solutions here.
> >
> > I have been thinking about adding a dmabuf exporter to VFIO, for
> > basically the same reason habana labs wants to do it.
> >
> > In that situation we'd want to see an approach similar to this as well
> > to have a broad usability.
> >
> > The GPU drivers also want this for certain sophisticated scenarios
> > with RDMA, the intree drivers just haven't quite got there yet.
> >
> > So, I think it is worthwhile to start thinking about this regardless
> > of habana labs.
>
> Oh sure, I've been having these for a while. I think there's two options:
> - some kind of soft-pin, where the contract is that we only revoke
> when absolutely necessary, and it's expected to be catastrophic on the
> importer's side.
Honestly, I'm not very keen on this. We don't really have HW support
in several RDMA scenarios for even catastrophic unpin.
Gal, can EFA even do this for a MR? You basically have to resize the
rkey/lkey to zero length (or invalidate it like a FMR) under the
catstrophic revoke. The rkey/lkey cannot just be destroyed as that
opens a security problem with rkey/lkey re-use.
I think I saw EFA's current out of tree implementations had this bug.
> to do is mmap revoke), and I think that model of exclusive device
> ownership with the option to revoke fits pretty well for at least some
> of the accelerators floating around. In that case importers would
> never get a move_notify (maybe we should call this revoke_notify to
> make it clear it's a bit different) callback, except when the entire
> thing has been yanked. I think that would fit pretty well for VFIO,
> and I think we should be able to make it work for rdma too as some
> kind of auto-deregister. The locking might be fun with both of these
> since I expect some inversions compared to the register path, we'll
> have to figure these out.
It fits semantically nicely, VFIO also has a revoke semantic for BAR
mappings.
The challenge is the RDMA side which doesn't have a 'dma disabled
error state' for objects as part of the spec.
Some HW, like mlx5, can implement this for MR objects (see revoke_mr),
but I don't know if anything else can, and even mlx5 currently can't
do a revoke for any other object type.
I don't know how useful it would be, need to check on some of the use
cases.
The locking is tricky as we have to issue a device command, but that
device command cannot run concurrently with destruction or the tail
part of creation.
Jason
On 20/08/2021 15:33, Jason Gunthorpe wrote:
> On Fri, Aug 20, 2021 at 09:25:30AM +0200, Daniel Vetter wrote:
>> On Fri, Aug 20, 2021 at 1:06 AM Jason Gunthorpe <[email protected]> wrote:
>>> On Wed, Aug 18, 2021 at 11:34:51AM +0200, Daniel Vetter wrote:
>>>> On Wed, Aug 18, 2021 at 9:45 AM Gal Pressman <[email protected]> wrote:
>>>>>
>>>>> Hey all,
>>>>>
>>>>> Currently, the RDMA subsystem can only work with dynamic dmabuf
>>>>> attachments, which requires the RDMA device to support on-demand-paging
>>>>> (ODP) which is not common on most devices (only supported by mlx5).
>>>>>
>>>>> While the dynamic requirement makes sense for certain GPUs, some devices
>>>>> (such as habanalabs) have device memory that is always "pinned" and do
>>>>> not need/use the move_notify operation.
>>>>>
>>>>> The motivation of this RFC is to use habanalabs as the dmabuf exporter,
>>>>> and EFA as the importer to allow for peer2peer access through libibverbs.
>>>>>
>>>>> This draft patch changes the dmabuf driver to differentiate between
>>>>> static/dynamic attachments by looking at the move_notify op instead of
>>>>> the importer_ops struct, and allowing the peer2peer flag to be enabled
>>>>> in case of a static exporter.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Signed-off-by: Gal Pressman <[email protected]>
>>>>
>>>> Given that habanalabs dma-buf support is very firmly in limbo (at
>>>> least it's not yet in linux-next or anywhere else) I think you want to
>>>> solve that problem first before we tackle the additional issue of
>>>> making p2p work without dynamic dma-buf. Without that it just doesn't
>>>> make a lot of sense really to talk about solutions here.
>>>
>>> I have been thinking about adding a dmabuf exporter to VFIO, for
>>> basically the same reason habana labs wants to do it.
>>>
>>> In that situation we'd want to see an approach similar to this as well
>>> to have a broad usability.
>>>
>>> The GPU drivers also want this for certain sophisticated scenarios
>>> with RDMA, the intree drivers just haven't quite got there yet.
>>>
>>> So, I think it is worthwhile to start thinking about this regardless
>>> of habana labs.
>>
>> Oh sure, I've been having these for a while. I think there's two options:
>> - some kind of soft-pin, where the contract is that we only revoke
>> when absolutely necessary, and it's expected to be catastrophic on the
>> importer's side.
>
> Honestly, I'm not very keen on this. We don't really have HW support
> in several RDMA scenarios for even catastrophic unpin.
>
> Gal, can EFA even do this for a MR? You basically have to resize the
> rkey/lkey to zero length (or invalidate it like a FMR) under the
> catstrophic revoke. The rkey/lkey cannot just be destroyed as that
> opens a security problem with rkey/lkey re-use.
I had some discussions with our hardware guys about such functionality in the
past, I think it should be doable (not necessarily the same way that FMR does it).
Though it would've been nicer if we could agree on a solution that could work
for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.
That's why I tried to approach this by denying such attachments for non-ODP
importers instead of exposing a "limited" dynamic importer.
On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
> Though it would've been nicer if we could agree on a solution that could work
> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.
I don't think it can really be done, revoke is necessary, and isn't a
primitive we have today.
Revoke is sort of like rereg MR, but with a guaranteed no-change to
the lkey/rkey
Then there is the locking complexity of linking the mr creation and
destruction to the lifecycle of the pages, which is messy and maybe
not general. For instance mlx5 would call its revoke_mr, disconnect
the dmabuf then destroy the mkey - but this is only safe because mlx5
HW can handle concurrent revokes.
> That's why I tried to approach this by denying such attachments for non-ODP
> importers instead of exposing a "limited" dynamic importer.
That is fine if there is no revoke - once revoke exists we must have
driver and HW support.
Jason
On 20/08/2021 17:32, Jason Gunthorpe wrote:
> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>
>> Though it would've been nicer if we could agree on a solution that could work
>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.
>
> I don't think it can really be done, revoke is necessary, and isn't a
> primitive we have today.
>
> Revoke is sort of like rereg MR, but with a guaranteed no-change to
> the lkey/rkey
>
> Then there is the locking complexity of linking the mr creation and
> destruction to the lifecycle of the pages, which is messy and maybe
> not general. For instance mlx5 would call its revoke_mr, disconnect
> the dmabuf then destroy the mkey - but this is only safe because mlx5
> HW can handle concurrent revokes.
Thanks, that makes sense.
>> That's why I tried to approach this by denying such attachments for non-ODP
>> importers instead of exposing a "limited" dynamic importer.
>
> That is fine if there is no revoke - once revoke exists we must have
> driver and HW support.
Agree.
IIUC, we're talking about three different exporter "types":
- Dynamic with move_notify (requires ODP)
- Dynamic with revoke_notify
- Static
Which changes do we need to make the third one work?
Am 21.08.21 um 11:16 schrieb Gal Pressman:
> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>
>>> Though it would've been nicer if we could agree on a solution that could work
>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem has.
>> I don't think it can really be done, revoke is necessary, and isn't a
>> primitive we have today.
>>
>> Revoke is sort of like rereg MR, but with a guaranteed no-change to
>> the lkey/rkey
>>
>> Then there is the locking complexity of linking the mr creation and
>> destruction to the lifecycle of the pages, which is messy and maybe
>> not general. For instance mlx5 would call its revoke_mr, disconnect
>> the dmabuf then destroy the mkey - but this is only safe because mlx5
>> HW can handle concurrent revokes.
> Thanks, that makes sense.
>
>>> That's why I tried to approach this by denying such attachments for non-ODP
>>> importers instead of exposing a "limited" dynamic importer.
>> That is fine if there is no revoke - once revoke exists we must have
>> driver and HW support.
> Agree.
> IIUC, we're talking about three different exporter "types":
> - Dynamic with move_notify (requires ODP)
> - Dynamic with revoke_notify
> - Static
>
> Which changes do we need to make the third one work?
Basically none at all in the framework.
You just need to properly use the dma_buf_pin() function when you start
using a buffer (e.g. before you create an attachment) and the
dma_buf_unpin() function after you are done with the DMA-buf.
Regards,
Christian.
On 23/08/2021 13:43, Christian König wrote:
> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>>
>>>> Though it would've been nicer if we could agree on a solution that could work
>>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem
>>>> has.
>>> I don't think it can really be done, revoke is necessary, and isn't a
>>> primitive we have today.
>>>
>>> Revoke is sort of like rereg MR, but with a guaranteed no-change to
>>> the lkey/rkey
>>>
>>> Then there is the locking complexity of linking the mr creation and
>>> destruction to the lifecycle of the pages, which is messy and maybe
>>> not general. For instance mlx5 would call its revoke_mr, disconnect
>>> the dmabuf then destroy the mkey - but this is only safe because mlx5
>>> HW can handle concurrent revokes.
>> Thanks, that makes sense.
>>
>>>> That's why I tried to approach this by denying such attachments for non-ODP
>>>> importers instead of exposing a "limited" dynamic importer.
>>> That is fine if there is no revoke - once revoke exists we must have
>>> driver and HW support.
>> Agree.
>> IIUC, we're talking about three different exporter "types":
>> - Dynamic with move_notify (requires ODP)
>> - Dynamic with revoke_notify
>> - Static
>>
>> Which changes do we need to make the third one work?
>
> Basically none at all in the framework.
>
> You just need to properly use the dma_buf_pin() function when you start using a
> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
> after you are done with the DMA-buf.
I replied to your previous mail, but I'll ask again.
Doesn't the pin operation migrate the memory to host memory?
Am 24.08.21 um 11:06 schrieb Gal Pressman:
> On 23/08/2021 13:43, Christian König wrote:
>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>>>
>>>>> Though it would've been nicer if we could agree on a solution that could work
>>>>> for more than 1-2 RDMA devices, using the existing tools the RDMA subsystem
>>>>> has.
>>>> I don't think it can really be done, revoke is necessary, and isn't a
>>>> primitive we have today.
>>>>
>>>> Revoke is sort of like rereg MR, but with a guaranteed no-change to
>>>> the lkey/rkey
>>>>
>>>> Then there is the locking complexity of linking the mr creation and
>>>> destruction to the lifecycle of the pages, which is messy and maybe
>>>> not general. For instance mlx5 would call its revoke_mr, disconnect
>>>> the dmabuf then destroy the mkey - but this is only safe because mlx5
>>>> HW can handle concurrent revokes.
>>> Thanks, that makes sense.
>>>
>>>>> That's why I tried to approach this by denying such attachments for non-ODP
>>>>> importers instead of exposing a "limited" dynamic importer.
>>>> That is fine if there is no revoke - once revoke exists we must have
>>>> driver and HW support.
>>> Agree.
>>> IIUC, we're talking about three different exporter "types":
>>> - Dynamic with move_notify (requires ODP)
>>> - Dynamic with revoke_notify
>>> - Static
>>>
>>> Which changes do we need to make the third one work?
>> Basically none at all in the framework.
>>
>> You just need to properly use the dma_buf_pin() function when you start using a
>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>> after you are done with the DMA-buf.
> I replied to your previous mail, but I'll ask again.
> Doesn't the pin operation migrate the memory to host memory?
Sorry missed your previous reply.
And yes at least for the amdgpu driver we migrate the memory to host
memory as soon as it is pinned and I would expect that other GPU drivers
do something similar.
This is intentional since we don't want any P2P to video memory with
pinned objects and want to avoid to run into a situation where one
device is doing P2P to video memory while another device needs the
DMA-buf in host memory.
You can still do P2P with pinned object, it's just up to the exporting
driver if it is allowed or not.
The other option is what Daniel suggested that we have some kind of
revoke. This is essentially what our KFD is doing as well when doing
interop with 3D GFX, but from Jasons responses I have a bit of doubt
that this will actually work on the hardware level for RDMA.
Regards,
Christian.
On 8/24/21 2:32 AM, Christian König wrote:
> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>> On 23/08/2021 13:43, Christian König wrote:
>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
...
>>>> IIUC, we're talking about three different exporter "types":
>>>> - Dynamic with move_notify (requires ODP)
>>>> - Dynamic with revoke_notify
>>>> - Static
>>>>
>>>> Which changes do we need to make the third one work?
>>> Basically none at all in the framework.
>>>
>>> You just need to properly use the dma_buf_pin() function when you start using a
>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>> after you are done with the DMA-buf.
>> I replied to your previous mail, but I'll ask again.
>> Doesn't the pin operation migrate the memory to host memory?
>
> Sorry missed your previous reply.
>
> And yes at least for the amdgpu driver we migrate the memory to host memory as soon as it is pinned
> and I would expect that other GPU drivers do something similar.
Well...for many topologies, migrating to host memory will result in a
dramatically slower p2p setup. For that reason, some GPU drivers may
want to allow pinning of video memory in some situations.
Ideally, you've got modern ODP devices and you don't even need to pin.
But if not, and you still hope to do high performance p2p between a GPU
and a non-ODP Infiniband device, then you would need to leave the pinned
memory in vidmem.
So I think we don't want to rule out that behavior, right? Or is the
thinking more like, "you're lucky that this old non-ODP setup works at
all, and we'll make it work by routing through host/cpu memory, but it
will be slow"?
thanks,
--
John Hubbard
NVIDIA
>
> This is intentional since we don't want any P2P to video memory with pinned objects and want to
> avoid to run into a situation where one device is doing P2P to video memory while another device
> needs the DMA-buf in host memory.
>
> You can still do P2P with pinned object, it's just up to the exporting driver if it is allowed or not.
>
> The other option is what Daniel suggested that we have some kind of revoke. This is essentially what
> our KFD is doing as well when doing interop with 3D GFX, but from Jasons responses I have a bit of
> doubt that this will actually work on the hardware level for RDMA.
>
> Regards,
> Christian.
On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
...
>>> And yes at least for the amdgpu driver we migrate the memory to host
>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>> do something similar.
>>
>> Well...for many topologies, migrating to host memory will result in a
>> dramatically slower p2p setup. For that reason, some GPU drivers may
>> want to allow pinning of video memory in some situations.
>>
>> Ideally, you've got modern ODP devices and you don't even need to pin.
>> But if not, and you still hope to do high performance p2p between a GPU
>> and a non-ODP Infiniband device, then you would need to leave the pinned
>> memory in vidmem.
>>
>> So I think we don't want to rule out that behavior, right? Or is the
>> thinking more like, "you're lucky that this old non-ODP setup works at
>> all, and we'll make it work by routing through host/cpu memory, but it
>> will be slow"?
>
> I think it depends on the user, if the user creates memory which is
> permanently located on the GPU then it should be pinnable in this way
> without force migration. But if the memory is inherently migratable
> then it just cannot be pinned in the GPU at all as we can't
> indefinately block migration from happening eg if the CPU touches it
> later or something.
>
OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
necessarily implies or requires migrating to host memory.
thanks,
--
John Hubbard
NVIDIA
On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
> On 8/24/21 2:32 AM, Christian König wrote:
> > Am 24.08.21 um 11:06 schrieb Gal Pressman:
> > > On 23/08/2021 13:43, Christian König wrote:
> > > > Am 21.08.21 um 11:16 schrieb Gal Pressman:
> > > > > On 20/08/2021 17:32, Jason Gunthorpe wrote:
> > > > > > On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
> ...
> > > > > IIUC, we're talking about three different exporter "types":
> > > > > - Dynamic with move_notify (requires ODP)
> > > > > - Dynamic with revoke_notify
> > > > > - Static
> > > > >
> > > > > Which changes do we need to make the third one work?
> > > > Basically none at all in the framework.
> > > >
> > > > You just need to properly use the dma_buf_pin() function when you start using a
> > > > buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
> > > > after you are done with the DMA-buf.
> > > I replied to your previous mail, but I'll ask again.
> > > Doesn't the pin operation migrate the memory to host memory?
> >
> > Sorry missed your previous reply.
> >
> > And yes at least for the amdgpu driver we migrate the memory to host
> > memory as soon as it is pinned and I would expect that other GPU drivers
> > do something similar.
>
> Well...for many topologies, migrating to host memory will result in a
> dramatically slower p2p setup. For that reason, some GPU drivers may
> want to allow pinning of video memory in some situations.
>
> Ideally, you've got modern ODP devices and you don't even need to pin.
> But if not, and you still hope to do high performance p2p between a GPU
> and a non-ODP Infiniband device, then you would need to leave the pinned
> memory in vidmem.
>
> So I think we don't want to rule out that behavior, right? Or is the
> thinking more like, "you're lucky that this old non-ODP setup works at
> all, and we'll make it work by routing through host/cpu memory, but it
> will be slow"?
I think it depends on the user, if the user creates memory which is
permanently located on the GPU then it should be pinnable in this way
without force migration. But if the memory is inherently migratable
then it just cannot be pinned in the GPU at all as we can't
indefinately block migration from happening eg if the CPU touches it
later or something.
Jason
On Wed, 25 Aug 2021 at 03:36, John Hubbard <[email protected]> wrote:
>
> On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> ...
> >>> And yes at least for the amdgpu driver we migrate the memory to host
> >>> memory as soon as it is pinned and I would expect that other GPU drivers
> >>> do something similar.
> >>
> >> Well...for many topologies, migrating to host memory will result in a
> >> dramatically slower p2p setup. For that reason, some GPU drivers may
> >> want to allow pinning of video memory in some situations.
> >>
> >> Ideally, you've got modern ODP devices and you don't even need to pin.
> >> But if not, and you still hope to do high performance p2p between a GPU
> >> and a non-ODP Infiniband device, then you would need to leave the pinned
> >> memory in vidmem.
> >>
> >> So I think we don't want to rule out that behavior, right? Or is the
> >> thinking more like, "you're lucky that this old non-ODP setup works at
> >> all, and we'll make it work by routing through host/cpu memory, but it
> >> will be slow"?
> >
> > I think it depends on the user, if the user creates memory which is
> > permanently located on the GPU then it should be pinnable in this way
> > without force migration. But if the memory is inherently migratable
> > then it just cannot be pinned in the GPU at all as we can't
> > indefinately block migration from happening eg if the CPU touches it
> > later or something.
> >
>
> OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
> necessarily implies or requires migrating to host memory.
I'm not sure we should be allowing dma_buf_pin at all on
non-migratable memory, what's to stop someone just pinning all the
VRAM and making the GPU unuseable?
I understand not considering more than a single user in these
situations is enterprise thinking, but I do worry about pinning is
always fine type of thinking when things are shared or multi-user.
My impression from this is we've designed hardware that didn't
consider the problem, and now to let us use that hardware in horrible
ways we should just allow it to pin all the things.
Dave.
On Wed, Aug 25, 2021 at 05:15:52AM +1000, Dave Airlie wrote:
> On Wed, 25 Aug 2021 at 03:36, John Hubbard <[email protected]> wrote:
> >
> > On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> > ...
> > >>> And yes at least for the amdgpu driver we migrate the memory to host
> > >>> memory as soon as it is pinned and I would expect that other GPU drivers
> > >>> do something similar.
> > >>
> > >> Well...for many topologies, migrating to host memory will result in a
> > >> dramatically slower p2p setup. For that reason, some GPU drivers may
> > >> want to allow pinning of video memory in some situations.
> > >>
> > >> Ideally, you've got modern ODP devices and you don't even need to pin.
> > >> But if not, and you still hope to do high performance p2p between a GPU
> > >> and a non-ODP Infiniband device, then you would need to leave the pinned
> > >> memory in vidmem.
> > >>
> > >> So I think we don't want to rule out that behavior, right? Or is the
> > >> thinking more like, "you're lucky that this old non-ODP setup works at
> > >> all, and we'll make it work by routing through host/cpu memory, but it
> > >> will be slow"?
> > >
> > > I think it depends on the user, if the user creates memory which is
> > > permanently located on the GPU then it should be pinnable in this way
> > > without force migration. But if the memory is inherently migratable
> > > then it just cannot be pinned in the GPU at all as we can't
> > > indefinately block migration from happening eg if the CPU touches it
> > > later or something.
> > >
> >
> > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
> > necessarily implies or requires migrating to host memory.
>
> I'm not sure we should be allowing dma_buf_pin at all on
> non-migratable memory, what's to stop someone just pinning all the
> VRAM and making the GPU unuseable?
IMHO the same thinking that prevents pining all of system ram and
making the system unusable? GPU isn't so special here. The main
restriction is the pinned memory ulimit. For most out-of-the-box cases
this is set to something like 64k
For the single-user HPC use cases it is made unlimited.
> My impression from this is we've designed hardware that didn't
> consider the problem, and now to let us use that hardware in horrible
> ways we should just allow it to pin all the things.
It is more complex than that, HW that can support dynamic memory under
*everything* is complicated (and in some cases slow!). As there is
only a weak rational to do this, we don't see it in often in the
market.
Jason
On Tue, Aug 24, 2021 at 3:16 PM Dave Airlie <[email protected]> wrote:
>
> On Wed, 25 Aug 2021 at 03:36, John Hubbard <[email protected]> wrote:
> >
> > On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> > ...
> > >>> And yes at least for the amdgpu driver we migrate the memory to host
> > >>> memory as soon as it is pinned and I would expect that other GPU drivers
> > >>> do something similar.
> > >>
> > >> Well...for many topologies, migrating to host memory will result in a
> > >> dramatically slower p2p setup. For that reason, some GPU drivers may
> > >> want to allow pinning of video memory in some situations.
> > >>
> > >> Ideally, you've got modern ODP devices and you don't even need to pin.
> > >> But if not, and you still hope to do high performance p2p between a GPU
> > >> and a non-ODP Infiniband device, then you would need to leave the pinned
> > >> memory in vidmem.
> > >>
> > >> So I think we don't want to rule out that behavior, right? Or is the
> > >> thinking more like, "you're lucky that this old non-ODP setup works at
> > >> all, and we'll make it work by routing through host/cpu memory, but it
> > >> will be slow"?
> > >
> > > I think it depends on the user, if the user creates memory which is
> > > permanently located on the GPU then it should be pinnable in this way
> > > without force migration. But if the memory is inherently migratable
> > > then it just cannot be pinned in the GPU at all as we can't
> > > indefinately block migration from happening eg if the CPU touches it
> > > later or something.
> > >
> >
> > OK. I just want to avoid creating any API-level assumptions that dma_buf_pin()
> > necessarily implies or requires migrating to host memory.
>
> I'm not sure we should be allowing dma_buf_pin at all on
> non-migratable memory, what's to stop someone just pinning all the
> VRAM and making the GPU unuseable?
In a lot of cases we have GPUs with more VRAM than system memory, but
we allow pinning in system memory.
Alex
>
> I understand not considering more than a single user in these
> situations is enterprise thinking, but I do worry about pinning is
> always fine type of thinking when things are shared or multi-user.
>
> My impression from this is we've designed hardware that didn't
> consider the problem, and now to let us use that hardware in horrible
> ways we should just allow it to pin all the things.
>
> Dave.
> -----Original Message-----
> From: Alex Deucher <[email protected]>
> Sent: Tuesday, August 24, 2021 12:44 PM
> To: Dave Airlie <[email protected]>
> Cc: John Hubbard <[email protected]>; Jason Gunthorpe <[email protected]>; Christian König <[email protected]>; Gal Pressman
> <[email protected]>; Daniel Vetter <[email protected]>; Sumit Semwal <[email protected]>; Doug Ledford
> <[email protected]>; open list:DMA BUFFER SHARING FRAMEWORK <[email protected]>; dri-devel <dri-
> [email protected]>; Linux Kernel Mailing List <[email protected]>; linux-rdma <[email protected]>; Gabbay,
> Oded (Habana) <[email protected]>; Tayar, Tomer (Habana) <[email protected]>; Yossi Leybovich <[email protected]>; Alexander
> Matushevsky <[email protected]>; Leon Romanovsky <[email protected]>; Xiong, Jianxin <[email protected]>
> Subject: Re: [RFC] Make use of non-dynamic dmabuf in RDMA
>
> On Tue, Aug 24, 2021 at 3:16 PM Dave Airlie <[email protected]> wrote:
> >
> > On Wed, 25 Aug 2021 at 03:36, John Hubbard <[email protected]> wrote:
> > >
> > > On 8/24/21 10:32 AM, Jason Gunthorpe wrote:
> > > ...
> > > >>> And yes at least for the amdgpu driver we migrate the memory to
> > > >>> host memory as soon as it is pinned and I would expect that
> > > >>> other GPU drivers do something similar.
> > > >>
> > > >> Well...for many topologies, migrating to host memory will result
> > > >> in a dramatically slower p2p setup. For that reason, some GPU
> > > >> drivers may want to allow pinning of video memory in some situations.
> > > >>
> > > >> Ideally, you've got modern ODP devices and you don't even need to pin.
> > > >> But if not, and you still hope to do high performance p2p between
> > > >> a GPU and a non-ODP Infiniband device, then you would need to
> > > >> leave the pinned memory in vidmem.
> > > >>
> > > >> So I think we don't want to rule out that behavior, right? Or is
> > > >> the thinking more like, "you're lucky that this old non-ODP setup
> > > >> works at all, and we'll make it work by routing through host/cpu
> > > >> memory, but it will be slow"?
> > > >
> > > > I think it depends on the user, if the user creates memory which
> > > > is permanently located on the GPU then it should be pinnable in
> > > > this way without force migration. But if the memory is inherently
> > > > migratable then it just cannot be pinned in the GPU at all as we
> > > > can't indefinately block migration from happening eg if the CPU
> > > > touches it later or something.
> > > >
> > >
> > > OK. I just want to avoid creating any API-level assumptions that
> > > dma_buf_pin() necessarily implies or requires migrating to host memory.
> >
> > I'm not sure we should be allowing dma_buf_pin at all on
> > non-migratable memory, what's to stop someone just pinning all the
> > VRAM and making the GPU unuseable?
>
> In a lot of cases we have GPUs with more VRAM than system memory, but we allow pinning in system memory.
>
> Alex
>
In addition, the dma-buf exporter can be a non-GPU device.
Jianxin
> >
> > I understand not considering more than a single user in these
> > situations is enterprise thinking, but I do worry about pinning is
> > always fine type of thinking when things are shared or multi-user.
> >
> > My impression from this is we've designed hardware that didn't
> > consider the problem, and now to let us use that hardware in horrible
> > ways we should just allow it to pin all the things.
> >
> > Dave.
Am 24.08.21 um 19:32 schrieb Jason Gunthorpe:
> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>> On 8/24/21 2:32 AM, Christian König wrote:
>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>> ...
>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>> - Dynamic with revoke_notify
>>>>>> - Static
>>>>>>
>>>>>> Which changes do we need to make the third one work?
>>>>> Basically none at all in the framework.
>>>>>
>>>>> You just need to properly use the dma_buf_pin() function when you start using a
>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>>>> after you are done with the DMA-buf.
>>>> I replied to your previous mail, but I'll ask again.
>>>> Doesn't the pin operation migrate the memory to host memory?
>>> Sorry missed your previous reply.
>>>
>>> And yes at least for the amdgpu driver we migrate the memory to host
>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>> do something similar.
>> Well...for many topologies, migrating to host memory will result in a
>> dramatically slower p2p setup. For that reason, some GPU drivers may
>> want to allow pinning of video memory in some situations.
>>
>> Ideally, you've got modern ODP devices and you don't even need to pin.
>> But if not, and you still hope to do high performance p2p between a GPU
>> and a non-ODP Infiniband device, then you would need to leave the pinned
>> memory in vidmem.
>>
>> So I think we don't want to rule out that behavior, right? Or is the
>> thinking more like, "you're lucky that this old non-ODP setup works at
>> all, and we'll make it work by routing through host/cpu memory, but it
>> will be slow"?
> I think it depends on the user, if the user creates memory which is
> permanently located on the GPU then it should be pinnable in this way
> without force migration. But if the memory is inherently migratable
> then it just cannot be pinned in the GPU at all as we can't
> indefinately block migration from happening eg if the CPU touches it
> later or something.
Yes, exactly that's the point. Especially GPUs have a great variety of
setups.
For example we have APUs where the local memory is just stolen system
memory and all buffers must be migrate-able because you might need all
of this stolen memory for scanout or page tables. In this case P2P only
makes sense to avoid the migration overhead in the first place.
Then you got dGPUs where only a fraction of the VRAM is accessible from
the PCIe BUS. Here you also absolutely don't want to pin any buffers
because that can easily crash when we need to migrate something into the
visible window for CPU access.
The only real option where you could do P2P with buffer pinning are
those compute boards where we know that everything is always accessible
to everybody and we will never need to migrate anything. But even then
you want some mechanism like cgroups to take care of limiting this.
Otherwise any runaway process can bring down your whole system.
Key question at least for me as GPU maintainer is if we are going to see
modern compute boards together with old non-ODP setups. Since those
compute boards are usually used with new hardware (like PCIe v4 for
example) the answer I think is most likely "no".
Christian.
>
> Jason
On 8/24/21 11:17 PM, Christian König wrote:
...
>> I think it depends on the user, if the user creates memory which is
>> permanently located on the GPU then it should be pinnable in this way
>> without force migration. But if the memory is inherently migratable
>> then it just cannot be pinned in the GPU at all as we can't
>> indefinately block migration from happening eg if the CPU touches it
>> later or something.
>
> Yes, exactly that's the point. Especially GPUs have a great variety of setups.
>
> For example we have APUs where the local memory is just stolen system memory and all buffers must be
> migrate-able because you might need all of this stolen memory for scanout or page tables. In this
> case P2P only makes sense to avoid the migration overhead in the first place.
>
> Then you got dGPUs where only a fraction of the VRAM is accessible from the PCIe BUS. Here you also
> absolutely don't want to pin any buffers because that can easily crash when we need to migrate
> something into the visible window for CPU access.
>
> The only real option where you could do P2P with buffer pinning are those compute boards where we
> know that everything is always accessible to everybody and we will never need to migrate anything.
> But even then you want some mechanism like cgroups to take care of limiting this. Otherwise any
> runaway process can bring down your whole system.
>
> Key question at least for me as GPU maintainer is if we are going to see modern compute boards
> together with old non-ODP setups. Since those compute boards are usually used with new hardware
> (like PCIe v4 for example) the answer I think is most likely "no".
>
That is a really good point. Times have changed and I guess ODP is on most (all?) of
the new Infiniband products now, and maybe we don't need to worry so much about
providing first-class support for non-ODP setups.
I've got to drag my brain into 2021+! :)
thanks,
--
John Hubbard
NVIDIA
On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
> The only real option where you could do P2P with buffer pinning are those
> compute boards where we know that everything is always accessible to
> everybody and we will never need to migrate anything. But even then you want
> some mechanism like cgroups to take care of limiting this. Otherwise any
> runaway process can bring down your whole system.
Why? It is not the pin that is the problem, it was allocating GPU
dedicated memory in the first place. pinning it just changes the
sequence to free it. No different than CPU memory.
> Key question at least for me as GPU maintainer is if we are going to see
> modern compute boards together with old non-ODP setups.
You should stop thinking about it as 'old non-ODP setups'. ODP is
very much a special case that allows a narrow set of functionality to
work in a narrow situation. It has a high performance penalty and
narrow HW support.
So yes, compute boards are routinely used in scenarios where ODP is
not available, today and for the foreseeable future.
Jason
Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
>
>> The only real option where you could do P2P with buffer pinning are those
>> compute boards where we know that everything is always accessible to
>> everybody and we will never need to migrate anything. But even then you want
>> some mechanism like cgroups to take care of limiting this. Otherwise any
>> runaway process can bring down your whole system.
>
> Why? It is not the pin that is the problem, it was allocating GPU
> dedicated memory in the first place. pinning it just changes the
> sequence to free it. No different than CPU memory.
Pinning makes the memory un-evictable.
In other words as long as we don't pin anything we can support as many
processes as we want until we run out of swap space. Swapping sucks
badly because your applications become pretty much unuseable, but you
can easily recover from it by killing some process.
With pinning on the other hand somebody sooner or later receives an
-ENOMEM or -ENOSPC and there is no guarantee that this goes to the right
process.
>> Key question at least for me as GPU maintainer is if we are going to see
>> modern compute boards together with old non-ODP setups.
> You should stop thinking about it as 'old non-ODP setups'. ODP is
> very much a special case that allows a narrow set of functionality to
> work in a narrow situation. It has a high performance penalty and
> narrow HW support.
>
> So yes, compute boards are routinely used in scenarios where ODP is
> not available, today and for the foreseeable future.
Yeah, that's the stuff I'm not sure about.
Christian.
>
> Jason
On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> > On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
> >
> > > The only real option where you could do P2P with buffer pinning are those
> > > compute boards where we know that everything is always accessible to
> > > everybody and we will never need to migrate anything. But even then you want
> > > some mechanism like cgroups to take care of limiting this. Otherwise any
> > > runaway process can bring down your whole system.
> > Why? It is not the pin that is the problem, it was allocating GPU
> > dedicated memory in the first place. pinning it just changes the
> > sequence to free it. No different than CPU memory.
>
> Pinning makes the memory un-evictable.
>
> In other words as long as we don't pin anything we can support as many
> processes as we want until we run out of swap space. Swapping sucks badly
> because your applications become pretty much unuseable, but you can easily
> recover from it by killing some process.
>
> With pinning on the other hand somebody sooner or later receives an -ENOMEM
> or -ENOSPC and there is no guarantee that this goes to the right process.
It is not really different - you have the same failure mode once the
system runs out of swap.
This is really the kernel side trying to push a policy to the user
side that the user side doesn't want..
Dedicated systems are a significant use case here and should be
supported, even if the same solution wouldn't be applicable to someone
running a desktop.
Jason
Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
>>>
>>>> The only real option where you could do P2P with buffer pinning are those
>>>> compute boards where we know that everything is always accessible to
>>>> everybody and we will never need to migrate anything. But even then you want
>>>> some mechanism like cgroups to take care of limiting this. Otherwise any
>>>> runaway process can bring down your whole system.
>>> Why? It is not the pin that is the problem, it was allocating GPU
>>> dedicated memory in the first place. pinning it just changes the
>>> sequence to free it. No different than CPU memory.
>> Pinning makes the memory un-evictable.
>>
>> In other words as long as we don't pin anything we can support as many
>> processes as we want until we run out of swap space. Swapping sucks badly
>> because your applications become pretty much unuseable, but you can easily
>> recover from it by killing some process.
>>
>> With pinning on the other hand somebody sooner or later receives an -ENOMEM
>> or -ENOSPC and there is no guarantee that this goes to the right process.
> It is not really different - you have the same failure mode once the
> system runs out of swap.
>
> This is really the kernel side trying to push a policy to the user
> side that the user side doesn't want..
But which is still the right thing to do as far as I can see. See
userspace also doesn't want proper process isolation since it takes
extra time.
Kernel development is driven by exposing the hardware functionality in a
save and manageable manner to userspace, and not by fulfilling userspace
requirements.
This is very important cause you otherwise you create a specialized
system and not a general purpose kernel.
> Dedicated systems are a significant use case here and should be
> supported, even if the same solution wouldn't be applicable to someone
> running a desktop.
And exactly that approach is not acceptable.
Christian.
>
> Jason
Am 25.08.21 um 16:47 schrieb Jason Gunthorpe:
> On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote:
>> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
>>> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
>>>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
>>>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
>>>>>
>>>>>> The only real option where you could do P2P with buffer pinning are those
>>>>>> compute boards where we know that everything is always accessible to
>>>>>> everybody and we will never need to migrate anything. But even then you want
>>>>>> some mechanism like cgroups to take care of limiting this. Otherwise any
>>>>>> runaway process can bring down your whole system.
>>>>> Why? It is not the pin that is the problem, it was allocating GPU
>>>>> dedicated memory in the first place. pinning it just changes the
>>>>> sequence to free it. No different than CPU memory.
>>>> Pinning makes the memory un-evictable.
>>>>
>>>> In other words as long as we don't pin anything we can support as many
>>>> processes as we want until we run out of swap space. Swapping sucks badly
>>>> because your applications become pretty much unuseable, but you can easily
>>>> recover from it by killing some process.
>>>>
>>>> With pinning on the other hand somebody sooner or later receives an -ENOMEM
>>>> or -ENOSPC and there is no guarantee that this goes to the right process.
>>> It is not really different - you have the same failure mode once the
>>> system runs out of swap.
>>>
>>> This is really the kernel side trying to push a policy to the user
>>> side that the user side doesn't want..
>> But which is still the right thing to do as far as I can see. See userspace
>> also doesn't want proper process isolation since it takes extra time.
> Why? You are pushing a policy of resource allocation/usage which
> more properly belongs in userspace.
>
>> Kernel development is driven by exposing the hardware functionality in a
>> save and manageable manner to userspace, and not by fulfilling userspace
>> requirements.
> I don't agree with this, that is a 1980's view of OS design. So much
> these days in the kernel is driven entirely by boutique userspace
> requirements and is very much not about the classical abstract role of
> an OS.
But it's still true never the less. Otherwise you would have libraries
for filesystem accesses and no system security to speak of.
>>> Dedicated systems are a significant use case here and should be
>>> supported, even if the same solution wouldn't be applicable to someone
>>> running a desktop.
>> And exactly that approach is not acceptable.
> We have knobs and settings all over the place to allow Linux to
> support a broad set of use cases from Android to servers, to HPC. So
> long as they can co-exist and the various optional modes do not
> critically undermine the security of the kernel, it is well in line
> with how things have been evolving in the last 15 years.
Yeah, that's exactly what I'm talking about by adding cgroup or similar.
You need a knob to control this.
> Here you are talking entirely about policy to control memory
> allocation, which is already well trodden ground for CPU memory.
>
> There are now endless boutique ways to deal with this, it is a very
> narrow view to say that GPU memory is so special and different that
> only one way can be the correct/allowed way.
Well I'm not talking about GPU memory in particular here. This is
mandatory for any memory or saying more general any resource.
E.g. you are not allowed to pin large amount of system memory on a
default installation for exactly those reasons as well.
That you can have a knob to disable this behavior for your HPC system is
perfectly fine, but I thing what Dave notes here as well that this is
most likely not the default behavior we want.
Christian.
>
> Jason
On Wed, Aug 25, 2021 at 05:14:06PM +0200, Christian König wrote:
> Yeah, that's exactly what I'm talking about by adding cgroup or similar. You
> need a knob to control this.
We have the pinned memory ulimit today.
A pinned memory cgroup might be interesting, but even containrs are
covered under the ulimit (IIRC), so the driver to do this work might
not be so strong.
Jason
On Wed, Aug 25, 2021 at 6:14 PM Christian König
<[email protected]> wrote:
>
> Am 25.08.21 um 16:47 schrieb Jason Gunthorpe:
> > On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote:
> >> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
> >>> On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
> >>>> Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> >>>>> On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
> >>>>>
> >>>>>> The only real option where you could do P2P with buffer pinning are those
> >>>>>> compute boards where we know that everything is always accessible to
> >>>>>> everybody and we will never need to migrate anything. But even then you want
> >>>>>> some mechanism like cgroups to take care of limiting this. Otherwise any
> >>>>>> runaway process can bring down your whole system.
> >>>>> Why? It is not the pin that is the problem, it was allocating GPU
> >>>>> dedicated memory in the first place. pinning it just changes the
> >>>>> sequence to free it. No different than CPU memory.
> >>>> Pinning makes the memory un-evictable.
> >>>>
> >>>> In other words as long as we don't pin anything we can support as many
> >>>> processes as we want until we run out of swap space. Swapping sucks badly
> >>>> because your applications become pretty much unuseable, but you can easily
> >>>> recover from it by killing some process.
> >>>>
> >>>> With pinning on the other hand somebody sooner or later receives an -ENOMEM
> >>>> or -ENOSPC and there is no guarantee that this goes to the right process.
> >>> It is not really different - you have the same failure mode once the
> >>> system runs out of swap.
> >>>
> >>> This is really the kernel side trying to push a policy to the user
> >>> side that the user side doesn't want..
> >> But which is still the right thing to do as far as I can see. See userspace
> >> also doesn't want proper process isolation since it takes extra time.
> > Why? You are pushing a policy of resource allocation/usage which
> > more properly belongs in userspace.
> >
> >> Kernel development is driven by exposing the hardware functionality in a
> >> save and manageable manner to userspace, and not by fulfilling userspace
> >> requirements.
> > I don't agree with this, that is a 1980's view of OS design. So much
> > these days in the kernel is driven entirely by boutique userspace
> > requirements and is very much not about the classical abstract role of
> > an OS.
>
> But it's still true never the less. Otherwise you would have libraries
> for filesystem accesses and no system security to speak of.
>
> >>> Dedicated systems are a significant use case here and should be
> >>> supported, even if the same solution wouldn't be applicable to someone
> >>> running a desktop.
> >> And exactly that approach is not acceptable.
> > We have knobs and settings all over the place to allow Linux to
> > support a broad set of use cases from Android to servers, to HPC. So
> > long as they can co-exist and the various optional modes do not
> > critically undermine the security of the kernel, it is well in line
> > with how things have been evolving in the last 15 years.
>
> Yeah, that's exactly what I'm talking about by adding cgroup or similar.
> You need a knob to control this.
>
> > Here you are talking entirely about policy to control memory
> > allocation, which is already well trodden ground for CPU memory.
> >
> > There are now endless boutique ways to deal with this, it is a very
> > narrow view to say that GPU memory is so special and different that
> > only one way can be the correct/allowed way.
>
> Well I'm not talking about GPU memory in particular here. This is
> mandatory for any memory or saying more general any resource.
>
> E.g. you are not allowed to pin large amount of system memory on a
> default installation for exactly those reasons as well.
Unless you are working on a h/w architecture that is designed
specifically for a single user.
At least in our h/w architecture, we aim for 100% utilization of the
h/w when you are running DL training workloads, as opposed to what is
happening inside a GPU when you are running training.
Therefore, it is counter-productive to serve multiple users from a
performance perspective.
In fact, the training problem is so large, that a single ASIC is
hardly sufficient, and you need to run on anywhere between 8 ASICs to
64-128 ASICs to get a reasonable time for training.
So there is no point in serving multiple users in that case and that's
why our device memory (HBM) is always pinned.
This is totally different from the usual role of a GPU, where you
serve the desktop and multiple other applications that draw on the
screen.
I wouldn't want some knob in the kernel to control what is the limit
of memory I can pin or not. I just don't think it is useful and/or
user friendly.
btw, regarding ODP, we don't support it in h/w (yet) because 99% of
the use-cases in DL training will suffer from performance issues if we
don't pin the host memory.
There are extreme topologies, with 10TB of parameters that may require
this, but like I said, it's very rare and not worth the effort.
Thanks,
Oded
>
> That you can have a knob to disable this behavior for your HPC system is
> perfectly fine, but I thing what Dave notes here as well that this is
> most likely not the default behavior we want.
>
> Christian.
>
> >
> > Jason
>
On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote:
> Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
> > On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
> > > Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
> > > > On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:
> > > >
> > > > > The only real option where you could do P2P with buffer pinning are those
> > > > > compute boards where we know that everything is always accessible to
> > > > > everybody and we will never need to migrate anything. But even then you want
> > > > > some mechanism like cgroups to take care of limiting this. Otherwise any
> > > > > runaway process can bring down your whole system.
> > > > Why? It is not the pin that is the problem, it was allocating GPU
> > > > dedicated memory in the first place. pinning it just changes the
> > > > sequence to free it. No different than CPU memory.
> > > Pinning makes the memory un-evictable.
> > >
> > > In other words as long as we don't pin anything we can support as many
> > > processes as we want until we run out of swap space. Swapping sucks badly
> > > because your applications become pretty much unuseable, but you can easily
> > > recover from it by killing some process.
> > >
> > > With pinning on the other hand somebody sooner or later receives an -ENOMEM
> > > or -ENOSPC and there is no guarantee that this goes to the right process.
> > It is not really different - you have the same failure mode once the
> > system runs out of swap.
> >
> > This is really the kernel side trying to push a policy to the user
> > side that the user side doesn't want..
>
> But which is still the right thing to do as far as I can see. See userspace
> also doesn't want proper process isolation since it takes extra time.
Why? You are pushing a policy of resource allocation/usage which
more properly belongs in userspace.
> Kernel development is driven by exposing the hardware functionality in a
> save and manageable manner to userspace, and not by fulfilling userspace
> requirements.
I don't agree with this, that is a 1980's view of OS design. So much
these days in the kernel is driven entirely by boutique userspace
requirements and is very much not about the classical abstract role of
an OS.
> > Dedicated systems are a significant use case here and should be
> > supported, even if the same solution wouldn't be applicable to someone
> > running a desktop.
>
> And exactly that approach is not acceptable.
We have knobs and settings all over the place to allow Linux to
support a broad set of use cases from Android to servers, to HPC. So
long as they can co-exist and the various optional modes do not
critically undermine the security of the kernel, it is well in line
with how things have been evolving in the last 15 years.
Here you are talking entirely about policy to control memory
allocation, which is already well trodden ground for CPU memory.
There are now endless boutique ways to deal with this, it is a very
narrow view to say that GPU memory is so special and different that
only one way can be the correct/allowed way.
Jason
Am 01.09.21 um 13:20 schrieb Gal Pressman:
> On 24/08/2021 20:32, Jason Gunthorpe wrote:
>> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>>> On 8/24/21 2:32 AM, Christian König wrote:
>>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>> ...
>>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>>> - Dynamic with revoke_notify
>>>>>>> - Static
>>>>>>>
>>>>>>> Which changes do we need to make the third one work?
>>>>>> Basically none at all in the framework.
>>>>>>
>>>>>> You just need to properly use the dma_buf_pin() function when you start using a
>>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>>>>> after you are done with the DMA-buf.
>>>>> I replied to your previous mail, but I'll ask again.
>>>>> Doesn't the pin operation migrate the memory to host memory?
>>>> Sorry missed your previous reply.
>>>>
>>>> And yes at least for the amdgpu driver we migrate the memory to host
>>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>>> do something similar.
>>> Well...for many topologies, migrating to host memory will result in a
>>> dramatically slower p2p setup. For that reason, some GPU drivers may
>>> want to allow pinning of video memory in some situations.
>>>
>>> Ideally, you've got modern ODP devices and you don't even need to pin.
>>> But if not, and you still hope to do high performance p2p between a GPU
>>> and a non-ODP Infiniband device, then you would need to leave the pinned
>>> memory in vidmem.
>>>
>>> So I think we don't want to rule out that behavior, right? Or is the
>>> thinking more like, "you're lucky that this old non-ODP setup works at
>>> all, and we'll make it work by routing through host/cpu memory, but it
>>> will be slow"?
>> I think it depends on the user, if the user creates memory which is
>> permanently located on the GPU then it should be pinnable in this way
>> without force migration. But if the memory is inherently migratable
>> then it just cannot be pinned in the GPU at all as we can't
>> indefinately block migration from happening eg if the CPU touches it
>> later or something.
> So are we OK with exporters implementing dma_buf_pin() without migrating the memory?
I think so, yes.
> If so, do we still want a move_notify callback for non-dynamic importers? A noop?
Well we could make the move_notify callback optional, e.g. so that you
get the new locking approach but still pin the buffers manually with
dma_buf_pin().
Regards,
Christian.
On 24/08/2021 20:32, Jason Gunthorpe wrote:
> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>> On 8/24/21 2:32 AM, Christian König wrote:
>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>> ...
>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>> - Dynamic with revoke_notify
>>>>>> - Static
>>>>>>
>>>>>> Which changes do we need to make the third one work?
>>>>> Basically none at all in the framework.
>>>>>
>>>>> You just need to properly use the dma_buf_pin() function when you start using a
>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin() function
>>>>> after you are done with the DMA-buf.
>>>> I replied to your previous mail, but I'll ask again.
>>>> Doesn't the pin operation migrate the memory to host memory?
>>>
>>> Sorry missed your previous reply.
>>>
>>> And yes at least for the amdgpu driver we migrate the memory to host
>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>> do something similar.
>>
>> Well...for many topologies, migrating to host memory will result in a
>> dramatically slower p2p setup. For that reason, some GPU drivers may
>> want to allow pinning of video memory in some situations.
>>
>> Ideally, you've got modern ODP devices and you don't even need to pin.
>> But if not, and you still hope to do high performance p2p between a GPU
>> and a non-ODP Infiniband device, then you would need to leave the pinned
>> memory in vidmem.
>>
>> So I think we don't want to rule out that behavior, right? Or is the
>> thinking more like, "you're lucky that this old non-ODP setup works at
>> all, and we'll make it work by routing through host/cpu memory, but it
>> will be slow"?
>
> I think it depends on the user, if the user creates memory which is
> permanently located on the GPU then it should be pinnable in this way
> without force migration. But if the memory is inherently migratable
> then it just cannot be pinned in the GPU at all as we can't
> indefinately block migration from happening eg if the CPU touches it
> later or something.
So are we OK with exporters implementing dma_buf_pin() without migrating the memory?
If so, do we still want a move_notify callback for non-dynamic importers? A noop?
On 01/09/2021 14:24, Christian König wrote:
>
>
> Am 01.09.21 um 13:20 schrieb Gal Pressman:
>> On 24/08/2021 20:32, Jason Gunthorpe wrote:
>>> On Tue, Aug 24, 2021 at 10:27:23AM -0700, John Hubbard wrote:
>>>> On 8/24/21 2:32 AM, Christian König wrote:
>>>>> Am 24.08.21 um 11:06 schrieb Gal Pressman:
>>>>>> On 23/08/2021 13:43, Christian König wrote:
>>>>>>> Am 21.08.21 um 11:16 schrieb Gal Pressman:
>>>>>>>> On 20/08/2021 17:32, Jason Gunthorpe wrote:
>>>>>>>>> On Fri, Aug 20, 2021 at 03:58:33PM +0300, Gal Pressman wrote:
>>>> ...
>>>>>>>> IIUC, we're talking about three different exporter "types":
>>>>>>>> - Dynamic with move_notify (requires ODP)
>>>>>>>> - Dynamic with revoke_notify
>>>>>>>> - Static
>>>>>>>>
>>>>>>>> Which changes do we need to make the third one work?
>>>>>>> Basically none at all in the framework.
>>>>>>>
>>>>>>> You just need to properly use the dma_buf_pin() function when you start
>>>>>>> using a
>>>>>>> buffer (e.g. before you create an attachment) and the dma_buf_unpin()
>>>>>>> function
>>>>>>> after you are done with the DMA-buf.
>>>>>> I replied to your previous mail, but I'll ask again.
>>>>>> Doesn't the pin operation migrate the memory to host memory?
>>>>> Sorry missed your previous reply.
>>>>>
>>>>> And yes at least for the amdgpu driver we migrate the memory to host
>>>>> memory as soon as it is pinned and I would expect that other GPU drivers
>>>>> do something similar.
>>>> Well...for many topologies, migrating to host memory will result in a
>>>> dramatically slower p2p setup. For that reason, some GPU drivers may
>>>> want to allow pinning of video memory in some situations.
>>>>
>>>> Ideally, you've got modern ODP devices and you don't even need to pin.
>>>> But if not, and you still hope to do high performance p2p between a GPU
>>>> and a non-ODP Infiniband device, then you would need to leave the pinned
>>>> memory in vidmem.
>>>>
>>>> So I think we don't want to rule out that behavior, right? Or is the
>>>> thinking more like, "you're lucky that this old non-ODP setup works at
>>>> all, and we'll make it work by routing through host/cpu memory, but it
>>>> will be slow"?
>>> I think it depends on the user, if the user creates memory which is
>>> permanently located on the GPU then it should be pinnable in this way
>>> without force migration. But if the memory is inherently migratable
>>> then it just cannot be pinned in the GPU at all as we can't
>>> indefinately block migration from happening eg if the CPU touches it
>>> later or something.
>> So are we OK with exporters implementing dma_buf_pin() without migrating the
>> memory?
>
> I think so, yes.
>
>> If so, do we still want a move_notify callback for non-dynamic importers? A noop?
>
> Well we could make the move_notify callback optional, e.g. so that you get the
> new locking approach but still pin the buffers manually with dma_buf_pin().
Thanks Christian!
So the end result will look similar to the original patch I posted, where
peer2peer can be enabled without providing move_notify, correct?